David Mirabito
2017-04-04 08:52:21 UTC
Hello List,
I understand PTP in general needs stable/symmetrical round-trip times to be
at it's best and that things like non-PTP-aware switches will introduce
noise.
It makes sense that as the network gets less deterministic and symmetrical
then PTP will necessarily perform worse - probably both in terms of the
self-reported rms/max offsets as well as as measured by some external
"correct" reference.
Is it safe to assume, that given a crappy PTP network ptp4l would just
degrade to ntp-like performance at worst? Or are there some other
thresholds or sanity checks which would cause it to throw in the towel in
situations where NTP would keep trucking, being designed for such
situations and necessarily more robust?
I have one machine which is admittedly on a not-ideal PTP network, but with
some bizzare logs I'm having difficulty understanding:
Mar 19 09:06:50 user.notice ptp4l: [1337049.193] selected best master clock
94bc40
Mar 19 09:06:50 user.warning ptp4l: [1337049.193] running in a temporal
vortex
Mar 19 09:06:50 user.notice ptp4l: [1337049.193] port 1: LISTENING to
UNCALIBRATED on RS_SLAVE
Mar 19 09:06:51 user.info phc2sys: [1337049.328] port d24202-1 changed state
Mar 19 09:06:51 user.info phc2sys: [1337049.328] reconfiguring after port
state change
Mar 19 09:06:51 user.info phc2sys: [1337049.328] master clock not ready,
waiting...
Mar 19 09:06:52 user.warning ptp4l: [1337050.328] clockcheck: clock jumped
backward or running slower than expected!
Mar 19 09:06:52 user.warning ptp4l: [1337050.520] clockcheck: clock jumped
forward or running faster than expected!
Mar 19 09:06:53 user.notice ptp4l: [1337051.884] port 1: UNCALIBRATED to
SLAVE on MASTER_CLOCK_SELECTED
Mar 19 09:06:54 user.info phc2sys: [1337052.328] port d24202-1 changed state
Mar 19 09:06:54 user.info phc2sys: [1337052.328] reconfiguring after port
state change
Mar 19 09:06:54 user.info phc2sys: [1337052.328] selecting CLOCK_REALTIME
for synchronization
Mar 19 09:06:54 user.info phc2sys: [1337052.328] selecting eth2 as the
master clock
Mar 19 09:06:56 user.warning ptp4l: [1337054.329] clockcheck: clock jumped
backward or running slower than expected!
Mar 19 09:06:56 user.notice ptp4l: [1337054.331] port 1: SLAVE to
UNCALIBRATED on SYNCHRONIZATION_FAULT
Mar 19 09:06:56 user.warning ptp4l: [1337054.562] clockcheck: clock jumped
forward or running faster than expected!
Mar 19 09:06:57 user.info phc2sys: [1337055.329] port d24202-1 changed state
Mar 19 09:06:57 user.info phc2sys: [1337055.329] reconfiguring after port
state change
Mar 19 09:06:57 user.info phc2sys: [1337055.329] master clock not ready,
waiting...
Mar 19 09:06:57 user.notice ptp4l: [1337055.887] port 1: UNCALIBRATED to
SLAVE on MASTER_CLOCK_SELECTED
Mar 19 09:06:58 user.info phc2sys: [1337056.329] port d24202-1 changed state
Mar 19 09:06:58 user.info phc2sys: [1337056.329] reconfiguring after port
state change
Mar 19 09:06:58 user.info phc2sys: [1337056.329] selecting CLOCK_REALTIME
for synchronization
Mar 19 09:06:58 user.info phc2sys: [1337056.329] selecting eth2 as the
master clock
Is this a possible failure mode of a network / master that is just too poor
to survive?
In particular the clock check differences - swinging by +-20% from the
system time *within the same second*?
Given we are also running phc2sys I am wondering if we've hit some case
where the system clock is lagging the PHC and there's some resonance
between this lag and the swings of the PHC? So system clock would speed up
to try catch the PHC and be at it's fastest just as the PHC has turned
around and been at it's slowest; so the system clock then slows down. I
have an inkling this could be additive over time and the swings get larger
and larger until each is around 10% adjusted frequency to "true" which
makes them ~20% from one another and clockcheck bites.
Is this a reasonable explanation?
Alternatively, are there other things which could cause near-simultaneous
clockcheck faster and slower complaints?
The config is pretty vanilla, just UDP in slave-only mode on a single port.
The network is not ideal for PTP, but I was hoping it would at least truck
along, even if not with record-breaking accuracy.
Analysing ~5000 status log mesages from ptp4l (in between state machine
resets) shows
* freq: mean 42714 stddev 406
* delay: mean 3577 stddev 40
* err_rms mean: 1005 stddev 10058 (nb: mean/dev of an already RMS term
might not be meaningful, and this is super-spiky probably to always hitting
fault mode and re-starting)
Am happy to try provide more info or run tests if there are any ideas or
suggestions.
Thanks,
David
I understand PTP in general needs stable/symmetrical round-trip times to be
at it's best and that things like non-PTP-aware switches will introduce
noise.
It makes sense that as the network gets less deterministic and symmetrical
then PTP will necessarily perform worse - probably both in terms of the
self-reported rms/max offsets as well as as measured by some external
"correct" reference.
Is it safe to assume, that given a crappy PTP network ptp4l would just
degrade to ntp-like performance at worst? Or are there some other
thresholds or sanity checks which would cause it to throw in the towel in
situations where NTP would keep trucking, being designed for such
situations and necessarily more robust?
I have one machine which is admittedly on a not-ideal PTP network, but with
some bizzare logs I'm having difficulty understanding:
Mar 19 09:06:50 user.notice ptp4l: [1337049.193] selected best master clock
94bc40
Mar 19 09:06:50 user.warning ptp4l: [1337049.193] running in a temporal
vortex
Mar 19 09:06:50 user.notice ptp4l: [1337049.193] port 1: LISTENING to
UNCALIBRATED on RS_SLAVE
Mar 19 09:06:51 user.info phc2sys: [1337049.328] port d24202-1 changed state
Mar 19 09:06:51 user.info phc2sys: [1337049.328] reconfiguring after port
state change
Mar 19 09:06:51 user.info phc2sys: [1337049.328] master clock not ready,
waiting...
Mar 19 09:06:52 user.warning ptp4l: [1337050.328] clockcheck: clock jumped
backward or running slower than expected!
Mar 19 09:06:52 user.warning ptp4l: [1337050.520] clockcheck: clock jumped
forward or running faster than expected!
Mar 19 09:06:53 user.notice ptp4l: [1337051.884] port 1: UNCALIBRATED to
SLAVE on MASTER_CLOCK_SELECTED
Mar 19 09:06:54 user.info phc2sys: [1337052.328] port d24202-1 changed state
Mar 19 09:06:54 user.info phc2sys: [1337052.328] reconfiguring after port
state change
Mar 19 09:06:54 user.info phc2sys: [1337052.328] selecting CLOCK_REALTIME
for synchronization
Mar 19 09:06:54 user.info phc2sys: [1337052.328] selecting eth2 as the
master clock
Mar 19 09:06:56 user.warning ptp4l: [1337054.329] clockcheck: clock jumped
backward or running slower than expected!
Mar 19 09:06:56 user.notice ptp4l: [1337054.331] port 1: SLAVE to
UNCALIBRATED on SYNCHRONIZATION_FAULT
Mar 19 09:06:56 user.warning ptp4l: [1337054.562] clockcheck: clock jumped
forward or running faster than expected!
Mar 19 09:06:57 user.info phc2sys: [1337055.329] port d24202-1 changed state
Mar 19 09:06:57 user.info phc2sys: [1337055.329] reconfiguring after port
state change
Mar 19 09:06:57 user.info phc2sys: [1337055.329] master clock not ready,
waiting...
Mar 19 09:06:57 user.notice ptp4l: [1337055.887] port 1: UNCALIBRATED to
SLAVE on MASTER_CLOCK_SELECTED
Mar 19 09:06:58 user.info phc2sys: [1337056.329] port d24202-1 changed state
Mar 19 09:06:58 user.info phc2sys: [1337056.329] reconfiguring after port
state change
Mar 19 09:06:58 user.info phc2sys: [1337056.329] selecting CLOCK_REALTIME
for synchronization
Mar 19 09:06:58 user.info phc2sys: [1337056.329] selecting eth2 as the
master clock
Is this a possible failure mode of a network / master that is just too poor
to survive?
In particular the clock check differences - swinging by +-20% from the
system time *within the same second*?
Given we are also running phc2sys I am wondering if we've hit some case
where the system clock is lagging the PHC and there's some resonance
between this lag and the swings of the PHC? So system clock would speed up
to try catch the PHC and be at it's fastest just as the PHC has turned
around and been at it's slowest; so the system clock then slows down. I
have an inkling this could be additive over time and the swings get larger
and larger until each is around 10% adjusted frequency to "true" which
makes them ~20% from one another and clockcheck bites.
Is this a reasonable explanation?
Alternatively, are there other things which could cause near-simultaneous
clockcheck faster and slower complaints?
The config is pretty vanilla, just UDP in slave-only mode on a single port.
The network is not ideal for PTP, but I was hoping it would at least truck
along, even if not with record-breaking accuracy.
Analysing ~5000 status log mesages from ptp4l (in between state machine
resets) shows
* freq: mean 42714 stddev 406
* delay: mean 3577 stddev 40
* err_rms mean: 1005 stddev 10058 (nb: mean/dev of an already RMS term
might not be meaningful, and this is super-spiky probably to always hitting
fault mode and re-starting)
Am happy to try provide more info or run tests if there are any ideas or
suggestions.
Thanks,
David