[Linuxptp-users] clockcheck jumps forwards and backwards

Discussion:

David Mirabito

2017-04-04 08:52:21 UTC

Hello List,

I understand PTP in general needs stable/symmetrical round-trip times to be
at it's best and that things like non-PTP-aware switches will introduce
noise.

It makes sense that as the network gets less deterministic and symmetrical
then PTP will necessarily perform worse - probably both in terms of the
self-reported rms/max offsets as well as as measured by some external
"correct" reference.

Is it safe to assume, that given a crappy PTP network ptp4l would just
degrade to ntp-like performance at worst? Or are there some other
thresholds or sanity checks which would cause it to throw in the towel in
situations where NTP would keep trucking, being designed for such
situations and necessarily more robust?

I have one machine which is admittedly on a not-ideal PTP network, but with
some bizzare logs I'm having difficulty understanding:

Mar 19 09:06:50 user.notice ptp4l: [1337049.193] selected best master clock
94bc40
Mar 19 09:06:50 user.warning ptp4l: [1337049.193] running in a temporal
vortex
Mar 19 09:06:50 user.notice ptp4l: [1337049.193] port 1: LISTENING to
UNCALIBRATED on RS_SLAVE
Mar 19 09:06:51 user.info phc2sys: [1337049.328] port d24202-1 changed state
Mar 19 09:06:51 user.info phc2sys: [1337049.328] reconfiguring after port
state change
Mar 19 09:06:51 user.info phc2sys: [1337049.328] master clock not ready,
waiting...
Mar 19 09:06:52 user.warning ptp4l: [1337050.328] clockcheck: clock jumped
backward or running slower than expected!
Mar 19 09:06:52 user.warning ptp4l: [1337050.520] clockcheck: clock jumped
forward or running faster than expected!
Mar 19 09:06:53 user.notice ptp4l: [1337051.884] port 1: UNCALIBRATED to
SLAVE on MASTER_CLOCK_SELECTED
Mar 19 09:06:54 user.info phc2sys: [1337052.328] port d24202-1 changed state
Mar 19 09:06:54 user.info phc2sys: [1337052.328] reconfiguring after port
state change
Mar 19 09:06:54 user.info phc2sys: [1337052.328] selecting CLOCK_REALTIME
for synchronization
Mar 19 09:06:54 user.info phc2sys: [1337052.328] selecting eth2 as the
master clock
Mar 19 09:06:56 user.warning ptp4l: [1337054.329] clockcheck: clock jumped
backward or running slower than expected!
Mar 19 09:06:56 user.notice ptp4l: [1337054.331] port 1: SLAVE to
UNCALIBRATED on SYNCHRONIZATION_FAULT
Mar 19 09:06:56 user.warning ptp4l: [1337054.562] clockcheck: clock jumped
forward or running faster than expected!
Mar 19 09:06:57 user.info phc2sys: [1337055.329] port d24202-1 changed state
Mar 19 09:06:57 user.info phc2sys: [1337055.329] reconfiguring after port
state change
Mar 19 09:06:57 user.info phc2sys: [1337055.329] master clock not ready,
waiting...
Mar 19 09:06:57 user.notice ptp4l: [1337055.887] port 1: UNCALIBRATED to
SLAVE on MASTER_CLOCK_SELECTED
Mar 19 09:06:58 user.info phc2sys: [1337056.329] port d24202-1 changed state
Mar 19 09:06:58 user.info phc2sys: [1337056.329] reconfiguring after port
state change
Mar 19 09:06:58 user.info phc2sys: [1337056.329] selecting CLOCK_REALTIME
for synchronization
Mar 19 09:06:58 user.info phc2sys: [1337056.329] selecting eth2 as the
master clock

Is this a possible failure mode of a network / master that is just too poor
to survive?

In particular the clock check differences - swinging by +-20% from the
system time *within the same second*?
Given we are also running phc2sys I am wondering if we've hit some case
where the system clock is lagging the PHC and there's some resonance
between this lag and the swings of the PHC? So system clock would speed up
to try catch the PHC and be at it's fastest just as the PHC has turned
around and been at it's slowest; so the system clock then slows down. I
have an inkling this could be additive over time and the swings get larger
and larger until each is around 10% adjusted frequency to "true" which
makes them ~20% from one another and clockcheck bites.

Is this a reasonable explanation?

Alternatively, are there other things which could cause near-simultaneous
clockcheck faster and slower complaints?

The config is pretty vanilla, just UDP in slave-only mode on a single port.
The network is not ideal for PTP, but I was hoping it would at least truck
along, even if not with record-breaking accuracy.

Analysing ~5000 status log mesages from ptp4l (in between state machine
resets) shows
* freq: mean 42714 stddev 406
* delay: mean 3577 stddev 40
* err_rms mean: 1005 stddev 10058 (nb: mean/dev of an already RMS term
might not be meaningful, and this is super-spiky probably to always hitting
fault mode and re-starting)

Am happy to try provide more info or run tests if there are any ideas or
suggestions.

Thanks,
David

Miroslav Lichvar

2017-04-04 10:57:52 UTC

Permalink

Post by David Mirabito
Is it safe to assume, that given a crappy PTP network ptp4l would just
degrade to ntp-like performance at worst? Or are there some other
thresholds or sanity checks which would cause it to throw in the towel in
situations where NTP would keep trucking, being designed for such
situations and necessarily more robust?

No, NTP would generally perform better in non-ideal conditions (e.g.
busy network using switches without PTP support), but PTP should still
work. What you describe looks like a bug to me.

Post by David Mirabito
Mar 19 09:06:54 user.info phc2sys: [1337052.328] reconfiguring after port
state change
Mar 19 09:06:54 user.info phc2sys: [1337052.328] selecting CLOCK_REALTIME
for synchronization
Mar 19 09:06:54 user.info phc2sys: [1337052.328] selecting eth2 as the
master clock
Mar 19 09:06:56 user.warning ptp4l: [1337054.329] clockcheck: clock jumped
backward or running slower than expected!
Mar 19 09:06:56 user.notice ptp4l: [1337054.331] port 1: SLAVE to
UNCALIBRATED on SYNCHRONIZATION_FAULT
Mar 19 09:06:56 user.warning ptp4l: [1337054.562] clockcheck: clock jumped
forward or running faster than expected!
Is this a possible failure mode of a network / master that is just too poor
to survive?

A broken master shouldn't trigger the clockcheck messages on slaves. I
suspect the phc2sys process is adjusting/stepping the PHC when it
shouldn't, possibly triggering some feedback loop between ptp4l and
phc2sys.

What HW/driver and linuxptp version is this? Can you please post your
ptp4l config and command-line options used for ptp4l and phc2sys?
Having full logs with measurements (-l 6) might help too.

--
Miroslav Lichvar

Ian Thompson

2017-04-04 15:45:16 UTC

Permalink

Possibly following on from Davidâs post.

We have a system with 18 boards in a rack, each board has a Altera SoC with the STM Ethernet MAC connected via gigabit Ethernet to an Arista ptp-aware switch and then a Spectracom GrandMaster.
The boards are running Linux kernel 3.15.0.

They lock quickly after boot and can remain locked for several hours but usually any one of the boards may do the following âŠ

Apr 4 13:42:04 localhost user.info ptp4l: [537.164] rms 123 max 599 freq +255 +/- 39 delay 7362 +/- 48
Apr 4 13:42:29 localhost user.err ptp4l: [561.387] timed out while polling for tx timestamp
Apr 4 13:42:29 localhost user.err ptp4l: [561.387] increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
Apr 4 13:42:29 localhost user.err ptp4l: [561.387] port 1: send delay request failed
Apr 4 13:42:29 localhost user.notice ptp4l: [561.387] port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
Apr 4 13:42:45 localhost user.notice ptp4l: [577.388] port 1: FAULTY to LISTENING on FAULT_CLEARED
Apr 4 13:42:45 localhost user.warn ptp4l: [577.414] clockcheck: clock jumped backward or running slower than expected!
Apr 4 13:42:45 localhost user.notice ptp4l: [577.414] port 1: new foreign master 000cec.fffe.0a085d-1
Apr 4 13:42:47 localhost user.notice ptp4l: [579.414] selected best master clock 000cec.fffe.0a085d
Apr 4 13:42:47 localhost user.notice ptp4l: [579.414] port 1: LISTENING to UNCALIBRATED on RS_SLAVE
Apr 4 13:42:54 localhost user.notice ptp4l: [587.164] port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
Apr 4 13:46:46 localhost user.info ptp4l: [818.414] rms 2312500092 max 37000001557 freq +246 +/- 250 delay 7358 +/- 46
Apr 4 13:51:02 localhost user.info ptp4l: [1074.413] rms 116 max 681 freq +256 +/- 48 delay 7373 +/- 88

Does this imply that one lost delay request can do this, or is there a retry mechanism?
Notice that the system recovers but we canât afford the large timing glitch that gets introduced.
We have a lot of traffic leaving the boards but only PTP traffic coming in. As we increase the off board transfer rates the problem seems to occur more often.

Thanks for any help,
Ian T.

David Mirabito

2017-04-04 23:18:09 UTC

Permalink

Hi,

The device is a:
00:14.0 Ethernet controller: Intel Corporation Ethernet Connection I354
(rev 03)

Using
bash-4.3# ethtool -i ma2
driver: igb
version: 5.3.0-k
firmware-version: 0.0.0
expansion-rom-version:
bus-info: 0000:00:14.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

And:
# ethtool -T ma2
Time stamping parameters for ma2:
Capabilities:
hardware-transmit (SOF_TIMESTAMPING_TX_HARDWARE)
software-transmit (SOF_TIMESTAMPING_TX_SOFTWARE)
hardware-receive (SOF_TIMESTAMPING_RX_HARDWARE)
software-receive (SOF_TIMESTAMPING_RX_SOFTWARE)
software-system-clock (SOF_TIMESTAMPING_SOFTWARE)
hardware-raw-clock (SOF_TIMESTAMPING_RAW_HARDWARE)
PTP Hardware Clock: 1
Hardware Transmit Timestamp Modes:
off (HWTSTAMP_TX_OFF)
on (HWTSTAMP_TX_ON)
Hardware Receive Filter Modes:
none (HWTSTAMP_FILTER_NONE)
all (HWTSTAMP_FILTER_ALL)

The config is:

[global]
slaveOnly 1
summary_interval 6
priority1 255

[ma2]

And running as:
/usr/sbin/ptp4l -f /etc/ptp4l.conf
/usr/sbin/phc2sys -a -r -u 64 -n 5

We are running version 1.8, downloaded from the sourceforge mirror. It's
built with openembedede/bitbake and their recipie defines some extra
cflags, I can look iwhy these were deemed to be necessary or if they could
affect anything:
EXTRA_OEMAKE = "'CFLAGS=-D_GNU_SOURCE -DHAVE_CLOCK_ADJTIME
-DHAVE_POSIX_SPAWN -DHAVE_ONESTEP_SYNC'"

I will look into obtaining more verbose logs.

For what it's worth, this exact same setup works elsewhere it is just this
one physical setup that exhibits this, although unclear if the cause a
physical fault or something about the network/master outside.

Additionally, since Ian brought it up
a) We do sometimes see tx timestamp timeouts too
b) We also occasionally see UNEXPECTED_SYSWRAP messages from igb
My understanding is that b) is an intel bug (bad per-device assumptions
made in code regarding default state of PPS IRQ) on this HW and seems to be
generally treated as benign. I do have a slight suspicion that a and b may
be somehow related (backing out of the unexpected wrap IRQ 'forgets' to
notice the available tx timestamp being ready?) but I have some digging to
to on that front.

I currently expect (although happy to be proven wrong) that both a) and b)
are unrelated to the clockcheck jumps, since a+b happens readily and
doesn't affect sync *too* badly, whereas constant clockcheck aborts happens
only in one place and is apparently disastrous to sync quality.

Cheers, and thanks for your replies,
Dave

Post by Ian Thompson
Possibly following on from Davidâs post.
We have a system with 18 boards in a rack, each board has a Altera SoC
with the STM Ethernet MAC connected via gigabit Ethernet to an Arista
ptp-aware switch and then a Spectracom GrandMaster.
The boards are running Linux kernel 3.15.0.
They lock quickly after boot and can remain locked for several hours but
usually any one of the boards may do the following âŠ
Apr 4 13:42:04 localhost user.info ptp4l: [537.164] rms 123 max 599
freq +255 +/- 39 delay 7362 +/- 48
Apr 4 13:42:29 localhost user.err ptp4l: [561.387] timed out while
polling for tx timestamp
Apr 4 13:42:29 localhost user.err ptp4l: [561.387] increasing
tx_timestamp_timeout may correct this issue, but it is likely caused by a
driver bug
Apr 4 13:42:29 localhost user.err ptp4l: [561.387] port 1: send delay request failed
Apr 4 13:42:29 localhost user.notice ptp4l: [561.387] port 1: SLAVE to
FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
Apr 4 13:42:45 localhost user.notice ptp4l: [577.388] port 1: FAULTY to
LISTENING on FAULT_CLEARED
Apr 4 13:42:45 localhost user.warn ptp4l: [577.414] clockcheck: clock
jumped backward or running slower than expected!
Apr 4 13:42:45 localhost user.notice ptp4l: [577.414] port 1: new foreign
master 000cec.fffe.0a085d-1
Apr 4 13:42:47 localhost user.notice ptp4l: [579.414] selected best
master clock 000cec.fffe.0a085d
Apr 4 13:42:47 localhost user.notice ptp4l: [579.414] port 1: LISTENING
to UNCALIBRATED on RS_SLAVE
UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
Apr 4 13:46:46 localhost user.info ptp4l: [818.414] rms 2312500092 max
37000001557 freq +246 +/- 250 delay 7358 +/- 46
Apr 4 13:51:02 localhost user.info ptp4l: [1074.413] rms 116 max 681
freq +256 +/- 48 delay 7373 +/- 88
Does this imply that one lost delay request can do this, or is there a retry mechanism?
Notice that the system recovers but we canât afford the large timing
glitch that gets introduced.
We have a lot of traffic leaving the boards but only PTP traffic coming
in. As we increase the off board transfer rates the problem seems to occur
more often.
Thanks for any help,
Ian T.

Ian Thompson

2017-04-05 14:13:50 UTC

Permalink

All

Hereâs a log with phc2sys output. This board ran for 11 hours without an error before this happened âŠ

Apr 5 09:18:21 localhost user.info phc2sys: [38796.187] rms 28 max 65 freq -242 +/- 1 delay 1099 +/- 11
Apr 5 09:20:21 localhost user.info phc2sys: [38916.211] rms 29 max 73 freq -242 +/- 3 delay 1100 +/- 12
Apr 5 09:20:40 localhost user.info ptp4l: [38934.698] rms 29 max 76 freq -242 +/- 6 delay 2029 +/- 4
Apr 5 09:20:52 localhost user.err ptp4l: [38946.961] timed out while polling for tx timestamp
Apr 5 09:20:52 localhost user.err ptp4l: [38946.961] increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
Apr 5 09:20:52 localhost user.err ptp4l: [38946.962] Failure in transport_send(to)()
Apr 5 09:20:52 localhost user.err ptp4l: [38946.962] port 1: send delay request failed
Apr 5 09:20:52 localhost user.notice ptp4l: [38946.962] port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
Apr 5 09:20:52 localhost user.info phc2sys: [38947.222] port 2cee26.fffe.000189-1 changed state
Apr 5 09:20:52 localhost user.info phc2sys: [38947.222] reconfiguring after port state change
Apr 5 09:20:52 localhost user.info phc2sys: [38947.222] selecting eth0 for synchronization
Apr 5 09:20:52 localhost user.info phc2sys: [38947.222] nothing to synchronize
Apr 5 09:21:08 localhost user.notice ptp4l: [38962.962] port 1: FAULTY to LISTENING on INIT_COMPLETE
Apr 5 09:21:08 localhost user.warn ptp4l: [38963.198] clockcheck: clock jumped backward or running slower than expected!
Apr 5 09:21:10 localhost user.notice ptp4l: [38964.698] port 1: new foreign master 000cec.fffe.0a0f8d-1
Apr 5 09:21:14 localhost user.notice ptp4l: [38968.698] selected best master clock 000cec.fffe.0a0f8d
Apr 5 09:21:14 localhost user.notice ptp4l: [38968.698] port 1: LISTENING to UNCALIBRATED on RS_SLAVE
Apr 5 09:21:14 localhost user.info phc2sys: [38969.224] port 2cee26.fffe.000189-1 changed state
Apr 5 09:21:14 localhost user.info phc2sys: [38969.224] reconfiguring after port state change
Apr 5 09:21:14 localhost user.info phc2sys: [38969.224] master clock not ready, waiting...
Apr 5 09:21:21 localhost user.notice ptp4l: [38975.698] port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
Apr 5 09:21:21 localhost user.info phc2sys: [38976.229] port 2cee26.fffe.000189-1 changed state
Apr 5 09:21:21 localhost user.info phc2sys: [38976.230] reconfiguring after port state change
Apr 5 09:21:21 localhost user.info phc2sys: [38976.230] selecting CLOCK_REALTIME for synchronization
Apr 5 09:21:21 localhost user.info phc2sys: [38976.230] selecting eth0 as the master clock
Apr 5 09:22:50 localhost user.info phc2sys: [39065.260] rms 34 max 88 freq -241 +/- 5 delay 1100 +/- 10
Apr 5 09:24:50 localhost user.info phc2sys: [39185.288] rms 28 max 67 freq -242 +/- 1 delay 1099 +/- 11
Apr 5 09:25:23 localhost user.info ptp4l: [39217.698] rms 3270369174 max 37000003713 freq -238 +/- 95 delay 2030 +/- 8
Apr 5 09:26:50 localhost user.info phc2sys: [39305.330] rms 29 max 70 freq -241 +/- 2 delay 1099 +/- 10
Apr 5 09:28:50 localhost user.info phc2sys: [39425.376] rms 29 max 71 freq -242 +/- 1 delay 1100 +/- 11
Apr 5 09:29:39 localhost user.info ptp4l: [39473.698] rms 29 max 75 freq -243 +/- 2 delay 2023 +/- 9

Ian T.

From: David Mirabito [mailto:***@gmail.com]
Sent: Tuesday, April 04, 2017 6:18 PM
To: Ian Thompson
Cc: linuxptp-***@lists.sourceforge.net
Subject: [External] Re: [Linuxptp-users] clockcheck jumps forwards and backwards

Hi,

The device is a:
00:14.0 Ethernet controller: Intel Corporation Ethernet Connection I354 (rev 03)

Using
bash-4.3# ethtool -i ma2
driver: igb
version: 5.3.0-k
firmware-version: 0.0.0
expansion-rom-version:
bus-info: 0000:00:14.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

And:
# ethtool -T ma2
Time stamping parameters for ma2:
Capabilities:
hardware-transmit (SOF_TIMESTAMPING_TX_HARDWARE)
software-transmit (SOF_TIMESTAMPING_TX_SOFTWARE)
hardware-receive (SOF_TIMESTAMPING_RX_HARDWARE)
software-receive (SOF_TIMESTAMPING_RX_SOFTWARE)
software-system-clock (SOF_TIMESTAMPING_SOFTWARE)
hardware-raw-clock (SOF_TIMESTAMPING_RAW_HARDWARE)
PTP Hardware Clock: 1
Hardware Transmit Timestamp Modes:
off (HWTSTAMP_TX_OFF)
on (HWTSTAMP_TX_ON)
Hardware Receive Filter Modes:
none (HWTSTAMP_FILTER_NONE)
all (HWTSTAMP_FILTER_ALL)

The config is:

[global]
slaveOnly 1
summary_interval 6
priority1 255

[ma2]

And running as:
/usr/sbin/ptp4l -f /etc/ptp4l.conf
/usr/sbin/phc2sys -a -r -u 64 -n 5

We are running version 1.8, downloaded from the sourceforge mirror. It's built with openembedede/bitbake and their recipie defines some extra cflags, I can look iwhy these were deemed to be necessary or if they could affect anything:
EXTRA_OEMAKE = "'CFLAGS=-D_GNU_SOURCE -DHAVE_CLOCK_ADJTIME -DHAVE_POSIX_SPAWN -DHAVE_ONESTEP_SYNC'"

I will look into obtaining more verbose logs.

For what it's worth, this exact same setup works elsewhere it is just this one physical setup that exhibits this, although unclear if the cause a physical fault or something about the network/master outside.

Additionally, since Ian brought it up
a) We do sometimes see tx timestamp timeouts too
b) We also occasionally see UNEXPECTED_SYSWRAP messages from igb
My understanding is that b) is an intel bug (bad per-device assumptions made in code regarding default state of PPS IRQ) on this HW and seems to be generally treated as benign. I do have a slight suspicion that a and b may be somehow related (backing out of the unexpected wrap IRQ 'forgets' to notice the available tx timestamp being ready?) but I have some digging to to on that front.

I currently expect (although happy to be proven wrong) that both a) and b) are unrelated to the clockcheck jumps, since a+b happens readily and doesn't affect sync *too* badly, whereas constant clockcheck aborts happens only in one place and is apparently disastrous to sync quality.

Cheers, and thanks for your replies,
Dave

On Wed, Apr 5, 2017 at 1:45 AM, Ian Thompson <***@pgs.com<mailto:***@pgs.com>> wrote:
Possibly following on from Davidâs post.

We have a system with 18 boards in a rack, each board has a Altera SoC with the STM Ethernet MAC connected via gigabit Ethernet to an Arista ptp-aware switch and then a Spectracom GrandMaster.
The boards are running Linux kernel 3.15.0.

They lock quickly after boot and can remain locked for several hours but usually any one of the boards may do the following âŠ

Apr 4 13:42:04 localhost user.info<https://urldefense.proofpoint.com/v2/url?u=http-3A__user.info&d=DwMFaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=zdHnydvzOnwuGQ--L90nq9WdYaiUdEVnfAroj9WKyYs&m=YrWhtlXVxxCzYLBrGRVRx46qRxB7vLHf6gtzNDah7es&s=fbUeH1NnQuOsYg0ca_GQ4d7QbQdPD0Q3K4UU8ucBTZY&e=> ptp4l: [537.164] rms 123 max 599 freq +255 +/- 39 delay 7362 +/- 48
Apr 4 13:42:29 localhost user.err ptp4l: [561.387] timed out while polling for tx timestamp
Apr 4 13:42:29 localhost user.err ptp4l: [561.387] increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
Apr 4 13:42:29 localhost user.err ptp4l: [561.387] port 1: send delay request failed
Apr 4 13:42:29 localhost user.notice ptp4l: [561.387] port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
Apr 4 13:42:45 localhost user.notice ptp4l: [577.388] port 1: FAULTY to LISTENING on FAULT_CLEARED
Apr 4 13:42:45 localhost user.warn ptp4l: [577.414] clockcheck: clock jumped backward or running slower than expected!
Apr 4 13:42:45 localhost user.notice ptp4l: [577.414] port 1: new foreign master 000cec.fffe.0a085d-1
Apr 4 13:42:47 localhost user.notice ptp4l: [579.414] selected best master clock 000cec.fffe.0a085d
Apr 4 13:42:47 localhost user.notice ptp4l: [579.414] port 1: LISTENING to UNCALIBRATED on RS_SLAVE
Apr 4 13:42:54 localhost user.notice ptp4l: [587.164] port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
Apr 4 13:46:46 localhost user.info<https://urldefense.proofpoint.com/v2/url?u=http-3A__user.info&d=DwMFaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=zdHnydvzOnwuGQ--L90nq9WdYaiUdEVnfAroj9WKyYs&m=YrWhtlXVxxCzYLBrGRVRx46qRxB7vLHf6gtzNDah7es&s=fbUeH1NnQuOsYg0ca_GQ4d7QbQdPD0Q3K4UU8ucBTZY&e=> ptp4l: [818.414] rms 2312500092 max 37000001557 freq +246 +/- 250 delay 7358 +/- 46
Apr 4 13:51:02 localhost user.info<https://urldefense.proofpoint.com/v2/url?u=http-3A__user.info&d=DwMFaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=zdHnydvzOnwuGQ--L90nq9WdYaiUdEVnfAroj9WKyYs&m=YrWhtlXVxxCzYLBrGRVRx46qRxB7vLHf6gtzNDah7es&s=fbUeH1NnQuOsYg0ca_GQ4d7QbQdPD0Q3K4UU8ucBTZY&e=> ptp4l: [1074.413] rms 116 max 681 freq +256 +/- 48 delay 7373 +/- 88

Does this imply that one lost delay request can do this, or is there a retry mechanism?
Notice that the system recovers but we canât afford the large timing glitch that gets introduced.
We have a lot of traffic leaving the boards but only PTP traffic coming in. As we increase the off board transfer rates the problem seems to occur more often.

Thanks for any help,
Ian T.

Richard Cochran

2017-04-06 04:37:16 UTC

Permalink

Ian,

The problem you are seeing shares the same symptom as in the original
post, but the root cause is different because you have totally
different HW. Let's take this discussion into its own thread, please.
I'll start by answering your other recent post.

Thanks,
Richard

Richard Cochran

2017-04-06 06:01:01 UTC

Permalink

Post by David Mirabito
[global]
slaveOnly 1
summary_interval 6
priority1 255
[ma2]
/usr/sbin/ptp4l -f /etc/ptp4l.conf
/usr/sbin/phc2sys -a -r -u 64 -n 5

Why are you using domain number 5 for phc2sys (-n 5) ?

That would cause ptp4l to ignore all of phc2sys's queries, unless
ptp4l were also using that domain.

Or did you actually use -N 5 ?

Thanks,
Richard

David Mirabito

2017-04-06 06:21:11 UTC

Permalink

Gah, sorry - I goofed and pulled the 'ps' output from a different box than
the config. That indeed is the domain and does match the conf for any given
machine, but not when I mix'n'match (ttl, domain, and transport are
site-specific).
bash-4.3# cat /etc/ptp4l.conf
[global]
slaveOnly 1
summary_interval 6
priority1 255
udp_ttl 16

[ma2]

and

S root 20462 1 0 80 0 2452 2920 poll_s 1 Mar03 ?
00:58:13 /usr/sbin/ptp4l -f /etc/ptp4l.conf
S root 20468 1 0 80 0 1068 2902 hrtime 0 Mar03 ?
00:01:17 /usr/sbin/phc2sys -a -r -u 64

So the config file and cmdlines look when the default 0 domain is
configured (in this case taken from the problematic box)

Apologies for the mixup,
Dave

Post by Richard Cochran

Post by David Mirabito
[global]
slaveOnly 1
summary_interval 6
priority1 255
[ma2]
/usr/sbin/ptp4l -f /etc/ptp4l.conf
/usr/sbin/phc2sys -a -r -u 64 -n 5

Why are you using domain number 5 for phc2sys (-n 5) ?
That would cause ptp4l to ignore all of phc2sys's queries, unless
ptp4l were also using that domain.
Or did you actually use -N 5 ?
Thanks,
Richard