Discussion:
[Linuxptp-users] ptp4l and network connectivity interruption
Brian Walsh
2015-12-08 16:27:29 UTC
Permalink
Sorry if this has been asked before. The archives are unreachable on
sourceforge. I keep getting an "Error 403 Read access required" when trying
to view the list archives.

I am having an issue with the ptp4l client and network connectivity. The
client works just fine and syncs the hardware clock on an Intel e1000
device. However, if anything interrupts that connectivity for a couple of
seconds the clock seems to drop the fact that it is synced to a TAI time
source with a leap second offset. It will panic that it is behind and jump
forward 36 seconds (the current leap second offset). Then a few seconds
later when connectivity is restored and resynced, it realizes it is now 36
seconds fast and takes 20 minutes or more to work back to the correct time.

I am able to reproduce this by temporarily blocking access to 1588 udp
ports 319 and 320 through iptables. Wait a few seconds and the clock will
jump ahead by the leap second offest. Unblock the udp ports and then the
clock begins the long process of adjusting back to the actual time.

Is there a setting that I have missed or something I have over looked? The
ptp4l client does not have many options. I would think that the clock
should maintain the last known offset during the brief interruption.

Thanks,
Brian
Richard Cochran
2015-12-10 09:25:33 UTC
Permalink
Post by Brian Walsh
Sorry if this has been asked before. The archives are unreachable on
sourceforge. I keep getting an "Error 403 Read access required" when trying
to view the list archives.
Yes, SF does have issues, and I want to move away from there,
eventually. In the mean time, you can use the archives on Gmane:

http://news.gmane.org/gmane.comp.linux.ptp.user
http://news.gmane.org/gmane.comp.linux.ptp.devel
Post by Brian Walsh
I am having an issue with the ptp4l client and network connectivity. The
client works just fine and syncs the hardware clock on an Intel e1000
device.
Which device?
Post by Brian Walsh
However, if anything interrupts that connectivity for a couple of
seconds the clock seems to drop the fact that it is synced to a TAI time
source with a leap second offset. It will panic that it is behind and jump
forward 36 seconds (the current leap second offset). Then a few seconds
later when connectivity is restored and resynced, it realizes it is now 36
seconds fast and takes 20 minutes or more to work back to the correct time.
IIRC, this problem is due to the fact the e1000 HW and driver requires
a complete reset when the link goes down. The old time values gets
lost, and the driver simply initializes the clock with the current
system time.
Post by Brian Walsh
I am able to reproduce this by temporarily blocking access to 1588 udp
ports 319 and 320 through iptables. Wait a few seconds and the clock will
jump ahead by the leap second offest. Unblock the udp ports and then the
clock begins the long process of adjusting back to the actual time.
Hm, I wouldn't expect that behavior, but it does sound like the link
loss symptom.
Post by Brian Walsh
Is there a setting that I have missed or something I have over looked? The
ptp4l client does not have many options. I would think that the clock
should maintain the last known offset during the brief interruption.
I think the source of the jump is not in ptp4l but rather in the
driver or HW.

Thanks,
Richard

------------------------------------------------------------------------------
Brian Walsh
2015-12-10 17:06:19 UTC
Permalink
On Thu, Dec 10, 2015 at 4:25 AM, Richard Cochran
Post by Richard Cochran
Post by Brian Walsh
I am having an issue with the ptp4l client and network connectivity. The
client works just fine and syncs the hardware clock on an Intel e1000
device.
Which device?
It is an Intel 82574L. 8086:10d3
Post by Richard Cochran
Post by Brian Walsh
However, if anything interrupts that connectivity for a couple of
seconds the clock seems to drop the fact that it is synced to a TAI time
source with a leap second offset. It will panic that it is behind and jump
forward 36 seconds (the current leap second offset). Then a few seconds
later when connectivity is restored and resynced, it realizes it is now 36
seconds fast and takes 20 minutes or more to work back to the correct time.
IIRC, this problem is due to the fact the e1000 HW and driver requires
a complete reset when the link goes down. The old time values gets
lost, and the driver simply initializes the clock with the current
system time.
Post by Brian Walsh
I am able to reproduce this by temporarily blocking access to 1588 udp
ports 319 and 320 through iptables. Wait a few seconds and the clock will
jump ahead by the leap second offest. Unblock the udp ports and then the
clock begins the long process of adjusting back to the actual time.
Hm, I wouldn't expect that behavior, but it does sound like the link
loss symptom.
Post by Brian Walsh
Is there a setting that I have missed or something I have over looked? The
ptp4l client does not have many options. I would think that the clock
should maintain the last known offset during the brief interruption.
I think the source of the jump is not in ptp4l but rather in the
driver or HW.
I am running tests using kernel version 4.1.7. I will try and trace it
down some more.

Looking again it appears it may be the opposite of what I thought.
ptp4l is maintaining the
offset value while the hardware clock has switched back to UTC time. I
am not seeing
anywhere that ptp4l is reseting the offset to 0 during this state.

Connectivity working:
***@host:~> phc_ctl eth0 cmp get
phc_ctl[92833.880]: offset from CLOCK_REALTIME is -36000012151ns
phc_ctl[92833.880]: clock time is 1449766596.912500774 or Thu Dec 10
16:56:36 2015

Ports blocked:
***@host:~> phc_ctl eth0 cmp get
phc_ctl[92834.718]: offset from CLOCK_REALTIME is 7518ns
phc_ctl[92834.719]: clock time is 1449766561.750694117 or Thu Dec 10
16:56:01 2015

------------------------------------------------------------------------------
Richard Cochran
2015-12-11 15:30:06 UTC
Permalink
Post by Brian Walsh
It is an Intel 82574L. 8086:10d3
Ok, I have that card. The driver is the e1000e (and not the e1000).
Can you send me your iptables script so that I can try and reproduce
the problem?
Post by Brian Walsh
Looking again it appears it may be the opposite of what I thought.
ptp4l is maintaining the
offset value while the hardware clock has switched back to UTC time. I
am not seeing
anywhere that ptp4l is reseting the offset to 0 during this state.
Right, it is in the driver or HW. I remember that card resetting the
clock after link loss. I complained about this, but Intel said it was
as HW limitation, IIRC.

However, I wouldn't expect this to happen just from the action of the
firewall. That sounds more like a driver bug.

Thanks,
Richard

------------------------------------------------------------------------------
Brian Walsh
2015-12-11 20:09:56 UTC
Permalink
Post by Richard Cochran
Post by Brian Walsh
It is an Intel 82574L. 8086:10d3
Ok, I have that card. The driver is the e1000e (and not the e1000).
Can you send me your iptables script so that I can try and reproduce
the problem?
I am just dropping udp packets on INPUT for ports 319 and 320

iptables -A INPUT -p udp --dport 319 -j DROP
iptables -A INPUT -p udp --dport 320 -j DROP

After a few seconds I just delete those rules.
Post by Richard Cochran
Post by Brian Walsh
Looking again it appears it may be the opposite of what I thought.
ptp4l is maintaining the
offset value while the hardware clock has switched back to UTC time. I
am not seeing
anywhere that ptp4l is reseting the offset to 0 during this state.
Right, it is in the driver or HW. I remember that card resetting the
clock after link loss. I complained about this, but Intel said it was
as HW limitation, IIRC.
However, I wouldn't expect this to happen just from the action of the
firewall. That sounds more like a driver bug.
I was looking at the linuxptp code to see if it could possibly detect the
condition. It does detect the initial jump when the hardware starts
receiving packets again. Maybe it could check the jump against the last
known offset value. Have it wait for a few packets while the device
settles before trusting that jump if it is close to the offset.

Brian

------------------------------------------------------------------------------
Richard Cochran
2015-12-12 17:50:45 UTC
Permalink
Post by Brian Walsh
I was looking at the linuxptp code to see if it could possibly detect the
condition. It does detect the initial jump when the hardware starts
receiving packets again. Maybe it could check the jump against the last
known offset value. Have it wait for a few packets while the device
settles before trusting that jump if it is close to the offset.
This is definitely a driver bug.

Looking at drivers/net/ethernet/intel/e1000e/netdev.c, in the function
e1000e_config_hwtstamp(), the time is reset whenever time stamping is
activated. That doesn't make any sense.

It looks like the calls to e1000e_get_base_timinca() and
timecounter_init() are misplaced. They should go into the probe
function instead.

Thanks,
Richard

------------------------------------------------------------------------------
Brian Walsh
2015-12-12 20:18:01 UTC
Permalink
Post by Richard Cochran
This is definitely a driver bug.
Looking at drivers/net/ethernet/intel/e1000e/netdev.c, in the function
e1000e_config_hwtstamp(), the time is reset whenever time stamping is
activated. That doesn't make any sense.
It looks like the calls to e1000e_get_base_timinca() and
timecounter_init() are misplaced. They should go into the probe
function instead.
I see that. Comparing that code to what happens in the ixgbe driver it
looks like reseting the clock should be part of e1000e_ptp_init. Then
the e1000e_ptp_init code should be called in the device open function to
initialize whenever the device is made active. Pull ptp_init out of the
probe function.

I will see what i can put together to test based off of the ixgbe code.

Brian

------------------------------------------------------------------------------
Richard Cochran
2015-12-12 20:50:32 UTC
Permalink
Post by Brian Walsh
I see that. Comparing that code to what happens in the ixgbe driver it
looks like reseting the clock should be part of e1000e_ptp_init. Then
the e1000e_ptp_init code should be called in the device open function to
initialize whenever the device is made active. Pull ptp_init out of the
probe function.
Sorry, I mixed up the Intel cards WRT the unfortunate HW limitation.
The 82574 does not need to reset the clock at link loss, or at least
it doesn't appear to need it.

I wouldn't follow ixgbe, because putting the reset in the 'open'
method means that the clock will become reset during ifup/ifdown. For
the ixgbe this is necessary, IIRC, but I wouldn't do it unless you are
absolutely by some HW quirk.

I would try something like the following untested patch...

Thanks,
Richard


diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 0a854a4..1823148 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -3732,16 +3732,6 @@ static int e1000e_config_hwtstamp(struct e1000_adapter *adapter,
er32(RXSTMPH);
er32(TXSTMPH);

- /* Get and set the System Time Register SYSTIM base frequency */
- ret_val = e1000e_get_base_timinca(adapter, &regval);
- if (ret_val)
- return ret_val;
- ew32(TIMINCA, regval);
-
- /* reset the ns time counter */
- timecounter_init(&adapter->tc, &adapter->cc,
- ktime_to_ns(ktime_get_real()));
-
return 0;
}

@@ -6980,6 +6970,7 @@ static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
u16 eeprom_data = 0;
u16 eeprom_apme_mask = E1000_EEPROM_APME;
s32 rval = 0;
+ u32 regval;

if (ei->flags2 & FLAG2_DISABLE_ASPM_L0S)
aspm_disable_flag = PCIE_LINK_STATE_L0S;
@@ -7270,6 +7261,16 @@ static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
/* carrier off reporting is important to ethtool even BEFORE open */
netif_carrier_off(netdev);

+ /* Get and set the System Time Register SYSTIM base frequency */
+ err = e1000e_get_base_timinca(adapter, &regval);
+ if (err)
+ goto err_register;
+ ew32(TIMINCA, regval);
+
+ /* reset the ns time counter */
+ timecounter_init(&adapter->tc, &adapter->cc,
+ ktime_to_ns(ktime_get_real()));
+
/* init PTP hardware clock */
e1000e_ptp_init(adapter);


------------------------------------------------------------------------------
Brian Walsh
2015-12-12 20:58:31 UTC
Permalink
Post by Richard Cochran
I wouldn't follow ixgbe, because putting the reset in the 'open'
method means that the clock will become reset during ifup/ifdown. For
the ixgbe this is necessary, IIRC, but I wouldn't do it unless you are
absolutely by some HW quirk.
I was not sure if having it reset during ifup makes more sense. Does the
clock go away when the interface is down? I can't test that right now.
It is my primary interface so it is always up on my device.
Post by Richard Cochran
I would try something like the following untested patch...
Just finished doing an initial test of quickly making the same changes
you sent. Looks like it fixes the problem I was seeing.

Makes sense. Stop reseting the clock and it will not reset.

Brian

------------------------------------------------------------------------------
Richard Cochran
2015-12-12 21:09:46 UTC
Permalink
Post by Brian Walsh
I was not sure if having it reset during ifup makes more sense. Does the
clock go away when the interface is down? I can't test that right now.
It is my primary interface so it is always up on my device.
The /dev/ptpX persists from ptp_clock_register() until
ptp_clock_unregister(). Ideally, the clock should appear when the
device is probed and stay running until either the HW is unplugged or
the driver gets unloaded.

There are some HW designs out there that cause the clock to go away or
become unusable when the link state changes, but I think the 82574
does not have those kinds of issues.
Post by Brian Walsh
Makes sense. Stop reseting the clock and it will not reset.
Yup.

Thanks,
Richard


------------------------------------------------------------------------------
Loading...