Discussion:
[Linuxptp-users] [E1000-devel] Need help with igb driver suspend crash issue
Keller, Jacob E
2016-05-11 16:24:50 UTC
Permalink
Hi,
-----Original Message-----
Sent: Wednesday, May 11, 2016 3:25 AM
Subject: Re: [E1000-devel] Need help with igb driver suspend crash issue
Hi,
I'm using Intel IGB I350 NIC card on one of our arm based platforms.
PCIe
link lost, device now detached" print in the log and subsequent resume
causes system to crash. After digging the code (BTW, I'm using kernel-3.18
release), it looks like the above print comes because of the following call
flow, which got executed after igb_suspend() is called ( I confirmed this
with the help of prints)
[10846.434381] [<ffffffc000089ce4>] dump_backtrace+0x0/0xf8
[10846.434386] [<ffffffc000089ea0>] show_stack+0x10/0x1c
[10846.434393] [<ffffffc000bc3b70>] dump_stack+0x80/0xc4
[10846.434397] [<ffffffc000613d3c>] igb_rd32+0xb0/0x1a8
[10846.434400] [<ffffffc00062eb0c>] igb_ptp_read_82580+0x18/0x48
[10846.434407] [<ffffffc000106e6c>] timecounter_read+0x1c/0x60
[10846.434410] [<ffffffc00062f338>] igb_ptp_gettime_82576+0x2c/0x88
[10846.434413] [<ffffffc00062f41c>] igb_ptp_overflow_check+0x1c/0x58
[10846.434419] [<ffffffc0000ba584>] process_one_work+0x154/0x414
[10846.434424] [<ffffffc0000bb338>] worker_thread+0x13c/0x4e4
[10846.434428] [<ffffffc0000bfc4c>] kthread+0xf8/0x110
It looks like reading timer registers would have returned all F's as the
device is already in D3Hot state.
Is my understanding correct. Is there any patch available to fix this
issue?
Let me know if more information is needed.
Maybe an ordering bug when doing suspend that we try to read things too late. Is that stack trace the actual crash or did you add the dump_stack yourself?

Thanks,
Jake
Thanks,
Vidya Sagar
Keller, Jacob E
2016-05-11 17:51:55 UTC
Permalink
From: vidya sagar [mailto:***@gmail.com]
Sent: Wednesday, May 11, 2016 10:44 AM
To: Keller, Jacob E <***@intel.com>
Cc: e1000-***@lists.sourceforge.net; linuxptp-***@lists.sourceforge.net
Subject: Re: [E1000-devel] Need help with igb driver suspend crash issue

I added the dump_stack() to see the full flow as the error print "PCIe link lost, device now detached" is there as part of igb_rd32() API which is called at many places. Are we not supposed to 'cancel delayed work of igb_ptp_overflow_check() when system goes to suspend state (and schedule when system resumes)?

On Wed, May 11, 2016 at 9:54 PM, Keller, Jacob E <***@intel.com<mailto:***@intel.com>> wrote:
Hi,
-----Original Message-----
Sent: Wednesday, May 11, 2016 3:25 AM
Subject: Re: [E1000-devel] Need help with igb driver suspend crash issue
Hi,
I'm using Intel IGB I350 NIC card on one of our arm based platforms.
PCIe
link lost, device now detached" print in the log and subsequent resume
causes system to crash. After digging the code (BTW, I'm using kernel-3.18
release), it looks like the above print comes because of the following call
flow, which got executed after igb_suspend() is called ( I confirmed this
with the help of prints)
[10846.434381] [<ffffffc000089ce4>] dump_backtrace+0x0/0xf8
[10846.434386] [<ffffffc000089ea0>] show_stack+0x10/0x1c
[10846.434393] [<ffffffc000bc3b70>] dump_stack+0x80/0xc4
[10846.434397] [<ffffffc000613d3c>] igb_rd32+0xb0/0x1a8
[10846.434400] [<ffffffc00062eb0c>] igb_ptp_read_82580+0x18/0x48
[10846.434407] [<ffffffc000106e6c>] timecounter_read+0x1c/0x60
[10846.434410] [<ffffffc00062f338>] igb_ptp_gettime_82576+0x2c/0x88
[10846.434413] [<ffffffc00062f41c>] igb_ptp_overflow_check+0x1c/0x58
[10846.434419] [<ffffffc0000ba584>] process_one_work+0x154/0x414
[10846.434424] [<ffffffc0000bb338>] worker_thread+0x13c/0x4e4
[10846.434428] [<ffffffc0000bfc4c>] kthread+0xf8/0x110
It looks like reading timer registers would have returned all F's as the
device is already in D3Hot state.
Is my understanding correct. Is there any patch available to fix this
issue?
Let me know if more information is needed.
Maybe an ordering bug when doing suspend that we try to read things too late. Is that stack trace the actual crash or did you add the dump_stack yourself?

Thanks,
Jake
Thanks,
Vidya Sagar
Keller, Jacob E
2016-05-11 17:57:32 UTC
Permalink
Post by Keller, Jacob E
I added the dump_stack() to see the full flow as the error print
"PCIe link lost, device now detached" is there as part of igb_rd32()
API which is called at many places. Are we not supposed to 'cancel
delayed work of igb_ptp_overflow_check() when system goes to suspend
state (and schedule when system resumes)?
It is probably because we're triggering a surprise removal event which
we shouldn't be doing. It likely happens because of the suspend which
puts is part way through this state, and we need to disable the
overflow check and re-enable it at a resume time, yep.

Thanks,
Jake

Loading...