FreeBSD Mail Archives

Date:      Wed, 11 Feb 2015 17:55:17 +0100
From:      Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
To:        Jack Vogel <jfvogel@gmail.com>
Cc:        FreeBSD Stable <freebsd-stable@freebsd.org>
Subject:   Re: igb(4) watchdog timeout, lagg(4) fails
Message-ID:  <54DB8975.2030001@omnilan.de>
In-Reply-To: <54B10432.8050909@omnilan.de>
References:  <54ACC6A2.1050400@omnilan.de> <54AE565D.50208@omnilan.de> <54AE5A6B.7040601@omnilan.de> <54AFA784.6020102@omnilan.de> <CAFOYbcn0F1QXajUZ2XOncSg8z9xjuCQtzC=Siteyrq%2BDkvAw-A@mail.gmail.com> <54B10432.8050909@omnilan.de>


[-- Attachment #1 --]
 Bez�glich Harald Schmalzbauer's Nachricht vom 10.01.2015 11:51
(localtime):
>  Bez�glich Jack Vogel's Nachricht vom 09.01.2015 18:46 (localtime):
>> The tuneable interrupt rate code is not mine, and looking at it I'm not
>> entirely
>> sure it works. Why are you focused on the interrupt rate anyway, do you have
>> some reason to tie it to the watchdog?
>>
>> You could turn AIM off (enable_aim) and see if that changed anything?
>>
>> It seems most the time problems show up they involve the use of lagg, if you
>> take it out of the mix does the problem go away?
�

> Is there a way to reset the interface without rebooting the machine? The
> watchdog doesn't really reset the device, it's in non-operating state
> afterwards. I need to 'ifconfig down' it for bringin lagg(4) back into
> operational state.
> Some kind of D3D0-state switch for a single address? kldunloading would
> destroy the remaining interface too�

I could isolate the igb watchdog timeout problem a bit.
It only happens on nics which take the PCH-PCIe route. Nics that are
connected to the CPU's PCIe root complex never show this issue.

Currently, I suffer from one unresponsible nic which shows the following
states:
dev.igb.1.%desc: Intel(R) PRO/1000 Network Connection version - 2.4.0
dev.igb.1.%driver: igb
dev.igb.1.%location: slot=0 function=0 handle=\_SB_.PCI0.PE70.S1F0
dev.igb.1.%pnpinfo: vendor=0x8086 device=0x10c9 subvendor=0x8086
subdevice=0xa03c class=0x020000
dev.igb.1.%parent: pci11
dev.igb.1.nvm: -1
dev.igb.1.enable_aim: 1
dev.igb.1.fc: 3
dev.igb.1.rx_processing_limit: 250
dev.igb.1.link_irq: 848
^^^^^^^^^^^^^^ 848???
dev.igb.1.dropped: 0
dev.igb.1.tx_dma_fail: 0
dev.igb.1.rx_overruns: 0
dev.igb.1.watchdog_timeouts: 414
dev.igb.1.device_control: 1488978497
dev.igb.1.rx_control: 67272738
dev.igb.1.interrupt_mask: 4
dev.igb.1.extended_int_mask: 2147483655
dev.igb.1.tx_buf_alloc: 0
dev.igb.1.rx_buf_alloc: 0
dev.igb.1.fc_high_water: 47488
dev.igb.1.fc_low_water: 47472
dev.igb.1.queue0.interrupt_rate: 0
dev.igb.1.queue0.txd_head: 0
dev.igb.1.queue0.txd_tail: 0
dev.igb.1.queue0.no_desc_avail: 2520
dev.igb.1.queue0.tx_packets: 43894
dev.igb.1.queue0.rxd_head: 0
dev.igb.1.queue0.rxd_tail: 0
dev.igb.1.queue0.rx_packets: 1918054
dev.igb.1.queue0.rx_bytes: 0
dev.igb.1.queue0.lro_queued: 0
dev.igb.1.queue0.lro_flushed: 0
dev.igb.1.queue1.interrupt_rate: 0
dev.igb.1.queue1.txd_head: 0
dev.igb.1.queue1.txd_tail: 0
dev.igb.1.queue1.no_desc_avail: 17
dev.igb.1.queue1.tx_packets: 36813
dev.igb.1.queue1.rxd_head: 0
dev.igb.1.queue1.rxd_tail: 0
dev.igb.1.queue1.rx_packets: 63738
dev.igb.1.queue1.rx_bytes: 0
dev.igb.1.queue1.lro_queued: 0
dev.igb.1.queue1.lro_flushed: 0
�
dev.igb.1.interrupts.asserts: 5890499
dev.igb.1.interrupts.rx_pkt_timer: 2103707
dev.igb.1.interrupts.rx_abs_timer: 0
dev.igb.1.interrupts.tx_pkt_timer: 0
dev.igb.1.interrupts.tx_abs_timer: 1983448
dev.igb.1.interrupts.tx_queue_empty: 50493
dev.igb.1.interrupts.tx_queue_min_thresh: 0
dev.igb.1.interrupts.rx_desc_min_thresh: 0
dev.igb.1.interrupts.rx_overrun: 0

The dev.igb.1.link_irq value doesn't really make sense, does it?

The rest isn't unusual, just shows the watchdog timeout problem becaus
of dev.igb.1.queue0.no_desc_avail I guess.

I manually adjusted:
hw.igb.num_queues: 2
hw.igb.rx_process_limit: 250
hw.igb.rxd: 4096
hw.igb.txd: 4096

Like mentioned, the nics not going through PCH-PCIe don't show this
problem, falsified.

This is the regular timeout interval for the last 24h (~3 minutes):
Feb 11 17:26:53 vega kernel: igb1: Watchdog timeout -- resetting
Feb 11 17:26:53 vega kernel: igb1: Queue(911600000) tdh = 2068077355, hw
tdt = 397078446
Feb 11 17:26:53 vega kernel: igb1: TX(911600000) desc avail = 0,Next TX
to Clean = 0
Feb 11 17:26:53 vega kernel: igb1: link state changed to DOWN
Feb 11 17:26:56 vega kernel: igb1: link state changed to UP
Feb 11 17:26:56 vega devd: Executing '/etc/rc.d/dhclient quietstart igb1'
Feb 11 17:30:10 vega kernel: igb1: Watchdog timeout -- resetting
Feb 11 17:30:10 vega kernel: igb1: Queue(911600000) tdh = 2068077355, hw
tdt = 397078446
Feb 11 17:30:10 vega kernel: igb1: TX(911600000) desc avail = 0,Next TX
to Clean = 0
Feb 11 17:30:10 vega kernel: igb1: link state changed to DOWN
Feb 11 17:30:13 vega kernel: igb1: link state changed to UP

But these resets don't bring the interface back into a working state :-(
"Queue" value remains constant, but "tdh" and "tdt" vary from time to
time, for example:
igb1: Queue(911600000) tdh = -337225283, hw tdt = 398180458

Unfortunately I don't know what they stand for. Is there an explanation
for people who don't just look for it in the drivers code?
Any idea where the problem could be?

Thanks,

-Harry


[-- Attachment #2 --]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (FreeBSD)

iEYEARECAAYFAlTbiXUACgkQLDqVQ9VXb8jktgCgxQgBLy0fLL1lIRhwHEHcS6NA
OKUAoKE3Unzf0vukXjy7N/oJWA+h3KH1
=Rw5U
-----END PGP SIGNATURE-----

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?54DB8975.2030001>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation