From owner-freebsd-stable@freebsd.org Tue Jan 24 18:02:48 2017 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E6455CC0CF4 for ; Tue, 24 Jan 2017 18:02:48 +0000 (UTC) (envelope-from ricera10@gmail.com) Received: from mail-ot0-f179.google.com (mail-ot0-f179.google.com [74.125.82.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id B37873B3 for ; Tue, 24 Jan 2017 18:02:48 +0000 (UTC) (envelope-from ricera10@gmail.com) Received: by mail-ot0-f179.google.com with SMTP id 73so133382257otj.0 for ; Tue, 24 Jan 2017 10:02:48 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=eUGofXp0v6ZOVcNnHywWUCuQOFoe+xr0RyAlZ3dURYI=; b=MFPefgGYoc/ZYzgXZS2mOdiJpyjZqO/bHkaHNxPqFCeRcZGJIP2xbUYrEolUMC8YKa UhCI+NTSMqmhuKla1LI1sE5cVPzwQBGWQcKFT8H7kgLwQkcL739o8cgtjBIgN1pmpWlP SI8aezNMjVN7E2YRSYpu2rHTdHjgHTp4IBQNl+2qEzhBzdVmo+nJS2sh3lw0ZwAMcxyD kP+bPMzZLc6zh9q5PXWGSzVsAhmXr14Fjj+y1ON+/0R2M2zUifTZsqceBaI9+Li5XomD JByGD1PYeFd6z7PBV6f4AgBXlylPMt6I/5q1SkW+ZkrXlMFy9z+QvrTjc//xwgRGKsCn TxvA== X-Gm-Message-State: AIkVDXKU1YWV/HhvJvDRsiVkcAb3IPpadKTWUZLQ6wvnwaqBzYStnpA0pH4Na6iUPrFhAw== X-Received: by 10.157.33.230 with SMTP id s93mr17022105otb.106.1485280638123; Tue, 24 Jan 2017 09:57:18 -0800 (PST) Received: from mail-ot0-f171.google.com (mail-ot0-f171.google.com. [74.125.82.171]) by smtp.gmail.com with ESMTPSA id t53sm10837452otd.6.2017.01.24.09.57.17 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 24 Jan 2017 09:57:17 -0800 (PST) Received: by mail-ot0-f171.google.com with SMTP id 73so133255565otj.0 for ; Tue, 24 Jan 2017 09:57:17 -0800 (PST) X-Received: by 10.157.24.92 with SMTP id t28mr18740323ott.238.1485280637235; Tue, 24 Jan 2017 09:57:17 -0800 (PST) MIME-Version: 1.0 References: <11f0e9e6-cfe7-1cc7-49a0-4bc42fd0f99a@gmx.de> In-Reply-To: <11f0e9e6-cfe7-1cc7-49a0-4bc42fd0f99a@gmx.de> From: Eric Joyner Date: Tue, 24 Jan 2017 17:57:06 +0000 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: intel 10gbe nic bug in 10.3 - no carrier To: Daniel Genis , freebsd-stable@freebsd.org Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Jan 2017 18:02:49 -0000 On Tue, Jan 10, 2017 at 2:38 AM Daniel Genis wrote: > Hello everyone, > > we're trying to tackle a rare bug that is very hard to debug. > > Our 10.3-RELEASE servers can panic boot and subsequently can come up > without network (2x - no carrier). We've seen this on 10.3-RELEASE-p0 we > have never seen this before. > > root@storage ~ # pciconf -lv | grep -B3 network > ix0@pci0:2:0:0: class=0x020000 card=0xd10f19e5 chip=0x10fb8086 > rev=0x01 hdr=0x00 > vendor = 'Intel Corporation' > device = '82599ES 10-Gigabit SFI/SFP+ Network Connection' > class = network > -- > ix1@pci0:2:0:1: class=0x020000 card=0xd10f19e5 chip=0x10fb8086 > rev=0x01 hdr=0x00 > vendor = 'Intel Corporation' > device = '82599ES 10-Gigabit SFI/SFP+ Network Connection' > class = network > > Our network is configured as active/passive using lagg. (/etc/rc.conf): > > ifconfig_ix0="up" > ifconfig_ix1="up" > cloned_interfaces="lagg0" > ifconfig_lagg0="laggproto failover laggport ix0 laggport ix1 10.1.2.31/16" > > After panic boot the network show up like this: > > ix0: flags=8843 metric 0 mtu 1500 > > options=8407bb > ether 60:08:10:d0:4e:9f > nd6 options=29 > media: (autoselect) > status: no carrier > ix1: flags=8843 metric 0 mtu 1500 > > options=8407bb > ether 60:08:10:d0:4e:9f > nd6 options=29 > media: (autoselect) > status: no carrier > > The network switch sees the connection as online. The LED's of the nic's > suggest the same, they see the network as online (led's are on like in > normal operation). Unplugging/replugging the network cable will get the > network online. Shutting the port on the switch and reenabling it wil > also get the network online. However another reboot will return the > machine into the no-carrier state. > > I've built various kernels trying to find where the regression is > without success. I tried porting the 10.2 nic driver (2.8.3) to 10.3 and > subsequently the lagg code as well. I ported nic driver 3.1.14 from > pfsense into 10.3-STABLE (2 december kernel) to no avail, also porting > lagg code from 10.2 did not make any difference. Rebooting with those > kernels the server remains in the no carrier state. > > We install our systems using mfsbsd for PXE boot. If I boot a machine > which has the "no carrier" state using the 10.3 PXE boot, both nic's > come online. If I then boot from disk again the machine returns into the > "no carrier" state. > > Recently we got some new machines, so we can shuffle more around and > also to try to debug this. We baseinstalled it using mfsbsd 10.3 pxe and > configured it like always. Here interestingly one of the two nic's > entered the "no carrier" state, the other nic remained operational. This > remained persistent across reboots. > > The issue disappears after many reboots but it's not conclusive. I've > had 2 machines with which I could experiment with. > > On one the problem it disappeared on it's own after a reboot (kernel > 10.3-STABLE git hash d99ba5c aka r299900(?)). > > On the other one I pxe booted 10.1 live environment and then > subsequently I booted into kernel 10.3-STABLE git hash 3673260fc9 aka > r308456(?)). But I don't think anything can be concluded from that. That > was the machine which had both nic's online after booting into the 10.3 > pxe environment but subsequently returned into no carrier state when > booting 10.3 from disk. > > We also tried many sysctl flags (and many reboots), but without success. > For example: hw.ix.enable_msix=0 and hw.ix.enable_msi=0 > > At the moment I have no spare/empty machine in this state, we will empty > one machine though which currently has this state (but is in production > right now). > I don't know how to trigger this state manually, which doesn't help for > debugging. > > I could link reference where others report similar issues, for example > https://www.reddit.com/r/PFSENSE/comments/45bcuq/10_gig_woes/ > Here they suggest that the new intel nic driver 3.1.14 fixes it. Though > I was not able to resolve the state by booting into a kernel with this > driver. > > If I can provide any additional information please do not hesitate to ask. > > Any tips and suggestions for debugging are most welcome! > > With kind regards, > > Daniel > _______________________________________________ > freebsd-stable@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" > This is a late follow-up, but could you file this as a bug on bugs.freebsd.org? - Eric