From owner-freebsd-net@FreeBSD.ORG Tue Oct 28 01:50:30 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id B57ECF82 for ; Tue, 28 Oct 2014 01:50:30 +0000 (UTC) Received: from mail-pa0-x229.google.com (mail-pa0-x229.google.com [IPv6:2607:f8b0:400e:c03::229]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 896E8836 for ; Tue, 28 Oct 2014 01:50:30 +0000 (UTC) Received: by mail-pa0-f41.google.com with SMTP id rd3so6664920pab.28 for ; Mon, 27 Oct 2014 18:50:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:date:to:cc:subject:message-id:reply-to:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=Xvl47QnxpG8gh02xp+O4Y+CNlKQ55i5yRyU6NTgftyE=; b=MacN9Kk0p9jKG7pb04t4PYgrcLV/74s6wVt8SUNHjs8OnEkr+YU9W2gd2Hhcmw8m8Z LwsB4mQnM0kISu7XMAGvuWs6ISiJLPdROSnI/UCFsdCNq2YODqQyOgBKqoPQkjQnfnZx 0/VTzhglakrPyW4Fp4eq7oknsnTh+F0KHEZQtUQ+3UJ79teknE7s8TRvQGcKg35D2mgV f2ADScItdir50CETHGeEO12htmQ7f1RRNrFViCUL8oArgxicwbh0QOpN7b4+snDp25Nq BUAgMbNDfW5msaZ7NNA2HX1BXaOV/flg4jp6O01KfaPAC+kaywKMGzWsb+eb/dxxmr2D MW0w== X-Received: by 10.68.189.67 with SMTP id gg3mr94440pbc.158.1414461030142; Mon, 27 Oct 2014 18:50:30 -0700 (PDT) Received: from pyunyh@gmail.com ([106.247.248.2]) by mx.google.com with ESMTPSA id d9sm85780pdm.5.2014.10.27.18.50.26 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Mon, 27 Oct 2014 18:50:28 -0700 (PDT) From: Yonghyeon PYUN X-Google-Original-From: "Yonghyeon PYUN" Received: by pyunyh@gmail.com (sSMTP sendmail emulation); Tue, 28 Oct 2014 10:50:20 +0900 Date: Tue, 28 Oct 2014 10:50:20 +0900 To: Mason Loring Bliss Subject: Re: Very bad Realtek problems Message-ID: <20141028015020.GB1054@michelle.fasterthan.com> Reply-To: pyunyh@gmail.com References: <20141027195124.GI17150@blisses.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20141027195124.GI17150@blisses.org> User-Agent: Mutt/1.4.2.3i Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Oct 2014 01:50:30 -0000 On Mon, Oct 27, 2014 at 03:51:24PM -0400, Mason Loring Bliss wrote: > Hi, all. > > I've been having sporadic and serious problems with the Realtek gigabit > interface built into my motherboard. Periodically, it just freezes up. I've > tried several things to no avail: turning on DEVICE_POLLING, frobbing > bootloader options and sysctl settings, etc. > [...] > It's not clear what's happening. I have been capturing stats periodically > with 'sysctl dev.re.0.stats=1', but that doesn't always show a problem. For > instance, during one of the lock-ups last night, after a reboot, I got this: > > re0 statistics: > Tx frames : 171306 > Rx frames : 20271 > Tx errors : 0 > Rx errors : 0 > Rx missed frames : 0 > Rx frame alignment errs : 0 > Tx single collisions : 0 > Tx multiple collisions : 0 > Rx unicast frames : 20271 > Rx broadcast frames : 0 > Rx multicast frames : 0 > Tx aborts : 0 > Tx underruns : 0 > > After running overnight, with sporadic automated transfers: > > re0 statistics: > Tx frames : 4658945 > Rx frames : 1258514 > Tx errors : 0 > Rx errors : 33 > Rx missed frames : 0 > Rx frame alignment errs : 3591 > Tx single collisions : 0 > Tx multiple collisions : 0 > Rx unicast frames : 1255880 > Rx broadcast frames : 2411 > Rx multicast frames : 223 > Tx aborts : 0 > Tx underruns : 0 > > I was seeing the "Rx multicast frames" creep up each time I saw a freeze last > night, which was confusing in that I'm not sure why there'd be any multicast > traffic. RealTek controllers have small number of H/W MAC counters so it's somewhat hard to guess what's happening there. But the RX frame alignment error normally indicates cabling issue or speed/duplex mismatches with link partner. It's normal to see multicast frames in local LAN. > > Here's the card from dmesg, with MSI/X turned off: > > re0: port 0xe800-0xe8ff mem 0xfbfff000-0xfbffffff,0xfbff8000-0xfbffbfff irq 18 at device 0.0 on pci2 > re0: Chip rev. 0x2c000000 > re0: MAC rev. 0x00200000 It seems your controller is RTL8168E. [...] > In general I've been saying "ifconfig re0 down ; ifconfig re0 up" to kick the > interface, but last night a friendly person from IRC mentioned that I could > work around this by running a steady ping and frobbing mediatype when I see > the pings fail. So, I've got this running: > > while true > do > ping -c 1 -t 1 firewall > /dev/null 2>&1 > if [ $? -ne 0 ]; then > date > echo "toggling re0" > echo > ifconfig re0 media 1000baseT mediaopt full-duplex,flowcontrol,master > ifconfig re0 media autoselect mediaopt flowcontrol > sleep 3 > fi > sleep 1 > done Please don't manually set media types for 1000baseT. It will result in speed/duplex mismatches and other issues. Probably this is the main reason why you see RX alignment errors. You should always stick to auto-negotiation with 1000baseT(Flow control can be set though). Manual media configuration is to workaround buggy link partners. > > This has been noting failures sporadically throughout the day, but it's > allowing traffic to continue moving, albeit with the occasional hiccough. > > This hardware has been running Debian for a couple years, and it's never had > so much as a short hiccough, so I have confidence that the hardware is fine. > It suggests that there's something the Linux driver is doing to handle this > hardware that FreeBSD isn't doing. For a while I was dual-booting and I'd see > errors with FreeBSD running that were't there under Debian. > > I'd started diving into the source, both Linux and FreeBSD, but I lack > sufficient exposure to ethernet driver code to be able to get a high-level > picture of what they're doing, and as such I haven't yet noticed any special- > case or hardware glitch handling that we're missing, although I might find > something eventually. > Data sheet for RealTek controller is not publicly available. Linux uses firmwares for every RealTek controllers. I vaguely guess it may be PHY DSP fixups but I don't have any detailed information for the firmwares. > I'm struggling with finding a way to see what's actually happening with this. > I've toggled MSI and MSI-X handling, I've turned down interrupt handling > delays, I've tried both I/O and memory register transfers, although I'd not > actually clear what's happening differently there. I've had polling variously > enabled and disabled. > > One thing to note is that last night's horror while I was trying to move some > back-up data was after rebooting from Windows. (Installed on a partition for > gaming...) It made me wonder if we're not fully setting up some state on the > card. I'd have what felt like a solid, glitchless week before that. > Vendor's Windows driver may access/program large set of registers unknown to re(4). Currently re(4) heavily relies on power on default settings since no detailed register configuration is not available. Some register configurations made in Windows can survive from warm boot. Does cold-boot from Windows make any difference for you? > FWIW, I'm running 10.1-RC3 on this box and I've seen issues from early on > while I was still running 10.0-RELEASE. > > Thanks in advance for clues. This is a showstopper for futher deployment for > me, as I've got these Realtek on-board cards in several boxes, and while the > media frobbing largely works, it's not something I can inflict on my users. > When you notice re(4) is locked up, - does H/W MAC counters still increase? - does interrupt still get generated for TX/RX? - could you narrow down which part of MAC(TX or RX) is in stuck condition? Thanks.