Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 28 Oct 2014 10:50:20 +0900
From:      Yonghyeon PYUN <pyunyh@gmail.com>
To:        Mason Loring Bliss <mason@blisses.org>
Cc:        freebsd-net@freebsd.org
Subject:   Re: Very bad Realtek problems
Message-ID:  <20141028015020.GB1054@michelle.fasterthan.com>
In-Reply-To: <20141027195124.GI17150@blisses.org>
References:  <20141027195124.GI17150@blisses.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Oct 27, 2014 at 03:51:24PM -0400, Mason Loring Bliss wrote:
> Hi, all.
> 
> I've been having sporadic and serious problems with the Realtek gigabit
> interface built into my motherboard. Periodically, it just freezes up. I've
> tried several things to no avail: turning on DEVICE_POLLING, frobbing
> bootloader options and sysctl settings, etc.
> 

[...]

> It's not clear what's happening. I have been capturing stats periodically
> with 'sysctl dev.re.0.stats=1', but that doesn't always show a problem. For
> instance, during one of the lock-ups last night, after a reboot, I got this:
> 
> re0 statistics:
> Tx frames : 171306
> Rx frames : 20271
> Tx errors : 0
> Rx errors : 0
> Rx missed frames : 0
> Rx frame alignment errs : 0
> Tx single collisions : 0
> Tx multiple collisions : 0
> Rx unicast frames : 20271
> Rx broadcast frames : 0
> Rx multicast frames : 0
> Tx aborts : 0
> Tx underruns : 0
> 
> After running overnight, with sporadic automated transfers:
> 
> re0 statistics:
> Tx frames : 4658945
> Rx frames : 1258514
> Tx errors : 0
> Rx errors : 33
> Rx missed frames : 0
> Rx frame alignment errs : 3591
> Tx single collisions : 0
> Tx multiple collisions : 0
> Rx unicast frames : 1255880
> Rx broadcast frames : 2411
> Rx multicast frames : 223
> Tx aborts : 0
> Tx underruns : 0
> 
> I was seeing the "Rx multicast frames" creep up each time I saw a freeze last
> night, which was confusing in that I'm not sure why there'd be any multicast
> traffic.

RealTek controllers have small number of H/W MAC counters so it's
somewhat hard to guess what's happening there.  But the RX frame
alignment error normally indicates cabling issue or speed/duplex
mismatches with link partner.  It's normal to see multicast frames
in local LAN.

> 
> Here's the card from dmesg, with MSI/X turned off:
> 
> re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0xe800-0xe8ff mem 0xfbfff000-0xfbffffff,0xfbff8000-0xfbffbfff irq 18 at device 0.0 on pci2
> re0: Chip rev. 0x2c000000
> re0: MAC rev. 0x00200000

It seems your controller is RTL8168E.

[...]

> In general I've been saying "ifconfig re0 down ; ifconfig re0 up" to kick the
> interface, but last night a friendly person from IRC mentioned that I could
> work around this by running a steady ping and frobbing mediatype when I see
> the pings fail. So, I've got this running:
> 
> while true
> do
> ping -c 1 -t 1 firewall > /dev/null 2>&1
> if [ $? -ne 0 ]; then
>     date
>     echo "toggling re0"
>     echo
>     ifconfig re0 media 1000baseT mediaopt full-duplex,flowcontrol,master
>     ifconfig re0 media autoselect mediaopt flowcontrol              
>     sleep 3
> fi
> sleep 1
> done

Please don't manually set media types for 1000baseT.  It will
result in speed/duplex mismatches and other issues.  Probably this
is the main reason why you see RX alignment errors. You should
always stick to auto-negotiation with 1000baseT(Flow control can be
set though).  Manual media configuration is to workaround buggy
link partners.

> 
> This has been noting failures sporadically throughout the day, but it's
> allowing traffic to continue moving, albeit with the occasional hiccough.
> 
> This hardware has been running Debian for a couple years, and it's never had
> so much as a short hiccough, so I have confidence that the hardware is fine.
> It suggests that there's something the Linux driver is doing to handle this
> hardware that FreeBSD isn't doing. For a while I was dual-booting and I'd see
> errors with FreeBSD running that were't there under Debian.
> 
> I'd started diving into the source, both Linux and FreeBSD, but I lack
> sufficient exposure to ethernet driver code to be able to get a high-level
> picture of what they're doing, and as such I haven't yet noticed any special-
> case or hardware glitch handling that we're missing, although I might find
> something eventually.
> 

Data sheet for RealTek controller is not publicly available. Linux
uses firmwares for every RealTek controllers.  I vaguely guess it
may be PHY DSP fixups but I don't have any detailed information for
the firmwares.

> I'm struggling with finding a way to see what's actually happening with this.
> I've toggled MSI and MSI-X handling, I've turned down interrupt handling
> delays, I've tried both I/O and memory register transfers, although I'd not
> actually clear what's happening differently there. I've had polling variously
> enabled and disabled.
> 
> One thing to note is that last night's horror while I was trying to move some
> back-up data was after rebooting from Windows. (Installed on a partition for
> gaming...) It made me wonder if we're not fully setting up some state on the
> card. I'd have what felt like a solid, glitchless week before that.
> 

Vendor's Windows driver may access/program large set of registers
unknown to re(4).  Currently re(4) heavily relies on power on
default settings since no detailed register configuration is not
available.  Some register configurations made in Windows can
survive from warm boot.
Does cold-boot from Windows make any difference for you?

> FWIW, I'm running 10.1-RC3 on this box and I've seen issues from early on
> while I was still running 10.0-RELEASE.
> 
> Thanks in advance for clues. This is a showstopper for futher deployment for
> me, as I've got these Realtek on-board cards in several boxes, and while the
> media frobbing largely works, it's not something I can inflict on my users.
> 

When you notice re(4) is locked up,
 - does H/W MAC counters still increase?
 - does interrupt still get generated for TX/RX?
 - could you narrow down which part of MAC(TX or RX) is in stuck
   condition?

Thanks.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20141028015020.GB1054>