Date: Mon, 26 Sep 2005 18:12:31 +0300 From: Petri Helenius <pete@he.iki.fi> To: Benjamin Rosenblum <ben@benswebs.com> Cc: net@freebsd.org Subject: Re: em(4) receive part wedging randomly at moderate load Message-ID: <43380FDF.90706@he.iki.fi> In-Reply-To: <43380F05.3070005@benswebs.com> References: <20050926142907.GI91328@cell.sick.ru> <43380F05.3070005@benswebs.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Benjamin Rosenblum wrote: > the em driver in itself is extremly buggy. many people, myself > included, are hitting some major problems with this driver that are > causeing some serious issues. i cant transfer any large files to my > server because the em driver panics and drops the connection for 15-20 > seconds. its a real pain in the butt when this happens too cause this > is my primary network storage server. i have had to resort to the > backup systems lately because of this problem. i think the entire em > network driver needs to get reworked and all these bugs really need to > be taken care of since this is one of the top like 3 network cards > used in the field today for gig transfer. > Does anyone have the programming data for the chipsets so the driver could be taken further? I've been unable to obtain them from Intel despite of repeated attempts. Pete > Gleb Smirnoff wrote: > >> Colleagues, >> >> during last month we are experiencing a nasty problem with em(4) >> driver. Several times a day the receive path of the driver wedges >> for a minute or two. During wedge the transmit part works with >> no problems. The latter fact makes this problem very nasty, because >> the problematic router can't be backed up with help of CARP. >> >> Some details: during the wedge all incoming packets are lost and >> counted as "Missed packets". I've checked this using >> `sysctl dev.em.0.stats=1`. The `dmesg` output is the following: >> >> em0: Excessive collisions = 0 >> em0: Symbol errors = 0 >> em0: Sequence errors = 0 >> em0: Defer count = 0 >> em0: Missed Packets = 1266 >> em0: Receive No Buffers = 220 >> em0: Receive length errors = 0 >> em0: Receive errors = 0 >> em0: Crc errors = 0 >> em0: Alignment errors = 0 >> em0: Carrier extension errors = 0 >> em0: XON Rcvd = 0 >> em0: XON Xmtd = 0 >> em0: XOFF Rcvd = 0 >> em0: XOFF Xmtd = 0 >> em0: Good Packets Rcvd = 28347789 >> em0: Good Packets Xmtd = 30911959 >> >> There is a clear evidence that command `sysctl dev.em.0.stats=1` itself >> can trigger the wedge. It is important, that the stats are printed >> to a 9600 baud serial console, and this takes about a second. I have >> suspicion, that the wedge happens when kernel doesn't service NIC >> interrupts for some period of time. Yes, some packets should be lost in >> this case, but the wedge must not continue for minutes! >> >> The box is serving 8 - 15 kpps, 70 - 100 MBps. It runs stateful pf(4) >> firewall, with 50k - 80k states. The IP fastforwarding is enabled. The >> average state insert/removal ratio is 300 states per second, however >> sometimes several thousands of states can be removed in one pass. The >> state removal locks the network code for quite a long time, so I guess >> that wedge happens exactly when a lot of states are removed. The NIC >> interrupts aren't serviced for some time and it wedges. >> >> The hardware is Supermicro server, with two onboard NICs: >> dev.em.0.%pnpinfo: vendor=0x8086 device=0x1075 subvendor=0x8086 >> subdevice=0x1075 class=0x020000 >> dev.em.1.%pnpinfo: vendor=0x8086 device=0x1076 subvendor=0x8086 >> subdevice=0x1076 class=0x020000 >> >> The NIC is plugged in Cisco Catalyst 6509 gigabit ethernet port. No >> errors are counted on switch port. >> >> To workaround the problem, I have made the following patch: >> >> @@ -1650,12 +1651,18 @@ >> struct ifnet *ifp; >> struct adapter * adapter = arg; >> ifp = adapter->ifp; >> + uint64_t ompc; >> >> EM_LOCK(adapter); >> >> em_check_for_link(&adapter->hw); >> em_print_link_status(adapter); >> - em_update_stats_counters(adapter); + ompc = >> adapter->stats.mpc; >> + em_update_stats_counters(adapter); >> + if (adapter->stats.mpc > ompc) { >> + printf("em watchdog: mpc %lld->%lld\n", ompc, >> adapter->stats.mpc); >> + em_init_locked(adapter); >> + } >> if (em_display_debug_stats && ifp->if_drv_flags & >> IFF_DRV_RUNNING) { >> em_print_hw_stats(adapter); >> } >> >> It helps to reduce downtime from few minutes to 2 seconds, but this >> is very dirty approach to the problem. Sample prints during runtime >> with patch: >> >> em watchdog: mpc 1767->2739 >> em watchdog: mpc 2739->4724 >> em watchdog: mpc 4724->7794 >> em watchdog: mpc 7794->10729 >> >> Every time this is printed, the network wedges for 2 seconds and then >> it revives. >> >> I am asking developers, who work in Intel, to pay attention to this >> problem. >> >>> From my side I can offer any help in testing and debugging. >> >> >> >> > > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?43380FDF.90706>