From owner-freebsd-net@FreeBSD.ORG Mon Sep 26 15:09:52 2005 Return-Path: X-Original-To: net@freebsd.org Delivered-To: freebsd-net@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 70B0916A41F for ; Mon, 26 Sep 2005 15:09:52 +0000 (GMT) (envelope-from ben@benswebs.com) Received: from ms-smtp-01.nyroc.rr.com (ms-smtp-01.nyroc.rr.com [24.24.2.55]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1582F43D48 for ; Mon, 26 Sep 2005 15:09:51 +0000 (GMT) (envelope-from ben@benswebs.com) Received: from [127.0.0.1] (cpe-72-224-114-15.nycap.res.rr.com [72.224.114.15]) by ms-smtp-01.nyroc.rr.com (8.12.10/8.12.10) with ESMTP id j8QF9iFg016356 for ; Mon, 26 Sep 2005 11:09:49 -0400 (EDT) Message-ID: <43380F05.3070005@benswebs.com> Date: Mon, 26 Sep 2005 11:08:53 -0400 From: Benjamin Rosenblum User-Agent: Mozilla Thunderbird 1.0.2 (Windows/20050317) X-Accept-Language: en-us, en MIME-Version: 1.0 To: net@freebsd.org References: <20050926142907.GI91328@cell.sick.ru> In-Reply-To: <20050926142907.GI91328@cell.sick.ru> Content-Type: text/plain; charset=KOI8-R; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: Symantec AntiVirus Scan Engine Cc: Subject: Re: em(4) receive part wedging randomly at moderate load X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Sep 2005 15:09:52 -0000 the em driver in itself is extremly buggy. many people, myself included, are hitting some major problems with this driver that are causeing some serious issues. i cant transfer any large files to my server because the em driver panics and drops the connection for 15-20 seconds. its a real pain in the butt when this happens too cause this is my primary network storage server. i have had to resort to the backup systems lately because of this problem. i think the entire em network driver needs to get reworked and all these bugs really need to be taken care of since this is one of the top like 3 network cards used in the field today for gig transfer. Gleb Smirnoff wrote: > Colleagues, > > during last month we are experiencing a nasty problem with em(4) >driver. Several times a day the receive path of the driver wedges >for a minute or two. During wedge the transmit part works with >no problems. The latter fact makes this problem very nasty, because >the problematic router can't be backed up with help of CARP. > >Some details: during the wedge all incoming packets are lost and >counted as "Missed packets". I've checked this using >`sysctl dev.em.0.stats=1`. The `dmesg` output is the following: > >em0: Excessive collisions = 0 >em0: Symbol errors = 0 >em0: Sequence errors = 0 >em0: Defer count = 0 >em0: Missed Packets = 1266 >em0: Receive No Buffers = 220 >em0: Receive length errors = 0 >em0: Receive errors = 0 >em0: Crc errors = 0 >em0: Alignment errors = 0 >em0: Carrier extension errors = 0 >em0: XON Rcvd = 0 >em0: XON Xmtd = 0 >em0: XOFF Rcvd = 0 >em0: XOFF Xmtd = 0 >em0: Good Packets Rcvd = 28347789 >em0: Good Packets Xmtd = 30911959 > >There is a clear evidence that command `sysctl dev.em.0.stats=1` itself >can trigger the wedge. It is important, that the stats are printed >to a 9600 baud serial console, and this takes about a second. I have >suspicion, that the wedge happens when kernel doesn't service NIC >interrupts for some period of time. Yes, some packets should be lost in >this case, but the wedge must not continue for minutes! > >The box is serving 8 - 15 kpps, 70 - 100 MBps. It runs stateful pf(4) >firewall, with 50k - 80k states. The IP fastforwarding is enabled. The >average state insert/removal ratio is 300 states per second, however >sometimes several thousands of states can be removed in one pass. The >state removal locks the network code for quite a long time, so I guess >that wedge happens exactly when a lot of states are removed. The NIC >interrupts aren't serviced for some time and it wedges. > >The hardware is Supermicro server, with two onboard NICs: > >dev.em.0.%pnpinfo: vendor=0x8086 device=0x1075 subvendor=0x8086 subdevice=0x1075 class=0x020000 >dev.em.1.%pnpinfo: vendor=0x8086 device=0x1076 subvendor=0x8086 subdevice=0x1076 class=0x020000 > >The NIC is plugged in Cisco Catalyst 6509 gigabit ethernet port. No >errors are counted on switch port. > >To workaround the problem, I have made the following patch: > >@@ -1650,12 +1651,18 @@ > struct ifnet *ifp; > struct adapter * adapter = arg; > ifp = adapter->ifp; >+ uint64_t ompc; > > EM_LOCK(adapter); > > em_check_for_link(&adapter->hw); > em_print_link_status(adapter); >- em_update_stats_counters(adapter); >+ ompc = adapter->stats.mpc; >+ em_update_stats_counters(adapter); >+ if (adapter->stats.mpc > ompc) { >+ printf("em watchdog: mpc %lld->%lld\n", ompc, adapter->stats.mpc); >+ em_init_locked(adapter); >+ } > if (em_display_debug_stats && ifp->if_drv_flags & IFF_DRV_RUNNING) { > em_print_hw_stats(adapter); > } > >It helps to reduce downtime from few minutes to 2 seconds, but this >is very dirty approach to the problem. Sample prints during runtime >with patch: > >em watchdog: mpc 1767->2739 >em watchdog: mpc 2739->4724 >em watchdog: mpc 4724->7794 >em watchdog: mpc 7794->10729 > >Every time this is printed, the network wedges for 2 seconds and then >it revives. > >I am asking developers, who work in Intel, to pay attention to this problem. >>From my side I can offer any help in testing and debugging. > > >