From owner-freebsd-net@FreeBSD.ORG Fri Apr 1 01:16:09 2011 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EB43F106566B for ; Fri, 1 Apr 2011 01:16:09 +0000 (UTC) (envelope-from jfvogel@gmail.com) Received: from mail-vw0-f54.google.com (mail-vw0-f54.google.com [209.85.212.54]) by mx1.freebsd.org (Postfix) with ESMTP id 8D9B98FC0C for ; Fri, 1 Apr 2011 01:16:09 +0000 (UTC) Received: by vws18 with SMTP id 18so2896425vws.13 for ; Thu, 31 Mar 2011 18:16:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=KLSRdOfhaiV1/88z/hYM7IR5jRyk+OqHN+luJZeAFSs=; b=DSHxe8MtTaB1P/uoCuiRNec13NpVgHj3lZJtM0MhR9m0bRPYdTjLDoQfy8y1BkbtuK hZKiuRlp44USV65B9vqx+ib/rGDI3FCzF45WmU9Nl0eViIWKrxmuxvdEzh83oY6z246T sWvam/PnfY4Sx1T+SgwapX+47YUG7t8T+M0DY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=F5IAtCGKTcLVANtPqkvXKrxTNq1ipssdwg2Nl4roP+k/4QS88COFLYRoqDRsynNYs4 yfa/hKwxDRwe1AUYPrSLFAh3X/eQWr1Byvhy61OZDOGuT94neZIMBKdaqq0onJaZyL/u Dq0Z1RlXrAQU0AES7JKM0/RnkpN3rhqIKlV30= MIME-Version: 1.0 Received: by 10.52.94.48 with SMTP id cz16mr4345852vdb.173.1301620568561; Thu, 31 Mar 2011 18:16:08 -0700 (PDT) Received: by 10.52.167.6 with HTTP; Thu, 31 Mar 2011 18:16:08 -0700 (PDT) In-Reply-To: References: Date: Thu, 31 Mar 2011 18:16:08 -0700 Message-ID: From: Jack Vogel To: Arnaud Lacombe Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: freebsd-net@freebsd.org Subject: Re: em(4) hang [Was: Re: igb(4) won't start with "igb0: Could not setup receive structures"] X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Apr 2011 01:16:10 -0000 I know how I'm going to handle this, am formulating code for it, should hav= e a something that can be tested tomorrow, time to head out for the night.. Essentially, rather than just looking for equality, I will calculate the number of unrefreshed mbufs given the check/refresh values, and then call refresh when anything is unrefreshed. This will happen in rxeof, but I will also pu= t back the rx interrupt trigger into local timer. I'm pretty sure this will b= e bullet proof, at least for this kind of hang. Jack On Thu, Mar 31, 2011 at 5:28 PM, Jack Vogel wrote: > You know what Arnaud, I've looked at the numbers again, and I suddenly sa= w > that next_to_check and next_to_refresh are NOT in a good state, exactly t= he > opposite, check is BEHIND refresh, which means the whole ring is empty, t= he > HEAD (next_to_check) is pointing at 929, but next_to_refresh is at 930, > RIGHT > IN FRONT of it, so the whole ring is depleted!! > > What this means is that just a test of check =3D=3D refresh is not going = to be > good > enough to protect against all cases, so let me think about how to handle > this... > > Jack > > > > On Thu, Mar 31, 2011 at 4:38 PM, Jack Vogel wrote: > >> My validation group has some kind of hang... happens when they use a >> certain number >> of clients each running a stress test to the SUT, its like this, no real >> handle on what's >> wrong, if I knew what was wrong it would be half way or more to fixing i= t >> :) >> >> The evidence shows you have hit the max clusters at one point, but have >> freed most >> of them back up again, there is no shortage right at this point. Your >> previous data >> showed a normal idle head/tail relationship.... >> >> Just as a data point, will you please disable msix, recompile and run in >> MSI mode, >> I just want to see if that makes a difference. Search in the driver for >> em_enable_msix >> and set it FALSE. >> >> Jack >> >> >> >> On Thu, Mar 31, 2011 at 4:06 PM, Arnaud Lacombe wrot= e: >> >>> Hi, >>> >>> On Thu, Mar 31, 2011 at 6:28 PM, Jack Vogel wrote: >>> > OK, but those are not something present in this data, that was what I= 'm >>> > asking. >>> > >>> > So, you have a hang for which we do not have a certain cause. What >>> does >>> > netstat -m show? >>> > >>> # netstat -m >>> 3073/74927/78000 mbufs in use (current/cache/total) >>> 3070/29698/32768/32768 mbuf clusters in use (current/cache/total/max) >>> 0/383 mbuf+clusters out of packet secondary zone in use (current/cache) >>> 0/12800/12800/12800 4k (page size) jumbo clusters in use >>> (current/cache/total/max) >>> 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max) >>> 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max) >>> 6908K/129327K/136236K bytes allocated to network (current/cache/total) >>> 0/1080/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) >>> 0/0/0 requests for jumbo clusters denied (4k/9k/16k) >>> 0/7/6656 sfbufs in use (current/peak/max) >>> 0 requests for sfbufs denied >>> 0 requests for sfbufs delayed >>> 0 requests for I/O initiated by sendfile >>> 0 calls to protocol drain routines >>> >>> Note that the mbuf allocation denial did not appended at once. It has >>> been progressively increasing by block of ~200 over the 5h of uptime >>> of the machine, until the current condition occurred. >>> >>> I have previously been trying to simulate the depletion and the hang, >>> but the driver recovered. I assume the condition is met in >>> em_local_timer() to refresh the ring, I'd still need to check that. >>> >>> - Arnaud >>> >>> > Jack >>> > >>> > >>> > On Thu, Mar 31, 2011 at 3:15 PM, Arnaud Lacombe >>> wrote: >>> >> >>> >> Hi, >>> >> >>> >> On Thu, Mar 31, 2011 at 5:57 PM, Jack Vogel >>> wrote: >>> >> > So, what is the evidence that the driver is stuck here? >>> >> > >>> >> About 800 pps (mostly SYN) present wire but never ever seen on em0, >>> >> plus a couple of ARP reply, which still never hit em0, plus the >>> >> `missed_packets' count increasing by the same 800 pps in the last >>> >> hour. Is that enough ? >>> >> >>> >> - Arnaud >>> >> >>> >> ps: I forgot to add that MAC address on the wire are fine. >>> >> >>> >> > I see that next_to_check !=3D next_to_refresh, which is why the >>> >> > local timer won't schedule anything. OH, and I also realized there >>> >> > is a problem with local_timer anyway, it will run rxeof, but that >>> won't >>> >> > help >>> >> > if you can't enter the loop, so I need to add some code at the top >>> to >>> >> > call em_refresh_mbufs() when in this state. >>> >> > >>> >> > On this interrupt cause that you are focused upon, although its >>> there in >>> >> > the >>> >> > design, I had talked with some of our most seasoned developers on >>> both >>> >> > the Windows and Linux side of the house, and NO one has ever used >>> this >>> >> > 'feature', because (and I'm quoting here) "there's no good use cas= e >>> for >>> >> > it". >>> >> > Meaning, there's always some simpler way of handling the issue. >>> >> > >>> >> > When you use MSIX you can't read causes btw, if you configured it, >>> it >>> >> > would >>> >> > mean you'd just get into the regular RX handler, same as always, s= o >>> why >>> >> > some special bother with this cause? >>> >> > >>> >> > On non-MSIX hardware there is just no particular reason to worry >>> about >>> >> > the >>> >> > cause either, we can just handle the RX situation in the interrupt >>> >> > handler. >>> >> > >>> >> > Jack >>> >> > >>> >> > >>> >> > On Thu, Mar 31, 2011 at 2:09 PM, Arnaud Lacombe >> > >>> >> > wrote: >>> >> >> >>> >> >> Hi Jack, >>> >> >> >>> >> >> On Thu, Mar 31, 2011 at 9:51 AM, Arnaud Lacombe < >>> lacombar@gmail.com> >>> >> >> wrote: >>> >> >> > [...] >>> >> >> > I'll remove part of the changes I made to keep only >>> >> >> > `rx_forced_refill' >>> >> >> > and the associated sysctl, re-run the tests and come back with >>> >> >> > correct >>> >> >> > value, hopefully in a few hours. >>> >> >> > >>> >> >> Here it is: >>> >> >> >>> >> >> # sysctl dev.em.0.%desc >>> >> >> dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 7.2.2 >>> >> >> >>> >> >> # sysctl dev.em.0.mac_stats.missed_packets >>> >> >> dev.em.0.mac_stats.missed_packets: 917428 >>> >> >> >>> >> >> # sysctl dev.em.0.debug=3D1 >>> >> >> dev.em.0.debug: I-1nterface is RUNNING and INACTIVE >>> >> >> em0: hw tdh =3D 975, hw tdt =3D 975 >>> >> >> em0: hw rdh =3D 884, hw rdt =3D 885 >>> >> >> em0: Tx Queue Status =3D 0 >>> >> >> em0: TX descriptors avail =3D 1024 >>> >> >> em0: Tx Descriptors avail failure =3D 0 >>> >> >> em0: RX discarded packets =3D 0 >>> >> >> em0: RX Next to Check =3D 884 >>> >> >> em0: RX Next to Refresh =3D 885 >>> >> >> -> -1 >>> >> >> >>> >> >> So the taskqueue cannot be scheduled to run and the driver is >>> stuck. >>> >> >> >>> >> >> > On Wed, Mar 30, 2011 at 2:22 PM, Jack Vogel >>> >> >> > wrote: >>> >> >> >> Read the code in HEAD, em_local_timer() has a test of ALL the = rx >>> >> >> >> queues >>> >> >> >> and >>> >> >> >> will schedule a task that refreshes mbufs if they are empty. >>> This >>> >> >> >> has >>> >> >> >> exactly the >>> >> >> >> same effect as checking for some interrupt cause, a cause that >>> is >>> >> >> >> not >>> >> >> >> available >>> >> >> >> when using MSIX on 82574, but this approach works for >>> everything. >>> >> >> >> >>> >> >> Can you please point me to a reference datasheet (or errata), >>> provided >>> >> >> by Intel, about the RX Overrun interrupt not being available with >>> >> >> MSI-X on the 82574 ? >>> >> >> >>> >> >> Currently, I only have access to [0], which precises the followin= g: >>> >> >> >>> >> >> 7.4 Interrupts >>> >> >> 7.4.2 MSI-X Mode >>> >> >> [...] >>> >> >> The following configuration and parameters are involved: >>> >> >> =95 The IVAR.INT_Alloc[4:0] entries map two Tx queues, two Rx que= ues >>> and >>> >> >> other >>> >> >> events to 5 interrupt vectors >>> >> >> =95 The ICR[24:20] bits reflect specific interrupt causes >>> >> >> =95 Five MSI-X interrupt vectors are provided (calculated based o= n >>> four >>> >> >> vectors for >>> >> >> queues and one vector for other causes). The requested number of >>> >> >> vectors >>> >> >> is >>> >> >> loaded from the MSI_X_N fields in the EEPROM into the PCIe MSI-X >>> >> >> capability >>> >> >> structure of the function. >>> >> >> >>> >> >> 10.2.4.1 Interrupt Cause Read Register - ICR (0x000C0; RC/WC) >>> >> >> [...] >>> >> >> >>> >> >> about bit 24: >>> >> >> >>> >> >> Other Interrupt. Indicates one of the following interrupts was se= t: >>> >> >> =95 Link Status Change. >>> >> >> =95 Receiver Overrun. >>> >> >> =95 MDIO Access Complete. >>> >> >> =95 Small Receive Packet Detected. >>> >> >> =95 Receive ACK Frame Detected. >>> >> >> =95 Manageability Event Detected. >>> >> >> >>> >> >> Thanks in advance, >>> >> >> - Arnaud >>> >> >> >>> >> >> [0]: ftp://download.intel.com/design/network/datashts/82574.pdf >>> >> > >>> >> > >>> > >>> > >>> >> >> >