From owner-freebsd-stable@FreeBSD.ORG Fri Jun 4 17:15:09 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E56E11065673 for ; Fri, 4 Jun 2010 17:15:08 +0000 (UTC) (envelope-from pyunyh@gmail.com) Received: from mail-pw0-f54.google.com (mail-pw0-f54.google.com [209.85.160.54]) by mx1.freebsd.org (Postfix) with ESMTP id AE92B8FC1A for ; Fri, 4 Jun 2010 17:15:08 +0000 (UTC) Received: by pwj1 with SMTP id 1so877649pwj.13 for ; Fri, 04 Jun 2010 10:15:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:received:from:date:to:cc :subject:message-id:reply-to:references:mime-version:content-type :content-disposition:in-reply-to:user-agent; bh=fByUcvmNyoGq0+dEWvukiRVjNySjjA2Y1lVNna4zQlQ=; b=eQ1zjBbUN+7pEUGV6lWqYgAdbrppKEQ09kM3uqoOsIZXN2XKlIlZails4WvIUD/NAI YbDyf4/aQVts9B6aazG4rwC3lVPltrq/ElQdSY4dJz/fimXCDMqgCGXpoLuuMjSzCHDb 9TRnqjiBp6y4ZwODyhDzHO/N75tpPsOBNQB7o= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:date:to:cc:subject:message-id:reply-to:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; b=DaAFZkkjJEvCtBelaBIeYlxbZ24DCU/IUWscqUya6T+iXCCG1fvmfxEQ5ap5oevYfm Qp8WafsDaWAJnL0ffJGr9PXRM07SC8xD+oXpfLh3BkAEOhBMpJuPQ+xWFNf4WpT4NAil d23bLAC8uiHbHHRvbIcFftbajAEVkYofWNFSU= Received: by 10.115.39.21 with SMTP id r21mr8792766waj.155.1275671708046; Fri, 04 Jun 2010 10:15:08 -0700 (PDT) Received: from pyunyh@gmail.com ([174.35.1.224]) by mx.google.com with ESMTPS id f11sm10307158wai.11.2010.06.04.10.15.06 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 04 Jun 2010 10:15:06 -0700 (PDT) Received: by pyunyh@gmail.com (sSMTP sendmail emulation); Fri, 4 Jun 2010 10:14:44 -0700 From: Pyun YongHyeon Date: Fri, 4 Jun 2010 10:14:44 -0700 To: Nikolay Denev Message-ID: <20100604171444.GA17648@michelle.cdnetworks.com> References: <77DFF2E5-7A1E-4063-A852-2C7AD9BC3DD4@gmail.com> <201005240948.33555.jhb@freebsd.org> <20100524171210.GA1418@michelle.cdnetworks.com> <87BA8EDC-BE95-4C84-94CD-5CA12961708A@gmail.com> <20100604003502.GF13502@michelle.cdnetworks.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i Cc: freebsd-stable@freebsd.org Subject: Re: if_sge related panics X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: pyunyh@gmail.com List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 04 Jun 2010 17:15:09 -0000 On Fri, Jun 04, 2010 at 07:52:19AM +0300, Nikolay Denev wrote: > > On Jun 4, 2010, at 3:35 AM, Pyun YongHyeon wrote: > > > On Thu, Jun 03, 2010 at 09:29:20AM +0300, Nikolay Denev wrote: > >> On May 24, 2010, at 8:12 PM, Pyun YongHyeon wrote: > >> > >>> On Mon, May 24, 2010 at 09:48:33AM -0400, John Baldwin wrote: > >>>> On Monday 24 May 2010 6:35:01 am Nikolay Denev wrote: > >>>>> On May 24, 2010, at 8:57 AM, Nikolay Denev wrote: > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>> Recently I started to experience a if_sge(4) related panic. > >>>>>> It happens almost every time I try to download a torrent file for example. > >>>>>> Copying of large files over NFS seem not to trigger it, but I haven't tested extensively. > >>>>>> > >>>>>> Here is the panic message : > >>>>>> > >>>>>> Fatal trap 12: page fault while in kernel mode > >>>>>> cpuid = 0; apic id = 00 > >>>>>> fault virtual address = 0x8 > >>>>>> fault code = supervisor write data, page not present > >>>>>> instruction pointer = 0x20:0xffffffff80230413 > >>>>>> stack pointer = 0x28:0xffffff80001e9280 > >>>>>> frame pointer = 0x28:0xffffff80001e9510 > >>>>>> code segment = base 0x0, limit 0xfffff, type 0x1b > >>>>>> = DPL 0, pres 1, long 1, def32 0, gran 1 > >>>>>> processor eflags = interrupt enabled, resume, IOPL = 0 > >>>>>> current process = 12 (irq19: sge0) > >>>>>> trap number = 12 > >>>>>> panic: page fault > >>>>>> cpuid = 0 > >>>>>> Uptime: 1d20h56m20s > >>>>>> Cannot dump. Device not defined or unavailable > >>>>>> Automatic reboot in 15 seconds - press a key on the console to abort > >>>>>> Sleeping thread (tid 100039, pid 12) owns a non-sleepable lock > >>>>>> > >>>>>> My swap is on a zvol, so I don't have dump. I'll try to attach a disk on the eSATA port and dump there if needed. > >>>>> > >>>>> Here is some info from the crashdump : > >>>>> > >>>>> (kgdb) #0 doadump () at pcpu.h:223 > >>>>> #1 0xffffffff802fb149 in boot (howto=260) > >>>>> at /usr/src/sys/kern/kern_shutdown.c:416 > >>>>> #2 0xffffffff802fb57c in panic (fmt=0xffffffff8055d564 "%s") > >>>>> at /usr/src/sys/kern/kern_shutdown.c:590 > >>>>> #3 0xffffffff805055b8 in trap_fatal (frame=0xffffff000288a3e0, eva=Variable "eva" is not available. > >>>>> ) > >>>>> at /usr/src/sys/amd64/amd64/trap.c:777 > >>>>> #4 0xffffffff805059dc in trap_pfault (frame=0xffffff80001e91d0, usermode=0) > >>>>> at /usr/src/sys/amd64/amd64/trap.c:693 > >>>>> #5 0xffffffff805061c5 in trap (frame=0xffffff80001e91d0) > >>>>> at /usr/src/sys/amd64/amd64/trap.c:451 > >>>>> #6 0xffffffff804eb977 in calltrap () > >>>>> at /usr/src/sys/amd64/amd64/exception.S:223 > >>>>> #7 0xffffffff80230413 in sge_start_locked (ifp=0xffffff000270d800) > >>>>> at /usr/src/sys/dev/sge/if_sge.c:1591 > >>>> > >>>> Try this. sge_encap() can sometimes return an error with m_head set to NULL: > >>>> > >>> > >>> Thanks John. Committed in r208512. > >>> > >>>> Index: if_sge.c > >>>> =================================================================== > >>>> --- if_sge.c (revision 208375) > >>>> +++ if_sge.c (working copy) > >>>> @@ -1588,7 +1588,8 @@ > >>>> if (m_head == NULL) > >>>> break; > >>>> if (sge_encap(sc, &m_head)) { > >>>> - IFQ_DRV_PREPEND(&ifp->if_snd, m_head); > >>>> + if (m_head != NULL) > >>>> + IFQ_DRV_PREPEND(&ifp->if_snd, m_head); > >>>> ifp->if_drv_flags |= IFF_DRV_OACTIVE; > >>>> break; > >>>> } > >>>> > >>>> -- > >>>> John Baldwin > >> > >> After the patch I experienced several network outages (ping reporting "no buffer space available") > >> that were resolved by ifconfig down/up of the sge(4) interface. > >> > > > > Because I don't have access to sge(4) controllers I never had chance > > to run it. Does ping(8) generates "no buffer space available" when > > the system is in idle state? Could you show me more information on > > how you checked network outages? > > > > It happened 4-5 times recently. I didn't do extensive investigation, but yes, ping > returned "no buffer space avail" when I tried pinging from the machine itself. > It was unreachable from other hosts on the network. > I'm not sure what you bean by idle state but there was a torrent client running > on the machine, which printed errors about inability to reach peers. > If system is under heavy TX load(e.g. 64bytes UDP test), ping(8) may show that message. > > >> I can see that most of the other drivers that handle XXX_encap() returning m_head pointing NULL, break when this condition > > > > Yes, most drivers written/touched by me behaves like that. > > > >> is hit: i.e. : > >> > >> Index: if_sge.c > >> =================================================================== > >> --- if_sge.c (revision 208375) > >> +++ if_sge.c (working copy) > >> @@ -1588,7 +1588,8 @@ > >> if (m_head == NULL) > >> break; > >> if (sge_encap(sc, &m_head)) { > >> - IFQ_DRV_PREPEND(&ifp->if_snd, m_head); > >> + if (m_head == NULL) > >> + break; > >> IFQ_DRV_PREPEND(&ifp->if_snd, m_head); > >> ifp->if_drv_flags |= IFF_DRV_OACTIVE; > >> break; > >> } > >> > >> But here in sge(4) we always set IFF_DRV_OACTIVE. > >> Do you think this can be the source of the problem ? > >> > > > > More correct way to set IFF_DRV_OACTIVE would be check the number > > of queued frames or just exit the transmit loop. If there is no > > queued frames, IFF_DRV_OACTIVE would never be cleared which in turn > > cause ENOBUFS in ping(8). I think your change looks more reasonable > > to me. Do you still see the same issue with the change you suggested? > > I'm runing with this change for a day or something now without any issues. Ok, committed in r208806. Thanks for the patch.