From owner-freebsd-net@FreeBSD.ORG Fri Jul 30 18:40:45 2010 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 178A2106567F for ; Fri, 30 Jul 2010 18:40:45 +0000 (UTC) (envelope-from ddesimone@verio.net) Received: from relay2-bcrtfl2.verio.net (relay2-bcrtfl2.verio.net [131.103.218.177]) by mx1.freebsd.org (Postfix) with ESMTP id BF3428FC29 for ; Fri, 30 Jul 2010 18:40:44 +0000 (UTC) Received: from iad-wprd-xchw02.corp.verio.net (iad-wprd-xchw02.corp.verio.net [198.87.7.165]) by relay2-bcrtfl2.verio.net (Postfix) with ESMTP id C23201FF0110 for ; Fri, 30 Jul 2010 14:10:37 -0400 (EDT) Thread-Index: AcswEoLxprg3IK6nTm+2J7lNzbdFxA== Received: from dllstx1-8sst9f1.corp.verio.net ([10.144.2.52]) by iad-wprd-xchw02.corp.verio.net over TLS secured channel with Microsoft SMTPSVC(6.0.3790.4675); Fri, 30 Jul 2010 14:10:36 -0400 Received: by dllstx1-8sst9f1.corp.verio.net (sSMTP sendmail emulation); Fri, 30 Jul 2010 13:10:35 -0500 Content-Transfer-Encoding: 7bit Date: Fri, 30 Jul 2010 13:10:35 -0500 From: "David DeSimone" Content-class: urn:content-classes:message Importance: normal Priority: normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.4657 To: Message-ID: <20100730181035.GI5168@verio.net> Mail-Followup-To: freebsd-net@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline Precedence: bulk User-Agent: Mutt/1.5.20 (2009-06-14) X-OriginalArrivalTime: 30 Jul 2010 18:10:36.0385 (UTC) FILETIME=[824D7510:01CB3012] Subject: Kernel (7.3) crash due to mbuf leak? X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 30 Jul 2010 18:40:45 -0000 After upgrading a couple of our systems from 7.2-RELEASE to 7.3-RELEASE, we have started to see them running out of mbuf's and crashing every month or so. The panic string is: kmem_malloc(16384): kmem_map too small: 335233024 total allocated The actual panic signature (backtrace) shows a memory allocation failure occurring in the filesystem code, but I do not think that is where the problem lies. Instead, it is clear to me that the system is slowly leaking mbuf's until there is no more kernel memory available, and the filesystem is just the innocent bystander asking for memory and failing to get it. Here's some netstat -m output on a couple of crashes: fs0# netstat -m -M vmcore.0 882167/2902/885069 mbufs in use (current/cache/total) 351/2041/2392/25600 mbuf clusters in use (current/cache/total/max) 351/1569 mbuf+clusters out of packet secondary zone in use (current/cache) 0/199/199/12800 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/19200 9k jumbo clusters in use (current/cache/total/max) 0/0/0/12800 16k jumbo clusters in use (current/cache/total/max) 221249K/5603K/226853K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines fs0# netstat -m -M vmcore.1 894317/2905/897222 mbufs in use (current/cache/total) 345/2013/2358/25600 mbuf clusters in use (current/cache/total/max) 350/1358 mbuf+clusters out of packet secondary zone in use (current/cache) 0/263/263/12800 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/19200 9k jumbo clusters in use (current/cache/total/max) 0/0/0/12800 16k jumbo clusters in use (current/cache/total/max) 224274K/5804K/230078K bytes allocated to network (current/cache/total) 0/1/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines fs1# netstat -m -M vmcore.0 857844/2890/860734 mbufs in use (current/cache/total) 317/2139/2456/25600 mbuf clusters in use (current/cache/total/max) 350/1603 mbuf+clusters out of packet secondary zone in use (current/cache) 0/263/263/12800 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/19200 9k jumbo clusters in use (current/cache/total/max) 0/0/0/12800 16k jumbo clusters in use (current/cache/total/max) 215098K/6052K/221151K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines I also note that my currently running systems are both well on their way to crashing again: fs0# netstat -m 766618/2927/769545 mbufs in use (current/cache/total) 276/2560/2836/25600 mbuf clusters in use (current/cache/total/max) 276/1772 mbuf+clusters out of packet secondary zone in use (current/cache) 0/550/550/12800 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max) 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max) 192207K/8051K/200259K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0/7/6656 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines fs0# uptime 1:00PM up 18 days, 13:52, 1 user, load averages: 0.00, 0.00, 0.00 fs1# netstat -m 126949/3356/130305 mbufs in use (current/cache/total) 263/1917/2180/25600 mbuf clusters in use (current/cache/total/max) 263/1785 mbuf+clusters out of packet secondary zone in use (current/cache) 0/295/295/12800 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max) 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max) 32263K/5853K/38116K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0/7/6656 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines fs1# uptime 1:00PM up 8 days, 17:23, 1 user, load averages: 0.00, 0.00, 0.00 Note that mbuf usage looks like a function of uptime, which is a classic leak indication. Can anyone give me some pointers as to how I can analyze these crashdumps, or my running system, to determine what network subsystem is leaking these mbuf's? The services on these systems are extremely simple: SSH (though nobody logs in) sendmail qmail ntpd (client only) named (BIND) Firewalling is performed by uncomplicated PF policy. No special network features in use (no VLAN's or such): em0: flags=8843 metric 0 mtu 1500 options=19b ether 00:30:48:XX:XX:XX inet XXX.XXX.XXX.XX netmask 0xfffffff8 broadcast XXX.XXX.XXX.XX media: Ethernet autoselect (1000baseTX ) status: active What can I do to troubleshoot this problem? Is there any accounting system built into the mbuf subsystem to help me with this? -- David DeSimone == Network Admin == fox@verio.net "I don't like spinach, and I'm glad I don't, because if I liked it I'd eat it, and I just hate it." -- Clarence Darrow This email message is intended for the use of the person to whom it has been sent, and may contain information that is confidential or legally protected. If you are not the intended recipient or have received this message in error, you are not authorized to copy, distribute, or otherwise use this message or its attachments. Please notify the sender immediately by return e-mail and permanently delete this message and any attachments. Verio, Inc. makes no warranty that this email is error or virus free. Thank you.