From owner-freebsd-pf@FreeBSD.ORG Tue May 3 05:06:41 2011 Return-Path: Delivered-To: freebsd-pf@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 811A4106564A; Tue, 3 May 2011 05:06:41 +0000 (UTC) (envelope-from jhellenthal@gmail.com) Received: from mail-iy0-f182.google.com (mail-iy0-f182.google.com [209.85.210.182]) by mx1.freebsd.org (Postfix) with ESMTP id 282878FC08; Tue, 3 May 2011 05:06:40 +0000 (UTC) Received: by iyj12 with SMTP id 12so7359715iyj.13 for ; Mon, 02 May 2011 22:06:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:sender:date:from:to:cc:subject:message-id :references:mime-version:content-type:content-disposition :in-reply-to:x-openpgp-key-id:x-openpgp-key-fingerprint :x-openpgp-key-url; bh=neU4BmS9TB85aasLW7PY2eL/gr8SmoXox+uWqdd219o=; b=UFMfL3ZwWcOm4C4KIVQMmCYL9E24Mtg9xm+1mroVu3yHONU6Gh376zyAa7NkT8Av3I VQ2cosaeD/tBMgwZYFdzKQUc70dPf+4Wi5MkwHFr8avLJ9hco9L+qwMyy3kd22ZCEnI1 ndl9tfQHAuU9wvQQwPYnO+F7wH+BFOZ2PeeVU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:x-openpgp-key-id :x-openpgp-key-fingerprint:x-openpgp-key-url; b=KoeOo1hQYNSzv1kFi6NZFI/gtphI38eqsxvpfxXprI03NSYf8norXJmq2SoY6U8Spw hbneoKdZpilfCQ6GidEFKHZFY27RGwN5HiLccoUToxHQFlMjoz/+9ODRep0ZgXNfu+ji /cyEKbJ7d94ZdLhZOx4JIXGXSRLdSwJ8SnQWI= Received: by 10.42.115.138 with SMTP id k10mr1900703icq.81.1304399200506; Mon, 02 May 2011 22:06:40 -0700 (PDT) Received: from DataIX.net ([99.190.84.116]) by mx.google.com with ESMTPS id vr5sm2395488icb.12.2011.05.02.22.06.37 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 02 May 2011 22:06:38 -0700 (PDT) Sender: "J. Hellenthal" Received: from DataIX.net (localhost [127.0.0.1]) by DataIX.net (8.14.4/8.14.4) with ESMTP id p4356YJY018800 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 3 May 2011 01:06:35 -0400 (EDT) (envelope-from jhell@DataIX.net) Received: (from jhell@localhost) by DataIX.net (8.14.4/8.14.4/Submit) id p4356YPf018799; Tue, 3 May 2011 01:06:34 -0400 (EDT) (envelope-from jhell@DataIX.net) Date: Tue, 3 May 2011 01:06:34 -0400 From: Jason Hellenthal To: Jeremy Chadwick Message-ID: <20110503050634.GB53414@DataIX.net> References: <20110503015854.GA31444@icarus.home.lan> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="DBIVS5p969aUjpLe" Content-Disposition: inline In-Reply-To: <20110503015854.GA31444@icarus.home.lan> X-OpenPGP-Key-Id: 0x89D8547E X-OpenPGP-Key-Fingerprint: 85EF E26B 07BB 3777 76BE B12A 9057 8789 89D8 547E X-OpenPGP-Key-URL: http://bit.ly/0x89D8547E Cc: freebsd-stable@freebsd.org, freebsd-pf@freebsd.org Subject: Re: RELENG_8 pf stack issue (state count spiraling out of control) X-BeenThere: freebsd-pf@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Technical discussion and general questions about packet filter \(pf\)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 May 2011 05:06:41 -0000 --DBIVS5p969aUjpLe Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Jeremy, On Mon, May 02, 2011 at 06:58:54PM -0700, Jeremy Chadwick wrote: >(Please keep me CC'd as I'm not subscribed to freebsd-pf. And apologies >for cross-posting, but the issue is severe enough that I wanted to make >it known on -stable) > >The below issue I'm describing is from a machine running 8.2-PRERELEASE >(RELENG_8) using src dated Tue Feb 15 05:46:02 PST 2011. > >Please read the story in full, as I have taken the time to describe >everything I did, plus log output, as well as induce a panic via "call >doadump" from ddb so I have a capture of the system at the time. I also >have a theory as to what caused the problem, but how to trigger it is >unknown; it may be a rare race condition. > > >This morning I woke up to find a report from one of our users that he >could not connect to a specific TCP port (not SSH) on one of our >servers. I also found that I couldn't SSH into the same box. Serial >console was working fine, and the serial console log showed no sign of >any problems. > >I started to debug the issue of me not being able to SSH into the >machine and within a few minutes became immediately concerned: pfctl >indicated we had reached the maximum number permitted state table >entries (10,000). > >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D ># pfctl -s info >Status: Enabled for 76 days 06:49:10 Debug: Urgent > >Interface Stats for em0 IPv4 IPv6 > Bytes In 8969748840 0 > Bytes Out 8296135477 0 > Packets In > Passed 128211763 0 > Blocked 621379 0 > Packets Out > Passed 138483868 0 > Blocked 2579 0 > >State Table Total Rate > current entries 10000 > searches 267316807 40.6/s > inserts 4440553 0.7/s > removals 4430553 0.7/s >Counters > match 5067474 0.8/s > bad-offset 0 0.0/s > fragment 324 0.0/s > short 0 0.0/s > normalize 32 0.0/s > memory 336946 0.1/s > bad-timestamp 0 0.0/s > congestion 0 0.0/s > ip-option 0 0.0/s > proto-cksum 1611 0.0/s > state-mismatch 509 0.0/s > state-insert 0 0.0/s > state-limit 0 0.0/s > src-limit 0 0.0/s > synproxy 0 0.0/s > ># pfctl -s memory >states hard limit 10000 >src-nodes hard limit 10000 >frags hard limit 5000 >tables hard limit 1000 >table-entries hard limit 100000 >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >The above is mainly for em0 (our WAN interface); our LAN interface (em1) >was not impacted because we use "set skip on em1". And it's a good >thing too: we have lots of LAN-based services that this machine provides >that would have been impacted. We also use "set skip on lo0". > >I immediately went to look at our monitoring graphs, which monitor pf >state (specifically state table entries), polled via bsnmpd(8). This >data is even more frightening: > >http://jdc.parodius.com/freebsd/pf-issue/pf_states-day.png >http://jdc.parodius.com/freebsd/pf-issue/pf_states-week.png > >Literally something was spiraling out of control, starting at approx. >2011/05/01 (Sun) at 12:30 PDT. The situation became dire at approx. >19:45 PDT the same day, but I wasn't aware of it until said user brought >an issue to my attention. > >You can see from the network I/O graphs (taken from SNMP polling our >switch, NOT from the host/box itself) that there was no DoS attack or >anything like that occurring -- this was something within FreeBSD >itself. More evidence of that will become apparent. > >http://jdc.parodius.com/freebsd/pf-issue/port_03-day.png >http://jdc.parodius.com/freebsd/pf-issue/port_03-week.png > >The first thing I did was "/etc/rc.d/pf reload". This command hung. >Any attempt to send Ctrl-C/SIGINT did nothing. I was able to >Ctrl-Z/SIGSTOP it, then use kill %1, but the actual reload process did >not truly die (despite csh stating "Terminated"). The only way to kill >it was to kill -9. > >Attempts to shut down any daemons which utilised the network -- >including things that only used em1 -- would not shut down. This >included things like postfix, mysqld, and some inet-based services. I >was forced to kill -9 them. Things like bsnmpd, however, did shut down. > >Equally as uncomfortable, "shutdown -r now" did not reboot the system. >That is to say, wall(1)'s announcement was shown, but the actual >stopping of services did not begin. > >The next thing I tried was "/etc/rc.d/pf stop", which worked. Then I >did "/etc/rc.d/pf start", which also worked. However, what I saw next >surely indicated a bug in the pf layer somewhere -- "pfctl -s states" >and "pfctl -s info" disagreed on the state count: > >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D ># pfctl -s info >Status: Enabled for 0 days 00:00:16 Debug: Urgent > >Interface Stats for em0 IPv4 IPv6 > Bytes In 3459 0 > Bytes Out 0 0 > Packets In > Passed 0 0 > Blocked 29 0 > Packets Out > Passed 0 0 > Blocked 0 0 > >State Table Total Rate > current entries 10000 > searches 29 1.8/s > inserts 0 0.0/s > removals 0 0.0/s >Counters > match 29 1.8/s > bad-offset 0 0.0/s > fragment 0 0.0/s > short 0 0.0/s > normalize 0 0.0/s > memory 18 1.1/s > bad-timestamp 0 0.0/s > congestion 0 0.0/s > ip-option 0 0.0/s > proto-cksum 0 0.0/s > state-mismatch 0 0.0/s > state-insert 0 0.0/s > state-limit 0 0.0/s > src-limit 0 0.0/s > synproxy 0 0.0/s > ># pfctl -s state | wc -l > 0 >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >The "pf uptime" shown above, by the way, matches system uptime. > >I then attempted "pfctl -F state", but nothing changed (looked the same >as above). > >Since I could not reboot the box, I was forced to drop to ddb via serial >console. I did some commands like "ps" and the like, and then "call >doadump" to induce a kernel panic, and then "reboot" (which worked). > >Once the machine came back up, savecore(8) ran, wrote the data out, and >everything has been fine since. /var/crash/core.txt.0 is ~68KBytes and >I do not feel comfortable sharing its content publicly, but will be >happy to hand it to developer(s) who are interested. Relevant tidbits I >can discern: > >------------------------------------------------------------------------ >ps -axl > > UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND > 0 422 0 0 -16 0 0 0 pftm DL ?? 1362773081:04.00= [pfpurge] >------------------------------------------------------------------------ > >------------------------------------------------------------------------ >vmstat -z > >ITEM SIZE LIMIT USED FREE REQUESTS FAI= LURES >pfsrctrpl: 152, 10000, 0, 0, 0, = 0 >pfrulepl: 912, 0, 40, 88, 806, = 0 >pfstatepl: 392, 10000, 10000, 0, 4440553, 3= 41638 >pfaltqpl: 240, 0, 0, 0, 0, = 0 >pfpooladdrpl: 88, 0, 0, 0, 0, = 0 >pfrktable: 1296, 1002, 4, 20, 112, = 0 >pfrkentry: 216, 100008, 603, 891, 15384, = 0 >pfrkentry2: 216, 0, 0, 0, 0, = 0 >pffrent: 32, 5050, 0, 303, 1620, = 0 >pffrag: 80, 0, 0, 135, 807, = 0 >pffrcache: 80, 10035, 0, 0, 0, = 0 >pffrcent: 24, 50022, 0, 0, 0, = 0 >pfstatescrub: 40, 0, 0, 0, 0, = 0 >pfiaddrpl: 120, 0, 0, 0, 0, = 0 >pfospfen: 112, 0, 696, 30, 18096, = 0 >pfosfp: 40, 0, 407, 97, 10582, = 0 >------------------------------------------------------------------------ > >You can see evidence of processes not exiting/doing what they should do >here: > >------------------------------------------------------------------------ >fstat > >USER CMD PID FD MOUNT INUM MODE SZ|DV R/W >root shutdown 91155 root / 2 drwxr-xr-x 512 r >root shutdown 91155 wd / 2 drwxr-xr-x 512 r >root shutdown 91155 text / 47195 -r-sr-x--- 15912 r >root shutdown 91155 0 /dev 38 crw------- ttyu0 rw >root shutdown 91155 1 /dev 38 crw------- ttyu0 rw >root shutdown 91155 2 /dev 38 crw------- ttyu0 rw >root sh 91129 root / 2 drwxr-xr-x 512 r >root sh 91129 wd / 2 drwxr-xr-x 512 r >root sh 91129 text / 44 -r-xr-xr-x 134848 r >root sh 91129 0 /dev 38 crw------- ttyu0 rw >root sh 91129 1* pipe ffffff01e78fc9e0 <-> ffffff01e78fc888= 0 rw >root sh 91129 2 /dev 20 crw-rw-rw- null w >root shutdown 91115 root / 2 drwxr-xr-x 512 r >root shutdown 91115 wd /storage 5 drwx------ 37 r >root shutdown 91115 text / 47195 -r-sr-x--- 15912 r >root shutdown 91115 0 /dev 38 crw------- ttyu0 rw >root shutdown 91115 1 /dev 38 crw------- ttyu0 rw >root shutdown 91115 2 /dev 38 crw------- ttyu0 rw >root shutdown 91115 3* local dgram ffffff008ff92960 >root sh 90818 root / 2 drwxr-xr-x 512 r >root sh 90818 wd / 70659 drwxr-xr-x 512 r >root sh 90818 text / 44 -r-xr-xr-x 134848 r >root sh 90818 0 /dev 38 crw------- ttyu0 rw >root sh 90818 1* pipe ffffff0043f1ecb8 <-> ffffff0043f1eb60= 0 rw >root sh 90818 2 /dev 20 crw-rw-rw- null w >root csh 90802 root / 2 drwxr-xr-x 512 r >root csh 90802 wd / 2 drwxr-xr-x 512 r >root csh 90802 text / 51 -r-xr-xr-x 358752 r >root csh 90802 15 /dev 38 crw------- ttyu0 rw >root csh 90802 16 /dev 38 crw------- ttyu0 rw >root csh 90802 17 /dev 38 crw------- ttyu0 rw >root csh 90802 18 /dev 38 crw------- ttyu0 rw >root csh 90802 19 /dev 38 crw------- ttyu0 rw >------------------------------------------------------------------------ > >No indication of mbuf exhaustion, putting further focus on the pf stack: > >------------------------------------------------------------------------ >netstat -m > >2054/1786/3840 mbufs in use (current/cache/total) >2048/1414/3462/25600 mbuf clusters in use (current/cache/total/max) >2048/896 mbuf+clusters out of packet secondary zone in use (current/cache) >0/320/320/12800 4k (page size) jumbo clusters in use (current/cache/total/= max) >0/0/0/19200 9k jumbo clusters in use (current/cache/total/max) >0/0/0/12800 16k jumbo clusters in use (current/cache/total/max) >4609K/4554K/9164K bytes allocated to network (current/cache/total) >0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) >0/0/0 requests for jumbo clusters denied (4k/9k/16k) >0 requests for sfbufs denied >0 requests for sfbufs delayed >0 requests for I/O initiated by sendfile >0 calls to protocol drain routines >------------------------------------------------------------------------ > >Here's one piece of core.0.txt which makes no sense to me -- the "rate" >column. I have a very hard time believing that was the interrupt rate >of all the relevant devices at the time (way too high). Maybe this data >becomes wrong only during a coredump? The total column I could believe. > >------------------------------------------------------------------------ >vmstat -i > >interrupt total rate >irq4: uart0 54768 912 >irq6: fdc0 1 0 >irq17: uhci1+ 172 2 >irq23: uhci3 ehci1+ 2367 39 >cpu0: timer 13183882632 219731377 >irq256: em0 260491055 4341517 >irq257: em1 127555036 2125917 >irq258: ahci0 225923164 3765386 >cpu2: timer 13183881837 219731363 >cpu1: timer 13002196469 216703274 >cpu3: timer 13183881783 219731363 >Total 53167869284 886131154 >------------------------------------------------------------------------ > >Here's what a normal "vmstat -i" shows from the command-line: > ># vmstat -i >interrupt total rate >irq4: uart0 518 0 >irq6: fdc0 1 0 >irq23: uhci3 ehci1+ 145 0 >cpu0: timer 19041199 1999 >irq256: em0 614280 64 >irq257: em1 168529 17 >irq258: ahci0 355536 37 >cpu2: timer 19040462 1999 >cpu1: timer 19040458 1999 >cpu3: timer 19040454 1999 >Total 77301582 8119 > >We graph many aspects of this box, including CPU load, memory/swap >usage, etc. and none show any sign that the interrupt rate on all of >those devices was even remotely out of control. (I would expect to see >CPU through the roof given the above data) > >I have since rebuilt/reinstalled world/kernel on the machine with the >latest RELENG_8 code (box is now 8.2-STABLE #0: Mon May 2 14:44:18 PDT >2011), hoping whatever this was may have been fixed. > >As for what I think may have triggered it, but I have no hard evidence >of such: on April 29th, I changed our pf.conf and did "/etc/rc.d/pf >reload". The pf.conf change was a single line: > >Old: scrub on em0 all >New: scrub in on em0 all > >Why it took the problem approximately 3 days to start is unknown. It's >the only change we've made to the system (truly/honestly), and it was a >change to pf.conf. > >If anyone has advice (or has seen the above problem), or is interested >in debugging it -- as I said, I have a vmcore -- I'm happy to assist in >any way I can. I would hate for someone else to get bit by this, and >really am hoping its something that has been fixed between February and >now. > That's quite the deduction there. I've noticed recently that you were also= =20 experimenting with the new NFS server recompiling kernel etc etc. Seeing=20 as weird things can happen with DNS, NFS and mountpoint's, is this the same= =20 machine that you were doing that on ? If so can you check to see how many requests for NFS operations were done= =20 to/from that box ? as well the names that would be being resolved and if=20 that machine can resolve them ? Also I would believe your using tables in your pf.conf, if so do any of=20 those tables contain a FQDN that cannot be resolved from that machine ? I think you probably see what I am getting at here as it could be some=20 sort of concurrent recursive DNS failure that can only be seen from the=20 machine caused by possibly the new NFS backend or a change in one of the=20 tables that pf would use. --=20 Regards, (jhell) Jason Hellenthal --DBIVS5p969aUjpLe Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (FreeBSD) Comment: http://bit.ly/0x89D8547E iQEcBAEBAgAGBQJNv41ZAAoJEJBXh4mJ2FR+kv0H/jLwWK5wBbRHO0t4PEeUTITN pKljwFvrnTSTDC1WIeYAe6AEL5zERfHaYsl2zGG2XiYCiVoSOaVRRYP/4lkztG3h X3K68z0q8WgTksr9L1iIR/oj2YRLGzvTI9ChV2lQh+lORF2mLW+doYlipPa0sAGq PJrruYlQk5uaY/9rEQn+/5QxvFM1A73yBVIoQKI4UAEdlv8tGlFAcyrqkOGnJ8ju ji6VOQfOndbwI6B+0rezkvtFGlSjPQTCVsl8+IAkA2rgcUPkGgOm9LxGFHsVzbc+ Kkj9zRKCSDc92qVKyw7+Asaucnm6lnhF/IHDYgvVPIGcG/eLi/PvVnbh003svmg= =xLPk -----END PGP SIGNATURE----- --DBIVS5p969aUjpLe--