Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 3 May 2011 01:06:34 -0400
From:      Jason Hellenthal <jhell@DataIX.net>
To:        Jeremy Chadwick <freebsd@jdc.parodius.com>
Cc:        freebsd-stable@freebsd.org, freebsd-pf@freebsd.org
Subject:   Re: RELENG_8 pf stack issue (state count spiraling out of control)
Message-ID:  <20110503050634.GB53414@DataIX.net>
In-Reply-To: <20110503015854.GA31444@icarus.home.lan>
References:  <20110503015854.GA31444@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help

--DBIVS5p969aUjpLe
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable


Jeremy,


On Mon, May 02, 2011 at 06:58:54PM -0700, Jeremy Chadwick wrote:
>(Please keep me CC'd as I'm not subscribed to freebsd-pf.  And apologies
>for cross-posting, but the issue is severe enough that I wanted to make
>it known on -stable)
>
>The below issue I'm describing is from a machine running 8.2-PRERELEASE
>(RELENG_8) using src dated Tue Feb 15 05:46:02 PST 2011.
>
>Please read the story in full, as I have taken the time to describe
>everything I did, plus log output, as well as induce a panic via "call
>doadump" from ddb so I have a capture of the system at the time.  I also
>have a theory as to what caused the problem, but how to trigger it is
>unknown; it may be a rare race condition.
>
>
>This morning I woke up to find a report from one of our users that he
>could not connect to a specific TCP port (not SSH) on one of our
>servers.  I also found that I couldn't SSH into the same box.  Serial
>console was working fine, and the serial console log showed no sign of
>any problems.
>
>I started to debug the issue of me not being able to SSH into the
>machine and within a few minutes became immediately concerned: pfctl
>indicated we had reached the maximum number permitted state table
>entries (10,000).
>
>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
># pfctl -s info
>Status: Enabled for 76 days 06:49:10          Debug: Urgent
>
>Interface Stats for em0               IPv4             IPv6
>  Bytes In                      8969748840                0
>  Bytes Out                     8296135477                0
>  Packets In
>    Passed                       128211763                0
>    Blocked                         621379                0
>  Packets Out
>    Passed                       138483868                0
>    Blocked                           2579                0
>
>State Table                          Total             Rate
>  current entries                    10000
>  searches                       267316807           40.6/s
>  inserts                          4440553            0.7/s
>  removals                         4430553            0.7/s
>Counters
>  match                            5067474            0.8/s
>  bad-offset                             0            0.0/s
>  fragment                             324            0.0/s
>  short                                  0            0.0/s
>  normalize                             32            0.0/s
>  memory                            336946            0.1/s
>  bad-timestamp                          0            0.0/s
>  congestion                             0            0.0/s
>  ip-option                              0            0.0/s
>  proto-cksum                         1611            0.0/s
>  state-mismatch                       509            0.0/s
>  state-insert                           0            0.0/s
>  state-limit                            0            0.0/s
>  src-limit                              0            0.0/s
>  synproxy                               0            0.0/s
>
># pfctl -s memory
>states        hard limit    10000
>src-nodes     hard limit    10000
>frags         hard limit     5000
>tables        hard limit     1000
>table-entries hard limit   100000
>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
>The above is mainly for em0 (our WAN interface); our LAN interface (em1)
>was not impacted because we use "set skip on em1".  And it's a good
>thing too: we have lots of LAN-based services that this machine provides
>that would have been impacted.  We also use "set skip on lo0".
>
>I immediately went to look at our monitoring graphs, which monitor pf
>state (specifically state table entries), polled via bsnmpd(8).  This
>data is even more frightening:
>
>http://jdc.parodius.com/freebsd/pf-issue/pf_states-day.png
>http://jdc.parodius.com/freebsd/pf-issue/pf_states-week.png
>
>Literally something was spiraling out of control, starting at approx.
>2011/05/01 (Sun) at 12:30 PDT.  The situation became dire at approx.
>19:45 PDT the same day, but I wasn't aware of it until said user brought
>an issue to my attention.
>
>You can see from the network I/O graphs (taken from SNMP polling our
>switch, NOT from the host/box itself) that there was no DoS attack or
>anything like that occurring -- this was something within FreeBSD
>itself.  More evidence of that will become apparent.
>
>http://jdc.parodius.com/freebsd/pf-issue/port_03-day.png
>http://jdc.parodius.com/freebsd/pf-issue/port_03-week.png
>
>The first thing I did was "/etc/rc.d/pf reload".  This command hung.
>Any attempt to send Ctrl-C/SIGINT did nothing.  I was able to
>Ctrl-Z/SIGSTOP it, then use kill %1, but the actual reload process did
>not truly die (despite csh stating "Terminated").  The only way to kill
>it was to kill -9.
>
>Attempts to shut down any daemons which utilised the network --
>including things that only used em1 -- would not shut down.  This
>included things like postfix, mysqld, and some inet-based services.  I
>was forced to kill -9 them.  Things like bsnmpd, however, did shut down.
>
>Equally as uncomfortable, "shutdown -r now" did not reboot the system.
>That is to say, wall(1)'s announcement was shown, but the actual
>stopping of services did not begin.
>
>The next thing I tried was "/etc/rc.d/pf stop", which worked.  Then I
>did "/etc/rc.d/pf start", which also worked.  However, what I saw next
>surely indicated a bug in the pf layer somewhere -- "pfctl -s states"
>and "pfctl -s info" disagreed on the state count:
>
>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
># pfctl -s info
>Status: Enabled for 0 days 00:00:16           Debug: Urgent
>
>Interface Stats for em0               IPv4             IPv6
>  Bytes In                            3459                0
>  Bytes Out                              0                0
>  Packets In
>    Passed                               0                0
>    Blocked                             29                0
>  Packets Out
>    Passed                               0                0
>    Blocked                              0                0
>
>State Table                          Total             Rate
>  current entries                    10000
>  searches                              29            1.8/s
>  inserts                                0            0.0/s
>  removals                               0            0.0/s
>Counters
>  match                                 29            1.8/s
>  bad-offset                             0            0.0/s
>  fragment                               0            0.0/s
>  short                                  0            0.0/s
>  normalize                              0            0.0/s
>  memory                                18            1.1/s
>  bad-timestamp                          0            0.0/s
>  congestion                             0            0.0/s
>  ip-option                              0            0.0/s
>  proto-cksum                            0            0.0/s
>  state-mismatch                         0            0.0/s
>  state-insert                           0            0.0/s
>  state-limit                            0            0.0/s
>  src-limit                              0            0.0/s
>  synproxy                               0            0.0/s
>
># pfctl -s state | wc -l
>       0
>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
>The "pf uptime" shown above, by the way, matches system uptime.
>
>I then attempted "pfctl -F state", but nothing changed (looked the same
>as above).
>
>Since I could not reboot the box, I was forced to drop to ddb via serial
>console.  I did some commands like "ps" and the like, and then "call
>doadump" to induce a kernel panic, and then "reboot" (which worked).
>
>Once the machine came back up, savecore(8) ran, wrote the data out, and
>everything has been fine since.  /var/crash/core.txt.0 is ~68KBytes and
>I do not feel comfortable sharing its content publicly, but will be
>happy to hand it to developer(s) who are interested.  Relevant tidbits I
>can discern:
>
>------------------------------------------------------------------------
>ps -axl
>
>  UID   PID  PPID CPU PRI NI   VSZ   RSS MWCHAN STAT  TT       TIME COMMAND
>    0   422     0   0 -16  0     0     0 pftm   DL    ??  1362773081:04.00=
 [pfpurge]
>------------------------------------------------------------------------
>
>------------------------------------------------------------------------
>vmstat -z
>
>ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAI=
LURES
>pfsrctrpl:                152,    10000,        0,        0,        0,    =
    0
>pfrulepl:                 912,        0,       40,       88,      806,    =
    0
>pfstatepl:                392,    10000,    10000,        0,  4440553,   3=
41638
>pfaltqpl:                 240,        0,        0,        0,        0,    =
    0
>pfpooladdrpl:              88,        0,        0,        0,        0,    =
    0
>pfrktable:               1296,     1002,        4,       20,      112,    =
    0
>pfrkentry:                216,   100008,      603,      891,    15384,    =
    0
>pfrkentry2:               216,        0,        0,        0,        0,    =
    0
>pffrent:                   32,     5050,        0,      303,     1620,    =
    0
>pffrag:                    80,        0,        0,      135,      807,    =
    0
>pffrcache:                 80,    10035,        0,        0,        0,    =
    0
>pffrcent:                  24,    50022,        0,        0,        0,    =
    0
>pfstatescrub:              40,        0,        0,        0,        0,    =
    0
>pfiaddrpl:                120,        0,        0,        0,        0,    =
    0
>pfospfen:                 112,        0,      696,       30,    18096,    =
    0
>pfosfp:                    40,        0,      407,       97,    10582,    =
    0
>------------------------------------------------------------------------
>
>You can see evidence of processes not exiting/doing what they should do
>here:
>
>------------------------------------------------------------------------
>fstat
>
>USER     CMD          PID   FD MOUNT      INUM MODE         SZ|DV R/W
>root     shutdown   91155 root /             2 drwxr-xr-x     512  r
>root     shutdown   91155   wd /             2 drwxr-xr-x     512  r
>root     shutdown   91155 text /         47195 -r-sr-x---   15912  r
>root     shutdown   91155    0 /dev         38 crw-------   ttyu0 rw
>root     shutdown   91155    1 /dev         38 crw-------   ttyu0 rw
>root     shutdown   91155    2 /dev         38 crw-------   ttyu0 rw
>root     sh         91129 root /             2 drwxr-xr-x     512  r
>root     sh         91129   wd /             2 drwxr-xr-x     512  r
>root     sh         91129 text /            44 -r-xr-xr-x  134848  r
>root     sh         91129    0 /dev         38 crw-------   ttyu0 rw
>root     sh         91129    1* pipe ffffff01e78fc9e0 <-> ffffff01e78fc888=
      0 rw
>root     sh         91129    2 /dev         20 crw-rw-rw-    null  w
>root     shutdown   91115 root /             2 drwxr-xr-x     512  r
>root     shutdown   91115   wd /storage      5 drwx------      37  r
>root     shutdown   91115 text /         47195 -r-sr-x---   15912  r
>root     shutdown   91115    0 /dev         38 crw-------   ttyu0 rw
>root     shutdown   91115    1 /dev         38 crw-------   ttyu0 rw
>root     shutdown   91115    2 /dev         38 crw-------   ttyu0 rw
>root     shutdown   91115    3* local dgram ffffff008ff92960
>root     sh         90818 root /             2 drwxr-xr-x     512  r
>root     sh         90818   wd /         70659 drwxr-xr-x     512  r
>root     sh         90818 text /            44 -r-xr-xr-x  134848  r
>root     sh         90818    0 /dev         38 crw-------   ttyu0 rw
>root     sh         90818    1* pipe ffffff0043f1ecb8 <-> ffffff0043f1eb60=
      0 rw
>root     sh         90818    2 /dev         20 crw-rw-rw-    null  w
>root     csh        90802 root /             2 drwxr-xr-x     512  r
>root     csh        90802   wd /             2 drwxr-xr-x     512  r
>root     csh        90802 text /            51 -r-xr-xr-x  358752  r
>root     csh        90802   15 /dev         38 crw-------   ttyu0 rw
>root     csh        90802   16 /dev         38 crw-------   ttyu0 rw
>root     csh        90802   17 /dev         38 crw-------   ttyu0 rw
>root     csh        90802   18 /dev         38 crw-------   ttyu0 rw
>root     csh        90802   19 /dev         38 crw-------   ttyu0 rw
>------------------------------------------------------------------------
>
>No indication of mbuf exhaustion, putting further focus on the pf stack:
>
>------------------------------------------------------------------------
>netstat -m
>
>2054/1786/3840 mbufs in use (current/cache/total)
>2048/1414/3462/25600 mbuf clusters in use (current/cache/total/max)
>2048/896 mbuf+clusters out of packet secondary zone in use (current/cache)
>0/320/320/12800 4k (page size) jumbo clusters in use (current/cache/total/=
max)
>0/0/0/19200 9k jumbo clusters in use (current/cache/total/max)
>0/0/0/12800 16k jumbo clusters in use (current/cache/total/max)
>4609K/4554K/9164K bytes allocated to network (current/cache/total)
>0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
>0/0/0 requests for jumbo clusters denied (4k/9k/16k)
>0 requests for sfbufs denied
>0 requests for sfbufs delayed
>0 requests for I/O initiated by sendfile
>0 calls to protocol drain routines
>------------------------------------------------------------------------
>
>Here's one piece of core.0.txt which makes no sense to me -- the "rate"
>column.  I have a very hard time believing that was the interrupt rate
>of all the relevant devices at the time (way too high).  Maybe this data
>becomes wrong only during a coredump?  The total column I could believe.
>
>------------------------------------------------------------------------
>vmstat -i
>
>interrupt                          total       rate
>irq4: uart0                        54768        912
>irq6: fdc0                             1          0
>irq17: uhci1+                        172          2
>irq23: uhci3 ehci1+                 2367         39
>cpu0: timer                  13183882632  219731377
>irq256: em0                    260491055    4341517
>irq257: em1                    127555036    2125917
>irq258: ahci0                  225923164    3765386
>cpu2: timer                  13183881837  219731363
>cpu1: timer                  13002196469  216703274
>cpu3: timer                  13183881783  219731363
>Total                        53167869284  886131154
>------------------------------------------------------------------------
>
>Here's what a normal "vmstat -i" shows from the command-line:
>
># vmstat -i
>interrupt                          total       rate
>irq4: uart0                          518          0
>irq6: fdc0                             1          0
>irq23: uhci3 ehci1+                  145          0
>cpu0: timer                     19041199       1999
>irq256: em0                       614280         64
>irq257: em1                       168529         17
>irq258: ahci0                     355536         37
>cpu2: timer                     19040462       1999
>cpu1: timer                     19040458       1999
>cpu3: timer                     19040454       1999
>Total                           77301582       8119
>
>We graph many aspects of this box, including CPU load, memory/swap
>usage, etc. and none show any sign that the interrupt rate on all of
>those devices was even remotely out of control.  (I would expect to see
>CPU through the roof given the above data)
>
>I have since rebuilt/reinstalled world/kernel on the machine with the
>latest RELENG_8 code (box is now 8.2-STABLE #0: Mon May  2 14:44:18 PDT
>2011), hoping whatever this was may have been fixed.
>
>As for what I think may have triggered it, but I have no hard evidence
>of such: on April 29th, I changed our pf.conf and did "/etc/rc.d/pf
>reload".  The pf.conf change was a single line:
>
>Old: scrub on em0 all
>New: scrub in on em0 all
>
>Why it took the problem approximately 3 days to start is unknown.  It's
>the only change we've made to the system (truly/honestly), and it was a
>change to pf.conf.
>
>If anyone has advice (or has seen the above problem), or is interested
>in debugging it -- as I said, I have a vmcore -- I'm happy to assist in
>any way I can.  I would hate for someone else to get bit by this, and
>really am hoping its something that has been fixed between February and
>now.
>

That's quite the deduction there. I've noticed recently that you were also=
=20
experimenting with the new NFS server recompiling kernel etc etc. Seeing=20
as weird things can happen with DNS, NFS and mountpoint's, is this the same=
=20
machine that you were doing that on ?

If so can you check to see how many requests for NFS operations were done=
=20
to/from that box ? as well the names that would be being resolved and if=20
that machine can resolve them ?

Also I would believe your using tables in your pf.conf, if so do any of=20
those tables contain a FQDN that cannot be resolved from that machine ?

I think you probably see what I am getting at here as it could be some=20
sort of concurrent recursive DNS failure that can only be seen from the=20
machine caused by possibly the new NFS backend or a change in one of the=20
tables that pf would use.

--=20

 Regards, (jhell)
 Jason Hellenthal


--DBIVS5p969aUjpLe
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (FreeBSD)
Comment: http://bit.ly/0x89D8547E

iQEcBAEBAgAGBQJNv41ZAAoJEJBXh4mJ2FR+kv0H/jLwWK5wBbRHO0t4PEeUTITN
pKljwFvrnTSTDC1WIeYAe6AEL5zERfHaYsl2zGG2XiYCiVoSOaVRRYP/4lkztG3h
X3K68z0q8WgTksr9L1iIR/oj2YRLGzvTI9ChV2lQh+lORF2mLW+doYlipPa0sAGq
PJrruYlQk5uaY/9rEQn+/5QxvFM1A73yBVIoQKI4UAEdlv8tGlFAcyrqkOGnJ8ju
ji6VOQfOndbwI6B+0rezkvtFGlSjPQTCVsl8+IAkA2rgcUPkGgOm9LxGFHsVzbc+
Kkj9zRKCSDc92qVKyw7+Asaucnm6lnhF/IHDYgvVPIGcG/eLi/PvVnbh003svmg=
=xLPk
-----END PGP SIGNATURE-----

--DBIVS5p969aUjpLe--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110503050634.GB53414>