From owner-freebsd-questions@FreeBSD.ORG  Fri Aug  6 07:17:04 2004
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 1DF8C16A4CE
	for <freebsd-questions@freebsd.org>;
	Fri,  6 Aug 2004 07:17:04 +0000 (GMT)
Received: from yami.57thstreet.com (constr1-host1.corridor.net
	[66.100.236.130])
	by mx1.FreeBSD.org (Postfix) with SMTP id 8983D43D55
	for <freebsd-questions@freebsd.org>;
	Fri,  6 Aug 2004 07:17:03 +0000 (GMT)	(envelope-from jeffk@well.com)
Received: (qmail 67025 invoked from network); 6 Aug 2004 07:48:12 -0000
Received: from unknown (HELO ?192.168.0.5?) (66.100.236.133)
  by constr1-host1.corridor.net with SMTP; 6 Aug 2004 07:48:12 -0000
Mime-Version: 1.0
X-Sender: jeffk@mail.well.com
Message-Id: <p06002017bd38da0c39a9@[192.168.0.5]>
Date: Fri, 6 Aug 2004 02:16:50 -0500
To: freebsd-questions@freebsd.org
From: Jeff Kramer <jeffk@well.com>
Content-Type: text/plain; charset="us-ascii" ; format="flowed"
Subject: identifying and fixing server I/O slowdowns
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>,
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>,
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 06 Aug 2004 07:17:04 -0000

Oh great and wise FreeBSD gurus,

I've been running FreeBSD boxes for about five years with great 
results (up to 6 at the moment), but recently one of my machines has 
started to seriously act up.  Every time a heavy disk operation (say, 
tar'ing a 1 gig directory) occurs the system slows to a crawl, and 
requests to apache/php/mysql sites hosted on it just hang.

The system is a dual p3 1.13ghz box with a gig of ram and mirrored 80 
gig WD800BB drives on a Promise TX2 controller.  The raid isn't 
degraded.  There's a dedicated 1.5 gig swap partition and a swap file 
on the /usr partition.  We had some apache processes go nuts one 
time, which is why I added the swap file.

We run about 15 jails on the machine, with MySQL in the server proper 
and apache/php running inside the jails.  I initially thought it was 
a rogue process taking down the machine, but it seems to be that any 
heavy disk activity for more than a few minutes brings about the 
slowdown.  It doesn't happen instantly, but after a minute or two 
things will slow to a crawl.

I've recompiled the kernel a few times, upgraded to the latest 
4-STABLE rev, and even turned on device polling, but nothing seems to 
be helping.  It doesn't seem to happen on another machine we have 
with identical hardware.

My sysctl.conf:

kern.ipc.somaxconn=4096
net.inet.tcp.sendspace=32768
net.inet.tcp.recvspace=32768
net.inet.icmp.drop_redirect=1
net.inet.icmp.log_redirect=1
net.inet.ip.redirect=0
net.inet6.ip6.redirect=0
net.link.ether.inet.max_age=1200
net.inet.icmp.bmcastecho=0
net.inet.icmp.maskrepl=0
kern.maxfiles=65536
kern.ipc.shm_use_phys=1
kern.polling.enable=1

And a netstat -m:

301/928/131072 mbufs in use (current/peak/max):
         301 mbufs allocated to data
287/874/32768 mbuf clusters in use (current/peak/max)
1980 Kbytes allocated to network (2% of mb_map in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines

And here's a typical systat -v snapshot while the machine's 'ok':

     3 users    Load  0.32  0.38  0.31                  Aug  6 00:03

Mem:KB    REAL            VIRTUAL                     VN PAGER  SWAP PAGER
         Tot   Share      Tot    Share    Free         in  out     in  out
Act  221588   38656   747652   117796   39404 count    4           3
All 1024156   41620  1546136   144132         pages   18           5
                                                                  Interrupts
Proc:r  p  d  s  w    Csw  Trp  Sys  Int  Sof  Flt     21 cow    1156 total
      2     2 70       343  63322119 1156   57  397 186992 wire        fxp0 irq2
                                                    623848 act      13 
ohci0 irq9
  4.4%Sys   1.0%Intr  2.5%User  0.0%Nice 92.1%Idl   176096 inact    11 mux irq10
|    |    |    |    |    |    |    |    |    |      37220 cache       fdc0 irq6
==+>                                                 2184 free   1004 clk irq0
                                                           daefr   128 rtc irq8
Namei         Name-cache    Dir-cache                  15 prcfr
     Calls     hits    %     hits    %                   5 react
       126      125   99                                   pdwake
                                       340 zfod            pdpgs
Disks   ad4   ad6   fd0   md0         119 ofod          1 intrn
KB/t   0.00 16.72  0.00  0.00          34 %slo-z   114304 buf
tps       0    11     0     0         401 tfree       173 dirtybuf
MB/s   0.00  0.17  0.00  0.00                       70310 desiredvnodes
% busy    0     9     0     0                       64089 numvnodes
                                                     54829 freevnodes


And here's a systat -v snapshop while the machine's choking:

     4 users    Load  0.39  0.35  0.31                  Aug  6 00:08

Mem:KB    REAL            VIRTUAL                     VN PAGER  SWAP PAGER
         Tot   Share      Tot    Share    Free         in  out     in  out
Act  191344   34248   728736   117268   51916 count    1                6
All 1024676   37500  2075520   144188         pages    2               67
                                                                  Interrupts
Proc:r  p  d  s  w    Csw  Trp  Sys  Int  Sof  Flt     29 cow    1698 total
      5     2 70       573  74423171 1699  225  367 180904 wire        fxp0 irq2
                                                    640404 act     335 
ohci0 irq9
  5.7%Sys   1.9%Intr  7.5%User  0.0%Nice 84.9%Idl   153116 inact   236 mux irq10
|    |    |    |    |    |    |    |    |    |      50252 cache       fdc0 irq6
===+>>>>                                             1664 free    999 clk irq0
                                                           daefr   128 rtc irq8
Namei         Name-cache    Dir-cache                  93 prcfr
     Calls     hits    %     hits    %                   1 react
      8693     8196   94       12    0                     pdwake
                                       308 zfod       2693 pdpgs
Disks   ad4   ad6   fd0   md0         135 ofod            intrn
KB/t  98.81 16.61  0.00  0.00          43 %slo-z   114304 buf
tps      13   225     0     0        1277 tfree       278 dirtybuf
MB/s   1.23  3.64  0.00  0.00                       70310 desiredvnodes
% busy    2    99     0     0                       64089 numvnodes
                                                     52125 freevnodes


Thoughts?  Is there any way to force a machine to limit the 
monopolization of a disk controller by a process?

-- 

Jeff Kramer
jeffk@well.com
http://www.keika.org/