From owner-freebsd-stable@FreeBSD.ORG Mon Jul 19 03:58:47 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6C05D106566C for ; Mon, 19 Jul 2010 03:58:47 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta07.westchester.pa.mail.comcast.net (qmta07.westchester.pa.mail.comcast.net [76.96.62.64]) by mx1.freebsd.org (Postfix) with ESMTP id 17D918FC13 for ; Mon, 19 Jul 2010 03:58:46 +0000 (UTC) Received: from omta14.westchester.pa.mail.comcast.net ([76.96.62.60]) by qmta07.westchester.pa.mail.comcast.net with comcast id jfru1e0011HzFnQ57fymPQ; Mon, 19 Jul 2010 03:58:46 +0000 Received: from koitsu.dyndns.org ([98.248.41.155]) by omta14.westchester.pa.mail.comcast.net with comcast id jfyl1e00E3LrwQ23afym5k; Mon, 19 Jul 2010 03:58:46 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 891859B425; Sun, 18 Jul 2010 20:58:44 -0700 (PDT) Date: Sun, 18 Jul 2010 20:58:44 -0700 From: Jeremy Chadwick To: Mike Tancsa Message-ID: <20100719035844.GA93487@icarus.home.lan> References: <201007182108.o6IL88eG043887@lava.sentex.ca> <20100718211415.GA84127@icarus.home.lan> <201007182142.o6ILgDQW044046@lava.sentex.ca> <20100719023419.GA91006@icarus.home.lan> <201007190301.o6J31Hs1045607@lava.sentex.ca> <20100719033424.GA92607@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20100719033424.GA92607@icarus.home.lan> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: freebsd-stable@freebsd.org Subject: Re: deadlock or bad disk ? RELENG_8 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 19 Jul 2010 03:58:47 -0000 On Sun, Jul 18, 2010 at 08:34:24PM -0700, Jeremy Chadwick wrote: > On Sun, Jul 18, 2010 at 11:01:03PM -0400, Mike Tancsa wrote: > > I do track some basic mem stats via rrd. Looking at the graphs upto > > that period, nothing unusual was happening > > sysctl vm.stats.vm | grep swap > > Here's another post basically reiterating the same thing: that the > controller the swap slice is on (in your case a 4-disk RAID array) is > basically taking too long to respond. > > http://groups.google.com/group/mailing.freebsd.stable/browse_thread/thread/2e7faeeaca719c52/cdcd4601ce1b90c5 > > I have no idea where the timeout values are in the kernel. I do see > these two entries in sysctl that look to be of interest though. You > might try adjusting these (not sure if they're sysctls or loader.conf > tunables only): > > vm.swap_idle_threshold2: 10 > vm.swap_idle_threshold1: 2 > > Descriptions: > > vm.swap_idle_threshold2: Time before a process will be swapped out > vm.swap_idle_threshold1: Guaranteed swapped in time for a process > > I want to point out that the actual amount of data being swapped out is > fairly small -- note the "size" fields the swap_pager kernel messages. > There doesn't necessarily have to be a shortage of memory to cause a > swapout (case in point, see above). I took a look at the RELENG_8 code responsible for printing this message: src/sys/vm/swap_pager.c 1067 /* 1068 * SWAP_PAGER_GETPAGES() - bring pages in from swap 1069 * 1070 * Attempt to retrieve (m, count) pages from backing store, but make 1071 * sure we retrieve at least m[reqpage]. We try to load in as large 1072 * a chunk surrounding m[reqpage] as is contiguous in swap and which 1073 * belongs to the same object. 1074 * 1075 * The code is designed for asynchronous operation and 1076 * immediate-notification of 'reqpage' but tends not to be 1077 * used that way. Please do not optimize-out this algorithmic 1078 * feature, I intend to improve on it in the future. 1079 * 1080 * The parent has a single vm_object_pip_add() reference prior to 1081 * calling us and we should return with the same. 1082 * 1083 * The parent has BUSY'd the pages. We should return with 'm' 1084 * left busy, but the others adjusted. 1085 */ 1086 static int 1087 swap_pager_getpages(vm_object_t object, vm_page_t *m, int count, int reqpage) 1088 { .... 1210 /* 1211 * wait for the page we want to complete. VPO_SWAPINPROG is always 1212 * cleared on completion. If an I/O error occurs, SWAPBLK_NONE 1213 * is set in the meta-data. 1214 */ 1215 VM_OBJECT_LOCK(object); 1216 while ((mreq->oflags & VPO_SWAPINPROG) != 0) { 1217 mreq->oflags |= VPO_WANTED; 1218 PCPU_INC(cnt.v_intrans); 1219 if (msleep(mreq, VM_OBJECT_MTX(object), PSWP, "swread", hz*20)) { 1220 printf( 1221 "swap_pager: indefinite wait buffer: bufobj: %p, blkno: %jd, size: %ld\n", 1222 bp->b_bufobj, (intmax_t)bp->b_blkno, bp->b_bcount); 1223 } 1224 } So I believe this indicates the message only gets printed during swapin, not swapout. Meaning it's happening during an I/O read from da0. Reading msleep(9) provides us some details about what "swread" correlates with (now I know where that column in ps/top comes from), and the timeout value (hz*20): The parameter wmesg is a string describing the sleep condition for tools like ps(1). Due to the limited space of those programs to display arbi‐ trary strings, this message should not be longer than 6 characters. The parameter timo specifies a timeout for the sleep. If timo is not 0, then the thread will sleep for at most timo / hz seconds. If the timeout expires, then the sleep function will return EWOULDBLOCK. So what's hz? Well, I want to assume it's kern.hz, which defaults to 1000. 1000*20 = 20000, so the timeout would be 20000/1000 = 20 seconds. That's a pretty long time to be waiting for an I/O read to return. So does vm.swap_idle_threshold1 play a role? I doubt it. The code is in src/sys/vm/vm_glue.c, but I don't understand it (especially since it's used in a function called swapout_procs()). I just wish I knew why the description was "Guaranteed swapped in time for a process" when it looks more like it's guaranteed swapped out time? -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |