From owner-freebsd-stable@FreeBSD.ORG Fri Mar 12 21:27:04 2010 Return-Path: Delivered-To: freebsd-stable@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 84E8B1065677 for ; Fri, 12 Mar 2010 21:27:04 +0000 (UTC) (envelope-from gallasch@free.de) Received: from smtp.free.de (smtp.free.de [91.204.6.103]) by mx1.freebsd.org (Postfix) with ESMTP id 764A18FC23 for ; Fri, 12 Mar 2010 21:27:03 +0000 (UTC) Received: (qmail 31273 invoked from network); 12 Mar 2010 22:27:01 +0100 Received: from smtp.free.de (HELO orwell.free.de) (gallasch@free.de@[91.204.4.103]) (envelope-sender ) by smtp.free.de (qmail-ldap-1.03) with AES128-SHA encrypted SMTP for ; 12 Mar 2010 22:27:01 +0100 Date: Fri, 12 Mar 2010 22:26:59 +0100 From: Kai Gallasch To: Alexander Motin Message-ID: <20100312222659.0198dd03@orwell.free.de> In-Reply-To: <4B9A8A27.8050608@FreeBSD.org> References: <20100311133916.42ba69b0@orwell.free.de> <20100312115028.GG1819@garage.freebsd.pl> <4B9A8A27.8050608@FreeBSD.org> X-Mailer: Claws Mail 3.7.3 (GTK+ 2.18.7; powerpc-apple-darwin9.8.0) X-Face: 7"x0zA5=*cXGZw-xjU<">'+!3(KXTUXZVLD42KVN{'go[UQr"Mc.e(XW92N8plZ(9x.{x; I<|95e+b&GH-36\15F~L$YD*Y +u}o&KV?6.%"mJIkaY3G>BKNt`1|Y+%K1P4t; 47D65&(Y7h5Ll-[ltkhamx.-; ,jggK'}oMpUgEHFG YQ"9oXKAl>!d,J}T{)@uxvfu?YFWC*\~h+,^f Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: freebsd-fs@FreeBSD.org, Pawel Jakub Dawidek , freebsd-stable@FreeBSD.org Subject: Re: proliant server lockups with freebsd-amd64-stable (2010-03-10) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Mar 2010 21:27:04 -0000 Am Fri, 12 Mar 2010 20:38:31 +0200 schrieb Alexander Motin : > Pawel Jakub Dawidek wrote: > > On Thu, Mar 11, 2010 at 01:39:16PM +0100, Kai Gallasch wrote: > >> I have some trouble with an opteron server locking up > >> spontaneously. It looses all networks connectivity and even > >> through console I can get no shell. > >> > >> Lockups occur mostly under disk load (periodic daily, bacula backup > >> running, make buildworld/buildkernel) and I can provoke them > >> easily. > > [...] > >> 4 0 0 0 LL *cissmtx 0xffffff04ed820c00 > >> [g_down] > > [...] > >> 100046 L *cissmtx 0xffffff04ed820c00 > >> [irq257: ciss0] > > [...] > > > > I was analizing similar problem as potential ZFS bug. It turned out > > to be bug in ciss(4) and I believe mav@ (CCed) has fix for that. > > That my patch is already at 8-STABLE since r204873 of 2010-03-08. Make > sure you have it. Updating collection src-all/cvs .. .. Edit src/sys/dev/ciss/ciss.c Edit src/sys/dev/ciss/cissvar.h Didn't have it! Must have been just a few hours I missed it, when I built the last kernel. So I will rebuild my kernel and give it a spin later on. > In this case trap stopped process at ciss_get_request(), which indeed > called holding cissmtx lock. But there is no place to sleep or loop > there, so may be it was just spontaneous. With bugs I was fixing there > was a chance to loop indefinitely between ciss and CAM on resource > constraint. That increases chance for such situation to be caught. > > You may try also look what's going on with `top -HS` and `systat -vm > 1`. FYI. Without your patch of ciss.c and cissvar.h a lockup with top -HS and systat -vm running gives following information below. Is there a pattern visible relating to your patch of the ciss driver? Kai. ------ vmstat -vm ------ 3 users Load 0.21 0.36 0.45 Mar 12 21:47 Mem:KB REAL VIRTUAL VN PAGER SWAP PAGER Tot Share Tot Share Free in out in out Act 2805456 40804 62699320 79936 12358k count All 6182560 95796 1136820k 212452 pages Proc: Interrupts r p d s w Csw Trp Sys Int Sof Flt cow 16018 total 491 533 28 576 19 179 174 zfod atkbd0 1 ozfod uart0 irq4 12.9%Sys 12.5%Intr 0.1%User 0.0%Nice 74.5%Idle %ozfod ata0 irq14 | | | | | | | | | | | daefr uhci0 45 ======+++++++ prcfr 2000 cpu0: time 87 dtbuf totfr 19 bce0 256 Namei Name-cache Dir-cache 100000 desvn react ciss0 257 Calls hits % hits % 40811 numvn pdwak 2000 cpu1: time 17300 frevn pdpgs 2000 cpu4: time intrn 2000 cpu5: time Disks da0 da1 da2 da3 da4 pass0 pass1 4995516 wire 2000 cpu6: time KB/t 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2593276 act 1999 cpu7: time tps 0 0 0 0 0 0 0 369560 inact 2000 cpu3: time MB/s 0.00 0.00 0.00 0.00 0.00 0.00 0.00 9568 cache 2000 cpu2: time %busy 0 0 0 0 0 0 0 12348752 free 1252272 buf ----------- top -HS ------------ last pid: 42561; load averages: 0.35, 0.38, 0.46 up 0+11:24:36 21:53:39 658 processes: 13 running, 623 sleeping, 21 waiting, 1 lock CPU: 0.6% user, 0.0% nice, 12.6% system, 25.0% interrupt, 61.8% idle Mem: 2559M Active, 363M Inact, 4892M Wired, 9548K Cache, 1223M Buf, 12G Free Swap: 21G Total, 21G Free PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 11 root 171 ki31 0K 128K CPU3 3 672:22 100.00% {idle: cpu3} 11 root 171 ki31 0K 128K CPU1 1 663:50 100.00% {idle: cpu1} 11 root 171 ki31 0K 128K RUN 4 649:18 100.00% {idle: cpu4} 12 root -32 - 0K 384K CPU7 0 6:38 100.00% {swi4: clock} 4 root -8 - 0K 16K CPU5 5 1:16 100.00% g_down 12 root -64 - 0K 384K CPU2 0 1:15 100.00% {swi2: cambio} 11 root 171 ki31 0K 128K CPU6 6 672:13 97.27% {idle: cpu6} 11 root 171 ki31 0K 128K CPU0 0 622:18 96.29% {idle: cpu0} 11 root 171 ki31 0K 128K RUN 7 676:57 10.99% {idle: cpu7} 2046 zope1 46 0 468M 379M ucond 6 7:30 1.76% {python} 11 root 171 ki31 0K 128K RUN 5 663:27 0.00% {idle: cpu5} 11 root 171 ki31 0K 128K RUN 2 656:03 0.00% {idle: cpu2} 12 root -68 - 0K 384K WAIT 0 14:54 0.00% {irq256: bce0} 2046 zope1 46 0 468M 379M ucond 4 6:02 0.00% {python} 1719 zope2 48 0 263M 178M ucond 3 2:00 0.00% {python2.4} 1724 zope2 46 0 321M 235M ucond 5 1:52 0.00% {python2.4} 6314 root 45 0 10432K 2564K ttyin 4 1:36 0.00% systat 3 root -8 - 0K 16K - 2 1:29 0.00% g_up 6301 zope4 46 0 311M 225M ucond 1 1:28 0.00% {python2.4} 2320 zope5 44 0 281M 197M ucond 1 1:22 0.00% {python2.4} 20 root 44 - 0K 16K zio->i 6 1:21 0.00% syncer 6301 zope4 46 0 311M 225M ucond 5 1:19 0.00% {python2.4} 1729 zope6 44 0 311M 228M ucond 4 1:15 0.00% {python2.4} 2320 zope5 45 0 281M 197M ucond 4 1:13 0.00% {python2.4} 2320 zope5 50 0 281M 197M ucond 7 1:08 0.00% {python2.4} 6301 zope4 44 0 311M 225M ucond 4 1:02 0.00% {python2.4} 5138 dave 44 0 341M 248M select 4 0:58 0.00% {initial thread} 1729 zope6 45 0 311M 228M ucond 6 0:56 0.00% {python2.4} 8158 19421 44 0 27644K 7560K select 4 0:52 0.00% centerim 1743 zope7 44 0 287M 201M zio->i 1 0:51 0.00% {python} 2046 zope1 44 0 468M 379M select 1 0:48 0.00% {initial thread} 19943 19420 44 0 31740K 13044K select 4 0:46 0.00% centerim 6301 zope4 44 0 311M 225M ucond 0 0:40 0.00% {python2.4} 1743 zope7 44 0 287M 201M select 0 0:38 0.00% {initial thread} 2052 zope8 44 0 300M 212M select 3 0:37 0.00% {initial thread} 1724 zope3 46 0 321M 235M ucond 4 0:33 0.00% {python2.4} 2320 zope5 44 0 281M 197M ucond 1 0:31 0.00% {python2.4} 2052 zope8 45 0 300M 212M ucond 7 0:30 0.00% {python} 2320 zope5 44 0 281M 197M select 6 0:29 0.00% {initial thread} 12 root -64 - 0K 384K *cissm 1 0:28 0.00% {irq257: ciss0} 6301 zope4 44 0 311M 225M select 6 0:28 0.00% {initial thread} 0 root -16 0 0K 816K - 0 0:25 0.00% {spa_zio} 2052 zope8 47 0 300M 212M ucond 2 0:24 0.00% {python} 1729 zope6 44 0 311M 228M select 3 0:24 0.00% {initial thread} 1719 zope2 44 0 263M 178M select 1 0:23 0.00% {initial thread} 1724 zope3 44 0 321M 235M select 1 0:23 0.00% {initial thread} 12 root -48 - 0K 384K WAIT 0 0:22 0.00% {swi0: uart uart} 1734 zope9 44 0 249M 163M select 4 0:21 0.00% {initial thread} 0 root -16 0 0K 816K - 4 0:16 0.00% {spa_zio_6} 0 root -16 0 0K 816K - 3 0:16 0.00% {spa_zio_4} 0 root -16 0 0K 816K - 0 0:16 0.00% {spa_zio_1} 0 root -16 0 0K 816K - 4 0:16 0.00% {spa_zio_2} 0 root -16 0 0K 816K - 1 0:16 0.00% {spa_zio_5} 0 root -16 0 0K 816K - 2 0:16 0.00% {spa_zio_0} 0 root -16 0 0K 816K - 1 0:16 0.00% {spa_zio_7} 0 root -16 0 0K 816K - 5 0:16 0.00% {spa_zio_3} 4288 mysql 44 0 62460K 21608K select 4 0:16 0.00% {mysqld} 59 root -8 - 0K 92K tx->tx 0 0:12 0.00% {txg_thread_enter} 6129 mysql 44 0 62788K 23420K select 0 0:12 0.00% {mysqld} 59 root -8 - 0K 92K vgeom: 5 0:09 0.00% {vdev da2} 59 root -8 - 0K 92K vgeom: 6 0:09 0.00% {vdev da3} 4288 mysql 44 0 62460K 21608K select 0 0:08 0.00% {mysqld} 6129 mysql 44 0 62788K 23420K select 0 0:08 0.00% {mysqld} 13 root 44 - 0K 16K - 0 0:08 0.00% yarrow 0 root -16 0 0K 816K - 0 0:08 0.00% {spa_zio_6} 0 root -16 0 0K 816K - 7 0:08 0.00% {spa_zio_5} 0 root -16 0 0K 816K - 5 0:08 0.00% {spa_zio_2} 0 root -16 0 0K 816K - 2 0:08 0.00% {spa_zio_4} 0 root -16 0 0K 816K - 3 0:08 0.00% {spa_zio_3} 0 root -16 0 0K 816K - 4 0:07 0.00% {spa_zio_1} 0 root -16 0 0K 816K - 0 0:07 0.00% {spa_zio_7} 0 root -16 0 0K 816K - 1 0:07 0.00% {spa_zio_0} 5099 root 44 0 11380K 4292K CPU4 4 0:07 0.00% top 12 root -44 - 0K 384K WAIT 0 0:07 0.00% {swi1: netisr 0} 4864 news 45 0 24964K 12108K nanslp 6 0:06 0.00% perl 5138 dave 76 0 341M 248M ucond 7 0:05 0.00% {python} 4863 news 76 0 8260K 1920K wait 5 0:05 0.00% sh 5155 root 44 0 442M 6592K kqread 1 0:04 0.00% {cache-kqueue} 1734 zope9 44 0 249M 163M ucond 4 0:04 0.00% {python2.4} 4862 news 44 0 139M 43180K select 6 0:04 0.00% innd 8173 19421 44 0 30928K 8496K select 4 0:04 0.00% irssi