From owner-freebsd-current@FreeBSD.ORG Fri Jul 12 21:34:31 2013 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id F035D4A8; Fri, 12 Jul 2013 21:34:31 +0000 (UTC) (envelope-from ianf@clue.co.za) Received: from zcs03.jnb1.cloudseed.co.za (zcs03.jnb1.cloudseed.co.za [41.154.0.139]) by mx1.freebsd.org (Postfix) with ESMTP id 49A551066; Fri, 12 Jul 2013 21:34:31 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by zcs03.jnb1.cloudseed.co.za (Postfix) with ESMTP id C5EBD2B432CD; Fri, 12 Jul 2013 23:34:22 +0200 (SAST) X-Virus-Scanned: amavisd-new at zcs03.jnb1.cloudseed.co.za Received: from zcs03.jnb1.cloudseed.co.za ([127.0.0.1]) by localhost (zcs03.jnb1.cloudseed.co.za [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id k1CIrIhdCLax; Fri, 12 Jul 2013 23:34:21 +0200 (SAST) Received: from clue.co.za (unknown [41.154.88.19]) by zcs03.jnb1.cloudseed.co.za (Postfix) with ESMTPSA id 617C02B43218; Fri, 12 Jul 2013 23:34:21 +0200 (SAST) Received: from localhost ([127.0.0.1] helo=zen) by clue.co.za with esmtp (Exim 4.80.1 (FreeBSD)) (envelope-from ) id 1Uxkyo-0002mE-MJ; Fri, 12 Jul 2013 23:34:18 +0200 To: Konstantin Belousov From: Ian FREISLICH Subject: Re: Filesystem wedges caused by r251446 In-Reply-To: <20130712201051.GI91021@kib.kiev.ua> References: <20130712201051.GI91021@kib.kiev.ua> <201307110923.06548.jhb@freebsd.org> <201307091202.24493.jhb@freebsd.org> X-Attribution: BOFH Date: Fri, 12 Jul 2013 23:34:18 +0200 Message-Id: Cc: freebsd-current@freebsd.org X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Jul 2013 21:34:32 -0000 Konstantin Belousov wrote: > > --rMuTkhzRlt+HYtLC > Content-Type: text/plain; charset=us-ascii > Content-Disposition: inline > Content-Transfer-Encoding: quoted-printable > > On Fri, Jul 12, 2013 at 08:11:36PM +0200, Ian FREISLICH wrote: > > John Baldwin wrote: > > > On Thursday, July 11, 2013 6:54:35 am Ian FREISLICH wrote: > > > > John Baldwin wrote: > > > > > On Thursday, July 04, 2013 5:03:29 am Ian FREISLICH wrote: > > > > > > Konstantin Belousov wrote: > > > > > > >=20 > > > > > > > Care to provide any useful information ? > > > > > > >=20 > > > > > > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers- > > > > > handbook/kerneldebug-deadlocks.html > > > > > >=20 > > > > > > Well, the system doesn't deadlock it's perfectly useable so long > > > > > > as you don't touch the file that's wedged. A lot of the time the > > > > > > userland process is unkillable, but often it is killable. How do > > > > > > I get from from the PID to where the FS is stuck in the kernel? > > > > >=20 > > > > > Use kgdb. 'proc ', then 'bt'. > > > >=20 > > > > So, I setup a remote kbgd session, but I still can't figure out how > > > > to get at the information we need. > > > >=20 > > > > (kgdb) proc 5176 > > > > only supported for core file target > > > >=20 > > > > In the mean time, I'll just force it to make a core dump from ddb. > > > > However, I can't reacreate the issue while the mirror (gmirror) is > > > > rebuilding, so we'll have to wait for that to finish. > > >=20 > > > Sorrry, just run 'sudo kgdb' on the box itself. You can inspect the ru= > nning > > > kernel without having to stop it. > >=20 > > So, this machine's installworld *always* stalls installing clang. > > The install can be stopped (ctrl-c) leaving behind this process: > >=20 > > root 23147 0.0 0.0 9268 1512 1 D 7:51PM 0:00.01 install -= > s -o root -g wheel -m 555 clang /usr/bin/clang > >=20 > > This is the backtrace from gdb. I suspect frame 4. > >=20 > > (kgdb) proc 23147 > > [Switching to thread 117 (Thread 100059)]#0 sched_switch ( > > td=3D0xfffffe000c012920, newtd=3D0x0, flags=3D) > > at /usr/src/sys/kern/sched_ule.c:1954 > > 1954 cpuid =3D PCPU_GET(cpuid); > > Current language: auto; currently minimal > > (kgdb) bt > > #0 sched_switch (td=3D0xfffffe000c012920, newtd=3D0x0,=20 > > flags=3D) at /usr/src/sys/kern/sched_ule.c:1954 > > #1 0xffffffff8047539e in mi_switch (flags=3D260, newtd=3D0x0) > > at /usr/src/sys/kern/kern_synch.c:487 > > #2 0xffffffff804acbea in sleepq_wait (wchan=3D0x0, pri=3D0) > > at /usr/src/sys/kern/subr_sleepqueue.c:620 > > #3 0xffffffff80474ee9 in _sleep (ident=3D,=20 > > lock=3D0xffffffff80a20300, priority=3D84, wmesg=3D0xffffffff8071129a = > "wdrain",=20 > > sbt=3D, pr=3D0, flags=3D) > > at /usr/src/sys/kern/kern_synch.c:249 > > #4 0xffffffff804e6523 in waitrunningbufspace () > > at /usr/src/sys/kern/vfs_bio.c:564 > > #5 0xffffffff804e6073 in bufwrite (bp=3D) > > at /usr/src/sys/kern/vfs_bio.c:1226 > > #6 0xffffffff804f05ed in cluster_wbuild (vp=3D0xfffffe008fec4000, size= > =3D32768,=20 > > start_lbn=3D136, len=3D, gbflags=3D zed out>) > > at /usr/src/sys/kern/vfs_cluster.c:1002 > > #7 0xffffffff804efbc3 in cluster_write (vp=3D0xfffffe008fec4000,=20 > > bp=3D0xffffff80f83da6f0, filesize=3D4456448, seqcount=3D127,=20 > > gbflags=3D) at /usr/src/sys/kern/vfs_cluster.c:5= > 92 > > #8 0xffffffff805c1032 in ffs_write (ap=3D0xffffff8121c81850) > > at /usr/src/sys/ufs/ffs/ffs_vnops.c:801 > > #9 0xffffffff8067fe21 in VOP_WRITE_APV (vop=3D,=20 > > ---Type to continue, or q to quit---=20 > > a=3D) at vnode_if.c:999 > > #10 0xffffffff80511eca in vn_write (fp=3D0xfffffe006a5f7410,=20 > > uio=3D0xffffff8121c81a90, active_cred=3D0x0, flags=3D out>,=20 > > td=3D0x0) at vnode_if.h:413 > > #11 0xffffffff8050eb3a in vn_io_fault (fp=3D0xfffffe006a5f7410,=20 > > uio=3D0xffffff8121c81a90, active_cred=3D0xfffffe006a6ca000, flags=3D0= > ,=20 > > td=3D0xfffffe000c012920) at /usr/src/sys/kern/vfs_vnops.c:983 > > #12 0xffffffff804b506a in dofilewrite (td=3D0xfffffe000c012920, fd=3D5,= > =20 > > fp=3D0xfffffe006a5f7410, auio=3D0xffffff8121c81a90,=20 > > offset=3D, flags=3D0) at file.h:290 > > #13 0xffffffff804b4cde in sys_write (td=3D0xfffffe000c012920,=20 > > uap=3D) at /usr/src/sys/kern/sys_generic.c:460 > > #14 0xffffffff8061807a in amd64_syscall (td=3D0xfffffe000c012920, traced= > =3D0) > > at subr_syscall.c:134 > > #15 0xffffffff806017ab in Xfast_syscall () > > at /usr/src/sys/amd64/amd64/exception.S:387 > > #16 0x000000000044e75a in ?? () > > Previous frame inner to this frame (corrupt stack?) > > (kgdb)=20 > > Please apply (mostly debugging) patch below, then reproduce the issue. > I need the backtrace of the 'main' hung process, assuming it is stuck > in the waitrunningbufspace(). Also, from the same kgdb session, print > runningbufreq, runningbufspace and lorunningspace. This is the hung process after applying the patch. I see no console messages on recreating the issue. (kgdb) proc 23113 [Switching to thread 102 (Thread 100146)]#0 sched_switch ( td=0xfffffe006971e000, newtd=0x0, flags=) at /usr/src/sys/kern/sched_ule.c:1954 1954 cpuid = PCPU_GET(cpuid); Current language: auto; currently minimal (kgdb) bt #0 sched_switch (td=0xfffffe006971e000, newtd=0x0, flags=) at /usr/src/sys/kern/sched_ule.c:1954 #1 0xffffffff8047539e in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:487 #2 0xffffffff804acbea in sleepq_wait (wchan=0x0, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:620 #3 0xffffffff80474ee9 in _sleep (ident=, lock=0xffffffff80a20300, priority=84, wmesg=0xffffffff8071129a "wdrain", sbt=, pr=0, flags=) at /usr/src/sys/kern/kern_synch.c:249 #4 0xffffffff804e6527 in waitrunningbufspace () at /usr/src/sys/kern/vfs_bio.c:566 #5 0xffffffff804e6073 in bufwrite (bp=) at /usr/src/sys/kern/vfs_bio.c:1228 #6 0xffffffff804f05ed in cluster_wbuild (vp=0xfffffe007f90d750, size=32768, start_lbn=825, len=, gbflags=) at /usr/src/sys/kern/vfs_cluster.c:1002 #7 0xffffffff804efbc3 in cluster_write (vp=0xfffffe007f90d750, bp=0xffffff80f86bcd90, filesize=27033600, seqcount=127, gbflags=) at /usr/src/sys/kern/vfs_cluster.c:592 #8 0xffffffff805c1032 in ffs_write (ap=0xffffff8121e34850) at /usr/src/sys/ufs/ffs/ffs_vnops.c:801 #9 0xffffffff8067fe21 in VOP_WRITE_APV (vop=, ---Type to continue, or q to quit--- a=) at vnode_if.c:999 #10 0xffffffff80511eca in vn_write (fp=0xfffffe000ccb5870, uio=0xffffff8121e34a90, active_cred=0x0, flags=, td=0x0) at vnode_if.h:413 #11 0xffffffff8050eb3a in vn_io_fault (fp=0xfffffe000ccb5870, uio=0xffffff8121e34a90, active_cred=0xfffffe000cf28800, flags=0, td=0xfffffe006971e000) at /usr/src/sys/kern/vfs_vnops.c:983 #12 0xffffffff804b506a in dofilewrite (td=0xfffffe006971e000, fd=5, fp=0xfffffe000ccb5870, auio=0xffffff8121e34a90, offset=, flags=0) at file.h:290 #13 0xffffffff804b4cde in sys_write (td=0xfffffe006971e000, uap=) at /usr/src/sys/kern/sys_generic.c:460 #14 0xffffffff8061807a in amd64_syscall (td=0xfffffe006971e000, traced=0) at subr_syscall.c:134 #15 0xffffffff806017ab in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:387 #16 0x000000000044e75a in ?? () Previous frame inner to this frame (corrupt stack?) (kgdb) frame 4 #4 0xffffffff804e6527 in waitrunningbufspace () at /usr/src/sys/kern/vfs_bio.c:566 566 msleep(&runningbufreq, &rbreqlock, PVM, "wdrain", 0); (kgdb) print runningbufreq $1 = 1 (kgdb) print runningbufspace $2 = 0 (kgdb) print lorunningspace $3 = 4587520 (kgdb) print hirunningspace $4 = 4194304 -- Ian Freislich