Date: Sun, 31 Jul 2005 15:47:37 -0400 From: Stephan Uphoff <ups@tree.com> To: Matthew Dillon <dillon@apollo.backplane.com> Cc: Kirk McKusick <mckusick@mckusick.com>, freebsd-hackers@freebsd.org Subject: Re: Possible softupdates bug when a indirect block buffer is reused Message-ID: <1122839257.1360.5.camel@palm> In-Reply-To: <200507311840.j6VIeWGv028507@apollo.backplane.com> References: <200507311840.j6VIeWGv028507@apollo.backplane.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi Matthew, We have been testing a fix for this for a few weeks. I will check it in today. Stephan On Sun, 2005-07-31 at 14:40, Matthew Dillon wrote: > Hi Kirk, hackers! > > I'm trying to track down a bug that is causing a buffer to be left > in a locked state and then causes the filesystem to lock up because > of that. > > The symptoms are that a heavily used filesystem suddenly starts running > out of space. It isn't due to deleted files with open descriptors, it's > due to the syncer getting stuck in a getblk state. This is in DragonFly, > but I can't find anything DFlyish wrong so I'm beginning to think it may > be an actual bug in softupdates. > > I have wound up with a situation where a getblk()'d bp has been > associated with a indirdep dependancy, i.e. stored in > indirdep->ir_savebp, but is never released. When something like > the syncer comes along and tries to access it, it locks up, and this > of course leads to inodes not getting cleared and the filesystem > eventually runs out of space when a lot of files are being created and > deleted. > > What has got me really confused is that the buffer in question seems to > wind up with a D_INDIRDEP dependancy that points back to itself. > > Here's the situation from a live gdb. Here is where the syncer is > stuck: > > (kgdb) back > #0 lwkt_switch () at thread2.h:95 > #1 0xc02a8a79 in tsleep (ident=0x0, flags=0, wmesg=0xc04eadb0 "getblk", > timo=0) at /usr/src-125beta/sys/kern/kern_synch.c:428 > #2 0xc02956bb in acquire (lkp=0xc758b4e0, extflags=33554464, wanted=1536) > at /usr/src-125beta/sys/kern/kern_lock.c:127 > #3 0xc0295a92 in lockmgr (lkp=0xc758b4e0, flags=33620002, interlkp=0x0, > td=0xd68f6400) at /usr/src-125beta/sys/kern/kern_lock.c:354 > #4 0xc02d6828 in getblk (vp=0xc71b3058, blkno=94440240, size=8192, slpflag=0, > slptimeo=0) at thread.h:79 > #5 0xc02d4404 in bread (vp=0xc71b3058, blkno=0, size=0, bpp=0x0) > at /usr/src-125beta/sys/kern/vfs_bio.c:567 > #6 0xc03f24fe in indir_trunc (ip=0xe048fc0c, dbn=94440240, level=1, lbn=2060, > countp=0xe048fbf8) at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:2221 > #7 0xc03f22df in handle_workitem_freeblocks (freeblks=0xe2fcef98) > at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:2138 > #8 0xc03f0462 in process_worklist_item (matchmnt=0x0, flags=0) > at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:726 > #9 0xc03f026c in softdep_process_worklist (matchmnt=0x0) > at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:625 > #10 0xc02e5ff3 in sched_sync () at /usr/src-125beta/sys/kern/vfs_sync.c:244 > #11 0xc0294863 in kthread_create_stk (func=0, arg=0x0, tdp=0xff800000, > stksize=0, fmt=0x0) at /usr/src-125beta/sys/kern/kern_kthread.c:104 > (kgdb) > > The buffer it is stuck on: > > (kgdb) print bp > $62 = (struct buf *) 0xc758b4b8 > (kgdb) print *bp > $63 = { > b_hash = { > le_next = 0x0, > le_prev = 0xc7391348 > }, > b_vnbufs = { > tqe_next = 0xc739b890, > tqe_prev = 0xc76f32b8 > }, > b_freelist = { > tqe_next = 0xc768d610, > tqe_prev = 0xc0565ac0 > }, > b_act = { > tqe_next = 0x0, > tqe_prev = 0x0 > }, > b_flags = 536870912, <<<<<<<<< 0x20000000 (getblk with no bread, etc) > b_qindex = 0, > b_xflags = 2 '\002', > b_lock = { > lk_interlock = { > t_cpu = 0xff800000, > t_reqcpu = 0xff800000, > t_unused01 = 0 > }, > lk_flags = 2098176, > lk_sharecount = 0, > lk_waitcount = 1, > lk_exclusivecount = 1, > lk_prio = 0, > lk_wmesg = 0xc04eadb0 "getblk", > lk_timo = 0, > lk_lockholder = 0xfffffffe > }, > b_error = 0, > b_bufsize = 8192, > b_runningbufspace = 0, > b_bcount = 8192, > b_resid = 0, > b_dev = 0xde0f0e38, > b_data = 0xcf824000 "\205\002", > b_kvabase = 0xcf824000 "\205\002", > b_kvasize = 16384, > b_lblkno = 94440240, > b_blkno = 94440240, > b_offset = 48353402880, > b_iodone = 0, > b_iodone_chain = 0x0, > b_vp = 0xc71b3058, > b_dirtyoff = 0, > b_dirtyend = 0, > b_pblkno = 87503631, > b_saveaddr = 0x0, > b_driver1 = 0x0, > b_caller1 = 0x0, > b_pager = { > pg_spc = 0x0, > pg_reqpage = 0 > }, > b_cluster = { > cluster_head = { > tqh_first = 0x0, > tqh_last = 0xc768d6bc > ---Type <return> to continue, or q <return> to quit--- > }, > cluster_entry = { > tqe_next = 0x0, > tqe_prev = 0xc768d6bc > } > }, > b_xio = { > xio_pages = 0xc758b584, > xio_npages = 2, > xio_offset = 0, > xio_bytes = 0, > xio_flags = 0, > xio_error = 0, > xio_internal_pages = {0xc34e5870, 0xc4aeb2b4, 0x0 <repeats 30 times>} > }, > b_dep = { > lh_first = 0xc7045040 > }, > b_chain = { > parent = 0x0, > count = 0 > } > } > > As you can see from b_flags, which is 0x20000000, the buffer has been > getblk()'d but not bread() or anything else. It is the typical state > that occurs when a buffer is placed in an indirdep->ir_savebp state > in setup_allocindir_phase2(). > > The buffer's dependancy looks like this: > > (kgdb) print bp > $65 = (struct buf *) 0xc758b4b8 > (kgdb) print *(struct indirdep *)bp->b_dep.lh_first > $66 = { > ir_list = { > wk_list = { > le_next = 0x0, > le_prev = 0xc758b604 > }, > wk_type = 5, > wk_state = 33025 <<<<<<<<<<<<< ATTACHED|GOINGAWAY|ONWORKLIST > }, > ir_saveddata = 0xdeadc0de "", > ir_savebp = 0xc758b4b8, <<<<<<<<<<<<< points back to itself > ir_donehd = { > lh_first = 0x0 > }, > ir_deplisthd = { > lh_first = 0x0 > } > } > (kgdb) > > As you can see, the buffer's dependancy seems to point to itself. > As you may know, ir_savebp buffers are left in a locked state, so a > buffer that has gotten into this situation winds up being permanently > deadlocked. > > This is on DragonFly, but I can't find anything else wrong. This is > on a filesystem running a myrid of jails which is constantly creating > and deleting files in parallel... so buffers are being reused quite often. > It takes about a week of this heavy activity for the bug to rear its > ugly head. It's fairly difficult to reproduce (takes about a week to > reproduce). > > When the problem does occur the system remains functional... the disk > device continues to work, the filesystem continues to work, except for > any blockages that chain down to the particular block that has deadlocked > (usually the syncer, but as time goes by more parts of the system will > deadlock). It IS possible to run a live gdb when the situation is > caught early enough. > > I am trying to figure out how the buffer manages to get into this > self-locked state. I'm pretty much stuck once I get to the > deallocate_dependancies() procedure. This procedure seems to be trying > to move a D_INDIRDEP dependancy from the passed bp into the buffer > pointed to by ir_savebp. As far as I can tell this is what is creating > the situation that makes the buffer's dependancy self-referential > and deadlocks the syncer. > > I have noticed a number of FreeBSD bug reports related to blocking in > getblk or to softupdates. I don't know if this is a similar problem > but it's worth Ccing freebsd-hackers on it as well. > > -Matt > > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org" > >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1122839257.1360.5.camel>