Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 21 May 1999 18:43:53 -0700 (PDT)
From:      Matthew Dillon <dillon@apollo.backplane.com>
To:        Kevin Day <toasty@home.dragondata.com>
Cc:        current@FreeBSD.ORG
Subject:   Re: -current deadlocks within 5 mins, over NFS
Message-ID:  <199905220143.SAA69156@apollo.backplane.com>
References:   <199905070817.DAA15632@home.dragondata.com>

next in thread | previous in thread | raw e-mail | index | archive | help
:
:Matt, I told you about this before, but completely forgot about it. After
:doing considerable testing on my test servers, i thought -current was safe
:enough to try on our production shell servers. I installed -current on one
:of my servers, and to my dismay, it hung. :)
:
:Within 5 minutes of running, nearly every process is blocked on 'inode',
:with the exception of a single 'cp' stuck in vmopar.
:
:I have a very silly, *very* poorly written script i run out of cron, every
:10 mins or so, to update my passwd and group files.
:
:#/bin/sh
:
:cp /home/private/passwd /etc
:cp /home/private/master.passwd /etc
:cp /home/private/group /etc
:rm /etc/spwd.db.tmp >/dev/null 2>&1 
:pwd_mkdb /etc/master.passwd
:
:Kevin
:
    ( Also in a later conversation Kevin indicated that a cron job on the
    server was updating /home/private/, creating a race between the server
    operating on /home/private and the client trying to copy files from
    /home/private.  It is this race which is revealing the bug ).

    I've managed to repeat the problem with two scripts.  On the server:

	while (1)
	    cp file1 file2
	    echo -n "l"
	end

    And on the client:

	while (1)
	    cp file2 /tmp/test3
	    echo -n "C"
	end

    On the client:

	ccccccccccccccccp: /tmp/test3: Bad address
	cccccccccccccccp: /tmp/test3: Bad address
	cccp: /tmp/test3: Bad address
	cccccccccp: /tmp/test3: Bad address
	ccccccccccccccccccp: /tmp/test3: Bad address
	cccccccccp: /tmp/test3: Bad address
	cccccccccccccp: /tmp/test3: Bad address
	cccccccccccccccccccccccccccp: /tmp/test3: Bad address
	cccccccccccccccccccccccccccccccccp: /tmp/test3: Bad address
	ccccccccccccccccccccccccccccccccccccccccccccccccccccccccp: /tmp/test3: Bad address
	cc<hang>

    ( The Bad address errors are correct for NFS considering what the server
    is doing to the poor file.  The hang of course is not )

    The cp process on the client gets stuck in vmopar, as previously reported.
    Fortunately I can have a gdb already running on the client on the live 
    kernel so it's easy to see what is going on.

    The problem is a same-process deadlock.  A VM fault occurs accessing a
    NFS-backed page.  The fault locks (PG_BUSY's) the page in question then
    calls vnode_pager_getpages() to bring the page in.  This filters down
    into an nfs_getpages() call which then calls nfs_readrpc().

    nfs_readrpc() normally ( and properly ) tries to keep the vnode 
    synchronized to the NFS state returned by the RPC.  The problem is that
    if the state indicates that the server has truncated the file,
    vnode_pager_setsize() will be called and will attempt to remove all
    the pages beyond the truncation point from the VM object.

    Unfortunately, at least one of those pages has been locked by the same
    process.  Bewm.  Deadlock.

    So, how to fix?  The only thing I can think of is to pass a flag to
    nfs_readrpc() so it knows the RPC is related to a VM fault, and to then
    allow nfs_readrpc() to leave np->n_size and vap->va_size *unsynchronized*
    if a file truncation occurs.  i.e. to avoid calling vnode_pager_setsize()
    and thus avoid the deadlock.  This is kinda icky.  We have no opportunity
    anywhere to call vnode_pager_setsize() because the faulted page must remain
    BUSY'd throughout the entire getpages operation.

    Comments? ( If I haven't confused the bajeezus out of everyone, that 
    is :-) )

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>


	17754 c41d4940 c45e9000   0 15192 15192  804006  S  cp vmopar c049b930

	vm_page_t 0xc049b930:

	  object = 0xc46297b4, 
	  pindex = 0x0, 
	  phys_addr = 0x2732000, 
	  queue = 0x0, 
	  flags = 0x83, 	(PG_BUSY|PG_WANTED|PG_REFERENCED)
	  pc = 0x32, 
	  wire_count = 0x0, 
	  hold_count = 0x0, 
	  act_count = 0x0, 
	  busy = 0x0, 
	  valid = 0x0, 
	  dirty = 0x0

	
#0  mi_switch () at ../../kern/kern_synch.c:827
#1  0xc0137f21 in tsleep (ident=0xc049b930, priority=0x4, 
    wmesg=0xc023bac7 "vmopar", timo=0x0) at ../../kern/kern_synch.c:443
#2  0xc01e8f12 in vm_object_page_remove (object=0xc46297b4, start=0x0, 
    end=0x1, clean_only=0x0) at ../../vm/vm_page.h:555
#3  0xc01ed93f in vnode_pager_setsize (vp=0xc46208c0, nsize=0x0000000000000000)
    at ../../vm/vnode_pager.c:285
#4  0xc01a3017 in nfs_loadattrcache (vpp=0xc45eab94, mdp=0xc45eaba0, 
    dposp=0xc45eaba4, vaper=0x0) at ../../nfs/nfs_subs.c:1383
#5  0xc01abc7c in nfs_readrpc (vp=0xc46208c0, uiop=0xc45eac08, cred=0xc09ba400)
    at ../../nfs/nfs_vnops.c:1060
#6  0xc0184f05 in nfs_getpages (ap=0xc45eac44) at ../../nfs/nfs_bio.c:154
#7  0xc01edefa in vnode_pager_getpages (object=0xc46297b4, m=0xc45eacec, 
    count=0x1, reqpage=0x0) at vnode_if.h:1067
#8  0xc01e2069 in vm_fault (map=0xc41d8d40, vaddr=0x28057000, fault_type=0x1, 
    fault_flags=0x0) at ../../vm/vm_pager.h:130
#9  0xc0207508 in trap_pfault (frame=0xc45ead94, usermode=0x0, eva=0x28057000)
    at ../../i386/i386/trap.c:791



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199905220143.SAA69156>