Date: Fri, 21 May 1999 18:43:53 -0700 (PDT) From: Matthew Dillon <dillon@apollo.backplane.com> To: Kevin Day <toasty@home.dragondata.com> Cc: current@FreeBSD.ORG Subject: Re: -current deadlocks within 5 mins, over NFS Message-ID: <199905220143.SAA69156@apollo.backplane.com> References: <199905070817.DAA15632@home.dragondata.com>
index | next in thread | previous in thread | raw e-mail
:
:Matt, I told you about this before, but completely forgot about it. After
:doing considerable testing on my test servers, i thought -current was safe
:enough to try on our production shell servers. I installed -current on one
:of my servers, and to my dismay, it hung. :)
:
:Within 5 minutes of running, nearly every process is blocked on 'inode',
:with the exception of a single 'cp' stuck in vmopar.
:
:I have a very silly, *very* poorly written script i run out of cron, every
:10 mins or so, to update my passwd and group files.
:
:#/bin/sh
:
:cp /home/private/passwd /etc
:cp /home/private/master.passwd /etc
:cp /home/private/group /etc
:rm /etc/spwd.db.tmp >/dev/null 2>&1
:pwd_mkdb /etc/master.passwd
:
:Kevin
:
( Also in a later conversation Kevin indicated that a cron job on the
server was updating /home/private/, creating a race between the server
operating on /home/private and the client trying to copy files from
/home/private. It is this race which is revealing the bug ).
I've managed to repeat the problem with two scripts. On the server:
while (1)
cp file1 file2
echo -n "l"
end
And on the client:
while (1)
cp file2 /tmp/test3
echo -n "C"
end
On the client:
ccccccccccccccccp: /tmp/test3: Bad address
cccccccccccccccp: /tmp/test3: Bad address
cccp: /tmp/test3: Bad address
cccccccccp: /tmp/test3: Bad address
ccccccccccccccccccp: /tmp/test3: Bad address
cccccccccp: /tmp/test3: Bad address
cccccccccccccp: /tmp/test3: Bad address
cccccccccccccccccccccccccccp: /tmp/test3: Bad address
cccccccccccccccccccccccccccccccccp: /tmp/test3: Bad address
ccccccccccccccccccccccccccccccccccccccccccccccccccccccccp: /tmp/test3: Bad address
cc<hang>
( The Bad address errors are correct for NFS considering what the server
is doing to the poor file. The hang of course is not )
The cp process on the client gets stuck in vmopar, as previously reported.
Fortunately I can have a gdb already running on the client on the live
kernel so it's easy to see what is going on.
The problem is a same-process deadlock. A VM fault occurs accessing a
NFS-backed page. The fault locks (PG_BUSY's) the page in question then
calls vnode_pager_getpages() to bring the page in. This filters down
into an nfs_getpages() call which then calls nfs_readrpc().
nfs_readrpc() normally ( and properly ) tries to keep the vnode
synchronized to the NFS state returned by the RPC. The problem is that
if the state indicates that the server has truncated the file,
vnode_pager_setsize() will be called and will attempt to remove all
the pages beyond the truncation point from the VM object.
Unfortunately, at least one of those pages has been locked by the same
process. Bewm. Deadlock.
So, how to fix? The only thing I can think of is to pass a flag to
nfs_readrpc() so it knows the RPC is related to a VM fault, and to then
allow nfs_readrpc() to leave np->n_size and vap->va_size *unsynchronized*
if a file truncation occurs. i.e. to avoid calling vnode_pager_setsize()
and thus avoid the deadlock. This is kinda icky. We have no opportunity
anywhere to call vnode_pager_setsize() because the faulted page must remain
BUSY'd throughout the entire getpages operation.
Comments? ( If I haven't confused the bajeezus out of everyone, that
is :-) )
-Matt
Matthew Dillon
<dillon@backplane.com>
17754 c41d4940 c45e9000 0 15192 15192 804006 S cp vmopar c049b930
vm_page_t 0xc049b930:
object = 0xc46297b4,
pindex = 0x0,
phys_addr = 0x2732000,
queue = 0x0,
flags = 0x83, (PG_BUSY|PG_WANTED|PG_REFERENCED)
pc = 0x32,
wire_count = 0x0,
hold_count = 0x0,
act_count = 0x0,
busy = 0x0,
valid = 0x0,
dirty = 0x0
#0 mi_switch () at ../../kern/kern_synch.c:827
#1 0xc0137f21 in tsleep (ident=0xc049b930, priority=0x4,
wmesg=0xc023bac7 "vmopar", timo=0x0) at ../../kern/kern_synch.c:443
#2 0xc01e8f12 in vm_object_page_remove (object=0xc46297b4, start=0x0,
end=0x1, clean_only=0x0) at ../../vm/vm_page.h:555
#3 0xc01ed93f in vnode_pager_setsize (vp=0xc46208c0, nsize=0x0000000000000000)
at ../../vm/vnode_pager.c:285
#4 0xc01a3017 in nfs_loadattrcache (vpp=0xc45eab94, mdp=0xc45eaba0,
dposp=0xc45eaba4, vaper=0x0) at ../../nfs/nfs_subs.c:1383
#5 0xc01abc7c in nfs_readrpc (vp=0xc46208c0, uiop=0xc45eac08, cred=0xc09ba400)
at ../../nfs/nfs_vnops.c:1060
#6 0xc0184f05 in nfs_getpages (ap=0xc45eac44) at ../../nfs/nfs_bio.c:154
#7 0xc01edefa in vnode_pager_getpages (object=0xc46297b4, m=0xc45eacec,
count=0x1, reqpage=0x0) at vnode_if.h:1067
#8 0xc01e2069 in vm_fault (map=0xc41d8d40, vaddr=0x28057000, fault_type=0x1,
fault_flags=0x0) at ../../vm/vm_pager.h:130
#9 0xc0207508 in trap_pfault (frame=0xc45ead94, usermode=0x0, eva=0x28057000)
at ../../i386/i386/trap.c:791
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message
help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199905220143.SAA69156>
