Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 10 Jan 1999 22:05:22 +1000
From:      Stephen McKay <syssgm@dtir.qld.gov.au>
To:        dg@FreeBSD.ORG
Cc:        freebsd-current@FreeBSD.ORG, syssgm@dtir.qld.gov.au
Subject:   Hangs on "inode" and "thrd_sleep"
Message-ID:  <199901101205.WAA21649@nymph.dtir.qld.gov.au>

next in thread | raw e-mail | index | archive | help
My test machine hung last night during 'make -j5 buildworld' with 7 processes
in "thrd_sleep" and 2 in "inode".  Thus began a marathon DDB session
(punctuated by some reluctant sleep).

The machine is a 486DX2/66 with 16Mb ram, AHA1542CF, 1Gb hard disk, kernel
from 29/12/98, compiling current from yesterday, elf binaries, elf kernel,
softupdates.  No NFS involved.  Plenty of swap, and with only 16Mb ram and
parallel builds it does an awful lot of paging.

The last visible bit of the compilation log went like this:

    cc -fpic -DPIC ... alias_util.so
    building profiled alias library
    building standard alias library
    building shared alias library (version 2)

Since it was a parallel make, possibly all 3 library builds are running in
parallel.  Certainly there are 3 tsort and 3 nm processes active (well, they
would be if the whole thing wasn't wedged).

The processes in "thrd_sleep" are trying to lock exec_map.  Exec_map has
1 shared lock, 7 waiting, and LK_NOPAUSE LK_SHARE_NON_ZERO LK_WAIT_NON_ZERO
and LK_WANT_EXCL set.  Where's the missing process with the shared lock?

The processes in "inode" are trying to lock the inode that refers to the
vnode that is "/usr/obj/elf/usr/src/tmp/usr/bin/sed".  There is 1 shared lock
and 2 waiting, and LK_SHARE_NON_ZERO LK_WAIT_NON_ZERO and LK_WANT_EXCL set.
Similarly, where is the missing process with the shared lock?

Well, the exec_map contains 6 entries.  Three are largish and must be from
argument copying.  The other 3 are single pages, and must come from that
peculiar double-mapping-of-the-text-data-boundary bit in elf_load_section().
Two of these pages are from the same "sed" vnode that the processes stuck
in "inode" want.  Of course, what I really should be saying is that the
same page is in exec_map in two places.

The problem was not lack of free pages.  The free list has hundreds of
free pages.

I'd like to say I've got to the bottom of all this and add another one
line patch to the kernel, but I've run out of puff.  I'll be leaving the
machine on (and stuck) for a while and will try again to determine the
root cause.

But I will ask:  What is likely to happen if two processes attempt to
exec the same binary at the same time and the binary is not in core?

The only place I can find that issues a shared lock on exec_map is the
vm_fault() (via vm_map_lookup()) to fill that double mapped text/data page.
Everything else seems to want an exclusive lock.  Thus I point my finger
vaguely in the direction of the elf exec code and yell "Witch!  Burn her!"

What else can I discover from my hung 486 that could help diagnose this?
I've only got DDB and stupidly disconnected my serial console setup.

Stephen.

PS Finding the name of a vnode from the name cache using ddb is slow and
painful.  What's the easy way?

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199901101205.WAA21649>