Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 12 Dec 2012 00:10:17 +0200
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        Tim Kientzle <kientzle@freebsd.org>, freebsd-current Current <freebsd-current@freebsd.org>
Subject:   Re: r244036 kernel hangs under load.
Message-ID:  <20121211221017.GC3013@kib.kiev.ua>
In-Reply-To: <1814631088.1331228.1355262952071.JavaMail.root@erie.cs.uoguelph.ca>
References:  <20121211045225.GY3013@kib.kiev.ua> <1814631088.1331228.1355262952071.JavaMail.root@erie.cs.uoguelph.ca>

next in thread | previous in thread | raw e-mail | index | archive | help

--rlMgvmd72ait5qJG
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Dec 11, 2012 at 04:55:52PM -0500, Rick Macklem wrote:
> Konstantin Belousov wrote:
> > On Mon, Dec 10, 2012 at 07:11:59PM -0500, Rick Macklem wrote:
> > > Konstantin Belousov wrote:
> > > > On Mon, Dec 10, 2012 at 01:38:21PM -0500, Rick Macklem wrote:
> > > > > Adrian Chadd wrote:
> > > > > > .. what was the previous kernel version?
> > > > > >
> > > > > Hopefully Tim has it narrowed down more, but I don't see
> > > > > the hangs on a Sept. 7 kernel from head and I do see them
> > > > > on a Dec. 3 kernel from head. (Don't know the eact rNNNNNN.)
> > > > >
> > > > > It seems to predate my commit (r244008), which was my first
> > > > > concern.
> > > > >
> > > > > I use old single core i386 hardware and can fairly reliably
> > > > > reproduce it by doing a kernel build and a "svn checkout"
> > > > > concurrently. No NFS activity. These are running on a local
> > > > > disk (UFS/FFS). (The kernel I reproduce it on is built via
> > > > > GENERIC for i386. If you want me to start a "binary search"
> > > > > for which rNNNNNN, I can do that, but it will take a while.:-)
> > > > >
> > > > > I can get out into DDB, but I'll admit I don't know enough
> > > > > about it to know where to look;-)
> > > > > Here's some lines from "db> ps", in case they give someone
> > > > > useful information. (I can leave this box sitting in DB for
> > > > > the rest of to-day, in case someone can suggest what I should
> > > > > look for on it.)
> > > > >
> > > > > Just snippets...
> > > > >    Ss pause adjkerntz
> > > > >    DL sdflush [sofdepflush]
> > > > >    RL [syncer]
> > > > >    DL vlruwt [vnlru]
> > > > >    DL psleep [bufdaemon]
> > > > >    RL [pagezero]
> > > > >    DL psleep [vmdaemon]
> > > > >    DL psleep [pagedaemon]
> > > > >    DL ccb_scan [xpt_thrd]
> > > > >    DL waiting_ [sctp_iterator]
> > > > >    DL ctl_work [ctl_thrd]
> > > > >    DL cooling [acpi_cooling0]
> > > > >    DL tzpoll [acpi_thermal]
> > > > >    DL (threaded) [usb]
> > > > >    ...
> > > > >    DL - [yarrow]
> > > > >    DL (threaded) [geom]
> > > > >    D - [g_down]
> > > > >    D - [g_up]
> > > > >    D - [g_event]
> > > > >    RL (threaded) [intr]
> > > > >    I [irq15: ata1]
> > > > >    ...
> > > > >    Run CPU0 [swi6: Giant taskq]
> > > > > --> does this one indicate the CPU is actually running this?
> > > > >    (after a db> cont, wait a while <ctrl><alt><esc> db> ps
> > > > >     it is still the same)
> > > > >    I [swi4: clock]
> > > > >    I [swi1: netisr 0]
> > > > >    I [swi3: vm]
> > > > >    RL [idle: cpu0]
> > > > >    SLs wait [init]
> > > > >    DL audit_wo [audit]
> > > > >    DLs (threaded) [kernel]
> > > > >    D - [deadlkres]
> > > > >    ...
> > > > >    D sched [swapper]
> > > > >
> > > > > I have no idea if this "ps" output helps, unless it indicates
> > > > > that it is looping on the Giant taskq?
> > > > Might be. You could do 'bt <pid>' for the process to see where it
> > > > loops.
> > > > Another good set of hints is at
> > > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handboo=
k/kerneldebug-deadlocks.html
> > >
> > > Kostik, you must be clairvoyant;-)
> > >
> > > When I did "show alllocks", I found that the syncer process held
> > > - exclusive sleep mutex mount mtx locked @ kern/vfs_subr.c:4720
> > > - exclusive lockmgr syncer locked @ kern/vfs_subr.c:1780
> > > The trace for this process goes like:
> > >  spinlock_exit
> > >  mtx_unlock_spin_flags
> > >  kern_yield
> > >  _mnt_vnode_next_active
> > >  vnode_next_active
> > >  vfs_msync()
> > >
> > > So, it seems like your r244095 commit might have fixed this?
> > > (I'm not good at this stuff, but from your description, it looks
> > >  like it did the kern_yield() with the mutex held and "maybe"
> > >  got into trouble trying to acquire Giant?)
> > >
> > > Anyhow, I'm going to test a kernel with r244095 in it and see
> > > if I can still reproduce the hang.
> > > (There wasn't much else in the "show alllocks", except a
> > >  process that held the exclusive vnode interlock mutex plus
> > >  a ufs vnode lock, but it's just doing a witness_unlock.)
> > There must be a thread blocked for the mount interlock for the loop
> > in the mnt_vnode_next_active to cause livelock.
> >=20
> Yes. I am getting hangs with the -current kernel and they seem
> easier for me to reproduce.
>=20
> For the one I just did, the "syncer" seems to be blocked at
>  VI_TRYLOCK() in _mnt_vnode_next_active().
trylock cannot block.

> The vnode interlock mutex is eclusively locked by a "sh"
> process (11627). Now, here is where it gets weird...
> When I do a "db> trace 11627" I get the following:
> witness_unlock+0x1f3  (subr_witness.c:1563)
> mtx_unlock_flags+0x9f (kern_mutex.c:250)
> vdropl+0x63           (vfs_subr.c:2405)
> vputx+0x130           (vfs_subr.c:2116)
> vput+0x10             (vfs_subr.c:2319)
> vm_mmap+0x52e         (vm_mmap.c:1341)
> sys_mmap
>=20
> So, it seems this process is stuck while trying to unlock
> the mutex, if that makes any sense...
It probably not stuck, but just you catched it at this moment.

The issue sounds more like a livelock. Can you obtain _all_ the information
listed in the deadlock debugging page I sent earlier, and provide it to
me ? Also, do you use the post-r244095 kernel ?

Is your machine SMP ?

--rlMgvmd72ait5qJG
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iQIcBAEBAgAGBQJQx69JAAoJEJDCuSvBvK1BDIIP/izQ62iN/xcPlL5i2pZ8gf+J
HCNtmgX1rKuKKf7asTyxdugvjjjKFE+asQcrsLII3e1jyCoWhQ5yPuqzwrA1waQn
6sCeLwSEd4HgFE8AGkE//1+RoaNbzlnZoX1O+C8gtZJhcermjbM9qPLGzX1bonui
wfldWJKx62clUk4W1ZsRlpNeMFb87ZgiEBjq6v/gZh6x4GF29M5UqzN+Zt7guhXo
ijfDWhpF1i63kqKvNEB2Ps0LsShzgKWnt5k88XzC0i0BowluerKmtf4n7Pe7DbNA
N57hPOVhapTJKau+sPH6KXOe5tWV0YlJJVzcc7i9St60D/TIN15jxUgR9vZsh/2u
L6n7YWkjBy9efBKY7vbIN+iapyzOaocB2fXeQXaJWEYm0/eZ+AO5oELwi0C1cI60
A7TagpfS0Pwn7xmHekbJTJoLF2H4Yhkp1Hchm2Or0JB+PRNXl+V3DObQz0pu48Zz
/Pnpl2PfNKUezo3uTpUAY51hcPHxdeLoFn7Y+RMjMNai8LzpMniYr5C6+RH54CVM
KjxdxkyxZKJWnuxmYJkDB5ANz8DM7ll87CweUpvZQJy/VOm9zpKRAWmmjLaleTc3
rZjpdthhXmE0j2r8aUDUVWdnFPjVqE6akgqD2cJN+7CGgZJ8ZIWOxiFkR3XrHO8t
rMoWJDZhHr5Buq2Ynbe4
=Orc8
-----END PGP SIGNATURE-----

--rlMgvmd72ait5qJG--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121211221017.GC3013>