From owner-freebsd-current@FreeBSD.ORG Tue Dec 11 22:10:27 2012 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2FB5F2EC; Tue, 11 Dec 2012 22:10:27 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 8F5E78FC08; Tue, 11 Dec 2012 22:10:26 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.5/8.14.5) with ESMTP id qBBMAHBZ037416; Wed, 12 Dec 2012 00:10:17 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.1 kib.kiev.ua qBBMAHBZ037416 Received: (from kostik@localhost) by tom.home (8.14.5/8.14.5/Submit) id qBBMAH65037415; Wed, 12 Dec 2012 00:10:17 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 12 Dec 2012 00:10:17 +0200 From: Konstantin Belousov To: Rick Macklem Subject: Re: r244036 kernel hangs under load. Message-ID: <20121211221017.GC3013@kib.kiev.ua> References: <20121211045225.GY3013@kib.kiev.ua> <1814631088.1331228.1355262952071.JavaMail.root@erie.cs.uoguelph.ca> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="rlMgvmd72ait5qJG" Content-Disposition: inline In-Reply-To: <1814631088.1331228.1355262952071.JavaMail.root@erie.cs.uoguelph.ca> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: Tim Kientzle , freebsd-current Current X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Dec 2012 22:10:27 -0000 --rlMgvmd72ait5qJG Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Dec 11, 2012 at 04:55:52PM -0500, Rick Macklem wrote: > Konstantin Belousov wrote: > > On Mon, Dec 10, 2012 at 07:11:59PM -0500, Rick Macklem wrote: > > > Konstantin Belousov wrote: > > > > On Mon, Dec 10, 2012 at 01:38:21PM -0500, Rick Macklem wrote: > > > > > Adrian Chadd wrote: > > > > > > .. what was the previous kernel version? > > > > > > > > > > > Hopefully Tim has it narrowed down more, but I don't see > > > > > the hangs on a Sept. 7 kernel from head and I do see them > > > > > on a Dec. 3 kernel from head. (Don't know the eact rNNNNNN.) > > > > > > > > > > It seems to predate my commit (r244008), which was my first > > > > > concern. > > > > > > > > > > I use old single core i386 hardware and can fairly reliably > > > > > reproduce it by doing a kernel build and a "svn checkout" > > > > > concurrently. No NFS activity. These are running on a local > > > > > disk (UFS/FFS). (The kernel I reproduce it on is built via > > > > > GENERIC for i386. If you want me to start a "binary search" > > > > > for which rNNNNNN, I can do that, but it will take a while.:-) > > > > > > > > > > I can get out into DDB, but I'll admit I don't know enough > > > > > about it to know where to look;-) > > > > > Here's some lines from "db> ps", in case they give someone > > > > > useful information. (I can leave this box sitting in DB for > > > > > the rest of to-day, in case someone can suggest what I should > > > > > look for on it.) > > > > > > > > > > Just snippets... > > > > > Ss pause adjkerntz > > > > > DL sdflush [sofdepflush] > > > > > RL [syncer] > > > > > DL vlruwt [vnlru] > > > > > DL psleep [bufdaemon] > > > > > RL [pagezero] > > > > > DL psleep [vmdaemon] > > > > > DL psleep [pagedaemon] > > > > > DL ccb_scan [xpt_thrd] > > > > > DL waiting_ [sctp_iterator] > > > > > DL ctl_work [ctl_thrd] > > > > > DL cooling [acpi_cooling0] > > > > > DL tzpoll [acpi_thermal] > > > > > DL (threaded) [usb] > > > > > ... > > > > > DL - [yarrow] > > > > > DL (threaded) [geom] > > > > > D - [g_down] > > > > > D - [g_up] > > > > > D - [g_event] > > > > > RL (threaded) [intr] > > > > > I [irq15: ata1] > > > > > ... > > > > > Run CPU0 [swi6: Giant taskq] > > > > > --> does this one indicate the CPU is actually running this? > > > > > (after a db> cont, wait a while db> ps > > > > > it is still the same) > > > > > I [swi4: clock] > > > > > I [swi1: netisr 0] > > > > > I [swi3: vm] > > > > > RL [idle: cpu0] > > > > > SLs wait [init] > > > > > DL audit_wo [audit] > > > > > DLs (threaded) [kernel] > > > > > D - [deadlkres] > > > > > ... > > > > > D sched [swapper] > > > > > > > > > > I have no idea if this "ps" output helps, unless it indicates > > > > > that it is looping on the Giant taskq? > > > > Might be. You could do 'bt ' for the process to see where it > > > > loops. > > > > Another good set of hints is at > > > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handboo= k/kerneldebug-deadlocks.html > > > > > > Kostik, you must be clairvoyant;-) > > > > > > When I did "show alllocks", I found that the syncer process held > > > - exclusive sleep mutex mount mtx locked @ kern/vfs_subr.c:4720 > > > - exclusive lockmgr syncer locked @ kern/vfs_subr.c:1780 > > > The trace for this process goes like: > > > spinlock_exit > > > mtx_unlock_spin_flags > > > kern_yield > > > _mnt_vnode_next_active > > > vnode_next_active > > > vfs_msync() > > > > > > So, it seems like your r244095 commit might have fixed this? > > > (I'm not good at this stuff, but from your description, it looks > > > like it did the kern_yield() with the mutex held and "maybe" > > > got into trouble trying to acquire Giant?) > > > > > > Anyhow, I'm going to test a kernel with r244095 in it and see > > > if I can still reproduce the hang. > > > (There wasn't much else in the "show alllocks", except a > > > process that held the exclusive vnode interlock mutex plus > > > a ufs vnode lock, but it's just doing a witness_unlock.) > > There must be a thread blocked for the mount interlock for the loop > > in the mnt_vnode_next_active to cause livelock. > >=20 > Yes. I am getting hangs with the -current kernel and they seem > easier for me to reproduce. >=20 > For the one I just did, the "syncer" seems to be blocked at > VI_TRYLOCK() in _mnt_vnode_next_active(). trylock cannot block. > The vnode interlock mutex is eclusively locked by a "sh" > process (11627). Now, here is where it gets weird... > When I do a "db> trace 11627" I get the following: > witness_unlock+0x1f3 (subr_witness.c:1563) > mtx_unlock_flags+0x9f (kern_mutex.c:250) > vdropl+0x63 (vfs_subr.c:2405) > vputx+0x130 (vfs_subr.c:2116) > vput+0x10 (vfs_subr.c:2319) > vm_mmap+0x52e (vm_mmap.c:1341) > sys_mmap >=20 > So, it seems this process is stuck while trying to unlock > the mutex, if that makes any sense... It probably not stuck, but just you catched it at this moment. The issue sounds more like a livelock. Can you obtain _all_ the information listed in the deadlock debugging page I sent earlier, and provide it to me ? Also, do you use the post-r244095 kernel ? Is your machine SMP ? --rlMgvmd72ait5qJG Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJQx69JAAoJEJDCuSvBvK1BDIIP/izQ62iN/xcPlL5i2pZ8gf+J HCNtmgX1rKuKKf7asTyxdugvjjjKFE+asQcrsLII3e1jyCoWhQ5yPuqzwrA1waQn 6sCeLwSEd4HgFE8AGkE//1+RoaNbzlnZoX1O+C8gtZJhcermjbM9qPLGzX1bonui wfldWJKx62clUk4W1ZsRlpNeMFb87ZgiEBjq6v/gZh6x4GF29M5UqzN+Zt7guhXo ijfDWhpF1i63kqKvNEB2Ps0LsShzgKWnt5k88XzC0i0BowluerKmtf4n7Pe7DbNA N57hPOVhapTJKau+sPH6KXOe5tWV0YlJJVzcc7i9St60D/TIN15jxUgR9vZsh/2u L6n7YWkjBy9efBKY7vbIN+iapyzOaocB2fXeQXaJWEYm0/eZ+AO5oELwi0C1cI60 A7TagpfS0Pwn7xmHekbJTJoLF2H4Yhkp1Hchm2Or0JB+PRNXl+V3DObQz0pu48Zz /Pnpl2PfNKUezo3uTpUAY51hcPHxdeLoFn7Y+RMjMNai8LzpMniYr5C6+RH54CVM KjxdxkyxZKJWnuxmYJkDB5ANz8DM7ll87CweUpvZQJy/VOm9zpKRAWmmjLaleTc3 rZjpdthhXmE0j2r8aUDUVWdnFPjVqE6akgqD2cJN+7CGgZJ8ZIWOxiFkR3XrHO8t rMoWJDZhHr5Buq2Ynbe4 =Orc8 -----END PGP SIGNATURE----- --rlMgvmd72ait5qJG--