From owner-freebsd-hackers@FreeBSD.ORG Wed Jun 17 21:18:10 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9E22B1065672; Wed, 17 Jun 2009 21:18:10 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 6288D8FC0C; Wed, 17 Jun 2009 21:18:10 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id D184A46B45; Wed, 17 Jun 2009 17:18:09 -0400 (EDT) Received: from jhbbsd.hudson-trading.com (unknown [209.249.190.8]) by bigwig.baldwin.cx (Postfix) with ESMTPA id A2C688A072; Wed, 17 Jun 2009 17:18:08 -0400 (EDT) From: John Baldwin To: Mel Flynn Date: Wed, 17 Jun 2009 17:17:37 -0400 User-Agent: KMail/1.9.7 References: <200906151353.06630.mel.flynn+fbsd.hackers@mailing.thruhere.net> <200906170815.26650.jhb@freebsd.org> <200906171152.55438.mel.flynn+fbsd.hackers@mailing.thruhere.net> In-Reply-To: <200906171152.55438.mel.flynn+fbsd.hackers@mailing.thruhere.net> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200906171717.37677.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Wed, 17 Jun 2009 17:18:08 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: freebsd-hackers@freebsd.org, Robert Watson Subject: Re: How best to debug locking/scheduler problems X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jun 2009 21:18:11 -0000 On Wednesday 17 June 2009 3:52:54 pm Mel Flynn wrote: > On Wednesday 17 June 2009 04:15:26 John Baldwin wrote: > > On Tuesday 16 June 2009 7:01:45 pm Mel Flynn wrote: > > > On Tuesday 16 June 2009 11:02:42 John Baldwin wrote: > > > > On Tuesday 16 June 2009 1:52:23 pm Mel Flynn wrote: > > > > > Hi John, > > > > > > > > > > On Tuesday 16 June 2009 04:19:57 John Baldwin wrote: > > > > > > On Monday 15 June 2009 5:53:05 pm Mel Flynn wrote: > > > > > > > PID TID COMM TDNAME KSTACK > > > > > > > 4283 100215 kdeinit4 - mi_switch > > > > > > > turnstile_wait _mtx_lock_sleep uipc_peeraddr kern_getpeername > > > > > > > getpeername syscall Xint0x80_syscall > > > > > > > % ps -ww 4283 > > > > > > > PID TT STAT TIME COMMAND > > > > > > > 4283 ?? T 0:00.38 kdeinit4: kdeinit4: kio_http http > > > > > > > local:/tmp/ksocket-mel/klauncherxJ1635.slave-socket > > > > > > > local:/tmp/ksocket- mel/plasmayC1653.slave-socket (kdeinit4) > > > > > > > > > > > > > > %ls -l /tmp/ksocket-mel/ > > > > > > > > > > > > > > total 2 > > > > > > > -rw-rw-r-- 1 mel wheel 62 Jun 14 22:55 KSMserver__0 > > > > > > > srw------- 1 mel wheel 0 Jun 14 22:55 kdeinit4__0 > > > > > > > srwxrwxr-x 1 mel wheel 0 Jun 14 22:55 > > > > > > > klauncherxJ1635.slave-socket > > > > > > > > > > > > You can use kgdb and the scripts at www.freebsd.org/~jhb/gdb. > > > > > > Simply run 'kgdb' as root and do 'lcd /folder/with/scripts' and > > > > > > 'source gdb6'. You can then do 'lockchain 4283' to find who holds > > > > > > the lock this thread is blocked on and what state they are in. > > > > > > > > > > Looks like a deadlock: > > > > > > > > > > (kgdb) lockchain 4283 > > > > > thread 100215 (pid 4283, kdeinit4) blocked on lock 0xc64374a0 > > > > > "unp_mtx" thread 100122 (pid 1635, klauncher) blocked on lock > > > > > 0xc6806348 "unp_mtx" thread 100215 (pid 4283, kdeinit4) blocked on > > > > > lock 0xc64374a0 "unp_mtx" thread 100122 (pid 1635, klauncher) blocked > > > > > on lock 0xc6806348 "unp_mtx" thread 100215 (pid 4283, kdeinit4) > > > > > blocked on lock 0xc64374a0 "unp_mtx" thread 100122 (pid 1635, > > > > > klauncher) blocked on lock 0xc6806348 "unp_mtx" thread 100215 (pid > > > > > 4283, kdeinit4) blocked on lock 0xc64374a0 "unp_mtx" thread 100122 > > > > > (pid 1635, klauncher) blocked on lock 0xc6806348 "unp_mtx" thread > > > > > 100215 (pid 4283, kdeinit4) blocked on lock 0xc64374a0 "unp_mtx" > > > > > thread 100122 (pid 1635, klauncher) blocked on lock 0xc6806348 > > > > > "unp_mtx" thread 100215 (pid 4283, kdeinit4) blocked on lock > > > > > 0xc64374a0 "unp_mtx" thread 100122 (pid 1635, klauncher) blocked on > > > > > lock 0xc6806348 "unp_mtx" thread 100215 (pid 4283, kdeinit4) blocked > > > > > on lock 0xc64374a0 "unp_mtx" thread 100122 (pid 1635, klauncher) > > > > > blocked on lock 0xc6806348 "unp_mtx" thread 100215 (pid 4283, > > > > > kdeinit4) blocked on lock 0xc64374a0 "unp_mtx" thread 100122 (pid > > > > > 1635, klauncher) blocked on lock 0xc6806348 "unp_mtx" thread 100215 > > > > > (pid 4283, kdeinit4) blocked on lock 0xc64374a0 "unp_mtx" thread > > > > > 100122 (pid 1635, klauncher) blocked on lock 0xc6806348 "unp_mtx" > > > > > thread 100215 (pid 4283, kdeinit4) blocked on lock 0xc64374a0 > > > > > "unp_mtx" thread 100122 (pid 1635, klauncher) blocked on lock > > > > > 0xc6806348 "unp_mtx" DEADLOCK > > > > > > > > > > Looking through the scripts now to see how I can get more info on the > > > > > call chain and hoping I don't panic the machine ;). It is quite > > > > > random to reproduce. > > > > > > > > In kgdb you can simply do 'tid 100122' followed by 'bt' and 'tid > > > > 100215' followed by 'bt'. > > > > > > Cool, thanks for helping John. Of course it pretty much shows me what > > > procstat -k shows and can't get any info on the userland part, but I can > > > fully inspect the locks and threads. > > > > > > Both threads are in TDS_INHIBITED state, and blocked on: > > > (kgdb) frame 0 > > > #0 sched_switch (td=0xc5971240, newtd=0xc4d39900, flags=259) > > > at /usr/src/sys/kern/sched_ule.c:1864 > > > 1864 cpuid = PCPU_GET(cpuid); > > > > That doesn't really tell us anything except that it isn't running. We know > > it is actually blocked on a lock, and we need the full stack trace to see > > where the two threads were trying to acquire the locks, hence 'bt'. ' > > procstat -k' output would be fine, too. > > the respective bt's: > (kgdb) tid 100122 > at /usr/src/sys/kern/kern_mutex.c:447 > #4 0xc06a68a5 in uipc_peeraddr (so=0xc64309a8, nam=0xe79a2c70) > at /usr/src/sys/kern/uipc_usrreq.c:682 > #5 0xc06a1e71 in kern_getpeername (td=0xc56e8900, fd=12, sa=0xe79a2c70, > alen=0xe79a2c6c) > at /usr/src/sys/kern/uipc_syscalls.c:1566 > > (kgdb) tid 100215 > (kgdb) bt > at /usr/src/sys/kern/kern_mutex.c:447 > #4 0xc06a68a5 in uipc_peeraddr (so=0xc6976338, nam=0xe9ae9c70) > at /usr/src/sys/kern/uipc_usrreq.c:682 > #5 0xc06a1e71 in kern_getpeername (td=0xc5971240, fd=7, sa=0xe9ae9c70, > alen=0xe9ae9c6c) > at /usr/src/sys/kern/uipc_syscalls.c:1566 These are the key frames. It looks like uipc_peeraddr() tries to lock two unp locks w/o any protection from the global unp linkage lock. I've changed it to use the same locking as uipc_accept() where it first grabs a read lock on the linkage lock and then just locks the other end of the connection to copy out its sockaddr. --- //depot/user/jhb/socket/kern/uipc_usrreq.c +++ /home/jhb/work/p4/socket/kern/uipc_usrreq.c @@ -671,7 +671,7 @@ KASSERT(unp != NULL, ("uipc_peeraddr: unp == NULL")); *nam = malloc(sizeof(struct sockaddr_un), M_SONAME, M_WAITOK); - UNP_PCB_LOCK(unp); + UNP_LINK_RLOCK(); /* * XXX: It seems that this test always fails even when connection is * established. So, this else clause is added as workaround to @@ -681,7 +681,7 @@ if (unp2 != NULL) { UNP_PCB_LOCK(unp2); if (unp2->unp_addr != NULL) - sa = (struct sockaddr *) unp->unp_conn->unp_addr; + sa = (struct sockaddr *) unp2->unp_addr; else sa = &sun_noname; bcopy(sa, *nam, sa->sa_len); @@ -690,7 +690,7 @@ sa = &sun_noname; bcopy(sa, *nam, sa->sa_len); } - UNP_PCB_UNLOCK(unp); + UNP_LINK_RUNLOCK(); return (0); } @@ -850,7 +850,7 @@ * return the slightly counter-intuitive but otherwise * correct error that the socket is not connected. * - * Locking here must be done carefully: the inkage lock + * Locking here must be done carefully: the linkage lock * prevents interconnections between unpcbs from changing, so * we can traverse from unp to unp2 without acquiring unp's * lock. Socket buffer locks follow unpcb locks, so we can -- John Baldwin