From owner-freebsd-current@FreeBSD.ORG Sat Sep 9 12:33:33 2006 Return-Path: X-Original-To: current@freebsd.org Delivered-To: freebsd-current@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C747316A403 for ; Sat, 9 Sep 2006 12:33:33 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id 79FC043D46 for ; Sat, 9 Sep 2006 12:33:33 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 2459B46D1A; Sat, 9 Sep 2006 08:33:33 -0400 (EDT) Date: Sat, 9 Sep 2006 13:33:33 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Peter Holm In-Reply-To: <20060908072830.GA63071@peter.osted.lan> Message-ID: <20060909132826.K84834@fledge.watson.org> References: <20060908072830.GA63071@peter.osted.lan> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: current@freebsd.org Subject: Re: Page fault in uipc_usrreq.c:997 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 09 Sep 2006 12:33:33 -0000 On Fri, 8 Sep 2006, Peter Holm wrote: > During boot of GENERIC HEAD from Sep 7 07:29 UTC I got this page > fault: > > Kernel page fault with the following non-sleepable locks held: > exclusive sleep mutex unp r = 0 (0xc0a5520c) locked @ > kern/uipc_usrreq.c:987 > KDB: stack backtrace: > kdb_backtrace(1,c410b000,c,c3f77a20,e43f7a28,...) at > kdb_backtrace+0x29 > witness_warn(5,0,c0941302) at witness_warn+0x192 > trap(8,28,c4190028,c413a7a8,c4195690,...) at trap+0x108 > calltrap() at calltrap+0x5 > --- trap 0xc, eip = 0xc06e01e6, esp = 0xe43f7a70, ebp = 0xe43f7bfc --- > unp_connect(c41ce000,c3f797e0,c3f77a20,c0a5520c,0,...) at > unp_connect+0x292 > uipc_connect(c41ce000,c3f797e0,c3f77a20) at uipc_connect+0x3e > soconnect(c41ce000,c3f797e0,c3f77a20) at soconnect+0x4e > kern_connect(c3f77a20,3,c3f797e0,c3f797e0,0,...) at kern_connect+0x76 > connect(c3f77a20,e43f7d04) at connect+0x30 > syscall(3b,3b,3b,1,8270000,...) at syscall+0x256 > > http://people.freebsd.org/~pho/stress/log/cons207.html. > > The core file is toast and I missed a back trace of pid 678 :-( This is likely one of the remaining race conditions in UNIX domain sockets having to do with simultaneous connect and close, which occur due to dropping locks for either a blocking name lookup or a recursion via the socket layer into the protocol a second time. When the UNIX domain socket global lock is dropped and re-acquired, the UNIX domain socket code needs to re-evaluate its assumptions regarding any references it has to other UNIX domain sockets, which may have "gone away" while the lock was released. Interestingly, many of these races also existed in 4.x and before, but they are more exposed with greater kernel parallelism. I recently closed a spate of them, but it looks like a few remain. In this case, the listen socket has possibly been closed (although possibly not) while sonewconn() is called. It could be a reference needs to be added to so2 before dropping the unp lock. I saw John's follow-up, but if ups/he don't have a fixed in a few days once I get back to the UK, I can investigate. Send me a ping next week if I appear to forget :-). Robert N M Watson Computer Laboratory University of Cambridge