From owner-freebsd-net@FreeBSD.ORG Wed Aug 13 14:01:58 2003 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 92FB837B401 for ; Wed, 13 Aug 2003 14:01:58 -0700 (PDT) Received: from mail.sandvine.com (sandvine.com [199.243.201.138]) by mx1.FreeBSD.org (Postfix) with ESMTP id C5B2543F3F for ; Wed, 13 Aug 2003 14:01:57 -0700 (PDT) (envelope-from emaste@sandvine.com) Received: by mail.sandvine.com with Internet Mail Service (5.5.2653.19) id ; Wed, 13 Aug 2003 17:01:57 -0400 Message-ID: From: Ed Maste To: 'Mike Silbersack' , Scot Loach Date: Wed, 13 Aug 2003 17:01:56 -0400 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain; charset="iso-8859-1" cc: "'freebsd-net@freebsd.org'" Subject: RE: TCP socket shutdown race condition X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Aug 2003 21:01:58 -0000 Mike "Silby" Silbersack wrote: >Well, as ui_ref is the best bet, redoing your tests with it expanded to >ui_int is where we need to start before looking further. :) > >I believe that a uidinfo->ui_ref over/underflow could cause random memory >corruption, so maybe the panic you're seeing comes about after a bunch of >memory has already been trashed. > >So anyway, promote ui_ref to a u_int and retest. Tell us what happens. So as Scot mentioned (http://news.gw.com/freebsd.net/10900) it doesn't look like the ui_ref is overflowing, and the panic still happens with a 32 bit ref count. I think I've found the problem. crfree() is called from a lot of places (I counted at least 20) including sodealloc() in the socket code, crcopy() etc. It's called at splnet() from sodealloc(). I'm not sure what spl (if any) it might be called at from elsewhere, but certainly not splnet(). I believe the non-atomic (on SMP) increment and decrement in crhold() and crfree() result in a race condition, with the ref count ending up less than it should be. By adding a busy wait loop to crhold() and crfree() and making them "even less atomic" I was able to reliably make the same panic occur within a minute or two of starting my test. (Running the same test, it took on the order of a day for Scot to observe a panic.) I've added an splhigh() around the code in crhold() and crfree() (with the delay left in) and haven't observed a panic yet. I'm not sure what the best way to fix this is, but the ref count inc/dec either needs to be protected or made atomic. I'm going to investigate the correct solution for this and supply a PR / patch, but for now let me know if more information is desired. -ed