From owner-freebsd-stable@FreeBSD.ORG Tue Sep 6 15:04:44 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C76B51065674 for ; Tue, 6 Sep 2011 15:04:44 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 9F5F98FC1C for ; Tue, 6 Sep 2011 15:04:44 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 5661146B37; Tue, 6 Sep 2011 11:04:44 -0400 (EDT) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id DB17C8A02E; Tue, 6 Sep 2011 11:04:43 -0400 (EDT) From: John Baldwin To: freebsd-stable@freebsd.org Date: Tue, 6 Sep 2011 11:04:43 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110617; KDE/4.5.5; amd64; ; ) References: <4E64933E.8030908@incore.de> In-Reply-To: <4E64933E.8030908@incore.de> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Message-Id: <201109061104.43409.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (bigwig.baldwin.cx); Tue, 06 Sep 2011 11:04:44 -0400 (EDT) Cc: Andreas Longwitz Subject: Re: UFS_DIRHASH panics on a dozen server within 30 hours X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Sep 2011 15:04:44 -0000 On Monday, September 05, 2011 5:15:42 am Andreas Longwitz wrote: > Hi, > > a week ago a dozen of my FreeBSD server crashed within a time span of > 30 hours. On the server run very different applications, some of them > were only standby. All server has the same kernel with FreeBSD 6 STABLE > and there were no problems for yours until the "black monday". > > Yes I know that FreeBSD 6 is out of date now, but I don't like to > change a very good running system. Another reason is that my hardware > needs the amr driver and because of the outstanding solution of the > amr_ioctl problem described in kern/155658 it is not possible for me > to upgrade my production sytems without changing hardware. Hmm, the patch in that PR should still apply to newer versions. Also, you could just change the malloc() call to always allocate the maximum size (instead of using a static buffer) for a smaller diff. It seems though that a specific command is overrunning its buffer. > Now I have a dozen core dumps and try to understand what happened. > All dumps looks very similar and the panic is always "page fault" > in _mtx_lock_sleep called from ufsdirhash_recycle or ufsdirhash_free > because the used mtx_object is overwritten with zeros by someone > before _mtx_lock_sleep is called. I don't know of anything in particular that would explain this, esp. as to why you would see them all occur at the same time. Maybe look to see if the machines were doing something unusual at that time (a cron job, etc.)? -- John Baldwin