From owner-freebsd-current@FreeBSD.ORG Thu Apr 16 20:26:22 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 92133106568B for ; Thu, 16 Apr 2009 20:26:22 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id AEF948FC14 for ; Thu, 16 Apr 2009 20:26:21 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 50A2946B2C; Thu, 16 Apr 2009 16:26:21 -0400 (EDT) Received: from jhbbsd.hudson-trading.com (unknown [209.249.190.8]) by bigwig.baldwin.cx (Postfix) with ESMTPA id 2F6C38A01A; Thu, 16 Apr 2009 16:26:20 -0400 (EDT) From: John Baldwin To: freebsd-current@freebsd.org Date: Thu, 16 Apr 2009 16:24:51 -0400 User-Agent: KMail/1.9.7 References: <49BD117B.2080706@163.com> <20090416144251.GA1605@plebeian.afflictions.org> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200904161624.51920.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Thu, 16 Apr 2009 16:26:20 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=0.1 required=4.2 tests=RDNS_NONE autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: Damian Gerow , Richard Todd Subject: Re: ZFS checksum errors on umass(4) insertion X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 16 Apr 2009 20:26:23 -0000 On Thursday 16 April 2009 2:36:48 pm Richard Todd wrote: > Damian Gerow writes: > > 1) Reverting the extended attribute locking change (r189967) does not change > > the situation for me. I still experience checksum issues and data loss. > > (Unsurprisingly.) > > > > 2) Without umass loaded, I have been completely unable to trigger the issue. > > > > 3) Once umass is loaded, and the symptoms start cropping up, unloading umass > > does not make them go away (again, unsurprisingly). What I haven't yet > > tested, but am currently working towards, is whether removing umass stops > > further checksum errors from ocurring. > > > > 4) r189967 does remove some LORs for me, even though I don't use (that I > > know of) extended attributes. > > > > 5) It seems that so long as umass is used at all, the symptoms will > > eventually show up. I've been able to trigger the symptoms by inserting > > then removing a umass device immediately after boot, then ramping up the > > workload. > > > > 6) The only difference made by vfs.zfs.debug=1 is that zfs reclaims are > > logged. > > > > I'm at a bit of a loss as to what to test next, other than checking for an > > increased number of checksum errors after unloading umass. However, I'm not > > convinced this is going to highlight the actual problem. I'm all ears as to > > what to test for at this point, as I'm running out of ideas. > > I have a question or two, and an idea. > > The questions: > > 1) How much RAM do you have, is it 4G or more? (I'm guessing the > answer is "yes".) > > 2) What does "sysctl -a | grep bounced" say? Check this both before and after > loading umass and seeing the bug triggered. > > My idea: I suspect a bug in the bounce-buffer code that does I/O to memory > space beyond the area a given piece of hardware can access directly thru DMA. > I've had some similar issues with checksum errors, and they seem to have gone > away since lowering hw.physmem to 3400M in loader.conf, which cuts memory > usage down below the point where anything needs to use bounce buffers. > You might try lowering hw.physmem and see if that helps; check with the > "sysctl -a | grep bounced" command, you should be seeing something like > > hw.busdma.zone0.total_bounced: 0 > hw.busdma.zone1.total_bounced: 0 > hw.busdma.zone2.total_bounced: 0 > > if no bounce-buffer usage is going on. (The number of zones may be different > on your system.) Can you please try http://www.FreeBSD.org/~jhb/patches/dma_pg.patch? This lines up with your analysis in that it fixes a problem in the bounce buffer code that was introduced with the new USB stack (and only triggers when the USB code has to use a bounce buffer). -- John Baldwin