From owner-freebsd-current@FreeBSD.ORG Thu Apr 16 21:33:00 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 654261065834; Thu, 16 Apr 2009 21:33:00 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id 0E23D8FC0A; Thu, 16 Apr 2009 21:32:59 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from phobos.local (pooker.samsco.org [168.103.85.57]) by pooker.samsco.org (8.14.2/8.14.2) with ESMTP id n3GLWqJt016570; Thu, 16 Apr 2009 15:32:52 -0600 (MDT) (envelope-from scottl@samsco.org) Message-ID: <49E7A404.5090208@samsco.org> Date: Thu, 16 Apr 2009 15:32:52 -0600 From: Scott Long User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.13) Gecko/20080313 SeaMonkey/1.1.9 MIME-Version: 1.0 To: John Baldwin References: <49BD117B.2080706@163.com> <20090416144251.GA1605@plebeian.afflictions.org> <200904161624.51920.jhb@freebsd.org> In-Reply-To: <200904161624.51920.jhb@freebsd.org> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-4.4 required=3.8 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org Cc: Damian Gerow , freebsd-current@freebsd.org, Richard Todd Subject: Re: ZFS checksum errors on umass(4) insertion X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 16 Apr 2009 21:33:01 -0000 John Baldwin wrote: > On Thursday 16 April 2009 2:36:48 pm Richard Todd wrote: >> Damian Gerow writes: >>> 1) Reverting the extended attribute locking change (r189967) does not change >>> the situation for me. I still experience checksum issues and data loss. >>> (Unsurprisingly.) >>> >>> 2) Without umass loaded, I have been completely unable to trigger the issue. >>> >>> 3) Once umass is loaded, and the symptoms start cropping up, unloading umass >>> does not make them go away (again, unsurprisingly). What I haven't yet >>> tested, but am currently working towards, is whether removing umass stops >>> further checksum errors from ocurring. >>> >>> 4) r189967 does remove some LORs for me, even though I don't use (that I >>> know of) extended attributes. >>> >>> 5) It seems that so long as umass is used at all, the symptoms will >>> eventually show up. I've been able to trigger the symptoms by inserting >>> then removing a umass device immediately after boot, then ramping up the >>> workload. >>> >>> 6) The only difference made by vfs.zfs.debug=1 is that zfs reclaims are >>> logged. >>> >>> I'm at a bit of a loss as to what to test next, other than checking for an >>> increased number of checksum errors after unloading umass. However, I'm not >>> convinced this is going to highlight the actual problem. I'm all ears as to >>> what to test for at this point, as I'm running out of ideas. >> I have a question or two, and an idea. >> >> The questions: >> >> 1) How much RAM do you have, is it 4G or more? (I'm guessing the >> answer is "yes".) >> >> 2) What does "sysctl -a | grep bounced" say? Check this both before and after >> loading umass and seeing the bug triggered. >> >> My idea: I suspect a bug in the bounce-buffer code that does I/O to memory >> space beyond the area a given piece of hardware can access directly thru DMA. >> I've had some similar issues with checksum errors, and they seem to have gone >> away since lowering hw.physmem to 3400M in loader.conf, which cuts memory >> usage down below the point where anything needs to use bounce buffers. >> You might try lowering hw.physmem and see if that helps; check with the >> "sysctl -a | grep bounced" command, you should be seeing something like >> >> hw.busdma.zone0.total_bounced: 0 >> hw.busdma.zone1.total_bounced: 0 >> hw.busdma.zone2.total_bounced: 0 >> >> if no bounce-buffer usage is going on. (The number of zones may be different >> on your system.) > > Can you please try http://www.FreeBSD.org/~jhb/patches/dma_pg.patch? This > lines up with your analysis in that it fixes a problem in the bounce buffer > code that was introduced with the new USB stack (and only triggers when the > USB code has to use a bounce buffer). > As a data point, most normal I/O is not going to trigger this bug, even if it gets bounced. I/O using O_DIRECT can, and GEOM discovery I/O can as well. Since memory is allocated from the top of the system, I think that the damage gets done early during boot, and then propagates out over time as the system becomes busier. Scott