From owner-freebsd-current@FreeBSD.ORG  Thu Apr 16 21:33:00 2009
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 654261065834;
	Thu, 16 Apr 2009 21:33:00 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.freebsd.org (Postfix) with ESMTP id 0E23D8FC0A;
	Thu, 16 Apr 2009 21:32:59 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from phobos.local (pooker.samsco.org [168.103.85.57])
	by pooker.samsco.org (8.14.2/8.14.2) with ESMTP id n3GLWqJt016570;
	Thu, 16 Apr 2009 15:32:52 -0600 (MDT)
	(envelope-from scottl@samsco.org)
Message-ID: <49E7A404.5090208@samsco.org>
Date: Thu, 16 Apr 2009 15:32:52 -0600
From: Scott Long <scottl@samsco.org>
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US;
	rv:1.8.1.13) Gecko/20080313 SeaMonkey/1.1.9
MIME-Version: 1.0
To: John Baldwin <jhb@freebsd.org>
References: <49BD117B.2080706@163.com>	<20090416144251.GA1605@plebeian.afflictions.org>	<x7myagjvi7.fsf@ichotolot.servalan.com>
	<200904161624.51920.jhb@freebsd.org>
In-Reply-To: <200904161624.51920.jhb@freebsd.org>
X-Enigmail-Version: 0.95.6
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-4.4 required=3.8 tests=ALL_TRUSTED,BAYES_00
	autolearn=ham version=3.1.8
X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org
Cc: Damian Gerow <dgerow@afflictions.org>, freebsd-current@freebsd.org,
	Richard Todd <rmtodd@ichotolot.servalan.com>
Subject: Re: ZFS checksum errors on umass(4) insertion
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 16 Apr 2009 21:33:01 -0000

John Baldwin wrote:
> On Thursday 16 April 2009 2:36:48 pm Richard Todd wrote:
>> Damian Gerow <dgerow@afflictions.org> writes:
>>> 1) Reverting the extended attribute locking change (r189967) does not change
>>> the situation for me.  I still experience checksum issues and data loss.
>>> (Unsurprisingly.)
>>>
>>> 2) Without umass loaded, I have been completely unable to trigger the issue.
>>>
>>> 3) Once umass is loaded, and the symptoms start cropping up, unloading umass
>>> does not make them go away (again, unsurprisingly).  What I haven't yet
>>> tested, but am currently working towards, is whether removing umass stops
>>> further checksum errors from ocurring.
>>>
>>> 4) r189967 does remove some LORs for me, even though I don't use (that I
>>> know of) extended attributes.
>>>
>>> 5) It seems that so long as umass is used at all, the symptoms will
>>> eventually show up.  I've been able to trigger the symptoms by inserting
>>> then removing a umass device immediately after boot, then ramping up the
>>> workload.
>>>
>>> 6) The only difference made by vfs.zfs.debug=1 is that zfs reclaims are
>>> logged.
>>>
>>> I'm at a bit of a loss as to what to test next, other than checking for an
>>> increased number of checksum errors after unloading umass.  However, I'm not
>>> convinced this is going to highlight the actual problem.  I'm all ears as to
>>> what to test for at this point, as I'm running out of ideas.
>> I have a question or two, and an idea.  
>>
>> The questions: 
>>
>> 1) How much RAM do you have, is it 4G or more?  (I'm guessing the
>> answer is "yes".)
>>
>> 2) What does "sysctl -a | grep bounced" say?  Check this both before and after
>> loading umass and seeing the bug triggered.
>>
>> My idea: I suspect a bug in the bounce-buffer code that does I/O to memory
>> space beyond the area a given piece of hardware can access directly thru DMA.
>> I've had some similar issues with checksum errors, and they seem to have gone
>> away since lowering hw.physmem to 3400M in loader.conf, which cuts memory
>> usage down below the point where anything needs to use bounce buffers. 
>> You might try lowering hw.physmem and see if that helps; check with the
>> "sysctl -a | grep bounced" command, you should be seeing something like 
>>
>> hw.busdma.zone0.total_bounced: 0
>> hw.busdma.zone1.total_bounced: 0
>> hw.busdma.zone2.total_bounced: 0
>>
>> if no bounce-buffer usage is going on.  (The number of zones may be different
>> on your system.)
> 
> Can you please try http://www.FreeBSD.org/~jhb/patches/dma_pg.patch?  This
> lines up with your analysis in that it fixes a problem in the bounce buffer
> code that was introduced with the new USB stack (and only triggers when the
> USB code has to use a bounce buffer).
> 

As a data point, most normal I/O is not going to trigger this bug, even
if it gets bounced.  I/O using O_DIRECT can, and GEOM discovery I/O can
as well.  Since memory is allocated from the top of the system, I think
that the damage gets done early during boot, and then propagates out
over time as the system becomes busier.

Scott