From owner-freebsd-current@FreeBSD.ORG Thu Apr 16 18:46:29 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EC962106564A for ; Thu, 16 Apr 2009 18:46:29 +0000 (UTC) (envelope-from rmtodd@ichotolot.servalan.com) Received: from mx1.synetsystems.com (mx1.synetsystems.com [76.10.206.14]) by mx1.freebsd.org (Postfix) with ESMTP id C60D08FC14 for ; Thu, 16 Apr 2009 18:46:29 +0000 (UTC) (envelope-from rmtodd@ichotolot.servalan.com) Received: by mx1.synetsystems.com (Postfix, from userid 66) id 380AFCD2; Thu, 16 Apr 2009 14:46:28 -0400 (EDT) Received: from rmtodd by servalan.servalan.com with local (Exim 4.69 (FreeBSD)) (envelope-from ) id 1LuWS9-0001I8-2B; Thu, 16 Apr 2009 13:36:49 -0500 To: freebsd-current@freebsd.org, Damian Gerow References: <49BD117B.2080706@163.com> <012d01c9b706$ccace720$6606b560$@Sparrevohn@btinternet.com> <20090409003108.fe768d54.nork@FreeBSD.org> <200904131304.43585.jhb@freebsd.org> <20090416144251.GA1605@plebeian.afflictions.org> From: Richard Todd Date: Thu, 16 Apr 2009 13:36:48 -0500 In-Reply-To: (Damian Gerow's message of "Thu, 16 Apr 2009 10:42:51 -0400") Message-ID: User-Agent: Gnus/5.1008 (Gnus v5.10.8) XEmacs/21.4.22 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Subject: Re: ZFS checksum errors on umass(4) insertion X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 16 Apr 2009 18:46:30 -0000 Damian Gerow writes: > 1) Reverting the extended attribute locking change (r189967) does not change > the situation for me. I still experience checksum issues and data loss. > (Unsurprisingly.) > > 2) Without umass loaded, I have been completely unable to trigger the issue. > > 3) Once umass is loaded, and the symptoms start cropping up, unloading umass > does not make them go away (again, unsurprisingly). What I haven't yet > tested, but am currently working towards, is whether removing umass stops > further checksum errors from ocurring. > > 4) r189967 does remove some LORs for me, even though I don't use (that I > know of) extended attributes. > > 5) It seems that so long as umass is used at all, the symptoms will > eventually show up. I've been able to trigger the symptoms by inserting > then removing a umass device immediately after boot, then ramping up the > workload. > > 6) The only difference made by vfs.zfs.debug=1 is that zfs reclaims are > logged. > > I'm at a bit of a loss as to what to test next, other than checking for an > increased number of checksum errors after unloading umass. However, I'm not > convinced this is going to highlight the actual problem. I'm all ears as to > what to test for at this point, as I'm running out of ideas. I have a question or two, and an idea. The questions: 1) How much RAM do you have, is it 4G or more? (I'm guessing the answer is "yes".) 2) What does "sysctl -a | grep bounced" say? Check this both before and after loading umass and seeing the bug triggered. My idea: I suspect a bug in the bounce-buffer code that does I/O to memory space beyond the area a given piece of hardware can access directly thru DMA. I've had some similar issues with checksum errors, and they seem to have gone away since lowering hw.physmem to 3400M in loader.conf, which cuts memory usage down below the point where anything needs to use bounce buffers. You might try lowering hw.physmem and see if that helps; check with the "sysctl -a | grep bounced" command, you should be seeing something like hw.busdma.zone0.total_bounced: 0 hw.busdma.zone1.total_bounced: 0 hw.busdma.zone2.total_bounced: 0 if no bounce-buffer usage is going on. (The number of zones may be different on your system.)