Date: Thu, 16 Apr 2009 15:32:52 -0600 From: Scott Long <scottl@samsco.org> To: John Baldwin <jhb@freebsd.org> Cc: Damian Gerow <dgerow@afflictions.org>, freebsd-current@freebsd.org, Richard Todd <rmtodd@ichotolot.servalan.com> Subject: Re: ZFS checksum errors on umass(4) insertion Message-ID: <49E7A404.5090208@samsco.org> In-Reply-To: <200904161624.51920.jhb@freebsd.org> References: <49BD117B.2080706@163.com> <20090416144251.GA1605@plebeian.afflictions.org> <x7myagjvi7.fsf@ichotolot.servalan.com> <200904161624.51920.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
John Baldwin wrote: > On Thursday 16 April 2009 2:36:48 pm Richard Todd wrote: >> Damian Gerow <dgerow@afflictions.org> writes: >>> 1) Reverting the extended attribute locking change (r189967) does not change >>> the situation for me. I still experience checksum issues and data loss. >>> (Unsurprisingly.) >>> >>> 2) Without umass loaded, I have been completely unable to trigger the issue. >>> >>> 3) Once umass is loaded, and the symptoms start cropping up, unloading umass >>> does not make them go away (again, unsurprisingly). What I haven't yet >>> tested, but am currently working towards, is whether removing umass stops >>> further checksum errors from ocurring. >>> >>> 4) r189967 does remove some LORs for me, even though I don't use (that I >>> know of) extended attributes. >>> >>> 5) It seems that so long as umass is used at all, the symptoms will >>> eventually show up. I've been able to trigger the symptoms by inserting >>> then removing a umass device immediately after boot, then ramping up the >>> workload. >>> >>> 6) The only difference made by vfs.zfs.debug=1 is that zfs reclaims are >>> logged. >>> >>> I'm at a bit of a loss as to what to test next, other than checking for an >>> increased number of checksum errors after unloading umass. However, I'm not >>> convinced this is going to highlight the actual problem. I'm all ears as to >>> what to test for at this point, as I'm running out of ideas. >> I have a question or two, and an idea. >> >> The questions: >> >> 1) How much RAM do you have, is it 4G or more? (I'm guessing the >> answer is "yes".) >> >> 2) What does "sysctl -a | grep bounced" say? Check this both before and after >> loading umass and seeing the bug triggered. >> >> My idea: I suspect a bug in the bounce-buffer code that does I/O to memory >> space beyond the area a given piece of hardware can access directly thru DMA. >> I've had some similar issues with checksum errors, and they seem to have gone >> away since lowering hw.physmem to 3400M in loader.conf, which cuts memory >> usage down below the point where anything needs to use bounce buffers. >> You might try lowering hw.physmem and see if that helps; check with the >> "sysctl -a | grep bounced" command, you should be seeing something like >> >> hw.busdma.zone0.total_bounced: 0 >> hw.busdma.zone1.total_bounced: 0 >> hw.busdma.zone2.total_bounced: 0 >> >> if no bounce-buffer usage is going on. (The number of zones may be different >> on your system.) > > Can you please try http://www.FreeBSD.org/~jhb/patches/dma_pg.patch? This > lines up with your analysis in that it fixes a problem in the bounce buffer > code that was introduced with the new USB stack (and only triggers when the > USB code has to use a bounce buffer). > As a data point, most normal I/O is not going to trigger this bug, even if it gets bounced. I/O using O_DIRECT can, and GEOM discovery I/O can as well. Since memory is allocated from the top of the system, I think that the damage gets done early during boot, and then propagates out over time as the system becomes busier. Scott
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?49E7A404.5090208>