Date: Mon, 01 Mar 1999 14:41:58 -0500 From: Andrew Heybey <ath@niksun.com> To: Mike Smith <mike@smith.net.au> Cc: freebsd-hackers@freebsd.org Subject: Re: Advice wanted on tracking down bug (or hw problem?) in 3.1R Message-ID: <199903011941.OAA28487@stiegl.niksun.com> In-Reply-To: Your message of Fri, 26 Feb 1999 12:22:25 -0800. <199902262022.MAA09175@dingo.cdrom.com>
next in thread | previous in thread | raw e-mail | index | archive | help
>> >>On Fri, 26 Feb 1999 09:52:33 -0800, Mike Smith <mike@smith.net.au> said: >> >> I have just submitted PR kern/10243, but I thought I would ask >> >> for some advice on hackers as well. >> >> >> >> The bug is that under certain loads, read(2) can return corrupted >> >> data (ie data that are not in the file on disk). The instances I >> >> have seen are relatively small amounts (8-64 bytes) of corrupt >> >> data at the end of a 4k page. The corrupt data is from a file >> >> previously read or another position in the current file. I have >> >> also seen this problem in 3.0-RELEASE but not in 2.2.8-RELEASE. >> >> mike> Can you look at the corrupt data and see if you can identify >> mike> it? In particular, look for objects that look like IP >> mike> addresses, MAC addresses, pointers into kernel space, ascii >> mike> text, etc. This is usually the best way to work out where the >> mike> data is coming from. >> >> The data is always (in every instance that I have examined) from some >> other part of the file currently being read or some other file in my >> set of test files. How my test setup works is that I have 30 50MB >> files. The files are filled with sequential integers (counting over >> the entire 1.5GB). My test program reads from the files (in order, >> starting over at file #0 when it reaches file #29) and compares what >> read(2) returns to what should be there (based on file number and file >> offset). >> >> One other possible clue: This morning I hooked my disks up to the >> regular Ultra SCSI (40MB/s) port of the 7890 controller rather than >> the Ultra/2 (80MB/s) port and I haven't seen the bug yet. I am not >> 100% positive since I have only run it for a few hours so far, but >> before I could almost always make the bug happen withing 10-15 >> minutes. > >Could you try bzero'ing your buffers before every read? This sniffs >very much like short transfers rather than sniping... > More information: I ran a test where I stopped all activity on the system as soon as the first test program observed the bug. That is, I stopped the other programs reading the disk and turned off the packet generator that had been raising the network load. Then I read the file with the garbage data again and it still contains the same garbage at the same offset. If I do enough disk I/O to flush it from the cache and then read it again it is fine. This behavior seems to confirm that it isn't a race condition (because then I would expect the subsequent read of the file to return the correct data). Rather, it seems that the buffer cache has become corrupted because of a short DMA. Any other suggestions? Would this more likely be a driver bug or a hw bug? It still seems to be the case that I cannot duplicate the bug with the disk connected to the 40MB/sec SCSI bus. andrew To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199903011941.OAA28487>