Date: Wed, 24 Feb 1999 15:42:21 -0500 From: Andrew Heybey <ath@niksun.com> To: FreeBSD-gnats-submit@freebsd.org Subject: kern/10243: read(2) returns garbage Message-ID: <199902242042.PAA24006@stiegl.niksun.com>
next in thread | raw e-mail | index | archive | help
>Number: 10243 >Category: kern >Synopsis: Under heavy disk and network load read(2) can return garbage. >Confidential: no >Severity: critical >Priority: high >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Thu Feb 25 10:20:02 PST 1999 >Closed-Date: >Last-Modified: >Originator: Andrew Heybey >Release: FreeBSD 3.1-RELEASE i386 >Organization: Niksun >Environment: 3.1-RELEASE GENERIC kernel (+ bpf) 450Mhz P-II, 256MB memory, Asus P2B-LS motherboard. Adaptec 7890 SCSI controller, IBM DRVS09V (10000 RPM LVD) disks Intel EtherExpress Pro 10/100B ethernet Full dmesg output (or any other info) available to anyone who wants to look into this. >Description: The bug is that under certain loads, read(2) can return corrupted data (ie data that are not in the file on disk). The instances I have seen are relatively small amounts (8-64 bytes) of corrupt data at the end of a 4k page. The corrupt data is from a file previously read or another position in the current file. I have also seen this problem in 3.0-RELEASE but not in 2.2.8-RELEASE. Specifically, I can reproduce the bug under the following conditions (I am sorry that I don't have a smaller and simpler test case): 1) Multiple processes reading a set of large files. I believe that the amount of data must be large enough such that the reads come from disk, not the cache (if I only read one 50MB file, I do not see the bug). (I have used 1.5GB of data files on a system with 256MB of physical memory.) I also believe that multiple read processes must be running (I have used 4 processes and found the bug, but not with only one process). The files that I have used are filled with sequential integers. This allows my test program to know if it gets bogus data from read(2), since it knows what should be there. *AND* 2) Very high network interrupt rate. I have tested on a fast ethernet receiving at about 46000 packets/sec. I use bpf to get the network interrupt rate up that high without having to do any protocol processing. I don't know if the network or bpf code has anything to do with the bug or if it is just that the high load stimulates some cam/vm/ufs/bpf bug. I have not been able to reproduce the bug without this high load. Both zero pkts/sec and 3000 pkts/sec do *not* exhibit the bug (or at least not after running for several hours), while with the network load it will usually occur within 10 minutes. >How-To-Repeat: I have put a small suite of programs that I use to produce this bug at http://www.niksun.com/~ath/fbsd_bug.tgz. The tar file contains a few test programs and complete instructions on how I tickle the bug. I have reproduced the bug on two different machines, so I don't think that the hw is broken (though the machines have substantially the same kind of hardware so it is conceivable that it is a HW misdesign of some kind). I welcome advice on how to track this down. It smells to me like an insufficient-application-of-splfoo bug, but I'm not even sure where to start looking. For example why would network I/O and BPF have any effect on disk reads? Even better, I suppose, would be someone to tell me that I'm an idiot and my test program is broken. But it is really a very simple program and has run for hours without a problem when there is negligible network load. >Fix: I wish. >Release-Note: >Audit-Trail: >Unformatted: To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-bugs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199902242042.PAA24006>