Date: Wed, 17 Jan 2001 01:21:05 -0800 (PST) From: "Ronald F. Guilmette" <rfg@monkeys.com> To: FreeBSD-gnats-submit@freebsd.org Subject: kern/24401: Advansys SCSI driver crashes random userland progs w/SIGPROF Message-ID: <200101170921.f0H9L5203676@mail.monkeys.com>
next in thread | raw e-mail | index | archive | help
>Number: 24401 >Category: kern >Synopsis: Advansys SCSI driver crashes random userland progs w/SIGPROF >Confidential: no >Severity: critical >Priority: high >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Wed Jan 17 01:30:01 PST 2001 >Closed-Date: >Last-Modified: >Originator: Ronald F. Guilmette >Release: FreeBSD 4.2-RELEASE i386 >Organization: Infinite Monkeys & Co. >Environment: System consists of: ASUS P5A motherboard 256 MB SDRAM 3 different PCI 10/100 ethernet controllers (xl0 rl0 rl1) ATAPI/EIDE CD ROM drive (ASUS 40x) Advansys model ASB3940UA Ultra/Narrow PCI SCSI controller 3 different SCSI hard drives >Description: (Note: This system was crashing random userland processes under FreeBSD 4.0, usually with spurious SIGVTALRM signals being sent to random processes at seemingly random times. Now I suspect that at last I know why, however I'm not 100% sure that those incidents were directly related to the bug that I am reporting here. THOSE problems with FreeBSD 4.0 got so bad that I had to back out of my upgrade to 4.0 and go back to 3.3.) Now, on to the current problem/bug... I recently was in the process of decomissioning an old (but large) Narrow/Ultra SCSI drive that was in the system (along with some others) i.e. an IBM model DCAS-34330 (4.3 GB). I backed up everything useful from that drive (a complete 4.1.1 system) onto my trusty old HP 35470A SCSI DAT/DSS tape drive and removed the drive physically from the system. I then installed a Quantum 4.5 GB SCSI drive (Quantum Viking 4.5) and loaded up FreeBSD 4.2 onto it. I then powered down, attached the the old IBM SCSI drive back to the system (in an external cases this time), did a low-level format on it and then made a fresh file system on it and then tried to restore my important stuff from my backup tape back onto the old IBM drive (using cpio). That restore from tape seemed to work OK until about half-way through when cpio crashed, apparently because it had received a totally unexpected SIGPROF. (The console message at the time cpio crashed said "Profiling time alarm" aka SIGPROF.) At first, I just chalked this up to sunspots or to gremlins or to the phase of the moon or something, and I just shrugged it off. (I didn't really need to do this restore from tape anyway.) A little later, I decided to sell the old IBM drive on eBay, but first I wanted to make sure that there would not be any incriminating White House E-mail message left intact on the drive. :-) So I did the following to try to erase whatever was on there formerly, and to wipe the drive totally clean: dd if=/dev/zero of=/dev/da1 bs=4096 This also seemed to be working ok... for awhile. But after awhile, the dd process also crashed and the console said "Profiling time alarm" (aka SIGPROF). I did the dd again and the same exact thing happened. I then decided to try to see if these failures were random or if they were always happening at the same spot on the disk. So I wrote a little C program (attached below) which would just write 4 KB sized blocks of zeros to any device it was told to write them to... while printing the block numbers as it was writing... and then I ran that against /dev/da1. Sure enough, after 485207 4KB blocks had been written (about half the disk) the system locked up. The X server stopped responding to the mouse and to keyboard input and about a minute later, the system rebooted on its own accord. I figured that the Advansys driver was sending the spurious SIGPROF signals to whatever userland process happened to be running at the unfortunate moment when it (the Advansys driver) tried to throw one of these signals. So I decided that it would be best to try running my little "zerodisk" test program when X was *not* running. I then did that... several times. In all cases (4) my little "zerodisk" program crashed unexpectedly (console message was always "Profiling time alarm") after it had already written several hundred thousand 4KB blocks of zeros to /dev/da1. Here are some of the block counts at the times of the crashes: 344449 329357 314214 As you can see, it may take awhile, but with the Advansys controller in the system, I could *always* and *repeatedly* get the driver to send one of these spurious SIGPROF signals to some undeserving userland process. (On an otherwise quite system, my little "zerodisk" program itself was the one most likely to be scheduled for execution by the kernel at any given instant in time, so it usually received these signals. But I believe that I have evidence that these spurious SIGPROF signals might also get sent in some cases to other random userland processes... depending on the exact timing of their genera- tion within the kernel.) After this, I gen'd up a new kernel (with Adaptech support in it), installed that, yanked the Advansys SCSI card out and plugged in an Adaptec 3940AU and re-ran my "zerodisk" test program against /dev/da1. I did this THREE TIMES, just to be sure, and it worked flawlessly each time, all the way to the end of the disk... over 1,000,000 4 KB block writes in each case. The bottom line is that if you do enough writes (several hundred thousand, typically) using an Advansys 3940UA controller, and an ordinary Ultra/Narrow SCSI drive (note: the IBM I used does NOT support tagged command queueing) using FreeBSD 4.2 and the Advansys driver contained therein, then eventually you are going to work the Advansys driver into a state where it will start throwing SIGPROF signals at random times to random useland processes for no apparent reason. This _does not_ occur with other SCSI controllers (e.g. AHA-3940AU) in the exact same system/environment. Clearly the Advansys driver has a VERY subtle, but very bad bug which, it appears, can only be consistantly/dependably elicited via a very intense stress test, e.g. several hundred thousand writes to disk before you can be assured of seeing the bug.) I am filing this bug report as critical/high-priority because the the effects of this bug are so nefarious, i.e. crashing random userland programs (maybe even init and/or the X server) at totally unpredictable and random times. (This sort of thing could give FreeBSD a bad reputation for unreliability!) >How-To-Repeat: Get yourself a Advansys ASB3940UA Ultra/Narrow PCI SCSI controller. Put it into an otherwise unremarkable x86/PCI system. Plug in one SCSI drive and install FreeBSD 4.2 on it. Plug in a second SCSI drive (at least 2GB, but 4GB would be better) that you can afford to overwrite entirely, and then just do: dd if=/dev/zero of=/dev/da1 bs=4096 (preferably on a quiet system, without any X server running) and then just sit back and wait. After awhile, the dd process will crash and you'll get the message: Profiling time alarm (I will even loan this exact IBM drive, and the controller, to anyone who wants to work on this bug. Just ask. The controller is useless to me now anyway... until someone fixes this bug... and I was gonna sell the drive on eBay anyway.) Alternatively, you can run the following simple "zerodisk" program that I cooked up. This will give you essentially the same results, but will show how many blocks got written before the spurious SIGPROF arrives. (BE VERY CAREFUL USING THIS PROGRAM. It must be run as root to access the disk device files and it can easily wipe out an entire disk permanently. In fact that is the purpose for which it was written!) /* zerodisk.c */ #include <stdio.h> #include <stdarg.h> #include <string.h> #include <errno.h> #include <fcntl.h> #include <unistd.h> #include <signal.h> static char const *pname; static void usage (void) { fprintf (stderr, "%s: Usage: `%s device'\n", pname, pname); exit (1); } static void errorv (register char const *const fmt, va_list ap) { fprintf (stderr, "%s: ", pname); vfprintf (stderr, fmt, ap); fputc ('\n', stderr); } static void error (register char const *const fmt, ...) { va_list ap; va_start (ap, fmt); errorv (fmt, ap); va_end (ap); } static void fatal (register char const *const fmt, ...) { va_list ap; va_start (ap, fmt); errorv (fmt, ap); va_end (ap); exit (1); } int main (register int const argc, char *argv[]) { enum { block_size = 4096 }; static char zeros[block_size]; register int fd; register unsigned long blockno = 0; pname = strrchr (argv[0], '/'); pname = pname ? pname+1 : argv[0]; if (argc != 2) usage (); if ((fd = open (argv[1], O_WRONLY)) == -1) fatal ("Error opening `%s': %s", argv[1], strerror (errno)); for (;;) { register ssize_t n; printf ("\rWriting block %lu", ++blockno); fflush (stdout); if ((n = write (fd, zeros, block_size)) == -1) { putchar ('\n'); fatal ("Error writing `%s': %s", argv[1], strerror (errno)); } if (n < block_size) { putchar ('\n'); error ("EOF detected on `%s'", argv[1]); exit (0); } } } >Fix: Buy and install a non-Advansys brand of SCSI controller. >Release-Note: >Audit-Trail: >Unformatted: To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-bugs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200101170921.f0H9L5203676>