Date: Fri, 23 Oct 1998 18:31:10 -0700 (PDT) From: Chris Timmons <skynyrd@opus.cts.cwu.edu> To: freebsd-scsi@FreeBSD.ORG Subject: Thrashing CAM on SMP Message-ID: <Pine.BSF.3.96.981023181248.28551A-100000@opus.cts.cwu.edu>
next in thread | raw e-mail | index | archive | help
I tried recently to reproduce the problems Mark Murray has with CAM & SMP (panic with X going and lots of filesystem activity.) I couldn't panic, but I did have the machine wedge with recurring, non-recoverable device tiemouts on the system and swap disks. The machine is a server and doesn't have a workstation video card. Of course, I forgot BREAK_TO_DEBUGGER, so I couldn't get a dump. Using an SMP -CURRENT from just before the 3.0 release, I set up 3 256M bonnies on different spindles, an md5 of a 280MB file, and a 'make -j 12 buildworld' - all in loops to repeat over and over. The buildworld also unmounted, newfs-ed and remounted /usr/obj after each turn. The machine is a dual-PII 266 tyan tiger. The system lasted for a couple days with a load average between 5 and 12. The activity lights on the 3 bonnie drives were almost always solid green and the box sounded like a popcorn popper. <IBM DDRS-34560W S71D> at scbus0 target 0 lun 0 (pass0,da0) <IBM DDRS-34560W S71D> at scbus0 target 1 lun 0 (pass1,da1) <SEAGATE ST34572W 0718> at scbus1 target 0 lun 0 (pass2,da2) <SEAGATE ST34572W 0784> at scbus1 target 1 lun 0 (pass3,da3) <QUANTUM XP34550W LXY4> at scbus1 target 4 lun 0 (pass4,da4) During the time it was alive, the bonnies were running on da2, da3, and da4. The only trouble I had were device timeouts on the firmware-buggy Atlas-II, and an occasional hiccup on the SEAGATES. I'm using 40MHZ xfer rates and adaptec cables with the active terminators - drive termination off. midtest3:/root#> grep BDR /var/log/messages Oct 21 04:16:13 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 21 05:27:29 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 21 14:44:18 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 21 17:10:31 midtest3 /kernel: (da3:ahc1:0:1:0): BDR message in message buffer Oct 21 17:11:31 midtest3 /kernel: (da3:ahc1:0:1:0): BDR message in message buffer Oct 21 17:12:30 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 21 19:47:07 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 21 20:04:24 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 22 01:38:54 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 22 02:51:36 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 22 04:10:12 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 22 05:51:51 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 22 07:41:04 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 22 07:47:00 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 22 09:22:32 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 22 10:50:59 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 22 11:06:40 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 22 13:34:20 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 22 15:12:07 midtest3 /kernel: (da2:ahc1:0:0:0): BDR message in message buffer Oct 22 15:13:07 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 22 15:28:40 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 22 15:43:34 midtest3 /kernel: (da2:ahc1:0:0:0): BDR message in message buffer Oct 22 15:44:34 midtest3 /kernel: (da2:ahc1:0:0:0): BDR message in message buffer Oct 22 15:45:34 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 22 16:18:28 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB Oct 22 17:06:45 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB When it finally died, I'd swear it was telling me that da0 and/or da1 kept timing out - messages to the serial console which I of course didn't trap. The machine would respond to pings and print out the BDR timeout messages, but would not do anything else, so it was apparantly stuck at a fairly high spl. I'm getting up-to-date, noticing Ken's mega-commit recently. I'll be able to break in with ddb now, and can take a dump if the situation re-occurs. The system is in a mega rack-mount case with multiple cooling fans blowing directly on the drives which were cool to the touch during the middle of the run, so I don't think we overheated. -Chris To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-scsi" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.3.96.981023181248.28551A-100000>