Date: Thu, 13 May 1999 22:06:41 +0200 (CEST) From: Tor Egge <tegge@not.fast.no> To: FreeBSD-gnats-submit@freebsd.org Subject: kern/11697: Disk failure hangs system Message-ID: <199905132006.WAA59935@not.fast.no>
next in thread | raw e-mail | index | archive | help
>Number: 11697 >Category: kern >Synopsis: Disk failure hangs system >Confidential: no >Severity: serious >Priority: medium >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Thu May 13 13:10:02 PDT 1999 >Closed-Date: >Last-Modified: >Originator: Tor Egge >Release: FreeBSD 3.1-STABLE i386 >Organization: Fast Search & Transfer ASA >Environment: FreeBSD 3.1-STABLE #0: Sat May 1 19:00:19 CEST 1999 root@response.fast.no:/usr/src/sys/compile/INDEX_SMP_SERIAL_DDB i386 ahc1: <Adaptec 2940 Ultra2 SCSI adapter> rev 0x00 int a irq 17 on pci0.14.0 ahc1: aic7890/91 Wide Channel A, SCSI Id=7, 16/255 SCBs da13 at ahc1 bus 0 target 9 lun 0 da13: <QUANTUM QM318000TD-SCA N1K0> Fixed Direct Access SCSI-2 device da13: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da13: 17366MB (35566499 512 byte sectors: 255H 63S/T 2213C) >Description: ---------------------- Unexpected busfree. LASTPHASE == 0x80 SEQADDR == 0x15b (da13:ahc1:0:9:0): Invalidating pack (da13:ahc1:0:9:0): Invalidating pack (da13:ahc1:0:9:0): Invalidating pack vm_fault: pager read error, pid 63486 (mkserv) (da13:ahc1:0:9:0): Invalidating pack Stopped at siointr1+0x6d: jmp siointr1+0x159 db> trace siointr1(e3c8d800,e02890b0,0,f2e0da2c,e0206144) at siointr1+0x6d siointr(0,f2e00010,0,1,e0289014) at siointr+0x1d Xfastintr4(ebd13528,e3e12800,ebd13528,c8000040,e0e7e8c8) at Xfastintr4+0x24 biodone(ebd13528,ebd13528,ebd13528,c8000040,e3e08000) at biodone+0x2d0 dastrategy(ebd13528,200202b4,f2e0daa8,e018167d,f2e0dacc) at dastrategy+0xab spec_strategy(f2e0dacc,f2e0dab4,e01e73a9,f2e0dacc,f2e0dad8) at spec_strategy+0x3e spec_vnoperate(f2e0dacc,f2e0dad8,e016d46f,f2e0dacc,2000) at spec_vnoperate+0x15 ufs_vnoperatespec(f2e0dacc) at ufs_vnoperatespec+0x15 bwrite(ebd13528,f2e0daf0,e0171879,f2e0db34,f2e0dafc) at bwrite+0xaf vop_stdbwrite(f2e0db34,f2e0dafc,e018167d,f2e0db34,f2e0db08) at vop_stdbwrite+0xe vop_defaultop(f2e0db34,f2e0db08,e01e73a9,f2e0db34,f2e0db3c) at vop_defaultop+0x15 spec_vnoperate(f2e0db34,f2e0db3c,e016de03,f2e0db34,200) at spec_vnoperate+0x15 ufs_vnoperatespec(f2e0db34,200,ebd13528,1,0) at ufs_vnoperatespec+0x15 vfs_bio_awrite(ebd13528,200,a200a000,1,f2e00010) at vfs_bio_awrite+0x103 getnewbuf(f1cea900,d10050,0,0,2000) at getnewbuf+0x2ec getblk(f1cea900,d10050,2000,0,0) at getblk+0x244 bread(f1cea900,d10050,2000,0,f2e0dc48) at bread+0x21 ffs_vget(e3e8c200,54ee7,f2e0dccc,f283ee40,f2e0df14) at ffs_vget+0x1bc ufs_lookup(f2e0dd24,f2e0dd38,e017055c,f2e0dd24,f3009c47) at ufs_lookup+0x936 ufs_vnoperate(f2e0dd24,f3009c47,f283ee40,f2e0df14,0) at ufs_vnoperate+0x15 vfs_cache_lookup(f2e0dd80,f2e0dd90,e01729fd,f2e0dd80,f1c6ce00) at vfs_cache_lookup+0x248 ufs_vnoperate(f2e0dd80,f1c6ce00,f2e0df14,f2e0def0,0) at ufs_vnoperate+0x15 lookup(f2e0def0,0,f2e0df84,f2e0def0,7273752f) at lookup+0x2c1 namei(f2e0def0,0,f2e0df84,f2d5c840,286) at namei+0x133 vn_open(f2e0def0,3,584,f2d5c840,e0254064) at vn_open+0x1f6 open(f2d5c840,f2e0df84,dfbfd594,dfbfc7e0,dfbfbfe4) at open+0xad syscall(27,27,dfbfbfe4,dfbfc7e0,dfbfc7b4) at syscall+0x187 Xint0x80_syscall() at Xint0x80_syscall+0x4c db> panic panic: from debugger mp_lock = 01000002; cpuid = 1; lapic.id = 00000000 boot() called on cpu#1 syncing disks... ------------- The SCSI bus is freed at the wrong moment, probably due to the device resetting. Then the command is retried, but is aborted AGAIN due to a selection timeout (indicating that the device had not completed resetting). This might be caused by bad firmware on the disk or a too weak power supply. I assume this is bad firmware. Combined with the VFS code being conservative (not wanting to throw away buffer contents on fatal write errors (which might lead to file system corruption if this is a transient error)), this sometimes lead to the buffer queues being filled with dirty buffers associated with the invalidated disk pack. Combined with what appears to be a bug in the routine waitfreebuffers, this could lead to an infinite busy loop in the kernel inside a splbio() protect region of code. >How-To-Repeat: Use Quantum disks. >Fix: Index: vfs_bio.c =================================================================== RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v retrieving revision 1.193.2.5 diff -u -r1.193.2.5 vfs_bio.c --- vfs_bio.c 1999/04/20 19:54:20 1.193.2.5 +++ vfs_bio.c 1999/05/12 19:57:13 @@ -577,7 +577,8 @@ if (bp->b_flags & B_LOCKED) bp->b_flags &= ~B_ERROR; - if ((bp->b_flags & (B_READ | B_ERROR)) == B_ERROR) { + if ((bp->b_flags & (B_READ | B_ERROR)) == B_ERROR && + bp->b_error != ENXIO) { bp->b_flags &= ~B_ERROR; bdirty(bp); } else if ((bp->b_flags & (B_NOCACHE | B_INVAL | B_ERROR | B_FREEBUF)) || @@ -1219,7 +1220,7 @@ waitfreebuffers(int slpflag, int slptimeo) { while (numfreebuffers < hifreebuffers) { flushdirtybuffers(slpflag, slptimeo); - if (numfreebuffers < hifreebuffers) + if (numfreebuffers >= hifreebuffers) break; needsbuffer |= VFS_BIO_NEED_FREE; if (tsleep(&needsbuffer, (PRIBIO + 4)|slpflag, "biofre", slptimeo)) >Release-Note: >Audit-Trail: >Unformatted: To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-bugs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199905132006.WAA59935>