Date: Tue, 16 Jan 2001 10:16:28 -0600 From: "Hamilton, Kent" <KHamilton@Hunter.COM> To: "'Andrew Gordon'" <arg@arg1.demon.co.uk> Cc: "'freebsd-stable@freebsd.org'" <freebsd-stable@freebsd.org> Subject: RE: Vinum incidents. Message-ID: <508F01B47A2BD411844500A0C9C83B440B6F44@mailbox.Hunter.COM>
next in thread | raw e-mail | index | archive | help
I've seen both of these problems on my home system as well. If someone has a solution I'd love to hear about it. My system is a Dual P-III 500 with two Adaptec controllers, 4 Seagate Barracuda 9Gb disks, Vinum RAID-5 and softupdates. A Futitsu 2Gb standalone, a CD-RW, a CD-ROM a DVD-Writer, and a HP Tape drive are also on the SCSI busses. > -----Original Message----- > From: Andrew Gordon [mailto:arg@arg1.demon.co.uk] > Sent: Tuesday, January 16, 2001 9:24 AM > To: freebsd-stable@FreeBSD.ORG > Subject: Vinum incidents. > > > > I have a server with 5 identical SCSI drives, arranged as a > single RAID-5 > volume using vinum (and softupdates). This is exported with > NFS/Samba/Netatalk/Econet to clients of various types; the > root,usr,var > partitions are on a small IDE drive (there are no local users or > application processes). The machine has a serial console. > > This has been working reliably for a couple of months, running -stable > from around the time of 4.2-release. On 1st January, I took > advantage of > the low load to do an upgrade to the latest -stable. > > Since then, there have been two incidents (probably not in > fact related to > the upgrade) where vinum has not behaved as expected: > > 1) Phantom disc error > --------------------- > > Vinum logged: > > Jan 2 01:59:26 serv20 /kernel: home.p0.s0: fatal write I/O error > Jan 2 01:59:26 serv20 /kernel: vinum: home.p0.s0 is stale by force > Jan 2 01:59:26 serv20 /kernel: vinum: home.p0 is degraded > > However, there was no evidence of any actual disc error - nothing was > logged on the console, in dmesg or any other log files. The > system would > have been substantially idle at that time of night, except > that the daily > cron jobs would just been starting at that time. > > A "vinum start home.p0.s0" some time later successfully > revived the plex > and the system then ran uninterrupted for two weeks. > > Does this suggest some sort of out-of-range block number bug > somewhere? > > 2) Recovery problems > -------------------- > > This morning, a technician accidentally(!) unplugged the > cable between the > SCSI card and the drive enclosure while the system was busy. > The console > showed a series of SCSI errors, culminating in a panic. > Although it is > configured to dump to the IDE drive, no dump was saved > (possibly due to > someone locally pressing the reset button). In any case, > this panic was > probably not very interesting. > > On reboot, it failed automatic fsck due to unexpected softupdates > inconstencies, but a manual fsck worked OK with only a modest > number of > incorrect block count/unref file errors, but a huge number of > "allocated > block/frag marked free" errors. A second fsck produced no > errors, so I > mounted the filesystem and continued. > > Sometime during this, the following occurred: > > (da3:ahc0:0:3:0): READ(10). CDB: 28 0 0 22 1d d6 0 0 2 0 > (da3:ahc0:0:3:0): MEDIUM ERROR info:221dd6 asc:11,0 > (da3:ahc0:0:3:0): Unrecovered read error sks:80,35 > Jan 16 09:17:40 serv20 /kernel: home.p0.s3: fatal read I/O error > Jan 16 09:17:40 serv20 /kernel: vinum: home.p0.s3 is > crashed by force > Jan 16 09:17:40 serv20 /kernel: vinum: home.p0 is degraded > (da3:ahc0:0:3:0): READ(10). CDB: 28 0 0 22 8 3a 0 0 2 0 > (da3:ahc0:0:3:0): MEDIUM ERROR info:22083a asc:11,0 > (da3:ahc0:0:3:0): Unrecovered read error sks:80,35 > Jan 16 09:17:41 serv20 /kernel: home.p0.s3: fatal read I/O error > Jan 16 09:17:42 serv20 /kernel: vinum: home.p0.s3 is stale by force > > These were real errors (reproducible by reading from da3s1a > with 'dd'), so > I fixed them by writing zeros over most of the drive, and verified by > dd-ing /dev/da3s1a to /dev/null. Since this now read OK, I tried to > revive the subdisk with "vinum start home.p0.s3". Vinum > reported that it > was reviving, then reported all the working drives "crashed > by force", and > the machine locked solid (no panic or dump, required the > reset button). > > Jan 16 09:48:28 serv20 /kernel: vinum: drive drive3 is up > Jan 16 09:48:46 serv20 /kernel: vinum: home.p0.s0 is crashed by force > Jan 16 09:48:46 serv20 /kernel: vinum: home.p0 is corrupt > Jan 16 09:48:46 serv20 /kernel: vinum: home.p0.s1 is crashed by force > Jan 16 09:48:46 serv20 /kernel: vinum: home.p0.s2 is crashed by force > Jan 16 09:48:46 serv20 /kernel: vinum: home.p0.s4 is crashed by force > Jan 16 09:48:46 serv20 /kernel: vinum: home.p0 is faulty > Jan 16 09:48:46 serv20 /kernel: vinum: home is down > > > On reboot, the vinum volume was broken: > > vinum: /dev is mounted read-only, not rebuilding /dev/vinum > Warning: defective objects > > V home State: down Plexes: 1 Size: > 68 GB > P home.p0 R5 State: faulty Subdisks: 5 Size: > 68 GB > S home.p0.s0 State: crashed PO: 0 B Size: > 17 GB > S home.p0.s1 State: crashed PO: 512 kB Size: > 17 GB > S home.p0.s2 State: crashed PO: 1024 kB Size: > 17 GB > S home.p0.s3 State: R 0% PO: 1536 kB Size: > 17 GB > *** Start home.p0.s3 with 'start' command *** > S home.p0.s4 State: crashed PO: 2048 kB Size: > 17 GB > > I used 'vinum start' on home.p0.s[0124], and the plex came back in > degraded mode; after fsck it mounted OK. > > On booting to multi-user mode, I noticed that all the drives > were marked > as 'down', even though the volume and most of the subdisks > were 'up' (and > a quick check in the console scroll-back showed that it was > also in this > state before the previous attempt to revive: > > vinum -> l > 5 drives: > D drive0 State: down Device /dev/da0s1a Avail: > 0/17500 MB (0%) > D drive1 State: down Device /dev/da1s1a Avail: > 0/17500 MB (0%) > D drive2 State: down Device /dev/da2s1a Avail: > 0/17500 MB (0%) > D drive3 State: down Device /dev/da3s1a Avail: > 0/17500 MB (0%) > D drive4 State: down Device /dev/da4s1a Avail: > 0/17500 MB (0%) > 1 volumes: > V home State: up Plexes: 1 Size: 68 GB > 1 plexes: > P home.p0 R5 State: degraded Subdisks: 5 Size: 68 GB > 5 subdisks: > S home.p0.s0 State: up PO: 0 B Size: 17 GB > S home.p0.s1 State: up PO: 512 kB Size: 17 GB > S home.p0.s2 State: up PO: 1024 kB Size: 17 GB > S home.p0.s3 State: R 0% PO: 1536 kB Size: 17 GB > *** Start home.p0.s3 with 'start' command *** > S home.p0.s4 State: up PO: 2048 kB Size: 17 GB > > > This time, I used 'vinum start' on drive[0-4] before doing > vinum start on > home.p0.s3, and this time it successfully revived, taking 10 > minutes or > so. Some minutes later, the machine paniced (this time > saving a dump): > > IdlePTD 3166208 > initial pcb at 282400 > panicstr: softdep_lock: locking against myself > panic messages: > --- > panic: softdep_setup_inomapdep: found inode > (kgdb) where > #0 0xc014dd1a in dumpsys () > #1 0xc014db3b in boot () > #2 0xc014deb8 in poweroff_wait () > #3 0xc01e6b49 in acquire_lock () > #4 0xc01eae02 in softdep_fsync_mountdev () > #5 0xc01eef0e in ffs_fsync () > #6 0xc01edc16 in ffs_sync () > #7 0xc017b42b in sync () > #8 0xc014d916 in boot () > #9 0xc014deb8 in poweroff_wait () > #10 0xc01e792c in softdep_setup_inomapdep () > #11 0xc01e44a4 in ffs_nodealloccg () > #12 0xc01e352b in ffs_hashalloc () > #13 0xc01e3186 in ffs_valloc () > #14 0xc01f4e6f in ufs_makeinode () > #15 0xc01f2824 in ufs_create () > #16 0xc01f5029 in ufs_vnoperate () > #17 0xc01b1e43 in nfsrv_create () > #18 0xc01c6b2e in nfssvc_nfsd () > #19 0xc01c6483 in nfssvc () > #20 0xc022b949 in syscall2 () > #21 0xc02207b5 in Xint0x80_syscall () > #22 0x8048135 in ?? () > > After this reboot (again requiring manual fsck) the system > appears to be > working normally, but again the drives are all marked 'down': > > serv20[arg]% vinum l > 5 drives: > D drive0 State: down Device /dev/da0s1a Avail: > 0/17500 MB(0%) > D drive1 State: down Device /dev/da1s1a Avail: > 0/17500 MB(0%) > D drive2 State: down Device /dev/da2s1a Avail: > 0/17500 MB(0%) > D drive3 State: down Device /dev/da3s1a Avail: > 0/17500 MB(0%) > D drive4 State: down Device /dev/da4s1a Avail: > 0/17500 MB(0%) > > 1 volumes: > V home State: up Plexes: 1 Size: 68 GB > > 1 plexes: > P home.p0 R5 State: up Subdisks: 5 Size: 68 GB > > 5 subdisks: > S home.p0.s0 State: up PO: 0 B Size: 17 GB > S home.p0.s1 State: up PO: 512 kB Size: 17 GB > S home.p0.s2 State: up PO: 1024 kB Size: 17 GB > S home.p0.s3 State: up PO: 1536 kB Size: 17 GB > S home.p0.s4 State: up PO: 2048 kB Size: 17 GB > > > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-stable" in the body of the message > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?508F01B47A2BD411844500A0C9C83B440B6F44>