FreeBSD Mail Archives

Date:      Tue, 16 Jan 2001 10:16:28 -0600
From:      "Hamilton, Kent" <KHamilton@Hunter.COM>
To:        "'Andrew Gordon'" <arg@arg1.demon.co.uk>
Cc:        "'freebsd-stable@freebsd.org'" <freebsd-stable@freebsd.org>
Subject:   RE: Vinum incidents.
Message-ID:  <508F01B47A2BD411844500A0C9C83B440B6F44@mailbox.Hunter.COM>

next in thread | raw e-mail | index | archive | help


I've seen both of these problems on my home system as well. If someone has a
solution I'd love to hear about it.

My system is a Dual P-III 500 with two Adaptec controllers, 4 Seagate
Barracuda 9Gb disks, Vinum RAID-5 and softupdates. A Futitsu 2Gb standalone,
a CD-RW, a CD-ROM a DVD-Writer, and a HP Tape drive are also on the SCSI
busses.



> -----Original Message-----
> From: Andrew Gordon [mailto:arg@arg1.demon.co.uk]
> Sent: Tuesday, January 16, 2001 9:24 AM
> To: freebsd-stable@FreeBSD.ORG
> Subject: Vinum incidents.
> 
> 
> 
> I have a server with 5 identical SCSI drives, arranged as a 
> single RAID-5
> volume using vinum (and softupdates).  This is exported with
> NFS/Samba/Netatalk/Econet to clients of various types; the 
> root,usr,var
> partitions are on a small IDE drive (there are no local users or
> application processes).  The machine has a serial console.
> 
> This has been working reliably for a couple of months, running -stable
> from around the time of 4.2-release.  On 1st January, I took 
> advantage of
> the low load to do an upgrade to the latest -stable.
> 
> Since then, there have been two incidents (probably not in 
> fact related to
> the upgrade) where vinum has not behaved as expected:
> 
> 1) Phantom disc error
> ---------------------
> 
> Vinum logged:
> 
> Jan  2 01:59:26 serv20 /kernel: home.p0.s0: fatal write I/O error
> Jan  2 01:59:26 serv20 /kernel: vinum: home.p0.s0 is stale by force
> Jan  2 01:59:26 serv20 /kernel: vinum: home.p0 is degraded
> 
> However, there was no evidence of any actual disc error - nothing was
> logged on the console, in dmesg or any other log files.  The 
> system would
> have been substantially idle at that time of night, except 
> that the daily
> cron jobs would just been starting at that time.
> 
> A "vinum start home.p0.s0" some time later successfully 
> revived the plex
> and the system then ran uninterrupted for two weeks.
> 
> Does this suggest some sort of out-of-range block number bug 
> somewhere?
> 
> 2) Recovery problems
> --------------------
> 
> This morning, a technician accidentally(!) unplugged the 
> cable between the
> SCSI card and the drive enclosure while the system was busy.  
> The console
> showed a series of SCSI errors, culminating in a panic.  
> Although it is
> configured to dump to the IDE drive, no dump was saved 
> (possibly due to
> someone locally pressing the reset button).  In any case, 
> this panic was
> probably not very interesting.
> 
> On reboot, it failed automatic fsck due to unexpected softupdates
> inconstencies, but a manual fsck worked OK with only a modest 
> number of
> incorrect block count/unref file errors, but a huge number of 
> "allocated
> block/frag marked free" errors.  A second fsck produced no 
> errors, so I
> mounted the filesystem and continued.
> 
> Sometime during this, the following occurred:
> 
>   (da3:ahc0:0:3:0): READ(10). CDB: 28 0 0 22 1d d6 0 0 2 0
>   (da3:ahc0:0:3:0): MEDIUM ERROR info:221dd6 asc:11,0
>   (da3:ahc0:0:3:0): Unrecovered read error sks:80,35
>   Jan 16 09:17:40 serv20 /kernel: home.p0.s3: fatal read I/O error
>   Jan 16 09:17:40 serv20 /kernel: vinum: home.p0.s3 is 
> crashed by force
>   Jan 16 09:17:40 serv20 /kernel: vinum: home.p0 is degraded
>   (da3:ahc0:0:3:0): READ(10). CDB: 28 0 0 22 8 3a 0 0 2 0
>   (da3:ahc0:0:3:0): MEDIUM ERROR info:22083a asc:11,0
>   (da3:ahc0:0:3:0): Unrecovered read error sks:80,35
>   Jan 16 09:17:41 serv20 /kernel: home.p0.s3: fatal read I/O error
>   Jan 16 09:17:42 serv20 /kernel: vinum: home.p0.s3 is stale by force
> 
> These were real errors (reproducible by reading from da3s1a 
> with 'dd'), so
> I fixed them by writing zeros over most of the drive, and verified by
> dd-ing /dev/da3s1a to /dev/null.  Since this now read OK, I tried to
> revive the subdisk with "vinum start home.p0.s3".  Vinum 
> reported that it
> was reviving, then reported all the working drives "crashed 
> by force", and
> the machine locked solid (no panic or dump, required the 
> reset button).
> 
> Jan 16 09:48:28 serv20 /kernel: vinum: drive drive3 is up
> Jan 16 09:48:46 serv20 /kernel: vinum: home.p0.s0 is crashed by force
> Jan 16 09:48:46 serv20 /kernel: vinum: home.p0 is corrupt
> Jan 16 09:48:46 serv20 /kernel: vinum: home.p0.s1 is crashed by force
> Jan 16 09:48:46 serv20 /kernel: vinum: home.p0.s2 is crashed by force
> Jan 16 09:48:46 serv20 /kernel: vinum: home.p0.s4 is crashed by force
> Jan 16 09:48:46 serv20 /kernel: vinum: home.p0 is faulty
> Jan 16 09:48:46 serv20 /kernel: vinum: home is down
> 
> 
> On reboot, the vinum volume was broken:
> 
> vinum: /dev is mounted read-only, not rebuilding /dev/vinum
> Warning: defective objects
> 
> V home              State: down     Plexes:       1 Size:     
>     68 GB
> P home.p0        R5 State: faulty   Subdisks:     5 Size:     
>     68 GB
> S home.p0.s0        State: crashed  PO:        0  B Size:     
>     17 GB
> S home.p0.s1        State: crashed  PO:      512 kB Size:     
>     17 GB
> S home.p0.s2        State: crashed  PO:     1024 kB Size:     
>     17 GB
> S home.p0.s3        State: R 0%     PO:     1536 kB Size:     
>     17 GB
>                     *** Start home.p0.s3 with 'start' command ***
> S home.p0.s4        State: crashed  PO:     2048 kB Size:     
>     17 GB
> 
> I used 'vinum start' on home.p0.s[0124], and the plex came back in
> degraded mode; after fsck it mounted OK.
> 
> On booting to multi-user mode, I noticed that all the drives 
> were marked
> as 'down', even though the volume and most of the subdisks 
> were 'up' (and
> a quick check in the console scroll-back showed that it was 
> also in this
> state before the previous attempt to revive:
> 
> vinum -> l
> 5 drives:
> D drive0           State: down     Device /dev/da0s1a Avail: 
> 0/17500 MB (0%)
> D drive1           State: down     Device /dev/da1s1a Avail: 
> 0/17500 MB (0%)
> D drive2           State: down     Device /dev/da2s1a Avail: 
> 0/17500 MB (0%)
> D drive3           State: down     Device /dev/da3s1a Avail: 
> 0/17500 MB (0%)
> D drive4           State: down     Device /dev/da4s1a Avail: 
> 0/17500 MB (0%)
> 1 volumes:
> V home             State: up       Plexes:       1 Size:         68 GB
> 1 plexes:
> P home.p0       R5 State: degraded Subdisks:     5 Size:         68 GB
> 5 subdisks:
> S home.p0.s0       State: up       PO:        0  B Size:         17 GB
> S home.p0.s1       State: up       PO:      512 kB Size:         17 GB
> S home.p0.s2       State: up       PO:     1024 kB Size:         17 GB
> S home.p0.s3       State: R 0%     PO:     1536 kB Size:         17 GB
>                    *** Start home.p0.s3 with 'start' command ***
> S home.p0.s4       State: up       PO:     2048 kB Size:         17 GB
> 
> 
> This time, I used 'vinum start' on drive[0-4] before doing 
> vinum start on
> home.p0.s3, and this time it successfully revived, taking 10 
> minutes or
> so.  Some minutes later, the machine paniced (this time 
> saving a dump):
> 
> IdlePTD 3166208
> initial pcb at 282400
> panicstr: softdep_lock: locking against myself
> panic messages:
> ---
> panic: softdep_setup_inomapdep: found inode
> (kgdb) where
> #0  0xc014dd1a in dumpsys ()
> #1  0xc014db3b in boot ()
> #2  0xc014deb8 in poweroff_wait ()
> #3  0xc01e6b49 in acquire_lock ()
> #4  0xc01eae02 in softdep_fsync_mountdev ()
> #5  0xc01eef0e in ffs_fsync ()
> #6  0xc01edc16 in ffs_sync ()
> #7  0xc017b42b in sync ()
> #8  0xc014d916 in boot ()
> #9  0xc014deb8 in poweroff_wait ()
> #10 0xc01e792c in softdep_setup_inomapdep ()
> #11 0xc01e44a4 in ffs_nodealloccg ()
> #12 0xc01e352b in ffs_hashalloc ()
> #13 0xc01e3186 in ffs_valloc ()
> #14 0xc01f4e6f in ufs_makeinode ()
> #15 0xc01f2824 in ufs_create ()
> #16 0xc01f5029 in ufs_vnoperate ()
> #17 0xc01b1e43 in nfsrv_create ()
> #18 0xc01c6b2e in nfssvc_nfsd ()
> #19 0xc01c6483 in nfssvc ()
> #20 0xc022b949 in syscall2 ()
> #21 0xc02207b5 in Xint0x80_syscall ()
> #22 0x8048135 in ?? ()
> 
> After this reboot (again requiring manual fsck) the system 
> appears to be
> working normally, but again the drives are all marked 'down':
> 
> serv20[arg]% vinum l
> 5 drives:
> D drive0          State: down     Device /dev/da0s1a Avail: 
> 0/17500 MB(0%)
> D drive1          State: down     Device /dev/da1s1a Avail: 
> 0/17500 MB(0%)
> D drive2          State: down     Device /dev/da2s1a Avail: 
> 0/17500 MB(0%)
> D drive3          State: down     Device /dev/da3s1a Avail: 
> 0/17500 MB(0%)
> D drive4          State: down     Device /dev/da4s1a Avail: 
> 0/17500 MB(0%)
> 
> 1 volumes:
> V home            State: up       Plexes:       1 Size:         68 GB
> 
> 1 plexes:
> P home.p0      R5 State: up       Subdisks:     5 Size:         68 GB
> 
> 5 subdisks:
> S home.p0.s0      State: up       PO:        0  B Size:         17 GB
> S home.p0.s1      State: up       PO:      512 kB Size:         17 GB
> S home.p0.s2      State: up       PO:     1024 kB Size:         17 GB
> S home.p0.s3      State: up       PO:     1536 kB Size:         17 GB
> S home.p0.s4      State: up       PO:     2048 kB Size:         17 GB
> 
> 
> 
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-stable" in the body of the message
> 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?508F01B47A2BD411844500A0C9C83B440B6F44>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation