FreeBSD Mail Archives

Date:      Thu, 7 May 1998 02:32:23 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        tom@sdf.com (Tom)
Cc:        tlambert@primenet.com, beng@lcs.mit.edu, dec@phoenix.its.rpi.edu, freebsd-hackers@FreeBSD.ORG
Subject:   Re: Network problem with 2.2.6-STABLE
Message-ID:  <199805070232.TAA19518@usr01.primenet.com>
In-Reply-To: <Pine.BSF.3.95q.980505223633.24411A-100000@misery.sdf.com> from "Tom" at May 5, 98 10:52:51 pm

>   What?  Was something about my message unclear?  restore dies with a
> "hole in map".  You can also search the PR database for that phrase to
> find an indentical report to the one I filed.

You are assuming that the "hole in map" is there because dump put it
there (which is unlikely), or because there was actually a hole in
the map, faithfully copied (which is unlikely, but possible, if you
partition didn't pass fsck prior to dump).

Vs. because your tape is/went bad and/or has a bad electrical connection.

So far we have the following possibilities, wchich I would like to
eliminate one-by-one:

o	It may be the IDE disk (D)
o	It may be the IDE controller (D)
o	It may be the IDE controller driver (D)
o	It may be the raw disk device driver (D)
o	It may be the SCSI controller (T)
o	It may be the SCSI controller driver (T)
o	It may be the raw tape device driver (T)
o	It may be the admixture of an IDE controller that fails
	when used in combination with a SCSI controller (D)
o	It may be the tape drive's default block size (H)
o	It may be the tape drive's firmware (H)
o	It may be the media you are using (M)
o	It may be dump (S)
o	It may be restore (S)


Below I detail some steps to tell whether or not it is in the path of (T)
or whether or not it is in the path of (S).

You need to take these steps before you can point at 2 of the thirteen
possible failure spots and say with confidence "it's dump/restore".



>   The tape and drive are ok.  I tar'ed the entire filesystem up, newfs the
> filesystem, and untar the tape, and it works great (which I have done as a
> test.

Tar does not complain about bad tape blocks, because it can't consistency
check them, having no check fields.  It will happily write zeroed blocks
into your files.

Restore is more sensitive to the problem, because restore requires
that the referential integrity of the files written to disk be intact.

Did you do MD5 checksums before and after, and compare the results?


> > What is the controller for the tape drive?
> 
>   2940UW
> 
> > Which driver is responsible for that controller?
> 
>   ahc

With or without the CAM patches?


> > What exact model of tape drive are you using?
> 
>   Quantum DLT 4000
> 
> > What exact brand of tapes are you using?
> 
>   Quantum DLT IV

You are positive you are using the st/mt command to select a block size
for this before starting the dump, right?

DAT drives are notoriously finicky about default block size selection.


> > What is the controller for the disk showing the problem?
> 
>   EIDE

This doesn't tell me if it is a CMD640B chip, or an Intel chip, either
of which can lose their minds if you take SCSI interrrupts while doing
a data transfer.

I can't rule out a controller failure without this information.


You should fsck your disk a number of times in rapid succession and see
if the cylinder group bitmaps are "corrupted".  This can happen with
IDE cables that are slightly out of spec. (generally: too long).


> > What exact model of disk drive are you using?
> 
>   Maxtor DiamondMax 8.4GB
> 
> > Are you overclocking your processor?
> 
>   No.
> 
>   You know what a much better test would be?  I can do a dump, read the
> first hundred megs or so with dd into a file, and send it to you.  Since
> "restore -t" reports the "hole in map" within seconds, it obviously hasn't
> read very far into the tape yet, so doing a restore from a disk file
> should have the same result.

Or better, you could dump to a disk instead of to a tape, then also dump
to a tape, and then do an MD5 checksum of the images and see if they match,
in order to isolate it to "tape or software" vs. "disk or software".

Also, a partial dump should exhibit the same problems, since "it
obviously hasn't read very far into the tape yet".  Which means
you don't need to write very far into the tape to trigger the problem.
Which means you can do the expriment with a disk image without the disk
containing the image needing to be larger than the disk being dumped.

You should also simply dump through MD5 to see if the MD5 checksum changes
between dump attempts.  If it does, the problem is in dump and/or the raw
disk device driver.  If it doesn't, the problem is in the tape or the
restore.

If the image restores without the panic, then the problem is in the
tape driver, controller, drive, or media.


I'm not being a hardass here.  Software doesn't mutate, so the problem
should be capable of being isolated.  I'm just doing fault isolation via
email, and it's not very efficient.

It would help if I could repeat the problem locally, but it doesn't
repeat locally for me on my 9G IBM drive (though I have to change
volume sets 9 times to repeat on a > 4G file system, since I don't use
DAT; this should not impact it, since I don't get buffer flushes or
other code that should change the outcome).


One possible discrepancy is that my IBM 9G drive is fast SCSI II, not
EIDE.  It may be an IDE driver problem.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199805070232.TAA19518>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation