Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Jun 1998 04:41:08 +0900
From:      Tetsuro FURUYA <ht5t-fry@asahi-net.or.jp>
To:        mike@smith.net.au
Cc:        robinson@public.bta.net.cn, freebsd-stable@FreeBSD.ORG, freebsd-questions@FreeBSD.ORG, Tetsuro FURUYA <tfu@ff.iij4u.or.jp>
Subject:   Re: Bug in wd driver 
Message-ID:  <199806101941.EAA11696@dilemma.tf.or.jp>
In-Reply-To: Your message of "Thu, 28 May 1998 12:57:14 -0700"
References:  <199805281957.MAA01309@dingo.cdrom.com>

next in thread | previous in thread | raw e-mail | index | archive | help
----Security_Multipart(Thu_Jun_11_04:40:53_1998)--
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit


I wrote, 

In Message-Id: <199805272026.FAA16850@dilemma.tf.or.jp>
In Message-Id: <199805281508.AAA04056@dilemma.tf.or.jp>

>I have been encountered at the same defaults in using Panasonic AL-N1,
>and FreeBSD-2.2.2.

>And bad144 was hangupped.
>But I have found out how to manipulate bad144, or fsck , or badsect.

>My kernel has kernel-debugger ddb(4) installed in it.
>                              ^^^^^^
>So, listening to the hamming sound of wd0 drive, and when wd drive
>is hangupped, invoke kernel-debugger by typing ctrl-alt-ESC keys.
>                                               ^^^^^^^^^^^^
>A while after stopping of disk access, type 'c' or 'continue',
>and go back to bad144 or fsck.                      ^^^^^^^^
>Several attempts may complete the identification of bad clusters.
>As for my machine, this was worked.

And you pointed out that,
> > In Message-ID: <199805272101.OAA01902@dingo.cdrom.com>
> > Mike Smith <mike@smith.net.au> worte:

> > >fsck /usr
> > >.....
> > >wd0: interrupt timeout:
> > >wd0: status 50<rdy,seekdone> error 0
> > >wd0: interrupt timeout:
> > >wd0: status 50<rdy,seekdone> error 1<no_dam>
> > 
> > >===> hang up
> > >===> type 'cntrl-alt-esc'
> 
> This defers the interrupt timeout...
> 
> > >db>wd0s1f: hard error reading fsbn 1152850 of 1152850-1152851(wd0s1 bn
> > >1279826; cn 317 tn 26 sn 44)
> > >wd0: status 59<rdy,seekdone,drq,err> error 40<uncorr>
> 
> ... but not the interrupt, which finally arrives and contains real 
> error information.  Note that the interrupt timeouts in your case 
> *don't* have DRQ set.  Are you running in multi-block mode?
> 
> > As for wd.c source, I will try to experiment :)
> 
> Please do.  It looks like your information may lead to a result here.  

It seems too late for writing reply to mailing list.
But, this seems important to note-users, so I dare to report the result of
my experiment of patch to /usr/src/sys/i386/isa/wd.c
which Mr. Mike Smith's stated,

In Message-Id: <199805272101.OAA01902@dingo.cdrom.com>
Mike Smith <mike@smith.net.au> writes:

>This would tend to imply that the timeout value is too short.
>
>Can you try increasing the timeout counter and provoking your disk?
>
>In sys/i386/isa/wd.c, in this section:
>
>        /*
>         * Schedule wdtimeout() to wake up after a few seconds.  Retrying
>         * unmarked bad blocks can take 3 seconds!  Then it is not good that
>         * we retry 5 times.
>         *
>         * On the first try, we give it 10 seconds, for drives that may need
>         * to spin up.
>         *
>         * XXX wdtimeout() doesn't increment the error count so we may loop
>         * forever.  More seriously, the loop isn't forever but causes a
>         * crash.
>         *
>         * TODO fix b_resid bug elsewhere (fd.c....).  Fix short but positive
>         * counts being discarded after there is an error (in physio I
>         * think).  Discarding them would be OK if the (special) file offset
>         * was not advanced.
>         */
>        if (wdtab[ctrlr].b_errcnt == 0)
>                du->dk_timeout = 1 + 10;
>        else
>                du->dk_timeout = 1 + 3;   <---- Only this line.
>
>
>Increase the 10 and 3 values (first and subsequent timeouts).  Try 
>raising them lots, then come down slowly.

Unfortunately, my /usr/src/sys/i386/isa/wd.c is different
from the above source code.
There is just only the last line in the wd.c.

So, I rewrite only this last line, and increased 3 to 50. ( Is this OK?)
Up to now, I have not yet experienced any disk crash, nor cannot-mount-root
problem, nor anything bad else.
And, system comes back successfully from bad sector read.
This time, error message is only as follows,

>wd0s1f: hard error reading fsbn 1152850 of 1152850-1152851(wd0s1 bn
>1279826; cn 317 tn 26 sn 44)
>wd0: status 59<rdy,seekdone,drq,err> error 40<uncorr>

or,

>Jun  8 12:17:03 dilemma pccardd[37]: pccardd started
>Jun  8 12:30:59 dilemma /kernel: wd0s1f: hard error reading
 fsbn 1215577 of 1215576-1215579 (wd0s1 bn 1342553; cn 332 tn 62 sn 23)
wd0: status 59<rdy,seekdone,drq,err> error 10<no_id>
>Jun  8 12:31:08 dilemma /kernel: wd0s1f: hard error reading
 fsbn 1215577 of 1215576-1215579 (wd0s1 bn 1342553; cn 332 tn 62 sn 23)
wd0: status 59<rdy,seekdone,drq,err> error 10<no_id>

So, the bug of wd.c device driver seems to be removed ^^)
The another problem of system lock after wd hungup seems to be
related to indefinite wait of swap_pager.(This is serious for X.)
But this defect does not appear when the wd device driver can recover
from disk access error.

You have written that 
>raising them lots, then come down slowly.

Is there any inconvenience when du->dk_timeout value is
very large ?
What if du->dk_timeout value is too large ?
What is this du->dk_timeout ?

I've just tried 'cd /usr; badsect BAD 1152850 1215577' & 'fsck /dev/rwd0s1f',
 but 'bad144 -s -v /dev/wd0' should work fine. 
( I had often used bad144. But now, my bad sectors of wd0 become too many
 for bad144 :( )
badsect & fsck don't take care of swap area,
 nevertheless they are working fine now :)

So, Thank you Mr. Mike Smith !

========================================================================
TEL: 048-852-3520    FAX: 048-858-1597
E-Mail:
     ht5t-fry@asahi-net.or.jp
     tfu@ff.iij4u.or.jp
pgp-fingerprint:
     pub  Tetsuro FURUYA <ht5t-fry@asahi-net.or.jp>
      Key fingerprint = F1 BA 5F C1 C2 48 1D C7  AE 5F 16 ED 12 17 75 38
=========================================================================

----Security_Multipart(Thu_Jun_11_04:40:53_1998)--
Content-Type: Application/Pgp-Signature
Content-Transfer-Encoding: 7bit

-----BEGIN PGP MESSAGE-----
Version: 2.6.3i

iQCVAwUANX7hSjzkiNBZ20qpAQGRfgP/Ws9puO32Jc4cxOZTE+TXDcYnBWhJV8vV
DeOuhMrf4Pozd+Y6LPgQ1FFXJHPwdU9ZR4vxUSn1VmBN/Hps/cA/UAFu1MG9p2oB
HfQqWrYFjE0zscm1Xja569jnICj2WVl5iPhmIDAXhvaCJrhLj1FF7ctcF8ZWeX0W
Sna/x38TJ0s=
=Zczd
-----END PGP MESSAGE-----

----Security_Multipart(Thu_Jun_11_04:40:53_1998)----

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-questions" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199806101941.EAA11696>