Date: Thu, 11 Jun 1998 04:41:08 +0900 From: Tetsuro FURUYA <ht5t-fry@asahi-net.or.jp> To: mike@smith.net.au Cc: robinson@public.bta.net.cn, freebsd-stable@FreeBSD.ORG, freebsd-questions@FreeBSD.ORG, Tetsuro FURUYA <tfu@ff.iij4u.or.jp> Subject: Re: Bug in wd driver Message-ID: <199806101941.EAA11696@dilemma.tf.or.jp> In-Reply-To: Your message of "Thu, 28 May 1998 12:57:14 -0700" References: <199805281957.MAA01309@dingo.cdrom.com>
next in thread | previous in thread | raw e-mail | index | archive | help
[-- Attachment #1 --]
I wrote,
In Message-Id: <199805272026.FAA16850@dilemma.tf.or.jp>
In Message-Id: <199805281508.AAA04056@dilemma.tf.or.jp>
>I have been encountered at the same defaults in using Panasonic AL-N1,
>and FreeBSD-2.2.2.
>And bad144 was hangupped.
>But I have found out how to manipulate bad144, or fsck , or badsect.
>My kernel has kernel-debugger ddb(4) installed in it.
> ^^^^^^
>So, listening to the hamming sound of wd0 drive, and when wd drive
>is hangupped, invoke kernel-debugger by typing ctrl-alt-ESC keys.
> ^^^^^^^^^^^^
>A while after stopping of disk access, type 'c' or 'continue',
>and go back to bad144 or fsck. ^^^^^^^^
>Several attempts may complete the identification of bad clusters.
>As for my machine, this was worked.
And you pointed out that,
> > In Message-ID: <199805272101.OAA01902@dingo.cdrom.com>
> > Mike Smith <mike@smith.net.au> worte:
> > >fsck /usr
> > >.....
> > >wd0: interrupt timeout:
> > >wd0: status 50<rdy,seekdone> error 0
> > >wd0: interrupt timeout:
> > >wd0: status 50<rdy,seekdone> error 1<no_dam>
> >
> > >===> hang up
> > >===> type 'cntrl-alt-esc'
>
> This defers the interrupt timeout...
>
> > >db>wd0s1f: hard error reading fsbn 1152850 of 1152850-1152851(wd0s1 bn
> > >1279826; cn 317 tn 26 sn 44)
> > >wd0: status 59<rdy,seekdone,drq,err> error 40<uncorr>
>
> ... but not the interrupt, which finally arrives and contains real
> error information. Note that the interrupt timeouts in your case
> *don't* have DRQ set. Are you running in multi-block mode?
>
> > As for wd.c source, I will try to experiment :)
>
> Please do. It looks like your information may lead to a result here.
It seems too late for writing reply to mailing list.
But, this seems important to note-users, so I dare to report the result of
my experiment of patch to /usr/src/sys/i386/isa/wd.c
which Mr. Mike Smith's stated,
In Message-Id: <199805272101.OAA01902@dingo.cdrom.com>
Mike Smith <mike@smith.net.au> writes:
>This would tend to imply that the timeout value is too short.
>
>Can you try increasing the timeout counter and provoking your disk?
>
>In sys/i386/isa/wd.c, in this section:
>
> /*
> * Schedule wdtimeout() to wake up after a few seconds. Retrying
> * unmarked bad blocks can take 3 seconds! Then it is not good that
> * we retry 5 times.
> *
> * On the first try, we give it 10 seconds, for drives that may need
> * to spin up.
> *
> * XXX wdtimeout() doesn't increment the error count so we may loop
> * forever. More seriously, the loop isn't forever but causes a
> * crash.
> *
> * TODO fix b_resid bug elsewhere (fd.c....). Fix short but positive
> * counts being discarded after there is an error (in physio I
> * think). Discarding them would be OK if the (special) file offset
> * was not advanced.
> */
> if (wdtab[ctrlr].b_errcnt == 0)
> du->dk_timeout = 1 + 10;
> else
> du->dk_timeout = 1 + 3; <---- Only this line.
>
>
>Increase the 10 and 3 values (first and subsequent timeouts). Try
>raising them lots, then come down slowly.
Unfortunately, my /usr/src/sys/i386/isa/wd.c is different
from the above source code.
There is just only the last line in the wd.c.
So, I rewrite only this last line, and increased 3 to 50. ( Is this OK?)
Up to now, I have not yet experienced any disk crash, nor cannot-mount-root
problem, nor anything bad else.
And, system comes back successfully from bad sector read.
This time, error message is only as follows,
>wd0s1f: hard error reading fsbn 1152850 of 1152850-1152851(wd0s1 bn
>1279826; cn 317 tn 26 sn 44)
>wd0: status 59<rdy,seekdone,drq,err> error 40<uncorr>
or,
>Jun 8 12:17:03 dilemma pccardd[37]: pccardd started
>Jun 8 12:30:59 dilemma /kernel: wd0s1f: hard error reading
fsbn 1215577 of 1215576-1215579 (wd0s1 bn 1342553; cn 332 tn 62 sn 23)
wd0: status 59<rdy,seekdone,drq,err> error 10<no_id>
>Jun 8 12:31:08 dilemma /kernel: wd0s1f: hard error reading
fsbn 1215577 of 1215576-1215579 (wd0s1 bn 1342553; cn 332 tn 62 sn 23)
wd0: status 59<rdy,seekdone,drq,err> error 10<no_id>
So, the bug of wd.c device driver seems to be removed ^^)
The another problem of system lock after wd hungup seems to be
related to indefinite wait of swap_pager.(This is serious for X.)
But this defect does not appear when the wd device driver can recover
from disk access error.
You have written that
>raising them lots, then come down slowly.
Is there any inconvenience when du->dk_timeout value is
very large ?
What if du->dk_timeout value is too large ?
What is this du->dk_timeout ?
I've just tried 'cd /usr; badsect BAD 1152850 1215577' & 'fsck /dev/rwd0s1f',
but 'bad144 -s -v /dev/wd0' should work fine.
( I had often used bad144. But now, my bad sectors of wd0 become too many
for bad144 :( )
badsect & fsck don't take care of swap area,
nevertheless they are working fine now :)
So, Thank you Mr. Mike Smith !
========================================================================
TEL: 048-852-3520 FAX: 048-858-1597
E-Mail:
ht5t-fry@asahi-net.or.jp
tfu@ff.iij4u.or.jp
pgp-fingerprint:
pub Tetsuro FURUYA <ht5t-fry@asahi-net.or.jp>
Key fingerprint = F1 BA 5F C1 C2 48 1D C7 AE 5F 16 ED 12 17 75 38
=========================================================================
[-- Attachment #2 --]
-----BEGIN PGP MESSAGE-----
Version: 2.6.3i
iQCVAwUANX7hSjzkiNBZ20qpAQGRfgP/Ws9puO32Jc4cxOZTE+TXDcYnBWhJV8vV
DeOuhMrf4Pozd+Y6LPgQ1FFXJHPwdU9ZR4vxUSn1VmBN/Hps/cA/UAFu1MG9p2oB
HfQqWrYFjE0zscm1Xja569jnICj2WVl5iPhmIDAXhvaCJrhLj1FF7ctcF8ZWeX0W
Sna/x38TJ0s=
=Zczd
-----END PGP MESSAGE-----
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199806101941.EAA11696>
