From owner-freebsd-mobile  Mon Oct  6 15:17:11 1997
Return-Path: <owner-freebsd-mobile>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.7/8.8.7) id PAA03590
          for mobile-outgoing; Mon, 6 Oct 1997 15:17:11 -0700 (PDT)
          (envelope-from owner-freebsd-mobile)
Received: from pinot.eecs.harvard.edu (pinot.eecs.harvard.edu [140.247.60.65])
          by hub.freebsd.org (8.8.7/8.8.7) with SMTP id PAA03563
          for <freebsd-mobile@freebsd.org>; Mon, 6 Oct 1997 15:16:28 -0700 (PDT)
          (envelope-from karp@eecs.harvard.edu)
Received: (from karp@localhost) by pinot.eecs.harvard.edu (8.6.12/8.6.12) id SAA21955 for freebsd-mobile@freebsd.org; Mon, 6 Oct 1997 18:16:18 -0400
Date: Mon, 6 Oct 1997 18:16:18 -0400
From: Brad Karp <karp@eecs.harvard.edu>
Message-Id: <199710062216.SAA21955@pinot.eecs.harvard.edu>
To: freebsd-mobile@freebsd.org
Subject: wd interrupt timeouts w/2.2.2, PAO, IBM 380
Sender: owner-freebsd-mobile@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

I'm running 2.2.2-RELEASE with the latest PAO from makefile.org on a
new IBM 380. I can find almost no mention of experience reports with
the 380 on the web and in mailing list archives, probably because this
particular model is so new.

At any rate, I sporadically get messages like the following on the console:

wd0: interrupt timeout:
wd0: status 58<rdy,seekdone,drq> error 0
wd0: interrupt timeout:
wd0: status 58<rdy,seekdone,drq> error 1<no_dam>

When these messages occur, the system hangs while it retries a disk
operation. The retries frequently go on for up to two minutes.

After a restart, I often go through long periods without these messages.
But once they start, they occur quite frequently. What's more, even if I
reboot, they continue to occur frequently after rebooting.

While it sounds implausible, I find that leaving the laptop powered off (not
sleeping, but fully off) for > 30 minutes returns it to a state where the
timeout messages vanish for a while.

I suspect an interaction between APM and the wd driver, naturally...it appears
the wd driver is intolerant of disk spin-downs.

At Poul Henning-Kemp's suggestion, I folded in the following code from
3.0-current into my 2.2.2 wd.c:wdcommand() :

        if (du->cfg_flags & WDOPT_SLEEPHACK) {
                /* OK, so the APM bios has put the disk into SLEEP mode,
                 * how can we tell ?  Uhm, we can't.  There is no 
                 * standardized way of finding out, and the only way to
                 * wake it up is to reset it.  Bummer.
                 *
                 * All the many and varied versions of the IDE/ATA standard
                 * explicitly tells us not to look at these registers if
                 * the disk is in SLEEP mode.  Well, too bad really, we
                 * have to find out if it's in sleep mode before we can 
                 * avoid reading the registers.
                 *
                 * I have reason to belive that most disks will return
                 * either 0xff or 0x00 in all but the status register 
                 * when in SLEEP mode, but I have yet to see one return 
                 * 0x00, so we don't check for that yet.
                 *
                 * The check for WDCS_BUSY is for the case where the
                 * bios spins up the disk for us, but doesn't initialize
                 * it correctly                                 /phk
                 */
                if(inb(wdc + wd_precomp) + inb(wdc + wd_cyl_lo) +
                    inb(wdc + wd_cyl_hi) + inb(wdc + wd_sdh) +
                    inb(wdc + wd_sector) + inb(wdc + wd_seccnt) == 6 * 0xff) {
                        if (bootverbose)
                                printf("wd(%d,%d): disk aSLEEP\n",
                                        du->dk_ctrlr, du->dk_unit);
                        wdunwedge(du);
                } else if(inb(wdc + wd_status) == WDCS_BUSY) {
                        if (bootverbose)
                                printf("wd(%d,%d): disk is BUSY\n",
                                        du->dk_ctrlr, du->dk_unit);
                        wdunwedge(du);
                }
        }

I also built a kernel with flag 0x4000 for my laptop's disk, so that
SLEEPHACK is active (see below). I still get the timeouts and console messages
with this code added, though. :-(

Relevant lines from my kernel config file:

options LAPTOP
options APM_PCCARD_RESUME
options PCIC_RESUME_RESET
options "APM_NOSUSPEND_IMMEDIATE=3"

controller	wdc0	at isa? port "IO_WD1" bio irq 14 vector wdintr
disk		wd0	at wdc0 drive 0 flags 0x4000

device		apm0	at isa?
#options		APM_BROKEN_STATCLOCK

When I boot, I'm told the following about my disk and the wd driver:

wdc0 at 0x1f0-0x1f7 irq 14 on isa
wdc0: unit 0 (wd0): <IBM-DMCA-21080>, sleep-hack
wd0: 1033MB (2116800 sectors), 2100 cyls, 16 heads, 63 S/T, 512 B/S

So the kernel is definitely turning on the sleep-specific code.

My question is: has anyone out there seen similar behavior on any model
of laptop? If so, what did you do to correct it? The machine is more or
less unusable when it goes away for minutes at a time in disk retries...

Many thanks,
-Brad, karp@eecs.harvard.edu