Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 12 Jul 1997 03:38:55 -0700 (PDT)
From:      Simon Shapiro <Shimon@i-Connect.Net>
To:        filo@yahoo.com, freebsd-SCSI@freebsd.org
Subject:   Re: problems with reboot
Message-ID:  <XFMail.970712033855.Shimon@i-Connect.Net>
In-Reply-To: <199707120421.VAA11648@ns2.yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help

Hi David Filo;  On 12-Jul-97 you wrote:

> when running the latest 1.1.7 code i noticed that the command being
> "marked" and later "destroyed" during reboot was the "remove media"
> command.  so i removed the DPT_HANDLE_TIMEOUTS option and it works
> fine now.  the umount during reboot can take > 30 seconds which is
> beyond your max timeout for scsi commands (i think).  so looks like
> you need to be careful on which commands you timeout and destroy.

Yup.  This whole timeout thing is bogus.  I trust you understand it 
by now :-)  It is only necessary for hardware platforms that corrupt
DMA transfers between the DPT and the main memory.  Actually, it is 
very probable that this is simply a delay, out of sync delay, rather
that corruption.

If you can live without DPT_HANDLE_TIMEOUTS, do so.  I recommend so,
as I do that myself.  The DPT firmware handles timeouts much better.
There is no need for it in the kernel, except as a survival tool.

> the next test was to simply hit the "reset" button while the dpt was
> chugging away on lots of untars.  unfortunately the first time i did
> this, the machine got hung on reboot in the dpt bios - never got past
> "waiting for dpt" message (first led kept blinking).  hitting reset a
> second time worked and the machine booted.  of course the filesystems
> were hosed, but they fscked fine.  so this sounds like a DPT firmware
> bug.  i have yet to reproduce this one in a few tries.  do you have a
> suggestion of who to talk to at dpt about this, or should i just go
> through the normal support channel.

what you describe is sensible but not a bug;  When you forcefully reset
the machine, if you were writing to a RAID-{1,5}, it is very possible
you did so in mid-transaction.  The DPT, upon boot, will try to restore 
the array to consistent state.  This operation may take a very long
while.  Getting stuck is not correct.  Did the card emit any beeps?
these actually indicate what the problem is.
What version of the firmware is it running?  It is visible durin boot,
and also in the syslog.  Upgrade to 7L0, try again (without the reset :-)
and call support.

> the next time i tried to duplicate the dpt hang (by hitting reset
> again), it came up fine (after fsck of course).  however as i started
> the multiple untars again, the machine panicked with the message
> "panic: blkfree: freeing free frag".  i was seeing this same behavior
> when the reboots weren't happening cleanly (i.e. machine comes up,
> fsck works, but then panic when accessing fs).  i would assume this is
> a 2.2 filesystem bug, but i'm not sure.  have you seen anything like
> this or have any reason to believe it's associated with the dpt
> driver?  i don't have much experience with 2.2 so i don't know if this
> is common.

Depends on how much memory you have, you can destroy up to 64Mb of 
disk writes.  There are very few filesystems that can survive this kind
of assault.  Even good file systems like Veritas vxfs, or (yes) NotTested
ntfs will not survive that.  One of the most robust filesystems ever
created, is an Oracle RDBMS (they do nor necessarily view their RDBMS
as a filesystem), will not survice losing 64Mb of data it thinks already
was committed to disk.

Many years ago i raced cars that had turbochargers on them.  The best
way to destroy one (for something that spins 130,000-200,000 rpm, bolted
to a car engine and sucking gasoline, they are very reliable), is to
open full throttle, on a running engine, and kill the power.
Why am I telling you this?  Every engineered product has a sure way
of destroying it by doing something that is doable and not clearly 
marked ``DO NOT DO THAT''.  The DPT controller assumes that normally,
computers do not push the reset button.  They are designed to resist a
single point of failure (SPOF).  What you do is MMPOF :-)  Smoke will
be emitted.

In a truely critical application, where application-side integrity
is more important than speed consierations, do the following:

* configure the DPT for write-through caches
* disable the caches on ALL the disk drives.
* Pray :-)  Some disk drives will NOT disable their caches when you tell
  them to.

> we have a lot more experience with 2.1 and the filesystem appears to
> be very stable.  which brings up the question: will your stuff work
> under 2.1?  if you think it's feasible i'll probably try to get it
> working under 2.1-stable to see if this filesystem problem persists.

The problem is not in the filesystem.  Put a good UPS between the CPU 
and the wall socket, cut off the reset button and it will work fine.
There is an issue with FreeBSD shutdown not waiting for the DPt to flush
caches as it should.

> finally, you've asked about posting/forwarding my questions/comments
> to other places.  no problems - do whatever you'd like with anything i
> say..

I do not know about that, but think thatthis particular exchange will
help other.

Simon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.970712033855.Shimon>