Date: Wed, 1 Mar 2000 13:24:19 -0500 (EST) From: jhood@sitaranetworks.com, cgull@owl.org To: FreeBSD-gnats-submit@freebsd.org Cc: grog@lemis.com Subject: kern/17098: /boot/loader hangs on switch to second drive Message-ID: <200003011824.NAA82716@malkovich.sitaranetworks.com>
next in thread | raw e-mail | index | archive | help
>Number: 17098
>Category: kern
>Synopsis: /boot/loader hangs on switch to second drive
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: freebsd-bugs
>State: open
>Quarter:
>Keywords:
>Date-Required:
>Class: sw-bug
>Submitter-Id: current-users
>Arrival-Date: Wed Mar 1 10:30:01 PST 2000
>Closed-Date:
>Last-Modified:
>Originator: John Hood
>Release: FreeBSD 3.2-RELEASE i386
>Organization:
Sitara Networks
>Environment:
i386 3.2-RELEASE + local mods, FreeBSD installs on two drives
>Description:
The boot loader often hangs when requested to boot a kernel from a
drive other than the one it started from.
>How-To-Repeat:
Set up a system with boot environment & kernels on two "fast" drives,
preferably of wildly different geometry and sizes. (Floppy I/O may be
slow enough to cover up this problem.) Install the sample loader.rc
on boot drive, editing as necessary. Reboot & wait forever, if you
are lucky-- the bug is a bit shy sometimes.
Sample loader.rc:
\ Loader.rc
\ 1 trace!
set currdev=disk2s1a: \ Something other than $loaddev
\ 11000 ms \ Uncomment this to get a working load
\ show
\
\ Includes additional commands
include /boot/loader.4th
\ Reads and processes loader.rc
start
\ Unless set otherwise, autoboot is automatic at this point
>Fix:
There are two problems here: the block cache code and (presumably)
the UFS code.
The block cache, as implemented, has no mechanism for distinguishing
which device a block or block request is for. When a different device
is selected, it may return a block from the wrong device. Debugging
this was complicated by the 2s block discard timeout-- debugging
printfs to a serial console would make the loader work, as would
executing loader commands/words by hand :)
Secondarily, when this happens, some other part of the loader reacts
poorly to bogus data and hangs-- I'd guess that it's the UFS code, but
I've not traced the problem.
The block-cache problem exists in any version of the loader that has
the block cache implemented.
Minimalistic i386-only fix for the block-cache problem follows-- diffs
are against a locally-modified 3.2-RELEASE. Since the loader's device
architecture does not have a globally-visible way of referring to a
specific device and unit, this appears to be the best way to pass the
necessary info into the block cache, short of wholesale
rearchitecting.
--john hood
diff -ur /sys/boot/common/bcache.c ./common/bcache.c
--- /sys/boot/common/bcache.c Sat Feb 6 09:27:29 1999
+++ ./common/bcache.c Fri Feb 18 17:35:19 2000
@@ -23,7 +23,7 @@
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*
- * $Id: bcache.c,v 1.4.2.1 1999/02/06 14:27:29 dcs Exp $
+ * $Id: bcache.c,v 1.2 2000/02/18 22:35:19 jhood Exp $
*/
/*
@@ -62,17 +62,33 @@
static int bcache_hits, bcache_misses, bcache_ops, bcache_bypasses;
static int bcache_bcount;
+static void *bcache_dkstrategy;
+static int bcache_dkunit;
+
static void bcache_insert(caddr_t buf, daddr_t blkno);
static int bcache_lookup(caddr_t buf, daddr_t blkno);
/*
+ * Invalidate the cache
+ */
+void
+bcache_flush(void)
+{
+ int i;
+
+ if (bcache_data != NULL) {
+ for (i = 0; i < bcache_nblks; i++) {
+ bcache_ctl[i].bc_count = -1;
+ bcache_ctl[i].bc_blkno = -1;
+ }
+ }
+}
+/*
* Initialise the cache for (nblks) of (bsize).
*/
int
bcache_init(int nblks, size_t bsize)
{
- int i;
-
/* discard any old contents */
if (bcache_data != NULL) {
free(bcache_data);
@@ -97,11 +113,9 @@
return(ENOMEM);
}
- /* Invalidate the cache */
- for (i = 0; i < bcache_nblks; i++) {
- bcache_ctl[i].bc_count = -1;
- bcache_ctl[i].bc_blkno = -1;
- }
+ bcache_dkstrategy = NULL;
+
+ /* bcache_flush() will happen on first call to bcache_strategy */
return(0);
}
@@ -130,6 +144,16 @@
DEBUG("bypass %d from %d", size / bcache_blksize, blk);
bcache_bypasses++;
return(dd->dv_strategy(dd->dv_devdata, rw, blk, size, buf, rsize));
+ }
+
+ /* has a new device/unit been requested? flush cache */
+ if ((bcache_dkstrategy != dd->dv_strategy) ||
+ (bcache_dkunit != dd->dv_dkunit)) {
+ DEBUG("cache flush, lastunit = %d newunit = %d",
+ bcache_dkunit, dd->dv_dkunit);
+ bcache_flush();
+ bcache_dkstrategy = dd->dv_strategy;
+ bcache_dkunit = dd->dv_dkunit;
}
nblk = size / bcache_blksize;
diff -ur /sys/boot/common/bootstrap.h ./common/bootstrap.h
--- /sys/boot/common/bootstrap.h Sat Feb 6 09:27:29 1999
+++ ./common/bootstrap.h Fri Feb 18 17:35:20 2000
@@ -23,7 +23,7 @@
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*
- * $Id: bootstrap.h,v 1.18.2.1 1999/02/06 14:27:29 dcs Exp $
+ * $Id: bootstrap.h,v 1.2 2000/02/18 22:35:20 jhood Exp $
*/
#include <sys/types.h>
@@ -84,6 +84,7 @@
struct bcache_devdata
{
int (*dv_strategy)(void *devdata, int rw, daddr_t blk, size_t size, void *buf, size_t *rsize);
+ int dv_dkunit;
void *dv_devdata;
};
diff -ur /sys/boot/i386/libi386/biosdisk.c ./i386/libi386/biosdisk.c
--- /sys/boot/i386/libi386/biosdisk.c Tue Mar 16 09:58:25 1999
+++ ./i386/libi386/biosdisk.c Fri Feb 18 17:35:47 2000
@@ -23,7 +23,7 @@
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*
- * $Id: biosdisk.c,v 1.20.2.4 1999/03/16 14:58:25 dcs Exp $
+ * $Id: biosdisk.c,v 1.3 2000/02/18 22:35:47 jhood Exp $
*/
/*
@@ -573,6 +575,8 @@
struct bcache_devdata bcd;
bcd.dv_strategy = bd_realstrategy;
+ bcd.dv_dkunit = ((struct open_disk *)(((struct i386_devdesc *)
+ devdata)->d_kind.biosdisk.data))->od_dkunit;
bcd.dv_devdata = devdata;
return(bcache_strategy(&bcd, rw, dblk, size, buf, rsize));
}
>Release-Note:
>Audit-Trail:
>Unformatted:
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200003011824.NAA82716>
