From owner-freebsd-arch@FreeBSD.ORG  Tue Jan 13 15:33:08 2004
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 54C1E16A4D5
	for <arch@freebsd.org>; Tue, 13 Jan 2004 15:33:08 -0800 (PST)
Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 0492E43D73
	for <arch@freebsd.org>; Tue, 13 Jan 2004 15:32:40 -0800 (PST)
	(envelope-from phk@phk.freebsd.dk)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.12.10/8.12.10) with ESMTP id i0DNWbdN012417
	for <arch@freebsd.org>; Wed, 14 Jan 2004 00:32:37 +0100 (CET)
	(envelope-from phk@phk.freebsd.dk)
To: arch@freebsd.org
From: Poul-Henning Kamp <phk@phk.freebsd.dk>
Date: Wed, 14 Jan 2004 00:32:37 +0100
Message-ID: <12416.1074036757@critter.freebsd.dk>
Subject: About removable disks, mountroot and sw-raid
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussion related to FreeBSD architecture
	<freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 13 Jan 2004 23:33:08 -0000


There has been some discussions about how we handle removable disks
and mountroot, raid engines and such stuff.

I think it would be a good idea if I dump my thinking to arch@ and
we can discuss it here.

First I will present the scenarios from which I have analyzed the
situation.

A:  Normal boot.
	Machine boots kernel, in relatively short time all disks
	are found.

B:  Slow boot
	Machine boots kernel, disk dribble in at a rate of one
	every 20 seconds as the cabinet powers them up.

C:  Boot with failed root disk.
	Machine boots kernel, in relatively short time all disks
	are found, root disk is not one of them.  (This is a strange
	scenario I know, but it is important for the analysis).

D:  Machine boots, all raid disks present.

E:  Machine boots, one raid disk missing.

F:  Machine running.  Operator plugs complete raid-set in, one disk
    at a time.

The solution:
-------------

I want to add a counter (protected by a mutex) which the diskdrivers
and GEOM will increment while they are configuring devices.

That means that as soon as the ata-disk system notices that there
_may_ be a disk on a cable, it will increment this counter.

If it subsequently determines that there wasn't a disk after all,
it decrements by one again.

If it finds a disk, it hands it off to GEOM/disk_create(9) before
decrementing the counter.

GEOM will similarly hold reference counts until all tasting have
settled down, so all geom classes have had their chance to do
their thing.

mount_root will stall on this counter being non-zero, and when it
goes to zero, try to open the root dev and fail if it is not found.

This solves scenario A.

Scenario B is only solvable with outside knowledge.  I propose to
add a tunable which says either how long time in total or maybe
more: useful how long time after the count went to zero before we
give up looking for the root dev. 

This means that the system will "stick around for a while" hoping
the missing disk appears, and after the timeout, it will fail.

A default timeout of 40 seconds from the last disk appeared
sounds like a good shot at a default to me.

This solves scenario B.

Provided what the user wants for scenario C is for mount_root to
fail, we have also solved that.  A magic timer configuration of -1
could mean "never", that caters for alternative desires.

That solve scenario C.


Now about sw-RAID (and mirror, and stripe and ...)

In general these methods must collect tributaries until they are
satisfied they can run.

For non-redundant configs this is trivial: all bits must be present.

For redundant methods, the administrator will have to set a policy
and I can imagine the following policies:
	1. Run when you have all tributaries.
	2. Run when you have quorum (ie: one copy of mirror etc)
	3. When you have quorum, run if no further tributaries have
	   arrived in N seconds.

Again a simple tunable integer can configure this (-1, 0, >0) and
maybe for simplicity we should use the same as we use for mountroot.

This solves scenario D, E and F  I belive.


And then the combination example:  Boot from a raid root with a
missing disk.  (lots of intermediate steps left out for clarity)

I have configured my system to wait for two minutes for disks.

The kernel loads, and disks ad0, ad1, ad2 and ad3 are found in
short order.

GEOM does its thing and my RAID geom class gets each to taste in
turn.

It finds out that it has quorum, but is incomplete so the
timer applies.  

It therefore is not finished autoconfiguring and grabs a count
on the "block mountroot" counter.

After two minutes, the disk has still not arrived, so the RAID
geom class starts up in degraded mode, presents the provider
to GEOM (which increments the counter since tasting is starting)
and then drops its own count.

The tasting of the provider happens in GEOM, possibly doing
partitioning etc.

Once it stops, the count goes to zero and mountroot gets to
try to open the root device.

If it cannot open the root device, it will retry for two minutes,
and then give up and go to the root-device prompt.

If I then go to scenario F, and then go take a pee after plugging
four of my five disks in, then I may curse the two minute timer,
but if I'm that advanced, I should have reset the counter.

Or should we have two counter tunables ?  one until root has
been mounted and another afterwards ?  No big deal, we can do that.

Does all this make sense to people ?

Poul-Henning


-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.