From owner-freebsd-arch@FreeBSD.ORG Tue Jan 13 15:33:08 2004 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 54C1E16A4D5 for ; Tue, 13 Jan 2004 15:33:08 -0800 (PST) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0492E43D73 for ; Tue, 13 Jan 2004 15:32:40 -0800 (PST) (envelope-from phk@phk.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.10/8.12.10) with ESMTP id i0DNWbdN012417 for ; Wed, 14 Jan 2004 00:32:37 +0100 (CET) (envelope-from phk@phk.freebsd.dk) To: arch@freebsd.org From: Poul-Henning Kamp Date: Wed, 14 Jan 2004 00:32:37 +0100 Message-ID: <12416.1074036757@critter.freebsd.dk> Subject: About removable disks, mountroot and sw-raid X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 Jan 2004 23:33:08 -0000 There has been some discussions about how we handle removable disks and mountroot, raid engines and such stuff. I think it would be a good idea if I dump my thinking to arch@ and we can discuss it here. First I will present the scenarios from which I have analyzed the situation. A: Normal boot. Machine boots kernel, in relatively short time all disks are found. B: Slow boot Machine boots kernel, disk dribble in at a rate of one every 20 seconds as the cabinet powers them up. C: Boot with failed root disk. Machine boots kernel, in relatively short time all disks are found, root disk is not one of them. (This is a strange scenario I know, but it is important for the analysis). D: Machine boots, all raid disks present. E: Machine boots, one raid disk missing. F: Machine running. Operator plugs complete raid-set in, one disk at a time. The solution: ------------- I want to add a counter (protected by a mutex) which the diskdrivers and GEOM will increment while they are configuring devices. That means that as soon as the ata-disk system notices that there _may_ be a disk on a cable, it will increment this counter. If it subsequently determines that there wasn't a disk after all, it decrements by one again. If it finds a disk, it hands it off to GEOM/disk_create(9) before decrementing the counter. GEOM will similarly hold reference counts until all tasting have settled down, so all geom classes have had their chance to do their thing. mount_root will stall on this counter being non-zero, and when it goes to zero, try to open the root dev and fail if it is not found. This solves scenario A. Scenario B is only solvable with outside knowledge. I propose to add a tunable which says either how long time in total or maybe more: useful how long time after the count went to zero before we give up looking for the root dev. This means that the system will "stick around for a while" hoping the missing disk appears, and after the timeout, it will fail. A default timeout of 40 seconds from the last disk appeared sounds like a good shot at a default to me. This solves scenario B. Provided what the user wants for scenario C is for mount_root to fail, we have also solved that. A magic timer configuration of -1 could mean "never", that caters for alternative desires. That solve scenario C. Now about sw-RAID (and mirror, and stripe and ...) In general these methods must collect tributaries until they are satisfied they can run. For non-redundant configs this is trivial: all bits must be present. For redundant methods, the administrator will have to set a policy and I can imagine the following policies: 1. Run when you have all tributaries. 2. Run when you have quorum (ie: one copy of mirror etc) 3. When you have quorum, run if no further tributaries have arrived in N seconds. Again a simple tunable integer can configure this (-1, 0, >0) and maybe for simplicity we should use the same as we use for mountroot. This solves scenario D, E and F I belive. And then the combination example: Boot from a raid root with a missing disk. (lots of intermediate steps left out for clarity) I have configured my system to wait for two minutes for disks. The kernel loads, and disks ad0, ad1, ad2 and ad3 are found in short order. GEOM does its thing and my RAID geom class gets each to taste in turn. It finds out that it has quorum, but is incomplete so the timer applies. It therefore is not finished autoconfiguring and grabs a count on the "block mountroot" counter. After two minutes, the disk has still not arrived, so the RAID geom class starts up in degraded mode, presents the provider to GEOM (which increments the counter since tasting is starting) and then drops its own count. The tasting of the provider happens in GEOM, possibly doing partitioning etc. Once it stops, the count goes to zero and mountroot gets to try to open the root device. If it cannot open the root device, it will retry for two minutes, and then give up and go to the root-device prompt. If I then go to scenario F, and then go take a pee after plugging four of my five disks in, then I may curse the two minute timer, but if I'm that advanced, I should have reset the counter. Or should we have two counter tunables ? one until root has been mounted and another afterwards ? No big deal, we can do that. Does all this make sense to people ? Poul-Henning -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence.