Date: Sat, 11 May 2002 00:55:47 -0700 From: Terry Lambert <tlambert2@mindspring.com> To: Gordon Tetlow <gordont@gnf.org> Cc: hackers@freebsd.org Subject: Re: nextboot loader diff Message-ID: <3CDCCE83.66AEF4BB@mindspring.com> References: <Pine.LNX.4.44.0205101634570.27477-100000@smtp.gnf.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Gordon Tetlow wrote: [ ... ] You *did* ask for comments... > > There should be a list, so that in a brown-out or whatever, you > > don't end up toggling back to the previous version accidently. > > This is not something that is meant for you to massage which root > partition you are going to boot up off of. I don't understand what it does, then. The original Whistle code was intended to attempt to boot 3 times from one partition, and then 3 times from another. If a boot was successful, then in the last rc file before the getty's were started, it reset the list to 3 times the current root and 3 times the alternate root. That way, on each success, the counter was reset, so in general, a given root was sticky. When the failure occurred, then the alternate root was the one whose rc files ran, and it became the sticky one. Worst case, you could power cycle a box three times quickly to force a switch back to an older version. The general failure case is not an indefinite hang, but a reset before the rc file runs. This is particularly true when you have a hardware watchdog, where the first thing that happens is the watchdog is set. Note that images are tested before they are shipped, so the worst case failure is "out of memory" or some other installation failure related problem, and not a kernel problem, anyway. I've personally had to solve this same problem several times now. > > You should only ever rewrite the contents of a single file, and > > it shouldn't be an important file. > > Yes, that's exactly what my patch does. I don't understand the "YES"/"NO" thing, then. There is one byte difference in the file length, which I don't think can be properly accounted, if you do the "YES"/"NO" thing. > > The existance/non-existance of the single file should be enough > > to trigger/suppress the nextboot behaviour. > > I can't unlink files in the loader, so the presence of such a file > wouldn't help. The file is the nextboot.conf file. And unlinking it is not something which you want to do, actually. I think we are misunderstanding each other's intent here. > > Don't assume that the nextboot file will be on the same disk and/or > > partition as the boot and other config file code. > > Well, I'm assuming it's on the root partition. It would be kinda silly for > it to anywhere else. Not really. Consider that if I switch root partitions, then, by definition, I switch nextboot files. Basically, the InterJet was laid out: boot code (including nextboot list) / #1 <- version X of the system (read only) / #2 <- version Y of the system (read only) swap /var <- log files and /tmp /data <- user data (config, user files, etc.) The fstab's on #1 and #2 were opposite, so that you could mount and overwrite the contents with a new release of the software. An upgrade was: mount opposite "root" unpack new system image onto opposite root set up opposite root fstab sync unmount nextboot "opposite opposite opposite this this this" reboot Each revision had data management upgrade/downgrade scripts; these were written to /data, so that opposite versions could downgrade. > > Together, these things will allow the new code to solve the same > > problem that the old code solved on the InterJet. > > I've never heard nor seen the old code. I don't know what it did, and I > don't particularly care. I did this because I thought the way Wes Peters > did his implementation was rather hackish (not saying mine is any better > =) and suboptimal if the machine doesn't make it to multi-user. Please > refer to the commit logs from earlier this month if you don't know of the > commit I'm referring to. I do. He committed some, but not all, of the code that Jon Mini and James wrote (Jon says some of it was based on code I wrote). The design I did at ClickArray was based on the Whistle design from when I worked at Whistle with Julian and Archie. The ClickArray code, if it was intended to solve the problem that the code it was supposedly derived from was intended to solve is for solving the remote upgrade problem, with no local removable media that can be used to recover from a catastrophic failure (the only recovery from such a failure is a fallback to a working previous revision, per the InterJet). The code you are talking about seems limited to replacing only the kernel. Frankly, that's recoverable via the serial console, if you put the "-p" in the right file in /. This isn't really sufficient for any embedded system that needs to get at netstat, ps, or other data which involves examination of kernel structures, which may change between kernel versions. You pretty much have to have two system images to solve that problem, or you'll find youself incredibly screwed, when the web UI, the CLI, SNMP, and the front panel LCD all start reporting random bogus data. 8-(. I'm not trying to dump on your code; I'm just saying that it's not solving the problem that the original code was added to be able to solve, and that the original nextboot itself was intended to resolve. You asked for comments ...those are mine. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3CDCCE83.66AEF4BB>