Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 23 Aug 2010 18:14:24 -0600 (MDT)
From:      "M. Warner Losh" <imp@bsdimp.com>
To:        xcllnt@mac.com
Cc:        freebsd-arch@freebsd.org
Subject:   Re: RFC: enhancing the root mount logic
Message-ID:  <20100823.181424.646155203640260173.imp@bsdimp.com>
In-Reply-To: <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com>
References:  <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net> <20100823.171201.107001114053031707.imp@bsdimp.com> <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com>

next in thread | previous in thread | raw e-mail | index | archive | help
In message: <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com>
            Marcel Moolenaar <xcllnt@mac.com> writes:
: 
: On Aug 23, 2010, at 4:12 PM, M. Warner Losh wrote:
: 
: *snip*
: 
: > However, all this scripting sounds a bit like a very simple shell in
: > the kernel.  What advantages are there to this approach vs having the
: > ability to run a simple shell script or executable and "pivot" the root
: > to a new location?
: 
: The 2 reasons for doing this in the kernel are:
: 1.  resiliency against ABI changes.
: 2.  allowing /sbin/init to come from the actual root file system.
: 
: Both points are impossible to handle efficiently or correctly if
: you need user space support in getting to your actual root file
: system. You basically have a catch-22 or bootstrap problem, which
: a pure in-kernel solution doesn't have.

OK.  That makes sense.  Without execing the new init, which may be a
problem with the current world view of init(8) and the kernel, you'd
have to have your final init on the first level root file system.

: > And how do you emulate the mount_foo programs for
: > foo filesystems?  Some of them do weird things that might not
: > translate well into the kernel...
: 
: True. I haven't flushed that out, but I was hoping that nmount(2)
: would have normalized most of this that it's a non-issue, provided
: we support mount options in this scheme.
: 
: If you have a concrete example of something that's not so trivial,
: but critical to support, let me know and I'll take it into account.

mount_smbfs makes a connection to the remote system to do
authentication presently in mount_smbfs and initializes the smb
context before mounting the file system in the kernel.  I don't know
if I'd call this a critical to support feature, but it was the first
"exception" to the rule that jumped into my head so I was curious if
you'd thought about it.

: > As you can see, I'm torn about how I feel about the idea.  For simple
: > cases, I think it is great, but as complexity builds, I become less
: > sure.  What if that iso image was compressed?
: 
: Can you elaborate how this is potentially a problem in this scheme,
: but not for "manual" mounting?

You'd need a way to stack up different modules, since you'd need
geom_uzip over md0 to make it useful to the cd9660 code.

: > What if I had a
: > software RAID of disks or flash devices?
: 
: I see no problem. In fact, the idea is triggered by switching to a
: flash file system on a NAND flash.

RAID of Flashes.  Something that would need configuration.  but you
may be correct: this level of flexibility may not be needed and other
concerns may trump it...

: > What about crypto?
: 
: See above. Can you elaborate?

Same thing, but with a crypto key :)

: > I know I
: > can handle those cases in /bin/sh, but will each new one require more
: > code in the kernel?
: 
: The way I see it is that the approach enhances how we now mount the
: root file system. We have very limited flexibility. I do not claim
: that my idea allows every possible variation, and I think it unfair
: to expect that of the approach. If one has real complex requirements,
: one can always just mount some file system on some storage device
: and deal with the root mount in user space. I don't see how this
: prevents that.

init(8) is the show stopper to a pivot root approach, unless you could
tell init that's on the first level and simple to exec /sbin/init to
pickup the new copy, but I don't know how happy that would make the
kernel..

: >  What would df and/or mount tell you about the
: > now-hidden file systems?
: 
: Can you explain what you mean by now-hidden file systems?

OK.  Let's say we have a three level scheme:

/dev/nor0 which has the initial root on it.
Next up is foo.iso.gz which is mounted read only on md0
next up is geom_uzip which present the device as md0.uzip which gets
mounted finally as root.

So would df show:

Filesystem     1024-blocks     Used    Avail Capacity  Mounted on
/dev/nor0             4096     4096    	   0     110%  /
/dev/md0.uzip	     16000    16000	   0	 110%  /

or

Filesystem     1024-blocks     Used    Avail Capacity  Mounted on
/dev/nor0             4096     4096    	   0     110%  /.old_root
/dev/md0.uzip	     16000    16000	   0	 110%  /

and if we had one more layer on nand:

Filesystem     1024-blocks     Used    Avail Capacity  Mounted on
/dev/nor0             4096     4096    	   0     110%  /
/dev/md0.uzip	     16000    16000	   0	 110%  /
/dev/nand0	    320000   300000    20000      82%  /

or

Filesystem     1024-blocks     Used    Avail Capacity  Mounted on
/dev/nor0             4096     4096    	   0     110%  /.old_root/.old_root
/dev/md0.uzip	     16000    16000	   0	 110%  /.old_root
/dev/nand0	    320000   300000    20000      82%  /

is the question I'm asking...

right now you can mostly do a pivot-root-like thing by having init do
a chroot very early, possibly after executing a simple rc script to
put the second level root system online.  init_script gets run very
early, followed by a chroot to init_chroot followed by a mount of
devfs on /dev if necessary.  However, when you do this, often times
you end up with weird looking df output since / isn't really / to df.

Anyway, the fact that we have a decoupled fork/exec really is what
lead me to ask the question.  It is useful to run arbitrary code
between the two, even if you usually run the same code...  sometimes
you want to be different.  I was thinking that this might be the same 
way here.  But, as you rightly point out, maybe there's too much
complexity in doing that and simpler is better.

Warner



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100823.181424.646155203640260173.imp>