Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 14 Oct 1999 20:56:11 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        phk@critter.freebsd.dk (Poul-Henning Kamp)
Cc:        freebsd-arch@freebsd.org
Subject:   Re: The eventual fate of BLOCK devices.
Message-ID:  <199910142056.NAA29867@usr08.primenet.com>
In-Reply-To: <447.939897820@critter.freebsd.dk> from "Poul-Henning Kamp" at Oct 14, 99 12:43:40 pm

next in thread | previous in thread | raw e-mail | index | archive | help
First of all, thanks to everyone for such a focussed discussion.

I have some comments on Poul's comments, and would at least like
to argue for a "legacy mode", even if it is not enabled by default,
so long as it can be enabled (and implied) without a kernel recompile
(but perhaps requiring a kernel module), for standards compliance
reasons, if no other.



> SUMMARY:
> 
> So far we have identified the following two classes of software
> which access disk-like devices through cdev and bdev:
> 
>   1) Database software.
> 
>   2) Filesystem maintenance tools
> 
>   3) savecore(8)

I would add:

    4) Programs which want to treat object in the filespace as if
       they were byte streams.

In other words, it shouldn't matter, and I should not have to
give special arguments to programs such as "tar" and "dd" and "team"
to do I/O in variable media blocking factors.

I think I would also add:

    5) Programs that have to deal with CDROM's containing multiple
       sessions.

This is an issues, since not all data is 2048 byte blocks, but
can in fact be 2352, or a physical sector size of 2048, 2336, or
2340 bytes.  This will only get more complicated as DVD and other
standards evolve and come online.

In addition, many WORM, mageneto-optical, and Japanese hard
drives (such as those by default in the NEC PC-98) are 1024 bytes.

I would have that complexity hidden from the user, who is most
interested in a linear array of bytes of arbitrary length, and
in seeking to non-block aligned offsets in the linear array.


> Database software prefer cdev semantics if at all possible, if
> running on anything but a cdev database software call fsync(2) a
> lot to make sure the writes have hit the media.

I would argue that such database software is either broken, or
it is expecting a broken kernel (one which does not do the correct
thing on block device descriptors marked O_SYNC -- such as FreeBSD's
existing block device semantics).


> Terry argues for retaining the bdev semantics rather than the cdev
> semantics, but I think we can dismiss that idea based on the above
> observation: it would penalize software which know better.  Retaining
> the bdev would in essence be emulating the mistake Linux made, and
> which they are now unmaking.

I think that for "software that knows better", i.e. software that
has called fstat(2) to get st_blksize, and intentionally performs
aligned writes, that it would be trivial to determine if a write
was on a block boundary, and spanned an integer number of blocks,
and therefore not penalize the smarter software.  This is really
an implementation issue, not a performance issue.


> The filesystem maintenance applications mentioned so far which rely
> on bdev semantics, the EXT2FS tools, can be trivially converted to
> operate on cdev semantics.  The majority of such tools already
> correctly operate on cdevs.

I believe the tools should be implemented via a different API,
since the kernel already knows about slices, partitions, etc.,
and has to have that knowledge embedded in it.  So either way,
the tools promiscuous knowledge of stuff that they really have no
right knowing in the first place isn't an argument for getting
rid of block devices -- nor an argument in favor of keeping them.


> Savecore(8) has already been converted to operate on cdevs.

Irrelevent, I think, as well.

Clearly, we could convert the entirety of all FTP'able software
on the Internet to do its own block size determination, and
do buffering in user space thereafter.  I think this would be
wasteful.

As Julian didn't point out, but probably meant to with his example,
Multiple fromas operating at a granularity of sizeof(struct foo),
where sizeof(struct foo) is not an integer multiple of the underlying
device block size, will havve to have some form of promiscuous IPC
mechanism to communicate with each other.

As has been pointed out before, advisory locking does not work on
specfs or other vnodes not accessed through the VFS interface's
struct vfsops.

Although this could be corrected by moving the advisory lock list
to the vnode, and removing all advisory locking code from every
VFS (except the NFS client VFS), this work has not yet been done.

Without buffering, a supra-record offset granularity would need
to be maintained and communicated between multiple programs that
are accessing the character device on non-block boundaries.

This is a can of worms.


> Using mmap(2) to provide a new type of buffered semantics for
> disk-like devices is insteresting, but its applicability will be
> limited by the virtual address space of a process: you can't map
> a 20GB database into a 32bit address space, so a lot of mmap(2)
> calls will be needed for serious sized data.  The need for, and
> actual use of such a facility seemes uncertain.

Agreed.


> There is general disagreement about how much code we save, but
> nobody disputes that we will be able to remove some amount of
> complexity from the kernel.  Most people seem to overlook the
> needlessly replicated code in a number of xxx(8) tools to DTRT with
> /dev/foo vs /dev/rfoo.

I think if these tools are written to operate on the less limited
block device, they should simply refuse to operate on the more
limited character device.  This is an elegant soloution, and some
message morally equivalent to "use the block device, dummy" would
be adequate to get the user to do the right thing, rather than
making up for the inadequacies of the user. (down that road lies
ruin and "undelete" and "unnewfs").


> Implementing an ioctl(2) to switch a disk-like device into bdev
> mode is relatively trivial, but there currently seems to be no
> point in doing so.

I think the point in doing this would be to ensure that code
would not be broken by the OS, and could be forced to work.

I would not object to removal of the block devices (except on
standards conformance based grounds), if it were guaranteed
that such an ioctl() would be implemented before their removal,
and that a user desiring to do so could override the "MAKEDEV"
to create "block" devices, on which this ioctl() call was implicitly
called on open.

This would certainly satisfy the "legacy/standards crowd", I
think, while still allowing the surgery you want to perform.


> There is a significant majority supporting the removal of bdev
> semantics.

Majority is not a measure of technical merit.

I think a character device that allowed block semantics, but would
discard cache buffers if accessed on block boundaries would equally
suffice to address the issue of unification of the block and character
device namespace, which I think is the real issue here.

However, an ioctl() based soloution, with a compatability mode which
is not enabled by default (but must be capable of being soft-enabled)
would suffice.


> An ioctl(2) based mode-switch will only be implemented if a
> very good reason for doing so materializes.

I think that the fact that we can't know about all software, and
that the standards specify block devices, argues for some form
of legacy support mechanism, even ifit isn't enabled by default
for FreeBSD systems.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199910142056.NAA29867>