Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 12 Mar 2004 00:04:47 +1100 (EST)
From:      Bruce Evans <bde@zeta.org.au>
To:        Colin Percival <colin.percival@wadham.ox.ac.uk>
Cc:        cvs-all@freebsd.org
Subject:   Re: cvs commit: src/sys/sys mdioctl.h src/sys/dev/md md.c  src/sbin/mdconfig mdconfig.8 mdconfig.c
Message-ID:  <20040311230444.G6384@gamplex.bde.org>
In-Reply-To: <6.0.1.1.1.20040311063721.03e220b8@imap.sfu.ca>
References:  <Your message of "Thu, 11 Mar 2004 06:30:28 GMT." <48348.1078986950@critter.freebsd.dk> <6.0.1.1.1.20040311063721.03e220b8@imap.sfu.ca>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 11 Mar 2004, Colin Percival wrote:

> At 06:35 11/03/2004, Poul-Henning Kamp wrote:
> >In message <6.0.1.1.1.20040311062306.03f9ade0@imap.sfu.ca>, Colin Percival
> >writ
> >es:
> > ><kernelnewbie>
> > >   Is it really necessary for vnode-backed memory disks to be
> > >accessed through the filesystem?  Why can't md(4) hijack the
> > >disk blocks which constitute the file (telling the filesystem
> > >not to touch them, of course) and translate I/O operations
> > >directly into I/O on the underlying device?
> > ></kernelnewbie>

Script started on Thu Mar 11 23:13:06 2004
ttyp0:root@besplex:/c/tmp> dd if=/dev/zero of=zz bs=1 oseek=32767g count=1
1+0 records in
1+0 records out
1 bytes transferred in 0.000070 secs (14266 bytes/sec)
ttyp0:root@besplex:/c/tmp> du zz
448	zz
ttyp0:root@besplex:/c/tmp> exit
Script done on Thu Mar 11 23:13:47 2004

This creates a file of size 32TB-epsilon with 1 minimal block in it
(a 2K frag for ffs).  md could map this block but would have difficulty
using the other 32TB-2*epsilon bytes in the file.  It would have to
duplcicate the file system's block allocator to allocate new blocks.
The block allocator is the most interesting part of a file system.
The script doesn't show mdconfig'ing this file since md has overflow
bugs at 4G sectors and can't actually handle files olf this size.

Apart from this, direct access would sort of work.  Very old versions
did this.  See rev.1.1 of sys/dev/vn/vn.c.  It only uses VOP_BMAP()
to map the blocks and VOP_STRATEGY() to do i/o.  It apparently doesn't
work for writing to holes in the file.

> >That would be a really complex solution to a problem which should not
> >exist in the first place :-)

The version in rev.1.1 of vn.c is about twice large and more than twice
as complex as the current code.  It probably needs to be more complex
to actually work (apart from not supporting sparse files).  There were
several intermediate versions that worked better but still had deadlock
problems (IIRC, it got rewritten 3 or 4 times mainly to reduce deadlock
problems).  The version in RELENG_3 still uses VOP_BMAP/VOP_STRATEGY.
The version in RELENG_4 still claims to use VOP_BMAP/VOP_STRATEGY in
a comment, but actually uses VOP_READ/VOP_WRITE for vnodes (as explained
in another comment).  The version in md.c in -current is similar to
vn.c in RELENG_4.

>    Well... yes, but it *would* make sure that data didn't get passed
> back up to the filesystem layer.  And it would probably be faster,
> which is why I thought of it.

It might also have fewer deadlock possibilities.  VOP_READ/VOP_WRITE are
inherently more blocking than VOP_STRATEGY.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20040311230444.G6384>