From owner-freebsd-hackers  Sun Oct 29 17:59:35 2000
Delivered-To: freebsd-hackers@freebsd.org
Received: from ren.sasknow.com (ren.sasknow.com [207.195.92.131])
	by hub.freebsd.org (Postfix) with ESMTP id 35ACB37B479
	for <freebsd-hackers@FreeBSD.ORG>; Sun, 29 Oct 2000 17:59:30 -0800 (PST)
Received: from localhost (ryan@localhost)
	by ren.sasknow.com (8.9.3/8.9.3) with ESMTP id UAA22197;
	Sun, 29 Oct 2000 20:03:12 -0600 (CST)
	(envelope-from ryan@sasknow.com)
Date: Sun, 29 Oct 2000 20:03:12 -0600 (CST)
From: Ryan Thompson <ryan@sasknow.com>
To: Matt Dillon <dillon@earth.backplane.com>
Cc: freebsd-hackers@FreeBSD.ORG
Subject: Re: Filesystem holes
In-Reply-To: <Pine.BSF.4.21.0010291907320.27883-100000@ren.sasknow.com>
Message-ID: <Pine.BSF.4.21.0010291959150.21599-100000@ren.sasknow.com>
Organization: SaskNow Technologies [www.sasknow.com]
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Ryan Thompson wrote to Matt Dillon:

> Matt Dillon wrote to Ryan Thompson:
> 
> > :> :> storage is rather inefficient for our table of about 2,850,000 members
> > :> :> (~2.1 GB total storage).  There are 64M possible hash values in our
> > :> :> current implementation, and our record size is variable, but could be
> > :> :> safely fixed at about 1.5KB... So, total storage if all values were used
> > :> :> would be about 96GB.  (See where I'm going with this?)
> > :...
> > :
> > :Remember, though, I'm not proposing a hash table.  I have been lucky
> > :enough to design a "hash" function that will crunch down all possible
> > :values into about 64M unique keys--so, no collisions.  I propose to use
> > :this function with address calculation.  So, O(1) everything, for single
> > :random lookups, which is (virtually) all this data structure must deal
> > :with.  Sorting isn't necessary, and that could be accomplished in O(n)
> > :anyway.
> > 
> >     Are you still talking about creating a 96GB file to manage 2.1GB worth
> >     of data?  I gotta say, that sounds kinda nuts to me!  Cpu cycles are
> >     cheap, I/O and memory is not.
> 
> Hi, Matt!  Thanks for the replies.  I'll try and keep you interested ;-)
> 
> Hmm... Perhaps you're still missing my original point?  I'm talking about
> a file with 96GB in addressable bytes (well, probably a bunch of files,
> given logical filesize limitations, but let's say for simplicity's sake
> that we have a single file).  It's actual "size" (in terms of allocated
> blocks) will be only a bit larger than 2.1GB.  (Directly proportional to
> the used size of the dataset.  Discrepancies only come into play when
> record size != block size, but that can be worked around somewhat)
> 
> In other words, ls -ls will report the "size" as some ridiculously large
> number, will show a much smaller block count.  So, assuming four records
> are added to the file on block boundaries, the file will actually only use
> four blocks... nowhere near 96GB!
> 
> In the UNIX filesystem (ya, I know.. just pick one :-), size of file !=
> space allocated for file.  Thus, my original questions were centered
> around filesystem holes.  I.e., non-allocated chunks in the middle of a
> file.  When trying to READ from within a hole, the kernel just sends back
> a buffer of zeros... which is enough to show that the record is not
> initialized.  

If you prefer to read system documentation instead of me, see lseek(2) :-)

     The lseek() function allows the file offset to be set beyond the end of
     the existing end-of-file of the file. If data is later written at this
     point, subsequent reads of the data in the gap return bytes of zeros (un-
     til data is actually written into the gap).

I suppose gap == hole.  Silly semantics. :-) 

> Actually, something like an "exists" function for a record
> wouldn't touch the disk at all!  When writing to a hole, the kernel simply
> allocates the necessary block(s).  This is really fast, too, for creation,
> as the empty set can be written to disk with touch(1), and uses far less
> memory than virtual initialization or memory structures ;-)
> 
> As an example, try 
> 
> fseek(f, 0-1, SEEK_SET);
> fputc('\n', f);
> 
> And analyze the reported filesize, as well as the reported block count of
> the file.  You should see a 2GB "size", and maybe 8K in allocated blocks
> (depends how you ran newfs ;-).  This is behaviour that has been around
> since the original UFS.
> 
> 
> >     Take the BTree example again -- if you think about it, all internal nodes
> >     in the tree will fit into memory trivially.  It is only the leaf nodes
> >     that do not.  So in regards to actual disk accesses you still wind up
> >     only doing *ONE* for the btree, even if it winds up being four or five
> >     levels deep.
> 
> Right.  And, actually, (without looking a bit more closely), I wouldn't be
> suprised if you could replace the four-line address calculation I have
> with your B+Tree structure and come up with the same result.  Only
> difference would be a few hundred lines of code, much more testing, and
> quite a few megs of RAM...  ;-)
> 
> What you referred to as "nuts", above, is just a logical way to provide a
> huge address space for a set of data, without actually allocating blocks
> in the filesystem for the entire address space until they are used.
> 
> 
> >     The result is that you will be able to create an index to your data,
> >     which is only 2.8 million records, in around 32 bytes per record or
> >     89 Megabytes.  Not 96GB.  Since you can cache all the internal nodes
> >     of the btree you are done.  The machine will do a much better job
> >     caching a 89MB index then a 96GB index.
> > 
> >     Do not make the mistake of equating cpu to I/O.  The cpu required to
> >     iterate through 4 levels in the btree (assuming 64 entries per node,
> >     it only takes 4 to cover 16 million records) is *ZERO* ... as in,
> >     probably less then a microsecond, whereas accessing the disk is going
> >     to be milliseconds.
> 
> CPU time for what I'm proposing is even closer to zero than for a tree...
> But, you're right, it doesn't make any real difference when compared to
> disk I/O...  B-Trees are good for a lot of things.  Address calculation
> can be really good, too, given a finite key set, and a way to represent
> that finite key set without wasting space.
> 
> 
> > > > A B+Tree will also scale with the size of the dataset being managed,
> > > > so you do not have to preallocate or prereserve file space. 
> >
> > > So will address calculation + filesystem holes, and
> > > sufficiently large filesizes :-)
> > 
> >     Uh, not with a 50:1 file size factor it won't.
> 
> See my above discussion on file size != allocated blocks.
> 
> And, address calculation implies that no index need be created or
> maintained (in memory, or on disk).  Memory utilization is practically
> nil, disk utilization is proportional to the actual data size, and all
> record operations are done with one seek (O(1)).
> 
> Perhaps performance could be improved by keeping a small memory cache, but
> that is tangential to the actual data structure, and will increase
> performance by percentages, not orders of magnitude.
> 
> > 
> > 					-Matt
> > 
> 
> 
> - Ryan
> 
> 

-- 
  Ryan Thompson <ryan@sasknow.com>
  Network Administrator, Accounts
  Phone: +1 (306) 664-1161

  SaskNow Technologies     http://www.sasknow.com
  #106-380 3120 8th St E   Saskatoon, SK  S7H 0W2


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message