From owner-freebsd-hackers Sun Oct 29 17:59:35 2000 Delivered-To: freebsd-hackers@freebsd.org Received: from ren.sasknow.com (ren.sasknow.com [207.195.92.131]) by hub.freebsd.org (Postfix) with ESMTP id 35ACB37B479 for ; Sun, 29 Oct 2000 17:59:30 -0800 (PST) Received: from localhost (ryan@localhost) by ren.sasknow.com (8.9.3/8.9.3) with ESMTP id UAA22197; Sun, 29 Oct 2000 20:03:12 -0600 (CST) (envelope-from ryan@sasknow.com) Date: Sun, 29 Oct 2000 20:03:12 -0600 (CST) From: Ryan Thompson To: Matt Dillon Cc: freebsd-hackers@FreeBSD.ORG Subject: Re: Filesystem holes In-Reply-To: Message-ID: Organization: SaskNow Technologies [www.sasknow.com] MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Ryan Thompson wrote to Matt Dillon: > Matt Dillon wrote to Ryan Thompson: > > > :> :> storage is rather inefficient for our table of about 2,850,000 members > > :> :> (~2.1 GB total storage). There are 64M possible hash values in our > > :> :> current implementation, and our record size is variable, but could be > > :> :> safely fixed at about 1.5KB... So, total storage if all values were used > > :> :> would be about 96GB. (See where I'm going with this?) > > :... > > : > > :Remember, though, I'm not proposing a hash table. I have been lucky > > :enough to design a "hash" function that will crunch down all possible > > :values into about 64M unique keys--so, no collisions. I propose to use > > :this function with address calculation. So, O(1) everything, for single > > :random lookups, which is (virtually) all this data structure must deal > > :with. Sorting isn't necessary, and that could be accomplished in O(n) > > :anyway. > > > > Are you still talking about creating a 96GB file to manage 2.1GB worth > > of data? I gotta say, that sounds kinda nuts to me! Cpu cycles are > > cheap, I/O and memory is not. > > Hi, Matt! Thanks for the replies. I'll try and keep you interested ;-) > > Hmm... Perhaps you're still missing my original point? I'm talking about > a file with 96GB in addressable bytes (well, probably a bunch of files, > given logical filesize limitations, but let's say for simplicity's sake > that we have a single file). It's actual "size" (in terms of allocated > blocks) will be only a bit larger than 2.1GB. (Directly proportional to > the used size of the dataset. Discrepancies only come into play when > record size != block size, but that can be worked around somewhat) > > In other words, ls -ls will report the "size" as some ridiculously large > number, will show a much smaller block count. So, assuming four records > are added to the file on block boundaries, the file will actually only use > four blocks... nowhere near 96GB! > > In the UNIX filesystem (ya, I know.. just pick one :-), size of file != > space allocated for file. Thus, my original questions were centered > around filesystem holes. I.e., non-allocated chunks in the middle of a > file. When trying to READ from within a hole, the kernel just sends back > a buffer of zeros... which is enough to show that the record is not > initialized. If you prefer to read system documentation instead of me, see lseek(2) :-) The lseek() function allows the file offset to be set beyond the end of the existing end-of-file of the file. If data is later written at this point, subsequent reads of the data in the gap return bytes of zeros (un- til data is actually written into the gap). I suppose gap == hole. Silly semantics. :-) > Actually, something like an "exists" function for a record > wouldn't touch the disk at all! When writing to a hole, the kernel simply > allocates the necessary block(s). This is really fast, too, for creation, > as the empty set can be written to disk with touch(1), and uses far less > memory than virtual initialization or memory structures ;-) > > As an example, try > > fseek(f, 0-1, SEEK_SET); > fputc('\n', f); > > And analyze the reported filesize, as well as the reported block count of > the file. You should see a 2GB "size", and maybe 8K in allocated blocks > (depends how you ran newfs ;-). This is behaviour that has been around > since the original UFS. > > > > Take the BTree example again -- if you think about it, all internal nodes > > in the tree will fit into memory trivially. It is only the leaf nodes > > that do not. So in regards to actual disk accesses you still wind up > > only doing *ONE* for the btree, even if it winds up being four or five > > levels deep. > > Right. And, actually, (without looking a bit more closely), I wouldn't be > suprised if you could replace the four-line address calculation I have > with your B+Tree structure and come up with the same result. Only > difference would be a few hundred lines of code, much more testing, and > quite a few megs of RAM... ;-) > > What you referred to as "nuts", above, is just a logical way to provide a > huge address space for a set of data, without actually allocating blocks > in the filesystem for the entire address space until they are used. > > > > The result is that you will be able to create an index to your data, > > which is only 2.8 million records, in around 32 bytes per record or > > 89 Megabytes. Not 96GB. Since you can cache all the internal nodes > > of the btree you are done. The machine will do a much better job > > caching a 89MB index then a 96GB index. > > > > Do not make the mistake of equating cpu to I/O. The cpu required to > > iterate through 4 levels in the btree (assuming 64 entries per node, > > it only takes 4 to cover 16 million records) is *ZERO* ... as in, > > probably less then a microsecond, whereas accessing the disk is going > > to be milliseconds. > > CPU time for what I'm proposing is even closer to zero than for a tree... > But, you're right, it doesn't make any real difference when compared to > disk I/O... B-Trees are good for a lot of things. Address calculation > can be really good, too, given a finite key set, and a way to represent > that finite key set without wasting space. > > > > > > A B+Tree will also scale with the size of the dataset being managed, > > > > so you do not have to preallocate or prereserve file space. > > > > > So will address calculation + filesystem holes, and > > > sufficiently large filesizes :-) > > > > Uh, not with a 50:1 file size factor it won't. > > See my above discussion on file size != allocated blocks. > > And, address calculation implies that no index need be created or > maintained (in memory, or on disk). Memory utilization is practically > nil, disk utilization is proportional to the actual data size, and all > record operations are done with one seek (O(1)). > > Perhaps performance could be improved by keeping a small memory cache, but > that is tangential to the actual data structure, and will increase > performance by percentages, not orders of magnitude. > > > > > -Matt > > > > > - Ryan > > -- Ryan Thompson Network Administrator, Accounts Phone: +1 (306) 664-1161 SaskNow Technologies http://www.sasknow.com #106-380 3120 8th St E Saskatoon, SK S7H 0W2 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message