Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 29 Oct 2000 16:04:14 -0600 (CST)
From:      Ryan Thompson <ryan@sasknow.com>
To:        Matt Dillon <dillon@earth.backplane.com>
Cc:        freebsd-hackers@FreeBSD.ORG
Subject:   Re: Filesystem holes
Message-ID:  <Pine.BSF.4.21.0010291350010.27883-100000@ren.sasknow.com>
In-Reply-To: <200010291627.e9TGRMP70398@earth.backplane.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Matt Dillon wrote to Ryan Thompson and freebsd-hackers@FreeBSD.ORG:

> :> Hi all...
> :> 
> :> One the tasks that I have undertaken lately is to improve the efficiency
> :> of a couple of storage facilities we use internally, here.  Basically,
> :> they are moderate-size tables currently implemented in SQL, which is OK in
> :> terms of performance, but the hash function is breaking down and the
> :> storage is rather inefficient for our table of about 2,850,000 members
> :> (~2.1 GB total storage).  There are 64M possible hash values in our
> :> current implementation, and our record size is variable, but could be
> :> safely fixed at about 1.5KB... So, total storage if all values were used
> :> would be about 96GB.  (See where I'm going with this?)
> :> 
> :> One of the options I am considering is actually using address calculation,
> :> and taking advantage of filesystem holes, to keep storage down to what is
> :> actually being used, while providing instant lookups.
> :> 
> :> The single file would be about 96G addressable bytes... But the actual
> :> block count would be much lower.  I suppose I will have to create a series
> :> of these files and divide the problem into < 4GB chunks, but one
> :> lookup/store will still be done with one calculation and one disk seek
> :> (just more filehandles open).
> :> 
> :> Deletes seem problematic.  My question is, does the operating system
> :> provide any way to free blocks used in the middle of a file?
> :> 
> :> Must I periodically re-create these files (long and slow process, but not
> :> so bad if done infrequently) to reclaim space, or is there a way to free
> :> arbitrary blocks in a file in userland (WITHOUT trashing the filesystem?
> :> :-)
> :> 
> :> - Ryan
> :> 
> :> -- 
> :>   Ryan Thompson <ryan@sasknow.com>
> :>   Network Administrator, Accounts
> :>   Phone: +1 (306) 664-1161
> :> 
> :>   SaskNow Technologies     http://www.sasknow.com
> :>   #106-380 3120 8th St E   Saskatoon, SK  S7H 0W2
> 
>     I would strongly recommend using a B+Tree instead of a hash table.  With
>     a hash table slightly different lookups will seek all over the place.
>     With a B+Tree lookups will stay more localized.

Right... That's a good point, but (and, no, I really didn't mention this
in my original post), "sequential" access is never important with the
system I have described.  Lookups are always random, and they are almost
always done one at a time (multiple lookups only done for data swaps...
very rare, like 1/1e7 accesses... and those are still O(1) (well,
O(2), but who's counting ;-)))...


>     For example, if you insert a (nearly) sorted dictionary of words into an
>     SQL table with a B+Tree, the memory working set required to insert
>     efficiently stays constant whether you are inserting a thousand, a million,
>     or a billion records.  That is, the memory requirement is effecitvely
>     O(LogN) for a disk requirement of O(N).   With a hash table, the memory
>     working set required to insert efficiently is approximately O(N) for a disk
>     requirement of O(N)... much much worse.

Remember, though, I'm not proposing a hash table.  I have been lucky
enough to design a "hash" function that will crunch down all possible
values into about 64M unique keys--so, no collisions.  I propose to use
this function with address calculation.  So, O(1) everything, for single
random lookups, which is (virtually) all this data structure must deal
with.  Sorting isn't necessary, and that could be accomplished in O(n)
anyway.


>     A B+Tree will also scale with the size of the dataset being managed,
>     so you do not have to preallocate or prereserve file space.

So will address calculation + filesystem holes, and sufficiently large
filesizes :-)

#include <stdio.h>

int main()
{
	FILE	*f;

	f = fopen("bigfile", "w");
	fseek(f, 0x7fffffff, SEEK_SET);
	putc('\n', f);
	fclose(f);
	return 0;
}

$ cc -o hole hole.c
$ ./hole
$ ls -lsk bigfile
48 -rw-rw-r--  1 ryan  2147483648 Oct 29 14:09 bigfile


>     We are using an in-house SQL database for our product (which I can't
>     really talk about right now) and, using B+Tree's, I can consistently
>     insert 300 records/sec on a cheap desktop PC (which I use for testing)
>     with 64MB of ram (remember, insert requires an internal select to check
>     for key conflicts), even when the test database grows to several
>     gigabytes.

I haven't even begun implementation, and I haven't had a chance to greatly
experiment... So I don't know how this will perform.  It would be directly
dependent on disk seeks, though...

Well... We could try a simple test... Add the following to hole.c:

char	buf[1024];		/* 1K structure */
int 	i;

...

for (i = 0; i < 8192; i++)	/* Insert 8192 records */
{
	fseek(f, rand()/1024*1024, SEEK_SET); /* Random access simulation */
	fwrite(buf, sizeof(buf), 1, f);
}

$ time ./hole.c

real	0m25.436s
user	0m0.180s
sys	0m4.912s

So, about 320 records/sec on the following test machine:
	Intel Pentium 200MHz, 128MB RAM, ASUS on-board IDE controller,
	8.4GB 5400RPM Western Digital IDE HDD.  Running FreeBSD 3.4,
	no softupdates.  System typically runs with loads around 1.5-2.0.
	
The machine isn't exactly a speed demon.  Try this on a fast SCSI RAID
system ;-)

If the record accesses follow a sequential pattern, of course, this
increases to about 3500/sec....

Another problem is the filesystem block size.  On the partition tested,
32K is allocated for each record.  This is, however, reasonably fast, when
compared with another filesystem on the same physical disk, using the same
test program.  8K is allocated for each record (assuming records are not
"close enough" to be in the same 8K block).  On that filesystem, the speed
drops below 275 records/sec... I suppose that's still quite good
considering the slow speed of these disks.

Admittedly, this is a really naive algorithm.. It could probably be
optimized a bit further (perhaps with a more expensive address-calculator
;-) to take advantage of blocksize and record location.  Really, though,
the accesses ARE pretty much random in the problem set (since this is also
a multi-process problem; multiple children are all going to be doing their
own separate thing)... So I've gone for fast single record lookups.

I suppose the allocation would fit better if I used larger record sizes
(for this problem, they could be fixed at about 1.5K)....  Even still,
storage is still a linear function of n... And allocation will get tighter
with the number of records.

- Ryan

-- 
  Ryan Thompson <ryan@sasknow.com>
  Network Administrator, Accounts
  Phone: +1 (306) 664-1161

  SaskNow Technologies     http://www.sasknow.com
  #106-380 3120 8th St E   Saskatoon, SK  S7H 0W2



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.21.0010291350010.27883-100000>