From owner-freebsd-hackers Sun Oct 29 14: 0:38 2000 Delivered-To: freebsd-hackers@freebsd.org Received: from ren.sasknow.com (ren.sasknow.com [207.195.92.131]) by hub.freebsd.org (Postfix) with ESMTP id 082F137B479 for ; Sun, 29 Oct 2000 14:00:34 -0800 (PST) Received: from localhost (ryan@localhost) by ren.sasknow.com (8.9.3/8.9.3) with ESMTP id QAA62388; Sun, 29 Oct 2000 16:04:14 -0600 (CST) (envelope-from ryan@sasknow.com) Date: Sun, 29 Oct 2000 16:04:14 -0600 (CST) From: Ryan Thompson To: Matt Dillon Cc: freebsd-hackers@FreeBSD.ORG Subject: Re: Filesystem holes In-Reply-To: <200010291627.e9TGRMP70398@earth.backplane.com> Message-ID: Organization: SaskNow Technologies [www.sasknow.com] MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Matt Dillon wrote to Ryan Thompson and freebsd-hackers@FreeBSD.ORG: > :> Hi all... > :> > :> One the tasks that I have undertaken lately is to improve the efficiency > :> of a couple of storage facilities we use internally, here. Basically, > :> they are moderate-size tables currently implemented in SQL, which is OK in > :> terms of performance, but the hash function is breaking down and the > :> storage is rather inefficient for our table of about 2,850,000 members > :> (~2.1 GB total storage). There are 64M possible hash values in our > :> current implementation, and our record size is variable, but could be > :> safely fixed at about 1.5KB... So, total storage if all values were used > :> would be about 96GB. (See where I'm going with this?) > :> > :> One of the options I am considering is actually using address calculation, > :> and taking advantage of filesystem holes, to keep storage down to what is > :> actually being used, while providing instant lookups. > :> > :> The single file would be about 96G addressable bytes... But the actual > :> block count would be much lower. I suppose I will have to create a series > :> of these files and divide the problem into < 4GB chunks, but one > :> lookup/store will still be done with one calculation and one disk seek > :> (just more filehandles open). > :> > :> Deletes seem problematic. My question is, does the operating system > :> provide any way to free blocks used in the middle of a file? > :> > :> Must I periodically re-create these files (long and slow process, but not > :> so bad if done infrequently) to reclaim space, or is there a way to free > :> arbitrary blocks in a file in userland (WITHOUT trashing the filesystem? > :> :-) > :> > :> - Ryan > :> > :> -- > :> Ryan Thompson > :> Network Administrator, Accounts > :> Phone: +1 (306) 664-1161 > :> > :> SaskNow Technologies http://www.sasknow.com > :> #106-380 3120 8th St E Saskatoon, SK S7H 0W2 > > I would strongly recommend using a B+Tree instead of a hash table. With > a hash table slightly different lookups will seek all over the place. > With a B+Tree lookups will stay more localized. Right... That's a good point, but (and, no, I really didn't mention this in my original post), "sequential" access is never important with the system I have described. Lookups are always random, and they are almost always done one at a time (multiple lookups only done for data swaps... very rare, like 1/1e7 accesses... and those are still O(1) (well, O(2), but who's counting ;-)))... > For example, if you insert a (nearly) sorted dictionary of words into an > SQL table with a B+Tree, the memory working set required to insert > efficiently stays constant whether you are inserting a thousand, a million, > or a billion records. That is, the memory requirement is effecitvely > O(LogN) for a disk requirement of O(N). With a hash table, the memory > working set required to insert efficiently is approximately O(N) for a disk > requirement of O(N)... much much worse. Remember, though, I'm not proposing a hash table. I have been lucky enough to design a "hash" function that will crunch down all possible values into about 64M unique keys--so, no collisions. I propose to use this function with address calculation. So, O(1) everything, for single random lookups, which is (virtually) all this data structure must deal with. Sorting isn't necessary, and that could be accomplished in O(n) anyway. > A B+Tree will also scale with the size of the dataset being managed, > so you do not have to preallocate or prereserve file space. So will address calculation + filesystem holes, and sufficiently large filesizes :-) #include int main() { FILE *f; f = fopen("bigfile", "w"); fseek(f, 0x7fffffff, SEEK_SET); putc('\n', f); fclose(f); return 0; } $ cc -o hole hole.c $ ./hole $ ls -lsk bigfile 48 -rw-rw-r-- 1 ryan 2147483648 Oct 29 14:09 bigfile > We are using an in-house SQL database for our product (which I can't > really talk about right now) and, using B+Tree's, I can consistently > insert 300 records/sec on a cheap desktop PC (which I use for testing) > with 64MB of ram (remember, insert requires an internal select to check > for key conflicts), even when the test database grows to several > gigabytes. I haven't even begun implementation, and I haven't had a chance to greatly experiment... So I don't know how this will perform. It would be directly dependent on disk seeks, though... Well... We could try a simple test... Add the following to hole.c: char buf[1024]; /* 1K structure */ int i; ... for (i = 0; i < 8192; i++) /* Insert 8192 records */ { fseek(f, rand()/1024*1024, SEEK_SET); /* Random access simulation */ fwrite(buf, sizeof(buf), 1, f); } $ time ./hole.c real 0m25.436s user 0m0.180s sys 0m4.912s So, about 320 records/sec on the following test machine: Intel Pentium 200MHz, 128MB RAM, ASUS on-board IDE controller, 8.4GB 5400RPM Western Digital IDE HDD. Running FreeBSD 3.4, no softupdates. System typically runs with loads around 1.5-2.0. The machine isn't exactly a speed demon. Try this on a fast SCSI RAID system ;-) If the record accesses follow a sequential pattern, of course, this increases to about 3500/sec.... Another problem is the filesystem block size. On the partition tested, 32K is allocated for each record. This is, however, reasonably fast, when compared with another filesystem on the same physical disk, using the same test program. 8K is allocated for each record (assuming records are not "close enough" to be in the same 8K block). On that filesystem, the speed drops below 275 records/sec... I suppose that's still quite good considering the slow speed of these disks. Admittedly, this is a really naive algorithm.. It could probably be optimized a bit further (perhaps with a more expensive address-calculator ;-) to take advantage of blocksize and record location. Really, though, the accesses ARE pretty much random in the problem set (since this is also a multi-process problem; multiple children are all going to be doing their own separate thing)... So I've gone for fast single record lookups. I suppose the allocation would fit better if I used larger record sizes (for this problem, they could be fixed at about 1.5K).... Even still, storage is still a linear function of n... And allocation will get tighter with the number of records. - Ryan -- Ryan Thompson Network Administrator, Accounts Phone: +1 (306) 664-1161 SaskNow Technologies http://www.sasknow.com #106-380 3120 8th St E Saskatoon, SK S7H 0W2 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message