Date: Thu, 09 Jun 2005 12:11:15 -0700 From: Bakul Shah <bakul@BitBlocks.com> To: des@des.no (=?iso-8859-1?q?Dag-Erling_Sm=F8rgrav?=) Cc: freebsd-fs@freebsd.org, FreeBSD Current <freebsd-current@freebsd.org> Subject: Re: you are in an fs with millions of small files Message-ID: <200506091911.j59JBGQn096385@gate.bitblocks.com> In-Reply-To: Your message of "Wed, 08 Jun 2005 09:50:44 %2B0200." <86zmu1l223.fsf@xps.des.no>
next in thread | previous in thread | raw e-mail | index | archive | help
> Giorgos Keramidas <keramida@freebsd.org> writes: > > Is there a better way to sort a linked list (not necessarily a > > singly-linked list, like the one fts_link is used for). > > Don't build a linked list to begin with. The comparison function is > known at the time the directory entries are read, so it should be a > simple matter to read them into a red-black tree instead of a singly- > linked list. I'm working on a patch. Unless the tree needs to grow dynamically, a red-black tree won't buy you anything. It is faster to use mergesort on an array than a red black tree implementation. They have the same O n log n complexity but per item overhead is higher in r-b perhaps due to more mallocs. Even a skewheap will be faster than r-b since you don't need to search, but just return items in sort order. Start with a small array. When it gets full double it (or grow by half if you want to waste less memory). In my tests, with input file constructed from multiple copies of /usr/share/dict/words rather than a directory, the mergesort+array was 3 times faster than red-black for upto at least 250K entries. At about 10M entries and 100MB of data, mergesort+array was still about twice as fast and used less memory. Using mmap the performance doubled and memory use went down! Benchmark program (but not with the mmap code) attached:-) % wc z 10000000 10000000 105733020 z % time a.out redblack < z 15.445u 1.314s 0:16.76 99.9% 10+320297k 0+0io 0pf+0w % time a.out array < z 9.524u 0.691s 0:10.21 100.0% 10+240289k 0+0io 0pf+0w % time a.out -m array < z 4.632u 0.653s 0:05.28 100.0% 10+151192k 0+0io 0pf+0w -- bakul [source compacted to fit on one page] #include <stdio.h> #include <string.h> #include <sys/tree.h> int debug; #define strndup(s,n) strncpy(malloc(n+1), s, n) int compare(const void*a, const void* b) { return strcmp(*(char**)a, *(char**)b); } do_ar() { char* line; size_t len; int c = 0, i; int s = 16; char** a = (char**)malloc(s*sizeof(char*)); if (debug) fprintf(stderr, "Using array+mergesort\n"); while (line = fgetln(stdin, &len)) { if (c >= s) a = (char**)realloc(a, (s *= 2)*sizeof(char*)); a[c++] = strndup(line, len-1); } mergesort(a, c, sizeof(char*), compare); if (debug) for (i = 0; i < c; i++) printf("%s\n", a[i]); } typedef struct node { char *name; RB_ENTRY(node) links; } Node; int cmp(Node* a, Node* b) { return strcmp(a->name, b->name); } RB_HEAD(rb_tree, node); RB_GENERATE(rb_tree, node, links, cmp); do_rb() { char* line; size_t len; struct rb_tree rb; Node* n; if (debug) fprintf(stderr, "Using red-back tree\n"); RB_INIT(&rb); while (line = fgetln(stdin, &len)) { struct node* node = malloc(sizeof *node); node->name = strndup(line, len-1); RB_INSERT(rb_tree, &rb, node); } if (debug) RB_FOREACH(n, rb_tree, &rb) printf("%s\n", n->name); } int main(int c, char** v) { if (c > 1 && strcmp(v[1], "-d") == 0) { debug = 1; c--; v++; } if (c > 1 && v[1][0] == 'r') do_rb(); else do_ar(); }
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200506091911.j59JBGQn096385>