Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 8 Sep 1996 19:42:00 -0400 (EDT)
From:      Bill Paul <wpaul@skynet.ctr.columbia.edu>
To:        hackers@freebsd.org
Subject:   First cut of libnisdb available
Message-ID:  <199609082342.TAA02851@skynet.ctr.columbia.edu>

next in thread | raw e-mail | index | archive | help
Well, after a couple weeks of sleepless nights, I think I've got it.
I've uploaded copies of nisdb.tar.gz to 

ftp.ctr.columbia.edu:/pub/misc/freebsd
and
skynet.ctr.columbia.edu:/pub/freebsd/nis

I'm also going to put a copy in ~wpaul on freefall. This distribution
includes the code for libnisdb, a man page, and library binaries for
FreeBSD (static, shared, profiled and pic archive).

Following is a description of how the library evolved to this point,
how it works and roughly how it compares to the Solaris implementation.

I expect to start work on bootstrapping rpc.nisd and the client-side
library. Whee.

When last we left our hero...
-----------------------------

Last time I said that I had a working prototype libnisdb library 
that was based on Berkeley DB. Well, a couple of things have lead me
to abandon the use of Berkeley DB as a back end, not the least of
which was speed (and lack thereof). Once I had the library completed,
I started running tests with the same ridiculously large sample passwd 
database that I normally use with ypserv. (This is a snapshot of the 
Columbia CUNIX cluster's /etc/passwd file containing over 30,000 
entries.) I found that stuffing the entire file into a single table
with two searchable columns (name and uid) took well over seven minutes. 
(This is on my home machine, a 486DX2/66 with 20MB of RAM and FreeBSD 2.1.0.)
This was quite a bit slower than I expected. Furthermore, the resulting 
table file was over 11MB in size (the original flat ASCII file was a
bit over 2MB). By contrast, the same table created with the Sun
library was only 9MB.

When I attempted to run a head to head test on the Solaris machine,
I discovered another disadvantage to using Berkeley DB: test programs
that worked perfectly well on my home system would die mysteriously
with SIGBUSes on the SPARC, due to what turned out to be memory alignment 
problems. Berkeley DB does not guarantee that the data it returns will be
aligned in any particular way, however my scheme assumed otherwise. This 
just happened to work on my home system due to the behavior of FreeBSD's 
malloc(3). On Solaris/SPARC, if you need to guarantee alignment, you must 
use memalign(). The Solaris port of Berkeley DB doesn't do this, so
hijinks ensued.

Ultimately I was able to work around this by copying critical data
into aligned buffers. This allowed my test programs to work, but the 
results only showed what I expected: the Sun library was considerably 
faster than mine. (I also discovered that the SPARC 10 was considerably
faster than my 486, but that's another matter. :) Retrievals tended
to take longer, and the 'create a ridiculously large table in one
shot' test took nearly twice as long. The Sun library was able to
create the table in about 50 seconds, whereas mine took almost 2
minutes. (This may not seem like a tremendous difference, but it
was more than enough to make me cringe.)

"My lord, I have a cunning plan!"
---------------------------------

This led me to rethink my scheme a little. It seemed as though I was
doing more I/O than the Sun library, and this was what was slowing
me down. My original idea was this:

An NIS+ entry is specified as a structure containing a 'type' field
and a series of 'column' structures. For a passwd type table, there
would be seven columns, the first being the username, the second
being the password, the third the UID, and so on. For a passwd table,
only the name and uid columns would be marked as searchable, so when
a query is made, it can only bein terms of these two columns. (I can
ask for all users where name=wpaul or uid=1234, but not where gid=3
or shell=/bin/false, since I never told the database I wanted to
search by these columns.)

When an entry is added, we do the following things:

- The entry itself is assigned a 32-bit 'tag,' which is basically just
  a shorthand way of referencing it. Each entry has a unique tag, which
  is generated by passing the entry through a hash function. (The main
  reason for using tags is to make comparing entries and making lists 
  of entries easier: it is much simpler to compare two 32-bit quantities
  than it is to do string compares on all the various parts of an
  entry.)

- The entry is stored in a Berkeley DB database using the 32-bit
  'tag' as a key. If we know the for a given entry, we can then read
  it back from the database.

- We construct 'relations' from the searchable columns and assign the
  entry's tag to them. For example, if the entry has name=wpaul and
  uid=1063, then we would store two additional records in the DB
  database: one with a key of 'name=wpaul' (literally, like that)
  and another with a key of 'uid=1063'. The data for both these
  records would be the tag for this entry.

To see how this works, imagine that we do a query where we say: find
me all entries where 'uid=1063' (which would be equivalent to a
getpwuid(1063)). The library would attempt to do a db->get for a
record with a key of 'uid=1063'. Assuming it finds a matching record,
it then sees that the data contains a single 32-bit tag. It then
checks for a record with a key matching this tag, and reads it back.

If we were to add a second entry where 'name=foo' and 'uid=1063,' the 
library then has a little extra work to do: there already exists a record 
with a key of 'uid=1063,' so it needs to update it to include the tag for 
the new entry. Now there are three 'relation' records: 'name=wpaul,' 
'name=foo' and 'uid=1063.' The 'name=wpaul' and 'name=foo' records only 
contain one tag each (for their respective entries), but 'uid=1063' now 
contains two. This means that if we repeat the query we did above, we
would now get back two entries from the database.

Queries can become a little more complicated if you specify two
attributes and both yield multiple 'tags'. For example, let's say
that the 'shell' column is also searchable, and we do a query that
says: find me all entries where 'uid=0' and 'shell=/bin/csh'. The
library would then check the database for records with keys of
'uid=0' and 'shell=/bin/csh'. Let's say that the 'uid=0' record
has 10 tags and the 'shell=/bin/csh' has 100 tags. This means there
are 10 users with 'uid=0' and 100 with 'shell=/bin/csh'. Clearly,
not all of the users that have /bin/csh as their shell also have
a UID of 0. So what we have to do is find all the tags that appear
in both the 'uid=0' list and the 'shell=/bin/csh' list. Again, this
is where the tags come in handy: you can just do simple comparisons
of 32-bit values rather than a mess of string compares. On problem
is finding a fast way to do the search: right now I'm using a 'brute
force' algorithm that just selects the smallest tag list and does
a search through all the larger lists using the smallest list as
a base. If a tag from the smaller list does not appear in all of
the larger lists, it is thrown out. (After explaining this to a
friend of mine, he said to me: "Bill, I think what you've created
here is just a really efficient bubble sort." I think he may be right,
unfortunately he wasn't able to provide me with a more effective
alternative. Before you jump forward with what you think is a clever
solution, remember the NIS+ protocol definition allows for tables
with up to 64 searchable columns, which means you can end up having
to take an intersection of a large number of tag lists.)

Anyway, the problem with this approach is that it requires writing
at least three DB records for every addition: one to add the actual
entry and one each for every 'relation.' And if there are already
matching relations in the table, we have to do extra work to read
them back, update them with the tag for the new entry, and then save
them back again. This results in a fair amount of I/O and syscall
overhead, which slows things down to a crawl.

If you hash it, they will come.
-------------------------------

I decided to try fixing this by deferring the I/O until the last
minute. One of the reasons I wanted to use Berkeley DB for this was
so that I could (ab)use its internal handling of hash tables (or
B-trees, if using the btree method). Using Berkeley DB meant that
it would do all the dirty work of arranging the relation entries
in an efficient way so that I wouldn't have to. But Berkeley DB
wasn't behaving quite the way I wanted, so I took matters into my
own hands: instead of storing the 'relation' records as individual
Berkeley DB records, I stored them in memory as part of a hash
table.

In fact, there's really a hash table for each searchable column.
To store a relation, we take the data in an entry column and hash
it to a 32-bit value. This is then used as the index to the table.
At the corresponding index location, we attach the list of tags.
So for 'name=wpaul,' we hash 'wpaul' and use the resulting value
as an index into the 'name=' table, and store the entry tag there.
Then we hash '1063' and use that as an index into the 'uid=' table
and store the same entry tag there. If we have another entry where
'uid=1063,' we will hash to the same location, find that there's
already a list there and add the new tag to the list.

Note that while hash values are 32 bits in size, I'm not allocating
an array with ULONG_MAX indexes. What I actually do is allocate
four buckets (I call them pages) each with 256 slots. I them mask
off all but the lower 8 bits of the hash value so that it will map
to one of the slots on the first page. If the corresponding slot is
empty, a new list is created there. If it's occupied, I shift the
hash value by 8 bits and mask again, then use that value as an index
into the next page. If the slot on that page is occupied, I shift
again, and so on, until I run out of pages. If I run out of pages,
then I resort to chaining: the new list is linked to an existing
list on the page with the slot that has the smallest chain count. (I
try to keep the chain counts of each slot the same to even out the 
distribution.)

What happens now is that when a new entry is added, it is written to
the DB database, but the 'relation' information is kept in memory.
When the database is checkpointed, the relation hash structures are
dumped to the database in one swell foop. When the database is re-opened,
the hash structures are read back into memory again.

Once I had this new scheme working, I ran more tests. There was some
improvement: on my home machine, the 'create a whopping great table
all in one shot' test completed in about 4 minutes as opposed to the
7 minutes it needed before. On the Solaris machine, the time dropped
from 1:50 to 1:15.

"You want how much disk space?!"
--------------------------------

As luck would have it, more complications arose. While Berkeley DB
can handle very large key and data items, it tends to waste disk space
under some conditions. To save the relation data, I was basically
converting the entire structure into an opaque form using the XDR
library and saving it as one huge data item (with an ASCII key called
'relation_array'). For the 30,000 entry passwd table, the total size
of all the relation data, once XDR'ed, was on the order of 2MB. This
meant that 2MB of the total table file was occupied by a single
key/data pair. Now let's say I re-open the database and add one more
entry. The relation data grows ever so slightly so that it now just
over 2MB in size. Berkeley DB, rather than re-using the old 2MB
chunk of space and expanding it a little, allocates a whole new chunk
of space that is 2MB + <a little extra to cover the new entry> and
adds that on top of the old 2MB chunk, which is now just sitting there
taking up room.

The strategy, I think, is that the old 2MB chunk, which is now too
small to hold the new relation data 'glob' will be placed on the 'free
list' and partitioned up to hold smaller data items that come along.
This means that as new entries are stored, they may fill in the
vacated space, but each time the relation data is expanded, it will
allocate a whole new 2MB+ chunk of space to hold it.

The result was a tremendous jump if file size every time I added
a new entry to the huge table: the first entry added 2MB to the
file size, the next entry another 2MB, and so on.

This was the last straw. I tried using both the hash and btree methods
and tweaked the various correspinding 'openinfo' parameters until I
was blue in the face: nothing helped. One potential solution that
occured to me was to write my own file compaction routines, but this
would involve first learning how to grok Berkeley DB files, which I
didn't really want to do.

What I needed to do was create my own file storage system that would
allow the same kind of quick access to arbitrary records afforded by
Berkeley DB but that would not waste space, or which would allow me to 
reclaim space without going insane.

When the going gets weird, the weird turn pro.
----------------------------------------------

The answer to this problem came to me when I remembered something about
the Sun libnisdb library that I'd noticed while I was doing my original
tests. When I went to compile my first test program using the Solaris
library, the linker complained that it could not resolve the symbol
'_db_free_result'. All the other nis_db functions resolved just fine,
but not this one. This confused me a bit since db_free_result() was
clearly documented to exist in both the nis_db(3n) man page and the
<rpcsvc/nis_db.h> header file. To figure out what was happening, I
ran nm(1) on the Sun /usr/lib/libnisdb.a, and sure enough, I saw the
problem: db_free_result() was there, but as a mangled C++ name. I
was compiling my program with gcc. Not knowing much about C++, I
took the quick way out by adding a #define to the test program that
mapped the unmangled name to the mangled one, just to make the linker
happy so I could get on with my life.

Anyway, one of the things I saw in the output from nm(1) was a reference
to xdrstdio_create(). I couldn't figure out why they would need to use
this function, particularly since libnisdb doesn't do an RPCs itself.

Then it hit me: you can use the XDR library to encode or decode
data to or from any kind of stream, be it a socket, a pipe, a tty,
or even a _file_. (For an example of how to do this with an AF_UNIX
socket, see the rpc.yppasswdd(8) source code.)

The XDR File Format (XFF)
-------------------------

>From this bit of information, I created the XDR File Format (XFF).
There can be XFF files and XFF log files. A regular XFF file contains
three things:

- An XDR-encoded XFF header containing the library version, the file 
  type, and offsets to a few other key locations within the file.

- A series of XDR-encoded entries, stored back to back. Since the NIS+
  protocol definition already describes an entry_obj structure for
  holding entries, we use the entry_obj structure and associated
  XDR filter for storing entries in XFF file.

- An XDR-encoded copy of the relation info for the table. This is
  dumped as one big glob after the last entry in the file.

An XFF log file contains two things:

- An XFF header, just like the XFF header for a standard file.

- A series of XDR-encoded log entries. A log entry is stored as a
  log_entry structure and contains an entry_obj structure inside it
  along with an action flag, a status flag, and a set of attributes,
  if needed (to delete an entry, we specify the attributes of the
  entry we want to delete rather than supplying a copy of the entry
  itself).

An XFF log file doesn't have any relation info in it: instead, the
log entries are used to update the relation data in memory.

With this new scheme, we have to change the nature of the 'tags'
used to identify entries. They're still unsigned 32-bit values, but
now they are computed differently and have special meaning. Each tag
is now not only a unique identifier, it's also a seek pointer into
the XFF file.

To illustrate this, let's say we do a query: find all users where
'uid=0'. Let's also say this yields a list of three tags. This means
we have three entries that satisfy the query. To read back, for example,
the entry associated with tag1, we do this:

- Use fseek() to move the file pointer 'tag1' bytes into the file.

- Use xdrstdio_create() to create an XDR stream from the file stream.

- Read back the entry into an entry_obj structure using the
  xdr_entry_obj() filter.

- Use xdr_destroy() to destroy the XDR stream.

Presto. We don't need to write any file format to structure conversion
routines since rpcgen and the XDR library take care of all that for us.

Initially however, we don't create an XFF file: we create a log. When
we do an entry, we write a log entry to the log file (also using the
XDR library) that says: 'an add operation was performed, and this is
the entry that was added.' Deletions are logged in the same way. (Note
that an update or replace is treaded as an add/delete pair.) All we
have is a log file until the user calls db_checkpoint(). At that point,
all the log entries are played back and used to update the real XFF
table file. Once all the log entries are committed, the log is removed.
If new changes are made, the library will create a new log file and
start the process over again.

But wait, there's more
----------------------

This leads to a couple of complications: we may, at a particular time,
have both a committed XFF table file and a log. As far as the user is
concerned, these must both appear as a single cohesive NIS+ table.
The user should not know that the data is split among two files. This
requires a bit of behind the scenes sleight of hand, but is is not
that hard to achieve.

The relation tables have to contain tags relating to both the real
file and the log file if we want lookups to work right. The question
is: how can we tell when a tag refers to an offset in the real file,
and how can we tell when it refers to an offset in the log file. The
answer is that when we store a tag relating to the log file, we bias
it by adding to it the total size of the real file. If we then do a
lookup and realize that the tag value is too large to possibly an
offset into the real XFF file, we use it as an offset into the XFF log
file instead. (If that doesn't work, we signal an error.)

We have to do similar magic when handling db_first_entry()/db_next_entry()
enumerations-type queries. When db_next_entry() reaches the end of of
the real file, it has to check to see if a log file exists and make
proper use of it if it does.

Both these cases are handled internaly by the xff_get_entry(),
xff_first_entry() and xff_next_entry() functions. This avoids complicating
the top level library functions.

The last complication has to do with deletions. If a user deletes an
entry in an already committed XFF file, we have to do two things:

- Log the deletion action.

- Update the relation tables.

- Mark the entry in the original table as deleted.

This requires us to touch the original table before the log is committed, 
but this is unavoidable: if we didn't do this, db_first_entry() and
db_next_entry() would not know to skip the entry during an enumeration
query. One possible alternative would be to keep a list of 'losing'
tags and have xff_first_entry()/xff_next_entry() check their current
file positions against the list, but this would require extra processing
at each instance of xff_next_entry() and would require extra memory.

Regardless, deletions offer one last problem, which is that we need to
go through some complex gymnastics to reclaim space that is left in
committed table files as the result of a deletion. We can't overwrite
the deleted entry with another entry since entries are not of fixed
size (which means we may clobber the entry following the deleted one).
And we can't just shift all the subsequent entries up a slot since that
would throw them out of sync with their tags in the relation tables.

Unfortunately, there is no simple way around this: if an entry is
deleted in the original file, we just have to compact it. If, after we
commit the XFF log, we notice that there were deletions, we compact
the file to reclaim space. This involves creating a new file containing
all of the still valid entries from the old file: entries marked as
deleted are skipped (but we make note of their offsets). Once the new
file is created, we run a special fixup routine on the relation tables 
which basically visits all of the tags in the tables and, if needed, 
fixes them up so that they stay in sync with the actual entries in the
file.

And the survey says...
----------------------

Initial tests of the first version of XFF -- which did not include 
logging -- were very encouraging. The 'create a big fat table in
one shot' test would now run to completion on my home machine in
about 1 minute, 30 seconds. On the SPARC, it took about 46 seconds,
which was four seconds faster than the Sun library. But then I started 
to think more about what I was doing and I realized that I this was 
reeally just a temporary situation. Without logging, my library could 
easily fall apart.

Supposing that I get halfway through the test and then crash. I will
have saved 15,000 entries to to the XFF file, which is good. But I
don't save the relation tables until a checkpoint, and since I crashed
before I could do the checkpoint, the relation data would not be
saved, which is bad.

With the log, I can just replay all the log entries at startup and
rebuild the relation tables in memory.

But this raised a few questions since I noticed that the Sun library
and my library didn't always behave the same way. These questions are
not easy to answer until you understand how the Sun library works.

Things you were never meant to know
-----------------------------------

I will now totally dispell the mystery surrounding the implementation
of the Sun libnisdb library and tell you how it works. I'm sure Sun
would rather I didn't do this, but I don't care. I don't like the way
Sun's library works, and I want people to understand why.

The nis_db(3n) man page says that the Sun database implementation is
'memory based.' This is an understatement of epic proportions. The Sun 
library maintains a structure or set of structures describing the 
relations between entries, just as my library does. It also keeps
this structure or set of structures in memory in order to speed up
queries, just as my library does. What it also does, which my library
doesn't do, is maintain the _entire_ contents of a table in memory
along with the relation data. This means that if you have an NIS+ table
with 10,000 entries, the library will keep all 10,000 entries in memory.

When you add an entry, the Sun library does two things: it writes the
addition to a log, and it adds the entry to its in-core image of the
table. If you do a deletion, it writes the deletion to the log and
removes the entry from its in-core image of the table. If you crash
prior to doing a checkpoint, the in-core table is lost, but it is
reconstructed at restart using the log.

When you checkpoint, the Sun library essentially dumps both its in-core
relation data and all of the entries into a single file. This is where
it uses the XDR library: it can dump its data structures straight to
disk using its own XDR filters. Once the new file has been successfuly
written, the log is removed.

By keeping the table entirely in-core, the Sun library avoids a few
of the issues that I had to contend with:

- You don't have to do any I/O to do a lookup: everything is in
  memory, so you just crunch a few numbers an out pops a pointer
  to the desired entry.

- Reclaiming space freed by deletions is easy: you just free() the
  memory occupied by the deleted entry and expunge its relation data
  from the tables.

- To 'commit' changes in the log, you just dump your in-core table
  image to disk. When the Sun library starts up, it reads the existing
  checkpointed table image from disk (if it exists) and then replays the
  log, thereby modifying its in-core image of the table to reflect the
  changes in the log. The image on disk remains unchanged. This reduces 
  the checkpoint operation to one big save operation where the in-core 
  image of the table is written out and ultimately replaces the old one.

- You don't have any file descriptors to juggle unless there's a log
  present. In my library, I go through a fair amount of trouble to
  manage file descriptors (actually FILE * pointers): you need one for
  each open table file, one for each open log file, and one more for each 
  db_first_entry()/db_next_entry() enumeration query that's currently in 
  progress.

But, as with most things, there's a tradeoff: in exchange for all
these simplifications, the Sun library uses a hell of a lot more memory
than mine. When running tests on the SPARC 10, my library, with the
30,000 entry table open, would consume about 1260 pages of virtual
memory (according to ps -elf). By contrast, the Sun library, with
the same table, consumed over 4400 pages. At 4K per page, that's
about 5MB for my library versus 18MB for the Sun library.

Some may view Sun's approach as reasonable since the VM system will
probably page out a lot of the extra pages anyway, thereby (ab)using
the VM system as part of the database. I don't think relying on the
VM system is such a good idea: what happens if you port the code to
another platform where the VM system behaves differently?

I'm sure Sun likes the implementation just fine: even with lots of
paging space, there's no arguing that the performance of the library
will improve if you throw more RAM at it, particularly with large
tables. This means that if you're looking to buy machines to use as
NIS+ servers in an environment with many thousands of users, you'll
have to buy lots of RAM, and Sun will be only too happy to sell it
to you.

There's one other difference between Sun's library and mine that
I can't quite explain: on the SPARC, a fully populated NIS+ table
file containing 30,000+ entries takes up about 9.6MB of disk space
when constructed with the Sun libnisdb. With my libnisdb, the
exact same table consumes only 7.5MB. I suspect this is because
Sun's in-core table image has the entry_obj structure for each entry
encapsulated in some larger structure (possibly with forward/backward
pointers to link them together), and all these larger 'encapsulation'
structures are also being dumped to disk when the table is checkpointed.

And so, in closing...
---------------------

The final version of my libnisdb is still not quite as fast as I
would like it to be, mainly because of the less memory-intensive
approach I used. For me, a checkpoint is largely a copy operation
whereas with Sun's library, it's just a store. At the moment, creating
a 30,000 entry table on my home machine takes about 2:45 (as opposed
to the 7+ minutes it took with the prototype), and about 1:12 on the
SPARC. If I were to use the same memory-based approach as Sun, this
time would be reduced by about a third, which would bring me right
in line with Sun's times (maybe a little faster since I didn't
use C++ :).

Also, a single lookup takes a little longer for my library since I need 
to do I/O to read entries from disk. However, overall startup time for my 
library is shorter because I don't have to load the entire table into 
memory: instead, I load only the relation table. And my table files are
smaller no matter what.

On the whole, I'd rather be in Philadelphia.

So what do you think, sirs?

-Bill

-- 
=============================================================================
-Bill Paul            (212) 854-6020 | System Manager, Master of Unix-Fu
Work:         wpaul@ctr.columbia.edu | Center for Telecommunications Research
Home:  wpaul@skynet.ctr.columbia.edu | Columbia University, New York City
=============================================================================
 "If you're ever in trouble, go to the CTR. Ask for Bill. He will help you."
=============================================================================



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199609082342.TAA02851>