From owner-freebsd-hackers  Sun Jul 28 13:09:42 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id NAA07833
          for hackers-outgoing; Sun, 28 Jul 1996 13:09:42 -0700 (PDT)
Received: from skynet.ctr.columbia.edu (skynet.ctr.columbia.edu [128.59.64.70])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id NAA07826
          for <hackers@freebsd.org>; Sun, 28 Jul 1996 13:09:29 -0700 (PDT)
Received: (from wpaul@localhost) by skynet.ctr.columbia.edu (8.6.12/8.6.9) id QAA28027 for hackers@freebsd.org; Sun, 28 Jul 1996 16:09:23 -0400
From: Bill Paul <wpaul@skynet.ctr.columbia.edu>
Message-Id: <199607282009.QAA28027@skynet.ctr.columbia.edu>
Subject: Request for comments: NIS+ database design
To: hackers@freebsd.org
Date: Sun, 28 Jul 1996 16:09:22 -0400 (EDT)
X-Mailer: ELM [version 2.4 PL24]
Content-Type: text
Sender: owner-hackers@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk


Greetings:

Lately I've been eroding my sanity by trying to put together an NIS+
implementation for FreeBSD. After going over the man pages, RPC protocol 
definitions and other available headers a few times, I decided to start 
at the bottom by building an NIS+ database backend library. Briefly,
with NIS+, databases are managed entirely by the server, rpc.nisd. This
includes creating, destroying and populating tables: all the NIS+
tools send requests to rpc.nisd, which in turn updates the database
files and keeps a log of changes. The log is used by slaves (or replicas,
as Sun now calls them) to stay in sync with the master server by
carrying out the changes as specified in the log rather than snarfing
an entire copy of the updated database, which is what NIS v2 does.

Furthermore, NIS+ uses a relational database system rather than the
(relatively) simple key/value database system used by NIS v2. With
NIS+, you're allowed to make queries such as: "find me all the
entries in the passwd table where uid=0 and shell=/bin/csh",
whereas with NIS v2 you're only allowed to say: "give me the data
associated with the key 'foouser' in the map 'passwd.byname'." (Note
that this assumes that the 'passwd' table has been created with the
uid and shell columns marked as searchable. Normally this is not
the case: only the name and uid columns are searchable.) This (IMHO)
is where most of the complexity in NIS+ comes from. With NIS v2, it's
fairly easy to just bolt on an interface to Berkeley DB, ndbm, GDBM
or whatever and leave it at that. With NIS+, you have to be more
creative.

Anyway. I managed to concoct a workable libnisdb prototype, but not
being a database expert I'm not terribly confident in its design. I
want to briefly explain how it works then ask for comments, criticisms,
screams of terror or any other input anyone would care to offer. I
apologize if this seems a bit too far off topic for this list. I don't
know how many of you actually care about this stuff, but since it is 
likely to end up in the tree eventually it deserves some discussion
somewhere.

The NIS+ database backend consists of a handful of public functions such
as db_create_table(), db_destroy_table(), db_add_entry(), db_remove_entry(),
db_list_entry(), db_next_entry(), db_reset_next_entry(), db_checkpoint(),
db_standby(), db_unload_table() and db_free_result(). Some of these
return just a single enum value (status code) while others return a
db_result structure containing the results of queries. db_free_result()
returns void. (Note that there were a few extra functions added between
Solaris 2.3, whent he NIS+ and nis_db.h specifications were released
and Solaris 2.5. I decided to implement the functions outlined as
of Solaris 2.5.) The Sun implementation uses something described in
the nis_db(3n) man page as the 'Structured Storage Manager (SMM).'
I also know that the Sun libnisdb is written, at least in part, in
C++ (in fact I think a lot of Sun's NIS+ implementation is written
in C++). I also think that it handles some of the transaction logging.
My library doesn't do logging, but when time comes to do rpc.nisd, I may 
decide to merge it in if it seems practical.

First of all, I decided to use the Berkeley DB btree method as the 
underlying database format. With ypserv I used the hash method. The
one downside to the hash method is that the R_CURSOR flag doesn't
work, which makes it hard to arbitrarily set a 'pointer' to a given
place in the database. I worked around this with ypserv, but I decided
I needed it to work correctly this time, so I switched to btree.

Second, the actual 'database' is stored in two files. One contains
the actual entries, and the other contains the relations that describe
the entries.

When an entry is added, it is first XDR-encoded into a buffer using
xdrmem_create() in order to transform it into an opaque form. This buffer 
is then passed to a hash function which generates a 32-bit hash value that
describes the entry. This hash value is used as the 'key' in the
'entries' database, and the XDR'ed buffer is the data. I call these
32-bit hash values 'tags.' Assuming one knows the 32-bit 'tag' for a 
particular entry, one can just do a (db->get) on the database using the 
tag as the key and get back the XDR'ed buffer. The original entry_obj 
structure can then be recreated just by passing the buffer through the 
XDR filter again. When the (db->put) operation is performed, I can tell
immediately if I've been handed a duplicate entry since Berkeley DB
will notice the key collision and return an error (if you use the
R_NOOVERWRITE flag, which I do). If this happens, the caller gets
back a DB_NOTUNIQUE error.

Next, a set of relations are derived from the entry, and the 32-bit tag
is associated with each relation. For example, if I have a 'passwd'
table with seven total columns, two of which are searchable (name
and uid), I would have two relations for each entry: name=<name>
and uid=<uid>. These would each be stored in the second database file
as keys with the 32-bit tag derived from the entry as data. With
passwd databases, the name field is meant to be unique, therefore
each name=<name> relation will likely only be related to a single
tag. However, uids are often not unique, which means that a single
uid=<uid> relation may yield several tags. If I find that a
uid=<uid> key already exists, I update it by adding the new tag
to the list of tags already stored. So as more entries are added 
that have the same uid=<uid> relation, the list of tags gets bigger.

Let's say I do a simple query: find me all entries where name=wpaul.
I would use db_list_entries() for this. First, I would use [name=wpaul]
as a key for searching the 'relations' database for the passwd table.
Assuming there is an entry that matches this criteria, I'll get back
a datum that contains the 32-bit tag for the desired entry. I then
use this tag as a key for searching the 'entries' database for the
passwd table and get back the entry buffer, which I convert back
into an entry_obj structure and pass back to the caller.

Now let's say I do a more complex query that yields several entries
as a result: find me all entries where uid=0 and shell=/bin/csh.
Let's say there are three entries where uid=0 but only two where
shell=/bin/csh. Doing a (db->get) from the 'relations' database
with [uid=0] as the key yields three tags, but [shell=/bin/csh]
yields only two. In this case, we take the smallest of the two
tag lists (uid=0) and check to see if each one exists in the other
relations' list. If we find that one of the tags in the smaller
list doesn't appear in the larger list, it's eliminated from the
query. (When there are more than two attributes, a tag is eliminated
if it fails to appear in any of the larger lists.) Once we know
which tags appear in both sets of lists, we retrieve the corresponding
entries and pass them back the caller.

Lastly, the 'entries' database has one special record in it which
actually contains an XDR'ed table_obj that describes the table.
When the database is initialized, each table is opened and this
record is read into a cache which the various functions can then
used for sanity checking. This record also describes the access
rights associated with the various columns in the table. (Note that
the database itself does not care about the access rights; only
the server (rpc.nisd) concerns itself with them.)

Along with all this, there are two circular queues of table handles.
One of them contains a single entry for each known table and is loaded
when the database library is initialized. The second queue is used to
hold special descriptors used by the db_first_entry() and db_next_entry() 
functions. These database handles are positioned to particular locations
within the btree databases using the (db->seq) function in Berkeley DB.
This is somewhat analagous to the database handle cache in ypserv.
The db_reset_next_entry() is used to close down one of these table
handles when the caller is finished with them.

The reason I decided to store the 'entries' and 'relations' separately
was to save a little time in db_next_entry(). If both types of
keys/data where stored in one database, I would have to check the
type of data retrieved on each (db->seq) to make sure it was of the
correct type and do a second retrieval if it wasn't (or a third,
or a fourth, if I encountered a series of the wrong type of data).
By putting the relations in a seperate database, I can step through
the entries quickly without worrying about encountering uninteresting
records.

That's about it really. The idea here was to make the retrieval process 
quick in exchange for making the storage process more complex since the 
overwhelming majority of queries received by the server are likely to
be for lookups.

At the moment I have a completed library with all the functions from
the Solaris nis_db(3n) man page and <rpcsvc/nis_db.h> header implemented.
I still need to make a couple of cleanup passes and write my own man page 
(not to mention run some comparison tests against the Solaris libnisdb), 
after which I'll put the whole mess up for FTP somewhere so people can 
look at it, point their fingers and laugh.

-Bill

-- 
=============================================================================
-Bill Paul            (212) 854-6020 | System Manager, Master of Unix-Fu
Work:         wpaul@ctr.columbia.edu | Center for Telecommunications Research
Home:  wpaul@skynet.ctr.columbia.edu | Columbia University, New York City
=============================================================================
 "If you're ever in trouble, go to the CTR. Ask for Bill. He will help you."
=============================================================================