From owner-freebsd-hackers Sun Jul 28 13:09:42 1996 Return-Path: owner-hackers Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id NAA07833 for hackers-outgoing; Sun, 28 Jul 1996 13:09:42 -0700 (PDT) Received: from skynet.ctr.columbia.edu (skynet.ctr.columbia.edu [128.59.64.70]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id NAA07826 for ; Sun, 28 Jul 1996 13:09:29 -0700 (PDT) Received: (from wpaul@localhost) by skynet.ctr.columbia.edu (8.6.12/8.6.9) id QAA28027 for hackers@freebsd.org; Sun, 28 Jul 1996 16:09:23 -0400 From: Bill Paul Message-Id: <199607282009.QAA28027@skynet.ctr.columbia.edu> Subject: Request for comments: NIS+ database design To: hackers@freebsd.org Date: Sun, 28 Jul 1996 16:09:22 -0400 (EDT) X-Mailer: ELM [version 2.4 PL24] Content-Type: text Sender: owner-hackers@freebsd.org X-Loop: FreeBSD.org Precedence: bulk Greetings: Lately I've been eroding my sanity by trying to put together an NIS+ implementation for FreeBSD. After going over the man pages, RPC protocol definitions and other available headers a few times, I decided to start at the bottom by building an NIS+ database backend library. Briefly, with NIS+, databases are managed entirely by the server, rpc.nisd. This includes creating, destroying and populating tables: all the NIS+ tools send requests to rpc.nisd, which in turn updates the database files and keeps a log of changes. The log is used by slaves (or replicas, as Sun now calls them) to stay in sync with the master server by carrying out the changes as specified in the log rather than snarfing an entire copy of the updated database, which is what NIS v2 does. Furthermore, NIS+ uses a relational database system rather than the (relatively) simple key/value database system used by NIS v2. With NIS+, you're allowed to make queries such as: "find me all the entries in the passwd table where uid=0 and shell=/bin/csh", whereas with NIS v2 you're only allowed to say: "give me the data associated with the key 'foouser' in the map 'passwd.byname'." (Note that this assumes that the 'passwd' table has been created with the uid and shell columns marked as searchable. Normally this is not the case: only the name and uid columns are searchable.) This (IMHO) is where most of the complexity in NIS+ comes from. With NIS v2, it's fairly easy to just bolt on an interface to Berkeley DB, ndbm, GDBM or whatever and leave it at that. With NIS+, you have to be more creative. Anyway. I managed to concoct a workable libnisdb prototype, but not being a database expert I'm not terribly confident in its design. I want to briefly explain how it works then ask for comments, criticisms, screams of terror or any other input anyone would care to offer. I apologize if this seems a bit too far off topic for this list. I don't know how many of you actually care about this stuff, but since it is likely to end up in the tree eventually it deserves some discussion somewhere. The NIS+ database backend consists of a handful of public functions such as db_create_table(), db_destroy_table(), db_add_entry(), db_remove_entry(), db_list_entry(), db_next_entry(), db_reset_next_entry(), db_checkpoint(), db_standby(), db_unload_table() and db_free_result(). Some of these return just a single enum value (status code) while others return a db_result structure containing the results of queries. db_free_result() returns void. (Note that there were a few extra functions added between Solaris 2.3, whent he NIS+ and nis_db.h specifications were released and Solaris 2.5. I decided to implement the functions outlined as of Solaris 2.5.) The Sun implementation uses something described in the nis_db(3n) man page as the 'Structured Storage Manager (SMM).' I also know that the Sun libnisdb is written, at least in part, in C++ (in fact I think a lot of Sun's NIS+ implementation is written in C++). I also think that it handles some of the transaction logging. My library doesn't do logging, but when time comes to do rpc.nisd, I may decide to merge it in if it seems practical. First of all, I decided to use the Berkeley DB btree method as the underlying database format. With ypserv I used the hash method. The one downside to the hash method is that the R_CURSOR flag doesn't work, which makes it hard to arbitrarily set a 'pointer' to a given place in the database. I worked around this with ypserv, but I decided I needed it to work correctly this time, so I switched to btree. Second, the actual 'database' is stored in two files. One contains the actual entries, and the other contains the relations that describe the entries. When an entry is added, it is first XDR-encoded into a buffer using xdrmem_create() in order to transform it into an opaque form. This buffer is then passed to a hash function which generates a 32-bit hash value that describes the entry. This hash value is used as the 'key' in the 'entries' database, and the XDR'ed buffer is the data. I call these 32-bit hash values 'tags.' Assuming one knows the 32-bit 'tag' for a particular entry, one can just do a (db->get) on the database using the tag as the key and get back the XDR'ed buffer. The original entry_obj structure can then be recreated just by passing the buffer through the XDR filter again. When the (db->put) operation is performed, I can tell immediately if I've been handed a duplicate entry since Berkeley DB will notice the key collision and return an error (if you use the R_NOOVERWRITE flag, which I do). If this happens, the caller gets back a DB_NOTUNIQUE error. Next, a set of relations are derived from the entry, and the 32-bit tag is associated with each relation. For example, if I have a 'passwd' table with seven total columns, two of which are searchable (name and uid), I would have two relations for each entry: name= and uid=. These would each be stored in the second database file as keys with the 32-bit tag derived from the entry as data. With passwd databases, the name field is meant to be unique, therefore each name= relation will likely only be related to a single tag. However, uids are often not unique, which means that a single uid= relation may yield several tags. If I find that a uid= key already exists, I update it by adding the new tag to the list of tags already stored. So as more entries are added that have the same uid= relation, the list of tags gets bigger. Let's say I do a simple query: find me all entries where name=wpaul. I would use db_list_entries() for this. First, I would use [name=wpaul] as a key for searching the 'relations' database for the passwd table. Assuming there is an entry that matches this criteria, I'll get back a datum that contains the 32-bit tag for the desired entry. I then use this tag as a key for searching the 'entries' database for the passwd table and get back the entry buffer, which I convert back into an entry_obj structure and pass back to the caller. Now let's say I do a more complex query that yields several entries as a result: find me all entries where uid=0 and shell=/bin/csh. Let's say there are three entries where uid=0 but only two where shell=/bin/csh. Doing a (db->get) from the 'relations' database with [uid=0] as the key yields three tags, but [shell=/bin/csh] yields only two. In this case, we take the smallest of the two tag lists (uid=0) and check to see if each one exists in the other relations' list. If we find that one of the tags in the smaller list doesn't appear in the larger list, it's eliminated from the query. (When there are more than two attributes, a tag is eliminated if it fails to appear in any of the larger lists.) Once we know which tags appear in both sets of lists, we retrieve the corresponding entries and pass them back the caller. Lastly, the 'entries' database has one special record in it which actually contains an XDR'ed table_obj that describes the table. When the database is initialized, each table is opened and this record is read into a cache which the various functions can then used for sanity checking. This record also describes the access rights associated with the various columns in the table. (Note that the database itself does not care about the access rights; only the server (rpc.nisd) concerns itself with them.) Along with all this, there are two circular queues of table handles. One of them contains a single entry for each known table and is loaded when the database library is initialized. The second queue is used to hold special descriptors used by the db_first_entry() and db_next_entry() functions. These database handles are positioned to particular locations within the btree databases using the (db->seq) function in Berkeley DB. This is somewhat analagous to the database handle cache in ypserv. The db_reset_next_entry() is used to close down one of these table handles when the caller is finished with them. The reason I decided to store the 'entries' and 'relations' separately was to save a little time in db_next_entry(). If both types of keys/data where stored in one database, I would have to check the type of data retrieved on each (db->seq) to make sure it was of the correct type and do a second retrieval if it wasn't (or a third, or a fourth, if I encountered a series of the wrong type of data). By putting the relations in a seperate database, I can step through the entries quickly without worrying about encountering uninteresting records. That's about it really. The idea here was to make the retrieval process quick in exchange for making the storage process more complex since the overwhelming majority of queries received by the server are likely to be for lookups. At the moment I have a completed library with all the functions from the Solaris nis_db(3n) man page and header implemented. I still need to make a couple of cleanup passes and write my own man page (not to mention run some comparison tests against the Solaris libnisdb), after which I'll put the whole mess up for FTP somewhere so people can look at it, point their fingers and laugh. -Bill -- ============================================================================= -Bill Paul (212) 854-6020 | System Manager, Master of Unix-Fu Work: wpaul@ctr.columbia.edu | Center for Telecommunications Research Home: wpaul@skynet.ctr.columbia.edu | Columbia University, New York City ============================================================================= "If you're ever in trouble, go to the CTR. Ask for Bill. He will help you." =============================================================================