From owner-freebsd-hackers Mon Dec 17 6:35:36 2001 Delivered-To: freebsd-hackers@freebsd.org Received: from snipe.prod.itd.earthlink.net (snipe.mail.pas.earthlink.net [207.217.120.62]) by hub.freebsd.org (Postfix) with ESMTP id 2C3EF37B416 for ; Mon, 17 Dec 2001 06:35:16 -0800 (PST) Received: from pool0043.cvx21-bradley.dialup.earthlink.net ([209.179.192.43] helo=mindspring.com) by snipe.prod.itd.earthlink.net with esmtp (Exim 3.33 #1) id 16FyrI-0006nc-00; Mon, 17 Dec 2001 06:35:12 -0800 Message-ID: <3C1E02A1.98BFFE5@mindspring.com> Date: Mon, 17 Dec 2001 06:35:14 -0800 From: Terry Lambert X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Lamont Granquist Cc: freebsd-hackers@FreeBSD.org Subject: Re: What a FBSD FS needs to do? References: <20011217014953.G15950-100000@coredump.scriptkiddie.org> Content-Type: multipart/mixed; boundary="------------02B49AC6CDE37E406593A945" Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG This is a multi-part message in MIME format. --------------02B49AC6CDE37E406593A945 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Lamont Granquist wrote: > > Can anyone give a brief overview (or point to one) of what a FS in FreeBSD > needs to do to interact with the rest of the OS? The general picture I've > got is of some code which interacts with the VFS layer above it and the > block I/O layer down below it. It is this correct? And what are the APIs > in those layers? (and how does the FS interact with the VM?) Briefly, there are ~185 kernel entry points which are consumed by the FFS code. To see these, go into the directory were you build your kernel and have object files lying around, and conunt them, e.g.: # cd /sys/compile/GENERIC # sh # ld -o /tmp/ffsobj ffs* ufs* >/tmp/ffs.link 2>&1 # cd /tmp # vi ffs.link :1,$g/:$/d :1,$g/more undefined/d :1,$s/'$// :1,$s/^.*`// :x # sort -o ffs.sort < ffs.link # uniq < ffs.sort > ffs.uniq # wc -l ffs.uniq 185 I have attached an example of the result, for my older 4.x based system to this email. If you look at these, you will see 5 broad categories: 1) Kernel support services. These are things like bzero, copyin, printf, uiomove, timeout, tsleepm untimeout, etc., and are required support functions that aren't really FS specific. Another OS would call then "generic kernel services", but wouldn't have the whole story. 2) VFS services. These are things like vfs_add_vnops, vfs_export, vfs_timestamp, etc., and are required for registration and recognition of the FS as a VFS. There are also services for manipulation of VFS specific kernel resources in this category. 3) Vnode services. These are things like all of the vop_* operations, vget, vgone, NDFREE, and so on. These services services represent both VFS service, which the VFS can call for stacking reasons (it calls these services, rather than calling the VFS specific routines it defines in orcer to abstract the VFS so that you can do VFS stacking, and things won't break), and VFS specific resource that are managed by the OS (such as vnodes, etc.). Note: The NDFREE reference is actually an implemenation error, since it breaks the "caller allocates/caller frees" paradigm; this is a long-standing layering issue. 4) Virtual memory and I/O services. These are things like malloc, free, cache_enter, getblk, bread, vm_object_deallocate, vinvalbuf, etc.. These services represent the VFS' interaction with the VM system, and, as a result, the buffer cache. The spl* functions, which are used for concurrency control, as well as the locking primitives, fall into this category. It's important to note that most of these operations only exist in "local media" FSs... if your VFS were implementing a stacking layer, you would not have almost any of these used by it, since the services consumed would be pretty much covered in #3, above. 5) Miscellaneous functions. Into this category, I lump all the inconvenient to explain functions, like the spec_* functions, which implement the special device operations exported by the VFS (when you look up a device, you actually get a specfs vnode back, instead of an FFS vinode, but since the backing object is an FFS object, you have to reference it through the FFS), and, similarly, the fifo_* operations (which are used to manage named pipes -- FIFO objects -- in the same way. You would also see "__divdi3" here, as well as other systhetic functions which are, in reality, an artifact of the compiler. Practically, nearly half of these undefined symbols could be made to go away, with little to no effect on performance. In particular, the descriptor references could be factored out at FS instance time, when the mount takes place, and a stack is "frozen" as a mounted FS instance. The way you would do this is to sort the VOP and VFSOP lists, respectively, and then build direct references, rather than descriptor references, and access them by index, rather than descirptor (this would be slightly faster, too). Other references could additionally be eliminated, as they are really the result of sloppy references (e.g. the spec_* and fifo_* entries: the first by mount-based externalization and inheritance, and the seconfd by pure inheritance, enforced at instance time). A lot of the b* buffer cache operations should probably be via an ops structure dereference; this means an additional pointer dereference at runtime, so some of the wins you got by sorting the VOP list and using an index, insteaqd of areverse lookup of the descriptor reference, get paid back at that time, but overall you are still better off. Ig you have ever programmed an IFS under Windows, you are familiar with the concept of function table reference definition at IFS registration time: this is basically tyhe same approach as there). The total external exposure could therefore be dropped to perhaps 30 or more symbols, which would make understanding things a whole lot easier. As far as the externally exposed symbols are concerned, the place to look for these is in the VFSOPS and VOPS tables; these are contained in /sys/ufs/ffs/ffs_vfsops.c and /sys/ufs/ffs/ffs_vnops.c, in descriptor tables. These tables define the VFS consumer interface used to talk to an FS by any VFS layer consumer. There are three consumers of the VFS layer at present: the system calls, the static references to things like the ufs_*, fifo_*, and spec_* operations by the ffs_* code, and the NFS server code. Putatively, there is also VFS stacking modules, but since only trivial versions of those actually work (they have overly complex interaction with the VM system, in particular, the cache coherency), so they don't really count as something you have to worry about supporting at this point, at least not any more than any other VFS supports them directly. This would all be significantly better handled in the context of a journal of a new, independent FS port to FreeBSD, since it would be possible to address the more arcan issues and the issues of ideal vs. practical kernel interaction, at a much more abstract (and thus useful to future FS writers) level. Using the FFS as your example is not generally a good idea, particularly since it has some additional complexity for things like Soft Updates and legacy stuff that make it a really bad example of "how to do things the right way when you are starting from scratch". I'm pretty sure Kirk and others would agree with this assessment. In any case, there's your "brief" overview. You would do well to read John Heidemann's thesis, and the documentation for the FICUS framework out of UCLA, on which the stacking code is based, as well as Matt Dillon's small articles that give a brief overview of the FreeBSD unified VM and buffer cache system. See: ftp://ftp.cs.ucla.edu/pub/ficus/ http://www.daemonnews.org/ -- Terry --------------02B49AC6CDE37E406593A945 Content-Type: text/plain; charset=us-ascii; name="ffs.uniq" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="ffs.uniq" M_TEMP NDFREE __divdi3 __moddi3 addaliasu addlog allocbuf bawrite bcmp bcopy bdevvp bdirty bdwrite biodone biowait bowrite bqrelse bread breadn brelse bremfree buf_wmesg bwillwrite bwrite bzero cache_enter cache_purge cluster_read cluster_write copyin copyinstr copyout copystr crfree curproc desiredvnodes dev2udev devsw devtoname dsname fifo_printinfo fifo_vnodeop_p fifo_vnoperate free getblk geteblk getmicrouptime getnewvnode groupmember hashinit iftovt_tab incore knote lbolt lf_advlock lockinit lockmgr lockmgr_printinfo log major makedev malloc malloc_init malloc_uninit minor mntvnode_slock module_register_init mountlist namei nchstats panic pmap_zero_page printf psignal random relookup rootdev rootvp scanc securelevel skpc spec_vnodeop_p spec_vnoperate speedup_syncer splbio splx suser_xxx sysctl__debug_children sysctl__vfs_children sysctl_handle_int tablefull time_second timeout tsleep uiomove untimeout uprintf vcount vflush vfs_add_vnodeops vfs_bio_awrite vfs_bio_clrbuf vfs_busy_pages vfs_cache_lookup vfs_export vfs_export_lookup vfs_getnewfsid vfs_getvfs vfs_modevent vfs_mountedon vfs_object_create vfs_rm_vnodeops vfs_stdextattrctl vfs_stduninit vfs_timestamp vget vgone vinvalbuf vm_freeze_copyopts vm_object_reference vm_object_vndeallocate vm_page_free_toq vm_page_zero_invalid vn_close vn_isdisk vn_lock vn_open vn_rdwr vnode_pager_generic_getpages vnode_pager_generic_putpages vnode_pager_setsize vop_access_desc vop_advlock_desc vop_balloc_desc vop_bmap_desc vop_bwrite_desc vop_cachedlookup_desc vop_close_desc vop_create_desc vop_default_desc vop_defaultop vop_freeblks_desc vop_fsync_desc vop_getattr_desc vop_getpages_desc vop_inactive_desc vop_ioctl_desc vop_islocked_desc vop_link_desc vop_lock_desc vop_lookup_desc vop_mkdir_desc vop_mknod_desc vop_mmap_desc vop_open_desc vop_pathconf_desc vop_poll_desc vop_print_desc vop_putpages_desc vop_read_desc vop_readdir_desc vop_readlink_desc vop_reallocblks_desc vop_reclaim_desc vop_remove_desc vop_rename_desc vop_rmdir_desc vop_setattr_desc vop_stdislocked vop_stdlock vop_stdpoll vop_stdunlock vop_strategy_desc vop_symlink_desc vop_unlock_desc vop_whiteout_desc vop_write_desc vprint vput vrecycle vref vrele vtruncbuf vttoif_tab wakeup --------------02B49AC6CDE37E406593A945-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message