Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 2 Sep 2008 23:31:18 +0200
From:      Luigi Rizzo <rizzo@iet.unipi.it>
To:        Robert Watson <rwatson@freebsd.org>
Cc:        FreeBSD networking and TCP/IP list <freebsd-net@freebsd.org>
Subject:   Re: how to read dynamic data structures from the kernel (was Re: reading routing table)
Message-ID:  <20080902213118.GB28398@onelab2.iet.unipi.it>
In-Reply-To: <alpine.BSF.1.10.0809022157340.84006@fledge.watson.org>
References:  <3170f42f0809010507q6c37a9d5q19649bc261d7656d@mail.gmail.com> <48BBE7B2.4050409@FreeBSD.org> <48BCE4AA.6050807@elischer.org> <3170f42f0809020017k643180efte155a5b5701a40cf@mail.gmail.com> <alpine.BSF.1.10.0809021017500.1150@fledge.watson.org> <20080902105124.GA22832@onelab2.iet.unipi.it> <alpine.BSF.1.10.0809022157340.84006@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Sep 02, 2008 at 10:02:10PM +0100, Robert Watson wrote:
> 
> On Tue, 2 Sep 2008, Luigi Rizzo wrote:
> 
> >The real problem is that these data structures are dynamic and potentially 
> >large, so the following approach (used e.g. in ipfw)
> >
> >	enter kernel;
> >	get shared lock on the structure;
> >	navigate through the structure and make a linearized copy;
> >	unlock;
> >	copyout the linearized copy;
> >
> >is extremely expensive and has the potential to block other activities for 
> >a long time.
> 
> Sockets, sysctl, kmem, etc, are all really just I/O mechanisms, with 
> varying levels of abstraction, for pushing data, and all fundamentally 
> suffer from the problem of a lack of general export abstraction.

yes, this is why i said we should not bother about which one is used.

> >What we'd need is some internal representation of the data structure that 
> >could give us individual entries of the data structure on each call, 
> >together with extra info (a pointer if we can guarantee that it doesn't 
> >get stale, something more if we cannot make the guarantee) to allow the 
> >navigation to occur.
> 
> I think there's necessarily implementation-specific details to all of these 
> steps for any given kernel subsystem -- we have data structures, 
> synchronization models, etc, that are all tuned to their common use 
> requirements, and monitoring is very much an edge case.  I don't think this 
> is bad: this is an OS kernel, after all, but it does make things a bit more 
> tricky.  Even if we can't share code, sharing approaches across subsystems 
> is a good idea.
> 
> For an example of what you have in mind, take a look at the sysctl 
> monitoring for UNIX domain sockets.  First, we allocate an array of 
> pointers sized to the number of unpcb's we have.  Then we walk the list, 
> bumping the references and adding pointers to the array.  Then we release 
> the global locks, and proceed lock, externalize, unlock, and copyout each 
> individual entry, using a generation number fo manage staleness.  Finally, 
> we walk the list dropping the refcounts and free the array.  This voids 
> holding global locks for a long time, as well as the stale data issue.  
> It's unideal in other ways -- long lists, reference count complexity, etc, 
> but as I mentioned, it is very much an edge case, and much of that 
> mechanism (especially refcounts) is something we need anyway for any 
> moderately complex kernel data structure.

what you describe above is more efficient but not that different
from what i described. The thing is, i always forget that in many
cases an iterator doesn't care for the order in which elements
are generated so you could use a solution like the one below,
by just doing a tiny little bit of work while modifying the main
data structure
(this might well be a known solution, since it is so trivial...)

[i already emailed the following to BMS, so apologies for the duplicate]

if all you care is iterating the whole data structure, without a
particular order, you could manage an additional array of pointers
to all the objects in the data structure (the array should be
implemented as a sparse, resizable array but that's a separate
issue, and probably a relatively trivial one -- i am googling for
it...).

Navigation and iterators are simple:

+ When inserting a new element, append an entry to the array, and make
  it point to the newly added record. Each entry gets a fresh sequence
  numbers, and one should make sure that seqnumbers are not recycled
  (64 bit should suffice ?).

+ when deleting an element, logically remove the entry from the array

+ the iterator returns a copy of the object, and its sequence number;

+ getnext returns the existing element following the 'current' one
  in the sparse array.

Complexity for most ops (data insertion, removal, lookup)
would be O(1) plus whatever is needed to do housekeeping on the
sparse array, and this depends on the usage of the main data structure
and how much we worry for expensive 'getnext' ops.
Sure you need a read lock on the main struct while you lookup
the next element on the sparse array, but the nice thing is that
you can release the lock at each step even if you have a poorly
implemented sparse array.

Makes sense ?

	cheers
	luigi



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080902213118.GB28398>