FreeBSD Mail Archives

Date:      Tue, 02 Sep 2008 15:10:24 -0700
From:      Julian Elischer <julian@elischer.org>
To:        Robert Watson <rwatson@FreeBSD.org>
Cc:        FreeBSD networking and TCP/IP list <freebsd-net@freebsd.org>, Luigi Rizzo <rizzo@iet.unipi.it>
Subject:   Re: how to read dynamic data structures from the kernel (was Re: reading routing table)
Message-ID:  <48BDB9D0.9030108@elischer.org>
In-Reply-To: <alpine.BSF.1.10.0809022157340.84006@fledge.watson.org>
References:  <3170f42f0809010507q6c37a9d5q19649bc261d7656d@mail.gmail.com>	<48BBE7B2.4050409@FreeBSD.org> <48BCE4AA.6050807@elischer.org>	<3170f42f0809020017k643180efte155a5b5701a40cf@mail.gmail.com>	<alpine.BSF.1.10.0809021017500.1150@fledge.watson.org>	<20080902105124.GA22832@onelab2.iet.unipi.it> <alpine.BSF.1.10.0809022157340.84006@fledge.watson.org>

Robert Watson wrote:
> 
> On Tue, 2 Sep 2008, Luigi Rizzo wrote:
> 
>> The real problem is that these data structures are dynamic and 
>> potentially large, so the following approach (used e.g. in ipfw)
>>
>>     enter kernel;
>>     get shared lock on the structure;
>>     navigate through the structure and make a linearized copy;
>>     unlock;
>>     copyout the linearized copy;
>>
>> is extremely expensive and has the potential to block other activities 
>> for a long time.
> 
> Sockets, sysctl, kmem, etc, are all really just I/O mechanisms, with 
> varying levels of abstraction, for pushing data, and all fundamentally 
> suffer from the problem of a lack of general export abstraction.
> 
>> What we'd need is some internal representation of the data structure 
>> that could give us individual entries of the data structure on each 
>> call, together with extra info (a pointer if we can guarantee that it 
>> doesn't get stale, something more if we cannot make the guarantee) to 
>> allow the navigation to occur.
> 

In some code I have seen (and some I have written) there is always two
levels of storage in some modules.. One that contains the majority
of the information and one that contains "changes that occured since
the main container was locked"..

so for example the routing tables might be locked and if
a routing change is requested thereafter, it gets stored in a 
transactional form on the side structure.. a routing lookup
during the period that the structure is locked (if a read lock) simply 
goes ahead, and at completion checks if there is a better
answer in the waiting list. A write request is stored as a
transaction request on the waiting list. not saying it works for 
everything but If we had a kernel written in a high enough level
language, such methods could be broadly used..  oh well.

using reader-writer locking mitigates a lot of this..

> I think there's necessarily implementation-specific details to all of 
> these steps for any given kernel subsystem -- we have data structures, 
> synchronization models, etc, that are all tuned to their common use 
> requirements, and monitoring is very much an edge case.  I don't think 
> this is bad: this is an OS kernel, after all, but it does make things a 
> bit more tricky.  Even if we can't share code, sharing approaches across 
> subsystems is a good idea.
> 
> For an example of what you have in mind, take a look at the sysctl 
> monitoring for UNIX domain sockets.  First, we allocate an array of 
> pointers sized to the number of unpcb's we have.  Then we walk the list, 
> bumping the references and adding pointers to the array.  Then we 
> release the global locks, and proceed lock, externalize, unlock, and 
> copyout each individual entry, using a generation number fo manage 
> staleness.  Finally, we walk the list dropping the refcounts and free 
> the array.  This voids holding global locks for a long time, as well as 
> the stale data issue.  It's unideal in other ways -- long lists, 
> reference count complexity, etc, but as I mentioned, it is very much an 
> edge case, and much of that mechanism (especially refcounts) is 
> something we need anyway for any moderately complex kernel data structure.

refcounts are, unfortunatly a really bad thing for multiprocessors.
refcounts, if they are actually incremented now and then are usually 
out of scope for any given CPU forcing lots of cache flushes and real 
reads from memory. There are some elaborate MP-refcount schemes we 
really should look at but most require a lot of memory.

>

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?48BDB9D0.9030108>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation