Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 19 Jun 2001 17:16:47 -0400
From:      Bosko Milekic <bmilekic@technokratis.com>
To:        Terry Lambert <tlambert2@mindspring.com>
Cc:        Rik van Riel <riel@conectiva.com.br>, Matt Dillon <dillon@earth.backplane.com>, Matthew Hagerty <mhagerty@voyager.net>, freebsd-hackers@FreeBSD.ORG
Subject:   Re: Article: Network performance by OS
Message-ID:  <20010619171647.A8322@technokratis.com>
In-Reply-To: <3B2FA26A.EA68DDCA@mindspring.com>; from tlambert2@mindspring.com on Tue, Jun 19, 2001 at 12:05:14PM -0700
References:  <Pine.LNX.4.21.0106161712060.2056-100000@imladris.rielhome.conectiva> <3B2FA26A.EA68DDCA@mindspring.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On Tue, Jun 19, 2001 at 12:05:14PM -0700, Terry Lambert wrote:
> Use of zalloci() permits allocations to occur at interrupt,
> such as allocations for replacement mbuf's in receive rings.
> 
> It would be very difficult to maintain FreeBSD's GigaBit
> ethernet performance without this type of thing.

	Actually, mbuf and cluster allocations are not done with the zone
allocator. Similarily to the zone allocator, though, the KVA space is all
present at initialization time, and allocations are all done from a
dedicate submap of kmem_map, which ensures that we have the required address
space. Allocations are then done via kmem_malloc() from the given mbuf map
and if the address space isn't available, we know that we have consumed the
available address space, so we return NULL immediately and we deal with the
mbufs and clusters already allocated and in circulation all via the cache
lists.
	The reason mbuf and cluster allocations are fast, even at interrupt
time, is not only due to the map allocations having all KVA they need, but
mainly due to the usage of of the cache lists. The new allocator takes
this a step further by introducing per-CPU cache allocation lists.
	As for the GigaBit performance, well, it even has less to do with
the mbuf and cluster allocation code, as I'm sure Bill Paul will proudly
point out. :-) The gigabit drivers at this time all do their own jumbo
buffer allocations [they need physically contiguous multi-page buffers] and,
from what I've seen, they mostly use contigmalloc() [ick!]. Some of them
pre-allocate the large buffers with contigmalloc() so as to prevent seeing
more and more allocation difficulties as memory becomes fragmented.

> One of the things that worries me about the new mbuf
> allocator is how it behaves with regard to lock inversion
> in a driver lock at interrupt time.  I'm not saying there
> is definitely a problem, but this is really tricky code,
> and the lock manager has poor deadlock avoidance
> characteristics when it comes to inversion, since it does
> not allocate locks onto a DAG arc that would permit cycle
> detection among N processes with N+1 (or more) locks.

	Have you seen the WITNESS code in -CURRENT?
 
> Because the allocations as a result of a zalloci() zone
> occur through the use of a page fault against a preallocated
> and contiguous KVA range, there's really very little, short
> of a full rewrite, which would permit allocations to still
> occur at interrupt, while at the same time ensuring that
> the zone remained recoverable.

	Well, as I said, we don't use zalloci() for mbufs and cluster (and
never have, in fact) but we still do have the contiguous KVA range.

> Frankly, with a number of minor modifications, and a bunch
> more INVARIANTS code to guard against inversion, we could
> allocate KVA space for mbufs, sockets, tcpcb's, and inpcb's
> (and udpcb's, though they are not as important to me), and
> have some possibility of recovering them to the system.

	sockets do use the zone allocator and the KVA space is pre-allocated,
as you say.

> This would have the effect of rendering the memory no
> longer type stable, but if it meant we could continue to
> allocate at interrupt context, it would be worth having
> a clearner going behind emptying full buckets back to the
> system.

	Shortly, as I previously mentionned, I plan to introduce a kproc for
the mb_alloc system which will, once the general cache list hits X number of
objects, go ahead and clear up Y pages worth of objects from the general
list. This would allow for the wired-down pages to be unwired (and we would
thus reclaim memory) while still keeping the pre-allocated KVA space and while
not hampering on the speedy allocations and deallocations of mbufs and
clusters (in fact, since the kproc would only touch the general list, both
interrupt and non-interrupt allocations would likely still be occuring even
while the cleaning up is being done).
 
> Many of these default "limitations" are intentional, both

	I agree.

> in terms of running out of KVA space (personally, I run
> with a 3G KVA space, which also limits user processes to
> 1G of space, which is opposite of the normal arrangement),
> and in terms of administration.
>
> Burning this space for zone allocations, you still need
> to come to a decision about what size to make each zone,
> given the limitations of zones, and the interrupt allocation
> requirement discussed above.

	Again, I agree. :-)

> >From an administrative perspective, you have to make a
> trade-off on whether on not you can weather a denial of
> service attack which expolits a vulnerability, such as
> no default limitation on the number of sockets or open
> file descriptors a process is permitted to consume.  In
> having no limitations on this, you open yourself to
> failure under what, under ordinary circumstances, would
> have to be considered grossly abnormal loads.
> 
> I have done a number of Windows installs, and among other
> things, it will ask you to characterize the load you
> expect, which I am sure results in some non-defaults for
> a number of tuning parameters.
> 
> Similarly, it has opportunity to notice the network
> hardware installed: if you install a GigaBit Ethernet
> card, it's probably a good be that you will be running
> heavy network services off the machine.  If you install
> SCSI disks, it's a pretty good bet you will be serving
> static content, either as a file server, or as an FTP
> or web server.
> 
> Tuning for mail services is different; the hardware
> doesn't really tell you that's the use to which you will
> put the box.
> 
> On the other hand, some of the tuning was front-loaded
> by the architecture of the software being better suited
> to heavy-weight threads implementations.  Contrary to
> their design claims, they are effectively running in a
> bunch of different processes.  Linux would potentially
> beat NT on this mix, simply because NT has more things
> running in the background to cause context switches to
> the non-shared address spaces of other tasks.  Put the
> same test to a 4 processor box with 4 NIC cartds, and I
> have no doubt that an identically configured NT box will
> beat the Linux box hands down.
> 
> 
> A common thread in these complaints that the results
> were somehow "FreeBSD's fault", rather than the fault of
> tuning and architecture of the application being run,
> is, frankly, ridiculous.

	I completely agree. :-)))
 
> -- Terry

Cheers,
-- 
 Bosko Milekic
 bmilekic@technokratis.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20010619171647.A8322>