From owner-freebsd-current Mon Nov 25 8:31:52 2002 Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id BEE1E37B404 for ; Mon, 25 Nov 2002 08:31:48 -0800 (PST) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8221743E91 for ; Mon, 25 Nov 2002 08:31:47 -0800 (PST) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (fledge.pr.watson.org [192.0.2.3]) by fledge.watson.org (8.12.6/8.12.5) with SMTP id gAPGVdBF037408; Mon, 25 Nov 2002 11:31:43 -0500 (EST) (envelope-from robert@fledge.watson.org) Date: Mon, 25 Nov 2002 11:31:39 -0500 (EST) From: Robert Watson X-Sender: robert@fledge.watson.org To: Andrew Gallatin Cc: Luigi Rizzo , current@freebsd.org Subject: Re: mbuf header bloat ? In-Reply-To: <15840.8629.324788.887872@grasshopper.cs.duke.edu> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Sat, 23 Nov 2002, Andrew Gallatin wrote: > I propose that we make struct label portion of the pkthdr compile-time > conditional on MAC. The assumption is that you will move the MAC label > to an m_tag sometime after 5.0-RELEASE. This weekend I spent about six hours looking at what it would take to move MAC label data into m_tags. While in theory it is a workable idea, it turns out our m_tag implementation is fairly far from being ready to handle something like this. I ran into the following immediate problems: (1) When packet headers are copied using m_copy_pkthdr(), different consumers have different expectations for what the resulting semantics are for m_tag data -- some want it duplicated, others want it moved. In practice, it is only ever moved, so consumers that expect duplication are in for a surprise. We need to re-implement the packet header copying code so that it can generate a failure (because it involves allocation), and separate the duplicate and move abstractions to get clean semantics. I exchanged some e-mail with Sam Leffler on the topic, and apparently OpenBSD has already made these changes, or similar ones, and we should do the same for 5.1. (2) m_tag's don't have a notion that the data carried in a tag is multi-dimmensional and may require special destructor behavior. While it does centralize copying and free'ing of data, it handles this purely with bcopy, malloc, and free, which is not appropriate for use with MAC labels, since they may contain a variety of per-policy data that may require special handling (reference count management, etc). I tried putting tag-specific release/free/... code in the m_tag central routines, and it looks like that would work, although it eventually would lead to a lot of junk in the m_tag code. We might want to consider m_tag entry free/copy pointers of some sort, but I'm not sure if we want to go there. Adding the MAC stuff to the m_tag_{free,copy,...} calls won't break the ABI, whereas adding free and copy pointers to the tags themselves would. (3) Not all code generating packets properly initializes m_tag field. The one example I've come across so far is the use of m_getcl() to grab mbufs with an attached cluster -- it never initializes the SLIST properly, as far as I can tell. Right now that's used in device drivers, and also in the BPF packet generation code. If the header is zero'd, this may be OK due to an implicit proper initialization, but this is concerning. We need to do more work to normalize packet handling. (4) Code still exists that improperly hand-copies mbuf packet header data. if_loop.c contains some particular bogus code that also triggers a panic in the MAC code for the same reason. If m_tag data ever passes through if_loop and hits the re-alignment case introduced by KAME, the system will panic when the tag data is free'd. This code all needs to be normalized, and proper semantics must be enforced. > This will immediately reduce the size of mbufs for the vast majority of > users, and will prevent a 4.1.1 like flag-day for 3rd party network > driver vendors. The only downside is that the few MAC users will not be > able to use 3rd party binary network drivers until the MAC label is put > into an m_tag. This seems fair, as the only people inconvienced are the > people who want the labels and they are motivated to move them to an > m_tag. But that's easy for me to say, since I don't run MAC, and I may > be missing something big. In the past I have looked at adding conditionally-defined components to struct mbuf and other key kernel data structures. While the condition of the tree is improving from this perspective due to better isolation of user and kernel data structures, the result is still incredibly messy, especially if you key the conditionally defined sections on a kernel option. mbuf.h is included in a number of userland applications -- some expected, such as the ipfilter test framework, but others less expected -- such as BIND. I'm very wary of the notion of adding conditionally defined portions of struct mbuf on this (and other) bases. I'll take a look at whether many of the obvious foot-shooting scenarios still exist since I last tried it. Moving to m_tag looks like a reasonable long-term strategy, but until the m_tag code is substantially more mature, it isn't realistic. Otherwise, I might have attempted to push through a change to it now before RC1. BTW, do you have any recent large-scale measurements of packet size distribution? In local tests and measurements, the additional 20 bytes on i386 didn't bump the remaining mbuf data space sufficiently low to substantially change the behavior of the stack. However, I haven't done measurements against the 64-bit variation. In practice, a number of network interfaces now seem to use clustered mbufs and not attempt to use the in-mbuf storage space... All my packet distribution measurements come from a typical ISP environment, but may not match what is seen in large-scale backbone environments. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Network Associates Laboratories To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message