From owner-freebsd-net Tue Oct 23 11: 6:31 2001 Delivered-To: freebsd-net@freebsd.org Received: from iguana.aciri.org (iguana.aciri.org [192.150.187.36]) by hub.freebsd.org (Postfix) with ESMTP id 6903437B406; Tue, 23 Oct 2001 11:06:26 -0700 (PDT) Received: (from rizzo@localhost) by iguana.aciri.org (8.11.3/8.11.1) id f9NI37H34813; Tue, 23 Oct 2001 11:03:07 -0700 (PDT) (envelope-from rizzo) Date: Tue, 23 Oct 2001 11:03:07 -0700 From: Luigi Rizzo To: net@freebsd.org Subject: performance issues with M_PREPEND on clusters Message-ID: <20011023110307.A34494@iguana.aciri.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.23i Sender: owner-freebsd-net@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Hi, don't know what is the right forum to discuss this, but this is certainly one. I have also Bcc-ed some people who might need to be involved. In the mbuf code, M_LEADINGSPACE always returns 0 when M_EXT is set, instead of calling the second part of M_WRITABLE to check whether there is a chance of writing into the cluster. This means that both M_PREPEND() and m_pullup() will have to allocate a second mbuf when they are invoked on a cluster, which these days is basically true for most of the incoming traffic (several network drivers pass up the full packet in a cluster). Why do we care ? Well, the additional work is not much (~2us per packet on a 750MHz box, say 10us on a slow box), but on boxes acting as routers this might in turn trigger further slowdown with various NICs which have performance or functional problems in doing scatter-gather I/O (the 21143 is one of those, and some of the NICs supported by the "dc" driver also require an additional copy when the packet being output is split across multiple buffers). A quick fix to this problem is to change M_LEADINGSPACE as shown in the attached patch (for -stable), so that M_PREPEND can write into the cluster if this is not shared. This change alone improved the peak forwarding rate (in packets per second) by 15-20% on a system we have here using the "dc" driver. Similar things could be done in m_pullup() to avoid the extra allocation. I have quickly discussed this with Bosko, and he suggests that M_LEADINGSPACE should not care if the buffer is writable or not, and only report how much room is available in front of the buffer, doing somewhere else the test for writability. While I agree with this view, it is also true that as it is now, M_LEADINGSPACE always returns 0 if the buffer is not writable, and some of the places where it is used (e.g. M_PREPEND) depend on this feature. So i would feel a bit uncomfortable in removing the writability test from M_LEADINGSPACE, whereas the change I suggest below seems to be reasonably safe. Any objections in importing this in -stable after adequate testing by some volunteers ? I do not think that testing in current would help for this change, because the mbuf handling code is slightly different there (as a matter of fact, the mbuf code in -stable is quite ugly, e.g. there is no locking when accessing the refcounts for mbufs). cheers luigi Index: sys/mbuf.h =================================================================== RCS file: /home/xorpc/u2/freebsd/src/sys/sys/mbuf.h,v retrieving revision 1.44.2.10 diff -u -r1.44.2.10 mbuf.h --- sys/mbuf.h 2001/07/03 11:02:01 1.44.2.10 +++ sys/mbuf.h 2001/10/20 21:18:04 @@ -473,8 +473,11 @@ /* * Check if we can write to an mbuf. */ +#define M_EXT_WRITABLE(m) \ + ((m)->m_ext.ext_free == NULL && mclrefcnt[mtocl((m)->m_ext.ext_buf)] == 1) + #define M_WRITABLE(m) (!((m)->m_flags & M_EXT) || \ - ((m)->m_ext.ext_free == NULL && mclrefcnt[mtocl((m)->m_ext.ext_buf)] == 1)) + M_EXT_WRITABLE(m) ) /* * Compute the amount of space available @@ -482,7 +485,7 @@ */ #define M_LEADINGSPACE(m) \ ((m)->m_flags & M_EXT ? \ - /* (m)->m_data - (m)->m_ext.ext_buf */ 0 : \ + (M_EXT_WRITABLE(m) ? (m)->m_data - (m)->m_ext.ext_buf : 0): \ (m)->m_flags & M_PKTHDR ? (m)->m_data - (m)->m_pktdat : \ (m)->m_data - (m)->m_dat) =================================================================== ----------------------------------+----------------------------------------- Luigi RIZZO, luigi@iet.unipi.it . ACIRI/ICSI (on leave from Univ. di Pisa) http://www.iet.unipi.it/~luigi/ . 1947 Center St, Berkeley CA 94704 Phone: (510) 666 2927 ----------------------------------+----------------------------------------- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-net" in the body of the message