From owner-svn-src-all@FreeBSD.ORG Sun Apr 19 17:13:38 2009 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DBD8D10656B2; Sun, 19 Apr 2009 17:13:38 +0000 (UTC) (envelope-from mat.macy@gmail.com) Received: from yw-out-2324.google.com (yw-out-2324.google.com [74.125.46.28]) by mx1.freebsd.org (Postfix) with ESMTP id 38ACB8FC19; Sun, 19 Apr 2009 17:13:38 +0000 (UTC) (envelope-from mat.macy@gmail.com) Received: by yw-out-2324.google.com with SMTP id 5so985002ywh.13 for ; Sun, 19 Apr 2009 10:13:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to:cc :content-type:content-transfer-encoding; bh=JYlm51MtNnr3upvOo1Yo3bGQWy0y62pELjzzImBnddI=; b=g9xxJuNEtGb7VCJXxPwbB5KfPISNskgJmgmmY8N4rA6j7wBDl73yheVDFa5+fJzram zJzwT/r+YdJBXDVG+u+CEYnJfdQXFmGvvAifp7NjEyvEDleoWICLe4sMtRKvgGoF7YVP RnaoAMIK53CYY9+sE8LVdSLsRtYTi36uJx9Mc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=kh+gxbY3jkCL7qKmEbWjNHHn2JVUL719gqtsp+Q+uJLHXPNVCWNxcclbKTzZ6hUvVI 9GCi4Smgoy1kREYfsO4HLmq8vz+yAfeW6S6FXtluJ/6Zc6RRKFhmOgokB2m+BXsykJEI 0qfrMw+C0f0FjGUng8iaKUzqA6e9Z+eLykaY8= MIME-Version: 1.0 Sender: mat.macy@gmail.com Received: by 10.100.46.10 with SMTP id t10mr6467216ant.116.1240161217532; Sun, 19 Apr 2009 10:13:37 -0700 (PDT) In-Reply-To: <49EAFA62.3010000@freebsd.org> References: <200904190444.n3J4i5wF098362@svn.freebsd.org> <49EAFA62.3010000@freebsd.org> Date: Sun, 19 Apr 2009 10:13:37 -0700 X-Google-Sender-Auth: 9088775c223f924a Message-ID: <3c1674c90904191013h119d040u1c59772a94dad2f1@mail.gmail.com> From: Kip Macy To: Andre Oppermann Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, Robert Watson Subject: Re: svn commit: r191259 - head/sys/netinet X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 19 Apr 2009 17:13:39 -0000 On Sun, Apr 19, 2009 at 3:18 AM, Andre Oppermann wrote: > Robert Watson wrote: >> >> On Sun, 19 Apr 2009, Kip Macy wrote: >> >>> Author: kmacy >>> Date: Sun Apr 19 04:44:05 2009 >>> New Revision: 191259 >>> URL: http://svn.freebsd.org/changeset/base/191259 > I have another question on the flowtable: =A0What is the pupose of it? > All router vendors have learned a long time ago that route caching > (aka flow caching) doesn't work out on a router that carries the DFZ > (default free zone, currently ~280k prefixes). =A0The overhead of managin= g > the flow table and the high churn rate make it much more expensive than > a direct and already very efficient radix trie lookup. Additionally a > well connected DFZ router has some 1k prefix updates per second. =A0More > information can be found for example at Cisco here: > =A0http://www.cisco.com/en/US/tech/tk827/tk831/technologies_white_paper09= 186a00800a62d9.shtml > The same findings are also available from all other major router vendors > like Juniper, Foundry, etc. > > Lets examine the situations: > =A0a) internal router with only a few routes; The routing and ARP table > =A0 =A0are small, lookups are very fast and everything is hot in the CPU > =A0 =A0caches anyway. > =A0b) DFZ router with 280k routes; A small flow table has constant thrash= ing > =A0 =A0becoming negative overhead only. =A0A large flow table has a high > maintenance > =A0 =A0overhead, higher lookup times and sill a significant amount of thr= ashing. > =A0 =A0The overhead of the flow table is equal or higher than a direct ro= uting > =A0 =A0table lookup. > Concluding that a flow table is never a win but a liability in any realis= tic > setting. You're assuming that a Cisco- / Juniper-class workload is representative of where FreeBSD is deployed. I agree that FreeBSD is sub-optimal for large routing environments for a whole host of other reasons. A better question is what are "typical" FreeBSD deployments, and how well would it work there. The flowtable needs to be sized to correspond to the number of flows, its utility rapidly diminishes as the number of collisions per bucket increases. The number of routes isn't the key metric, it is the number of flows active within a 30 second period. On current hardware we probably could not handle more than a couple of million concurrent flows (with a 4 million entry hash table). > Now I don't have benchmark numbers to back up the theory I put forth here= . > However I want to bring up the rationale for why nobody else is doing it. > A statistical analysis easily shows that flow caching has only a few smal= l > spots where it may offer some advantage over direct routing table lookups= ; > none of them are where it matter in real work situations. I can't argue with you, because you have not adequately characterize "real" work situations. I know that it is useful for the commercial environments with which I am familiar. > As our kernel currently stands an advantage of the flow table can certain= ly > be demonstrated for a small routing table and a small number of flows. = =A0This > is due to a very sub-optimal routing table implementation we have. =A0The= flow > table approach short-cuts a significant number of locking operations > (routing > table, routing entries, ARP table and possibly some more). =A0On the othe= r > hand > this caching of flows and pointers to routing entries and ARP entries > complicates > updates to these tables and potentially makes them very expensive. Incorrect. The implementation of the routing and arp tables are unchanged with the inclusion of the flowtable. Any complexity in their implementations is completely decoupled from the flowtable. If their implementations change, the flowtable will follow suit. > =A0Additionally > is creates a "tangled mess" again complicating future changes and advance= s > in > those areas (unless the flow table were simply removed again at that poin= t). The two will remain separate, please do no confuse matters. > I argue that instead of cludging around (the flow table) a sub-optimal pa= rt > of the network stack (the current incarnation of the routing table) time > could > be equally spent wiser on fixing the problems in the first place. =A0I've > outlined > a few approaches a couple of times before on the mailing lists. =A0If the > routing > table would no longer support direct pointers to entries the locking coul= d > be > significantly simplified and the ARP table could use rmlocks (read-mostly > locks) > as it is changed only very infrequently. =A0It's all about the number of = locks > that > have to be aquired per packet/lookup. =A0It also has the benefit of an or= der > of a > magnitude less complexity (and hard to debug egde cases, which cannot be > under- > estimated). In principle the ARP table could use rmlocks now. For the routing table you thing we should copy the rtentry out? I agree that the locked ref counting of rtentrys has ridiculously high overhead and would love to see that go away. The one major concern that I had when looking at doing that was the need to ensure continued liveness of the structures pointed to by the rtentry. -Kip