From owner-freebsd-hackers Fri Apr 11 02:15:43 1997 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.5/8.8.5) id CAA06189 for hackers-outgoing; Fri, 11 Apr 1997 02:15:43 -0700 (PDT) Received: from pdx1.world.net (pdx1.world.net [192.243.32.18]) by freefall.freebsd.org (8.8.5/8.8.5) with ESMTP id CAA06181 for ; Fri, 11 Apr 1997 02:15:39 -0700 (PDT) Received: from suburbia.net (suburbia.net [203.4.184.1]) by pdx1.world.net (8.7.5/8.7.3) with SMTP id CAA18618 for ; Fri, 11 Apr 1997 02:18:03 -0700 (PDT) Received: (qmail 24506 invoked by uid 110); 11 Apr 1997 08:59:42 -0000 MBOX-Line: From owner-netdev@nuclecu.unam.mx Fri Apr 11 08:48:02 1997 remote from suburbia.net Delivered-To: proff@suburbia.net Received: (qmail 24164 invoked from network); 11 Apr 1997 08:47:51 -0000 Received: from peyote-asesino.nuclecu.unam.mx (qmailr@132.248.29.202) by suburbia.net with SMTP; 11 Apr 1997 08:47:51 -0000 Received: (qmail 25207 invoked by alias); 11 Apr 1997 07:53:12 -0000 Delivered-To: netdev-outgoing@peyote-asesino.nuclecu.unam.mx Received: (qmail 25203 invoked from network); 11 Apr 1997 07:53:10 -0000 Received: from roxanne.nuclecu.unam.mx (132.248.29.2) by peyote-asesino.nuclecu.unam.mx with SMTP; 11 Apr 1997 07:53:10 -0000 Received: (from root@localhost) by roxanne.nuclecu.unam.mx (8.6.12/8.6.11) id DAA16003 for netdev-outgoing; Fri, 11 Apr 1997 03:49:16 -0500 Received: from caipfs.rutgers.edu (caipfs.rutgers.edu [128.6.19.100]) by roxanne.nuclecu.unam.mx (8.6.12/8.6.11) with ESMTP id DAA15998 for ; Fri, 11 Apr 1997 03:49:11 -0500 Received: from jenolan.caipgeneral (jenolan.rutgers.edu [128.6.111.5]) by caipfs.rutgers.edu (8.8.5/8.8.5) with SMTP id EAA26413 for ; Fri, 11 Apr 1997 04:47:03 -0400 (EDT) Received: by jenolan.caipgeneral (SMI-8.6/SMI-SVR4) id EAA00936; Fri, 11 Apr 1997 04:45:46 -0400 Date: Fri, 11 Apr 1997 04:45:46 -0400 Message-Id: <199704110845.EAA00936@jenolan.caipgeneral> From: "David S. Miller" To: netdev@roxanne.nuclecu.unam.mx Subject: while I'm nursing a kernel compile or two... Sender: owner-hackers@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk I still think socket demultiplexing can be done much better, here are some things in my head: 1) Dynamic hash table growth at least for TCP. This is an old trick, would require a tcp_htbl_size to keep track of how big we are, and a function to rehash into the new table, it only grows and never shrinks. It only grows by powers of two as well, the limitation on it's max size is based upon the amount of ram in the machine. All straight forward stuff. 2) Slightly more intricate. Main bound hash is two level, second level hashes are only allocated and hooked into the top level when they are first needed but are never destroyed. Destruction would require timers and a garbage collector, does not seem to be worth it. 3) A bit more hairy. View the Socket identity as a 96 bit key (as it really is) Use this and a DP trie and digital searches to look up sockets. The main reason this approach turns me on is that DP tries are specifically designed for longest matching prefix searches, in particular we can arrange the bits in the "big key" to have local port and local addr at the front, this makes listening sock lookup (which looks in this case like a default route ;-) much quicker even with huge numbers of connections. This scheme also can be proven to have a given response time with a given set of sockets. DP trie's are specifically designed such that the layout of the tree is dependant upon "whats" in the tree not "how it got there" That is, no matter the order of insertions and deletions, for that given set of sockets the trie will always be the same. One lose is that to do a lookup you have to put the socket identity elements from the header on the local stack to arrange the bits correctly next to each other for the search. Also, although insertion and deletion are done in constant time, I am still not certain this is not a "considerable" amount of constant time (ie. too slow ;-) 4) Really far out (this is an ANK original ;-) Stick small hashes (perhaps 2 level) into the destination cache entries. We need to get at the dst cache entry for all outgoing and incoming packets we care about anyways, thus it is very cheap to sprinkle the mini hash tables into there and look them up this way. Of the top of my head there is only one bad case, and unfortunately this is the case that the benchmarks tend to test. A few machines (thus a small number of dst cache entries) making thousands upon thousands of connections to us. This is because the mini hashes in the dst cache entries would be overloaded entirely. It might be offset entirely if we used a two level scheme in the mini-hashes (8 entries top level, perhaps 16 in the second level). The only other problems this one might present is that a few extra dst cache lookups will occur when we need to do a lookup and we have no other reason (currently) to have the dst entry handy already. For 3 and 4 the TCP bound hash would need to remain as I cannot think of any other way to perform those operations more efficiently. B tree's looks interesting for this application, but I still have to study some of the worse/average case analysis for those, the problem with socket lookups is that you hit the worse case during these high stress situations. This is why hashes are so appealing for socket demultiplexing, "are the chains getting too long? just make the table bigger" etc. Just a brain dump, continue hacking... ---------------------------------------------//// Yow! 11.26 MB/s remote host TCP bandwidth & //// 199 usec remote TCP latency over 100Mb/s //// ethernet. Beat that! //// -----------------------------------------////__________ o David S. Miller, davem@caip.rutgers.edu /_____________/ / // /_/ ><