Date: Fri, 28 Feb 1997 21:26:57 -0500 From: "David S. Miller" <davem@jenolan.rutgers.edu> To: netdev@roxanne.nuclecu.unam.mx Subject: overview of socket hashing changes, plus new patch Message-ID: <199703010226.VAA08431@jenolan.caipgeneral>
next in thread | raw e-mail | index | archive | help
Eric Schenk pointed out some problems to me that caused IP transparent proxy to not even compile, this is believed to be fixed and it should actually work. Fetch the new patch from: vger.rutgers.edu:/pub/linux/Net/patches/fordyson2.diff.gz A brief overview of the changes, with some insight into my design decisions where applicable: 1) Anything referring to SOCK_ARRAY_SIZE and code which used that was essentially just deleted. 2) Socket demultiplexing, I determined, was a per protocol problem. Therefore I added the following methods to struct proto: void hash(struct sock *sk) This is where a socket is added to a protocols lookup tables, the method by which this is done and the tables used are completely opaque to the af_inet/af_inet6 code, it really doesn't care how you arrange things. For example if you wanted to use a DP trie for TCP sockets (I considered this approach, but decided against it, see below) the callers simply would not care. void unhash(struct sock *sk) This removes a sock from the protocols lookup tables. void rehash(struct sock *sk) Here we get called when the "identity" of a sock has changed in some way, yet it still exists and lookups should find it just fine. The theory here is that a quick routine needs to be present when we just change the address/port elements of a sock, since the hash is probably different then. unsigned short good_socknum(unsigned short base) The idea here is that a protocol knows best how to choose the best possible port number. For example, it can exploit some properties of it's lookup tables and thus choose a port which will seed better in those tables. int verify_bind(struct sock *sk, unsigned short snum) This just verifies that attaching SK to the passed local port is "ok" and doesn't conflict with any other existing connection in that protocol. To aide in the support of these things two elements were added to struct sock: struct sock **hashtable; int hashent; This allows the af_inet code to do things more freely and only force it to call the rehash routine after all of it's identity changes have been completed. This way the per protocol hashing code, during a rehash, can easily find where the sock was beforehand even though the new hash computation might be different. Alas it was found that there still needed to exist a way to walk the entire socket list. But happily, this is only needed in one place, the procfs stuff. So who cares ;-) This was implemented by adding: struct sock *sklist_next, *sklist_prev; to the front of both struct sock and struct proto. 3) Socket hash table protection. It is very simple, when things need to be changed, a BH atomic section is entered. For the lookup code, during incoming packet processing, since you know you are running within a BH you can safely avoid even doing this and run full speed with no locking. Eric Schenk has informed me that he thinks this can be made to scream for fine grained SMP as well. This is good. 4) The af_inet code was fast pathed, and some special cases were completely removed. The code is generally much cleaner now and easier to understand. In particular the RAW protocols for v4 and v6 were given their own bind method, eliminating a lot of excess comparisons in the af_inet bind code. The implementations: SOCK_RAW They did not need the full 256 entry hash tables they previously did. The size of the hash was chosen to be MAX_INET_PROTOS, this simplified and speeded up the raw packet delivery. The hash is on sk->num, simple, stupid, nothing fancy necessary at all. SOCK_PACKET No lookups, in the sense we are discussing here, are even ever necessary. Therefore this protocol lacks a table and also lacks the hash methods etc. SOCK_UDP The algorithm is the same as previous, hash on the local port number. For UDP the one behind sock cache was retained as for UDP there are two things to exploit: 1) Nearly all UDP sockets on a system are wildcarded out the wazoo, so hashing on anything other than the local port will only make it more complex and cause the lookups to take longer. 2) Numerous research papers and studies have shown that the one behind UDP socket cache can have hit rates approaching %80, this is the main reason it was in fact retained. SOCK_TCP Ok, this one is a doozy, and it was the incentive behind all of this work I did, hold on to your seats. There are some nice properties about TCP which you can exploit, in particular: 1) The overwhelming majority of TCP sockets have a full identity, that is they are completely bound to a unique [local address, local port], [foreign address, foreign port] identity. Furthermore, the only TCP sockets which possess wildcards and can get hit during a demultiplex for an incoming packet are those in the listening state, we exploit this heavily. 2) And for these listening connections, due to limitations in the BSD sockets API, only the local address and port can be non-wildcard. The foreign port and address can never be bound to anything for listening TCP socket, so don't even bother checking for it. Note the protocol does allow for a listening socket to be bound, yet as stated the API lacks any way in which you can actually do this. With that in mind, there are three hash tables for TCP sockets. One for LISTENING sockets, one for sockets with a full identity yet not dead/TCP_CLOSE, and one further hash whose purpose will be described shortly. Sockets in the full identity hash are hashed based upon all four of laddr, lport, faddr, fport. This seems to hash the tables rather nicely. Listening sockets are only hashed based upon the local port, since this is the only thing we can guarentee is not wildcarded, in fact as just stated we in fact know that the foreign port and address are wildcarded. The algorithm for incoming socket lookup is essentially: 1) Try to find perfect match in established hash table. 2) If nothing found, look in listening table for closest match. Step (1) need only check for perfect matches (that is all it could ever find in there!), and step (2) need only prioritize a lookup by "local port and local address match" then "local port match, yet local address wildcarded". And this is all you need to check. Possible improvement for the listening lookups. When we add a socket to the listening hash (rare occurrance, does not happen all that often, except for ftp clients in non-passive mode) we look for a "perfect listener" on our selected port. If one is found, we place ourselves after him. The lookup code for the listening case can thus be simplified to return the first socket where the local port matches, and this will always be correct due the way in which we have layed things out. Ok, that made incoming packet demultiplexing scale extremely well. Let's see how I covered some other issues where performance sucked previously. TCP timers, one was tricky to solve the other was simple. Every 75 seconds the keepalive timer goes off and previously (FreeBSD does this as well, but much worse, I'll describe this is mass detail later) the entire socket list was walked to find TCP's which need a keepalive probe sent. The fix here is to walk the established hash chains 4 times every 75 seconds. Each time only examining 1/4 of the chains. 1/4 was chosen because for a HZ of at least 100, this has no lost of accuracy whatsoever. Also note how it is only the established hash which is examined, listening sockets do not have keepalives go off, this saves a bunch of stupid checks. Second issue, the SYN reception timeout. Again, we previously walked the entire socket list when this fired off, no longer! Only listening sockets need to be checked (and as I mentioned this is an extremely small portion of the TCP socket space on a machine) so we only walk the TCP listening hash table. This case was a piece of cake to deal with. ;-) I will describe the good socket and bind verification techniques at some other point in time. It is quick but there are some bad corner cases I'd like to eliminate before I explain my intentions ;-) Suffice to say though that the third hash not yet described helps make these operations go like smoke. Comparison with FreeBSD for TCP (I love this, in fact once you read this you will wonder where their claims of scalability even come from): 1) Socket demultiplexing: FreeBSD - Single hash for entire TCP socket space. Listening sockets pollute the established connections. Hash function involves a mod instruction because the table size is a prime. Extraneous testing in the hash lookup because this code in in_pcb.c must be generic and cannot exploit nice properties present in TCP. Linux - Dual hash, listening sockets do not pollute the perfectly unique sockets. For the listening case, only the necessary pieces of the socket identity are examined to determine a match. Hash table size is a power of 2, eliminating the need of a costly multi cycle modulo instruction. (Some minutae, I have approximately 50 or so samples from some of the largest ftp and web servers that I know about (some of these samples have 5,000 or more connections), listing the socket connections during high load. A comparison of the current bound hash function under Linux and the one under FreeBSD shows them to distribute about the same, if not better for Linux (in particular Linux is better on machines which run many clients as well as large single port servers) the prime size of the hash does not seem to help FreeBSD and they eat the multi-cycle modulo instrucion on top of it all, so stupid) 2) Timers: FreeBSD - Every 500ms, and every 200ms, the _ENTIRE_ TCP socket list is walked, yes in it's entirety. How the fuck does this scale? Linux - Most TCP timers go off with the socket in question is an argument to the timer. And due to Finne Gangstad's scalable timer implementation the setting and deletion of these timers are near zero overhead. This also has the nice effect that our timers are also much more accurate than BSD. (this avoids the retransmit "explosions of death" every 200ms as seen on large BSD derived servers) For the timers which need not be that accurate, we either walk the small listener hash chains every half second, and every (75 / 4) seconds we walk (128 / 4) of the established hash chains. Now those are scalable TCP timers, beat that. ;-) 3) Port selection, and bind verification. I will discuss this issue at a later time, suffice to say that for FreeBSD they essentially walk the entire socket list for both operations looking for a match/miss. ---------------------------------------------//// Yow! 11.26 MB/s remote host TCP bandwidth & //// 199 usec remote TCP latency over 100Mb/s //// ethernet. Beat that! //// -----------------------------------------////__________ o David S. Miller, davem@caip.rutgers.edu /_____________/ / // /_/ ><
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199703010226.VAA08431>