From owner-freebsd-net Fri May 1 21:33:14 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id VAA03251 for freebsd-net-outgoing; Fri, 1 May 1998 21:33:14 -0700 (PDT) (envelope-from owner-freebsd-net@FreeBSD.ORG) Received: from khavrinen.lcs.mit.edu (khavrinen.lcs.mit.edu [18.24.4.193]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id VAA03244 for ; Fri, 1 May 1998 21:33:06 -0700 (PDT) (envelope-from wollman@khavrinen.lcs.mit.edu) Received: (from wollman@localhost) by khavrinen.lcs.mit.edu (8.8.8/8.8.8) id AAA10671; Sat, 2 May 1998 00:32:47 -0400 (EDT) (envelope-from wollman) Date: Sat, 2 May 1998 00:32:47 -0400 (EDT) From: Garrett Wollman Message-Id: <199805020432.AAA10671@khavrinen.lcs.mit.edu> To: Chris Csanady Cc: freebsd-net@FreeBSD.ORG, jtw@lcs.mit.edu Subject: Fast IP forwarding In-Reply-To: <199805020229.VAA04136@friley585.res.iastate.edu> References: <199805012043.QAA09515@khavrinen.lcs.mit.edu> <199805020229.VAA04136@friley585.res.iastate.edu> Sender: owner-freebsd-net@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org < said: > I have included the ip_input() function from Van Jacobsons experimental > stack as a reference. It uses an extremely simple cache, that requires > little memory and only a handful of instructions that manage it. This > code should actually have *very* low overhead even in the common case. Wow, that's a but more complicated than I would have expected from Van. Here's a slightly different take on the same concept... with some comments from eighteen months after the fact. (Some wild mutation of this may have eventually made it into the CAIRN group's sources---JTW certainly did have a copy of it when I left the fifth floor.) Note that, in a real router, you probably don't want to use a cache at all, but rather a specialized data structure that can compactly store the entire routing table and access it directly with a minimum of instructions. Or at least that's what I last heard, anyway. A couple of papers on such data structures were presented at the last SIGCOMM. /* * Ip input routine. Checksum and byte swap header. If fragmented * try to reassemble. Process options. Pass to next level. */ void ip_input(struct mbuf *m) { struct ip *ip; struct ipq *fp; struct in_ifaddr *ia; int hlen; int len, err; struct in_addr dst; struct ipfwd_cache *ipfc; struct rtentry *rt; struct ifnet *ifp; // note that we're trying to do both fast input and fast forwarding // here, depending on what sort of mood the sysadmin is in... #ifndef IPFP_ROUTER /* XXX - should do multicast faster */ /* XXX - GCC extension (pointers to labels) */ static const void *actions[] = { &&dopanic, &&fast_input, &&slow, &&fast_fwd, &&cantforward }; #endif #ifdef DIAGNOSTIC if ((m->m_flags & M_PKTHDR) == 0) panic("ip_input no HDR"); #endif /* * If no IP addresses have been set yet but the interfaces * are receiving, can't do anything with incoming packets yet. */ // this is bogus... we should still be able to receive 255.255.255.255 // broadcasts regardless of our configuration state if the netif is up. if (in_ifaddr == NULL) goto bad; ipstat.ips_total++; /* * Fast Path begins here. * Still need to worry about: * Multicast */ // note that all the special cases are handled elsewhere. // this supposedly helps branch prediction on Pentiums // and in all cases reduces the cache footprint of the // common case. if (m->m_pkthdr.len < sizeof(struct ip)) goto tooshort; // of course, with better buffering we wouldn't be playing with mbufs // so much, which would save more time and code space. #ifdef DIAGNOSTIC if (m->m_len < sizeof(struct ip)) panic("ipintr mbuf too short"); #endif ip = mtod(m, struct ip *); // IP_VHL_BORING is defined to be the right value for v4. // symbolic names are better than magic pointer dereferences... if (ip->ip_vhl != IP_VHL_BORING) { if (IP_VHL_V(ip->ip_vhl) != IPVERSION) goto badvers; hlen = IP_VHL_HL(ip->ip_vhl) << 2; if (hlen > m->m_pkthdr.len || hlen < sizeof(struct ip)) goto badhlen; // I came up with this idea independently. The experience of // source-routed multicast tunnels taught everybody that you can't // single out packets with options for really abysmal service. /* otherwise, hlen > minimum, so do options */ if (ip_fast_options(m->m_pkthdr.rcvif, m, ip, &dst)) /* can't handle these options quickly */ goto slow; } else { dst = ip->ip_dst; } len = ntohs(ip->ip_len); // in retrospect, this should probably have been handled with a `goto slow'. // only problem is short packets on Ethernets which got padded up to 64. if (m->m_pkthdr.len != len) { if (m->m_pkthdr.len < len) goto toosmall; if (m->m_len == m->m_pkthdr.len) { m->m_len = m->m_pkthdr.len = len; } else { m_adj(m, len - m->m_pkthdr.len); } } // now you see why I introduced this wrinkle two years ago... #ifdef COMPAT_IPFW if (ip_fw_chk_ptr) goto slow; #endif ipfc = &ipfwd_cache[ipf_hash(dst)]; // we don't bother filling the cache in the fast path -- just kick // over to the slow path and it will do it for us as a documented // side-effect. if we're playing router, this will also allow us // to input those packets which are actually addressed to us (since // the action member doesn't exist in that case) if (ipfc->dst.s_addr != dst.s_addr || !(rt = ipfc->rt) || !(rt->rt_flags & RTF_UP)) goto slow; // if we're not playing router, do whatever action is suggested by the // cache.... #ifndef IPFP_ROUTER if (ipfc->action > ipfp_cantforward) goto dopanic; goto *actions[ipfc->action]; /* XXX GCC extension */ #endif fast_fwd: ifp = rt->rt_ifp; if (m->m_pkthdr.rcvif == ifp) goto slow; /* maybe we should do fast redirects? */ /* * RSVP requires that we intercept all RSVP packets * passing through and feed them to the local daemon. * This will go away when we have router alert. */ if (rsvp_on && ip->ip_p == IPPROTO_RSVP) #ifdef IPFP_ROUTER goto slow; #else goto fast_input; #endif if (ifp->if_mtu <= len) goto slow; /* oops, need fragmentation */ // we make sure to keep the technically-correct behavior as an option // (but everybody knows that no routers do ip checksums on in-transit // packets these days). #if IP_FWD_CHECKSUM if (ip->ip_vhl == IP_VHL_BORING) tmpsum = ip_cksum_hdr(ip); else tmpsum = in_cksum(m, IP_VHL_HL(ip->ip_vhl) << 2); if (tmpsum) goto badsum; #endif if (--ip->ip_ttl == 0) goto send_time_exceeded; // that's what this macro (defined in ) is for... in_cksum_update(ip); // if I had done the right thing with the routing code, this would // go through the next-hop table instead and would be a trivial // `blat the header on front and send' operation most of the time. rt->rt_use++; if (rt->rt_flags & RTF_GATEWAY) { err = ifp->if_output(ifp, m, rt->rt_gateway, rt); } else { ipaddr.sin_addr = dst; err = ifp->if_output(ifp, m, (struct sockaddr *)&ipaddr, rt); } /* XXX other errors? */ if (err == EHOSTUNREACH) { /* XXX should count this */ return; } return; // I've left out the other obvious parts... -GAWollman -- Garrett A. Wollman | O Siem / We are all family / O Siem / We're all the same wollman@lcs.mit.edu | O Siem / The fires of freedom Opinions not those of| Dance in the burning flame MIT, LCS, CRS, or NSA| - Susan Aglukark and Chad Irschick To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-net" in the body of the message