From owner-freebsd-net  Fri May  1 21:33:14 1998
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id VAA03251
          for freebsd-net-outgoing; Fri, 1 May 1998 21:33:14 -0700 (PDT)
          (envelope-from owner-freebsd-net@FreeBSD.ORG)
Received: from khavrinen.lcs.mit.edu (khavrinen.lcs.mit.edu [18.24.4.193])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id VAA03244
          for <freebsd-net@FreeBSD.ORG>; Fri, 1 May 1998 21:33:06 -0700 (PDT)
          (envelope-from wollman@khavrinen.lcs.mit.edu)
Received: (from wollman@localhost)
	by khavrinen.lcs.mit.edu (8.8.8/8.8.8) id AAA10671;
	Sat, 2 May 1998 00:32:47 -0400 (EDT)
	(envelope-from wollman)
Date: Sat, 2 May 1998 00:32:47 -0400 (EDT)
From: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>
Message-Id: <199805020432.AAA10671@khavrinen.lcs.mit.edu>
To: Chris Csanady <ccsanady@friley585.res.iastate.edu>
Cc: freebsd-net@FreeBSD.ORG, jtw@lcs.mit.edu
Subject: Fast IP forwarding
In-Reply-To: <199805020229.VAA04136@friley585.res.iastate.edu>
References: <199805012043.QAA09515@khavrinen.lcs.mit.edu>
	<199805020229.VAA04136@friley585.res.iastate.edu>
Sender: owner-freebsd-net@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

<<On Fri, 01 May 1998 21:29:37 -0500, Chris Csanady <ccsanady@friley585.res.iastate.edu> said:

> I have included the ip_input() function from Van Jacobsons experimental
> stack as a reference.  It uses an extremely simple cache, that requires
> little memory and only a handful of instructions that manage it.  This
> code should actually have *very* low overhead even in the common case.

Wow, that's a but more complicated than I would have expected from
Van.  Here's a slightly different take on the same concept... with
some comments from eighteen months after the fact.  (Some wild
mutation of this may have eventually made it into the CAIRN group's
sources---JTW certainly did have a copy of it when I left the fifth
floor.)  Note that, in a real router, you probably don't want to use a
cache at all, but rather a specialized data structure that can
compactly store the entire routing table and access it directly with a
minimum of instructions.  Or at least that's what I last heard,
anyway.  A couple of papers on such data structures were presented at
the last SIGCOMM.

/*
 * Ip input routine.  Checksum and byte swap header.  If fragmented
 * try to reassemble.  Process options.  Pass to next level.
 */
void
ip_input(struct mbuf *m)
{
	struct ip *ip;
	struct ipq *fp;
	struct in_ifaddr *ia;
	int hlen;
	int len, err;
	struct in_addr dst;
	struct ipfwd_cache *ipfc;
	struct rtentry *rt;
	struct ifnet *ifp;
// note that we're trying to do both fast input and fast forwarding
// here, depending on what sort of mood the sysadmin is in...
#ifndef	IPFP_ROUTER
	/* XXX - should do multicast faster */
	/* XXX - GCC extension (pointers to labels) */
	static const void *actions[] = { &&dopanic, &&fast_input, &&slow, 
					 &&fast_fwd, &&cantforward };
#endif

#ifdef	DIAGNOSTIC
	if ((m->m_flags & M_PKTHDR) == 0)
		panic("ip_input no HDR");
#endif
	/*
	 * If no IP addresses have been set yet but the interfaces
	 * are receiving, can't do anything with incoming packets yet.
	 */
// this is bogus... we should still be able to receive 255.255.255.255
// broadcasts regardless of our configuration state if the netif is up.
	if (in_ifaddr == NULL)
		goto bad;
	ipstat.ips_total++;

	/*
	 * Fast Path begins here.
	 * Still need to worry about:
	 *	Multicast
	 */
// note that all the special cases are handled elsewhere.
// this supposedly helps branch prediction on Pentiums
// and in all cases reduces the cache footprint of the
// common case.
	if (m->m_pkthdr.len < sizeof(struct ip))
		goto tooshort;

// of course, with better buffering we wouldn't be playing with mbufs
// so much, which would save more time and code space.
#ifdef	DIAGNOSTIC
	if (m->m_len < sizeof(struct ip))
		panic("ipintr mbuf too short");
#endif
	ip = mtod(m, struct ip *);

// IP_VHL_BORING is defined to be the right value for v4.
// symbolic names are better than magic pointer dereferences...
	if (ip->ip_vhl != IP_VHL_BORING) {
		if (IP_VHL_V(ip->ip_vhl) != IPVERSION)
			goto badvers;

		hlen = IP_VHL_HL(ip->ip_vhl) << 2;
		if (hlen > m->m_pkthdr.len || hlen < sizeof(struct ip))
			goto badhlen;

// I came up with this idea independently.  The experience of
// source-routed multicast tunnels taught everybody that you can't
// single out packets with options for really abysmal service.
		/* otherwise, hlen > minimum, so do options */
		if (ip_fast_options(m->m_pkthdr.rcvif, m, ip, &dst))
			/* can't handle these options quickly */
			goto slow;
	} else {
		dst = ip->ip_dst;
	}
	
	len = ntohs(ip->ip_len);
// in retrospect, this should probably have been handled with a `goto slow'.
// only problem is short packets on Ethernets which got padded up to 64.
	if (m->m_pkthdr.len != len) {
		if (m->m_pkthdr.len < len)
			goto toosmall;

		if (m->m_len == m->m_pkthdr.len) {
			m->m_len = m->m_pkthdr.len = len;
		} else {
			m_adj(m, len - m->m_pkthdr.len);
		}
	}

// now you see why I introduced this wrinkle two years ago...
#ifdef COMPAT_IPFW
	if (ip_fw_chk_ptr)
		goto slow;
#endif

	ipfc = &ipfwd_cache[ipf_hash(dst)];
// we don't bother filling the cache in the fast path -- just kick
// over to the slow path and it will do it for us as a documented
// side-effect.  if we're playing router, this will also allow us
// to input those packets which are actually addressed to us (since
// the action member doesn't exist in that case)
	if (ipfc->dst.s_addr != dst.s_addr
	    || !(rt = ipfc->rt)
	    || !(rt->rt_flags & RTF_UP))
		goto slow;

// if we're not playing router, do whatever action is suggested by the
// cache....
#ifndef IPFP_ROUTER
	if (ipfc->action > ipfp_cantforward)
		goto dopanic;
	goto *actions[ipfc->action]; /* XXX GCC extension */
#endif

fast_fwd:
	ifp = rt->rt_ifp;
	if (m->m_pkthdr.rcvif == ifp)
		goto slow; /* maybe we should do fast redirects? */

	/*
	 * RSVP requires that we intercept all RSVP packets
	 * passing through and feed them to the local daemon.
	 * This will go away when we have router alert.
	 */
	if (rsvp_on && ip->ip_p == IPPROTO_RSVP)
#ifdef IPFP_ROUTER
		goto slow;
#else
		goto fast_input;
#endif

	if (ifp->if_mtu <= len)
		goto slow; /* oops, need fragmentation */

// we make sure to keep the technically-correct behavior as an option
// (but everybody knows that no routers do ip checksums on in-transit
// packets these days).
#if IP_FWD_CHECKSUM
	if (ip->ip_vhl == IP_VHL_BORING)
		tmpsum = ip_cksum_hdr(ip);
	else
		tmpsum = in_cksum(m, IP_VHL_HL(ip->ip_vhl) << 2);
	if (tmpsum)
		goto badsum;
#endif
	if (--ip->ip_ttl == 0)
		goto send_time_exceeded;

// that's what this macro (defined in <machine/in_cksum.h>) is for...
	in_cksum_update(ip);

// if I had done the right thing with the routing code, this would
// go through the next-hop table instead and would be a trivial
// `blat the header on front and send' operation most of the time.
	rt->rt_use++;
	if (rt->rt_flags & RTF_GATEWAY) {
		err = ifp->if_output(ifp, m, rt->rt_gateway, rt);
	} else {
		ipaddr.sin_addr = dst;
		err = ifp->if_output(ifp, m, 
				     (struct sockaddr *)&ipaddr, rt);
	}

	/* XXX other errors? */
	if (err == EHOSTUNREACH) {
		/* XXX should count this */
		return;
	}
	return;

// I've left out the other obvious parts...

-GAWollman

--
Garrett A. Wollman   | O Siem / We are all family / O Siem / We're all the same
wollman@lcs.mit.edu  | O Siem / The fires of freedom 
Opinions not those of| Dance in the burning flame
MIT, LCS, CRS, or NSA|                     - Susan Aglukark and Chad Irschick

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message