Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 21 Sep 2004 01:57:54 +0200
From:      Andre Oppermann <andre@freebsd.org>
To:        John-Mark Gurney <gurney_j@resnet.uoregon.edu>
Cc:        freebsd-arch@freebsd.org
Subject:   Re: better MTU support...
Message-ID:  <414F6E82.59E5A16@freebsd.org>
References:  <20040906050435.GA72089@funkthat.com> <41408D4C.E33B6F98@freebsd.org> <20040918231719.GV72089@funkthat.com>

next in thread | previous in thread | raw e-mail | index | archive | help
John-Mark Gurney wrote:
> 
> Andre Oppermann wrote this message on Thu, Sep 09, 2004 at 19:05 +0200:
> 
> Ok, finally got a switch (and gige cards, if_re needs work) capable of
> jumbo frames..
> 
> > John-Mark Gurney wrote:
> > > In a recent experiment w/ Jumbo frames, I found out that sending ip
> > > frames completely ignores the MTU set on host routes.  This makes it
> > > difficult (or next to impossible) to support a network that has both
> > > regular and jumbo frames on it as you can't restrict some hosts to the
> > > smaller frames.
> >
> > What you should do instead is to set the MTU on the interface to 9018
> > or so and then have a default route with MTU 1500 for everything else.
> > Now you can specify larger MTUs for hosts that support it.
> >
> > Otherwise you are opening a can of worms...
> 
> This doesn't fix it, since the output still doesn't honor the mtu on
> the route..  Note, I'm not testing tcp, only udp and icmp since I've
> seen that TCP already works fine...
> # netstat -rnWfinet
> Routing tables
> 
> Internet:
> Destination        Gateway            Flags    Refs      Use    Mtu    Netif Expire
> default            192.168.0.14       UGS         0       11   1500      em0
> 127.0.0.1          127.0.0.1          UH          0       40  16384      lo0
> 192.168.0          link#5             UC          0        0   9000      em0
> 192.168.0.1        00:a0:c9:59:8b:6c  UHLW        0       33   1500      em0    175
> 192.168.0.3        00:0a:95:9e:8b:88  UHLW        0     1988   9000      em0    374
> 192.168.0.14       00:a0:c9:31:30:5e  UHLW        1        8   1500      em0    955
> 192.168.0.20       00:07:e9:0d:aa:ca  UHLW        0       18   9000      em0    187
> 192.168.0.21       00:07:e9:0d:ad:06  UHLW        0        2   9000      lo0
> 
> tcpdump output:
> 16:02:14.311079 IP 192.168.0.21 > 192.168.0.1: icmp 5008: echo request seq 14
> 16:02:15.320981 IP 192.168.0.21 > 192.168.0.1: icmp 5008: echo request seq 15
> 16:04:54.720890 IP 192.168.0.21 > 128.223.122.47: icmp 5008: echo request seq 0
> 16:04:55.727148 IP 192.168.0.21 > 128.223.122.47: icmp 5008: echo request seq 1
> 16:05:02.288989 IP 192.168.0.21 > 192.168.0.20: icmp 5008: echo request seq 0
> 16:05:02.289856 IP 192.168.0.20 > 192.168.0.21: icmp 5008: echo reply seq 0
> 16:05:03.296481 IP 192.168.0.21 > 192.168.0.20: icmp 5008: echo request seq 1
> 16:05:03.297282 IP 192.168.0.20 > 192.168.0.21: icmp 5008: echo reply seq 1
> 
> So, as you can see, it's broken...
> 
> with my patch, ip properly fragments the packets to machines with
> smaller mtu...
> 
> > > I now have a patch to ip_output that makes it obay the MTU set on the
> > > route instead of that of the interface.
> >
> > Your patch corrects a problem in ip_output where a smaller MTU on an
> > rtentry was ignored but that is only for the non-TCP cases.  When you
> > open a TCP session the MTU will be honored (see tcp_subr.c:tcp_maxmtu).
> > If not it would be a bug.
> >
> > Could you try your large MTU setup again using the procedure I desribed
> > above?
> >
> > That should solve your immediate problem.
> 
> Nope, it doesn't...
> 
> > For the general 'bug' in ip_output that it doesn't honour a smaller MTU
> > on a route I'd like to do a more throughout fix.  Routes should be
> > created with MTU 0 if the MTU is not different from the if_mtu.  Only
> > in those cases where you want to have a lower MTU you set it.  For cloned
> > routes the MTU would be cloned from the parent.  This range of changes is
> > more intrusive.  On top of that comes the new ARP code which will have a
> > MTU field as well.  This one is supposed to store different MTUs for mixed
> > MTU L2 networks.  How to transport the MTU information is a separate
> > discussion.
> >
> > If the fix above works for you I'd like to do the real fix later (< end
> > of year) and not change the current behaviour in ip_output at the moment.
> 
> It wouldn't be hard to add to my patch the check to see if the route's
> mtu is 0 and just use the if mtu... which then solves the ip part of
> your more complete fix...  Then when you finally fix the route/arp stuff
> nothing else should be necessary...
> 
> Sound good?

Moving the check upwards as you have done in ip_output() works in your
case but is not a real and clean fix.  Ideally the routes should never
have any MTU assigned to them unless someone explicitly sets it.  So the
MTU for the routes should always be zero and ignored.  If it is zero then
only the link MTU will be used.  If there is an MTU on a route it should
be observed not only for host routes (as you do in your patch) but also
for network routes.  Getting this right requires disabling the copying
of the MTU when a route is cloned or created.  We also have to check that
all consumers of the MTU field in the kernel and userland can cope with
zero MTU and these semantics (ignoring it).

I'll get to doing that till end of the week.  If get some of those earlier
please send me the patches so we don't duplicate work.  Then we have next
week something ready to commit to 6-current.

-- 
Andre



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?414F6E82.59E5A16>