From owner-freebsd-hackers  Sun Mar 21 11:51:51 1999
Delivered-To: freebsd-hackers@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [209.157.86.2])
	by hub.freebsd.org (Postfix) with ESMTP id 13C0914C3B
	for <freebsd-hackers@FreeBSD.ORG>; Sun, 21 Mar 1999 11:51:50 -0800 (PST)
	(envelope-from dillon@apollo.backplane.com)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.9.3/8.9.1) id LAA14338;
	Sun, 21 Mar 1999 11:51:28 -0800 (PST)
	(envelope-from dillon)
Date: Sun, 21 Mar 1999 11:51:28 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <199903211951.LAA14338@apollo.backplane.com>
To: Terry Lambert <tlambert@primenet.com>
Cc: hasty@rah.star-gate.com, wes@softweyr.com, ckempf@enigami.com,
	wpaul@skynet.ctr.columbia.edu, freebsd-hackers@FreeBSD.ORG
Subject: Re: Gigabit ethernet -- what am I doing wrong?
References:  <199903211907.MAA16206@usr06.primenet.com>
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

:
:If you really have a problem with the idea that packets which collide
:should be dropped, well, then stop thinking IP and start thinking ATM
:instead (personnaly, I hate ATM, but if the tool fits...).

    Terry, this may sound great on paper, but nobody in their right mind
    drops a packet inside a router unless they absolutely have to.  This
    lesson was learned long ago, and has only become more important now as 
    the number of hops increase.   There is no simple solution that doesn't
    have terrible boundry conditions on the load curve.

:>     of up to 20mS, each port would require 8 MBytes of buffer space and you
:>     *still* don't solve the problem that occurs if one port backs up, short
:>     of throwing away the packets destined to other ports even if the other
:>     ports are idle.
:
:Yes, that's true.  This is the "minumum pool retention time" problem,
:which is the maximum allowable latency before a discard occurs.

    No, it's worse then that.  Supplying sufficient buffer space only partially
    solves the problem.  If the buffer space is not properly scheduled,
    an overload on one port or even just the statistical possibility of packets
    being ordered to different destinations badly can result in a 
    multiplication of the latency to other ports.  Adding even more buffer 
    space to try to brute-force a solution doesn't work well ( and is an 
    extremely expensive proposition to boot )... which is why high-end router
    and switch makers spend a lot of time on scheduling.

:See above.  This was a management budget driven topology error, having
:really nothing at all to do with the capabilities of the hardware.  It
:was the fact that in a topology having more than one gigaswitch, each
:gigaswitch must dedicate 50% of its capability to talking to the other
:gigaswitches.  This means you must increase the number of switches in
:a Fibbonacci sequence, not a linear additive sequence.

    It had nothing to do with any of that.  I *LIVED* the crap at MAE-WEST 
    because BEST had a T3 there and we had to deal with it every day, even
    though we were only using 5 MBits out of our 45 MBit T3.  The problem
    was that a number of small 'backbones' were selling transit bandwidth
    at MAE-WEST and overcommitting their bandwidth.  The moment any one of
    their ports exceeded 45 MBits, the Gigaswitch went poof due to
    head-of-queue blocking, a combination of software on the gigaswitch
    and the way the gigabit switch's hardware queues packets for transfer
    between cards. 

:Consider that if I have one gigaswitch, and all it's ports are in use,
:if I dedicate 50% of it's ports to inter-switch communication, and
:then add only one other switch where I do the same, then I end up
:with exactly the same number of ports.
:
:The MAE-West problem was that (much!) fewer than 50% of the ports on
:the two switches were dedicated to interconnect.

    The problem at MAE-WEST had absolutely nothing to do with this.  The
    problem occured *inside* a *single* switch.  If one port overloaded on
    that switch, all the ports started having problems due to head-of-line
    blocking.

    BEST had to pull peering with a number of overcommitted nodes for 
    precisely this reason, but that didn't help nodes that were talking to
    us who *were* peering with the overcommitted nodes.   The moment any of
    these unloaded nodes tried to send a packet to an overcommitted node,
    it blocked the head of the queue and created massive latencies for 
    packets destined to other unloaded nodes.  Eventually enough pressure
    was placed on the idiot backbones to clean up their act, but it took 2+
    years for them to get to that point.

:The soloution to this is to have geographically seperate clusters
:of the things, and ensure that the interlinks between them are
:significantly faster than their interlinks to equipment that isn't
:them.

    The solution to this at MAE-WEST was to clamp down on the idiots who
    were selling transit at MAE-WEST and overcommitting their ports, plus
    numerous software upgrades none of which really solved the problem
    completely.

:With respect, I think that you are barking up the wrong tree.  If the
:latency is such that the pool retention time becomes infinite for any
:packet, then you are screwed, blued, and tatooed.  You *must* never
:fill the pool faster than you can drain it... period.  The retention
:time is dictated not by the traffic rate itself, but by the relative
:rate between the traffic and the rate at which you can make delivery
:decisions based on traffic content.

    'you must never fill the pool faster then you can drain it' is a 
    cop-out.  In the best case scenario, that is exactly what happens.
    Unfortunately, the best case requires a level of sophistication and
    scheduling that only a few people have gotten right.  Even Cisco
    has blown it numerous times.

:Increasing the pool size can only compensate for a slow decision process,
:not resource overcommit by a budget-happy management.  The laws of physics
:and mathematics bend to the will of no middle-manager.

    This is not a resource overcommit issue.  This is a resource scheduling
    issue.  You can always overcommit a resource -- the router must deal
    with that situation no matter what your topology.  It is *HOW* you deal
    with the overcommit that matters.  It is not possible to avoid a resource
    overcommit in a router or a switch because ports, even ports with the
    same physical speed, have mismatched loads.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message