Date: Sun, 21 Mar 1999 11:51:28 -0800 (PST) From: Matthew Dillon <dillon@apollo.backplane.com> To: Terry Lambert <tlambert@primenet.com> Cc: hasty@rah.star-gate.com, wes@softweyr.com, ckempf@enigami.com, wpaul@skynet.ctr.columbia.edu, freebsd-hackers@FreeBSD.ORG Subject: Re: Gigabit ethernet -- what am I doing wrong? Message-ID: <199903211951.LAA14338@apollo.backplane.com> References: <199903211907.MAA16206@usr06.primenet.com>
next in thread | previous in thread | raw e-mail | index | archive | help
:
:If you really have a problem with the idea that packets which collide
:should be dropped, well, then stop thinking IP and start thinking ATM
:instead (personnaly, I hate ATM, but if the tool fits...).
Terry, this may sound great on paper, but nobody in their right mind
drops a packet inside a router unless they absolutely have to. This
lesson was learned long ago, and has only become more important now as
the number of hops increase. There is no simple solution that doesn't
have terrible boundry conditions on the load curve.
:> of up to 20mS, each port would require 8 MBytes of buffer space and you
:> *still* don't solve the problem that occurs if one port backs up, short
:> of throwing away the packets destined to other ports even if the other
:> ports are idle.
:
:Yes, that's true. This is the "minumum pool retention time" problem,
:which is the maximum allowable latency before a discard occurs.
No, it's worse then that. Supplying sufficient buffer space only partially
solves the problem. If the buffer space is not properly scheduled,
an overload on one port or even just the statistical possibility of packets
being ordered to different destinations badly can result in a
multiplication of the latency to other ports. Adding even more buffer
space to try to brute-force a solution doesn't work well ( and is an
extremely expensive proposition to boot )... which is why high-end router
and switch makers spend a lot of time on scheduling.
:See above. This was a management budget driven topology error, having
:really nothing at all to do with the capabilities of the hardware. It
:was the fact that in a topology having more than one gigaswitch, each
:gigaswitch must dedicate 50% of its capability to talking to the other
:gigaswitches. This means you must increase the number of switches in
:a Fibbonacci sequence, not a linear additive sequence.
It had nothing to do with any of that. I *LIVED* the crap at MAE-WEST
because BEST had a T3 there and we had to deal with it every day, even
though we were only using 5 MBits out of our 45 MBit T3. The problem
was that a number of small 'backbones' were selling transit bandwidth
at MAE-WEST and overcommitting their bandwidth. The moment any one of
their ports exceeded 45 MBits, the Gigaswitch went poof due to
head-of-queue blocking, a combination of software on the gigaswitch
and the way the gigabit switch's hardware queues packets for transfer
between cards.
:Consider that if I have one gigaswitch, and all it's ports are in use,
:if I dedicate 50% of it's ports to inter-switch communication, and
:then add only one other switch where I do the same, then I end up
:with exactly the same number of ports.
:
:The MAE-West problem was that (much!) fewer than 50% of the ports on
:the two switches were dedicated to interconnect.
The problem at MAE-WEST had absolutely nothing to do with this. The
problem occured *inside* a *single* switch. If one port overloaded on
that switch, all the ports started having problems due to head-of-line
blocking.
BEST had to pull peering with a number of overcommitted nodes for
precisely this reason, but that didn't help nodes that were talking to
us who *were* peering with the overcommitted nodes. The moment any of
these unloaded nodes tried to send a packet to an overcommitted node,
it blocked the head of the queue and created massive latencies for
packets destined to other unloaded nodes. Eventually enough pressure
was placed on the idiot backbones to clean up their act, but it took 2+
years for them to get to that point.
:The soloution to this is to have geographically seperate clusters
:of the things, and ensure that the interlinks between them are
:significantly faster than their interlinks to equipment that isn't
:them.
The solution to this at MAE-WEST was to clamp down on the idiots who
were selling transit at MAE-WEST and overcommitting their ports, plus
numerous software upgrades none of which really solved the problem
completely.
:With respect, I think that you are barking up the wrong tree. If the
:latency is such that the pool retention time becomes infinite for any
:packet, then you are screwed, blued, and tatooed. You *must* never
:fill the pool faster than you can drain it... period. The retention
:time is dictated not by the traffic rate itself, but by the relative
:rate between the traffic and the rate at which you can make delivery
:decisions based on traffic content.
'you must never fill the pool faster then you can drain it' is a
cop-out. In the best case scenario, that is exactly what happens.
Unfortunately, the best case requires a level of sophistication and
scheduling that only a few people have gotten right. Even Cisco
has blown it numerous times.
:Increasing the pool size can only compensate for a slow decision process,
:not resource overcommit by a budget-happy management. The laws of physics
:and mathematics bend to the will of no middle-manager.
This is not a resource overcommit issue. This is a resource scheduling
issue. You can always overcommit a resource -- the router must deal
with that situation no matter what your topology. It is *HOW* you deal
with the overcommit that matters. It is not possible to avoid a resource
overcommit in a router or a switch because ports, even ports with the
same physical speed, have mismatched loads.
-Matt
Matthew Dillon
<dillon@backplane.com>
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199903211951.LAA14338>
