From owner-freebsd-hackers Sun Mar 21 11:51:51 1999 Delivered-To: freebsd-hackers@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [209.157.86.2]) by hub.freebsd.org (Postfix) with ESMTP id 13C0914C3B for ; Sun, 21 Mar 1999 11:51:50 -0800 (PST) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.9.3/8.9.1) id LAA14338; Sun, 21 Mar 1999 11:51:28 -0800 (PST) (envelope-from dillon) Date: Sun, 21 Mar 1999 11:51:28 -0800 (PST) From: Matthew Dillon Message-Id: <199903211951.LAA14338@apollo.backplane.com> To: Terry Lambert Cc: hasty@rah.star-gate.com, wes@softweyr.com, ckempf@enigami.com, wpaul@skynet.ctr.columbia.edu, freebsd-hackers@FreeBSD.ORG Subject: Re: Gigabit ethernet -- what am I doing wrong? References: <199903211907.MAA16206@usr06.primenet.com> Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG : :If you really have a problem with the idea that packets which collide :should be dropped, well, then stop thinking IP and start thinking ATM :instead (personnaly, I hate ATM, but if the tool fits...). Terry, this may sound great on paper, but nobody in their right mind drops a packet inside a router unless they absolutely have to. This lesson was learned long ago, and has only become more important now as the number of hops increase. There is no simple solution that doesn't have terrible boundry conditions on the load curve. :> of up to 20mS, each port would require 8 MBytes of buffer space and you :> *still* don't solve the problem that occurs if one port backs up, short :> of throwing away the packets destined to other ports even if the other :> ports are idle. : :Yes, that's true. This is the "minumum pool retention time" problem, :which is the maximum allowable latency before a discard occurs. No, it's worse then that. Supplying sufficient buffer space only partially solves the problem. If the buffer space is not properly scheduled, an overload on one port or even just the statistical possibility of packets being ordered to different destinations badly can result in a multiplication of the latency to other ports. Adding even more buffer space to try to brute-force a solution doesn't work well ( and is an extremely expensive proposition to boot )... which is why high-end router and switch makers spend a lot of time on scheduling. :See above. This was a management budget driven topology error, having :really nothing at all to do with the capabilities of the hardware. It :was the fact that in a topology having more than one gigaswitch, each :gigaswitch must dedicate 50% of its capability to talking to the other :gigaswitches. This means you must increase the number of switches in :a Fibbonacci sequence, not a linear additive sequence. It had nothing to do with any of that. I *LIVED* the crap at MAE-WEST because BEST had a T3 there and we had to deal with it every day, even though we were only using 5 MBits out of our 45 MBit T3. The problem was that a number of small 'backbones' were selling transit bandwidth at MAE-WEST and overcommitting their bandwidth. The moment any one of their ports exceeded 45 MBits, the Gigaswitch went poof due to head-of-queue blocking, a combination of software on the gigaswitch and the way the gigabit switch's hardware queues packets for transfer between cards. :Consider that if I have one gigaswitch, and all it's ports are in use, :if I dedicate 50% of it's ports to inter-switch communication, and :then add only one other switch where I do the same, then I end up :with exactly the same number of ports. : :The MAE-West problem was that (much!) fewer than 50% of the ports on :the two switches were dedicated to interconnect. The problem at MAE-WEST had absolutely nothing to do with this. The problem occured *inside* a *single* switch. If one port overloaded on that switch, all the ports started having problems due to head-of-line blocking. BEST had to pull peering with a number of overcommitted nodes for precisely this reason, but that didn't help nodes that were talking to us who *were* peering with the overcommitted nodes. The moment any of these unloaded nodes tried to send a packet to an overcommitted node, it blocked the head of the queue and created massive latencies for packets destined to other unloaded nodes. Eventually enough pressure was placed on the idiot backbones to clean up their act, but it took 2+ years for them to get to that point. :The soloution to this is to have geographically seperate clusters :of the things, and ensure that the interlinks between them are :significantly faster than their interlinks to equipment that isn't :them. The solution to this at MAE-WEST was to clamp down on the idiots who were selling transit at MAE-WEST and overcommitting their ports, plus numerous software upgrades none of which really solved the problem completely. :With respect, I think that you are barking up the wrong tree. If the :latency is such that the pool retention time becomes infinite for any :packet, then you are screwed, blued, and tatooed. You *must* never :fill the pool faster than you can drain it... period. The retention :time is dictated not by the traffic rate itself, but by the relative :rate between the traffic and the rate at which you can make delivery :decisions based on traffic content. 'you must never fill the pool faster then you can drain it' is a cop-out. In the best case scenario, that is exactly what happens. Unfortunately, the best case requires a level of sophistication and scheduling that only a few people have gotten right. Even Cisco has blown it numerous times. :Increasing the pool size can only compensate for a slow decision process, :not resource overcommit by a budget-happy management. The laws of physics :and mathematics bend to the will of no middle-manager. This is not a resource overcommit issue. This is a resource scheduling issue. You can always overcommit a resource -- the router must deal with that situation no matter what your topology. It is *HOW* you deal with the overcommit that matters. It is not possible to avoid a resource overcommit in a router or a switch because ports, even ports with the same physical speed, have mismatched loads. -Matt Matthew Dillon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message