Date: Sun, 21 Mar 1999 19:07:52 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: dillon@apollo.backplane.com (Matthew Dillon) Cc: tlambert@primenet.com, hasty@rah.star-gate.com, wes@softweyr.com, ckempf@enigami.com, wpaul@skynet.ctr.columbia.edu, freebsd-hackers@FreeBSD.ORG Subject: Re: Gigabit ethernet -- what am I doing wrong? Message-ID: <199903211907.MAA16206@usr06.primenet.com> In-Reply-To: <199903211835.KAA13904@apollo.backplane.com> from "Matthew Dillon" at Mar 21, 99 10:35:14 am
next in thread | previous in thread | raw e-mail | index | archive | help
> :You mean "most recent network cards". Modern networks cards have memory > :that can be DMA'ed into by other modern network cards. > : > :Moral: being of later manufacture makes you more recent, but being > : capable of data higher rates is what makes you modern. > > It's a nice idea, but there are lots of problems with card-to-card > DMA. If you have only two network ports in your system (note: I said > ports, not cards), I suppose you could get away with it. Otherwise you > need something significantly more sophisticated. > > The problem is that you hit one of the most common situations that occured > in early routers: DMA blockages to one destination screwing over others. > > For example, say you have four network ports and you are receiving packets > which must be distributed to the other ports. Lets say network port #1 > receives packets A, B, C, D, and E. Packet A must be distributed to > port #2, and packet's B-D must be distributed to port #3, and packet E > must be distributed to port #4. > > What happens when the DMA to port #2 blocks due to a temporary overcommit > of packets being sent to port #2? Or due to a collision/retry situation > occuring on port #1? What happens is that the packets B-E stick around > in port #1's input queue and don't get sent to ports 3 and 4 even if > ports 3 and 4 are idle. My argument would be that you should be looking at this as an IP switch level thing, on the order of a Kalpana Etherswitch, or whatever the modern Gigabit ethernet equivalent manufacturer par excellance du jour happens to be. Personally, I have no problem whatsoever doing the MAE-West trick of invoking the "leaky bucket" algorithm, most noticible when they bought only one, not two, gigaswitches, and failed to dedicate 50% of the ports to communication with other gigaswitches, which is what the Fibbonacci sequence would suggest is the correct mechanism for node expansion of that type of topology. > Even worse, what happens to poor packet E which can't be sent to port 4 > until all the mess from packets A-D are dealt with? Major latency occurs > at best, packet loss occurs at worse. The way to deal with the "mess" is to allow the origin to retransmit, instead of taking it upon yourself to ensure reliable end-to-end datagram delivery of IP datagrams. Why do it for IP if you aren't going to do it for UDP/IP? If you really have a problem with the idea that packets which collide should be dropped, well, then stop thinking IP and start thinking ATM instead (personnaly, I hate ATM, but if the tool fits...). > For each port in your system, you need a *minimum* per-port buffer size > to handle the maximum latency you wish to allow times the number of ports > in the router. If you have 4 1 Gigabit ports and wish to allow latencies > of up to 20mS, each port would require 8 MBytes of buffer space and you > *still* don't solve the problem that occurs if one port backs up, short > of throwing away the packets destined to other ports even if the other > ports are idle. Yes, that's true. This is the "minumum pool retention time" problem, which is the maximum allowable latency before a discard occurs. > Backups can also introduce additional latencies that are not the fault > of the destination port. It doesn't matter whose fault it is. > DEC Gigaswitch switches suffered from exactly this problem -- MAE-WEST > had serious problems for several years, in fact, due to overcommits on > a single port out of dozens. See above. This was a management budget driven topology error, having really nothing at all to do with the capabilities of the hardware. It was the fact that in a topology having more than one gigaswitch, each gigaswitch must dedicate 50% of its capability to talking to the other gigaswitches. This means you must increase the number of switches in a Fibbonacci sequence, not a linear additive sequence. Consider that if I have one gigaswitch, and all it's ports are in use, if I dedicate 50% of it's ports to inter-switch communication, and then add only one other switch where I do the same, then I end up with exactly the same number of ports. The MAE-West problem was that (much!) fewer than 50% of the ports on the two switches were dedicated to interconnect. For topologies of 5 or more switches, the number of ports which must be dedicated to direct interconnect goes up again... one progression is Fibbonacci, the other geometric; you have diminishing returns for the next X switches in the progression, but the returns are always positive; you merely have to pay progresively more for each node in the progression. The soloution to this is to have geographically seperate clusters of the things, and ensure that the interlinks between them are significantly faster than their interlinks to equipment that isn't them. > There are solutions to this sort of problem, but all such solutions > require truely significant on-card buffer memory... 8 MBytes minimum > with my example above. In order to handle card-to-card DMA, cards > must be able to handle sophisticated DMA scheduling to prevent > blockages from interfering with other cards. With respect, I think that you are barking up the wrong tree. If the latency is such that the pool retention time becomes infinite for any packet, then you are screwed, blued, and tatooed. You *must* never fill the pool faster than you can drain it... period. The retention time is dictated not by the traffic rate itself, but by the relative rate between the traffic and the rate at which you can make delivery decisions based on traffic content. Increasing the pool size can only compensate for a slow decision process, not resource overcommit by a budget-happy management. The laws of physics and mathematics bend to the will of no middle-manager. > Industrial strength routers that implement cross bars or other high > speed switch matrices have to solve the ripple-effect-blockage problem. > It is not a trivial problem to solve. It *IS* a necessary problem to > solve since direct card-card transfers are much more efficient then > transfers to common shared-memory stores. It seems to me that this is a wholly artificial problem based upon an unreasonable desire to overcommit resources, said desire resulting in an assymetry in the routing capability between supposed peers, such that the interaction of two peers can negatively impact a third. I blame this on the phone company's historical use of circuit switching technologies, and the baggage that comes with that mindset. Assymetry is bad. All of the memory in the world would not have fixed the MAE-West topology problem (well, OK, but only if they got the memory from the machines that were trying to send packets through MAE-West, such that they were incapable of generating traffic for lack of the memory needed to boot and run ;-)). > It is *NOT* a problem that PC architectures can deal with well, > though. On this, we heartily agree! > It is definitely *NOT* a problem that PCI cards are usually able > to deal with due to the lack of DMA channels and the lack of a > system-wide scheduling protocol. I still think that it's very interesting to ask "what's the absolute, total balls-to-the-wall best that such limited hardware can do?", not to mention fun. 8-). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199903211907.MAA16206>