From owner-freebsd-current Mon Oct 15 11:35:25 2001 Delivered-To: freebsd-current@freebsd.org Received: from pintail.mail.pas.earthlink.net (pintail.mail.pas.earthlink.net [207.217.120.122]) by hub.freebsd.org (Postfix) with ESMTP id 6983F37B407 for ; Mon, 15 Oct 2001 11:35:15 -0700 (PDT) Received: from mindspring.com (dialup-209.245.128.127.Dial1.SanJose1.Level3.net [209.245.128.127]) by pintail.mail.pas.earthlink.net (EL-8_9_3_3/8.9.3) with ESMTP id LAA00345; Mon, 15 Oct 2001 11:35:00 -0700 (PDT) Message-ID: <3BCB2C87.BEF543A0@mindspring.com> Date: Mon, 15 Oct 2001 11:35:51 -0700 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: "Kenneth D. Merry" Cc: current@FreeBSD.ORG Subject: Re: Why do soft interrupt coelescing? References: <3BBF5E49.65AF9D8E@mindspring.com> <20011006144418.A6779@panzer.kdm.org> <3BC00ABC.20ECAAD8@mindspring.com> <20011008231046.A10472@panzer.kdm.org> <3BC34FC2.6AF8C872@mindspring.com> <20011010000604.A19388@panzer.kdm.org> <3BC40E04.D89ECB05@mindspring.com> <20011010232020.A27019@panzer.kdm.org> <3BC55201.EC273414@mindspring.com> <20011015002407.A59917@panzer.kdm.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG "Kenneth D. Merry" wrote: > Dropping packets before they get into card memory would only be possible > with some sort of traffic shaper/dropping mechanism on the wire to drop > things before they get to the card at all. Actually, DEC had a congestion control mechanism that worked by marking all packets, over a certainl level of congestion (this was sometimes called the "DECbits" approach). You can do the same thing with any intermediate hop router, so long as it is better at moving packets than your destination host. It turns out that even if the intermediate hop and the host at the destination are running the same hardware and OS, the cost is going to be higher to do the terminal processing than it is to do the network processing, so you are able to use the tagging to indicate to the terminal hop which flows to drop packets out of before processing. Cisco routers can do this (using the CMU firmware) on a per flow basis, leaving policy up to the end node. Very neat. [ ... per connection overhead, and overcommit ... ] > You could always just put 4G of RAM in the machine, since memory is so > cheap now. :) > > At some point you'll hit a limit in the number of connections the processor > can actually handle. In practice, particularly for HTTP or FTP flows, you can halve the amount of memory expected to be used. This is because the vast majority of the data is generally pushed in only one direction. For HTTP 1.1 persistant connections, you can, for the most part, also assume that the connections are bursty -- that is, that there is a human attached to the other end, who will spend some time examining the content before making another request (you can assume the same thing for 1.0, but that doesn't count against persistant connection count, unless you also include time spent in TIME_WAIT). So overcommit turns out to be O.K. -- which is what I was trying to say in a back-handed way, in the last post. If you include window control (i.e. you care about overall throughput, and not about individual connections), then you can safely service 1,000,000 connections with 4G on a FreeBSD box. > > This is actually very bad: you want to drop packets before you > > insert them into the queue, rather than after they are in the > > queue. This is because you want the probability of the drop > > (assuming the queue is not maxed out: otherwise, the probabilty > > should be 100%) to be proportional to the exponential moving > > average of the queue depth, after that depth exceeds a drop > > threshold. In other words, you want to use RED. > > Which queue? The packets are dropped before they get to ether_input(). The easy answer is "any queue", since what you are becoming concerned with is pool retention time: you want to throw away packets before a queue overflow condition makes it so you are getting more than you can actually process. > Dropping random packets would be difficult. The "R" in "RED" is "Random" for "Random Early Detection" (or "Random Early Drop", for a minority of the literature), true. But the randomness involved is whether you drop vs. insert a given packet, not whether you drop a random packet from the queue contents. Really dropping random queue elements would be incredibly bad. The problem is that, during an attack, the number of packets you get is proportionally huge, compared to the non-attack packets (the ones you want to let through). A RED approach will prevent new packets being enqueued: it protects the host system's ability to continue degraded processing, by making the link appear "lossy" -- the closer to queue full, the more lossy the link. If you were to drop random packets already in the queue, then the proportional probability of dumping good packets is equal to the queue depth times the number of bad packets divided by the number of total packets. In other words, a continuous attack will almost certainly push all good packets out of the queue before they reach the head. Dropping packets prior to insertion maintains the ratio of bad and good packets, so it doesn't inflate the danger to the good packets by the relative queue depth: thus dropping before insertion is a significantly better strategy than dropping after insertion, for any queue depth over 1. > > Maybe I'm being harsh in calling it "spam'ming". It does the > > wrong thing, by dropping the oldest unprocessed packets first. > > A FIFO drop is absolutely the wrong thing to do in an attack > > or overload case, when you want to shed load. I consider the > > packet that is being dropped to have been "spam'med" by the > > card replacing it with another packet, rather than dropping > > the replacement packet instead. > > > > The real place for this drop is "before it gets to card memory", > > not "after it is in host memory"; Floyd, Jacobsen, Mogul, etc., > > all agree on that. > > As I mentioned above, how would you do that without some sort of traffic > shaper on the wire? The easiest answer is to RED queue in the card firmware. > My focus with gigabit ethernet was to get maximal throughput out of a small > number of connections. Dealing with a large number of connections is a > different problem, and I'm sure it uncovers lots of nifty bugs. 8-). I guess that you are more interested in intermediate hops and site to site VPN, while I'm more interested in connection termination (big servers, SSL termination, and single client VPN). > > I'd actually prefer to avoid the other DMA; I'd also like > > to avoid the packet receipt order change that results from > > DMA'ing over the previous contents, in the case that an mbuf > > can't be allocated. I'd rather just let good packets in with > > a low (but non-zero) statistical probability, relative to a > > slew of bad packets, rather than letting a lot of bad packets > > from a persistant attacker push my good data out with the bad. > > Easier said than done -- dumping random packets would be difficult with a > ring-based structure. Probably what you'd have to do is have an extra pool > of mbufs lying around that would get thrown in at random times when mbufs > run out to allow some packets to get through. > > The problem is, once you exhaust that pool, you're back to the same old > problem if you're completely out of mbufs. > > You could probably also start shrinking the number of buffers in the ring, > but as I said before, you'd need a method for the kernel to notify the > driver that more mbufs are available. You'd be better off shrinking the window size across all the connections, I think. As to difficult to do, I actually have RED queue code, which I adapted from the formula in a paper. I have no problem giving that code out. The real issue is that the BSD queue macros involved in the queues really need to be modified to include an "elements on queue" count for the calculation of the moving average. > > If you are sharing interrupts at this network load, then > > you are doing the wrong thing in your system setup. If > > you don't have such a high load, then it's not a problem; > > either way, you can avoid the problem, for the most part. > > It's easier with a SMP system (i.e. you've got an APIC.) Virtual wire mode is actually bad. It's better to do asymmetric interrupt handling, if you have a number of identical or even similar interrupt sources (e.g. two gigabit cards in a box). > > OK, I will rediff and generate context diffs; expect them to > > be sent in 24 hours or so from now. > > It's been longer than that... Sorry; I've been doing a lot this weekend. I will redo them at work today, and resend them tonight... definitely. > > > Generally the ICMP response tells you how big the maximum MTU is, so you > > > don't have to guess. > > > > Maybe it's the ICMP response; I still haven't had a chance to > > hold Michael down and drag the information out of him. 8-). > > Maybe what's the ICMP response? The difference between working and not working. > > Cicso boxes detect "black hole" routes; I'd have to read the > > white paper, rather than just its abstract, to tell you how, > > though... > > It depends on what they're trying to do with the information. If they're > just trying to route around a problem, that's one thing. If they're trying > to diagnose MTU problems, that's quite another. > > In general, it's going to be pretty easy for routers to detect when a > packet exceeds the MTU for one of their interfaces and send back a ICMP > packet. A "black hole" route doesn't ICMP back, either because some idiot has blocked ICMP, or because it's just too dumb... > > Not for the user. > > Configuring the MTU is a standard part of configuring IP networks. > If your users aren't smart enough to do it, you'll pretty much have > to default to 1500 bytes for ethernet. Or figure out how to negotiate higher... > You can let the more clueful users increase the MTU. That doesn't improve performance, and so "default configuration" benchmarks like "Polygraph" really suffer, as a result. > If you're supplying enough of the equipment, you can make some assumptions > about the equipment setup. This was the case with my former employer -- in > many cases we supplied the switch as well as the machines to go onto the > network, so we knew ahead of time that jumbo frames would work. Otherwise, > we'd work with the customer to set things up with standard or jumbo frames > depending on their network architecture. This approach only works if you're Cisco or another big iron vendor, in an established niche. [ ... more on MTU negotiation for jumbograms ... ] > > In any case, Intel cards appear to do it, and so do Tigon III's. > > That's nice, but there's no reason a card has to accept packets with a > higher MTU than it is configured for. True, but then there's no reason I have to buy the cards that choose to not do this. 8-). -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message