Date: Wed, 10 Oct 2001 01:59:48 -0700 From: Terry Lambert <tlambert2@mindspring.com> To: "Kenneth D. Merry" <ken@kdm.org> Cc: current@FreeBSD.ORG Subject: Re: Why do soft interrupt coelescing? Message-ID: <3BC40E04.D89ECB05@mindspring.com> References: <3BBF5E49.65AF9D8E@mindspring.com> <20011006144418.A6779@panzer.kdm.org> <3BC00ABC.20ECAAD8@mindspring.com> <20011008231046.A10472@panzer.kdm.org> <3BC34FC2.6AF8C872@mindspring.com> <20011010000604.A19388@panzer.kdm.org>
next in thread | previous in thread | raw e-mail | index | archive | help
"Kenneth D. Merry" wrote: > eh? The card won't write past the point that has been acked by the kernel. > If the kernel hasn't acked the packets and one of the receive rings fills > up, the card will hold off on sending packets up to the kernel. Uh, eh? You mean the card will hold off on DMA and interrupts? This has not been my experience. Is this with firmware other than the default and/or the 4.3-RELEASE FreeBSD driver? > I agree that you can end up spending large portions of your time doing > interrupt processing, but I haven't seen instances of "buffer overrun", at > least not in the kernel. The case where you'll see a "buffer overrun", at > least with the ti(4) driver, is when you have a sender that's faster than > the receiver. > > So the receiver can't process the data in time and the card just drops > packets. OK, assuming you meant that the copies would stall, and the data not be copied (which is technically the right thing to do, assuming a source quench style livelock avoidance, which doesn't currently exist)... The problem is still that you end up doing interrupt processing until you run out of mbufs, and then you have the problem of not being able to transmit responses, for lack of mbufs. In the ti driver case, the inability to get another mbuf to replace the one that will be taken out of the ring means that the mbuf gets reused for more data -- NOT that the data flow in the form of DMA from the card ends up being halted until mbufs become available. The real problem here is that most received packets want a response; for things like web servers, where the average request is ~.5k and the average response is ~11k, this means that you would need to establish use-based watermarking, to seperate the mbuff pool into transmit and receive resources; in practice, this doesn't really work, if you are getting your content from a seperate server (e.g. an NFS server that provides content for a web farm, etc.). > That's a different situation from the card spamming the receive > ring over and over again, which is what you're describing. I've > never seen that happen, and if it does actually happen, I'd be > interested in seeing evidence. Please look at what happens in the case of an allocation failure, for any driver that does not allow shrinking the ring of receive mbufs (the ti is one example). > > Without hacking firmware, the best you can do is to ensure > > that you process as much of all the traffic as you possibly > > can, and that means avoiding livelock. > > Uhh, the Tigon firmware *does* drop packets when there is no > more room in the proper receive ring on the host side. It > doesn't spam things. > > What gives you that idea? You've really got some strange ideas > about what goes on with that board. Why would someone design > firmware so obviously broken? The driver does it on purpose, by not giving away the mbuf in the receive ring, until it has an mbuf to replace it. Maybe this should be rewritten to not mark the packet as received, and thus allow the card to overwrite it. There are two problems with that approach however. The first is what happens when you reach mbuf exhaustion, and the only way you can clear out received mbufs is to process the data in a user space appication which never gets to run, and when it does get to run, can't write a response for a request and response protocol, such that it can't free up any mbufs? The second is that, in the face of a denial of service attack, the correct approach (according to Van Jacobsen) is to do a random drop, and rely on the fact that the attack packets, being proportionally more of the queue contents, get dropped with a higher probability... so while you _can_ do this, it is really a bad idea, if you are trying to make your stack robust against attacks. The other thing that you appear to be missing is that the most common case is that you have plenty of mbufs, and you keep getting interrupts, replacing the mbufs in the receive ring, and pushing the data into the ether input by giving away the full mbufs. The problem occurs when you are receiving at such a high rate that you don't have any free cycles to run NETISR, and thus you can not empty the queue from which ipintr is called with data. In other words, it's not really the card's fault that the OS didn't run the stack at hardware interrupt. > > What this means is that you get more benefit in the soft > > interrupt coelescing than you otherwise would get, when > > you are doing LRP. > > > > But, you do get *some* benefit from doing it anyway, even > > if your ether input processing is light: so long as it is > > non-zero, you get benefit. > > > > Note that LRP itself is not a panacea for livelock, since > > it just moves the scheduling problem from the IRQ<->NETISR > > scheduling into the NETISR<->process scheduling. You end > > up needing to implement weighted fair share or other code > > to ensure that the user space process is permitted to run, > > so you end up monitoring queue depth or something else, > > and deciding not to reenable interrupts on the card until > > you hit a low water mark, indicating processing has taken > > place (see the papers by Druschel et. al. and Floyd et. al.). > > It sounds like it might be better handled with better scheduling in > combination with interrupt coalescing in the hardware. Hardware interrupt coelescing stalls your processing until the coeslecing window becomes full, or the coelescing idle timer fires. Really, the best way to handle it is polling, but when you do that, there is nothing left over for useful work, other than the work of moving packets (e.g. it's OK for L2 or L4 switching, but can't handle L7 switching, and can't handle user processes not involved with actual networking). I've seen this same hardware, with rewritten firmware, do about 100,000 connections a second, with polling, but that number at that point is no longer a figure of merit, since in order to actually have a product, you have to be able to get real work done (Alteon quotes ~128,000/second for their most recent product offering). > Is burdening the hardware interrupt handler even more the > best solution? I don't really understand the question: you have to do the work sometime, in response to an external, real-world event. Polling reduces latency in handling the event closer to when it actually occurs, but means that you can only do a limited set of work. Handling the event in response to a signal that it has taken place (i.e. an interrupt) is the best approach to making it so that you can do other work until your attention is required. Polling multiple times until there is no more work (e.g. coelescing the work into a single interrupt) is as close as you'll get to pure polling without actually doing pure polling. I don't really see why doing as much work as possible by directly coupling to the "work to do" event -- the interrupt -- is being considered a bad thing. It goes against all the high speed networking research of the last decade, so with no contrary evidence, it's really hard to support that view. > > I was chewed on before for context diffs. As I said before, > > I can provide tem, if that's the current coin of the realm; it > > doesn't matter to me. > > That should be the default mode -- as Peter pointed out, non-context diffs > will work far less often and are far more difficult to decipher than > context diffs. Is this a request for me to resend the diffs? > > The problem I have is that you simply can't use jumbograms > > in a commercial product, if they can't be autonegotiated, > > or you will burn all your profit in technical support calls > > very quickly. > > The MTU isn't something that gets negotiated -- it's something that is set > per-network-segment. > > Properly functioning routers will either fragment packets (that don't have > the DF bit set) or send back ICMP responses telling the sender their packet > should be smaller. Sure. So you set the DF bit, and then start with honking big packets, sending progressively smaller ones, until you get a response. > > Intel can autonegotiate with several manufacturers, and the > > Tigon III. It can interiperate with the Tigon II, if you > > statically configure it for jumbograms. > > This sounds screwed up; you don't "autonegotiate" jumbo frames, at least on > the NIC/ethernet level. You set the MTU to whatever the maximum is for > that network, and then you let your routing equipment handle the rest. > > If you're complaining that the Intel board won't accept jumbo frames from > the Alteon board by default (i.e. without setting the MTU to 9000) that's > no surprise at all. Why should a board accept a large packet when its MTU > is set to something smaller? It's the Alteon board that craps out. Intel to Intel works fine, when you "just start using them" on one end. > I don't think there's any requirement that a card accept packets larger > than its configured MTU. You're right that it's not a requirement: it's just a customer expectation that your hardware will try to operate in the most efficient mode possible, and all other modes will be fallback positions (which you will then complain about in your logs). > > Another interesting thing is that it is often a much better > > idea to negotiate an 8k MTU for jumbograms. The reason for > > this is that it fits evenly into 4 mbuf clusters. > > Yeah, that's nice. FreeBSD's TCP stack will send 8KB payloads > automatically if your MTU is configured for something more than that. > > That comes in handy for zero copy implementations. Ugh. I was unaware that FreeBSD's stack would not honor a 9k MTU. > > There are actually some good arguments in there for having > > non-fixed-sized mbufs... > > Perhaps, but not that many. 1) mbuf cluster headers waste most of their space. 2) max MTU sized mbufs have no wasted space. 3) (obsolete, sadly) the tcptempl structure is 60 bytes, not 256, but uses a full mbuf, for a full > 2/3rds space wastage. 4) 9k packets -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3BC40E04.D89ECB05>