Date: Thu, 09 Aug 2001 01:39:57 -0700 From: Terry Lambert <tlambert2@mindspring.com> To: Weiguang SHI <weiguang_shi@hotmail.com> Cc: grog@FreeBSD.org, bmilekic@technokratis.com, dillon@earth.backplane.com, zzhang@cs.binghamton.edu, freebsd-hackers@FreeBSD.org Subject: Re: Allocate a page at interrupt time Message-ID: <3B724C5D.6CDD0063@mindspring.com> References: <F122JziWwSoRKbpG1Ki00002460@hotmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Weiguang SHI wrote: > > I found an article on livelock at > > http://www.research.compaq.com/wrl/people/mogul/mogulpubsextern.html > > Just go there and search for "livelock". > > But I don't agree with Terry about the interrupt-thread-is-bad > thing, because, if I read it correctly, the authors themself > implemented their ideas in interrupt thread of the Digital Unix. Not quite. These days, we are not necessarily talking about just interrupt load limitations. Feel free to take the following with a grain of salt; but realize, I have personally achieved more simultaneous connections on a FreeBSD box than anyone else out there without my code in hand, and this was using gigabit ethernet controllers on modern hardware, and further, this code is in shipping product today. -- The number one way of dealing with excess load is to load-shed it before the load causes problems. In an interrupt threads implementaion, you can't really do this, since the only option you have is when to schedule a polling operation. This leads to several inefficiencies, all of which negatively impact the top end performance you are going to be able to achieve. Use of interrupt threads suffers from a drastically increased latency in reenabling of interrupts, and can generally only perform a single polling cycle, without running into the problem of not making forward progress at the application level (they run at IPL 0, which is effectively the same time at which NETISR is currently run). This leads to a tradeoff in increased interrupt handling latency (e.g. the Tigon II Gigabit ethernet driver in FreeBSD sets the Tigon II card firmware to coelesce at most 32 interrupts), vs. the transmit starvation problem noted in section 4.4 of the paper. It should also be noted that, even if you have not reenabled interrupts, the DMA engine on the card will still be DMA'ing data into your receiver ring buffer. The burst data rate on a 66MHz, 64 bit PCI bus is just over 4Mbits/S, and the sustainable data rate is much lower than that. This means a machine acting as a switch or firewall with two of these cards on board will not really have much time for doing anything at all, except DMA transfers, if they are run at full burst speed all the time (not possible). Running an application which requires disk activity will further eat into the available bandwidth. So this raises the spectre of DMA-based bus transfer livelock: not just interrupt based livelock, if one is scheduling interrupt threads to do event polling, instead of using one of the other approaches outlined in the paper. In the DEC UNIX case, they mitigated the problem by getting rid of the IP input queue, and getting rid of NETISR (I agree that these are required of any code with these goals). The use of the polling thread is really just their way of implementing the polling approach, from section 5.3. This does not address the problems I noted above, and in particular, does not address the latency vs. bus livelock tradeoff problem with modern hardware (they were using an AMD LANCE Ethernet chip; this was a 10Mb chip, and it doesn't support interrupt coelescing). They also assumed the use of a user space forwarding agent ("screend"): a single process. Further, I think that the feedback mechanism selected is not really workable, without rewriting the card firmware, and having a significant memory buffer on the card, something which is not available on the market yet today. This is because, in practice, you can't stop all incoming packet processing just because one user space program out of dozens has a full input queue that the user space program has not processed yet. It's not reasonable to ignore new incoming requests to a web server, or to disable card interrupts, or to (for example) drop all ARP packets until TCP processing for that one application is complete: their basic assumption -- which they admit, in section 6.6.1, is that the screend is the only application running on the system. This is simply not the case with a high traffic web server, a database system, or any other work-to-do-engine model of several process (or threads) with identical capability to service the incoming requests. Further, these applications use TCP, and thus have explicitly application bound socket endpoints, and there is no way to guarantee client load. We could trivially DOS attack an Apache server running SSL via mod_proxy, for example, by sending a flood of intentionally bad packets. The computation expense would keep its input queue full, and therefore, the feedback mechanism noted would starve the other Apache processes of legitimate input. There are other obvious attacks, which are no less damaging in their results, which attack other points in the assumption of a single process queue feedback mechanism. Their scheduler in section 7, which is in effect identical to the "fixed" scheduling class in SVR4 (which was used by USL to avoid the "move mouse, wiggle cursor" problem when using the UnixWare linker, which mmap'ed all object files, and then seeked all over in them, therefore thrashing all other pages out of the buffer cache, by giving the X server a fixed portion of the available CPU to thrash those pages back in -- a highly non-optimal fix), really only addresses CPU contention, at the cost of pessimizing the amount of CPU available to do packet processing when that's all the work ther is to do, and at the expense of "chatty" network apllications: precisely the sort of thing FreeBSD is being used for these days: NFS servers, HTTP servers, etc.; clearly, the quota was set from the perspective of the need to support an interactive user load which is not really common, nor expected in applications where FreeBSD is commonly used, these days. Section 7.1 notes this tangentially, in a discussion of NFS. Section 10 notes the differential rates of memory vs. CPU speed increase, and that their approach may have some problems; my opinion is that the polling approach is correct, and that there is probably room for something like a weighted fair share method for doing scheduling, and that using the 5.1 or 5.2 method, trather than the interrupt thread method in 5.3, would yield significantly better results in SMP and high speed interface systems that are limited not by interfaces or interrupts (per se), but by bis bandwidth (recent work at various Universities seems to confirm this opinion, FWIW). Overall, this is a seminal paper, and identifies almost every one of the important issues; and it admits where it may be wrong if things change in the future -- I would argue that things have changed. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3B724C5D.6CDD0063>