From owner-freebsd-current Sat Oct 13 23:30: 8 2001 Delivered-To: freebsd-current@freebsd.org Received: from swan.mail.pas.earthlink.net (swan.mail.pas.earthlink.net [207.217.120.123]) by hub.freebsd.org (Postfix) with ESMTP id C948C37B401 for ; Sat, 13 Oct 2001 23:29:56 -0700 (PDT) Received: from mindspring.com (dialup-209.247.143.225.Dial1.SanJose1.Level3.net [209.247.143.225]) by swan.mail.pas.earthlink.net (EL-8_9_3_3/8.9.3) with ESMTP id XAA12529; Sat, 13 Oct 2001 23:29:50 -0700 (PDT) Message-ID: <3BC93111.E0804A37@mindspring.com> Date: Sat, 13 Oct 2001 23:30:41 -0700 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Mike Silbersack Cc: freebsd-current@freebsd.org Subject: Re: Some interrupt coalescing tests References: <20000101045805.G4587-200000@patrocles.silby.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Mike Silbersack wrote: > Well, I've been watching everyone argue about the value of interrupt > coalescing in the net drivers, so I decided to port terry's patch to 4.4 & > -current to see what the results are. Thanks! > The network is 100mbps, switched. To simulate load, I used a syn flooder > aimed at an unused port. icmp/rst response limiting was enabled. > > With the -current box attacking the -stable box, I was able to notice a > slight drop in interrupts/second with the patch applied. The number of > packets was ~57000/second. > > Before: ~46000 ints/sec, 57-63% processor usage due to interrupts. > After: ~38000 ints/sec, 50-60% processor usage due to interrupts. > > In both cases, the box felt responsive. One issue to be careful of here is that the removal of the tcptmpl actually causes a performance hit that wasn't there in the 4.3 code. My original complaint about tcptmpl taking up 256 instead of 60 bytes stands, but I'm more than half convinced that making it take up 60 bytes is OK... or at least is more OK than allocating and deallocating each time, and I don't yet have a better answer to the problem. 4.3 doesn't have this change, but 4.4 does. > With the -stable box attacking the -current box, the patch made no > difference. The box bogged down at only ~25000 ints/sec, and response > limiting reported the number of packets to be ~44000/second. > > I'm not sure if the number was lower because the celeron couldn't run the > flooder as quickly, or if the -current box was dropping packets. I > suspect the latter, as the -current box was NOTICEABLY slowed down; I > could watch systat refresh the screen. This is unfortunate; it's an effect that I expected with the -current code, because of the change to the interrupt processing path. To clarify here, the slowdown occurred both with and without the patch, right? The problem here is that when you hit livelock (full CPU utilization), then you are pretty much unable to do anything at all, unless the code path goes all the way to the top of the stack. > The conclusion? I think that the dc driver does a good enough job of > grabbing multiple packets at once, and won't be helped by Terry's patch > except in a few very cases. 10% is a good improvement; my gut feeling is that it would have been less than that. This is actually good news for me, since it means that my 30% number is bounded by the user space program not being run (in other words, I should be able to get considerably better performance, using a weighted fair share scheduler). As long as it doesn't damage performance, I think that it's proven itself. > In fact, I have a sneaky suspicion that Terry's patch may > increase bus traffic slightly. I'm not sure how much of > an issue this is, perhaps Bill or Luigi could comment. This would be interesting to me, as well. I gave Luigi an early copy of the patch to play with a while ago, and also copied Bill. I'm interested in how you think it could increase traffic; the only credible reason I've been able to come up with is the ability to push more packets through, when they would otherwise end up being dropped because of the queue full condition -- if this is the case, the bus traffic is real work, and not additonal overhead. If you weren't getting any packets, or had a very slow packet rate, it might increase bus traffic, in that doing an extra check might always return a negative response (in the test case in question, that's not true, since it's not doing more work than it would with the same load, using interrupts to trigger the same bus traffic. Note that it is only a consideration in the case that there is bus traffic involved when polling an empty ring to see if DMA has been done to a particular mbuf or cluster, so it takes an odd card for it to be a problem. > In short, if we're going to try to tackle high interrupt load, > it should be done by disabling interrupts and going to polling > under high load; I would agree with this, except that it's only really a useful observation if FreeBSD is being used as purely a network processor. Without interrupts, the polling will take a significant portion of the available CPU to do, and you can't burn that CPU if, for example, you have an SSL card that does your handshakes, but you need to run the SSL sessions themselves up in user space. For example, the current ClickArray "Array 1000" product does around 700 1024 bit SSL connection setups a second, and, since it uses a Broadcom card, the card is only doing the handshaking, and not the rest of the crypto processing. The crypto stream processing has to be done in user space, in the SSL proxy code living there, and as such, would suffer from doing poling. > the patch proposed here isn't worth the extra complexity. I'd argue that the complexity is coming, no matter what. If you seperate out the tx_eof and rx_eof entry points, and externalize them into the ethernet driver interface, in order to enable polling, you are going to need to have a return value on them, as well. To implement scheduling, this return value is going to need to give a packet count, so that you can forego polling every N packets (N > fair share threshold), or else you are not going to be able to do any additional processing. The if_dc driver is problematic because of its organization; if you look at the if_ti driver, or try to apply the same idea to the if_tg driver, it becomes 16 lines of code; to externalize the interfaces and make the necessary changes, without adding the while loop into the ISR, you are talking 60 lines of code (including structure changes to support the new entry points, and excluding code reorganization for the other cards). Also, there are a number of cards that will not transfer additional data until the interrupt is acknowledged. You would need to go to pure polling, or you would need to not do polling at all on those cards. NB: If you are interested in pure connection rate, and you want to protect against SYN flood, then your best bet is actually to simply put a SYN-cookie implementation into the firmware for the card, and deal with connection setup that way. With that approach, you should be able to easily support a quarter million connections a second. > I suppose this would all change if we were using LRP and doing lots of > processing in the interrupt handler... but we aren't. This is probably a back-handed poke at me not making the code available. I have to either clear it with my employer, or reimplement it from scratch (not hard; I could probably improve it significantly, were I to do this, knowing what I now know). I'm in the process of getting an approval list together. Even so, 10% is nothing to sneeze at... -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message