Date: Fri, 18 Jan 2002 00:49:55 -0800 From: Terry Lambert <tlambert2@mindspring.com> To: Michal Mertl <mime@traveller.cz> Cc: arch@FreeBSD.ORG Subject: Re: 64 bit counters again Message-ID: <3C47E1B2.6938136@mindspring.com> References: <Pine.BSF.4.41.0201180033210.82507-100000@prg.traveller.cz>
next in thread | previous in thread | raw e-mail | index | archive | help
Michal Mertl wrote: > > 4) Measure CPU overhead as well as I/O overhead. > > I don't know what do you mean by I/O overhead here. Say you could flood a gigabit interface, and it was 6% of the CPU on average. Now after you patches, suppose that it's 10% of the CPU. The limiting factor is the interface... but that's only for your application, which is not doing CPU intensive processing. Something that did a lot of CPU work (like SSL), would have a different profile, and you would be limiting the application by causing it to become CPU bound earlier. > > 6) Use an SMP system, make sure that you have a sender > > on both CPUs, and measure TLB shootdown and page > > mapping turnover to ensure you get that overhead in > > there, too (plus the lock overhead). > > I'm afraid I don't understand. I don't see that deep into kernel > unfortunately. If you tell me what to look at and how... The additional locks required for i386 64 bit atomicity will, if the counter is accessed by more than one CPU, result in bus contention for inter-CPU coherency. > > 7) Make sure you are sending data already in the kernel, > > so you aren't including copy overhead in the CPU cost, > > since practically no one implements servers with copy > > overhead these days. > > What do you mean by that? Zero-copy operation? Like sendfile? Is Apache > 1.x zero-copy? Yes, zero copy. Sendfile isn't ideal, but works. Apache is not zero copy. The idea is to not include a lot of CPU work on copies between the user space and the kernel, which aren't going to happen in an extremely optimized application. > > If you push data at 100Mbit, and not even at full throttle at > > that, you can't reasonably expect to see a slowdown when you > > have other bottlenecks between you and the changes. > > > > In particular, you're not going to see things like the pool > > size go up because of increased pool retention time, etc., > > due to the overhead of doing the calculations. > > That's probably correct eventhough I again don't fully understand what > you're talking about :-). Look at the max number of mbufs allocated. They form a pool of type stable memory from which mbufs are allocated (things that get allocated get freed to the pool instead of freed to the system). You can see this in the zone counts by dumping the zone information with vmstat, and in the mbuf counts in the netstat -m case. Basically, if you run without the 64 bit stuff, and get one number, and then run with it, and get a larger number, then this means that the time you are spending doing the stats is increasing the amount of time it takes in the code path, and so the mbufs don't get processed out as quickly. The implication, IFF this is the case, is that the additional processing overhead has increased the amount of time a buffer remains in transit -- the pool retention time -- and thus it increases the overall total pool size for a given throughput. The upshot of this happening is that you now require more memory for the same amount of work, or, if your machine is "maxed out", then the high end amount of work you can do is reduced by the changes. > > Also, realize that even though simply pushing data doesn't > > use up a lot of CPU on FreeBSD if you do it right, even 2% > > or 4% increase in CPU overhead overall is enough to cause > > problems for already CPU-bound applications (i.e. that's > > ~40 less SSL connections per server). > > You're right with that too. Of course I know that at full CPU load the > clocks will be missing and maybe other things (memory bandwidth with > locked operations?) will suffer. Yes. It's important to know whether it is significant for the bottleneck figure of merit for a particular application. For SSL, this is CPU cycles. For an NFS server, this is how much data it can push in a given period of time (overall throughput). For some other application, it's some other number. For example, the thing that limits the top end speed of SQUID is how fast it can log, and the number one factor there is actually the rate at which gettimeofday() can be called, and still maintain the exhaustive log records that users have come to expect (these are basically UI "eye candy", except when the logs are digested and used for billing purposes, at which point they are really absolutely critical). Because network processing for almost all packets in or out in the current FreeBSD occurs at NETISR, this basically means that the closer it takes to a quantum, the closer you are to a condition called "receiver livelock". This actually drops your top end by up to 15%, and can actually stop your server in its tracks if you aren't very careful (RED queueing, weighted fair share queue scheduling, to ensure you don't spend all your time in the kernel, and none in user space processing request, etc.). > > But we can wait for your effects on the mbuf count high > > watermark and CPU utilization values before jumping to any > > conclusions... > > I'm afraid I can't provide any measurement with faster interfaces. I can > try to use real server to sned me some data so it's executing on both > processors, but I would probably become limited with 100Mbit sooner than > I'll notice processors have less time to do their job :(. Well, you probably should collect *all* statistics you can, in the most "this is the only thing I'm doing with the box" way you can, before and after the code change, and then plot the ones that get worse (or better) as a result of the change. [ ... ] > THE MOST IMPORTANT QUESTION, to which lots of you probably know answer > is, DO WE NEED ATOMIC OPERATIONS FOR ACCESSING DIFFERENT COUNTERS (e.g. > network-device (modified in ISR? - YES/NO) or network-protocol or > filesystem ...)? NO MATTER WHAT THE SIZE OF THE COUNTER IS. > > If we need atomic, we need atomic 32 bit as much as 64 bit. If we don't, > we can have cheaper 64 bits counters. My API allows for different > treatment of different classes of counters (if simple answer to my > question exists) or places in kernel (you know you're calling when > interrupt can occur, other CPU may modify the same counter...). I run the > SMP kernel with the same test with "simple 64 bit add (addl,adcl)" without > noticing anything went wrong and that sure isn't anywhere near as > expensive as lock;cmpxchg8b. I think the answer is "yes, we need atomic counters". Whether they need to be 64 bit or just 32 bit is really application dependent (we have all agreed to that, I think). See Bruce's posting about atomicity; I think it speaks very eleoquently on the issue (much more brief than what I'd write to say the same thing ;^)). -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3C47E1B2.6938136>