From owner-cvs-all@FreeBSD.ORG Thu Oct 20 05:45:34 2005 Return-Path: X-Original-To: cvs-all@FreeBSD.org Delivered-To: cvs-all@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id F019D16A41F; Thu, 20 Oct 2005 05:45:34 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.115]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2B97F43D62; Thu, 20 Oct 2005 05:45:32 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.0.87]) by mailout2.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id j9K5jSUG014862; Thu, 20 Oct 2005 15:45:28 +1000 Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id j9K5jKEa026219; Thu, 20 Oct 2005 15:45:25 +1000 Date: Thu, 20 Oct 2005 15:45:21 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Scott Long In-Reply-To: <4355080C.302@samsco.org> Message-ID: <20051020145234.H99720@delplex.bde.org> References: <200510172310.j9HNAVPL013057@repoman.freebsd.org> <20051018094402.A29138@grasshopper.cs.duke.edu> <435501B9.4070401@samsco.org> <17237.1482.52148.283282@grasshopper.cs.duke.edu> <4355080C.302@samsco.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: cvs-src@FreeBSD.org, src-committers@FreeBSD.org, Andrew Gallatin , cvs-all@FreeBSD.org, David Xu Subject: Re: cvs commit: src/sys/amd64/amd64 cpu_switch.S machdep.c X-BeenThere: cvs-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: CVS commit messages for the entire tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 20 Oct 2005 05:45:35 -0000 On Tue, 18 Oct 2005, Scott Long wrote: [Excessive quoting retained since I want to comment on separate points.] > Andrew Gallatin wrote: >> Scott Long writes: >> > Andrew Gallatin wrote: >> > > David Xu [davidxu@FreeBSD.org] wrote: >> > > > >>davidxu 2005-10-17 23:10:31 UTC >> > >> >> > >> FreeBSD src repository >> > >> >> > >> Modified files: >> > >> sys/amd64/amd64 cpu_switch.S machdep.c > >> Log: >> > >> Micro optimization for context switch. Eliminate code for saving >> gs.base >> > >> and fs.base. We always update pcb.pcb_gsbase and pcb.pcb_fsbase >> > >> when user wants to set them, in context switch routine, we only need >> to >> > >> write them into registers, we never have to read them out from >> registers >> > >> when thread is switched away. Since rdmsr is a serialization >> instruction, >> > >> micro benchmark shows it is worthy to do. >> > > > > > > Nice. This reduces lmbench context switch latency by about >> 0.4us (7.2 >> > > -> 6.8us), and reduces TCP loopback latency by about 0.9us (36.1 -> >> > > 35.2) on my dual core 3800+ I wonder if this reduces the context switch latency from about 1.320 usec to 0.900 usec on my A64-3000. The latency is only .520 usec in i386 mode. I use a TSC timecounter of course. The fastest loopback latency that I've seen is 5.638 usec under Linux-2.2.9 on the same machine. In Linux-2.6.10, it has regressed to 17.1 usec. In FreeBSD last year, it was 10.8 usec on the same machine in i386 mode and 19.0 in amd64 mode. So the A64 can almost keep up with an AXP-1400 running a pre-SMPng version of FreeBSD where it was 9.94 usec. [... Nonsense by phk already snipped] The timecounter is not used by schedulers, so the inefficiency of non-TSC timecounters and its effect on context switching has nothing to do with schedulers. Schedulers use mainly tick counts, and intentionally don't try hard to keep track of interrupt times because the fine-grained timekeeping needed to keep track of interrupts would be too expansive. It is still too expensive, but is now done (except for fast interrupts), but is not used by schedulers. The timestamps taken by mi_switch() are used mainly by userland statistics utilities. They are very useful for debugging and for otherwise understanding system behaviour, but are sometimes too inefficient. >> > > > > It is a shame we can't find a way to use the TSC as a timecounter >> on >> > > SMP systems. It seems that about 40% of the context switch time is >> > > spent just waiting for the PIO read of the ACPI-fast or i8254 to >> > > return. It seems to be more like 95% in year case. >> > > > > > > Drew >> > > > > > > > > The TSC represents the clock rate of the CPU, and thus >> can vary wildly >> > when thermal and power management controls kick in, and there is no way >> > to know when it changes. Because of this, I think that it's >> > practically useless on Pentium-Mobile and Pentium-M chips, among many >> > others. There is also the issue of multiple CPUs having to keep their >> > TSC's somewhat in sync in order to get consistent counting in the >> > system. The best that you can do is to periodically read a stable >> > counter and try to recalibrate, but then you'll likely start getting >> > wild operational variances. I agree that it's too hard to sync the TSC on systems with power management. It would be easy enough to sync with the i8254 every HZ, but even that would give extreme nonlinearities when the TSC frequency jumps up or down. Jumping up is the worst case. E.g, if the TSC frequency starts at 1GHz and HZ is 1000 expect the TSC count to increment by 10^6 in the next msec. If the TSC frequency jumps up to 2GHz, then the TSC count will actually increment by 2*10^6. I see nothing better than recalibrating half way into the next msec (when the TSC count reaches 10^6) and then wildly slewing the TSC clock so that the 10^6 increment in the count expected in the next half a msec from causing another half-msec error. >> As I pointed out in another thread, both linux and solaris do it. >> Solaris seems to have a nice algorithm for keeping things in sync, and >> accounting for the TSC getting cleared after suspend/resume etc. At >> my level of understanding, this argument is nothing more than "but >> Mom, all the other kids are doing it". I was just hoping that >> somebody with real understanding could pick up on it. > > Steering mutliple TSC's together isn't that hard and there are plenty of > examples, as you point out. Accounting for the changes due to thermal > and power management (note that this isn't the same problem as suspend > and resume) is what worries me. Possibly the systems with power management don't matter here. Power management is currently only essential for portable machines, and the portable machines won't have multi-Gb/s networks to keep up with and might not have such strict real time requirements. >> > It's a shame that a PIO read is still so >> > expensive. I'd hate to see just how bad your benchmark becomes when >> > ACPI-slow is used instead of ACPI-fast. >> >> It seems like reading ACPI-fast is "only" 3us or so, but when the ctx >> switch is otherwise 4us, it adds up. i8254 is much worse on this >> system (6.5us). I don't know why your system is so slow. I get ~50nsec for TSC, ~1000 nsec for ACPI-fast, ~3000 nsec for ACPI-slow and ~4000 nsec for i8254. But PIO keeps getting slower even in absolute terms. My (nearly) newest system (nForce2) has ISA PIO times of 1133 nsec for the i8254 registers where my first PCI system (with an early Intel chipset) has a read time of 703 usec and a write time of 1180 nsec. The nForce2 system also has a PCI PIO read time of 290 nsec for the same PCI card that can be read in 125 nsec (overclocked) or 150 nsec (not overclocked) on a KT266A system. >> > I wonder if moving to HZ=1000 on amd64 and i386 was really all that good >> > of an idea. Having preemption in the kernel means that ithreads can run >> > right away instead of having to wait for a tick, and various fixes to >> > 4BSD in the past year have eliminated bugs that would make the CPU wait >> > for up to a tick to schedule a thread. So all we're getting now is a >> > 10x increase in scheduler overhead, including reading the timecounters. >> >> Yeah. I moved my back to hz=1000 when I noticed 4000 interrupts/sec >> on an idle system. > > Do you mean 1000 or 100 here? Anyways, the high clock interrupt rate is > so that we can use the local apic clock to get the various system ticks > that we have instead of continuing to fight motherboards that no longer > hook up the 8259 in a sane way. This is why 5.x doesn't work well on a > number of new motherboards (nvidia ones especially) but 6.x works just > fine. [Dan actually meant 100.] I use 100 and never downgraded to use 1000 except for testing how bad it is. The default number is now up to * 2 * HZ. E.g., it is 4000 on sledge.freebsd.org. While 4000 interrupts/sec can be handled easily by any new machine, 4000 is a disgustingly large number to use for clock interrupts. Have a look at vmstat -i output on almost any machine. On most machines in the freebsd cluster, the total number of interrupts is dominated by clock interrupts even with HZ = 100. The main use for a large HZ is to low quality hardware and applications that need or want to poll very often. Bruce