Date: Sat, 12 Jan 2002 12:34:29 -0800 From: Peter Wemm <peter@wemm.org> To: Bruce Evans <bde@zeta.org.au> Cc: Terry Lambert <tlambert2@mindspring.com>, Alfred Perlstein <bright@mu.org>, Kelly Yancey <kbyanc@posi.net>, Nate Williams <nate@yogotech.com>, Daniel Eischen <eischen@pcnet1.pcnet.com>, Dan Eischen <eischen@vigrid.com>, Archie Cobbs <archie@dellroad.org>, arch@FreeBSD.ORG Subject: Re: Request for review: getcontext, setcontext, etc Message-ID: <20020112203429.EE98738CC@overcee.netplex.com.au> In-Reply-To: <20020112205919.E5372-100000@gamplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Bruce Evans wrote: > On Sat, 12 Jan 2002, Terry Lambert wrote: > > > Bruce Evans wrote: > > > (*) It may not be all that good. It was good on old machines when 108 > > > bytes was a lot of memory and moving the state in and out of the FPU > > > was slow too. It is possible that the logic to avoid doing the switch > > > takes longer than always doing it, but not all that likely because logic > > > speed is increasing faster than memory speed and new machines have more > > > state to save (512 (?) bytes for SSE). > > > > Correct me if my math is wrong, but let's run with this... > > > > If I have a 2GHz CPU and 133MHz memory, then we are talking a 16:1 > > slowdown for a transfer of 512 bytes from register to L1 to L2 > > to main memory for an FPU state spill. > > > > Assuming a 64 bit data path, then we are talking a minimum of > > 3 * 512/(64/8) * (16:1) or 3k (3076) clocks to save the damn FPU > > state off to main memory (a store in a loop is 3 clocks ignoring > > the setup and crap, right?). Add another 3k clocks to bring it > > back. > > > > Best case, God loves us, and we spill and restore from L1 > > without an IPI or an invalidation, and without starting the > > thread on a CPU other than the one where it was suspended, and > > all spills are to cacheable write-through pages. That's a 16 > > times speed increase because we get to ignore the bus speed > > differential, or 3 * 512/(65/8) * 2 = (6k/16) = 384 clocks. > > This seems to be off by a bit. Actual timing on an Athlon1600 > overclocked a little gives the following times for some crtical > parts of context switching for each iteration of instructions in > a loop (not counting 2 cycles of loop overhead): > > pushal; popal: 9 cycles > pushl %ds; popl %ds: 21 cycles > fxsave; fxrstor: 105 cycles > fnsave; frstor: 264 cycles > > This certainly hits the L1 cache almost every time. So the 512-byte L1 > case "only" takes 105 cycles, not 384, but the 108-byte L1 case takes > much longer. fxsave/fxrstor is so fast that I don't quite believe the > times -- it saves 16 times as much state as pushal/popal in less than > 12 times as much time. Well, fxsave/fxrstor were specifically designed so that this could all be done with burst transfers. fxsave/fxrstor are possibly doing 256 bit wide transfers to/from the L1 cache. Also dont forget that the fast save/ restore operations were designed with strict alignment requirements so that a whole bunch of checks can be skipped at runtime that fnsave/frstor have to still deal with. > > So it seems to me that it is *incredibly* expensive to do the > > FPU save and restore, considering what *else* I could be doing > > with those clock cycles. > > I agree that fnsave/frstor are still incredibly expensive if the > above times are correct. fxsave/fxrstor is only credibly expensive. > However, the overheads for fnsave/frstor are small compared with > the overheads for the !*#*$% segment registers. We switch 3 segment > registers explicitly and 2 implicitly on every switch to the kernel. > According to the above, this has the same overhead as 1 fxsave/frstor. > It gets done much more often than context switches. I hoped to get > rid of the 2 expicit segment register switches, but couldn't keep > up with the forces of bloat that added a 3rd. Now I don't notice > this bloat unless I count cycles and forget that a billion of them > is a lot :-). Heh. That reminds me, I need to talk over some IPI vector tweaks with you. I had forgotten that segment register operations were so bad. Hmm. What are they again? I see %ds, %es and %fs. I assume the two implicit ones were %cs and %ss. Which had you hoped to remove? What *is* %es used for anyway? > > With an average instruction time of 6 clocks (erring on the > > side of caution), the question is "can we perform the logic > > for the avoidance in 64 or less instructions?" I think the > > answer is "yes", even if we throw in half a dozen uncached > > memory references to main memory as part of the process and > > take the 16:1 hit on each of them (that would be 96 clocks > > in memory references, leaving us 288/6 = 38 instructions to > > massage whatever we got back from those references). > > The Xdna trap to do load the state if we guessed wrong about the > next timeslice not using the FPU takes about 200 instructions > including several slow ones like iret, so we don't get near 38 > instructions in all cases although we could (Xdna can be written > in about 10 instructions if it doesn't go through trap() and > other general routines). Hmm, that is good to know too. Cheers, -Peter -- Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au "All of this is for nothing if we don't go to the stars" - JMS/B5 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020112203429.EE98738CC>