Date: Thu, 31 Mar 2005 19:15:16 -0800 (PST) From: Matthew Dillon <dillon@apollo.backplane.com> To: Bruce Evans <bde@zeta.org.au> Cc: bde@freebsd.org Subject: Re: Fwd: 5-STABLE kernel build with icc broken Message-ID: <200504010315.j313FGLn056122@apollo.backplane.com> References: <423C15C5.6040902@fsn.hu> <20050327133059.3d68a78c@Magellan.Leidinger.net> <20050327162839.2fafa6aa@Magellan.Leidinger.net> <5bbfe7d405032823232103d537@mail.gmail.com> <20050331155418.F20400@delplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
All I really did was implement a comment that DG had made many years ago in the PCB structure about making the FPU save area a pointer rather then hardwiring it into the PCB. This greatly reduces the complexity of work required to allow the kernel to 'borrow' the FPU. It basically allows the kernel to 'stack' save contexts rather then swap-out save contexts. The result is that the cross-over point for the copy size where the FPU becomes economical is a much lower value (~2K rather then ~4-8K). The FPU overhead differences between DFly and FreeBSD for bcopy only matters for buffers between 2K and 16K in size. After that the copy itself overshadows the FPU setup overhead. In DFly the kernel must still check to see whether userland has used the FPU and save the state before it reuses the FPU in the kernel. We don't bother to restore the state, we simply allow userland to take another fault (the idea being that if userland is making several I/O calls into the kernel in a batch, the FPU state is only saved once). Once the kernel has done this and adjusted the FPU save area it can use the FPU at a whim, even though blocking conditions, and then just throw away the FPU context when it is done. We could theoretically stack multiple kernel FPU contexts through this mechanism but I don't see much advantage to it so I don't... I have a lockout bit so if the kernel is already using the FPU and takes e.g. a preemptive interrupt, it doesn't go and use the FPU within that preemption. The use of the XMM registers is a cpu optimization. Modern CPUs, especially AMD Athlon and Opterons, are more efficient with 128 bit moves then with 64 bit moves. I experimented with all sorts of configurations, including the use of special data caching instructions, but they had so many special cases and degenerate conditions that I found that simply using straight XMM instructions, reading as big a glob as possible, then writing the glob, was by far the best solution. The key for fast block copying is to not issue any memory writes other then those related directly to the data being copied. This avoids unnecessary RAS cycles which would otherwise kill copying performance. In tests I found that copying multi-page blocks in a single loop was far more efficient then copying data page-by-page precisely because page-by-page copying was too complex to be able to avoid extranious writes to memory unrelated to the target buffer inbetween each page copy. -Matt
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200504010315.j313FGLn056122>