Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 25 Oct 2024 23:36:29 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        John Baldwin <jhb@freebsd.org>
Cc:        gnn <gnn@freebsd.org>, Arch <freebsd-arch@freebsd.org>
Subject:   Re: Building kernels with FPU support?
Message-ID:  <ZxwBTcNyqaKhx0ri@kib.kiev.ua>
In-Reply-To: <b0c5f4b4-46a2-4601-aac8-4775ae29b4f9@FreeBSD.org>
References:  <E37A7972-7FBE-4428-81E1-8DC6D0F67726@freebsd.org> <b0c5f4b4-46a2-4601-aac8-4775ae29b4f9@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Oct 25, 2024 at 03:54:32PM -0400, John Baldwin wrote:
> On 10/23/24 10:38, gnn wrote:
> > Howdy,
> > 
> > I am wondering if anyone has tried, lately, to see what effect building with FPU support has on overall system performance.  I've been working with a kernel module that needs this (for reasons I'll not go into now) and it occurred to me that the perceived performance overhead that caused us to only do fixed point in the kernel may no longer be significant.  I note that Linux has an option to build their kernel with FPU support.
> > 
> > And yes, I know that we have the ability to selectively deal with the FPU, from the calls outlined in Section 9 for fpu, but I'm asking the more general question of "does it matter?" and "if so, how much?"
> 
> To enable vector instructions "in general" in the kernel means that every trap would
> need to save the floating point state.  Basically, struct trapframe would need to save
> all the vector/FP register state in addition to GPRs.  You would also need to save/restore
> it when switching threads in the kernel.  In essence, the current per-pcb state we have
> now would stay, but would hold in-kernel state, and userspace state would end up in the
> trapframe from userspace.
> 
> This would probably be quite expensive.  Saving and restoring FPU state is not cheap
> and we would now be doing that on every entry/exit into the kernel (so extra overhead
> on system calls, faults, and interrupts).  It would also probably blow out kernel
> stack usage quite a bit.  The XSAVE region on modern x86 processors is already close
> to 2k and is only growing.  That would be a substantially larger trapframe and require
> larger kstacks as a result.

On amd64, td_pcb is the part of struct thread, and user fpu save area is
allocated from zone.  But indeed, saving/restoring that even only on
syscalls and page faults (which are morally syscalls) is extremely expensive.
It is several kBs or write and then read.  This also would increase cache
footprint of each thread.

> 
> To mitigate the latter you could perhaps try to only use FP in the kernel "top-half"
> and not use it in bottom-half interrupt code.  I worry a bit about clearly demarking
> bottom-half code to still compile without FP, but as long as you disable FP access
> for nested faults you'd find any inconsistencies there rather quickly in the form
> of panics.
Practically you cannot save XMM without saving x87 state.  Other parts of
the state can be avoided but you need to know this in advance, because saving
should occur on entry, and the usage happens some time later.  If somebody
decides to only save 'if needed', then the code would be not different from
the current proposal.

> 
> Certainly it would be a fair bit of work to prototype to see what happens.  Some other
> things you could try are to only save a subset of register state for traps (e.g.
> just FXSAVE on x86 would mean you can use SSE and FP, but not AVX which might be
> enough for the many use cases in the kernel while not blowing out quite as much stack
> space).

Indeed, although existing infra is already sufficient to not require too
much code.

Still, I do not think this is useful, other than as an experiment, and then
might be a kernel option.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ZxwBTcNyqaKhx0ri>