Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 20 Oct 2012 10:11:24 -0700
From:      John-Mark Gurney <jmg@funkthat.com>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        freebsd-arch@freebsd.org
Subject:   Re: using SSE2 in kernel C code (improving AES-NI module)
Message-ID:  <20121020171124.GU1967@funkthat.com>
In-Reply-To: <20121020054847.GB35915@deviant.kiev.zoral.com.ua>
References:  <20121019233833.GS1967@funkthat.com> <20121020054847.GB35915@deviant.kiev.zoral.com.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
Konstantin Belousov wrote this message on Sat, Oct 20, 2012 at 08:48 +0300:
> On Fri, Oct 19, 2012 at 04:38:33PM -0700, John-Mark Gurney wrote:
> > So, the AES-NI module already uses SSE2 instructions, but it does so
> > only in assembly.  I have improved the performance of the AES-NI
> > modules implementation, but this involves me using additional SSE2
> > instructions.
> > 
> > In order to keep my sanity, I did part of the new code in C using
> > gcc native types and xmmintrin.h, but we do not support this header in
> > the kernel..  This means we cannot simply add the new code to the
> > kernel...
> > 
> > Any good ideas on how to integrate this code into the kernel build?

[...]

> 
> The current structure of the aes-ni driver is partly enforced by the
> issue you noted. We cannot use sse intristics in the kernel, and
> huge inline assembler fragments are hard to write.
> 
> I prefer to have the separate .S files with the optimized code,
> hand-written. If needed, I offer you a help with transition. I would
> need a full patch to rewrite the code.

Are you sure you want to do this?  It'll involve writing around 500
lines of assembly besides the constants... And it isn't simple like
the aesni_enc where we have a single loop for the rounds...  I've
posted a tar.gz to overlay onto sys/crypto/aesni at:
https://www.funkthat.com/~jmg/aesni.repfile.tar.gz

It doesn't have the build infrastructure to build _wrap2.c into assembly
and build a kernel/module w/ it yet, hence my original email...

I'd prefer to keep the C file as it is MUCH easier to understand what
is happening...  It was also much easier to write and try different
optimization strategies...

A brief overview of the code...  It turns out that the throughput on
the AES instructions is 1 per clock, but has a latency of 8 on most
processors...  This means if we pipeline the work, do 8 ECB blocks at
once, we can significantly cut the clocks down...  The other part is to
reduce the time it takes to calculate the tweak factor...  I unrolled
this calculation 8 times such that we can keep the results in registers
to pass into the 8 block ECB function...  This last part does make a
difference...

Thanks for taking a look at it...

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121020171124.GU1967>