From owner-freebsd-arch@FreeBSD.ORG Sat Oct 20 17:11:25 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9EC1B6AE for ; Sat, 20 Oct 2012 17:11:25 +0000 (UTC) (envelope-from jmg@h2.funkthat.com) Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18]) by mx1.freebsd.org (Postfix) with ESMTP id 7231D8FC08 for ; Sat, 20 Oct 2012 17:11:25 +0000 (UTC) Received: from h2.funkthat.com (localhost [127.0.0.1]) by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id q9KHBOIB075330 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 20 Oct 2012 10:11:24 -0700 (PDT) (envelope-from jmg@h2.funkthat.com) Received: (from jmg@localhost) by h2.funkthat.com (8.14.3/8.14.3/Submit) id q9KHBOc1075329; Sat, 20 Oct 2012 10:11:24 -0700 (PDT) (envelope-from jmg) Date: Sat, 20 Oct 2012 10:11:24 -0700 From: John-Mark Gurney To: Konstantin Belousov Subject: Re: using SSE2 in kernel C code (improving AES-NI module) Message-ID: <20121020171124.GU1967@funkthat.com> Mail-Followup-To: Konstantin Belousov , freebsd-arch@freebsd.org References: <20121019233833.GS1967@funkthat.com> <20121020054847.GB35915@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121020054847.GB35915@deviant.kiev.zoral.com.ua> User-Agent: Mutt/1.4.2.3i X-Operating-System: FreeBSD 7.2-RELEASE i386 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.2 (h2.funkthat.com [127.0.0.1]); Sat, 20 Oct 2012 10:11:25 -0700 (PDT) Cc: freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Oct 2012 17:11:26 -0000 Konstantin Belousov wrote this message on Sat, Oct 20, 2012 at 08:48 +0300: > On Fri, Oct 19, 2012 at 04:38:33PM -0700, John-Mark Gurney wrote: > > So, the AES-NI module already uses SSE2 instructions, but it does so > > only in assembly. I have improved the performance of the AES-NI > > modules implementation, but this involves me using additional SSE2 > > instructions. > > > > In order to keep my sanity, I did part of the new code in C using > > gcc native types and xmmintrin.h, but we do not support this header in > > the kernel.. This means we cannot simply add the new code to the > > kernel... > > > > Any good ideas on how to integrate this code into the kernel build? [...] > > The current structure of the aes-ni driver is partly enforced by the > issue you noted. We cannot use sse intristics in the kernel, and > huge inline assembler fragments are hard to write. > > I prefer to have the separate .S files with the optimized code, > hand-written. If needed, I offer you a help with transition. I would > need a full patch to rewrite the code. Are you sure you want to do this? It'll involve writing around 500 lines of assembly besides the constants... And it isn't simple like the aesni_enc where we have a single loop for the rounds... I've posted a tar.gz to overlay onto sys/crypto/aesni at: https://www.funkthat.com/~jmg/aesni.repfile.tar.gz It doesn't have the build infrastructure to build _wrap2.c into assembly and build a kernel/module w/ it yet, hence my original email... I'd prefer to keep the C file as it is MUCH easier to understand what is happening... It was also much easier to write and try different optimization strategies... A brief overview of the code... It turns out that the throughput on the AES instructions is 1 per clock, but has a latency of 8 on most processors... This means if we pipeline the work, do 8 ECB blocks at once, we can significantly cut the clocks down... The other part is to reduce the time it takes to calculate the tweak factor... I unrolled this calculation 8 times such that we can keep the results in registers to pass into the 8 block ECB function... This last part does make a difference... Thanks for taking a look at it... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not."