Date: Wed, 10 Jan 2007 00:00:34 GMT From: Mikhail Teterin <mi+kde@aldan.algebra.com> To: freebsd-bugs@FreeBSD.org Subject: Re: bin/106734: [patch] SSE2 optimization for bzip2/libbz2 Message-ID: <200701100000.l0A00Y9S097998@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
The following reply was made to PR bin/106734; it has been noted by GNATS. From: Mikhail Teterin <mi+kde@aldan.algebra.com> To: Julian Seward <jseward@acm.org> Cc: bug-followup@freebsd.org Subject: Re: bin/106734: [patch] SSE2 optimization for bzip2/libbz2 Date: Tue, 9 Jan 2007 18:34:36 -0500 On Sunday 07 January 2007 00:08, Julian Seward wrote: = > /* Load the bytes: */ = > n1 = (__m128i)_mm_loadu_pd((double *)(block + i1)); = > n2 = (__m128i)_mm_loadu_pd((double *)(block + i2)); = > read beyond the end of the defined area of block. block is = > defined for [0 .. nblock + BZ_N_OVERSHOOT - 1], but I think = > you are doing a SSE load at &block[nblock + BZ_N_OVERSHOOT - 2], = > hence loading 15 bytes of garbage. I don't think, that's quite right... Instead of processing 8 bytes at a time, as the non-SSE code is doing, I'm comparing 16 at a time. Thus it is possible for me to be over by exactly 8 sometimes... Anyway, the problem was stemming from my bumping i1 and i2 by 16 instead of 8 after the _initial check_ (which, in the quadrant-less case should not need to be separate at all, actually). Sometimes _that_ would bring them over... I think, the solution is to either bump up BZ_N_OVERSHOOT even further or check and adjust i1 and i2: if (i1 >= nblock) i1 -= nblock; if (i2 >= nblock) i2 -= nblock; at the beginning, rather than the end of the loop. Having done that, I no longer peek beyond the end of the block (according to gdb's conditional breakpoints, at least). Please, check the new http://aldan.algebra.com/~mi/bz/blocksort-SSE2-patch-2 Yours, -mi P.S. The following gdb-script is what I used. Run as: gdb -x x.txt bzip2 x.txt: break blocksort.c:516 cond 1 (i1 > nblock) || (i2 > nblock) run -9 < /tmp/PLIST > /dev/null andjust the compression level, the input's location, and be sure to have blocksort.o compiled with debug information, of course...
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200701100000.l0A00Y9S097998>