Date: Mon, 22 Feb 2021 20:20:43 +0100 From: Mateusz Guzik <mjguzik@gmail.com> To: freebsd-arm@freebsd.org Subject: state of kernel core primitives in aarch64 Message-ID: <CAGudoHFtLjfWVynZFff%2BiNhNG=CChAwgi9ftEsd3=UtdWWX-6A@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
I made a quick trip and it looks like there is performance left on the table. Similar remarks are probably applicable to userspace. First some clean up: bzero, bcmp and bcopy are all defined as builtins to memset, memcmp and memmove, but the kernel still provides them. Arguably both bzero and bcmp can be a little bit faster than memset (for knowing upfront the target is to be zeroed out) and memmcp (for only having to note the difference instead of computing what it is). If such optimizations are of significance of arm, builtins should be changed at least on that arch. So happens clang provides __builtin_bzero which do resort to calling relevant routines if necessary. Regardless of the above, all routines seem to be slower than they need to, at least when I compare them to non-simd code in https://github.com/ARM-software/optimized-routines/tree/master/string/aarch64 As a simple test I ran a simple test calling access(2) in a loop over /usr/obj/usr/src/amd64.amd64/sys/GENERIC/vnode_if.c, running on ARM Neoverse-N1 r3p1. This copies the string from userspace using copyinstr and compares each component (usr, obj and so on) using memcmp. According to dtrace[1] both copying and comparing are top of the profile. You can prod me on irc regarding hardware and benchmark code. [1] dtrace seems to return a bogus result where sampling on instructions reports return address instead and the conclusion was made with that in mind -- Mateusz Guzik <mjguzik gmail.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAGudoHFtLjfWVynZFff%2BiNhNG=CChAwgi9ftEsd3=UtdWWX-6A>