From owner-svn-src-head@freebsd.org Tue May 8 01:37:44 2018 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B03C5FC5C75; Tue, 8 May 2018 01:37:44 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail109.syd.optusnet.com.au (mail109.syd.optusnet.com.au [211.29.132.80]) by mx1.freebsd.org (Postfix) with ESMTP id 8966C6FFDD; Tue, 8 May 2018 01:37:43 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from [192.168.0.102] (c110-21-101-228.carlnfd1.nsw.optusnet.com.au [110.21.101.228]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 3DEE8D68CB1; Tue, 8 May 2018 11:37:31 +1000 (AEST) Date: Tue, 8 May 2018 11:37:30 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans cc: Mateusz Guzik , src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r333324 - in head/sys: amd64/amd64 conf In-Reply-To: <20180508072206.B888@besplex.bde.org> Message-ID: <20180508101158.J1350@besplex.bde.org> References: <201805071507.w47F7SOs035073@repo.freebsd.org> <20180508072206.B888@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0 a=PalzARQSbocsUSjMRkwAPg==:117 a=PalzARQSbocsUSjMRkwAPg==:17 a=kj9zAlcOel0A:10 a=BVA95AcoyG47zic6ybEA:9 a=CjuIK1q_8ugA:10 X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 May 2018 01:37:44 -0000 On Tue, 8 May 2018, Bruce Evans wrote: > On Mon, 7 May 2018, Mateusz Guzik wrote: > >> Log: >> amd64: replace libkern's memset and memmove with assembly variants >> ... >> memset is repurposed memcpy. The librkern variant is doing fishy stuff, >> including branching on 0 and calling bzero. > > You mean the old libkern variant. It just duplicates the inline > implementation in a stupid way. Another style bug in this is that the > memset() (and also index() and rindex()) is implemented as an inline in > , when newer mem*() are implemented as macros in > . The inlines are harder to use and don't even support > the hack of a special case if the size is constant and small. > > libkern/memcmp.c is almost never used, and neither is your asm version, > so it shouldn't be optimized and your asm version shouldn't exist, and > the bugs in the asm version rarely matter. It is just a fallback for > cases where inlining doesn't work and for use in function pointers. I > don't see how the inlining can ever not work, but there is an ifdef > tangle that makes this non-obvious. I checked in an old i386 kernel that the extern memset() is not called even once during a makeworld. But several object files reference it. For a newer amd64 kernel, these files are: - huf_decompress.o - iflib.o - kern_event.o - memset.o - nfs_clvsops.o - ppp_msq.o - vga.o - zstd_opt.o (this uses a private static version) Bah, my test for this was broken because the naming error of having a private copy of memset broke debugging -- ddb put the breakpoint at the version in zstd_opt.o. Repeating the makeworld with breakpoints at both versions still showed no calls. Debugging this in the preprocessed sources for vga.c shows that memset() is never called there. However, bzero() is called a lot, and it expands to __builtin_memset() for the case that is supposed to be inlining (constant size <= 64), but the compiler (gcc) generates code that calls memset() anyway, and the extern memset() is needed for this fallback. But apparently, the inlining is not that bad, so the memset() is never actually called. Something is wrong, since the calls to memset() are all with size 384 Oops. The calls are actually to initialize local variables. Some code has the good pessimization of large local variables with auto-initiazation to 0 on every entry to the function. At least gcc apparently uses memset() to initialize such variables, although this is invalid in freestanding mode. So the extern memset() is not just a fallback for direct pessimizations. The limited size of the kernel stack makes this pessimization relatively rare. Further analysis of calls to memset() in a full amd64 kernel: - only 20 calls total - the sizes are variable(perhaps for a VLA), 0x10, 0x10, 0x10, 0x10, 0x2004, 0x4004, 0x4004, 0x2018, 0x60, 0xc0, 0x2a8, 0xf8, 0x100, 0x180, 0x180, 0x180, 0x180, 0x180, 0x180 (the last 6 all for vga.c where the pessimization is in the slow path). So almost no problems with large zero-initialized variables on the stack either. clang does excessive inlining, so of course it inlines most of these initializations. It actually inlines all except 6 (ones with size 0x2004, 0x2004, 0x4004, 0x4004, 0xf8, 0x100). For the size 0x180 in vga.c, clang generates only memcpy() where gcc generates memset() and then load-stores and/or memcpy(). Reducing the slowness of large auto-initialized local variables is an interesting problem. memcpy() from static data is simplest, but for sparse data memset() followed by a few loads and stores is probably better. I would have expected clang to do the fancier optimization with memset(). I use -Os to limit such optimizations (they are especially useless in the initialization code in vga.c) but forget to use it for clang in this test. So it is even more surprising that gcc apparently does the fancier optimization with memset(). memcpy() from static data would give smaller code though larger data, and I think -Os is mostly to optimized code for space. Bruce