From owner-svn-src-all@freebsd.org Sun Dec 10 08:53:41 2017 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E1862E866AE; Sun, 10 Dec 2017 08:53:41 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail108.syd.optusnet.com.au (mail108.syd.optusnet.com.au [211.29.132.59]) by mx1.freebsd.org (Postfix) with ESMTP id 7DA8469050; Sun, 10 Dec 2017 08:53:41 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from [192.168.0.102] (c110-21-101-228.carlnfd1.nsw.optusnet.com.au [110.21.101.228]) by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id 91B1B1A8DA3; Sun, 10 Dec 2017 19:31:20 +1100 (AEDT) Date: Sun, 10 Dec 2017 19:31:20 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Conrad Meyer cc: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r326736 - head/usr.bin/wc In-Reply-To: <201712092155.vB9LtJFY053385@repo.freebsd.org> Message-ID: <20171210160344.X1124@besplex.bde.org> References: <201712092155.vB9LtJFY053385@repo.freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=cK6QihWN c=1 sm=1 tr=0 a=PalzARQSbocsUSjMRkwAPg==:117 a=PalzARQSbocsUSjMRkwAPg==:17 a=kj9zAlcOel0A:10 a=rYIJg0UdEs76zxW-XZ8A:9 a=xAcgYsopjZ4dNkU-:21 a=qDnm0jGG12O4IVmq:21 a=hbmDjY5MnyDLhL0-:21 a=CjuIK1q_8ugA:10 X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Dec 2017 08:53:42 -0000 On Sat, 9 Dec 2017, Conrad Meyer wrote: > Log: > wc(1): Extend non-controversial optimizations to '-c' mode > > wc(1)'s slow path for counting words or multibyte characters requires > conversion of the 8-bit input stream to wide characters. However, a faster > path can be used for counting only lines ('-l' -- newlines have the same > representation in all supported encodings) or bytes ('-c'). > > The existing line count optimization was not used if the input was the > implicit stdin. Additionally, it wasn't used if only byte counting was > requested. This change expands the fast path to both of these scenarios. This seems reasonable, but I don't like the complicated logic for handling combinations of lines and bytes in the same loop. Bytes alone just require adding up the byte counts from read() (in the slow case where st_size cannot be used). This is obviously faster than using cat(1) (modulo block size pessimizations), since cat has to do output. PS: actually, it pessimizes the fastest case of wc -c on a regular file. See below. (The wc -l code is before the fastest case. That was not a very good organization, but it worked since wc -l can never use the fastest case.) Your change seems to work almost right for wc -c but not for wc -m. On freefall, dd bs=1m goes at 32GB/sec. (I think this depends on some vm magic to reduce copies. 1m is large and asks for cache busting, but it should only use 4K of L1 in the kernel and 1 or 2 times 1m in dd. Apparently dd is avoiding the output. My throughput utility which does avoid the output is only 1GB/sec faster). wc -c is now mediocre. It goes at 5.5GB/sec for a pipe from dd bs=1m. 5-6 times slower is too many, but cat it is better than cat which is 11.5 times slower (3GB/sec). 1m is too large a block size for me, and dd bs=64k is only slightly slower (28GB/sec). I blame pipes and slow syscalls for the slowness of dd bs=1m Expanding the buffer size from 64 kB helps reduce the number of read(2) > calls needed, but exactly what impact that change has and what size to > expand the buffer to are still under discussion. It would be wrong. I didn't see any discussion of this. Other utilities have different bugs in this area. The worst are all utilities that use stdio. Just using stdio asks for slowness, but fread() and fwrite() are not too bad for efficiency provided stdio's block sizing is not trusted and setvbuf() is used to override it. stdio naively trusts stat() to return a usable block size in st_blksize, but stat() rarely does so. This doesn't matter much for buffered i/o, unless you want to get near 32GB/sec). dd bs=4k goes at 10GB/sec on freefall. This wastes some CPU, but it is hard to produce i/o at even 1/10 of that speed. Just don't go 8 times slower than that by using a block size of 512. One of the kernel bugs is using PAGE_SIZE as a default for st_blksize. I notice this often when I downgrade to versions of FreeBSD with this bug and do silly things like wc on disks and not so silly things like cmp on disks. I work around this using dd to change the block size. This uses more CPU, but not enough to matter for disks (except SSD ones). Ugh, -current has broken st_blksize for pipes. It is now hard-coded as PAGE_SIZE. In old versions, it was pipe->pipe_buffer_size. Pipe code has internal problems configuring the latter. It uses bogusly ifdefed constants PIPE_SIZE = 16K and BIG_PIPE_SIZE = 64K, and an obscure method to select between these. Only socket code handles st_blksize perfectly. It sets st_blksize to sb_hiwat. The user has complete control over socket buffer sizes and watermarks, and the defaults are better. > Modified: head/usr.bin/wc/wc.c > ============================================================================== > --- head/usr.bin/wc/wc.c Sat Dec 9 21:04:56 2017 (r326735) > +++ head/usr.bin/wc/wc.c Sat Dec 9 21:55:19 2017 (r326736) > @@ -206,30 +206,30 @@ cnt(const char *file) > ... > + else if ((fd = open(file, O_RDONLY, 0)) < 0) { > + xo_warn("%s: open", file); > + return (1); > + } > + if (doword || (domulti && MB_CUR_MAX != 1)) > + goto word; Checks of MB_CUR_MAX like this are supposed to make wc -m do the same as wc -c when the multi-byte width is 1. Is this really valid? Even with ASCII, is is reasonable to treat bytes with no encoding as not there at all, or as an encoding error. I don't know of mb conversion functions are capable of or allowed to ignore bytes with no encoding. On exotic machines with 32-bit chars (now allowed by POSIX). This check seemed to work before for making wc -c equally slow as wc -m for (domulti && MB_CUR_MAX != 1) (just doing this check in inner loops is pessimal). I can't see why the above doesn't make wc -m equally fast to wc -c in this case. > + /* > + * Line counting is split out because it's a lot faster to get > + * lines than to get words, since the word count requires some > + * logic. > + */ > + if (doline || dochar) { > + while ((len = read(fd, buf, MAXBSIZE))) { > + if (len == -1) { > + xo_warn("%s: read", file); > + (void)close(fd); > + return (1); > + } > + if (siginfo) { > + show_cnt(file, linect, wordct, charct, > + llct); > + } Style bugs: - excessive line splitting - excessve parenthses The second one was in the old code. The first one was not in the old code since the splitting was needed because the indentation was larger. The siginfo checks in the inner loop of the slow case help make it slower. Here there is only 1 per read so the pessimization is small. > + charct += len; > + if (doline) { > for (p = buf; len--; ++p) > if (*p == '\n') { > if (tmpll > llct) If doing only chars, then this just adds up the byte counts, except for hair like the error checks (needed) and siginfo tests (hair). Old versions would have been much simpler using unsafe signal handlers. This doesn't require adding siginfo checks in all i/o loops, but just prints immediately and usually unsafely in signal handlers. I think the "safe" method doesn't even work for large i/o's or small blocked i/o's. Unsafe siginfo prints in the middle of the i/o. Most BSD programs can't handle EINTR in i/o loops, so most siginfo setups preserve the default restarting of syscalls, although not restarting is required to get back to the i/o loops to print. > + /* > + * If all we need is the number of characters and it's a > + * regular file, just stat the puppy. > + */ The comment is still precious, and is now misformatted by not adjusting for the new indentation. > + if (dochar || domulti) { > + if (fstat(fd, &sb)) { > + xo_warn("%s: fstat", file); > + (void)close(fd); > + return (1); > + } > + if (S_ISREG(sb.st_mode)) { > + reset_siginfo(); > + charct = sb.st_size; > show_cnt(file, linect, wordct, charct, llct); > + tcharct += charct; > (void)close(fd); > return (0); > - } This naively trusts S_ISREG() to work, so never worked on FreeBSD's irregular "regular" files in procfs and perhaps other places. Similar checks come close to breaking cp and cmp on these files. But cp and cmp do dubious mmapping() and differently bad fallback to a slow case. mmap() fails for most irregular regular files, so the slow case is reached, wc could also use mmap() if it cared about efficiency, but this is hard to do right. No FreeBSD tilities understand buffer sizing even for read() and write(), and mmap() is harder. The slow case in wc is too slow fo better i/o to make much difference. wc -c on a regular file doesn't do i/o. This leaves only wc -l as possibily benefiting from mmap(). cmp is too agressive using mmap(). For regular files, it tries to map the whole file, and falls back to a very slow method using getc() if mmap() fails. For non-regular files, it always uses the slow method This method wastes a lot of CPU for getc() and wastes a lot of bandwith for some file types using the st_blksize pessimizations. Mapping large files busts caches just to allow a simpler terminating condition which saves a few cycles per char. Comparison uses a slow per-char method in all cases to simplify printing differences for the usual case. cp still uses the 25+ year old option VM_AND_BUFFER_CACHE_SYNCHRONIZED in its Makefile in case vm is still 25 years old. Other utilities just assume that mmap() works and that using it is actually an optimization. cp still uses the 25+ year old hard coded limit of 8MB on mmapping(). That used to be too large, but is now a bit small. However, with more caching everywhere it is hard to know the best size and method. I don't like the mmap() method, but it works best for my usual case of lots small files in the buffer cache. All methods are fast then. For large files only on disks, everything is slow and mmamp() makes little difference to the speed of the copy. It might use less CPU but waste more memory. For very large files, it is clearly worst to map whole files like cmp does. This wastes memory in the application and gives vm the wrong hint that the whole files will be used again soon. cp's window of 8MB is not too bad. Perhaps smaller is better for data that won't be used again soon. Then we only want the freatures of mmap that allow it to reduce copying and buffering. cp sometimes knows when the data won't be used again soon because it is copying large data (large files or many small files). I debugged while wc -m doesn't work as documented or as the code seems to support for locales that don't support multibyte characters. It is just that -m turns off dochar. This used to have no effect since the wc -c used the slow case like wc -m. Now the complicated logic is broken. The fastest case of wc -c (where it is supposed to just stat the file()) is also broken by essentially the same error. Now this case use the new second-fastest read() method for wc -c. Bizarrely, wc -m still works as intended by using the fastests case. Code with broken logic: X if (doword || (domulti && MB_CUR_MAX != 1)) X goto word; The point of this MB_CUR_MAX != 1 check is to ensure reaching the slow case for multi-byte characters. Otherwise, we would fall through to test for the fastest case. The test there (and others before there) depends on testing MB_CUR_MAX here. X /* X * Line counting is split out because it's a lot faster to get X * lines than to get words, since the word count requires some X * logic. X */ X if (doline || dochar) { This used to only test doline (but handle dochar). Testing dochar here breaks the fastest case for dochar (but not for domulti) by handling dochar here in a slower way. X [... rest of second-fastest case] X } X /* X * If all we need is the number of characters and it's a X * regular file, just stat the puppy. X */ X if (dochar || domulti) { This used to handle dochar and the width 1 case of domutli in the fastest way. It still works for domulti, but is unreachable for dochar, since dochar was eaten above. X [... fastest case using fstat()] X } To fix this, first move the fstat() case earlier. The condition for using this case can be written as (doline == 0). This is clearer than a testing all the flags again and much clearer than testing only half the flags again as is done now. Then after handling the fastest case, don't test any more, and reduce the indentation. The test in the current code is the same as a null test except it has different bugs than a null test. We want to treat -m as -c, so the test should be of (dochar || domulti || doline), but one of thes flags is known to be set since doword is known to be clear. The current test just breaks the fastest case for -c and the second-fastest case for -m, but avoids breaking the fastest case for -m. Actually, the broken cases for -m are reversed when -l is used. Then -m behaves the same as -c for width 1. Perhaps some more reorganization can be used to redce the indentation. Gotos are already used to do not much more than reduce indentation. Bruce