From owner-freebsd-performance@FreeBSD.ORG Mon Jan 9 16:39:40 2012 Return-Path: Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D8CD3106564A for ; Mon, 9 Jan 2012 16:39:40 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au [211.29.132.185]) by mx1.freebsd.org (Postfix) with ESMTP id 60BE88FC13 for ; Mon, 9 Jan 2012 16:39:40 +0000 (UTC) Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au (c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136]) by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q09GdZu0005748 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 10 Jan 2012 03:39:35 +1100 Date: Tue, 10 Jan 2012 03:39:35 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Dieter BSD In-Reply-To: <20120105174626.218240@gmx.com> Message-ID: <20120110014339.L2530@besplex.bde.org> References: <20120105174626.218240@gmx.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="0-1583360631-1326127175=:2530" Cc: freebsd-performance@freebsd.org Subject: Re: cmp(1) has a bottleneck, but where? X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 09 Jan 2012 16:39:40 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --0-1583360631-1326127175=:2530 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Thu, 5 Jan 2012, Dieter BSD wrote: >> Something/somehow it's issuing smaller IOs when using mmap? > > On my box, 64K reads. =C2=A0Using the '-' to avoid mmap it uses > 128K. > > The big difference I found was that the default mmap case isn't > using read-ahead. So it has to wait on the disk every time. =C2=A0:-( The hard \xc2\xa0 certainly deserves a :-(. This may indicate a general problem with read-ahead, but for cmp the basic problem is that it is unsuited to comparing large or binary files for equality. Without -l or -x, it is broken as designed for this, since it is required to keep track of lines. Without -l or -x, it doesn't need to keep track of lines, but still does. Its inner loop is: % =09for (byte =3D line =3D 1; length--; ++byte) { % =09=09if ((ch =3D *p1) !=3D *p2) { This type of comparison is the slowest way to write memcmp() other than ones that are intentionally slow. % =09=09=09if (xflag) { % =09=09=09=09dfound =3D 1; % =09=09=09=09(void)printf("%08llx %02x %02x\n", % =09=09=09=09 (long long)byte - 1, ch, *p2); % =09=09=09} else if (lflag) { % =09=09=09=09dfound =3D 1; % =09=09=09=09(void)printf("%6lld %3o %3o\n", % =09=09=09=09 (long long)byte, ch, *p2); % =09=09=09} else With -l or -x, it could just use memcmp() and blame it for any slowness. % =09=09=09=09diffmsg(file1, file2, byte, line); Without -l or -x, it must print the line number here. % =09=09=09=09/* NOTREACHED */ % =09=09} % =09=09if (ch =3D=3D '\n') % =09=09=09++line; It keeps track of the line number here. This statement by itself doesn't add much overhead, but looking at every byte to get to it does. % =09=09if (++p1 =3D=3D e1) { A fairly slow method to check advance the pointer and check for the end. % =09=09=09off1 +=3D MMAP_CHUNK; % =09=09=09if ((p1 =3D m1 =3D remmap(m1, fd1, off1)) =3D=3D NULL) { % =09=09=09=09munmap(m2, MMAP_CHUNK); % =09=09=09=09err(ERR_EXIT, "remmap %s", file1); % =09=09=09} % =09=09=09e1 =3D m1 + MMAP_CHUNK; % =09=09} % =09=09if (++p2 =3D=3D e2) { Even more slowness. The chunk size is the same for each file, or should be, so that remapping occurs at the same point for each. % =09=09=09off2 +=3D MMAP_CHUNK; % =09=09=09if ((p2 =3D m2 =3D remmap(m2, fd2, off2)) =3D=3D NULL) { % =09=09=09=09munmap(m1, MMAP_CHUNK); % =09=09=09=09err(ERR_EXIT, "remmap %s", file2); % =09=09=09} % =09=09=09e2 =3D m2 + MMAP_CHUNK; % =09=09} % =09} Looking at every character like this is a good pessimization technique. The best example of this that I know of is wc(1). On a 100MB file zz in the buffer cache, on ref9-i386.freebsd.org: "wc zz zz" takes 1.75 seconds "cmp zz zz" 0.87 "cmp - zz /dev/null" 0.10 Well, wc is only about twice as slow as cmp. Its main loop is not obviously much worse than the above. It counts characters and words in addition. Somehow it is even slower than cmp with "-" to make it use slow stdio APIs and a more similar i/o method (wc uses read() with a block size of MAXBSIZE, while "cmp -" uses getc() with stdio doing all the buffering and whatever block size it wants (in practice, stdio is still "smart" about block sizes, so it uses one of 16K for the ffs file system where this was tested first, but it uses one of 4K for nfs. These block sizes are determined as follows: stdio naively believes that st_blksize is a good i/o size for stdio, and uses it unless it is smaller than BUFSIZE. BUFSIZE is stdio's historical mistake in this area. It is still 1024). With buffering, the block size doesn't matter much. In fact, I've noticed a block size of 512 sometimes working better for nfs since it allows better parallelism in simple read/write loops. The small sizes are only possibly better if they can saturate the hardware part of the i/o, but this is normal provided that reblocking hides the small block sizes from the hardware. The 16K-blocks used by "cmp -" are somehow faster than the 64K-blocks used by wc. > Using the '-' to avoid mmap it benefits from read-ahead, but the > default of 8 isn't large enough. =C2=A0Crank up vfs.read_max and it > becomes cpu bound. =C2=A0(assuming using 2 disks and not limited by > both disks being on the same wimpy controller) > > A) Should the default vfs.read_max be increased? Maybe, but I don't buy most claims that larger block sizes are better. In -current, vfs.read_max is 64 on at least i386. I think that is in fs-blocks, so it is 1MB for the test ffs file system. It is far too small for file systems with small block sizes. OTOH, it is far too large for the people trying to boot FreeBSD on ~1MB memory sticks. In my version of FreeBSD-5, it is 256, but it is in KB. In FreeBSD-9 on spar64, it is 8. That is only 128K with 16K-blocks. With 512-blocks, it is a whole 4K. > B) Can the mmap case be fixed? =C2=A0What is the aledged benefit of > using mmap anyway? =C2=A0All I've even seen are problems. It is much faster for cases where the file is already in memory. It is unclear whether this case is common enough to matter. I guess it isn't. One of my tests was the silly "cmp zz zz". This is an obviously good case for mmap(), but mmap() seemed to do even better on it than expected. I was trying to use file a larger than main memory, but not much larger because the tests would take longer and I was short of disk space. "cmp zz zz" seemed to do an especially good job of not thrashing the caches when the file was a little larger than main memory -- it seemed to keep the whole file cached, while "cmp - zz /dev/null" 0.10 "sh -c 'tar cf - zz | tar df -' 0.33 "time tar cf - zz | tar df -" 0.01 user 0.09 sys (first tar is as fast as cat= ) "tar cf - zz | time tar df -" 0.10 user 0.10 sys (second tar is twice as slow= ) 0.33 seconds is about what the cmp -x operation should take with read(): the i/o for each file goes at 1GB/S for cat, and we can't go much faster than that using read(). For different files, we have to read() twice. The pipeline between the tars is apparently very efficient (almost zero-copy?) so it doesn't take much longer. Then it is to be expected that the comparator takes at least as long as 1 read() (same number of memory accesses). But using memmap(), we should be able to go almost 3 times faster, by not doing any read()s, but only if everything is cached. When everything is on disk, we should be limited only by disk bandwidth, after reducing the CPU and possibly fixing the read-ahead. Further info: - the tars in the above use the silly historical size of 10K. For copying large trees, I use a shell script with a block size of 2048K for the reader and 1024K for the writer. And though I don't like bsdtar, I use it for the writer in this script since it preserves times better - the main memory speed on the test system is about 3.2GB/S (PC3200 memory overclocked). x86 memory has cache effects and asymmetries and so that this speed is rarely achieved for copying and even more rarely achieved for reading. - the best block size for reading from the buffer cache is 16K. This gives 1.66GB/S. Main memory can't go that fast, but the L2 cache can. With a block size of 64K, the speed drops to 1.51GB/S. tar's silly default of 10K gives 1.43GB/S. I'm not sure if the magic 16K is related to the L1 cache size or the filesystem block size. If it is the latter, then stdio's using st_blksize is right for not completely accidental reasons. - 1GB/S is still quite fast. You are lucky if disk i/o runs 1% as fast as that on average. But for large files it should be 5-10% as fast, and the memory comparison part should run about 100% as fast, so that cmp -x takes insignificant CPU like tar d does. The comparator in tar d is simply memcmp() (with a rather obscure loop around it to fit in with tar's generality). Bruce --0-1583360631-1326127175=:2530--