From owner-freebsd-performance@FreeBSD.ORG  Mon Jan  9 14:43:35 2012
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2DFC7106566B
	for <freebsd-performance@freebsd.org>;
	Mon,  9 Jan 2012 14:43:35 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au
	[211.29.132.183])
	by mx1.freebsd.org (Postfix) with ESMTP id A29A38FC16
	for <freebsd-performance@freebsd.org>;
	Mon,  9 Jan 2012 14:43:34 +0000 (UTC)
Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au
	(c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136])
	by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q09EhSS2002916
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Tue, 10 Jan 2012 01:43:29 +1100
Date: Tue, 10 Jan 2012 01:43:28 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20120104000111.K6684@besplex.bde.org>
Message-ID: <20120110013455.D2530@besplex.bde.org>
References: <20120103073736.218240@gmx.com>
	<CAGH67wQXuMasyc9BE8M9fHsQv6d2zdRxDQ2ekX4whjHJFyqZyg@mail.gmail.com>
	<20120103083454.GA22673@zlo.nu> <20120104000111.K6684@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Marc Olzheim <marcolz@stack.nl>, Garrett Cooper <yanegomi@gmail.com>,
	freebsd-performance@freebsd.org, Dieter BSD <dieterbsd@engineer.com>
Subject: Re: cmp(1) has a bottleneck, but where?
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 09 Jan 2012 14:43:35 -0000

On Wed, 4 Jan 2012, Bruce Evans wrote:

> On Tue, 3 Jan 2012, Marc Olzheim wrote:
>
>> On Tue, Jan 03, 2012 at 12:21:10AM -0800, Garrett Cooper wrote:
>>>     The file is 3.0GB in size. Look at all those page faults though!
>>> Thanks!
>>> -Garrett
>> 
>> From usr.bin/cmp/c_regular.c:
>> 
>> #define MMAP_CHUNK (8*1024*1024)
>> ...
>> for (..) {
>> 	mmap() chunk of size MMAP_CHUNK.
>> 	compare
>> 	munmap()k
>> }
>> 
>> That 8 MB chunk size sounds like a bad plan to me. I can imagine
>> something needed to be done to compare files larger than X GB on a 32bit
>> system, but 8MB is pretty small...
>
> 8MB is more than large enough.  It works at disk speed in my tests.  cp
> still uses this value.  Old versions of cmp used the bogus value of
> ...
> In my tests, using "-" for one of the files mainly takes lots more user
> time.  It only reduces the real time by 25%.  This is on a core2.  On
> a system with a slow CPU, it is easy for getc() to be much slower than
> the disk.

More careful tests showed serious slowness when the combined file sizes
exceeded the cache size.  cmp takes an enormous amount of CPU (see another
reply), and this seems to be done mostly in series with i/o, so the total
time increases too much.  A smaller mmap() size or not using mmap() at
all might improve paralellism.

Bruce

From owner-freebsd-performance@FreeBSD.ORG  Mon Jan  9 16:39:40 2012
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D8CD3106564A
	for <freebsd-performance@freebsd.org>;
	Mon,  9 Jan 2012 16:39:40 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au
	[211.29.132.185])
	by mx1.freebsd.org (Postfix) with ESMTP id 60BE88FC13
	for <freebsd-performance@freebsd.org>;
	Mon,  9 Jan 2012 16:39:40 +0000 (UTC)
Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au
	(c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136])
	by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q09GdZu0005748
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Tue, 10 Jan 2012 03:39:35 +1100
Date: Tue, 10 Jan 2012 03:39:35 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Dieter BSD <dieterbsd@engineer.com>
In-Reply-To: <20120105174626.218240@gmx.com>
Message-ID: <20120110014339.L2530@besplex.bde.org>
References: <20120105174626.218240@gmx.com>
MIME-Version: 1.0
Content-Type: MULTIPART/MIXED; BOUNDARY="0-1583360631-1326127175=:2530"
Cc: freebsd-performance@freebsd.org
Subject: Re: cmp(1) has a bottleneck, but where?
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 09 Jan 2012 16:39:40 -0000

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--0-1583360631-1326127175=:2530
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE

On Thu, 5 Jan 2012, Dieter BSD wrote:

>> Something/somehow it's issuing smaller IOs when using mmap?
>
> On my box, 64K reads. =C2=A0Using the '-' to avoid mmap it uses
> 128K.
>
> The big difference I found was that the default mmap case isn't
> using read-ahead. So it has to wait on the disk every time. =C2=A0:-(

The hard \xc2\xa0 certainly deserves a :-(.

This may indicate a general problem with read-ahead, but for cmp the
basic problem is that it is unsuited to comparing large or binary
files for equality.  Without -l or -x, it is broken as designed for
this, since it is required to keep track of lines.  Without -l or
-x, it doesn't need to keep track of lines, but still does.  Its
inner loop is:

% =09for (byte =3D line =3D 1; length--; ++byte) {
% =09=09if ((ch =3D *p1) !=3D *p2) {

This type of comparison is the slowest way to write memcmp() other than
ones that are intentionally slow.

% =09=09=09if (xflag) {
% =09=09=09=09dfound =3D 1;
% =09=09=09=09(void)printf("%08llx %02x %02x\n",
% =09=09=09=09    (long long)byte - 1, ch, *p2);
% =09=09=09} else if (lflag) {
% =09=09=09=09dfound =3D 1;
% =09=09=09=09(void)printf("%6lld %3o %3o\n",
% =09=09=09=09    (long long)byte, ch, *p2);
% =09=09=09} else

With -l or -x, it could just use memcmp() and blame it for any slowness.

% =09=09=09=09diffmsg(file1, file2, byte, line);

Without -l or -x, it must print the line number here.

% =09=09=09=09/* NOTREACHED */
% =09=09}
% =09=09if (ch =3D=3D '\n')
% =09=09=09++line;

It keeps track of the line number here.  This statement by itself doesn't
add much overhead, but looking at every byte to get to it does.

% =09=09if (++p1 =3D=3D e1) {

A fairly slow method to check advance the pointer and check for the end.

% =09=09=09off1 +=3D MMAP_CHUNK;
% =09=09=09if ((p1 =3D m1 =3D remmap(m1, fd1, off1)) =3D=3D NULL) {
% =09=09=09=09munmap(m2, MMAP_CHUNK);
% =09=09=09=09err(ERR_EXIT, "remmap %s", file1);
% =09=09=09}
% =09=09=09e1 =3D m1 + MMAP_CHUNK;
% =09=09}
% =09=09if (++p2 =3D=3D e2) {

Even more slowness.  The chunk size is the same for each file, or should
be, so that remapping occurs at the same point for each.

% =09=09=09off2 +=3D MMAP_CHUNK;
% =09=09=09if ((p2 =3D m2 =3D remmap(m2, fd2, off2)) =3D=3D NULL) {
% =09=09=09=09munmap(m1, MMAP_CHUNK);
% =09=09=09=09err(ERR_EXIT, "remmap %s", file2);
% =09=09=09}
% =09=09=09e2 =3D m2 + MMAP_CHUNK;
% =09=09}
% =09}

Looking at every character like this is a good pessimization technique.
The best example of this that I know of is wc(1).  On a 100MB file zz
in the buffer cache, on ref9-i386.freebsd.org:

     "wc zz zz"      takes 1.75 seconds
     "cmp zz zz"           0.87
     "cmp - zz <zz"        1.50
     "cat zz >/dev/null"   0.10

Well, wc is only about twice as slow as cmp.  Its main loop is not
obviously much worse than the above.  It counts characters and words
in addition.  Somehow it is even slower than cmp with "-" to make it
use slow stdio APIs and a more similar i/o method
   (wc uses read() with a block size of MAXBSIZE, while "cmp -" uses
   getc() with stdio doing all the buffering and whatever block size
   it wants
     (in practice, stdio is still "smart" about block sizes, so it
     uses one of 16K for the ffs file system where this was tested
     first, but it uses one of 4K for nfs.  These block sizes are
     determined as follows: stdio naively believes that st_blksize
     is a good i/o size for stdio, and uses it unless it is smaller
     than BUFSIZE.  BUFSIZE is stdio's historical mistake in this
     area.  It is still 1024).
   With buffering, the block size doesn't matter much.  In fact, I've
   noticed a block size of 512 sometimes working better for nfs since
   it allows better parallelism in simple read/write loops.  The small
   sizes are only possibly better if they can saturate the hardware
   part of the i/o, but this is normal provided that reblocking hides
   the small block sizes from the hardware.  The 16K-blocks used by
   "cmp -" are somehow faster than the 64K-blocks used by wc.

> Using the '-' to avoid mmap it benefits from read-ahead, but the
> default of 8 isn't large enough. =C2=A0Crank up vfs.read_max and it
> becomes cpu bound. =C2=A0(assuming using 2 disks and not limited by
> both disks being on the same wimpy controller)
>
> A) Should the default vfs.read_max be increased?

Maybe, but I don't buy most claims that larger block sizes are better.
In -current, vfs.read_max is 64 on at least i386.  I think that is in
fs-blocks, so it is 1MB for the test ffs file system.  It is far too
small for file systems with small block sizes.  OTOH, it is far too
large for the people trying to boot FreeBSD on ~1MB memory sticks.
In my version of FreeBSD-5, it is 256, but it is in KB.  In FreeBSD-9
on spar64, it is 8.  That is only 128K with 16K-blocks.  With 512-blocks,
it is a whole 4K.

> B) Can the mmap case be fixed? =C2=A0What is the aledged benefit of
> using mmap anyway? =C2=A0All I've even seen are problems.

It is much faster for cases where the file is already in memory.  It
is unclear whether this case is common enough to matter.  I guess it
isn't.

One of my tests was the silly "cmp zz zz".  This is an obviously good
case for mmap(), but mmap() seemed to do even better on it than expected.
I was trying to use file a larger than main memory, but not much larger
because the tests would take longer and I was short of disk space.
"cmp zz zz" seemed to do an especially good job of not thrashing the
caches when the file was a little larger than main memory -- it seemed
to keep the whole file cached, while "cmp - zz <zz" went to disk.  This
is hard to explain, since the second method only needs about 2*64K of
memory.

To compare lots of files, I find cmp unusable for other reasons.  I
sometimes use recursive diff, but this is slow too (slower?) and has
other problems, like following symlinks.  So I usually use tar d for
this.  tar d has a much better comparator so it is faster even for
large single files though it needs an extra process and a pipeline
which touches the data several times more.  The d option is broken
(nonexistent) in BSD tar so I don't use it (BSD tar that is).  On
another system with a non-broken tar: on the 100MB file:

     "wc zz zz"      takes 2.20 seconds
     "cmp zz zz"           0.78  (0.72 user 0.05 sys)
     "cmp - zz <zz"        1.20
     "cat zz >/dev/null"   0.10
     "sh -c 'tar cf - zz | tar df -'
                           0.33
     "time tar cf - zz | tar df -"
                           0.01 user 0.09 sys  (first tar is as fast as cat=
)
     "tar cf - zz | time tar df -"
                           0.10 user 0.10 sys  (second tar is twice as slow=
)

0.33 seconds is about what the cmp -x operation should take with read():
the i/o for each file goes at 1GB/S for cat, and we can't go much
faster than that using read().  For different files, we have to read()
twice.  The pipeline between the tars is apparently very efficient
(almost zero-copy?) so it doesn't take much longer.  Then it is to be
expected that the comparator takes at least as long as 1 read() (same
number of memory accesses).  But using memmap(), we should be able to
go almost 3 times faster, by not doing any read()s, but only if everything
is cached.  When everything is on disk, we should be limited only by
disk bandwidth, after reducing the CPU and possibly fixing the read-ahead.

Further info:
- the tars in the above use the silly historical size of 10K.  For copying
   large trees, I use a shell script with a block size of 2048K for the
   reader and 1024K for the writer.  And though I don't like bsdtar, I use
   it for the writer in this script since it preserves times better
- the main memory speed on the test system is about 3.2GB/S (PC3200 memory
   overclocked).  x86 memory has cache effects and asymmetries and so that
   this speed is rarely achieved for copying and even more rarely achieved
   for reading.
- the best block size for reading from the buffer cache is 16K.  This gives
   1.66GB/S.  Main memory can't go that fast, but the L2 cache can.  With
   a block size of 64K, the speed drops to 1.51GB/S.  tar's silly default
   of 10K gives 1.43GB/S.  I'm not sure if the magic 16K is related to the
   L1 cache size or the filesystem block size.  If it is the latter, then
   stdio's using st_blksize is right for not completely accidental reasons.
- 1GB/S is still quite fast.  You are lucky if disk i/o runs 1% as fast
   as that on average.  But for large files it should be 5-10% as fast,
   and the memory comparison part should run about 100% as fast, so that
   cmp -x takes insignificant CPU like tar d does.  The comparator in
   tar d is simply memcmp() (with a rather obscure loop around it to fit
   in with tar's generality).

Bruce
--0-1583360631-1326127175=:2530--

From owner-freebsd-performance@FreeBSD.ORG  Thu Jan 12 19:31:46 2012
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1A7A5106566C
	for <freebsd-performance@freebsd.org>;
	Thu, 12 Jan 2012 19:31:46 +0000 (UTC)
	(envelope-from dieterbsd@engineer.com)
Received: from mailout-us.mail.com (mailout-us.gmx.com [74.208.5.67])
	by mx1.freebsd.org (Postfix) with SMTP id B73BC8FC18
	for <freebsd-performance@freebsd.org>;
	Thu, 12 Jan 2012 19:31:45 +0000 (UTC)
Received: (qmail 25510 invoked by uid 0); 12 Jan 2012 19:31:44 -0000
Received: from 67.206.186.17 by rms-us004.v300.gmx.net with HTTP
Content-Type: text/plain; charset="utf-8"
Date: Thu, 12 Jan 2012 14:31:41 -0500
From: "Dieter BSD" <dieterbsd@engineer.com>
Message-ID: <20120112193142.218240@gmx.com>
MIME-Version: 1.0
To: freebsd-performance@freebsd.org
X-Authenticated: #74169980
X-Flags: 0001
X-Mailer: GMX.com Web Mailer
x-registered: 0
Content-Transfer-Encoding: 8bit
X-GMX-UID: SVB9byQ03zOlNR3dAHAh+ot+IGRvbwAL
Subject: Re: cmp(1) has a bottleneck, but where?
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 12 Jan 2012 19:31:46 -0000

> The hard \xc2\xa0 certainly deserves a :-(.

Agreed. Brain damaged guity-until-proven-innocent anti-spam measures
force the use of webmail for outgoing email. Which amoung other problems
inserts garbage. Sorry.

>> A) Should the default vfs.read_max be increased?
>
> Maybe, but I don't buy most claims that larger block sizes are better.

I didn't say anything about block sizes. There needs to be enough
data in memory so that the CPU doesn't run out while the disk is
seeking.

>> B) Can the mmap case be fixed? What is the aledged benefit of
>> using mmap anyway? All I've even seen are problems.
>
> It is much faster for cases where the file is already in memory. It
> is unclear whether this case is common enough to matter. I guess it
> isn't.

Is there a reasonably efficient way to tell if a file is already
in memory or not? If not, then we have to guess.
If the file is larger than memory it cannot already be in memory.
For real world uses, there are 2 files, and not all memory can be
used for buffering files. So cmp could check the file sizes and
if larger than x% of main memory then assume not in memory.
There could be a command line argument specifying which method to
use, or providing a guess whether the files are in memory or not.

I wrote a prototype no-features cmp using read(2) and memcmp(3).
For large files it is faster than the base cmp and uses less cpu.
It is I/O bound rather than CPU bound.

So perhaps use memcmp when possible and decide between read and mmap
based on (something)?

Assuming the added performance justifies the added complexity?

From owner-freebsd-performance@FreeBSD.ORG  Sat Jan 14 11:34:58 2012
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
Delivered-To: freebsd-performance@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id CB2321065670
	for <freebsd-performance@FreeBSD.org>;
	Sat, 14 Jan 2012 11:34:58 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au
	[211.29.132.183])
	by mx1.freebsd.org (Postfix) with ESMTP id 3DD628FC0A
	for <freebsd-performance@FreeBSD.org>;
	Sat, 14 Jan 2012 11:34:57 +0000 (UTC)
Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au
	(c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136])
	by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q0EBYpTW011928
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sat, 14 Jan 2012 22:34:52 +1100
Date: Sat, 14 Jan 2012 22:34:51 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Dieter BSD <dieterbsd@engineer.com>
In-Reply-To: <20120112193142.218240@gmx.com>
Message-ID: <20120114220001.R1458@besplex.bde.org>
References: <20120112193142.218240@gmx.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-performance@FreeBSD.org
Subject: Re: cmp(1) has a bottleneck, but where?
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 14 Jan 2012 11:34:59 -0000

On Thu, 12 Jan 2012, Dieter BSD wrote:

>>> A) Should the default vfs.read_max be increased?
>>
>> Maybe, but I don't buy most claims that larger block sizes are better.
>
> I didn't say anything about block sizes. There needs to be enough
> data in memory so that the CPU doesn't run out while the disk is
> seeking.

Oops.  I was thinking of read-ahead essentially extending the block
size.  It (or rather clustering) does exactly that for file systems
with small block sizes, provided the blocks are contiguous.  But
too much of it gives latency and resource wastage problems.  Reads
by other processes may be queued behind read-ahead that is never
used.

>>> B) Can the mmap case be fixed? What is the aledged benefit of
>>> using mmap anyway? All I've even seen are problems.
>>
>> It is much faster for cases where the file is already in memory. It
>> is unclear whether this case is common enough to matter. I guess it
>> isn't.
>
> Is there a reasonably efficient way to tell if a file is already
> in memory or not? If not, then we have to guess.

Not that I know of.  You would want to know how much of it is in
memory.

> If the file is larger than memory it cannot already be in memory.
> For real world uses, there are 2 files, and not all memory can be
> used for buffering files. So cmp could check the file sizes and
> if larger than x% of main memory then assume not in memory.
> There could be a command line argument specifying which method to
> use, or providing a guess whether the files are in memory or not.

I think the 8MB value does that well enough, especially now that
everyone has a GB or 16 of memory.

posix_fadvise() should probably be used for large files to tell the
system not to cache the data.  Its man page reminded me of the O_DIRECT
flag.  Certainly if the combined size exceeds the size of main memory,
O_DIRECT would be good (even for benchmarks that cmp the same files :-).
But cmp and cp are too old to use it.

> I wrote a prototype no-features cmp using read(2) and memcmp(3).
> For large files it is faster than the base cmp and uses less cpu.
> It is I/O bound rather than CPU bound.

What about using mmap() and memcmp()?  mmap() shouldn't be inherently
much worse than read().  I think it shouldn't and doesn't not read
ahead the whole mmap()ed size (8MB here), since that would be bad for
latency.  So it must page it in when it is accessed, and read ahead
for that.

there is another thread about how bad mmap() and sendfile() are with
zfs, because zfs is not merged with the buffer cache so using mmap()
with it wastes about a factor of 2 of memory; sendfile() uses mmap()
so using it with zfs is bad too.  Apparently no one uses cp or cmp
with zfs :-), or they would notice its slowness there too.

> So perhaps use memcmp when possible and decide between read and mmap
> based on (something)?
>
> Assuming the added performance justifies the added complexity?

I think memcmp() instead of byte comparision for cmp -lx is not very
complex.  More interesting is memcmp() for the general case.  For
small files (<= mmap()ed size), mmap() followed by memcmp(), then
go back to a byte comp to count the line number when memcmp() fails
seems good.  Going back is messier and slower for large files.  In
the worst case of files larger than memory with a difference at the
end, it involves reading everything twice, so it is twice as slow
if it is i/o bound.

Bruce