From owner-freebsd-current  Fri Aug  9 19:00:25 1996
Return-Path: owner-current
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id TAA24077
          for current-outgoing; Fri, 9 Aug 1996 19:00:25 -0700 (PDT)
Received: from austin.polstra.com (austin.polstra.com [206.213.73.10])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id TAA24072
          for <current@FreeBSD.org>; Fri, 9 Aug 1996 19:00:21 -0700 (PDT)
Received: from austin.polstra.com (jdp@localhost) by austin.polstra.com (8.7.5/8.7.3) with ESMTP id SAA24389; Fri, 9 Aug 1996 18:59:59 -0700 (PDT)
Message-Id: <199608100159.SAA24389@austin.polstra.com>
To: "Boyd R. Faulkner" <faulkner@asgard.bga.com>
cc: current@FreeBSD.org
Subject: Re: Praise for CVSup 
In-reply-to: Your message of "Fri, 09 Aug 1996 20:12:05 -0459."
             <199608100111.UAA14116@utgard.bga.com> 
Date: Fri, 09 Aug 1996 18:59:59 -0700
From: John Polstra <jdp@polstra.com>
Sender: owner-current@FreeBSD.org
X-Loop: FreeBSD.org
Precedence: bulk

> CVSup does more file checking than sup does.  You can end up with
> files with the right date and size but not the right contents and,
> while I may be wrong, sup will not detect this.  Since CVSup uses
> MD5 (yes?)  to ID the files, you are gurarnteed the correct contents.

Well ... yes and no.  It depends on the situation.  In general, CVSup
does _not_ ID the files via MD5 checksums.  It compares the time stamps
between the client and the server, and if they are identical, it assumes
that the files are identical too.  In that case, it doesn't examine the
files further.  (There is an exception which, I conjecture, applies to
your particular case.  I'll explain that in a minute.)

The reason it doesn't compare MD5 checksums for every file on the client
and the server is that it would be too slow, too compute intensive, and
too disk intensive.  No real-time network file update package could do
that, without bringing the server to its knees.  It has to cull the
unchanged files from the list using just the information that it can get
from a call to stat().

The exception is when you are using CVSup's checkout mode the very
first time.  In that case, CVSup cannot ID your existing checked-out
files via the time stamps, because the time stamps of the checked-out
files are not the same as the time stamps of the corresponding RCS
files on the server machine.  So it really has no choice.  On the
client, it checksums each file.  On the server, it parses each RCS
file, and checksums each revision on the selected branch, from most
recent to least recent.  This is the worst situation, in terms of
server load, but it's not as bad as it sounds.  First, it's computing
the checksums on the fly as it generates revisions -- not doing
some gross thing like calling "co" to emit them to temporary files.
So its main activity involves crunching through a memory-mapped
RCS file, computing the checksums as it goes.  Second, if the client
already has files, they're probably fairly recent.  So the server
won't have to checksum very many revisions before it finds the
right one.  Third, this situation only happens the first time a
given client uses CVSup in checkout mode.  After that, the so-called
"list files" remember which revisions the client possesses.

The other place where MD5 checksums are used is to verify each file
that CVSup has updated by editing in new deltas and so forth.  That
was inspired by CTM, with a few gentle prods from Justin Gibbs,
and it has turned out to be a really good thing.  Besides instilling
confidence in CVSup, it permits it to be imperfect and incomplete
in the way it deals with RCS files.  I learned during the alpha
test period that there is an enormous variety of truly sick things
that people can and _will_ do to the RCS files in a CVS repository.
If CVSup had to be perfect in anticipating every one of them, well,
I wouldn't trust it myself.

But with the checksum verification, it doesn't even have to handle the
rarest kinds of changes properly at all.  When those kinds of changes
happen, it edits the file incorrectly, but finds out about it when it
verifies the checksums.  Then it says, "Checksum mismatch for foo --
will transfer entire file".  At that point it leaves the file untouched,
and it arranges to transfer the whole thing at the end of the run.  It
works well, and it helps keep people from getting mad at me.

You may even see this happen the next time you run CVSup.  Today
I (needlessly, it turns out) changed the default RCS keyword
expansion on one of the repository files for the "net/cvsup" port.
That is one of the two or three kinds of very rare changes that
CVSup's RCS file analysis currently does not cover.  (The other
one that comes to mind is changes in the list of locked revisions.
Since CVS never locks its RCS files, it's not much of an issue.)

-- John