From owner-freebsd-questions@FreeBSD.ORG  Sun Nov  6 13:39:13 2005
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
X-Original-To: freebsd-questions@freebsd.org
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 87A5C16A41F
	for <freebsd-questions@freebsd.org>;
	Sun,  6 Nov 2005 13:39:13 +0000 (GMT)
	(envelope-from infofarmer@gmail.com)
Received: from zproxy.gmail.com (zproxy.gmail.com [64.233.162.199])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 1DBCD43D45
	for <freebsd-questions@freebsd.org>;
	Sun,  6 Nov 2005 13:39:13 +0000 (GMT)
	(envelope-from infofarmer@gmail.com)
Received: by zproxy.gmail.com with SMTP id 8so159851nzo
	for <freebsd-questions@freebsd.org>;
	Sun, 06 Nov 2005 05:39:12 -0800 (PST)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com;
	h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
	b=BTlVFSC8aVWZLbfmmI1EnLp+Prp8IGZEQNI0q3yRmB8rakjOfOkmzH6GHhqnsRQ1TOk2tfOzlKIItB60PKYRDroNXfWc+qcWIxn+jDfoVhffzCux02fAfXrlsx53gxtwaK9LtXKc5iORA+DR3d1Rqqc/VtwqAO+6DEZaTWmV48E=
Received: by 10.36.8.18 with SMTP id 18mr227956nzh;
	Sun, 06 Nov 2005 05:39:12 -0800 (PST)
Received: by 10.37.20.33 with HTTP; Sun, 6 Nov 2005 05:39:12 -0800 (PST)
Message-ID: <cb5206420511060539qe4d7c40i198e806950c60482@mail.gmail.com>
Date: Sun, 6 Nov 2005 16:39:12 +0300
From: "Andrew P." <infofarmer@gmail.com>
To: Kirk Strauser <kirk@strauser.com>
In-Reply-To: <200511060657.39674.kirk@strauser.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
References: <200511040956.19087.kirk@strauser.com>
	<200511041129.17912.kirk@strauser.com>
	<cb5206420511041204y6a4120eq5198f4f1fd4426de@mail.gmail.com>
	<200511060657.39674.kirk@strauser.com>
Cc: freebsd-questions@freebsd.org
Subject: Re: Fast diff command for large files?
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 06 Nov 2005 13:39:13 -0000

On 11/6/05, Kirk Strauser <kirk@strauser.com> wrote:
> On Friday 04 November 2005 02:04 pm, you wrote:
>
> > Does the overall order of lines change every time you dump the tables?
>
> No, although an arbitrary number of lines might get deleted.
>
> > If it does/can, then there's a trivial solution (a few lines in perl, o=
r a
> > hundred lines in C) that'll make the speed roughly similar to that of I=
/O.
>
> Could you elaborate?  That's been bugging me all weekend.  I know I shoul=
d
> know this, but I can't quite put my finger on it.
> --
> Kirk Strauser
>
>
>

while (there are more records) {
 a =3D read (line from old file)
 b =3D read (line from new file)
 if (a =3D=3D b) then next
 if (a <> b) {
  if (a in new_records) {
   get a out of new_records
   next
  }
  if (b in old_records) {
   get b out of old_records
   next
  }
  put a in old_records
  put b in new_records
}

after that old_records will contain records present in old
file, but not in new file, and new_records will contain
records present in new file, but not old one.

Note, that the difference must be kept in RAM, so it
won't work if there are multi-gig diffs, but it will work
very fast if the diffs are only 10-100Mb, it will work at
close to I/O speed if the diff is under 10Mb.

If the records can be ordered in a known order (e.g.
alphabetically), we don't need to keep anything in
RAM then and make any checks at all. Let's
assume an ascending order (1-2-5-7-31-...):

while (there are more records) {
 a =3D read (line from old file)
 b =3D read (line from new file)
 while (a <> b) {
  if (a < b) then {
   write a to old_records
   read next a
  }
  if (a > b) then {
   write b to new_records
   read next b
  }
 }
}

If course, you've got to add some checks to
deal with EOF correctly.

Hope this gives you some idea.