From owner-freebsd-questions@FreeBSD.ORG Sun Nov 6 13:39:13 2005 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 87A5C16A41F for ; Sun, 6 Nov 2005 13:39:13 +0000 (GMT) (envelope-from infofarmer@gmail.com) Received: from zproxy.gmail.com (zproxy.gmail.com [64.233.162.199]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1DBCD43D45 for ; Sun, 6 Nov 2005 13:39:13 +0000 (GMT) (envelope-from infofarmer@gmail.com) Received: by zproxy.gmail.com with SMTP id 8so159851nzo for ; Sun, 06 Nov 2005 05:39:12 -0800 (PST) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=BTlVFSC8aVWZLbfmmI1EnLp+Prp8IGZEQNI0q3yRmB8rakjOfOkmzH6GHhqnsRQ1TOk2tfOzlKIItB60PKYRDroNXfWc+qcWIxn+jDfoVhffzCux02fAfXrlsx53gxtwaK9LtXKc5iORA+DR3d1Rqqc/VtwqAO+6DEZaTWmV48E= Received: by 10.36.8.18 with SMTP id 18mr227956nzh; Sun, 06 Nov 2005 05:39:12 -0800 (PST) Received: by 10.37.20.33 with HTTP; Sun, 6 Nov 2005 05:39:12 -0800 (PST) Message-ID: Date: Sun, 6 Nov 2005 16:39:12 +0300 From: "Andrew P." To: Kirk Strauser In-Reply-To: <200511060657.39674.kirk@strauser.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline References: <200511040956.19087.kirk@strauser.com> <200511041129.17912.kirk@strauser.com> <200511060657.39674.kirk@strauser.com> Cc: freebsd-questions@freebsd.org Subject: Re: Fast diff command for large files? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 06 Nov 2005 13:39:13 -0000 On 11/6/05, Kirk Strauser wrote: > On Friday 04 November 2005 02:04 pm, you wrote: > > > Does the overall order of lines change every time you dump the tables? > > No, although an arbitrary number of lines might get deleted. > > > If it does/can, then there's a trivial solution (a few lines in perl, o= r a > > hundred lines in C) that'll make the speed roughly similar to that of I= /O. > > Could you elaborate? That's been bugging me all weekend. I know I shoul= d > know this, but I can't quite put my finger on it. > -- > Kirk Strauser > > > while (there are more records) { a =3D read (line from old file) b =3D read (line from new file) if (a =3D=3D b) then next if (a <> b) { if (a in new_records) { get a out of new_records next } if (b in old_records) { get b out of old_records next } put a in old_records put b in new_records } after that old_records will contain records present in old file, but not in new file, and new_records will contain records present in new file, but not old one. Note, that the difference must be kept in RAM, so it won't work if there are multi-gig diffs, but it will work very fast if the diffs are only 10-100Mb, it will work at close to I/O speed if the diff is under 10Mb. If the records can be ordered in a known order (e.g. alphabetically), we don't need to keep anything in RAM then and make any checks at all. Let's assume an ascending order (1-2-5-7-31-...): while (there are more records) { a =3D read (line from old file) b =3D read (line from new file) while (a <> b) { if (a < b) then { write a to old_records read next a } if (a > b) then { write b to new_records read next b } } } If course, you've got to add some checks to deal with EOF correctly. Hope this gives you some idea.