From owner-freebsd-questions@FreeBSD.ORG Mon Nov 7 15:48:41 2005 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6FF2B16A420 for ; Mon, 7 Nov 2005 15:48:41 +0000 (GMT) (envelope-from kirk@strauser.com) Received: from kanga.honeypot.net (kanga.honeypot.net [208.162.254.122]) by mx1.FreeBSD.org (Postfix) with ESMTP id 722FA43D5E for ; Mon, 7 Nov 2005 15:48:34 +0000 (GMT) (envelope-from kirk@strauser.com) Received: from localhost (localhost [127.0.0.1]) by kanga.honeypot.net (Postfix) with ESMTP id 65A64222407 for ; Mon, 7 Nov 2005 09:48:33 -0600 (CST) Received: from kanga.honeypot.net ([127.0.0.1]) by localhost (kanga.honeypot.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 70017-13 for ; Mon, 7 Nov 2005 09:48:32 -0600 (CST) Received: from janus.daycos.com (janus.daycos.com [204.26.70.77]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) by kanga.honeypot.net (Postfix) with ESMTP id B0BC9222401 for ; Mon, 7 Nov 2005 09:48:32 -0600 (CST) From: Kirk Strauser To: freebsd-questions@freebsd.org Date: Mon, 7 Nov 2005 09:48:22 -0600 User-Agent: KMail/1.8.2 References: <200511040956.19087.kirk@strauser.com> <200511060657.39674.kirk@strauser.com> In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart2449820.Ro4SCRXWNq"; protocol="application/pgp-signature"; micalg=pgp-sha1 Content-Transfer-Encoding: 7bit Message-Id: <200511070948.27910.kirk@strauser.com> X-Virus-Scanned: amavisd-new at honeypot.net Subject: Re: Fast diff command for large files? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Nov 2005 15:48:41 -0000 --nextPart2449820.Ro4SCRXWNq Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline On Sunday 06 November 2005 07:39, Andrew P. wrote: > Note, that the difference must be kept in RAM, so it won't work if there= =20 > are multi-gig diffs, but it will work very fast if the diffs are only=20 > 10-100Mb, it will work at close to I/O speed if the diff is under 10Mb. = =20 Thanks, Andrew! My Python script runs that algorithm in 17 seconds on a=20 400MB file with 10% CPU. =46or anyone interested, here's my implementation. Note that the readline(= )=20 method in Python always returns something, even at EOF (at which point you= =20 get an empty string). Also, empty strings evaluate as "false", which is=20 why the "if not (oldline or newline): break" code exits at the end. old_records =3D [] new_records =3D [] while 1: oldline, newline =3D oldfile.readline(), newfile.readline() if not (oldline or newline): break if oldline =3D=3D newline: continue try: new_records.remove(oldline) except ValueError: if oldline: old_records.append(oldline) try: old_records.remove(newline) except ValueError: if newline: new_records.append(newline) > Hope this gives you some idea. It did. It must've been a long work week, because that all seems so obviou= s=20 in retrospect but was completely opaque at the time. Thanks again! =2D-=20 Kirk Strauser --nextPart2449820.Ro4SCRXWNq Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- iD8DBQBDb3dL5sRg+Y0CpvERAhUcAJ0XNZ4mWtxZgvUbbPbWbX77lI/CmwCfWZrH aiMPAA3WfoC1eKlNWbAMiGA= =qYPx -----END PGP SIGNATURE----- --nextPart2449820.Ro4SCRXWNq--