Date: Sun, 6 Nov 2005 16:39:12 +0300 From: "Andrew P." <infofarmer@gmail.com> To: Kirk Strauser <kirk@strauser.com> Cc: freebsd-questions@freebsd.org Subject: Re: Fast diff command for large files? Message-ID: <cb5206420511060539qe4d7c40i198e806950c60482@mail.gmail.com> In-Reply-To: <200511060657.39674.kirk@strauser.com> References: <200511040956.19087.kirk@strauser.com> <200511041129.17912.kirk@strauser.com> <cb5206420511041204y6a4120eq5198f4f1fd4426de@mail.gmail.com> <200511060657.39674.kirk@strauser.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 11/6/05, Kirk Strauser <kirk@strauser.com> wrote: > On Friday 04 November 2005 02:04 pm, you wrote: > > > Does the overall order of lines change every time you dump the tables? > > No, although an arbitrary number of lines might get deleted. > > > If it does/can, then there's a trivial solution (a few lines in perl, o= r a > > hundred lines in C) that'll make the speed roughly similar to that of I= /O. > > Could you elaborate? That's been bugging me all weekend. I know I shoul= d > know this, but I can't quite put my finger on it. > -- > Kirk Strauser > > > while (there are more records) { a =3D read (line from old file) b =3D read (line from new file) if (a =3D=3D b) then next if (a <> b) { if (a in new_records) { get a out of new_records next } if (b in old_records) { get b out of old_records next } put a in old_records put b in new_records } after that old_records will contain records present in old file, but not in new file, and new_records will contain records present in new file, but not old one. Note, that the difference must be kept in RAM, so it won't work if there are multi-gig diffs, but it will work very fast if the diffs are only 10-100Mb, it will work at close to I/O speed if the diff is under 10Mb. If the records can be ordered in a known order (e.g. alphabetically), we don't need to keep anything in RAM then and make any checks at all. Let's assume an ascending order (1-2-5-7-31-...): while (there are more records) { a =3D read (line from old file) b =3D read (line from new file) while (a <> b) { if (a < b) then { write a to old_records read next a } if (a > b) then { write b to new_records read next b } } } If course, you've got to add some checks to deal with EOF correctly. Hope this gives you some idea.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?cb5206420511060539qe4d7c40i198e806950c60482>