From owner-freebsd-questions@FreeBSD.ORG Fri Sep 14 05:45:49 2012 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6A6F2106566B for ; Fri, 14 Sep 2012 05:45:49 +0000 (UTC) (envelope-from gobble.wa@gmail.com) Received: from mail-wi0-f178.google.com (mail-wi0-f178.google.com [209.85.212.178]) by mx1.freebsd.org (Postfix) with ESMTP id DFEC88FC14 for ; Fri, 14 Sep 2012 05:45:48 +0000 (UTC) Received: by wibhr14 with SMTP id hr14so3501537wib.13 for ; Thu, 13 Sep 2012 22:45:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=gaMd3xxZ2O0xCL/xQ5Et0456uCu9DQR5r3MuVsxXg2I=; b=nic/uPlkWX6jgievPrvej0KK/a23vO/74yQ6VXkkesPn3SehY9IaWySUldAnq+kb0C hsbJaBqRqZkC8RXGNC8nX285bW5/D1TcnAq4fvcxwxwdBTpRM7sHC4AJOIkbjVOWr8rW eJTLAj/xNYHg5LAGLwVkhFDhr4TGiKnVcGY18I9lN08zsR47ku+NXaVVZgE9MvKqwGsU nHduYhEiXWF046w+cTNqniVTR2cnAjeBOW6z/2CFZxdxeIVObLn+A7fhBwC2NFRLNpG8 IhXaezOmP8Hsxoh7CQPZit2fMqJT52Y6PAGnmgEzXqG97nzCbrnu38bP0KLIC92M5wvC ktFQ== MIME-Version: 1.0 Received: by 10.180.96.3 with SMTP id do3mr3451394wib.5.1347601542246; Thu, 13 Sep 2012 22:45:42 -0700 (PDT) Received: by 10.216.183.2 with HTTP; Thu, 13 Sep 2012 22:45:42 -0700 (PDT) In-Reply-To: <20120914033522.GA95427@neutralgood.org> References: <20120913191844.1C07FBEA7@kev.msw.wpafb.af.mil> <20120914033522.GA95427@neutralgood.org> Date: Thu, 13 Sep 2012 22:45:42 -0700 Message-ID: From: Waitman Gobble To: freebsd-questions@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: Re: cksum entire dir?? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Sep 2012 05:45:49 -0000 On Thu, Sep 13, 2012 at 8:35 PM, wrote: > On Thu, Sep 13, 2012 at 03:18:43PM -0400, Karl Vogel wrote: > > Here's a simple, system-independent way to find duplicate files. All you > > need is something to generate a digest you trust (MD5, SHA1, whatever) > plus > > normal Unix stuff: awk, expand, grep, join, sort, and uniq. > > > > Generate the signatures: > > > > me% cd ~/bin > > me% find . -type f -print0 | xargs -0 md5 -r | sort > /tmp/sig1 > > > > me% cat /tmp/sig1 > > 0287839688bd660676582266685b05bd ./mkrcs > > 0b97494883c76da546e3603d1b65e7b2 ./pwgen > > ddbed53e795724e4a6683e7b0987284c ./authlog > > ddbed53e795724e4a6683e7b0987284c ./cmdlog > > fdff1fd84d47f76dbd4954c607d66714 ./dbrun > > ff5e24efec5cf1e17cf32c58e9c4b317 ./tr0 > > > > Find duplicate signatures: > > > > me% awk '{print $1}' /tmp/sig1 | uniq -c | expand | grep -v "^ *1 " > > 2 ddbed53e795724e4a6683e7b0987284c > > you% awk '{print $1}' /tmp/sig1 | uniq -d > > But in both your and my code the uniq will frequently fail because the > input is not sorted. The uniq command only works when the lines to compare > are adjacent. So... > > you% awk '{print $1}' /tmp/sig1 | sort | uniq -d > -- > Kevin P. Neal http://www.pobox.com/~kpn/ > > "I like being on The Daily Show." - Kermit the Frog, Feb 13 2001 > _______________________________________________ > freebsd-questions@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-questions > To unsubscribe, send any mail to " > freebsd-questions-unsubscribe@freebsd.org" > Hi, But what happens when, like in my 'md5 file' tinkering example above, there's one or more identical files along the path which may or may not exist in both hierarchies? For example, the BSD License file. In my previous message I purposely made a 'testdir' and copied a file into that dir... they have the same hash. Anyway I was thinking if I had proceeded with the tinker example, using sys/tree.h and creating an associative array and using the relative path and filename, along with the md5 hash, as the key. So the keys would be like [8d3986a5e8747ae89b3c5f82f22bc402 ./find.c] [8d3986a5e8747ae89b3c5f82f22bc402 ./testdir/find.c] then you'd have path A and path B to compare, which i think basically add 1 for A and 2 for B, so you'd know a "1" would be in "A" only, or "2" would be in "B" only, and "3" would be in both A and B. [e406e4422cf29f3b42484596524b71c1 ./find] => 1 //A only [e3ea95347aa5efd7030103536c23a8d3 ./find.1.gz] => 3 //OK [4b1fd4eb69577f53bd97d8cd2159c8eb ./md5find] => 3 //OK [03d161fcb84fb38aad6ccd8ce0cafeaf ./testdir] => 2 //B only But again I have to say that mtree already does this very well... Here's an example of mtree for Gary to compare two paths, hopefully helpful. set up two things to compare, A and B # mkdir A B # touch A/1 A/2 A/3 A/4 # find A A A/1 A/2 A/3 A/4 # rsync -av A B sending incremental file list A/ A/1 A/2 A/3 A/4 sent 236 bytes received 92 bytes 656.00 bytes/sec total size is 0 speedup is 0.00 # find B B B/A B/A/1 B/A/2 B/A/3 B/A/4 compare with mtree # mtree -K sha256digest,uname,gname -c -p A | mtree -p B/A {no output = OK they match, default: only report situations} now mess up B # rm B/A/3 # touch B/A/2 # touch B/A/extrabonusfile compare again # mtree -K sha256digest,uname,gname -c -p A | mtree -p B/A . changed modification time expected Thu Sep 13 22:33:02 2012 found Thu Sep 13 22:43:46 2012 2 changed modification time expected Thu Sep 13 22:33:02 2012 found Thu Sep 13 22:38:01 2012 extrabonusfile extra ./3 missing Waitman Gobble San Jose California