Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 13 Sep 2012 22:45:42 -0700
From:      Waitman Gobble <gobble.wa@gmail.com>
To:        freebsd-questions@freebsd.org
Subject:   Re: cksum entire dir??
Message-ID:  <CAFuo_fzOe0RqLUN42yut-HLN4DVcsZTOAY%2Bu2KwV=jgjGzqGTg@mail.gmail.com>
In-Reply-To: <20120914033522.GA95427@neutralgood.org>
References:  <CAFuo_fyaToC_0NcvD6jobOK3qWm2D8CXzUn6Drxzr_tEkEL6dQ@mail.gmail.com> <20120913191844.1C07FBEA7@kev.msw.wpafb.af.mil> <20120914033522.GA95427@neutralgood.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Sep 13, 2012 at 8:35 PM, <kpneal@pobox.com> wrote:

> On Thu, Sep 13, 2012 at 03:18:43PM -0400, Karl Vogel wrote:
> > Here's a simple, system-independent way to find duplicate files.  All you
> > need is something to generate a digest you trust (MD5, SHA1, whatever)
> plus
> > normal Unix stuff: awk, expand, grep, join, sort, and uniq.
> >
> > Generate the signatures:
> >
> >   me% cd ~/bin
> >   me% find . -type f -print0 | xargs -0 md5 -r | sort > /tmp/sig1
> >
> >   me% cat /tmp/sig1
> >   0287839688bd660676582266685b05bd ./mkrcs
> >   0b97494883c76da546e3603d1b65e7b2 ./pwgen
> >   ddbed53e795724e4a6683e7b0987284c ./authlog
> >   ddbed53e795724e4a6683e7b0987284c ./cmdlog
> >   fdff1fd84d47f76dbd4954c607d66714 ./dbrun
> >   ff5e24efec5cf1e17cf32c58e9c4b317 ./tr0
> >
> > Find duplicate signatures:
> >
> >   me% awk '{print $1}' /tmp/sig1 | uniq -c | expand | grep -v "^  *1 "
> >         2 ddbed53e795724e4a6683e7b0987284c
>
> you% awk '{print $1}' /tmp/sig1 | uniq -d
>
> But in both your and my code the uniq will frequently fail because the
> input is not sorted. The uniq command only works when the lines to compare
> are adjacent. So...
>
> you% awk '{print $1}' /tmp/sig1 | sort | uniq -d
> --
> Kevin P. Neal                                http://www.pobox.com/~kpn/
>
>    "I like being on The Daily Show." - Kermit the Frog, Feb 13 2001
> _______________________________________________
> freebsd-questions@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to "
> freebsd-questions-unsubscribe@freebsd.org"
>

Hi,

But what happens when, like in my 'md5 file' tinkering example above,
there's one or more identical files along the path which may or may not
exist in both hierarchies? For example, the BSD License file. In my
previous message I purposely made a 'testdir' and copied a file into that
dir... they have the same hash.

Anyway I was thinking if I had proceeded with the tinker example, using
sys/tree.h and creating an associative array and using the relative path
and filename, along with the md5 hash, as the key. So the keys would be
like

[8d3986a5e8747ae89b3c5f82f22bc402 ./find.c]
[8d3986a5e8747ae89b3c5f82f22bc402 ./testdir/find.c]

then you'd have path A and path B to compare, which i think basically add 1
for A and 2 for B, so you'd know a "1" would be in "A" only, or "2" would
be in "B" only, and "3" would be in both A and B.

[e406e4422cf29f3b42484596524b71c1 ./find] => 1 //A only
[e3ea95347aa5efd7030103536c23a8d3 ./find.1.gz] => 3 //OK
[4b1fd4eb69577f53bd97d8cd2159c8eb ./md5find] => 3 //OK
[03d161fcb84fb38aad6ccd8ce0cafeaf ./testdir] => 2 //B only



But again I have to say that mtree already does this very well...

Here's an example of mtree for Gary to compare two paths, hopefully helpful.


set up two things to compare, A and B

# mkdir A B
# touch A/1 A/2 A/3 A/4
# find A
A
A/1
A/2
A/3
A/4

# rsync -av A B
sending incremental file list
A/
A/1
A/2
A/3
A/4

sent 236 bytes  received 92 bytes  656.00 bytes/sec
total size is 0  speedup is 0.00

# find B
B
B/A
B/A/1
B/A/2
B/A/3
B/A/4


compare with mtree

# mtree -K sha256digest,uname,gname -c -p A | mtree -p B/A

{no output = OK they match, default: only report situations}


now mess up B

# rm B/A/3
# touch B/A/2
# touch B/A/extrabonusfile


compare again

# mtree -K sha256digest,uname,gname -c -p A | mtree -p B/A

. changed
    modification time expected Thu Sep 13 22:33:02 2012 found Thu Sep 13
22:43:46 2012
2 changed
    modification time expected Thu Sep 13 22:33:02 2012 found Thu Sep 13
22:38:01 2012
extrabonusfile extra
./3 missing


Waitman Gobble
San Jose California



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFuo_fzOe0RqLUN42yut-HLN4DVcsZTOAY%2Bu2KwV=jgjGzqGTg>