Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 13 Sep 2012 15:18:43 -0400 (EDT)
From:      vogelke+freebsd@pobox.com (Karl Vogel)
To:        freebsd-questions@freebsd.org
Subject:   Re: cksum entire dir??
Message-ID:  <20120913191844.1C07FBEA7@kev.msw.wpafb.af.mil>
In-Reply-To: <CAFuo_fyaToC_0NcvD6jobOK3qWm2D8CXzUn6Drxzr_tEkEL6dQ@mail.gmail.com> (message from Waitman Gobble on Wed, 12 Sep 2012 22:52:22 -0700)

next in thread | previous in thread | raw e-mail | index | archive | help
Here's a simple, system-independent way to find duplicate files.  All you
need is something to generate a digest you trust (MD5, SHA1, whatever) plus
normal Unix stuff: awk, expand, grep, join, sort, and uniq.

Generate the signatures:

  me% cd ~/bin
  me% find . -type f -print0 | xargs -0 md5 -r | sort > /tmp/sig1

  me% cat /tmp/sig1
  0287839688bd660676582266685b05bd ./mkrcs
  0b97494883c76da546e3603d1b65e7b2 ./pwgen
  ddbed53e795724e4a6683e7b0987284c ./authlog
  ddbed53e795724e4a6683e7b0987284c ./cmdlog
  fdff1fd84d47f76dbd4954c607d66714 ./dbrun
  ff5e24efec5cf1e17cf32c58e9c4b317 ./tr0

Find duplicate signatures:

  me% awk '{print $1}' /tmp/sig1 | uniq -c | expand | grep -v "^  *1 "
        2 ddbed53e795724e4a6683e7b0987284c

  me% awk '{print $1}' /tmp/sig1 | uniq -c | expand | grep -v "^  *1 " |
      awk '{print $2}' > /tmp/sig2

Associate the duplicates with files:

  me% join /tmp/sig[12]
  ddbed53e795724e4a6683e7b0987284c ./authlog
  ddbed53e795724e4a6683e7b0987284c ./cmdlog

If your filenames contain whitespace, you can URL-encode them, play some
games with awk, or use perl.

-- 
Karl Vogel                      I don't speak for the USAF or my company

This is really a lovely horse, I once rode her mother.
                                       --Ted Walsh, Horse Racing Commentator



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120913191844.1C07FBEA7>