From owner-freebsd-questions@FreeBSD.ORG Thu Sep 13 20:15:57 2012 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 486A4106564A for ; Thu, 13 Sep 2012 20:15:57 +0000 (UTC) (envelope-from vogelke@hcst.net) Received: from beta.hcst.com (beta.hcst.com [192.52.183.241]) by mx1.freebsd.org (Postfix) with ESMTP id 0F13F8FC17 for ; Thu, 13 Sep 2012 20:15:56 +0000 (UTC) Received: from beta.hcst.com (localhost [127.0.0.1]) by beta.hcst.com (8.14.3/8.14.3/Debian-9.4) with ESMTP id q8DKFnMn012158 for ; Thu, 13 Sep 2012 16:15:49 -0400 Received: (from vogelke@localhost) by beta.hcst.com (8.14.3/8.14.3/Submit) id q8DKFnus012157; Thu, 13 Sep 2012 16:15:49 -0400 Received: by kev.msw.wpafb.af.mil (Postfix, from userid 32768) id 1C07FBEA7; Thu, 13 Sep 2012 15:18:43 -0400 (EDT) To: freebsd-questions@freebsd.org In-reply-to: (message from Waitman Gobble on Wed, 12 Sep 2012 22:52:22 -0700) Organization: Array Infotech X-Disclaimer: I don't speak for the USAF or Array Infotech. X-GPG-ID: 1024D/711752A0 2006-06-27 Karl Vogel X-GPG-Fingerprint: 56EB 6DBF 4224 C953 F417 CC99 4C7C 7D46 7117 52A0 Message-Id: <20120913191844.1C07FBEA7@kev.msw.wpafb.af.mil> Date: Thu, 13 Sep 2012 15:18:43 -0400 (EDT) From: vogelke+freebsd@pobox.com (Karl Vogel) Subject: Re: cksum entire dir?? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: vogelke+freebsd@pobox.com List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Sep 2012 20:15:57 -0000 Here's a simple, system-independent way to find duplicate files. All you need is something to generate a digest you trust (MD5, SHA1, whatever) plus normal Unix stuff: awk, expand, grep, join, sort, and uniq. Generate the signatures: me% cd ~/bin me% find . -type f -print0 | xargs -0 md5 -r | sort > /tmp/sig1 me% cat /tmp/sig1 0287839688bd660676582266685b05bd ./mkrcs 0b97494883c76da546e3603d1b65e7b2 ./pwgen ddbed53e795724e4a6683e7b0987284c ./authlog ddbed53e795724e4a6683e7b0987284c ./cmdlog fdff1fd84d47f76dbd4954c607d66714 ./dbrun ff5e24efec5cf1e17cf32c58e9c4b317 ./tr0 Find duplicate signatures: me% awk '{print $1}' /tmp/sig1 | uniq -c | expand | grep -v "^ *1 " 2 ddbed53e795724e4a6683e7b0987284c me% awk '{print $1}' /tmp/sig1 | uniq -c | expand | grep -v "^ *1 " | awk '{print $2}' > /tmp/sig2 Associate the duplicates with files: me% join /tmp/sig[12] ddbed53e795724e4a6683e7b0987284c ./authlog ddbed53e795724e4a6683e7b0987284c ./cmdlog If your filenames contain whitespace, you can URL-encode them, play some games with awk, or use perl. -- Karl Vogel I don't speak for the USAF or my company This is really a lovely horse, I once rode her mother. --Ted Walsh, Horse Racing Commentator