From nobody Mon May 15 22:26:07 2023 X-Original-To: questions@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4QKv921Y46z4B8sF for ; Mon, 15 May 2023 22:26:14 +0000 (UTC) (envelope-from sysadmin.lists@mailfence.com) Received: from wilbur.contactoffice.com (wilbur.contactoffice.com [212.3.242.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4QKv9049Q6z3GT5 for ; Mon, 15 May 2023 22:26:12 +0000 (UTC) (envelope-from sysadmin.lists@mailfence.com) Authentication-Results: mx1.freebsd.org; dkim=fail ("body hash did not verify") header.d=mailfence.com header.s=20210208-e7xh header.b=0M7Ic4tA; spf=pass (mx1.freebsd.org: domain of sysadmin.lists@mailfence.com designates 212.3.242.68 as permitted sender) smtp.mailfrom=sysadmin.lists@mailfence.com; dmarc=pass (policy=quarantine) header.from=mailfence.com Received: from ichabod.co-bxl (ichabod.co-bxl [10.2.0.36]) by wilbur.contactoffice.com (Postfix) with ESMTP id 45564140E for ; Tue, 16 May 2023 00:26:09 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1684189569; s=20210208-e7xh; d=mailfence.com; i=sysadmin.lists@mailfence.com; h=Date:From:To:Message-ID:In-Reply-To:References:Subject:MIME-Version:Content-Type:Content-Transfer-Encoding; l=6367; bh=VDYslLe0byAwtoGzEkMKvgda5P7t2c2GMFj7tVQGe2Q=; b=0M7Ic4tA6KhohgteEdtxQNKmto1nVG+zB0ENdD23kY7tFMEtiDgOUX9EqcpIiY3b GtKae/5b5ki86+1Jq34TnThRVcS0zELTzBut8cjLRSkFNcP+RwjK6lHlAbpdUOfvq6u y40NwNr/oSMSJTJB41PwUqbG9swgBwRo40DVp5rPL3Zif8Cq6xj/tByn5VeravVM8mI ANKiPaT/ZsNMc056XWqL9nWZRat0LhJcTv20XUNm6ySr2kt0E1Y39IMiKonTb4se6ZN quRnjxjDF5wGLXxPxBNjTzu/n2Fp+zQUVQNKxV87xCM7M1G+IS5r1+1o0q9Kl8Z5On0 ftbt81au5g== Date: Tue, 16 May 2023 00:26:07 +0200 (CEST) From: Sysadmin Lists To: questions@freebsd.org Message-ID: <941908372.622746.1684189567246@ichabod.co-bxl> In-Reply-To: <818813a2-8ab0-df54-3c59-0e1ba9ce743d@holgerdanske.com> References: <9887a438-95e7-87cc-a162-4ad7a70d744f@optiplex-networks.com> <7747f587-f33e-f39c-ac97-fe4fe19e0b76@optiplex-networks.com> <7c2429c5-55d0-1649-a442-ce543f2d46c2@holgerdanske.com> <6a0aba81-485a-8985-d20d-6da58e9b5580@optiplex-networks.com> <347612746.1721811.1683912265841@fidget.co-bxl> <08804029-03de-e856-568b-74494dfc81cf@holgerdansk e.com> <126434505.494354.1684104532813@ichabod.co-bxl> <818813a2-8ab0-df5 4-3c59-0e1ba9ce743d@holgerdanske.com> Subject: Re: Tool to compare directories and delete duplicate files from one directory List-Id: User questions List-Archive: https://lists.freebsd.org/archives/freebsd-questions List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-questions@freebsd.org X-BeenThere: freebsd-questions@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Mailer: ContactOffice Mail X-ContactOffice-Account: com:312482426 X-Spamd-Result: default: False [-2.11 / 15.00]; NEURAL_HAM_MEDIUM(-0.75)[-0.754]; NEURAL_HAM_SHORT(-0.67)[-0.674]; DMARC_POLICY_ALLOW_WITH_FAILURES(-0.50)[]; NEURAL_SPAM_LONG(0.20)[0.204]; R_SPF_ALLOW(-0.20)[+ip4:212.3.242.64/26:c]; MIME_GOOD(-0.10)[text/plain]; RCVD_IN_DNSWL_LOW(-0.10)[212.3.242.68:from]; XM_UA_NO_VERSION(0.01)[]; RCVD_TLS_LAST(0.00)[]; MLMMJ_DEST(0.00)[questions@freebsd.org]; MIME_TRACE(0.00)[0:+]; FROM_EQ_ENVFROM(0.00)[]; ASN(0.00)[asn:10753, ipnet:212.3.242.64/26, country:US]; DKIM_TRACE(0.00)[mailfence.com:-]; RCVD_COUNT_TWO(0.00)[2]; R_DKIM_REJECT(0.00)[mailfence.com:s=20210208-e7xh]; FROM_HAS_DN(0.00)[]; DMARC_POLICY_ALLOW(0.00)[mailfence.com,quarantine]; PREVIOUSLY_DELIVERED(0.00)[questions@freebsd.org]; RCPT_COUNT_ONE(0.00)[1]; TO_DN_NONE(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; ARC_NA(0.00)[] X-Rspamd-Queue-Id: 4QKv9049Q6z3GT5 X-Spamd-Bar: -- X-ThisMailContainsUnwantedMimeParts: N > ---------------------------------------- > From: David Christensen > Date: May 15, 2023, 1:43:38 AM > To: > Subject: Re: Tool to compare directories and delete duplicate files from one directory > > > I looks like your script only finds duplicates when the subpath is > identical (?): > Yeah. Wasn't that the original problem description? I went off the example given by Paul earlier in this thread, and it looked like only files with matching subpaths were being considered (because the OP accidentally rsync'd files from a source to a bunch of destination dirs). If we're simply looking for files that have the same name anywhere in the set of dirs, then comparing their sizes to know if they're assumed (!) duplicates or differ in size, that's way easier to program. As a side note on performance, I ran the program on a set of 8 dirs containing over 750,000 files and 300G of data. Here are the results: real 0m10.791s user 0m5.361s sys 0m5.928s And here are the results for counting the files in the dirs using `wc': real 0m12.464s user 0m0.834s sys 0m11.671s That means the program processed the list of files quicker that `wc' could count them, which is wild. Obviously, as the number of apparent duplicates is encountered, the number of `stat' calls increases, and the run-time will, too. But this shows how efficient awk is at comparing strings. > 2023-05-15 01:38:20 dpchrist@vf1 /vf1zpool1/dpchrist > $ cp -Ra foo bar > > 2023-05-15 01:39:18 dpchrist@vf1 /vf1zpool1/dpchrist > $ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh -n foo bar > Building files list from: foo bar > Comparing files ... > duplicates: bar/1/2/a foo/1/2/a > duplicates: bar/1/i-j foo/1/i-j > duplicates: bar/1/2/e foo/1/2/e > duplicates: bar/1/a-b foo/1/a-b > duplicates: bar/1/g foo/1/g > duplicates: bar/1/2/i foo/1/2/i > duplicates: bar/q-r foo/q-r > duplicates: bar/m-n foo/m-n > duplicates: bar/1/2/m foo/1/2/m > duplicates: bar/c foo/c > duplicates: bar/e-f foo/e-f > duplicates: bar/1/s foo/1/s > duplicates: bar/k foo/k > duplicates: bar/o foo/o > duplicates: bar/q foo/q > duplicates: bar/1/c-d foo/1/c-d > duplicates: bar/1/2/s-t foo/1/2/s-t > duplicates: bar/1/2/o-p foo/1/2/o-p > duplicates: bar/1/2/k-l foo/1/2/k-l > duplicates: bar/g-h foo/g-h > > 2023-05-15 01:39:41 dpchrist@vf1 /vf1zpool1/dpchrist > $ ls -R1 foo | wc > 26 24 82 > > 2023-05-15 01:39:44 dpchrist@vf1 /vf1zpool1/dpchrist > $ ls -R1 bar | wc > 26 24 82 > > 2023-05-15 01:40:10 dpchrist@vf1 /vf1zpool1/dpchrist > $ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh -n foo bar > Building files list from: foo bar > Comparing files ... > duplicates: bar/1/2/a foo/1/2/a > duplicates: bar/1/i-j foo/1/i-j > duplicates: bar/1/2/e foo/1/2/e > duplicates: bar/1/a-b foo/1/a-b > duplicates: bar/1/g foo/1/g > duplicates: bar/1/2/i foo/1/2/i > duplicates: bar/q-r foo/q-r > duplicates: bar/m-n foo/m-n > duplicates: bar/1/2/m foo/1/2/m > duplicates: bar/c foo/c > duplicates: bar/e-f foo/e-f > duplicates: bar/1/s foo/1/s > duplicates: bar/k foo/k > duplicates: bar/o foo/o > duplicates: bar/q foo/q > duplicates: bar/1/c-d foo/1/c-d > duplicates: bar/1/2/s-t foo/1/2/s-t > duplicates: bar/1/2/o-p foo/1/2/o-p > duplicates: bar/1/2/k-l foo/1/2/k-l > duplicates: bar/g-h foo/g-h > > 2023-05-15 01:40:22 dpchrist@vf1 /vf1zpool1/dpchrist > $ ls -R1 foo | wc > 26 24 82 > > 2023-05-15 01:40:29 dpchrist@vf1 /vf1zpool1/dpchrist > $ ls -R1 bar | wc > 26 24 82 > > 2023-05-15 01:40:34 dpchrist@vf1 /vf1zpool1/dpchrist > $ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh foo bar > Building files list from: foo bar > Comparing files ... > duplicates: bar/1/2/a foo/1/2/a > remove bar/1/2/a? n > duplicates: bar/1/i-j foo/1/i-j > remove bar/1/i-j? n > duplicates: bar/1/2/e foo/1/2/e > remove bar/1/2/e? n > duplicates: bar/1/a-b foo/1/a-b > remove bar/1/a-b? n > duplicates: bar/1/g foo/1/g > remove bar/1/g? n > duplicates: bar/1/2/i foo/1/2/i > remove bar/1/2/i? n > duplicates: bar/q-r foo/q-r > remove bar/q-r? n > duplicates: bar/m-n foo/m-n > remove bar/m-n? n > duplicates: bar/1/2/m foo/1/2/m > remove bar/1/2/m? n > duplicates: bar/c foo/c > remove bar/c? n > duplicates: bar/e-f foo/e-f > remove bar/e-f? n > duplicates: bar/1/s foo/1/s > remove bar/1/s? n > duplicates: bar/k foo/k > remove bar/k? n > duplicates: bar/o foo/o > remove bar/o? n > duplicates: bar/q foo/q > remove bar/q? n > duplicates: bar/1/c-d foo/1/c-d > remove bar/1/c-d? n > duplicates: bar/1/2/s-t foo/1/2/s-t > remove bar/1/2/s-t? n > duplicates: bar/1/2/o-p foo/1/2/o-p > remove bar/1/2/o-p? n > duplicates: bar/1/2/k-l foo/1/2/k-l > remove bar/1/2/k-l? n > duplicates: bar/g-h foo/g-h > remove bar/g-h? n > > > David > Thanks for running that test. It's working as designed. However, it doesn't check if the apparent duplicate is literally the same file (same inode) encountered through an overlapping directory, or a hard-link. This one does (although it might be a moot point if I misunderstood the original problem). #!/bin/sh -e # remove or report duplicate files: $0 [-n] dir[1] dir[2] ... dir[n] if [ "X$1" = "X-n" ]; then n=1; shift; fi echo "Building files list from: ${@}" find "${@}" -xdev -type f | awk -d1 -v n=$n 'BEGIN { cmd = "stat -f \"%i %z\" " for (x = 1; x < ARGC; x++) args = args ? args "|" ARGV[x] : ARGV[x]; ARGC = 0 } { files[$0] = match($0, "(" args ")/?") + RLENGTH } END { for (i in ARGV) sub("/*$", "/", ARGV[i]) print "Comparing files ..." for (i = 1; i < x; i++) for (file in files) if (file ~ "^" ARGV[i]) { for (j = i +1; j < x; j++) if (ARGV[j] substr(file, files[file]) in files) { dup = ARGV[j] substr(file, files[file]) cmd "\"" file "\"" | getline; close(cmd "\"" file "\"") fil_i = $1; fil_s = $2 cmd "\"" dup "\"" | getline; close(cmd "\"" dup "\"") dup_i = $1; dup_s = $2 if (fil_i == dup_i) continue if (fil_s == dup_s) { act("dup") } else act("diff") } delete files[file] } } function act(message) { print ((message == "dup") ? "duplicates:" : "difference:"), dup, file if (!n) system("rm -vi \"" dup "\"