Date: Sat, 20 May 2023 23:59:41 +0200 (CEST) From: Sysadmin Lists <sysadmin.lists@mailfence.com> To: questions@freebsd.org Subject: Re: Tool to compare directories and delete duplicate files from one directory Message-ID: <1554097298.1191790.1684619981686@ichabod.co-bxl> In-Reply-To: <2055648982.2509909.1684516773170@fidget.co-bxl> References: <9887a438-95e7-87cc-a162-4ad7a70d744f@optiplex-networks.com> <344b29c6-3d69-543d-678d-c2433dbf7152@optiplex-networks.com> <CAFbbPuiNqYLLg8wcg8S_3=y46osb06%2BduHqY9f0n=OuRgGVY=w@mail.gmail.com> <ef0328b0-caab-b6a2-5b33-1ab069a07f80@optiplex-networks.com> <CAFbbPujUALOS%2BsUxsp=54vxVAHe_jkvi3d-CksK78c7rxAVoNg@mail.gmail.com> <7747f587-f33e-f39c-ac97-fe4fe19e0b76@optiplex-networks.com> <CAFbbPuhoMOM=wp26yZ42e9xnRP%2BtJ6B30y8%2BBa3eCBh2v66Grw@mail.gmail.com> <fd9aa7d3-f6a7-2274-f970-d4421d187855@optiplex-networks.com> <CAFbbPujpPPrm-axMC9S5OnOiYn2oPuQbkRjnQY4tp=5L7TiVSg@mail.gmail.com> <eda13374-48c1-1749-3a73-530370934eff@optiplex-networks.com> <CAFbbPujbyPHm2GO%2BFnR0G8rnsmpA3AxY2NzYOAAXetApiF8HVg@mail.gmail.com> <b4ac4aea-a051-fbfe-f860-cd7836e5a1bb@optiplex-networks.com> <7c2429c5-55d0-1649-a442-ce543f2d46c2@holgerdanske.com> <6a0aba81-485a-8985-d20d-6da58e9b5580@optiplex-networks.com> <347612746.1721811.1683912265841@fidget.co-bxl> <08804029-03de-e856-568b-74494dfc81cf@holgerdansk e.com> <126434505.494354.1684104532813@ichabod.co-bxl> <2055648982.2509909.1684516773170@fidget.co-bxl>
next in thread | previous in thread | raw e-mail | index | archive | help
> ----------------------------------------
> From: Sysadmin Lists <sysadmin.lists@mailfence.com>
> Date: May 19, 2023, 10:19:33 AM
> To: <questions@freebsd.org>
> Subject: Re: Tool to compare directories and delete duplicate files from one directory
>
>
> Performance is pretty good:
> $ time dedup_multidirs.sh -V dedup{1..13}
> DEBUG: 313087 files, 3497 duplicates, 309590 unique, 42848 stat calls
> # 773723 differences: same filenames, different sizes or hashes
>
> real 1m32.719s
> user 0m50.671s
> sys 0m44.054s
>
> $ du -xs dedup{1..13} | awk '{ sum = sum + $1 } END { print sum }'
> 219195746 # 200G+ of data
Found a bug; shaved 10-seconds:
--------------------------------------------------------------------------------
diff --git a/dedup_multidirs.sh b/dedup_multidirs.sh
index 8563d49..86c5f07 100755
--- a/dedup_multidirs.sh
+++ b/dedup_multidirs.sh
@@ -48,8 +48,8 @@ END { for (i in ARGV) sub("/*$", "/", ARGV[i])
processed[d]
hits++ }
else act("diff")
- if (c++ == hasf[ARGV[k], file])
- break
+ if (++c == hasf[ARGV[k], file])
+ { c = 0; break }
} } } }
if (e) debug(3)
processed[dups[file, j]]; delete dups[file, j]
--------------------------------------------------------------------------------
As a sanity-check, I checked to see how much time it would take to merely store
every encountered file, grouped by filename. It's so slow:
total files: 14347
real 1m37.176s
user 1m36.823s
sys 0m0.212s
--------------------------------------------------------------------------------
{ files[$0] = substr($0, match($0, /[^\/]+$/)); tfiles++ }
END { for (f in files)
if (f in processed == 0) {
processed[f]; dups[f]; hits[files[f]]++
for (s in files) {
if (f != s && s in processed == 0)
if (s ~ "/" files[f] "$") {
processed[s]; dups[s]; hits[files[f]]++
}
}
compare(dups)
for (f in dups) { delete dups[f]; delete files[f] }
}
for (h in hits) printf("%6d %s\n", hits[h], h) | "sort"
close("sort")
print "total files:", tfiles
}
function compare(array, f) {
for (f in array) { } # do nothing
}
--
Sent with https://mailfence.com
Secure and private email
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1554097298.1191790.1684619981686>
