Date: Sat, 20 May 2023 23:59:41 +0200 (CEST) From: Sysadmin Lists <sysadmin.lists@mailfence.com> To: questions@freebsd.org Subject: Re: Tool to compare directories and delete duplicate files from one directory Message-ID: <1554097298.1191790.1684619981686@ichabod.co-bxl> In-Reply-To: <2055648982.2509909.1684516773170@fidget.co-bxl> References: <9887a438-95e7-87cc-a162-4ad7a70d744f@optiplex-networks.com> <344b29c6-3d69-543d-678d-c2433dbf7152@optiplex-networks.com> <CAFbbPuiNqYLLg8wcg8S_3=y46osb06%2BduHqY9f0n=OuRgGVY=w@mail.gmail.com> <ef0328b0-caab-b6a2-5b33-1ab069a07f80@optiplex-networks.com> <CAFbbPujUALOS%2BsUxsp=54vxVAHe_jkvi3d-CksK78c7rxAVoNg@mail.gmail.com> <7747f587-f33e-f39c-ac97-fe4fe19e0b76@optiplex-networks.com> <CAFbbPuhoMOM=wp26yZ42e9xnRP%2BtJ6B30y8%2BBa3eCBh2v66Grw@mail.gmail.com> <fd9aa7d3-f6a7-2274-f970-d4421d187855@optiplex-networks.com> <CAFbbPujpPPrm-axMC9S5OnOiYn2oPuQbkRjnQY4tp=5L7TiVSg@mail.gmail.com> <eda13374-48c1-1749-3a73-530370934eff@optiplex-networks.com> <CAFbbPujbyPHm2GO%2BFnR0G8rnsmpA3AxY2NzYOAAXetApiF8HVg@mail.gmail.com> <b4ac4aea-a051-fbfe-f860-cd7836e5a1bb@optiplex-networks.com> <7c2429c5-55d0-1649-a442-ce543f2d46c2@holgerdanske.com> <6a0aba81-485a-8985-d20d-6da58e9b5580@optiplex-networks.com> <347612746.1721811.1683912265841@fidget.co-bxl> <08804029-03de-e856-568b-74494dfc81cf@holgerdansk e.com> <126434505.494354.1684104532813@ichabod.co-bxl> <2055648982.2509909.1684516773170@fidget.co-bxl>
next in thread | previous in thread | raw e-mail | index | archive | help
> ---------------------------------------- > From: Sysadmin Lists <sysadmin.lists@mailfence.com> > Date: May 19, 2023, 10:19:33 AM > To: <questions@freebsd.org> > Subject: Re: Tool to compare directories and delete duplicate files from one directory > > > Performance is pretty good: > $ time dedup_multidirs.sh -V dedup{1..13} > DEBUG: 313087 files, 3497 duplicates, 309590 unique, 42848 stat calls > # 773723 differences: same filenames, different sizes or hashes > > real 1m32.719s > user 0m50.671s > sys 0m44.054s > > $ du -xs dedup{1..13} | awk '{ sum = sum + $1 } END { print sum }' > 219195746 # 200G+ of data Found a bug; shaved 10-seconds: -------------------------------------------------------------------------------- diff --git a/dedup_multidirs.sh b/dedup_multidirs.sh index 8563d49..86c5f07 100755 --- a/dedup_multidirs.sh +++ b/dedup_multidirs.sh @@ -48,8 +48,8 @@ END { for (i in ARGV) sub("/*$", "/", ARGV[i]) processed[d] hits++ } else act("diff") - if (c++ == hasf[ARGV[k], file]) - break + if (++c == hasf[ARGV[k], file]) + { c = 0; break } } } } } if (e) debug(3) processed[dups[file, j]]; delete dups[file, j] -------------------------------------------------------------------------------- As a sanity-check, I checked to see how much time it would take to merely store every encountered file, grouped by filename. It's so slow: total files: 14347 real 1m37.176s user 1m36.823s sys 0m0.212s -------------------------------------------------------------------------------- { files[$0] = substr($0, match($0, /[^\/]+$/)); tfiles++ } END { for (f in files) if (f in processed == 0) { processed[f]; dups[f]; hits[files[f]]++ for (s in files) { if (f != s && s in processed == 0) if (s ~ "/" files[f] "$") { processed[s]; dups[s]; hits[files[f]]++ } } compare(dups) for (f in dups) { delete dups[f]; delete files[f] } } for (h in hits) printf("%6d %s\n", hits[h], h) | "sort" close("sort") print "total files:", tfiles } function compare(array, f) { for (f in array) { } # do nothing } -- Sent with https://mailfence.com Secure and private email
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1554097298.1191790.1684619981686>