From nobody Sat May 20 21:59:41 2023 X-Original-To: questions@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4QNyLF6QHmz4Bb2p for ; Sat, 20 May 2023 21:59:49 +0000 (UTC) (envelope-from sysadmin.lists@mailfence.com) Received: from wilbur.contactoffice.com (wilbur.contactoffice.com [212.3.242.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4QNyLD08vMz4HXb for ; Sat, 20 May 2023 21:59:47 +0000 (UTC) (envelope-from sysadmin.lists@mailfence.com) Authentication-Results: mx1.freebsd.org; dkim=fail ("body hash did not verify") header.d=mailfence.com header.s=20210208-e7xh header.b=tTda71oy; spf=pass (mx1.freebsd.org: domain of sysadmin.lists@mailfence.com designates 212.3.242.68 as permitted sender) smtp.mailfrom=sysadmin.lists@mailfence.com; dmarc=pass (policy=quarantine) header.from=mailfence.com Received: from ichabod.co-bxl (ichabod.co-bxl [10.2.0.36]) by wilbur.contactoffice.com (Postfix) with ESMTP id 41F749CA for ; Sat, 20 May 2023 23:59:45 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1684619985; s=20210208-e7xh; d=mailfence.com; i=sysadmin.lists@mailfence.com; h=Date:From:To:Message-ID:In-Reply-To:References:Subject:MIME-Version:Content-Type:Content-Transfer-Encoding; l=2176; bh=yrVqniLKj/kr8Qy/+doBbAUUhBXMUPoSKtM7NF2cF0Y=; b=tTda71oyHWdGMnyR4v7Tumnu0lSi0x9/3fD5PGQb0as2VTB1pXRPLYRFcGADHnJo w8nSvzIbWhNvxzhHVzsyLcKcGlK/DturtCy0jdduZOKJ2YWr4wY9AIa60/1kJ4pUJ9U 0hut8X3m28bRBK+TL4JBvoMB/Cvcs9uD2ixQB+sbciPXT5BOW+FsCcmkbAn9MOVSFNg gwyWUKHKs4N6JTGiiEN1VgWaO4zLNImMR+NMt+qe2VLdAlgBv3gfMhp44oDVeRFojQS tibZJEnuzLPw1BYgPWrXsPVHKFCjmZ6+EKCGS5cKGD+9jfyIOSqD1aDXjzi9/JuMVpx PK2Mb2CX0w== Date: Sat, 20 May 2023 23:59:41 +0200 (CEST) From: Sysadmin Lists To: questions@freebsd.org Message-ID: <1554097298.1191790.1684619981686@ichabod.co-bxl> In-Reply-To: <2055648982.2509909.1684516773170@fidget.co-bxl> References: <9887a438-95e7-87cc-a162-4ad7a70d744f@optiplex-networks.com> <344b29c6-3d69-543d-678d-c2433dbf7152@optiplex-networks.com> <7747f587-f33e-f39c-ac97-fe4fe19e0b76@optiplex-networks.com> <7c2429c5-55d0-1649-a442-ce543f2d46c2@holgerdanske.com> <6a0aba81-485a-8985-d20d-6da58e9b5580@optiplex-networks.com> <347612746.1721811.1683912265841@fidget.co-bxl> <08804029-03de-e856-568b-74494dfc81cf@holgerdansk e.com> <126434505.494354.1684104532813@ichabod.co-bxl> <2055648982.2509909.1684516773170@fidget.co-bxl> Subject: Re: Tool to compare directories and delete duplicate files from one directory List-Id: User questions List-Archive: https://lists.freebsd.org/archives/freebsd-questions List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-questions@freebsd.org X-BeenThere: freebsd-questions@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Mailer: ContactOffice Mail X-ContactOffice-Account: com:312482426 X-Spamd-Result: default: False [-2.21 / 15.00]; NEURAL_HAM_SHORT(-1.00)[-0.999]; NEURAL_HAM_MEDIUM(-0.99)[-0.994]; NEURAL_SPAM_LONG(0.67)[0.668]; DMARC_POLICY_ALLOW_WITH_FAILURES(-0.50)[]; R_SPF_ALLOW(-0.20)[+ip4:212.3.242.64/26]; MIME_GOOD(-0.10)[text/plain]; RCVD_IN_DNSWL_LOW(-0.10)[212.3.242.68:from]; XM_UA_NO_VERSION(0.01)[]; RCVD_TLS_LAST(0.00)[]; MLMMJ_DEST(0.00)[questions@freebsd.org]; MIME_TRACE(0.00)[0:+]; FROM_EQ_ENVFROM(0.00)[]; ASN(0.00)[asn:10753, ipnet:212.3.242.64/26, country:US]; DKIM_TRACE(0.00)[mailfence.com:-]; RCVD_COUNT_TWO(0.00)[2]; TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_HAS_DN(0.00)[]; DMARC_POLICY_ALLOW(0.00)[mailfence.com,quarantine]; PREVIOUSLY_DELIVERED(0.00)[questions@freebsd.org]; RCPT_COUNT_ONE(0.00)[1]; TO_DN_NONE(0.00)[]; R_DKIM_REJECT(0.00)[mailfence.com:s=20210208-e7xh]; ARC_NA(0.00)[] X-Rspamd-Queue-Id: 4QNyLD08vMz4HXb X-Spamd-Bar: -- X-ThisMailContainsUnwantedMimeParts: N > ---------------------------------------- > From: Sysadmin Lists > Date: May 19, 2023, 10:19:33 AM > To: > Subject: Re: Tool to compare directories and delete duplicate files from one directory > > > Performance is pretty good: > $ time dedup_multidirs.sh -V dedup{1..13} > DEBUG: 313087 files, 3497 duplicates, 309590 unique, 42848 stat calls > # 773723 differences: same filenames, different sizes or hashes > > real 1m32.719s > user 0m50.671s > sys 0m44.054s > > $ du -xs dedup{1..13} | awk '{ sum = sum + $1 } END { print sum }' > 219195746 # 200G+ of data Found a bug; shaved 10-seconds: -------------------------------------------------------------------------------- diff --git a/dedup_multidirs.sh b/dedup_multidirs.sh index 8563d49..86c5f07 100755 --- a/dedup_multidirs.sh +++ b/dedup_multidirs.sh @@ -48,8 +48,8 @@ END { for (i in ARGV) sub("/*$", "/", ARGV[i]) processed[d] hits++ } else act("diff") - if (c++ == hasf[ARGV[k], file]) - break + if (++c == hasf[ARGV[k], file]) + { c = 0; break } } } } } if (e) debug(3) processed[dups[file, j]]; delete dups[file, j] -------------------------------------------------------------------------------- As a sanity-check, I checked to see how much time it would take to merely store every encountered file, grouped by filename. It's so slow: total files: 14347 real 1m37.176s user 1m36.823s sys 0m0.212s -------------------------------------------------------------------------------- { files[$0] = substr($0, match($0, /[^\/]+$/)); tfiles++ } END { for (f in files) if (f in processed == 0) { processed[f]; dups[f]; hits[files[f]]++ for (s in files) { if (f != s && s in processed == 0) if (s ~ "/" files[f] "$") { processed[s]; dups[s]; hits[files[f]]++ } } compare(dups) for (f in dups) { delete dups[f]; delete files[f] } } for (h in hits) printf("%6d %s\n", hits[h], h) | "sort" close("sort") print "total files:", tfiles } function compare(array, f) { for (f in array) { } # do nothing } -- Sent with https://mailfence.com Secure and private email