From nobody Sun May 14 01:55:26 2023 X-Original-To: questions@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4QJlvc1VJ3z4B4G8 for ; Sun, 14 May 2023 01:55:40 +0000 (UTC) (envelope-from dpchrist@holgerdanske.com) Received: from holgerdanske.com (holgerdanske.com [IPv6:2001:470:0:19b::b869:801b]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "holgerdanske.com", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4QJlvZ6vw2z4Nrc for ; Sun, 14 May 2023 01:55:38 +0000 (UTC) (envelope-from dpchrist@holgerdanske.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=holgerdanske.com header.s=nov-20210719-112354 header.b=Gim2ZjM7; spf=pass (mx1.freebsd.org: domain of dpchrist@holgerdanske.com designates 2001:470:0:19b::b869:801b as permitted sender) smtp.mailfrom=dpchrist@holgerdanske.com; dmarc=pass (policy=none) header.from=holgerdanske.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=holgerdanske.com; s=nov-20210719-112354; t=1684029334; bh=xPwAzpingy/jSuKO+/S3anlnsAfxdKQSIKbdkAl4y5s=; h=Received:Message-ID:Date:MIME-Version:User-Agent:Subject:To: References:Content-Language:From:In-Reply-To:Content-Type: Content-Transfer-Encoding; b=Gim2ZjM7NZQkuzm0VYFkCAJKKX3Gg+M+wriJ0SP+qMlA5OtnQzmGHBcoIO2BEe7VD 4jjEHTo18mowiYNxnZs+GDJ+UTY2Ym197Mb4cu8SP/x2iVCGKe3m7s3zeJVyNRvOe5 V3mES0g6Eo1/hIoRNPsu+SQmCgKEmMEoUn7qMMYxvR7Hs75wR7lEQ3YZGtd1Mx+5AT X2vjA0N529ioi6SJ3Pq66yK2UD617EkSj6DhBm0lSq37Uu4U9Pr3vC3dG2B8DU8Csz wH4cD3E8mLwCZ7IYK/l4YeDpQpc2a8N7F8eFUl2GoWKPX1eUtp/lBWcv2Y8Y+JiK71 pBbFma/Ih+wpcT+Hq+kzUKchA9Yrh/dUHtis1jUgABsptvWepyYxTHPfHMRFwbNZ5v vjTGaSW6Ys2UYqTzb/+kzQGjz0IFeL1RuWa65MnD8VJFIqJBWnpQElxU1gmiBL7c4w U/wAlFTnGw8xU+H+LxYss6wC5dMdc34O1qbaDSbYkyQnjayHlhh6c9vgqCHLRgu3io rxZG0kXL73W1/F4U9KBaRGIR1ITh3I/BHMFTUeEKXU8nMQn5t4BFDwhxHN0C/Iiyao 3jp2M3cpt/De6JEB0szAm6/hFjw61Qqy7y2JvTtpn/LaQciJdgrUUQDlclcdhqIuSG nEPudVmTD6h9JuNXpPsvgJUQ= Received: from 99.100.19.101 (99-100-19-101.lightspeed.frokca.sbcglobal.net [99.100.19.101]) by holgerdanske.com with ESMTPSA (TLS_AES_128_GCM_SHA256:TLSv1.3:Kx=any:Au=any:Enc=AESGCM(128):Mac=AEAD) (SMTP-AUTH username dpchrist@holgerdanske.com, mechanism PLAIN) for ; Sat, 13 May 2023 18:55:34 -0700 Message-ID: <08804029-03de-e856-568b-74494dfc81cf@holgerdanske.com> Date: Sat, 13 May 2023 18:55:26 -0700 List-Id: User questions List-Archive: https://lists.freebsd.org/archives/freebsd-questions List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-questions@freebsd.org X-BeenThere: freebsd-questions@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: Tool to compare directories and delete duplicate files from one directory To: questions@freebsd.org References: <9887a438-95e7-87cc-a162-4ad7a70d744f@optiplex-networks.com> <344b29c6-3d69-543d-678d-c2433dbf7152@optiplex-networks.com> <7747f587-f33e-f39c-ac97-fe4fe19e0b76@optiplex-networks.com> <7c2429c5-55d0-1649-a442-ce543f2d46c2@holgerdanske.com> <6a0aba81-485a-8985-d20d-6da58e9b5580@optiplex-networks.com> <347612746.1721811.1683912265841@fidget.co-bxl> Content-Language: en-US From: David Christensen In-Reply-To: <347612746.1721811.1683912265841@fidget.co-bxl> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spamd-Result: default: False [-4.00 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-0.999]; DMARC_POLICY_ALLOW(-0.50)[holgerdanske.com,none]; R_SPF_ALLOW(-0.20)[+a:november.he.net]; R_DKIM_ALLOW(-0.20)[holgerdanske.com:s=nov-20210719-112354]; MIME_GOOD(-0.10)[text/plain]; DKIM_TRACE(0.00)[holgerdanske.com:+]; ASN(0.00)[asn:6939, ipnet:2001:470::/32, country:US]; MLMMJ_DEST(0.00)[questions@freebsd.org]; MIME_TRACE(0.00)[0:+]; FROM_EQ_ENVFROM(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; RCPT_COUNT_ONE(0.00)[1]; PREVIOUSLY_DELIVERED(0.00)[questions@freebsd.org]; MID_RHS_MATCH_FROM(0.00)[]; ARC_NA(0.00)[]; RCVD_COUNT_TWO(0.00)[2]; FROM_HAS_DN(0.00)[]; TO_DN_NONE(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; RCVD_TLS_ALL(0.00)[] X-Rspamd-Queue-Id: 4QJlvZ6vw2z4Nrc X-Spamd-Bar: --- X-ThisMailContainsUnwantedMimeParts: N On 5/12/23 10:24, Sysadmin Lists wrote: > Curiosity got the better of me. I've been searching for a project that requires > the use of multi-dimensional arrays in BSD-awk (not explicitly supported). But > after writing it, I realized there was a more efficient way without them (only > run `stat' on files with matching paths plus names) [nonplussed]. > Here's that one. > > #!/bin/sh -e > # remove or report duplicate files: $0 [-n] dir[1] dir[2] ... dir[n] > if [ "X$1" = "X-n" ]; then n=1; shift; fi > > echo "Building files list from ... ${@}" > > find "${@}" -xdev -type f | > awk -v n=$n 'BEGIN { cmd = "stat -f %z " > for (x = 1; x < ARGC; x++) args = args ? args "|" ARGV[x] : ARGV[x]; ARGC = 0 } > { files[$0] = match($0, "(" args ")/?") + RLENGTH } # index of filename > END { for (i in ARGV) sub("/+$", "", ARGV[i]) # remove trailing-/s > print "Comparing files ..." > for (i = 1; i < x; i++) for (file in files) if (file ~ "^" ARGV[i]) { > for (j = i +1; j < x; j++) > if (ARGV[j] "/" substr(file, files[file]) in files) { > dup = ARGV[j] "/" substr(file, files[file]) > cmd file | getline fil_s; close(cmd file) > cmd dup | getline dup_s; close(cmd dup) > if (dup_s == fil_s) act(file, dup, "dup") > else act(file, dup, "diff") } > delete files[file] > } } > > function act(file, dup, message) { > print ((message == "dup") ? "duplicates:" : "difference:"), dup, file > if (!n) system("rm -vi " dup " }' "${@}" > > Priority is given by the order of the arguments (first highest, last lowest). > The user is prompted to delete lower-priority dupes encountered if '-n' isn't > given, otherwise it just reports what it finds. Comparing by size and name only > seems odd (a simple `diff' would be easier). Surprisingly, accounting for a > mixture of dirnames with and w/o trailing-slashes was a bit tricky (dir1 dir2/). > > Fun challenge. Learned a lot about awk. I wrestled with a Perl script years ago when I did not know of fdupes(1), jdupes(1), etc.. Brute force O(N^2) comparison worked for toy datasets, but was impractical when I applied it to a directory containing thousands of files and hundreds of gigabytes. (The OP mentioned 12 TB.) Practical considerations of run time, memory usage, disk I/O, etc., drove me to find the kinds of optimizations fdupes(1) and jdupes(1) mention. I do not know Awk, so it is hard to comment on your script. I suggest commenting out any create/update/delete code, running the script against larger and larger datasets, and seeing what optimizations you can add. David