Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 13 May 2023 18:55:26 -0700
From:      David Christensen <dpchrist@holgerdanske.com>
To:        questions@freebsd.org
Subject:   Re: Tool to compare directories and delete duplicate files from one directory
Message-ID:  <08804029-03de-e856-568b-74494dfc81cf@holgerdanske.com>
In-Reply-To: <347612746.1721811.1683912265841@fidget.co-bxl>
References:  <9887a438-95e7-87cc-a162-4ad7a70d744f@optiplex-networks.com> <344b29c6-3d69-543d-678d-c2433dbf7152@optiplex-networks.com> <CAFbbPuiNqYLLg8wcg8S_3=y46osb06%2BduHqY9f0n=OuRgGVY=w@mail.gmail.com> <ef0328b0-caab-b6a2-5b33-1ab069a07f80@optiplex-networks.com> <CAFbbPujUALOS%2BsUxsp=54vxVAHe_jkvi3d-CksK78c7rxAVoNg@mail.gmail.com> <7747f587-f33e-f39c-ac97-fe4fe19e0b76@optiplex-networks.com> <CAFbbPuhoMOM=wp26yZ42e9xnRP%2BtJ6B30y8%2BBa3eCBh2v66Grw@mail.gmail.com> <fd9aa7d3-f6a7-2274-f970-d4421d187855@optiplex-networks.com> <CAFbbPujpPPrm-axMC9S5OnOiYn2oPuQbkRjnQY4tp=5L7TiVSg@mail.gmail.com> <eda13374-48c1-1749-3a73-530370934eff@optiplex-networks.com> <CAFbbPujbyPHm2GO%2BFnR0G8rnsmpA3AxY2NzYOAAXetApiF8HVg@mail.gmail.com> <b4ac4aea-a051-fbfe-f860-cd7836e5a1bb@optiplex-networks.com> <7c2429c5-55d0-1649-a442-ce543f2d46c2@holgerdanske.com> <6a0aba81-485a-8985-d20d-6da58e9b5580@optiplex-networks.com> <347612746.1721811.1683912265841@fidget.co-bxl>

next in thread | previous in thread | raw e-mail | index | archive | help
On 5/12/23 10:24, Sysadmin Lists wrote:

> Curiosity got the better of me. I've been searching for a project that requires
> the use of multi-dimensional arrays in BSD-awk (not explicitly supported). But
> after writing it, I realized there was a more efficient way without them (only
> run `stat' on files with matching paths plus names) [nonplussed].
> Here's that one.
> 
> #!/bin/sh -e
> # remove or report duplicate files: $0 [-n] dir[1] dir[2] ... dir[n]
> if [ "X$1" = "X-n" ]; then n=1; shift; fi
> 
> echo "Building files list from ... ${@}"
> 
> find "${@}" -xdev -type f |
> awk -v n=$n 'BEGIN { cmd = "stat -f %z "
> for (x = 1; x < ARGC; x++) args = args ? args "|" ARGV[x] : ARGV[x]; ARGC = 0 }
>       { files[$0] = match($0, "(" args ")/?") + RLENGTH }  # index of filename
> END  { for (i in ARGV) sub("/+$", "", ARGV[i])            # remove trailing-/s
>         print "Comparing files ..."
>         for (i = 1; i < x; i++) for (file in files) if (file ~ "^" ARGV[i]) {
>              for (j = i +1; j < x; j++)
>                   if (ARGV[j] "/" substr(file, files[file]) in files) {
>                       dup = ARGV[j] "/" substr(file, files[file])
>                       cmd file | getline fil_s; close(cmd file)
>                       cmd dup  | getline dup_s; close(cmd dup)
>                       if (dup_s == fil_s) act(file, dup, "dup")
>                       else act(file, dup, "diff") }
>              delete files[file]
>       } }
> 
> function act(file, dup, message) {
>      print ((message == "dup") ? "duplicates:" : "difference:"), dup, file
>      if (!n) system("rm -vi " dup "</dev/tty")
> }' "${@}"
> 
> Priority is given by the order of the arguments (first highest, last lowest).
> The user is prompted to delete lower-priority dupes encountered if '-n' isn't
> given, otherwise it just reports what it finds. Comparing by size and name only
> seems odd (a simple `diff' would be easier). Surprisingly, accounting for a
> mixture of dirnames with and w/o trailing-slashes was a bit tricky (dir1 dir2/).
> 
> Fun challenge. Learned a lot about awk.


I wrestled with a Perl script years ago when I did not know of 
fdupes(1), jdupes(1), etc..  Brute force O(N^2) comparison worked for 
toy datasets, but was impractical when I applied it to a directory 
containing thousands of files and hundreds of gigabytes.  (The OP 
mentioned 12 TB.)  Practical considerations of run time, memory usage, 
disk I/O, etc., drove me to find the kinds of optimizations fdupes(1) 
and jdupes(1) mention.


I do not know Awk, so it is hard to comment on your script.  I suggest 
commenting out any create/update/delete code, running the script against 
larger and larger datasets, and seeing what optimizations you can add.


David




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?08804029-03de-e856-568b-74494dfc81cf>