Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 20 May 2023 23:59:41 +0200 (CEST)
From:      Sysadmin Lists <sysadmin.lists@mailfence.com>
To:        questions@freebsd.org
Subject:   Re: Tool to compare directories and delete duplicate files from one directory
Message-ID:  <1554097298.1191790.1684619981686@ichabod.co-bxl>
In-Reply-To: <2055648982.2509909.1684516773170@fidget.co-bxl>
References:  <9887a438-95e7-87cc-a162-4ad7a70d744f@optiplex-networks.com> <344b29c6-3d69-543d-678d-c2433dbf7152@optiplex-networks.com> <CAFbbPuiNqYLLg8wcg8S_3=y46osb06%2BduHqY9f0n=OuRgGVY=w@mail.gmail.com> <ef0328b0-caab-b6a2-5b33-1ab069a07f80@optiplex-networks.com> <CAFbbPujUALOS%2BsUxsp=54vxVAHe_jkvi3d-CksK78c7rxAVoNg@mail.gmail.com> <7747f587-f33e-f39c-ac97-fe4fe19e0b76@optiplex-networks.com> <CAFbbPuhoMOM=wp26yZ42e9xnRP%2BtJ6B30y8%2BBa3eCBh2v66Grw@mail.gmail.com> <fd9aa7d3-f6a7-2274-f970-d4421d187855@optiplex-networks.com> <CAFbbPujpPPrm-axMC9S5OnOiYn2oPuQbkRjnQY4tp=5L7TiVSg@mail.gmail.com> <eda13374-48c1-1749-3a73-530370934eff@optiplex-networks.com> <CAFbbPujbyPHm2GO%2BFnR0G8rnsmpA3AxY2NzYOAAXetApiF8HVg@mail.gmail.com> <b4ac4aea-a051-fbfe-f860-cd7836e5a1bb@optiplex-networks.com> <7c2429c5-55d0-1649-a442-ce543f2d46c2@holgerdanske.com> <6a0aba81-485a-8985-d20d-6da58e9b5580@optiplex-networks.com> <347612746.1721811.1683912265841@fidget.co-bxl> <08804029-03de-e856-568b-74494dfc81cf@holgerdansk e.com> <126434505.494354.1684104532813@ichabod.co-bxl> <2055648982.2509909.1684516773170@fidget.co-bxl>

next in thread | previous in thread | raw e-mail | index | archive | help
> ----------------------------------------
> From: Sysadmin Lists <sysadmin.lists@mailfence.com>
> Date: May 19, 2023, 10:19:33 AM
> To: <questions@freebsd.org>
> Subject: Re: Tool to compare directories and delete duplicate files from one directory
> 
> 
> Performance is pretty good:
> $ time dedup_multidirs.sh -V dedup{1..13}
> DEBUG: 313087 files, 3497 duplicates, 309590 unique, 42848 stat calls
>      # 773723 differences: same filenames, different sizes or hashes
> 
> real    1m32.719s
> user    0m50.671s
> sys     0m44.054s
> 
> $ du -xs dedup{1..13} | awk '{ sum = sum + $1 } END { print sum }'
> 219195746  # 200G+ of data

Found a bug; shaved 10-seconds:

--------------------------------------------------------------------------------

diff --git a/dedup_multidirs.sh b/dedup_multidirs.sh
index 8563d49..86c5f07 100755
--- a/dedup_multidirs.sh
+++ b/dedup_multidirs.sh
@@ -48,8 +48,8 @@ END   { for (i in ARGV) sub("/*$", "/", ARGV[i])
                                                      processed[d]
                                                      hits++ }
                                                  else act("diff")
-                                                 if (c++ == hasf[ARGV[k], file])
-                                                     break
+                                                 if (++c == hasf[ARGV[k], file])
+                                                     { c = 0; break }
                                  }   }   }   }
                                  if (e) debug(3)
                                  processed[dups[file, j]]; delete dups[file, j]

--------------------------------------------------------------------------------

As a sanity-check, I checked to see how much time it would take to merely store
every encountered file, grouped by filename. It's so slow:

total files: 14347

real    1m37.176s
user    1m36.823s
sys     0m0.212s

--------------------------------------------------------------------------------

    { files[$0] = substr($0, match($0, /[^\/]+$/)); tfiles++ }
END { for (f in files)
          if (f in processed == 0) {
              processed[f]; dups[f]; hits[files[f]]++
              for (s in files) {
                  if (f != s && s in processed == 0)
                      if (s ~ "/" files[f] "$") {
                          processed[s]; dups[s]; hits[files[f]]++
                      }
              }
              compare(dups)
              for (f in dups) { delete dups[f]; delete files[f] }
          }
          for (h in hits) printf("%6d %s\n", hits[h], h) | "sort"
          close("sort")
          print "total files:", tfiles
    }
function compare(array,  f) {
        for (f in array) { } # do nothing
}


-- 
Sent with https://mailfence.com  
Secure and private email



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1554097298.1191790.1684619981686>