Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 4 May 2023 23:36:19 -0400
From:      Paul Procacci <pprocacci@gmail.com>
To:        Kaya Saman <kayasaman@optiplex-networks.com>
Cc:        freebsd-questions@freebsd.org
Subject:   Re: Tool to compare directories and delete duplicate files from one directory
Message-ID:  <CAFbbPujbyPHm2GO%2BFnR0G8rnsmpA3AxY2NzYOAAXetApiF8HVg@mail.gmail.com>
In-Reply-To: <eda13374-48c1-1749-3a73-530370934eff@optiplex-networks.com>
References:  <9887a438-95e7-87cc-a162-4ad7a70d744f@optiplex-networks.com> <CAFbbPugfhXGPfscKpx6B0ue=DcF_qssL6P-0GgB1CWKtm3U=tQ@mail.gmail.com> <344b29c6-3d69-543d-678d-c2433dbf7152@optiplex-networks.com> <CAFbbPuiNqYLLg8wcg8S_3=y46osb06%2BduHqY9f0n=OuRgGVY=w@mail.gmail.com> <ef0328b0-caab-b6a2-5b33-1ab069a07f80@optiplex-networks.com> <CAFbbPujUALOS%2BsUxsp=54vxVAHe_jkvi3d-CksK78c7rxAVoNg@mail.gmail.com> <7747f587-f33e-f39c-ac97-fe4fe19e0b76@optiplex-networks.com> <CAFbbPuhoMOM=wp26yZ42e9xnRP%2BtJ6B30y8%2BBa3eCBh2v66Grw@mail.gmail.com> <fd9aa7d3-f6a7-2274-f970-d4421d187855@optiplex-networks.com> <CAFbbPujpPPrm-axMC9S5OnOiYn2oPuQbkRjnQY4tp=5L7TiVSg@mail.gmail.com> <eda13374-48c1-1749-3a73-530370934eff@optiplex-networks.com>

next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000bd96d105fae9fe90
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Thu, May 4, 2023 at 11:20=E2=80=AFPM Kaya Saman <kayasaman@optiplex-netw=
orks.com>
wrote:

>
> On 5/5/23 04:01, Paul Procacci wrote:
>
> On Thu, May 4, 2023 at 10:30=E2=80=AFPM Kaya Saman <
> kayasaman@optiplex-networks.com> wrote:
>
>>
>> On 5/5/23 03:08, Paul Procacci wrote:
>>
>> There are multiple reasons why it may not work.  My guess is because the
>> potential for characters that could be showing up within the filenames a=
nd
>> whatnot.
>>
>> This can be solved with an interpreted language that's a bit more
>> forgiving.
>> Take the following perl script.  It does the same thing as the shell
>> script (almost).  It renames the source file instead of making a copy of=
 it.
>>
>> run as:  ./test.pl /absolute/path/to/master_dir /absolute_path_to_dir_x
>>
>> ########################################################################=
###########
>>
>> #!/usr/bin/env perl
>>
>> use strict;
>> use warnings;
>>
>> sub msgDie
>> {
>>   my ($ret) =3D shift;
>>   my ($msg) =3D shift // "$0 dir_base dir\n";
>>   print $msg;
>>   exit($ret);
>> }
>>
>> msgDie(1) unless(scalar @ARGV eq 2);
>>
>> my $base =3D $ARGV[0];
>> my $dir  =3D $ARGV[1];
>>
>> msgDie(1, "base directory doesn't exist\n") unless -d $base;
>> msgDie(1, "source directory doesn't exist\n") unless -d $dir;
>>
>> opendir(my $dh, $dir) or msgDie("Unable to open directory: $dir\n");
>> while(readdir $dh)
>> {
>>   next if($_ eq '.' || $_ eq '..');
>>   if( ! -f "$base/$_" ){
>>     rename("$dir/$_", "$base/$_");
>>     next;
>>   }
>>
>>   my ($ref) =3D (stat("$base/$_"))[7];
>>   my ($src) =3D (stat("$dir/$_"))[7];
>>   unlink("$dir/$_") if($ref =3D=3D $src);
>> }
>>
>> ########################################################################=
###########
>>
>> ~Paul
>>
>>
>>
>> This didn't seem to work :-(
>>
>>
>> What exactly happened is this:
>>
>>
>> I created a set of test directories in /tmp
>>
>>
>> So, I have /tmp/test1 and /tmp/test2
>>
>>
>> to mimic the structure of the directories I intend to run this thing I
>> did this:
>>
>>
>> create a subdir called: dupdir in /tmp/test1 and /tmp/test2
>>
>>
>> /tmp/test2/dupdir contains these files: dup and dup1
>>
>>
>> /tmp/test1/dupdir contains a modified 'dup' file but copied dup1 file.
>>
>>
>> However*, now things get interesting as dup from test1 contains "1234567=
"
>> and dup from test2 contains "111" <- this is to simulate the file size
>> difference.
>>
>>
>>
>>
>>
>>
> Worked for me!  Regardless.  Use rsync then.
>
> rsync --ignore-existing --remove-source-files  /src /dest
>
> This would at the very least move non-existent files from the source over=
 to the dest AND remove those source files AFTER the transfer happens.
>
> You'll be 1/2 way there doing that.  What you'll be left with are file th=
at exist in BOTH src AND DEST.
>
>
> ~Paul
>
>
> Paul, I think we've got wires crossed....
>
>
> I *have* already performed the rsync. Apologies if I wasn't clear!
>
>
> The problem I am faced with is that the destination directory is already
> populated with the information from 3 source directories.
>
>
> I need to remove the sync'ed files in the source directories and leave
> files that match in name but are of different sizes.
>
>
> The problem is I can't use rsync again for this as there aren't any
> options to simply compare files based on size. I can't use the --existing
> option as the files exist in both directories....
>
>
> This is the dilemma I am facing:
>
>
> ls -l /merged_dir/folder/
>
> 234904506 - file 'a'
>
>
> ls -l /source_dir/folder/
>
> 1080918146 - file 'a'
>
>
> so in this case file 'a' is in both directories with the same name but
> different size. I need to keep both versions. However, *if* they were the
> same size then remove the file in the source_dir.....
>
>
> That's all.. I don't need to transfer anything or copy anything at all...
> just compare and remove files of same name and size.
>
>
> Hopefully I am explaining better and things are more clear? Again I
> apologize for the confusion  :-(
>

You're at least partially right that I was confused because comparing by
name and by size makes no sense to me.  A single byte changed in one yields
the same name and the same size but are different!  ;)
Is the below output what you're expecting to happen:

% mkdir a b
% echo 1111 > a/test.txt
% echo 1111 > b/test.txt
%./test.pl a b
% ls -l a b
a:
total 5
-rw-r--r--  1 pprocacci  pprocacci  5 May  5 03:26 test.txt

b:
total 0

----------

The below perl script is what was ran above.  1) Find a file from directory
"b".  2)  Go to the top of the loop if the file doesn't exist in directory
"a".  3) Go to the top of the loop if the file sizes do not match  4)
unlink the file if conditions 2 and 3 fall through.

#################################################
#!/usr/bin/env perl

use strict;
use warnings;

sub msgDie
{
  my ($ret) =3D shift;
  my ($msg) =3D shift // "$0 dir_base dir\n";
  print $msg;
  exit($ret);
}

msgDie(1) unless(scalar @ARGV eq 2);

my $base =3D $ARGV[0];
my $dir  =3D $ARGV[1];

msgDie(1, "base directory doesn't exist\n") unless -d $base;
msgDie(1, "source directory doesn't exist\n") unless -d $dir;

opendir(my $dh, $dir) or msgDie("Unable to open directory: $dir\n");
while(readdir $dh)
{
  next if($_ eq '.' || $_ eq '..');
  next if(! -f "$base/$_");

  my ($ref) =3D (stat("$base/$_"))[7];
  my ($src) =3D (stat("$dir/$_"))[7];
  unlink("$dir/$_") if($ref =3D=3D $src);
}
#################################################

--=20
__________________

:(){ :|:& };:

--000000000000bd96d105fae9fe90
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div dir=3D"ltr"><br></div><br><div class=3D"gma=
il_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Thu, May 4, 2023 at 11:2=
0=E2=80=AFPM Kaya Saman &lt;<a href=3D"mailto:kayasaman@optiplex-networks.c=
om">kayasaman@optiplex-networks.com</a>&gt; wrote:<br></div><blockquote cla=
ss=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid =
rgb(204,204,204);padding-left:1ex">
 =20
   =20
 =20
  <div>
    <p><br>
    </p>
    <div>On 5/5/23 04:01, Paul Procacci wrote:<br>
    </div>
    <blockquote type=3D"cite">
     =20
      <div dir=3D"ltr">
        <div>On Thu, May 4, 2023 at 10:30=E2=80=AFPM Kaya Saman &lt;<a href=
=3D"mailto:kayasaman@optiplex-networks.com" target=3D"_blank">kayasaman@opt=
iplex-networks.com</a>&gt;
          wrote:
          <div class=3D"gmail_quote">
            <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0=
.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
              <div>
                <p><br>
                </p>
                <div>On 5/5/23 03:08, Paul Procacci wrote:<br>
                </div>
                <blockquote type=3D"cite">
                  <div dir=3D"ltr">
                    <div>There are multiple reasons why it may not
                      work.=C2=A0 My guess is because the potential for
                      characters that could be showing up within the
                      filenames and whatnot.<br>
                      <br>
                    </div>
                    <div>This can be solved with an interpreted language
                      that&#39;s a bit more forgiving.<br>
                    </div>
                    <div>Take the following perl script.=C2=A0 It does the
                      same thing as the shell script (almost).=C2=A0 It
                      renames the source file instead of making a copy
                      of it.<br>
                      <br>
                      run as:=C2=A0 ./<a href=3D"http://test.pl" target=3D"=
_blank">test.pl</a>
                      /absolute/path/to/master_dir
                      /absolute_path_to_dir_x<br>
                    </div>
                    <div><br>
                    </div>
                    <div>
###########################################################################=
########
                      <br>
                      #!/usr/bin/env perl<br>
                      <br>
                      use strict;<br>
                      use warnings;<br>
                      <br>
                      sub msgDie<br>
                      {<br>
                      =C2=A0 my ($ret) =3D shift;<br>
                      =C2=A0 my ($msg) =3D shift // &quot;$0 dir_base dir\n=
&quot;;<br>
                      =C2=A0 print $msg;<br>
                      =C2=A0 exit($ret);<br>
                      }<br>
                      <br>
                      msgDie(1) unless(scalar @ARGV eq 2);<br>
                      <br>
                      my $base =3D $ARGV[0];<br>
                      my $dir =C2=A0=3D $ARGV[1];<br>
                      <br>
                      msgDie(1, &quot;base directory doesn&#39;t exist\n&qu=
ot;) unless
                      -d $base;<br>
                      msgDie(1, &quot;source directory doesn&#39;t exist\n&=
quot;)
                      unless -d $dir;<br>
                      <br>
                      opendir(my $dh, $dir) or msgDie(&quot;Unable to open
                      directory: $dir\n&quot;);<br>
                      while(readdir $dh)<br>
                      {<br>
                      =C2=A0 next if($_ eq &#39;.&#39; || $_ eq &#39;..&#39=
;);<br>
                      =C2=A0 if( ! -f &quot;$base/$_&quot; ){<br>
                      =C2=A0 =C2=A0 rename(&quot;$dir/$_&quot;, &quot;$base=
/$_&quot;);<br>
                      =C2=A0 =C2=A0 next;<br>
                      =C2=A0 }<br>
                      <br>
                      =C2=A0 my ($ref) =3D (stat(&quot;$base/$_&quot;))[7];=
<br>
                      =C2=A0 my ($src) =3D (stat(&quot;$dir/$_&quot;))[7];<=
br>
                      =C2=A0 unlink(&quot;$dir/$_&quot;) if($ref =3D=3D $sr=
c);<br>
                      }<br>
###########################################################################=
########<br>
                      <br>
                    </div>
                    <div>~Paul<br>
                    </div>
                  </div>
                  <br>
                  <br>
                </blockquote>
                <p><br>
                </p>
                <p>This didn&#39;t seem to work :-(</p>
                <p><br>
                </p>
                <p>What exactly happened is this:</p>
                <p><br>
                </p>
                <p>I created a set of test directories in /tmp</p>
                <p><br>
                </p>
                <p>So, I have /tmp/test1 and /tmp/test2</p>
                <p><br>
                </p>
                <p>to mimic the structure of the directories I intend to
                  run this thing I did this:</p>
                <p><br>
                </p>
                <p>create a subdir called: dupdir in /tmp/test1 and
                  /tmp/test2</p>
                <p><br>
                </p>
                <p>/tmp/test2/dupdir contains these files: dup and dup1</p>
                <p><br>
                </p>
                <p>/tmp/test1/dupdir contains a modified &#39;dup&#39; file=
 but
                  copied dup1 file.<br>
                </p>
                <p><br>
                </p>
                <p>However*, now things get interesting as dup from
                  test1 contains &quot;1234567&quot; and dup from test2 con=
tains
                  &quot;111&quot; &lt;- this is to simulate the file size
                  difference.</p>
                <p><br>
                  <br>
                  <br>
                  <br>
                </p>
              </div>
            </blockquote>
            <div>=C2=A0<br>
            </div>
            <div>Worked for me!=C2=A0 Regardless.=C2=A0 Use rsync then.</di=
v>
            <div><br>
            </div>
            <div>rsync --ignore-existing --remove-source-files=C2=A0 /src
              /dest<br>
              <pre><code>This would at the very least move non-existent fil=
es from the source over to the dest AND remove those source files AFTER the=
 transfer happens.
</code></pre>
              <pre><code>You&#39;ll be 1/2 way there doing that.  What you&=
#39;ll be left with are file that exist in BOTH src AND DEST.

</code></pre>
              <pre><code>~Paul
</code></pre>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    <p><br>
    </p>
    <p>Paul, I think we&#39;ve got wires crossed....</p>
    <p><br>
    </p>
    <p>I *have* already performed the rsync. Apologies if I wasn&#39;t
      clear!</p>
    <p><br>
    </p>
    <p>The problem I am faced with is that the destination directory is
      already populated with the information from 3 source directories.</p>
    <p><br>
    </p>
    <p>I need to remove the sync&#39;ed files in the source directories and
      leave files that match in name but are of different sizes.</p>
    <p><br>
    </p>
    <p>The problem is I can&#39;t use rsync again for this as there aren&#3=
9;t
      any options to simply compare files based on size. I can&#39;t use th=
e
      --existing option as the files exist in both directories....<br>
    </p>
    <p><br>
    </p>
    <p>This is the dilemma I am facing:</p>
    <p><br>
    </p>
    <p>ls -l /merged_dir/folder/</p>
    <p>234904506 - file &#39;a&#39;</p>
    <p><br>
    </p>
    <p>ls -l /source_dir/folder/</p>
    <p>1080918146 - file &#39;a&#39;</p>
    <p><br>
    </p>
    <p>so in this case file &#39;a&#39; is in both directories with the sam=
e
      name but different size. I need to keep both versions. However,
      *if* they were the same size then remove the file in the
      source_dir.....</p>
    <p><br>
    </p>
    <p>That&#39;s all.. I don&#39;t need to transfer anything or copy anyth=
ing
      at all... just compare and remove files of same name and size.</p>
    <p><br>
    </p>
    <p>Hopefully I am explaining better and things are more clear? Again
      I apologize for the confusion=C2=A0 :-(<br>
    </p>
  </div>

</blockquote></div><br clear=3D"all"></div>You&#39;re at least partially ri=
ght that I was confused because comparing by name and by size makes no sens=
e to me.=C2=A0 A single byte changed in one yields the same name and the sa=
me size but are different!=C2=A0 ;)<br>Is the below output what you&#39;re =
expecting to happen:<br><br>% mkdir a b<br>% echo 1111 &gt; a/test.txt<br>%=
 echo 1111 &gt; b/test.txt<br>%./<a href=3D"http://test.pl">test.pl</a>; a b=
<br>% ls -l a b<br></div>a:<br><div>total 5<br>-rw-r--r-- =C2=A01 pprocacci=
 =C2=A0pprocacci =C2=A05 May =C2=A05 03:26 test.txt<br><br></div><div>b:<br=
></div><div><div>total 0</div><div><br>----------<br><br></div><div>The bel=
ow perl script is what was ran above.=C2=A0 1) Find a file from directory &=
quot;b&quot;.=C2=A0 2)=C2=A0 Go to the top of the loop if the file doesn&#3=
9;t exist in directory &quot;a&quot;.=C2=A0 3) Go to the top of the loop if=
 the file sizes do not match=C2=A0 4)=C2=A0 unlink the file if conditions 2=
 and 3 fall through.<br><br>
#################################################

<br>#!/usr/bin/env perl<br><br>use strict;<br>use warnings;<br><br>sub msgD=
ie<br>{<br>=C2=A0 my ($ret) =3D shift;<br>=C2=A0 my ($msg) =3D shift // &qu=
ot;$0 dir_base dir\n&quot;;<br>=C2=A0 print $msg;<br>=C2=A0 exit($ret);<br>=
}<br><br>msgDie(1) unless(scalar @ARGV eq 2);<br><br>my $base =3D $ARGV[0];=
<br>my $dir =C2=A0=3D $ARGV[1];<br><br>msgDie(1, &quot;base directory doesn=
&#39;t exist\n&quot;) unless -d $base;<br>msgDie(1, &quot;source directory =
doesn&#39;t exist\n&quot;) unless -d $dir;<br><br>opendir(my $dh, $dir) or =
msgDie(&quot;Unable to open directory: $dir\n&quot;);<br>while(readdir $dh)=
<br>{<br>=C2=A0 next if($_ eq &#39;.&#39; || $_ eq &#39;..&#39;);<br>=C2=A0=
 next if(! -f &quot;$base/$_&quot;);<br><br>=C2=A0 my ($ref) =3D (stat(&quo=
t;$base/$_&quot;))[7];<br>=C2=A0 my ($src) =3D (stat(&quot;$dir/$_&quot;))[=
7];<br>=C2=A0 unlink(&quot;$dir/$_&quot;) if($ref =3D=3D $src);<br>}</div><=
div>#################################################<br></div><div><br></d=
iv><div><span class=3D"gmail_signature_prefix">-- </span><br><div dir=3D"lt=
r" class=3D"gmail_signature">__________________<br><br>:(){ :|:&amp; };:</d=
iv></div></div></div>

--000000000000bd96d105fae9fe90--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFbbPujbyPHm2GO%2BFnR0G8rnsmpA3AxY2NzYOAAXetApiF8HVg>