Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 5 May 2023 03:30:14 +0100
From:      Kaya Saman <kayasaman@optiplex-networks.com>
To:        Paul Procacci <pprocacci@gmail.com>
Cc:        freebsd-questions@freebsd.org
Subject:   Re: Tool to compare directories and delete duplicate files from one directory
Message-ID:  <fd9aa7d3-f6a7-2274-f970-d4421d187855@optiplex-networks.com>
In-Reply-To: <CAFbbPuhoMOM=wp26yZ42e9xnRP%2BtJ6B30y8%2BBa3eCBh2v66Grw@mail.gmail.com>
References:  <9887a438-95e7-87cc-a162-4ad7a70d744f@optiplex-networks.com> <CAFbbPugfhXGPfscKpx6B0ue=DcF_qssL6P-0GgB1CWKtm3U=tQ@mail.gmail.com> <344b29c6-3d69-543d-678d-c2433dbf7152@optiplex-networks.com> <CAFbbPuiNqYLLg8wcg8S_3=y46osb06%2BduHqY9f0n=OuRgGVY=w@mail.gmail.com> <ef0328b0-caab-b6a2-5b33-1ab069a07f80@optiplex-networks.com> <CAFbbPujUALOS%2BsUxsp=54vxVAHe_jkvi3d-CksK78c7rxAVoNg@mail.gmail.com> <7747f587-f33e-f39c-ac97-fe4fe19e0b76@optiplex-networks.com> <CAFbbPuhoMOM=wp26yZ42e9xnRP%2BtJ6B30y8%2BBa3eCBh2v66Grw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
This is a multi-part message in MIME format.
--------------FQX0zPbApKHBu0wXoTJEWsO0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable


On 5/5/23 03:08, Paul Procacci wrote:
> There are multiple reasons why it may not work.=C2=A0 My guess is becau=
se=20
> the potential for characters that could be showing up within the=20
> filenames and whatnot.
>
> This can be solved with an interpreted language that's a bit more=20
> forgiving.
> Take the following perl script.=C2=A0 It does the same thing as the she=
ll=20
> script (almost).=C2=A0 It renames the source file instead of making a c=
opy=20
> of it.
>
> run as:=C2=A0 ./test.pl <http://test.pl>; /absolute/path/to/master_dir=20
> /absolute_path_to_dir_x
>
> #######################################################################=
############=20
>
> #!/usr/bin/env perl
>
> use strict;
> use warnings;
>
> sub msgDie
> {
> =C2=A0 my ($ret) =3D shift;
> =C2=A0 my ($msg) =3D shift // "$0 dir_base dir\n";
> =C2=A0 print $msg;
> =C2=A0 exit($ret);
> }
>
> msgDie(1) unless(scalar @ARGV eq 2);
>
> my $base =3D $ARGV[0];
> my $dir =C2=A0=3D $ARGV[1];
>
> msgDie(1, "base directory doesn't exist\n") unless -d $base;
> msgDie(1, "source directory doesn't exist\n") unless -d $dir;
>
> opendir(my $dh, $dir) or msgDie("Unable to open directory: $dir\n");
> while(readdir $dh)
> {
> =C2=A0 next if($_ eq '.' || $_ eq '..');
> =C2=A0 if( ! -f "$base/$_" ){
> =C2=A0 =C2=A0 rename("$dir/$_", "$base/$_");
> =C2=A0 =C2=A0 next;
> =C2=A0 }
>
> =C2=A0 my ($ref) =3D (stat("$base/$_"))[7];
> =C2=A0 my ($src) =3D (stat("$dir/$_"))[7];
> =C2=A0 unlink("$dir/$_") if($ref =3D=3D $src);
> }
> #######################################################################=
############
>
> ~Paul
>
>

This didn't seem to work :-(


What exactly happened is this:


I created a set of test directories in /tmp


So, I have /tmp/test1 and /tmp/test2


to mimic the structure of the directories I intend to run this thing I=20
did this:


create a subdir called: dupdir in /tmp/test1 and /tmp/test2


/tmp/test2/dupdir contains these files: dup and dup1


/tmp/test1/dupdir contains a modified 'dup' file but copied dup1 file.


However*, now things get interesting as dup from test1 contains=20
"1234567" and dup from test2 contains "111" <- this is to simulate the=20
file size difference.


I then ran: ./test.pl /tmp/test1 /tmp/test2


The expected behavior is that I should retain the file 'dup' in test1=20
while 'dup1' should be removed.


In my actual file system I have many of these subdirs, so a fair test=20
would probably be something like creating:

/tmp/test1/dupdir1

/tmp/test2/dupdir1

/tmp/test1/dupdir2

/tmp/test2/dupdir2


then putting the file dup into dupdir1 and dup1 into dupdir2


I guess my issue is complex?? If I only I had used the=20
--remove-source-files option during my initial rsync then I wouldn't=20
have had to worry about any of this since I used the --ignore-existing=20
option so that would have done the trick initially, but I decided to=20
play safe instead and now ended up with a slight headache on my hands.

--------------FQX0zPbApKHBu0wXoTJEWsO0
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<html>
  <head>
    <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DUTF=
-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class=3D"moz-cite-prefix">On 5/5/23 03:08, Paul Procacci wrote:<=
br>
    </div>
    <blockquote type=3D"cite"
cite=3D"mid:CAFbbPuhoMOM=3Dwp26yZ42e9xnRP+tJ6B30y8+Ba3eCBh2v66Grw@mail.gm=
ail.com">
      <meta http-equiv=3D"content-type" content=3D"text/html; charset=3DU=
TF-8">
      <div dir=3D"ltr">
        <div>There are multiple reasons why it may not work.=C2=A0 My gue=
ss
          is because the potential for characters that could be showing
          up within the filenames and whatnot.<br>
          <br>
        </div>
        <div>This can be solved with an interpreted language that's a
          bit more forgiving.<br>
        </div>
        <div>Take the following perl script.=C2=A0 It does the same thing=
 as
          the shell script (almost).=C2=A0 It renames the source file ins=
tead
          of making a copy of it.<br>
          <br>
          run as:=C2=A0 ./<a href=3D"http://test.pl" moz-do-not-send=3D"t=
rue">test.pl</a>
          /absolute/path/to/master_dir /absolute_path_to_dir_x<br>
        </div>
        <div><br>
        </div>
        <div>
#########################################################################=
##########
          <br>
          #!/usr/bin/env perl<br>
          <br>
          use strict;<br>
          use warnings;<br>
          <br>
          sub msgDie<br>
          {<br>
          =C2=A0 my ($ret) =3D shift;<br>
          =C2=A0 my ($msg) =3D shift // "$0 dir_base dir\n";<br>
          =C2=A0 print $msg;<br>
          =C2=A0 exit($ret);<br>
          }<br>
          <br>
          msgDie(1) unless(scalar @ARGV eq 2);<br>
          <br>
          my $base =3D $ARGV[0];<br>
          my $dir =C2=A0=3D $ARGV[1];<br>
          <br>
          msgDie(1, "base directory doesn't exist\n") unless -d $base;<br=
>
          msgDie(1, "source directory doesn't exist\n") unless -d $dir;<b=
r>
          <br>
          opendir(my $dh, $dir) or msgDie("Unable to open directory:
          $dir\n");<br>
          while(readdir $dh)<br>
          {<br>
          =C2=A0 next if($_ eq '.' || $_ eq '..');<br>
          =C2=A0 if( ! -f "$base/$_" ){<br>
          =C2=A0 =C2=A0 rename("$dir/$_", "$base/$_");<br>
          =C2=A0 =C2=A0 next;<br>
          =C2=A0 }<br>
          <br>
          =C2=A0 my ($ref) =3D (stat("$base/$_"))[7];<br>
          =C2=A0 my ($src) =3D (stat("$dir/$_"))[7];<br>
          =C2=A0 unlink("$dir/$_") if($ref =3D=3D $src);<br>
          }<br>
#########################################################################=
##########<br>
          <br>
        </div>
        <div>~Paul<br>
        </div>
      </div>
      <br>
      <br>
    </blockquote>
    <p><br>
    </p>
    <p>This didn't seem to work :-(</p>
    <p><br>
    </p>
    <p>What exactly happened is this:</p>
    <p><br>
    </p>
    <p>I created a set of test directories in /tmp</p>
    <p><br>
    </p>
    <p>So, I have /tmp/test1 and /tmp/test2</p>
    <p><br>
    </p>
    <p>to mimic the structure of the directories I intend to run this
      thing I did this:</p>
    <p><br>
    </p>
    <p>create a subdir called: dupdir in /tmp/test1 and /tmp/test2</p>
    <p><br>
    </p>
    <p>/tmp/test2/dupdir contains these files: dup and dup1</p>
    <p><br>
    </p>
    <p>/tmp/test1/dupdir contains a modified 'dup' file but copied dup1
      file.<br>
    </p>
    <p><br>
    </p>
    <p>However*, now things get interesting as dup from test1 contains
      "1234567" and dup from test2 contains "111" &lt;- this is to
      simulate the file size difference.</p>
    <p><br>
    </p>
    <p>I then ran: ./test.pl /tmp/test1 /tmp/test2</p>
    <p><br>
    </p>
    <p>The expected behavior is that I should retain the file 'dup' in
      test1 while 'dup1' should be removed.</p>
    <p><br>
    </p>
    <p>In my actual file system I have many of these subdirs, so a fair
      test would probably be something like creating:</p>
    <p>/tmp/test1/dupdir1</p>
    <p>/tmp/test2/dupdir1</p>
    <p>/tmp/test1/dupdir2</p>
    <p>/tmp/test2/dupdir2</p>
    <p><br>
    </p>
    <p>then putting the file dup into dupdir1 and dup1 into dupdir2</p>
    <p><br>
    </p>
    <p>I guess my issue is complex?? If I only I had used the
      --remove-source-files option during my initial rsync then I
      wouldn't have had to worry about any of this since I used the
      --ignore-existing option so that would have done the trick
      initially, but I decided to play safe instead and now ended up
      with a slight headache on my hands.<br>
    </p>
  </body>
</html>

--------------FQX0zPbApKHBu0wXoTJEWsO0--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?fd9aa7d3-f6a7-2274-f970-d4421d187855>