Date: Thu, 4 May 2023 20:13:02 -0400 From: Paul Procacci <pprocacci@gmail.com> To: Kaya Saman <kayasaman@optiplex-networks.com> Cc: freebsd-questions@freebsd.org Subject: Re: Tool to compare directories and delete duplicate files from one directory Message-ID: <CAFbbPujUALOS%2BsUxsp=54vxVAHe_jkvi3d-CksK78c7rxAVoNg@mail.gmail.com> In-Reply-To: <ef0328b0-caab-b6a2-5b33-1ab069a07f80@optiplex-networks.com> References: <9887a438-95e7-87cc-a162-4ad7a70d744f@optiplex-networks.com> <CAFbbPugfhXGPfscKpx6B0ue=DcF_qssL6P-0GgB1CWKtm3U=tQ@mail.gmail.com> <344b29c6-3d69-543d-678d-c2433dbf7152@optiplex-networks.com> <CAFbbPuiNqYLLg8wcg8S_3=y46osb06%2BduHqY9f0n=OuRgGVY=w@mail.gmail.com> <ef0328b0-caab-b6a2-5b33-1ab069a07f80@optiplex-networks.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000b207ea05fae72763 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, May 4, 2023 at 7:53=E2=80=AFPM Kaya Saman <kayasaman@optiplex-netwo= rks.com> wrote: > > On 5/4/23 23:32, Paul Procacci wrote: > > > > On Thu, May 4, 2023 at 5:47=E2=80=AFPM Kaya Saman <kayasaman@optiplex-net= works.com> > wrote: > >> >> On 5/4/23 17:29, Paul Procacci wrote: >> >> >> >> On Thu, May 4, 2023 at 11:53=E2=80=AFAM Kaya Saman < >> kayasaman@optiplex-networks.com> wrote: >> >>> Hi, >>> >>> >>> I'm wondering if anyone knows of a tool like diff or so that can also >>> delete files based on name and size from either left/right or >>> source/destination directory? >>> >>> >>> Basically what I have done is performed an rsync without using the >>> --remove-source-files option onto a newly bought and created disk pool >>> (yes zpool) that i am trying to consolidate my data - as it's currently >>> spread out over multiple pools with the same folder name. >>> >>> >>> The issue I am facing mainly is that I perform another rsync and use th= e >>> --remove-source-files option, rsync will delete files based on name >>> while there are some files that have the same name but not same size an= d >>> I would like to retain these files. >>> >>> >>> Right now I have looked at many different options in both rsync and >>> other tools but found nothing suitable. I even tested using a few test >>> dirs and files that I put into /tmp and whatever I tried, the files of >>> different size either got transferred or deleted. >>> >>> >>> How would be a good way to approach this problem? >>> >>> >>> Even if I create some kind of shell script and use diff, I think it wil= l >>> only compare names and not file sizes. >>> >>> >>> I'm really lost here.... >>> >>> >>> Regards, >>> >>> >>> Kaya >>> >>> >>> >>> >> It sounds like you want fdupes. It's in the ports tree. >> >> ~Paul >> >> -- >> __________________ >> >> :(){ :|:& };: >> >> >> >> I tried fdupes and installed it a while back. For me it felt like it onl= y >> works on a single directory. >> >> >> My dir structure is that I have" >> >> >> /dir <- main directory where everything has now been rsync'ed to >> >> /dir_1 <- old directory with partial content >> >> /dir_2 <- more partial content >> >> /dir_3 <- more partial content >> >> >> The key thing here is that I need to compare: >> >> >> /dir_(x) with /dir >> >> >> if the files are different sizes in /dir_(x) then leave them, otherwise >> delete if both name and file size are the same. >> > > Then a tiny shell script does the job assuming your files don't have any > spaces and no weird characters exist: > > #!/bin/sh > > for i in b c d; > do > ls $i/ | while read file; > do > [ ! -f a/$file ] && cp $i/$file a/$file && continue > > ref=3D`stat -f '%z' a/$file` > src=3D`stat -f '%z' %i/$file` > [ $ref -eq $src ] && rm -f $i/file > > done > done > > Change paths accordingly and backup your stuff. ;) > > ~Paul > > -- > __________________ > > :(){ :|:& };: > > > Thanks Paul, > > > I should be able to work with this. There are actually spaces and weird > characters in the file names so I assume doing something like "file" shou= ld > allow for that? > > > I don't think I need the line after the 'do' statement do I? From what I > understand it copies the file from directory i to directory a? As I > explained initially, the files have already been rsync'ed so I just need = to > compare and delete accordingly. > > When I performed the rsync it took around a week to complete per run, > currently zfs list shows around 12TB usage for my /dir but that's with > compression enabled, of the merged directory. > > > A quick Google shows that I can use something like this: > > search_dir=3D/the/path/to/base/dirfor entry in "$search_dir"/*do > echo "$entry"done > > > To list the files in the directory though this might be Bash and not Csh > > > Otherwise clunkily (my scripting style is pretty rubbish and non > efficient), I could do something like (it probably won't work!): > > > #!/bin/sh > > > #fb =3D file base > > #fm - file merge - file that has already been merged using rsync unless > size was different > > > dir_base=3D/dir > for fb in "$dir_base"/* > do > echo "$fs" > done > > > dir_merge=3D/dir_1 > for fm in "$dir_merge"/* > do > echo "$fm" > done > > > do > > ref=3D`stat -f '%z' $dir_base/$fb` > src=3D`stat -f '%z' %i$dir_merge/$fm` > [ $ref -eq $src ] && rm -f $dir_merge/$fm > > done > > > > Regards, > > > Kaya > What I provided is exactly what you needed as it loops through all the directories. You just have to provide the list of source directories on that first for loop. You can alter it, removing the first for loop, but then you'll need to run it for each directory you'd want to apply the checks to. Enclosing the variables in quotes may or may not help. A quote is a valid character in a filename and therefore may not work as expected. If you're reasonably sure your filenames do not contain quotes then you have a better chance of it working. Worst comes to worst, you'll need to: find /path -print0 | xargs -0 -n 1 <args> to overcome weird characters in filenames. In either case, adding quotes at this point knowing you have at least spaces and some special characters, is probably the correct course of action. As an aside, I don't use this syntax: for entry in "$search_dir"/* You're certainly free to do so, but I personally avoid globs when possible. Maybe not so much in scripts like this but on the command line, those globs can expand to a size that exceeds allowable sizes to command line arguments= . Revised script adding comments: ----------------------------------------------------- #!/bin/sh # # dir_1, dir_2, and dir_3 are the directories I want to search through. for i in dir_1 dir_2 dir_3; do # Retrieve the filenames within each of those directories ls $i/ | while read file; do If the file doesn't exist in the base dir, copy it and continue with the top of the loop. [ ! -f dir_base/$file ] && cp $i/$file dir_base/ && continue # # Getting to this point means the file eixsts in both locations. # # Get the file size as it is in the dir_base ref=3D`stat -f '%z' dir_base/$file` # Get the file size as it is in $i src=3D`stat -f '%z' $i/$file` # If the sizes are the same, remove the file from the source directory [ $ref -eq $src ] && rm -f $i/file done done --=20 __________________ :(){ :|:& };: --000000000000b207ea05fae72763 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div><div><div dir=3D"ltr"><br></div><br><div class=3D"gma= il_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Thu, May 4, 2023 at 7:53= =E2=80=AFPM Kaya Saman <<a href=3D"mailto:kayasaman@optiplex-networks.co= m">kayasaman@optiplex-networks.com</a>> wrote:<br></div><blockquote clas= s=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid r= gb(204,204,204);padding-left:1ex"> =20 =20 =20 <div> <p><br> </p> <div>On 5/4/23 23:32, Paul Procacci wrote:<br> </div> <blockquote type=3D"cite"> =20 <div dir=3D"ltr"> <div> <div dir=3D"ltr"><br> </div> <br> <div class=3D"gmail_quote"> <div dir=3D"ltr" class=3D"gmail_attr">On Thu, May 4, 2023 at 5:47=E2=80=AFPM Kaya Saman <<a href=3D"mailto:kayasaman@op= tiplex-networks.com" target=3D"_blank">kayasaman@optiplex-networks.com</a>&= gt; wrote:<br> </div> <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0= .8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div> <p><br> </p> <div>On 5/4/23 17:29, Paul Procacci wrote:<br> </div> <blockquote type=3D"cite"> <div dir=3D"ltr"> <div> <div dir=3D"ltr"><br> </div> <br> <div class=3D"gmail_quote"> <div dir=3D"ltr" class=3D"gmail_attr">On Thu, May 4= , 2023 at 11:53=E2=80=AFAM Kaya Saman <<a href= =3D"mailto:kayasaman@optiplex-networks.com" target=3D"_blank">kayasaman@opt= iplex-networks.com</a>> wrote:<br> </div> <blockquote class=3D"gmail_quote" style=3D"margin:0= px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">H= i,<br> <br> <br> I'm wondering if anyone knows of a tool like diff or so that can also <br> delete files based on name and size from either left/right or <br> source/destination directory?<br> <br> <br> Basically what I have done is performed an rsync without using the <br> --remove-source-files option onto a newly bought and created disk pool <br> (yes zpool) that i am trying to consolidate my data - as it's currently <br> spread out over multiple pools with the same folder name.<br> <br> <br> The issue I am facing mainly is that I perform another rsync and use the <br> --remove-source-files option, rsync will delete files based on name <br> while there are some files that have the same name but not same size and <br> I would like to retain these files.<br> <br> <br> Right now I have looked at many different options in both rsync and <br> other tools but found nothing suitable. I even tested using a few test <br> dirs and files that I put into /tmp and whatever I tried, the files of <br> different size either got transferred or deleted.<br> <br> <br> How would be a good way to approach this problem?<br> <br> <br> Even if I create some kind of shell script and use diff, I think it will <br> only compare names and not file sizes.<br> <br> <br> I'm really lost here....<br> <br> <br> Regards,<br> <br> <br> Kaya<br> <br> <br> <br> </blockquote> </div> <br> </div> <div>It sounds like you want fdupes.=C2=A0 It's in = the ports tree.</div> <div><br> </div> <div>~Paul<br> </div> <div><br> <span>-- </span><br> <div dir=3D"ltr">__________________<br> <br> :(){ :|:& };:</div> </div> </div> </blockquote> <p><br> </p> <p><br> </p> <p>I tried fdupes and installed it a while back. For me it felt like it only works on a single directory.</p> <p><br> </p> <p>My dir structure is that I have"</p> <p><br> </p> <p>/dir <- main directory where everything has now been rsync'ed to<br> </p> <p>/dir_1 <- old directory with partial content<br> </p> <p>/dir_2 <- more partial content<br> </p> <p>/dir_3 <- more partial content</p> <p><br> </p> <p>The key thing here is that I need to compare:</p> <p><br> </p> <p>/dir_(x) with /dir</p> <p><br> </p> <p>if the files are different sizes in /dir_(x) then leave them, otherwise delete if both name and file size are the same.<br> </p> </div> </blockquote> </div> <br> Then a tiny shell script does the job assuming your files don't have any spaces and no weird characters exist:<br> <br clear=3D"all"> #!/bin/sh<br> <br> for i in b c d;<br> do<br> =C2=A0 ls $i/ | while read file;<br> =C2=A0 do<br> =C2=A0 =C2=A0 [ ! -f a/$file ] && cp $i/$file a/$file &am= p;& continue<br> <br> =C2=A0 =C2=A0 ref=3D`stat -f '%z' a/$file`<br> =C2=A0 =C2=A0 src=3D`stat -f '%z' %i/$file`<br> =C2=A0 =C2=A0 [ $ref -eq $src ] && rm -f $i/file<br> <br> =C2=A0 done<br> done<br> <br> </div> <div>Change paths accordingly and backup your stuff. ;)</div> <div><br> </div> <div>~Paul<br> </div> <div><br> <span>-- </span><br> <div dir=3D"ltr">__________________<br> <br> :(){ :|:& };:</div> </div> </div> </blockquote> <p><br> </p> <p>Thanks Paul,</p> <p><br> </p> <p>I should be able to work with this. There are actually spaces and weird characters in the file names so I assume doing something like "file" should allow for that?</p> <p><br> </p> <p>I don't think I need the line after the 'do' statement d= o I? From what I understand it copies the file from directory i to directory a? As I explained initially, the files have already been rsync'ed so I just need to compare and delete accordingly.</p> <p>When I performed the rsync it took around a week to complete per run, currently zfs list shows around 12TB usage for my /dir but that's with compression enabled, of the merged directory.</p> <p><br> </p> <p>A quick Google shows that I can use something like this:</p> <pre style=3D"margin:0px;border:0px none;font-style:normal;font-variant= -ligatures:normal;font-variant-caps:normal;font-variant-numeric:inherit;fon= t-variant-east-asian:inherit;font-variant-alternates:inherit;font-weight:40= 0;font-stretch:inherit;font-kerning:inherit;font-feature-settings:inherit;v= ertical-align:baseline;box-sizing:inherit;width:auto;max-height:600px;overf= low:auto;letter-spacing:normal;text-align:left;text-indent:0px;text-transfo= rm:none;word-spacing:0px;text-decoration-style:initial;text-decoration-colo= r:initial"><code style=3D"margin:0px;padding:0px;border:0px none;font-style= :inherit;font-variant:inherit;font-weight:inherit;font-stretch:inherit;line= -height:inherit;font-family:inherit;font-kerning:inherit;font-feature-setti= ngs:inherit;vertical-align:baseline;box-sizing:inherit;background-color:tra= nsparent;white-space:inherit">search_dir=3D/the/path/to/base/dir <span style=3D"margin:0px;padding:0px;border:0px none;font-style:inherit;fo= nt-variant:inherit;font-weight:inherit;font-stretch:inherit;line-height:inh= erit;font-family:inherit;font-kerning:inherit;font-feature-settings:inherit= ;font-size:13px;vertical-align:baseline;box-sizing:inherit">for</span> entr= y <span style=3D"margin:0px;padding:0px;border:0px none;font-style:inherit;= font-variant:inherit;font-weight:inherit;font-stretch:inherit;line-height:i= nherit;font-family:inherit;font-kerning:inherit;font-feature-settings:inher= it;font-size:13px;vertical-align:baseline;box-sizing:inherit">in</span> <sp= an style=3D"margin:0px;padding:0px;border:0px none;font-style:inherit;font-= variant:inherit;font-weight:inherit;font-stretch:inherit;line-height:inheri= t;font-family:inherit;font-kerning:inherit;font-feature-settings:inherit;fo= nt-size:13px;vertical-align:baseline;box-sizing:inherit">"<span style= =3D"margin:0px;padding:0px;border:0px none;font-style:inherit;font-variant:= inherit;font-weight:inherit;font-stretch:inherit;line-height:inherit;font-f= amily:inherit;font-kerning:inherit;font-feature-settings:inherit;font-size:= 13px;vertical-align:baseline;box-sizing:inherit">$search_dir</span>"</= span>/* <span style=3D"margin:0px;padding:0px;border:0px none;font-style:inherit;fo= nt-variant:inherit;font-weight:inherit;font-stretch:inherit;line-height:inh= erit;font-family:inherit;font-kerning:inherit;font-feature-settings:inherit= ;font-size:13px;vertical-align:baseline;box-sizing:inherit">do</span> <span style=3D"margin:0px;padding:0px;border:0px none;font-style:inherit;= font-variant:inherit;font-weight:inherit;font-stretch:inherit;line-height:i= nherit;font-family:inherit;font-kerning:inherit;font-feature-settings:inher= it;font-size:13px;vertical-align:baseline;box-sizing:inherit">echo</span> <= span style=3D"margin:0px;padding:0px;border:0px none;font-style:inherit;fon= t-variant:inherit;font-weight:inherit;font-stretch:inherit;line-height:inhe= rit;font-family:inherit;font-kerning:inherit;font-feature-settings:inherit;= font-size:13px;vertical-align:baseline;box-sizing:inherit">"<span styl= e=3D"margin:0px;padding:0px;border:0px none;font-style:inherit;font-variant= :inherit;font-weight:inherit;font-stretch:inherit;line-height:inherit;font-= family:inherit;font-kerning:inherit;font-feature-settings:inherit;font-size= :13px;vertical-align:baseline;box-sizing:inherit">$entry</span>"</span= > <span style=3D"margin:0px;padding:0px;border:0px none;font-style:inherit;fo= nt-variant:inherit;font-weight:inherit;font-stretch:inherit;line-height:inh= erit;font-family:inherit;font-kerning:inherit;font-feature-settings:inherit= ;font-size:13px;vertical-align:baseline;box-sizing:inherit">done</span></co= de></pre> <p></p> <p><br> </p> <p>To list the files in the directory though this might be Bash and not Csh</p> <p><br> </p> <p>Otherwise clunkily (my scripting style is pretty rubbish and non efficient), I could do something like (it probably won't work!):<= /p> <p><br> </p> <p>#!/bin/sh<br> </p> <p><br> </p> <p>#fb =3D file base</p> <p>#fm - file merge - file that has already been merged using rsync unless size was different<br> </p> <p><br> </p> <p>dir_base=3D/dir<br> for fb in "$dir_base"/*<br> do<br> =C2=A0 echo "$fs"<br> done</p> <p><br> </p> <p>dir_merge=3D/dir_1<br> for fm in "$dir_merge"/*<br> do<br> =C2=A0 echo "$fm"<br> done</p> <p><br> </p> <p>=C2=A0 do<br> <br> =C2=A0 =C2=A0 ref=3D`stat -f '%z' $dir_base/$fb`<br> =C2=A0 =C2=A0 src=3D`stat -f '%z' %i$dir_merge/$fm`<br> =C2=A0 =C2=A0 [ $ref -eq $src ] && rm -f $dir_merge/$fm<br> <br> =C2=A0 done</p> <p><br> </p> <p><br> </p> <p>Regards,</p> <p><br> </p> <p>Kaya<br> </p> </div> </blockquote></div><br></div>What I provided is exactly what you needed as = it loops through all the directories.=C2=A0 You just have to provide the li= st of source directories on that first for loop.<br>You can alter it, remov= ing the first for loop, but then you'll need to run it for each directo= ry you'd want to apply the checks to.<br><br>Enclosing the variables in= quotes may or may not help.=C2=A0 A quote is a valid character in a filena= me and therefore may not work as expected.<br></div><div>If you're reas= onably sure your filenames do not contain quotes then you have a better cha= nce of it working.<br><br></div><div>Worst comes to worst, you'll need = to: find /path -print0 | xargs -0 -n 1 <args> to overcome weird chara= cters in filenames.<br><br></div><div>In either case, adding quotes at this= point knowing you have at least spaces and some special characters, is pro= bably the correct course of action.<br><br>As an aside, I don't use thi= s syntax:=C2=A0=C2=A0=C2=A0 <code style=3D"margin:0px;padding:0px;border:0p= x none;font-style:inherit;font-variant:inherit;font-weight:inherit;font-str= etch:inherit;line-height:inherit;font-family:inherit;font-kerning:inherit;f= ont-feature-settings:inherit;vertical-align:baseline;box-sizing:inherit;bac= kground-color:transparent;white-space:inherit"><span style=3D"margin:0px;pa= dding:0px;border:0px none;font-style:inherit;font-variant:inherit;font-weig= ht:inherit;font-stretch:inherit;line-height:inherit;font-family:inherit;fon= t-kerning:inherit;font-feature-settings:inherit;font-size:13px;vertical-ali= gn:baseline;box-sizing:inherit">for</span> entry <span style=3D"margin:0px;= padding:0px;border:0px none;font-style:inherit;font-variant:inherit;font-we= ight:inherit;font-stretch:inherit;line-height:inherit;font-family:inherit;f= ont-kerning:inherit;font-feature-settings:inherit;font-size:13px;vertical-a= lign:baseline;box-sizing:inherit">in</span> <span style=3D"margin:0px;paddi= ng:0px;border:0px none;font-style:inherit;font-variant:inherit;font-weight:= inherit;font-stretch:inherit;line-height:inherit;font-family:inherit;font-k= erning:inherit;font-feature-settings:inherit;font-size:13px;vertical-align:= baseline;box-sizing:inherit">"<span style=3D"margin:0px;padding:0px;bo= rder:0px none;font-style:inherit;font-variant:inherit;font-weight:inherit;f= ont-stretch:inherit;line-height:inherit;font-family:inherit;font-kerning:in= herit;font-feature-settings:inherit;font-size:13px;vertical-align:baseline;= box-sizing:inherit">$search_dir</span>"</span>/*</code> <br>You're certainly free to do so, but I personally avoid globs when p= ossible.<br>Maybe not so much in scripts like this but on the command line,= those globs can expand to a size that exceeds allowable sizes to command l= ine arguments.<br><br>Revised script adding comments:<br>------------------= -----------------------------------<br> #!/bin/sh<br> <br><div>#</div><div># dir_1, dir_2, and dir_3 are the directorie= s I want to search through.<br></div> for i in dir_1 dir_2 dir_3;<br> do<br></div>=C2=A0 # Retrieve the filenames within each of those = directories<br><div> =C2=A0 ls $i/ | while read file;<br> =C2=A0 do<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0 If the file does= n't exist in the base dir, copy it and continue with the top of the loo= p.<br></div><div> =C2=A0 =C2=A0 [ ! -f dir_base/$file ] && cp $i/$file dir_= base/ && continue<br> <br></div><div>=C2=A0=C2=A0=C2=A0 #<br></div><div>=C2=A0=C2=A0=C2= =A0 # Getting to this point means the file eixsts in both locations.<br>=C2= =A0=C2=A0=C2=A0 #<br><br></div><div>=C2=A0=C2=A0=C2=A0 # Get the file size = as it is in the dir_base<br></div><div> =C2=A0 =C2=A0 ref=3D`stat -f '%z' dir_base/$file`<br><br>= </div><div>=C2=A0=C2=A0=C2=A0 # Get the file size as it is in $i<br></div><= div> =C2=A0 =C2=A0 src=3D`stat -f '%z' $i/$file`<br><br></div>= <div>=C2=A0=C2=A0=C2=A0 # If the sizes are the same, remove the file from t= he source directory<br></div><div> =C2=A0 =C2=A0 [ $ref -eq $src ] && rm -f $i/file<br> <br> =C2=A0 done<br> done <br><br><br clear=3D"all"><div><br><span class=3D"gmail_signature_prefix">-= - </span><br><div dir=3D"ltr" class=3D"gmail_signature">__________________<= br><br>:(){ :|:& };:</div></div></div></div> --000000000000b207ea05fae72763--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFbbPujUALOS%2BsUxsp=54vxVAHe_jkvi3d-CksK78c7rxAVoNg>