From owner-freebsd-questions@FreeBSD.ORG Sun Jul 23 17:28:04 2006 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 15EAD16A4DD for ; Sun, 23 Jul 2006 17:28:04 +0000 (UTC) (envelope-from parv@pair.com) Received: from mta13.adelphia.net (mta13.mail.adelphia.net [68.168.78.44]) by mx1.FreeBSD.org (Postfix) with ESMTP id 87EF843D46 for ; Sun, 23 Jul 2006 17:28:03 +0000 (GMT) (envelope-from parv@pair.com) Received: from default.chvlva.adelphia.net ([69.160.66.115]) by mta13.adelphia.net (InterMail vM.6.01.05.02 201-2131-123-102-20050715) with ESMTP id <20060723172802.RELQ10992.mta13.adelphia.net@default.chvlva.adelphia.net>; Sun, 23 Jul 2006 13:28:02 -0400 Received: by default.chvlva.adelphia.net (Postfix, from userid 1000) id 3526EBB54; Sun, 23 Jul 2006 13:28:14 -0400 (EDT) Date: Sun, 23 Jul 2006 13:28:14 -0400 From: Parv To: Joshua Lewis Message-ID: <20060723172814.GB960@holestein.holy.cow> Mail-Followup-To: Joshua Lewis , freebsd-questions@freebsd.org References: <336A5DA6-5A43-44C0-8961-139C81702AB3@familyfunzone.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <336A5DA6-5A43-44C0-8961-139C81702AB3@familyfunzone.net> Cc: freebsd-questions@freebsd.org Subject: Re: Searching a drive and copying files X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Jul 2006 17:28:04 -0000 in message <336A5DA6-5A43-44C0-8961-139C81702AB3@familyfunzone.net>, wrote Joshua Lewis thusly... > > I need to search my drive for all pictures on my system and copy > them to a networked system using sftp or ssh or what not. There > will be duplicate names on the drive so I was hoping to have dups > placed in a separate folder. Unison, net/unison port, should be able to handle the duplicates based on file checksum. (I personally have not used it much, so i cannot answer any other queried about it; refer to its fine man page.) > Due to my for lack of a better term stupidity when I first got > my camera I will probably have instances when there will be three > or four duplicates. can help me out with that it would be great. ... > My goal is to find all my pictures and compare them then delete > the dups that don't look that good. A daunting task as I have 20 > GB of data. I bet 10 GB are dups. A checksum-based management of duplicates will help with the files with identical contents, but not with files that differ even a bit. Perl program below -- a modified version of Randal Schwartz's version[0] -- uses md5(1) to identify duplicates (as in identical files), failing that, Image::Magick based on fuzz factor. When it finds duplicates, it asks to enter the item number from the file list to be deleted. [0] Article "Finding similar images", http://www.stonehenge.com/merlyn/LinuxMag/col50.html To be able to run, it needs Image::Magick (graphics/ImageMagick port), Cache::FileCache (devel/p5-Cache-Cache), List::Util (lang/p5-Scalar-List-Utils), File::Copy & File::Path. Mind that it, rather Image::Magick, may consume all of your memory and/or temporary fs if you run it on all the files at once. If you are good in Perl, you could modify the program to move the duplicates in a directory (instead of deleting), and possibly not to ask to take the particular action (if as you say you would have a boat load of duplicates). Without further interruptions, program follows ... #!perl # This is a modified version of Randal Schwartz's ... # # http://www.stonehenge.com/merlyn/LinuxMag/col50.html # # ... as it uses checksum (MD5 for now) to detect identical files, failing that # uses Image::Magick. use warnings; use strict; $|++; use Image::Magick; use Cache::FileCache; use File::Copy qw( move ); use File::Path qw( mkpath ); use List::Util qw( reduce ); use Carp qw(carp); use Getopt::Long qw( :config gnu_compat no_ignore_case no_debug ); # User option; permitted average deviation in the vector elements. my $fuzz = 15; # User option; if defined, rename corrupt images into this dir. my $corrupt_dir = "CORRUPT"; { my $usage; GetOptions ( 'h|usage|help' => \$usage , 'f|fuzz=i' => \$fuzz , 'c|corrupt=s' => \$corrupt_dir , 'nc|nocorrupt' => sub { undef $corrupt_dir; } ) or usage( 1 ); usage( 0 ) if $usage; # Check if any arguments remain which will be file names usage( 1, "No file(s) or directory(ies) given." ) unless scalar @ARGV; } sub warnif; my $cache = Cache::FileCache->new ( { namespace => 'image.cache' , cache_root => ( glob( "~/log/misc" ) )[ 0 ] } ); my @buckets; FILE: while ( @ARGV ) { my $file = shift; next FILE if -l $file; if ( -d $file ) { opendir DIR, $file or next FILE; unshift @ARGV, map { m/^\./ ? () : "$file/$_"; } sort readdir DIR; next FILE; } next FILE unless -f _ or -d _; my ( @stat ) = stat _ or die "should not happen: $!"; # dev/ino/mtime my $key = "@stat[ 0, 1, 9 ]"; my @vector; #print "$file "; if ( my $data = $cache->get( $key ) ) { #print "... is cached\n"; @vector = @$data; } else { my $image = Image::Magick->new; if ( my $x = $image->Read( $file ) ) { if ( defined $corrupt_dir and $x =~ m/corrupt|unexpected end-of-file/i ) { print "$file "; print "... renaming into $corrupt_dir\n"; -d $corrupt_dir or mkpath $corrupt_dir, 0, 0700 or die "Cannot mkpath $corrupt_dir: $!"; move $file, $corrupt_dir or warn "Cannot rename: $!"; } else { print "$file "; print "... skipping ( $x )\n"; } next FILE; } #print "is ", join( "x", $image->Get( 'width', 'height' ) ), "\n"; warnif $image->Normalize(); warnif $image->Resize( geometry => '4x4!' ); warnif $image->Set( magick => 'rgb' ); @vector = unpack "C*", $image->ImageToBlob(); $cache->set( $key, [ @vector ] ); } BUCKET: for my $bucket ( @buckets ) { my $error = 0; INDEX: for my $index ( 0 .. $#vector ) { $error += abs( $bucket->[ 0 ][ $index ] - $vector[ $index ] ); next BUCKET if $error > $fuzz * @vector; } push @$bucket, $file; #print "linked ", join( ", ", @$bucket[ 1 .. $#$bucket ] ), "\n"; next FILE; } push @buckets, [ [ @vector ], $file ]; } # Connect images only, no interactive process #exit; for my $bucket ( @buckets ) { my @names = @$bucket; shift @names; # first element is vector next unless @names > 1; # skip unique images my $images = Image::Magick->new; $images->Read( @names ); compare_as_text( $images ); my $sums = collect_md5sum( $images ); { # Silence warning about single use of $b. no warnings 'once'; compare_as_image( $images ) unless reduce { $a eq $b ? $a : 0 } @$sums; } print "Delete? [picture number] "; my $img_count = scalar @{ $images }; my @dead; chomp( my $dead = ); @dead = $dead =~ m/^ \s* [*+] $/x ? ( 1 .. $img_count ) : $dead =~ m/^ \s* - \d+ $/x ? ( $img_count + $dead + 1 .. $img_count ) : grep { $_ >= 1 and $_ <= $img_count } $dead =~ /(\d+)/g; for ( @dead ) { my $dead_name = $images->[ $_ - 1 ]->Get( 'base-filename' ); warn "rm $dead_name\n"; unlink $dead_name or warn "Cannot rm $dead_name: $!"; warn "\n"; } } sub compare_as_text { my $images = shift; my $frmt = "%d: %s\n -- %dx%d %0.3f kB\n"; foreach my $img ( 0 .. scalar @$images - 1 ) { printf $frmt , ( $img + 1 ), $images->[ $img ]->Get( 'base-filename' ) , $images->[ $img ]->Get( 'width' ), $images->[ $img ]->Get( 'height' ) , ( $images->[ $img ]->Get( 'filesize' ) / 1024 ) ; } } sub collect_md5sum { my $images = shift; my @md5; foreach ( 0 .. scalar @$images - 1 ) { my $name = $images->[ $_ ]->Get( 'base-filename' ); push @md5, ( split ' ', qx/ md5 $name / )[ 3 ]; } return [ @md5 ]; } sub compare_as_image { my $images = shift; my $montage = $images->Montage ( geometry => '370x500' , tile => '2x2' , label => "[%p] %i %wx%h %b" ); print "processing...\n"; $montage->Display(); } sub warnif { my $value = shift; carp $value if $value; } sub usage { my ( $exit, $message ) = @_; print STDERR $message, "\n" if $exit && $message; my $old_fd = select( $exit == 0 ? \*STDOUT : \*STDERR ); print <<"_USAGE_"; similar-image - Keep|Delete similar looking images similar-image [ -fuzz ] [ -corrupt | -nocorrupt ] < files directories > This program takes the following options ... -f | -fuzz Permitted average deviation in the vector elements; (set value: $fuzz). -c | -corrupt Move corrupt images into this directory; (set value: $corrupt_dir). -nc | -nocorrupt Do not define a corrupted-image directory (so that files are not moved). _USAGE_ select $old_fd; exit( $exit ); } __END__ - Parv --