From owner-freebsd-questions@FreeBSD.ORG  Sun Jul 23 17:28:04 2006
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
X-Original-To: freebsd-questions@freebsd.org
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 15EAD16A4DD
	for <freebsd-questions@freebsd.org>;
	Sun, 23 Jul 2006 17:28:04 +0000 (UTC) (envelope-from parv@pair.com)
Received: from mta13.adelphia.net (mta13.mail.adelphia.net [68.168.78.44])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 87EF843D46
	for <freebsd-questions@freebsd.org>;
	Sun, 23 Jul 2006 17:28:03 +0000 (GMT) (envelope-from parv@pair.com)
Received: from default.chvlva.adelphia.net ([69.160.66.115])
	by mta13.adelphia.net
	(InterMail vM.6.01.05.02 201-2131-123-102-20050715) with ESMTP id
	<20060723172802.RELQ10992.mta13.adelphia.net@default.chvlva.adelphia.net>;
	Sun, 23 Jul 2006 13:28:02 -0400
Received: by default.chvlva.adelphia.net (Postfix, from userid 1000)
	id 3526EBB54; Sun, 23 Jul 2006 13:28:14 -0400 (EDT)
Date: Sun, 23 Jul 2006 13:28:14 -0400
From: Parv <parv@pair.com>
To: Joshua Lewis <joshua.lewis@familyfunzone.net>
Message-ID: <20060723172814.GB960@holestein.holy.cow>
Mail-Followup-To: Joshua Lewis <joshua.lewis@familyfunzone.net>,
	freebsd-questions@freebsd.org
References: <336A5DA6-5A43-44C0-8961-139C81702AB3@familyfunzone.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <336A5DA6-5A43-44C0-8961-139C81702AB3@familyfunzone.net>
Cc: freebsd-questions@freebsd.org
Subject: Re: Searching a drive and copying files
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 23 Jul 2006 17:28:04 -0000

in message <336A5DA6-5A43-44C0-8961-139C81702AB3@familyfunzone.net>,
wrote Joshua Lewis thusly...
>
> I need to search my drive for all pictures on my system and copy
> them  to a networked system using sftp or ssh or what not. There
> will be  duplicate names on the drive so I was hoping to have dups
> placed in a  separate folder.

Unison, net/unison port, should be able to handle the duplicates
based on file checksum.  (I personally have not used it much, so i
cannot answer any other queried about it; refer to its fine man
page.)


> Due to my for lack of a better term stupidity when I  first got
> my camera I will probably have instances when there will be  three
> or four duplicates.  can help me out with that it  would be great.
...
> My goal is to find all my pictures and compare them then delete
> the  dups that don't look that good. A daunting task as I have 20
> GB of  data. I bet 10 GB are dups.

A checksum-based management of duplicates will help with the files
with identical contents, but not with files that differ even a bit.

Perl program below -- a modified version of Randal Schwartz's
version[0] -- uses md5(1) to identify duplicates (as in identical
files), failing that, Image::Magick based on fuzz factor.  When it
finds duplicates, it asks to enter the item number from the file
list to be deleted.

  [0] Article "Finding similar images",
      http://www.stonehenge.com/merlyn/LinuxMag/col50.html


To be able to run, it needs Image::Magick (graphics/ImageMagick
port), Cache::FileCache (devel/p5-Cache-Cache), List::Util
(lang/p5-Scalar-List-Utils), File::Copy & File::Path.

Mind that it, rather Image::Magick, may consume all of your memory
and/or temporary fs if you run it on all the files at once.

If you are good in Perl, you could modify the program to move the
duplicates in a directory (instead of deleting), and possibly not to
ask to take the particular action (if as you say you would have a
boat load of duplicates).

Without further interruptions, program follows ...

  #!perl

  #  This is a modified version of Randal Schwartz's ...
  #
  #    http://www.stonehenge.com/merlyn/LinuxMag/col50.html
  #
  #  ... as it uses checksum (MD5 for now) to detect identical files, failing that
  #  uses Image::Magick.

  use warnings; use strict;

  $|++;

  use Image::Magick;
  use Cache::FileCache;
  use File::Copy qw( move );
  use File::Path qw( mkpath );
  use List::Util qw( reduce );

  use Carp qw(carp);

  use Getopt::Long qw( :config gnu_compat no_ignore_case no_debug );

  #  User option; permitted average deviation in the vector elements.
  my $fuzz = 15;

  #  User option; if defined, rename corrupt images into this dir.
  my $corrupt_dir = "CORRUPT";
  {
    my $usage;
    GetOptions
    (
      'h|usage|help' => \$usage
    , 'f|fuzz=i' => \$fuzz
    , 'c|corrupt=s' => \$corrupt_dir
    , 'nc|nocorrupt' => sub { undef $corrupt_dir; }
    )
    or usage( 1 );

    usage( 0 ) if $usage;

    #  Check if any arguments remain which will be file names
    usage( 1, "No file(s) or directory(ies) given." ) unless scalar @ARGV;
  }

  sub warnif;

  my $cache = Cache::FileCache->new
    ( {
        namespace => 'image.cache'
      , cache_root => ( glob( "~/log/misc" ) )[ 0 ]
      }
    );

  my @buckets;

  FILE: while ( @ARGV )
  {
    my $file = shift;
    next FILE if -l $file;
    if ( -d $file )
    {
      opendir DIR, $file or next FILE;
      unshift @ARGV, map { m/^\./ ? () : "$file/$_"; } sort readdir DIR;
      next FILE;
    }

    next FILE unless -f _ or -d _;

    my ( @stat ) = stat _ or die "should not happen: $!";

    #  dev/ino/mtime
    my $key = "@stat[ 0, 1, 9 ]";

    my @vector;

    #print "$file ";
    if ( my $data = $cache->get( $key ) )
    {
      #print "... is cached\n";
      @vector = @$data;
    }
    else
    {
      my $image = Image::Magick->new;
      if ( my $x = $image->Read( $file ) )
      {
        if ( defined $corrupt_dir and $x =~ m/corrupt|unexpected end-of-file/i )
        {
          print "$file ";
          print "... renaming into $corrupt_dir\n";

          -d $corrupt_dir
            or mkpath $corrupt_dir, 0, 0700
            or die "Cannot mkpath $corrupt_dir: $!";

          move $file, $corrupt_dir or warn "Cannot rename: $!";
        }
        else
        {
          print "$file ";
          print "... skipping ( $x )\n";
        }
        next FILE;
      }

      #print "is ", join( "x", $image->Get( 'width', 'height' ) ), "\n";
      warnif $image->Normalize();
      warnif $image->Resize( geometry => '4x4!' );
      warnif $image->Set( magick => 'rgb' );
      @vector = unpack "C*", $image->ImageToBlob();
      $cache->set( $key, [ @vector ] );
    }
    BUCKET: for my $bucket ( @buckets )
    {
      my $error = 0;
      INDEX: for my $index ( 0 .. $#vector )
      {
        $error += abs( $bucket->[ 0 ][ $index ] - $vector[ $index ] );
        next BUCKET if $error > $fuzz * @vector;
      }
      push @$bucket, $file;

      #print "linked ", join( ", ", @$bucket[ 1 .. $#$bucket ] ), "\n";
      next FILE;
    }
    push @buckets, [ [ @vector ], $file ];
  }

  #  Connect images only, no interactive process
  #exit;

  for my $bucket ( @buckets )
  {
    my @names = @$bucket;
    shift @names;  # first element is vector
    next unless @names > 1;  # skip unique images
    my $images = Image::Magick->new;
    $images->Read( @names );

    compare_as_text( $images );

    my $sums = collect_md5sum( $images );
    {
      #  Silence warning about single use of $b.
      no warnings 'once';
      compare_as_image( $images )
        unless reduce { $a eq $b ? $a : 0 } @$sums;
    }

    print "Delete? [picture number] ";

    my $img_count = scalar @{ $images };
    my @dead;
    chomp( my $dead = <STDIN> );
    @dead =
        $dead =~ m/^ \s* [*+] $/x  ? ( 1 .. $img_count )
      : $dead =~ m/^ \s* - \d+ $/x ? ( $img_count + $dead + 1 .. $img_count )
      : grep { $_ >= 1 and $_ <= $img_count } $dead =~ /(\d+)/g;

    for ( @dead )
    {
      my $dead_name = $images->[ $_ - 1 ]->Get( 'base-filename' );

      warn "rm $dead_name\n";
      unlink $dead_name or warn "Cannot rm $dead_name: $!";
      warn "\n";
    }
  }

  sub compare_as_text
  {
    my $images = shift;

    my $frmt = "%d: %s\n -- %dx%d  %0.3f kB\n";

    foreach my $img ( 0 .. scalar @$images - 1 )
    {
      printf $frmt , ( $img + 1 ), $images->[ $img ]->Get( 'base-filename' )
      , $images->[ $img ]->Get( 'width' ), $images->[ $img ]->Get( 'height' )
      , ( $images->[ $img ]->Get( 'filesize' ) / 1024 )
      ;
    }
  }

  sub collect_md5sum
  {
    my $images = shift;
    my @md5;
    foreach ( 0 .. scalar @$images - 1 )
    {
      my $name = $images->[ $_ ]->Get( 'base-filename' );
      push @md5, ( split ' ', qx/ md5 $name / )[ 3 ];
    }
    return [ @md5 ];
  }

  sub compare_as_image
  {
    my $images  = shift;
    my $montage =
      $images->Montage
      ( geometry => '370x500' , tile => '2x2' , label => "[%p] %i %wx%h %b" );

    print "processing...\n";
    $montage->Display();
  }

  sub warnif
  {
    my $value = shift;
    carp $value if $value;
  }

  sub usage
  {
    my ( $exit, $message ) = @_;

    print STDERR $message, "\n" if $exit && $message;

    my $old_fd = select( $exit == 0 ? \*STDOUT : \*STDERR );
    print <<"_USAGE_";
  similar-image - Keep|Delete similar looking images

  similar-image [ -fuzz <avg dev> ]
                [ -corrupt <directory> | -nocorrupt ]
                < files directories >

  This program takes the following options ...

    -f | -fuzz        Permitted average deviation in the vector
                      elements; (set value: $fuzz).

    -c | -corrupt     Move corrupt images into this directory; (set
                      value: $corrupt_dir).

    -nc | -nocorrupt  Do not define a corrupted-image directory (so that
                      files are not moved).
  _USAGE_

    select $old_fd;
    exit( $exit );
  }
  __END__


  - Parv

--