From owner-freebsd-questions@FreeBSD.ORG  Sat Oct 13 20:44:58 2012
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id C67FE501
 for <freebsd-questions@freebsd.org>; Sat, 13 Oct 2012 20:44:58 +0000 (UTC)
 (envelope-from kline@thought.org)
Received: from p3plsmtpa01-07.prod.phx3.secureserver.net
 (p3plsmtpa01-07.prod.phx3.secureserver.net [72.167.82.87])
 by mx1.freebsd.org (Postfix) with SMTP id 9CB468FC16
 for <freebsd-questions@freebsd.org>; Sat, 13 Oct 2012 20:44:58 +0000 (UTC)
Received: (qmail 7963 invoked from network); 13 Oct 2012 20:38:17 -0000
Received: from unknown (209.180.213.209)
 by p3plsmtpa01-07.prod.phx3.secureserver.net (72.167.82.87) with ESMTP;
 13 Oct 2012 20:38:17 -0000
Date: Sat, 13 Oct 2012 13:38:16 -0700
From: Gary Kline <kline@thought.org>
To: Polytropon <freebsd@edvax.de>
Subject: Re: editing pdf files
Message-ID: <20121013203816.GD14155@ethic.thought.org>
References: <5074A6B9.8040209@dreamchaser.org> <5078641D.4050905@passap.ru>
 <20121012234628.GA11112@ethic.thought.org>
 <20121013131907.c666bfc2.freebsd@edvax.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20121013131907.c666bfc2.freebsd@edvax.de>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: FreeBSD Mailing List <freebsd-questions@freebsd.org>
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 13 Oct 2012 20:44:58 -0000

On Sat, Oct 13, 2012 at 01:19:07PM +0200, Polytropon wrote:
> On Fri, 12 Oct 2012 16:46:28 -0700, Gary Kline wrote:
> > 	ive got a question that fits in here.  hopefully.
> > 
> > 	last week  I found a book from 1901 that google had scanned and listed
> > 	as a pdf file.  it was text plus photos of the rich/famous of the 
> > 	1800s.  somehow, google found the exact string that matched my great
> > 	grandfather [from the civil war].  I d'loaded the file (maybe 2mbytes)
> > 	and searched using acroread.  nada.  I used the pdftotext utility.
> > 	same: nothing but  some 600 page numbers.
> > 
> > 	my guess is that google just took photos of the book and used other
> > 	tools to create a pdf file.  I am not =that= serious  about genealogy,
> > 	but I would like to know if there are any tools to edit this kind of
> > 	pdf file.
> 
> In case the PDF is nothing more than a compilation of images,
> there's a way to deal with it for editing:


	the images in this book aren't what I am interested in.
	just text.

> 
> step 1: disassemble
> step 2: edit images
> step 3: reassemble
> 
> The disassembling can be done with 
> 
> 	% pdfimages source.pdf .
> 
> Then the files can be edited whatever tool you like, e. g. Gimp.
> They often come out in PBM format.
> 
> Finally the images can be re-converted to PDF and combined to one
> PDF file:
> 
> 	for IMG in .*.pbm; do
> 		convert ${IMG} ${IMG}.pdf
> 	done
> 	pdftk .*.pdf output target.pdf
> 
> Note the ".*" prefix for the file specification: The images extracted
> by pdfimages match that pattern (at least in the case I tested it for).
> If they get other names than .0000001.pbm, change the approach
> accordingly.
> 

	turns out that the first roughtly 580 pages are of no interest.
	I'll see if tesseract-ocr can get rid of most of the data.

	what fmt works best with the ocr suites?  or are they about the 
	same?  for the section I got in that 1901 book on my g-grandfather,
	it was only about 1.5 pages.  there was no photo, just his name 
	and some bio.  Still, things I had no knowledge of.  I'm sure 
	that my father didnt know either!

	gary

> 
> 
> -- 
> Polytropon
> Magdeburg, Germany
> Happy FreeBSD user since 4.0
> Andra moi ennepe, Mousa, ...

-- 
 Gary Kline  kline@thought.org  http://www.thought.org  Public Service Unix
              Twenty-six years of service to the Unix community.