From owner-freebsd-questions@FreeBSD.ORG Sat Oct 13 20:44:58 2012 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C67FE501 for ; Sat, 13 Oct 2012 20:44:58 +0000 (UTC) (envelope-from kline@thought.org) Received: from p3plsmtpa01-07.prod.phx3.secureserver.net (p3plsmtpa01-07.prod.phx3.secureserver.net [72.167.82.87]) by mx1.freebsd.org (Postfix) with SMTP id 9CB468FC16 for ; Sat, 13 Oct 2012 20:44:58 +0000 (UTC) Received: (qmail 7963 invoked from network); 13 Oct 2012 20:38:17 -0000 Received: from unknown (209.180.213.209) by p3plsmtpa01-07.prod.phx3.secureserver.net (72.167.82.87) with ESMTP; 13 Oct 2012 20:38:17 -0000 Date: Sat, 13 Oct 2012 13:38:16 -0700 From: Gary Kline To: Polytropon Subject: Re: editing pdf files Message-ID: <20121013203816.GD14155@ethic.thought.org> References: <5074A6B9.8040209@dreamchaser.org> <5078641D.4050905@passap.ru> <20121012234628.GA11112@ethic.thought.org> <20121013131907.c666bfc2.freebsd@edvax.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121013131907.c666bfc2.freebsd@edvax.de> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: FreeBSD Mailing List X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Oct 2012 20:44:58 -0000 On Sat, Oct 13, 2012 at 01:19:07PM +0200, Polytropon wrote: > On Fri, 12 Oct 2012 16:46:28 -0700, Gary Kline wrote: > > ive got a question that fits in here. hopefully. > > > > last week I found a book from 1901 that google had scanned and listed > > as a pdf file. it was text plus photos of the rich/famous of the > > 1800s. somehow, google found the exact string that matched my great > > grandfather [from the civil war]. I d'loaded the file (maybe 2mbytes) > > and searched using acroread. nada. I used the pdftotext utility. > > same: nothing but some 600 page numbers. > > > > my guess is that google just took photos of the book and used other > > tools to create a pdf file. I am not =that= serious about genealogy, > > but I would like to know if there are any tools to edit this kind of > > pdf file. > > In case the PDF is nothing more than a compilation of images, > there's a way to deal with it for editing: the images in this book aren't what I am interested in. just text. > > step 1: disassemble > step 2: edit images > step 3: reassemble > > The disassembling can be done with > > % pdfimages source.pdf . > > Then the files can be edited whatever tool you like, e. g. Gimp. > They often come out in PBM format. > > Finally the images can be re-converted to PDF and combined to one > PDF file: > > for IMG in .*.pbm; do > convert ${IMG} ${IMG}.pdf > done > pdftk .*.pdf output target.pdf > > Note the ".*" prefix for the file specification: The images extracted > by pdfimages match that pattern (at least in the case I tested it for). > If they get other names than .0000001.pbm, change the approach > accordingly. > turns out that the first roughtly 580 pages are of no interest. I'll see if tesseract-ocr can get rid of most of the data. what fmt works best with the ocr suites? or are they about the same? for the section I got in that 1901 book on my g-grandfather, it was only about 1.5 pages. there was no photo, just his name and some bio. Still, things I had no knowledge of. I'm sure that my father didnt know either! gary > > > -- > Polytropon > Magdeburg, Germany > Happy FreeBSD user since 4.0 > Andra moi ennepe, Mousa, ... -- Gary Kline kline@thought.org http://www.thought.org Public Service Unix Twenty-six years of service to the Unix community.