FreeBSD Mail Archives

Date:      Sat, 13 Oct 2012 13:04:27 -0700
From:      Gary Kline <kline@thought.org>
To:        "C. P. Ghost" <cpghost@cordula.ws>
Cc:        freebsd-questions@freebsd.org
Subject:   Re: editing pdf files
Message-ID:  <20121013200427.GC14155@ethic.thought.org>
In-Reply-To: <CADGWnjU6xTEXFBS7v1hqLg15OSN=deop=fH5A28dbDmxLLYiXg@mail.gmail.com>
References:  <5074A6B9.8040209@dreamchaser.org> <5078641D.4050905@passap.ru> <20121012234628.GA11112@ethic.thought.org> <CADGWnjU6xTEXFBS7v1hqLg15OSN=deop=fH5A28dbDmxLLYiXg@mail.gmail.com>

index | next in thread | previous in thread | raw e-mail


On Sat, Oct 13, 2012 at 04:40:23AM +0200, C. P. Ghost wrote:
> On Sat, Oct 13, 2012 at 1:46 AM, Gary Kline <kline@thought.org> wrote:
> > On Fri, Oct 12, 2012 at 10:40:29PM +0400, Boris Samorodov wrote:
> >> 10.10.2012 02:35, Gary Aitken пишет:
> >>
> >> > Can someone give me advice on editing pdf files?
> >>
> >> Take a look at graphics/inkscape.
> >>
> >> --
> >> WBR, Boris Samorodov (bsam)
> >> FreeBSD Committer, http://www.FreeBSD.org The Power To Serve
> >
> >
> >         ive got a question that fits in here.  hopefully.
> >
> >         last week  I found a book from 1901 that google had scanned and listed
> >         as a pdf file.  it was text plus photos of the rich/famous of the
> >         1800s.  somehow, google found the exact string that matched my great
> >         grandfather [from the civil war].  I d'loaded the file (maybe 2mbytes)
> >         and searched using acroread.  nada.  I used the pdftotext utility.
> >         same: nothing but  some 600 page numbers.
> >
> >         my guess is that google just took photos of the book and used other
> >         tools to create a pdf file.  I am not =that= serious  about genealogy,
> >         but I would like to know if there are any tools to edit this kind of
> >         pdf file.
> 
> I suspect the following: they scanned the book and put all the images
> into the PDF. The PDF itself is merely a container for scanned pages;
> it thus contains no text (save for the page numbers).
> 
> That Google was able to search in this file is probably due to them running
> some OCR program on the image files, and then indexing the (approximate)
> text that the OCR program generated. Probably they used something like
> tesseract-ocr from ports graphics/tesseract:
>   http://code.google.com/p/tesseract-ocr/
> 

	in more recent google stuff--text--sci-tech zines or whatever--it 
	sseems like they have used some very high-end ocr programs and
	=then= turned the file into pdf.  I have been able to get very
	good textfiles from a small sample of google's work.  

	a few years ago I tried the ocr ports we have.  very poor results.
	it may be time to see if the newer versions gives me better results.

	gary

	ps: tesseract was one I tried [circa '10] ...  time to look at the
	actual Code!


					
> 
> -cpghost.
> 
> -- 
> Cordula's Web. http://www.cordula.ws/

-- 
 Gary Kline  kline@thought.org  http://www.thought.org  Public Service Unix
              Twenty-six years of service to the Unix community.

help

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121013200427.GC14155>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation