Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 13 Oct 2012 04:40:23 +0200
From:      "C. P. Ghost" <cpghost@cordula.ws>
To:        Gary Kline <kline@thought.org>
Cc:        freebsd-questions@freebsd.org
Subject:   Re: editing pdf files
Message-ID:  <CADGWnjU6xTEXFBS7v1hqLg15OSN=deop=fH5A28dbDmxLLYiXg@mail.gmail.com>
In-Reply-To: <20121012234628.GA11112@ethic.thought.org>
References:  <5074A6B9.8040209@dreamchaser.org> <5078641D.4050905@passap.ru> <20121012234628.GA11112@ethic.thought.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Oct 13, 2012 at 1:46 AM, Gary Kline <kline@thought.org> wrote:
> On Fri, Oct 12, 2012 at 10:40:29PM +0400, Boris Samorodov wrote:
>> 10.10.2012 02:35, Gary Aitken =D0=BF=D0=B8=D1=88=D0=B5=D1=82:
>>
>> > Can someone give me advice on editing pdf files?
>>
>> Take a look at graphics/inkscape.
>>
>> --
>> WBR, Boris Samorodov (bsam)
>> FreeBSD Committer, http://www.FreeBSD.org The Power To Serve
>
>
>         ive got a question that fits in here.  hopefully.
>
>         last week  I found a book from 1901 that google had scanned and l=
isted
>         as a pdf file.  it was text plus photos of the rich/famous of the
>         1800s.  somehow, google found the exact string that matched my gr=
eat
>         grandfather [from the civil war].  I d'loaded the file (maybe 2mb=
ytes)
>         and searched using acroread.  nada.  I used the pdftotext utility=
.
>         same: nothing but  some 600 page numbers.
>
>         my guess is that google just took photos of the book and used oth=
er
>         tools to create a pdf file.  I am not =3Dthat=3D serious  about g=
enealogy,
>         but I would like to know if there are any tools to edit this kind=
 of
>         pdf file.

I suspect the following: they scanned the book and put all the images
into the PDF. The PDF itself is merely a container for scanned pages;
it thus contains no text (save for the page numbers).

That Google was able to search in this file is probably due to them running
some OCR program on the image files, and then indexing the (approximate)
text that the OCR program generated. Probably they used something like
tesseract-ocr from ports graphics/tesseract:
  http://code.google.com/p/tesseract-ocr/

>         tia guys,
>
>         gary
>
>
> --
>  Gary Kline  kline@thought.org  http://www.thought.org  Public Service Un=
ix
>               Twenty-six years of service to the Unix community.

-cpghost.

--=20
Cordula's Web. http://www.cordula.ws/



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CADGWnjU6xTEXFBS7v1hqLg15OSN=deop=fH5A28dbDmxLLYiXg>