Date: Sat, 13 Oct 2012 13:04:27 -0700 From: Gary Kline <kline@thought.org> To: "C. P. Ghost" <cpghost@cordula.ws> Cc: freebsd-questions@freebsd.org Subject: Re: editing pdf files Message-ID: <20121013200427.GC14155@ethic.thought.org> In-Reply-To: <CADGWnjU6xTEXFBS7v1hqLg15OSN=deop=fH5A28dbDmxLLYiXg@mail.gmail.com> References: <5074A6B9.8040209@dreamchaser.org> <5078641D.4050905@passap.ru> <20121012234628.GA11112@ethic.thought.org> <CADGWnjU6xTEXFBS7v1hqLg15OSN=deop=fH5A28dbDmxLLYiXg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Oct 13, 2012 at 04:40:23AM +0200, C. P. Ghost wrote: > On Sat, Oct 13, 2012 at 1:46 AM, Gary Kline <kline@thought.org> wrote: > > On Fri, Oct 12, 2012 at 10:40:29PM +0400, Boris Samorodov wrote: > >> 10.10.2012 02:35, Gary Aitken пишет: > >> > >> > Can someone give me advice on editing pdf files? > >> > >> Take a look at graphics/inkscape. > >> > >> -- > >> WBR, Boris Samorodov (bsam) > >> FreeBSD Committer, http://www.FreeBSD.org The Power To Serve > > > > > > ive got a question that fits in here. hopefully. > > > > last week I found a book from 1901 that google had scanned and listed > > as a pdf file. it was text plus photos of the rich/famous of the > > 1800s. somehow, google found the exact string that matched my great > > grandfather [from the civil war]. I d'loaded the file (maybe 2mbytes) > > and searched using acroread. nada. I used the pdftotext utility. > > same: nothing but some 600 page numbers. > > > > my guess is that google just took photos of the book and used other > > tools to create a pdf file. I am not =that= serious about genealogy, > > but I would like to know if there are any tools to edit this kind of > > pdf file. > > I suspect the following: they scanned the book and put all the images > into the PDF. The PDF itself is merely a container for scanned pages; > it thus contains no text (save for the page numbers). > > That Google was able to search in this file is probably due to them running > some OCR program on the image files, and then indexing the (approximate) > text that the OCR program generated. Probably they used something like > tesseract-ocr from ports graphics/tesseract: > http://code.google.com/p/tesseract-ocr/ > in more recent google stuff--text--sci-tech zines or whatever--it sseems like they have used some very high-end ocr programs and =then= turned the file into pdf. I have been able to get very good textfiles from a small sample of google's work. a few years ago I tried the ocr ports we have. very poor results. it may be time to see if the newer versions gives me better results. gary ps: tesseract was one I tried [circa '10] ... time to look at the actual Code! > > -cpghost. > > -- > Cordula's Web. http://www.cordula.ws/ -- Gary Kline kline@thought.org http://www.thought.org Public Service Unix Twenty-six years of service to the Unix community.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121013200427.GC14155>