Date: Tue, 2 Dec 2008 16:07:41 -0800 From: Gary Kline <kline@thought.org> To: Chris Shenton <chris@shenton.org> Cc: FreeBSD Mailing List <freebsd-questions@FreeBSD.ORG> Subject: Re: any way to turn a pdf file into something OCR-able? Message-ID: <20081203000741.GC63279@thought.org> In-Reply-To: <86ej0qjsb0.fsf@Boqueria.shenton.org> References: <20081201231440.GA30682@thought.org> <86ej0qjsb0.fsf@Boqueria.shenton.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Dec 02, 2008 at 02:22:27PM -0500, Chris Shenton wrote: > Gary Kline <kline@thought.org> writes: > > > pdftotext fail on the large [32MB] file I've got. Is there any other way I > > can translate this huge textfile to ascii or html or text? > > I wrote some code using Python PDF library 'pypdf' to split a multipage > PDF scan into individual pages, then used the tesseract OCR to convert > to text. Not 100% of course, and it really got confused by pages that > were not right-side-up, but not a bad start for pages that are really > scans -- images -- rather than PDF representation of text. > > Sadly, I haven't gotten it into a suitable state to release. Well, sounds hopeful for when I scan around 200 pages of pre-1923 journal articles. These are in columnal form IIRC correctly. --Be WONDERFUL if there were some kind of hardware top translate Old books and journals automagically. ... . gary -- Gary Kline kline@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org Flash: The alpha release of Jottings is available: http://jottings.thought.org/index.php
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20081203000741.GC63279>