FreeBSD Mail Archives

Date:      Tue, 2 Dec 2008 16:07:41 -0800
From:      Gary Kline <kline@thought.org>
To:        Chris Shenton <chris@shenton.org>
Cc:        FreeBSD Mailing List <freebsd-questions@FreeBSD.ORG>
Subject:   Re: any way to turn a pdf file into something OCR-able?
Message-ID:  <20081203000741.GC63279@thought.org>
In-Reply-To: <86ej0qjsb0.fsf@Boqueria.shenton.org>
References:  <20081201231440.GA30682@thought.org> <86ej0qjsb0.fsf@Boqueria.shenton.org>

On Tue, Dec 02, 2008 at 02:22:27PM -0500, Chris Shenton wrote:
> Gary Kline <kline@thought.org> writes:
> 
> > 	pdftotext fail on the large [32MB] file I've got.  Is there any other way I
> > 	can translate this huge textfile to ascii or html or text?
> 
> I wrote some code using Python PDF library 'pypdf' to split a multipage
> PDF scan into individual pages, then used the tesseract OCR to convert
> to text.  Not 100% of course, and it really got confused by pages that
> were not right-side-up, but not a bad start for pages that are really
> scans -- images -- rather than PDF representation of text. 
> 
> Sadly, I haven't gotten it into a suitable state to release. 


	Well, sounds hopeful for when I scan around 200 pages of pre-1923 journal 
	articles.  These are in columnal form IIRC correctly.  

	--Be WONDERFUL if there were some kind of hardware top translate Old books
	and journals automagically.  ... .

	gary



-- 
 Gary Kline  kline@thought.org  http://www.thought.org  Public Service Unix
        http://jottings.thought.org   http://transfinite.thought.org
 Flash: The alpha release of Jottings is available: http://jottings.thought.org/index.php

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20081203000741.GC63279>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation