From owner-freebsd-questions@FreeBSD.ORG Wed Dec 3 00:07:46 2008 Return-Path: Delivered-To: freebsd-questions@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 16883106564A for ; Wed, 3 Dec 2008 00:07:46 +0000 (UTC) (envelope-from kline@thought.org) Received: from aristotle.thought.org (aristotle.thought.org [209.180.213.210]) by mx1.freebsd.org (Postfix) with ESMTP id C86468FC18 for ; Wed, 3 Dec 2008 00:07:45 +0000 (UTC) (envelope-from kline@thought.org) Received: from thought.org (tao.thought.org [10.47.0.250]) (authenticated bits=0) by aristotle.thought.org (8.14.2/8.14.2) with ESMTP id mB308ETx067784; Tue, 2 Dec 2008 16:08:14 -0800 (PST) (envelope-from kline@thought.org) Received: by thought.org (nbSMTP-1.00) for uid 1002 kline@thought.org; Tue, 2 Dec 2008 16:07:41 -0800 (PST) Date: Tue, 2 Dec 2008 16:07:41 -0800 From: Gary Kline To: Chris Shenton Message-ID: <20081203000741.GC63279@thought.org> References: <20081201231440.GA30682@thought.org> <86ej0qjsb0.fsf@Boqueria.shenton.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <86ej0qjsb0.fsf@Boqueria.shenton.org> User-Agent: Mutt/1.4.2.3i X-Organization: Thought Unlimited. Public service Unix since 1986. X-Of_Interest: With 22 years of service to the Unix community. X-Spam-Status: No, score=-4.4 required=3.6 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.2.3 X-Spam-Checker-Version: SpamAssassin 3.2.3 (2007-08-08) on aristotle.thought.org Cc: FreeBSD Mailing List Subject: Re: any way to turn a pdf file into something OCR-able? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Dec 2008 00:07:46 -0000 On Tue, Dec 02, 2008 at 02:22:27PM -0500, Chris Shenton wrote: > Gary Kline writes: > > > pdftotext fail on the large [32MB] file I've got. Is there any other way I > > can translate this huge textfile to ascii or html or text? > > I wrote some code using Python PDF library 'pypdf' to split a multipage > PDF scan into individual pages, then used the tesseract OCR to convert > to text. Not 100% of course, and it really got confused by pages that > were not right-side-up, but not a bad start for pages that are really > scans -- images -- rather than PDF representation of text. > > Sadly, I haven't gotten it into a suitable state to release. Well, sounds hopeful for when I scan around 200 pages of pre-1923 journal articles. These are in columnal form IIRC correctly. --Be WONDERFUL if there were some kind of hardware top translate Old books and journals automagically. ... . gary -- Gary Kline kline@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org Flash: The alpha release of Jottings is available: http://jottings.thought.org/index.php