From owner-freebsd-questions@FreeBSD.ORG  Wed Dec  3 00:07:46 2008
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 16883106564A
	for <freebsd-questions@FreeBSD.ORG>;
	Wed,  3 Dec 2008 00:07:46 +0000 (UTC)
	(envelope-from kline@thought.org)
Received: from aristotle.thought.org (aristotle.thought.org [209.180.213.210])
	by mx1.freebsd.org (Postfix) with ESMTP id C86468FC18
	for <freebsd-questions@FreeBSD.ORG>;
	Wed,  3 Dec 2008 00:07:45 +0000 (UTC)
	(envelope-from kline@thought.org)
Received: from thought.org (tao.thought.org [10.47.0.250])
	(authenticated bits=0)
	by aristotle.thought.org (8.14.2/8.14.2) with ESMTP id mB308ETx067784; 
	Tue, 2 Dec 2008 16:08:14 -0800 (PST)
	(envelope-from kline@thought.org)
Received: by thought.org (nbSMTP-1.00) for uid 1002
	kline@thought.org; Tue,  2 Dec 2008 16:07:41 -0800 (PST)
Date: Tue, 2 Dec 2008 16:07:41 -0800
From: Gary Kline <kline@thought.org>
To: Chris Shenton <chris@shenton.org>
Message-ID: <20081203000741.GC63279@thought.org>
References: <20081201231440.GA30682@thought.org>
	<86ej0qjsb0.fsf@Boqueria.shenton.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <86ej0qjsb0.fsf@Boqueria.shenton.org>
User-Agent: Mutt/1.4.2.3i
X-Organization: Thought Unlimited. Public service Unix since 1986.
X-Of_Interest: With 22 years  of service to the Unix community.
X-Spam-Status: No, score=-4.4 required=3.6 tests=ALL_TRUSTED,BAYES_00
	autolearn=ham version=3.2.3
X-Spam-Checker-Version: SpamAssassin 3.2.3 (2007-08-08) on
	aristotle.thought.org
Cc: FreeBSD Mailing List <freebsd-questions@FreeBSD.ORG>
Subject: Re: any way to turn a pdf file into something OCR-able?
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 03 Dec 2008 00:07:46 -0000

On Tue, Dec 02, 2008 at 02:22:27PM -0500, Chris Shenton wrote:
> Gary Kline <kline@thought.org> writes:
> 
> > 	pdftotext fail on the large [32MB] file I've got.  Is there any other way I
> > 	can translate this huge textfile to ascii or html or text?
> 
> I wrote some code using Python PDF library 'pypdf' to split a multipage
> PDF scan into individual pages, then used the tesseract OCR to convert
> to text.  Not 100% of course, and it really got confused by pages that
> were not right-side-up, but not a bad start for pages that are really
> scans -- images -- rather than PDF representation of text. 
> 
> Sadly, I haven't gotten it into a suitable state to release. 


	Well, sounds hopeful for when I scan around 200 pages of pre-1923 journal 
	articles.  These are in columnal form IIRC correctly.  

	--Be WONDERFUL if there were some kind of hardware top translate Old books
	and journals automagically.  ... .

	gary


-- 
 Gary Kline  kline@thought.org  http://www.thought.org  Public Service Unix
        http://jottings.thought.org   http://transfinite.thought.org
 Flash: The alpha release of Jottings is available: http://jottings.thought.org/index.php