From owner-freebsd-questions@FreeBSD.ORG Tue Dec 2 18:56:23 2008 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1186F1065687 for ; Tue, 2 Dec 2008 18:56:23 +0000 (UTC) (envelope-from rsmith@xs4all.nl) Received: from smtp-vbr7.xs4all.nl (smtp-vbr7.xs4all.nl [194.109.24.27]) by mx1.freebsd.org (Postfix) with ESMTP id 81C9E8FC18 for ; Tue, 2 Dec 2008 18:56:22 +0000 (UTC) (envelope-from rsmith@xs4all.nl) Received: from slackbox.xs4all.nl (slackbox.xs4all.nl [213.84.242.160]) by smtp-vbr7.xs4all.nl (8.13.8/8.13.8) with ESMTP id mB2IuKF2073424; Tue, 2 Dec 2008 19:56:20 +0100 (CET) (envelope-from rsmith@xs4all.nl) Received: by slackbox.xs4all.nl (Postfix, from userid 1001) id C82C4BA8C; Tue, 2 Dec 2008 19:56:19 +0100 (CET) Date: Tue, 2 Dec 2008 19:56:19 +0100 From: Roland Smith To: Robert Huff Message-ID: <20081202185619.GA44591@slackbox.xs4all.nl> References: <20081201231440.GA30682@thought.org> <20081202010730.GA15970@slackbox.xs4all.nl> <18740.36349.523718.591189@jerusalem.litteratus.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="gBBFr7Ir9EOA20Yy" Content-Disposition: inline In-Reply-To: <18740.36349.523718.591189@jerusalem.litteratus.org> X-GPG-Fingerprint: 1A2B 477F 9970 BA3C 2914 B7CE 1277 EFB0 C321 A725 X-GPG-Key: http://www.xs4all.nl/~rsmith/pubkey.txt X-GPG-Notice: If this message is not signed, don't assume I sent it! User-Agent: Mutt/1.5.18 (2008-05-17) X-Virus-Scanned: by XS4ALL Virus Scanner Cc: FreeBSD Mailing List Subject: Re: any way to turn a pdf file into something OCR-able? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 02 Dec 2008 18:56:23 -0000 --gBBFr7Ir9EOA20Yy Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Dec 01, 2008 at 08:23:09PM -0500, Robert Huff wrote: >=20 > Roland Smith writes: >=20 > > > pdftotext fail on the large [32MB] file I've got. Is there any > > > other way I can translate this huge textfile to ascii or html or > > > text? > > =20 >=20 > > Please define "fail" in this context? I've used pdftotxt on > > documents exceeding 40MB. However there are of course things that > > don't work; > > =20 > > 1) Some PDFs are just wrappers around JPEG images. In this case > > there is no text for pdftotext to convert =3D> epic fail. >=20 > In this case "convert" from the ImageMagick port will get you a > series of .jpg/.gif/.. Read the manual carefully before > attempting; also note this can be a slow process. Which still doesn't give plain text. But in this case one would need an OCR app. There is a new one available in ports called cuneiform. It is supposed to be quite good, but I haven't had the need to try it yet.=20 I've tried gocr and tesseract in the past but was not really impressed with them. For short documents it's easier to do the OCR with the Mk I eyeball & brain. :-) You'll have to completely check an OCR-ed document for errors anyway. Roland --=20 R.F.Smith http://www.xs4all.nl/~rsmith/ [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated] pgp: 1A2B 477F 9970 BA3C 2914 B7CE 1277 EFB0 C321 A725 (KeyID: C321A725) --gBBFr7Ir9EOA20Yy Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (FreeBSD) iEYEARECAAYFAkk1hNMACgkQEnfvsMMhpyUkVwCfWhgOuc0FblhkBCEpp4m7dDtj WCIAn0O3AKVveUrIdzQ1fpsn46xIX6/9 =wWj3 -----END PGP SIGNATURE----- --gBBFr7Ir9EOA20Yy--