Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 2 Dec 2008 19:56:19 +0100
From:      Roland Smith <rsmith@xs4all.nl>
To:        Robert Huff <roberthuff@rcn.com>
Cc:        FreeBSD Mailing List <freebsd-questions@freebsd.org>
Subject:   Re: any way to turn a pdf file into something OCR-able?
Message-ID:  <20081202185619.GA44591@slackbox.xs4all.nl>
In-Reply-To: <18740.36349.523718.591189@jerusalem.litteratus.org>
References:  <20081201231440.GA30682@thought.org> <20081202010730.GA15970@slackbox.xs4all.nl> <18740.36349.523718.591189@jerusalem.litteratus.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--gBBFr7Ir9EOA20Yy
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, Dec 01, 2008 at 08:23:09PM -0500, Robert Huff wrote:
>=20
> Roland Smith writes:
>=20
> >  > 	pdftotext fail on the large [32MB] file I've got.  Is there any
> >  > 	other way I can translate this huge textfile to ascii or html or
> >  > 	text?
> > =20
>=20
> >  Please define "fail" in this context? I've used pdftotxt on
> >  documents exceeding 40MB. However there are of course things that
> >  don't work;
> > =20
> >  1) Some PDFs are just wrappers around JPEG images. In this case
> >  there is no text for pdftotext to convert =3D> epic fail.
>=20
> 	In this case "convert" from the ImageMagick port will get you a
> series of .jpg/.gif/.<whatever>.  Read the manual carefully before
> attempting; also note this can be a slow process.

Which still doesn't give plain text. But in this case one would need an
OCR app.

There is a new one available in ports called cuneiform. It is supposed
to be quite good, but I haven't had the need to try it yet.=20

I've tried gocr and tesseract in the past but was not really impressed
with them. For short documents it's easier to do the OCR with the Mk I
eyeball & brain. :-) You'll have to completely check an OCR-ed document
for errors anyway.

Roland
--=20
R.F.Smith                                   http://www.xs4all.nl/~rsmith/
[plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated]
pgp: 1A2B 477F 9970 BA3C 2914  B7CE 1277 EFB0 C321 A725 (KeyID: C321A725)

--gBBFr7Ir9EOA20Yy
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (FreeBSD)

iEYEARECAAYFAkk1hNMACgkQEnfvsMMhpyUkVwCfWhgOuc0FblhkBCEpp4m7dDtj
WCIAn0O3AKVveUrIdzQ1fpsn46xIX6/9
=wWj3
-----END PGP SIGNATURE-----

--gBBFr7Ir9EOA20Yy--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20081202185619.GA44591>