Date: Tue, 2 Dec 2008 02:07:30 +0100 From: Roland Smith <rsmith@xs4all.nl> To: Gary Kline <kline@thought.org> Cc: FreeBSD Mailing List <freebsd-questions@freebsd.org> Subject: Re: any way to turn a pdf file into something OCR-able? Message-ID: <20081202010730.GA15970@slackbox.xs4all.nl> In-Reply-To: <20081201231440.GA30682@thought.org> References: <20081201231440.GA30682@thought.org>
next in thread | previous in thread | raw e-mail | index | archive | help
--82I3+IH0IqGh5yIs Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Dec 01, 2008 at 03:14:43PM -0800, Gary Kline wrote: > pdftotext fail on the large [32MB] file I've got. Is there any > other way I can translate this huge textfile to ascii or html or > text? Please define "fail" in this context? I've used pdftotxt on documents exceeding 40MB. However there are of course things that don't work; 1) Some PDFs are just wrappers around JPEG images. In this case there is no text for pdftotext to convert =3D> epic fail. 2) If the text contains ligatures etc. you should use the proper encoding that contains such characters (e.g. '-enc UTF-8') or you will loose them. 3) Things like equations will not render well, if at all. This also depends on the encoding. Roland --=20 R.F.Smith http://www.xs4all.nl/~rsmith/ [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated] pgp: 1A2B 477F 9970 BA3C 2914 B7CE 1277 EFB0 C321 A725 (KeyID: C321A725) --82I3+IH0IqGh5yIs Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (FreeBSD) iEYEARECAAYFAkk0ilIACgkQEnfvsMMhpyX9GwCgljxePhLFAy/thtzNiyTbvHeM nhMAn34OELIwnwlX7OqyRa4rEg46fVG4 =1x24 -----END PGP SIGNATURE----- --82I3+IH0IqGh5yIs--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20081202010730.GA15970>