Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 9 Jun 2009 11:18:49 +0200
From:      Polytropon <freebsd@edvax.de>
To:        Daniel Underwood <djuatdelta@gmail.com>
Cc:        freebsd-questions@freebsd.org
Subject:   Re: PDF inventory software
Message-ID:  <20090609111849.f0d38651.freebsd@edvax.de>
In-Reply-To: <b6c05a470906082011i75fe455cg97d237b2bb9b47a8@mail.gmail.com>
References:  <b6c05a470906081417x370edb66yb86fac71b462eab8@mail.gmail.com> <20090609023702.EF4D2BED8@kev.msw.wpafb.af.mil> <b6c05a470906082011i75fe455cg97d237b2bb9b47a8@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 8 Jun 2009 23:11:50 -0400, Daniel Underwood <djuatdelta@gmail.com> wrote:
> Since all the PDFs contain text (none are scanned "images"), can I
> simply use some command like grep to search for text within the
> collection?  If so, how would I do this?  Can grep read text from
> within PDFs?

I don't think so, because PDF files seem to be binary format.
There are two ways. The first is using the "strings" program
that can isolate printable strings from binary files. The
second - the much better way - is to use "pdftotext" to turn
the PDF files into regular ASCII text which is greppable then.
You could write a kind of "pdfgrep" tool that acts as a wrapper
around pdftotext, grep, and your PDF file collection.

In any case, it would surely help if your files have meaningful
filenames, so they can easily be identified.


-- 
Polytropon
>From Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090609111849.f0d38651.freebsd>