From owner-freebsd-questions@FreeBSD.ORG Tue Jun 9 09:19:01 2009 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4CD241065673 for ; Tue, 9 Jun 2009 09:19:01 +0000 (UTC) (envelope-from freebsd@edvax.de) Received: from mx01.qsc.de (mx01.qsc.de [213.148.129.14]) by mx1.freebsd.org (Postfix) with ESMTP id 0B0D58FC13 for ; Tue, 9 Jun 2009 09:18:55 +0000 (UTC) (envelope-from freebsd@edvax.de) Received: from r55.edvax.de (port-92-195-65-8.dynamic.qsc.de [92.195.65.8]) by mx01.qsc.de (Postfix) with ESMTP id 00D793CC74; Tue, 9 Jun 2009 11:18:53 +0200 (CEST) Received: from r55.edvax.de (localhost [127.0.0.1]) by r55.edvax.de (8.14.2/8.14.2) with SMTP id n599Inwo001478; Tue, 9 Jun 2009 11:18:49 +0200 (CEST) (envelope-from freebsd@edvax.de) Date: Tue, 9 Jun 2009 11:18:49 +0200 From: Polytropon To: Daniel Underwood Message-Id: <20090609111849.f0d38651.freebsd@edvax.de> In-Reply-To: References: <20090609023702.EF4D2BED8@kev.msw.wpafb.af.mil> Organization: EDVAX X-Mailer: Sylpheed 2.4.7 (GTK+ 2.12.1; i386-portbld-freebsd7.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: freebsd-questions@freebsd.org Subject: Re: PDF inventory software X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Polytropon List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Jun 2009 09:19:01 -0000 On Mon, 8 Jun 2009 23:11:50 -0400, Daniel Underwood wrote: > Since all the PDFs contain text (none are scanned "images"), can I > simply use some command like grep to search for text within the > collection? If so, how would I do this? Can grep read text from > within PDFs? I don't think so, because PDF files seem to be binary format. There are two ways. The first is using the "strings" program that can isolate printable strings from binary files. The second - the much better way - is to use "pdftotext" to turn the PDF files into regular ASCII text which is greppable then. You could write a kind of "pdfgrep" tool that acts as a wrapper around pdftotext, grep, and your PDF file collection. In any case, it would surely help if your files have meaningful filenames, so they can easily be identified. -- Polytropon >From Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ...