From owner-freebsd-questions Fri Jan 5 8:37:14 2001 From owner-freebsd-questions@FreeBSD.ORG Fri Jan 5 08:37:10 2001 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from lila.inti.gov.ar (lila.inti.gov.ar [200.10.161.32]) by hub.freebsd.org (Postfix) with ESMTP id 7EAD537B402 for ; Fri, 5 Jan 2001 08:37:05 -0800 (PST) Received: from iib005.iib.unsam.edu.ar ([200.3.113.15] helo=mail.inti.gov.ar ident=fernan) by lila.inti.gov.ar with smtp (Exim 3.02 #1) id 14EZjJ-0005Gg-00; Fri, 05 Jan 2001 13:28:37 -0300 Date: Fri, 5 Jan 2001 13:32:04 -0300 From: Fernan Aguero To: Robert Badaracco Cc: freebsd-questions@freebsd.org Subject: Re: Parsing PDF files... Message-ID: <20010105133204.S890@iib005.iib.unsam.edu.ar> Reply-To: fernan@iib.unsam.edu.ar References: <3A55E9D5.D8F4EE33@typeline.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit In-Reply-To: <3A55E9D5.D8F4EE33@typeline.com>; from rjb@typeline.com on Fri, Jan 05, 2001 at 12:35:49 -0300 X-Mailer: Balsa 1.0.1 Content-Length: 4206 Lines: 121 Sender: owner-freebsd-questions@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Bob, I don't know if it would be of help (didn't try it myself) but here you have an excerpt from the ht://dig FAQ (the text contains links to other sites so you'd better read the FAQ online, see the URL at the bottom). From the FAQ, ............ 4.9. How do I index PDF files? This too can be done with an external parser or converter, in combination with the pdftotext program that is part of the xpdf 0.90 package. A sample of such a parser is the contrib/parse_doc.pl Perl script. It uses pdftotext to parse PDF documents, then processes the text into external parser records. The most recent version of parse_doc.pl is available on our web site. For example, you could put this in your configuration file: external_parsers: application/msword /usr/local/bin/parse_doc.pl \ application/postscript /usr/local/bin/parse_doc.pl \ application/pdf /usr/local/bin/parse_doc.pl You would also need to configure the script to indicate where all of the document to text converters are installed. As of htdig version 3.1.4, you can use an external converter, such as contrib/conv_doc.pl or the newer and more complete doc2html.pl Perl script, also available on our web site, instead of an external parser. These scripts are simpler, and offer more consistent parsing, because the final work is done by htdig's internal parsers. See the comments inside these scripts for an example of their usage. Whether you use this external parser or converter, or acroread with the pdf_parser attribute, to successfully index PDF files be sure to set the max_doc_size attribute to a value larger than the size of your largest PDF file. PDF documents can not be parsed if they are truncated. This also raises the questions of why two different methods of indexing PDFs are supported, and which method is preferred. The built-in PDF support, which uses acroread to convert the PDF to PostScript, was the first method which was provided. It had a few problems with it: acroread is not open source, it is not supported on all systems on which ht://Dig can run, and for some PDFs, the PostScript that acroread generated was very difficult to parse into indexable text. Also, the built-in PDF support expected PDF documents to use the same character encoding as is defined in your current locale, which isn't always the case. The external parser, which uses pdftotext, was developed to overcome these problems. xpdf 0.90 is free software, and its pdftotext utility works very well as an indexing tool. It also converts various PDF encodings to the Latin 1 set. It is the opinion of the developers that this is the preferred method. However, some users still prefer to stick with acroread, as it works well for them, and is a little easier to set up if you've already installed Acrobat. Also, pdftotext still has some difficulty handling text in landscape orientation, even with its new -raw option in 0.90, so if you need to index such text in PDFs, you may still get better results with acroread. See also question 5.2 below and question 1.13 above. ................... The FAQ is at http://www.htdig.org/FAQ.html On Fri, 05 Jan 2001 12:35:49 Robert Badaracco wrote: > Hi, > > I'm not sure if this is the correct address to post such a question, > so > here goes... > > Does anyone know if there's a C lib that contains routines for parsing > Adobe PDF files? > Unlike Postscript file, PDF's are encoded. All I need to do is decode > and parse the > comments. > > Thanks, > Bob > > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-questions" in the body of the message > -- # --------------------------------------------------------- # # _ # # Fernan Aguero | / \ # # Bioinformatics | ASCII \ / against # # IIB-UNSAM | ribbon / HTML # # fernan@iib.unsam.edu.ar | campaign / \ email # # ICQ 100325972 | / \ # # # # --------------------------------------------------------- # To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-questions" in the body of the message