FreeBSD Mail Archives

Date:      Fri, 5 Jan 2001 13:32:04 -0300
From:      Fernan Aguero <fernan@iib.unsam.edu.ar>
To:        Robert Badaracco <rjb@typeline.com>
Cc:        freebsd-questions@freebsd.org
Subject:   Re: Parsing PDF files...
Message-ID:  <20010105133204.S890@iib005.iib.unsam.edu.ar>
In-Reply-To: <3A55E9D5.D8F4EE33@typeline.com>; from rjb@typeline.com on Fri, Jan 05, 2001 at 12:35:49 -0300
References:  <3A55E9D5.D8F4EE33@typeline.com>

Bob, I don't know if it would be of help (didn't try it myself) but here
you have an excerpt from the ht://dig FAQ (the text contains links to
other sites so you'd better read the FAQ online, see the URL at the
bottom).

From the FAQ,

............

4.9. How do I index PDF files?

This too can be done with an external parser or converter, in
combination with
the pdftotext program that is part of the xpdf 0.90 package. A sample of
such a
parser is the contrib/parse_doc.pl Perl script. It uses pdftotext to
parse PDF
documents, then processes the text into external parser records. The
most
recent version of parse_doc.pl is available on our web site.

For example, you could put this in your configuration file:

external_parsers: application/msword /usr/local/bin/parse_doc.pl \
                  application/postscript /usr/local/bin/parse_doc.pl \
                  application/pdf /usr/local/bin/parse_doc.pl

You would also need to configure the script to indicate where all of the
document to text converters are installed.

As of htdig version 3.1.4, you can use an external converter, such as
contrib/conv_doc.pl or the newer and more complete doc2html.pl Perl
script, also
available on our web site, instead of an external parser. These scripts
are
simpler, and offer more consistent parsing, because the final work is
done by
htdig's internal parsers. See the comments inside these scripts for an
example of
their usage.

Whether you use this external parser or converter, or acroread with the
pdf_parser attribute, to successfully index PDF files be sure to set the
max_doc_size attribute to a value larger than the size of your largest
PDF file.
PDF documents can not be parsed if they are truncated.

This also raises the questions of why two different methods of indexing
PDFs are
supported, and which method is preferred. The built-in PDF support,
which uses
acroread to convert the PDF to PostScript, was the first method which
was
provided. It had a few problems with it: acroread is not open source, it
is not
supported on all systems on which ht://Dig can run, and for some PDFs,
the
PostScript that acroread generated was very difficult to parse into
indexable
text. Also, the built-in PDF support expected PDF documents to use the
same
character encoding as is defined in your current locale, which isn't
always the
case. The external parser, which uses pdftotext, was developed to
overcome
these problems. xpdf 0.90 is free software, and its pdftotext utility
works very
well as an indexing tool. It also converts various PDF encodings to the
Latin 1
set. It is the opinion of the developers that this is the preferred
method.
However, some users still prefer to stick with acroread, as it works
well for them,
and is a little easier to set up if you've already installed Acrobat.

Also, pdftotext still has some difficulty handling text in landscape
orientation,
even with its new -raw option in 0.90, so if you need to index such text
in PDFs,
you may still get better results with acroread.

See also question 5.2 below and question 1.13 above.

...................

The FAQ is at http://www.htdig.org/FAQ.html

On Fri, 05 Jan 2001 12:35:49 Robert Badaracco wrote:
> Hi,
> 
> I'm not sure if this is the correct address to post such a question,
> so
> here goes...
> 
> Does anyone know if there's a C lib that contains routines for parsing
> Adobe PDF files?
> Unlike Postscript file, PDF's are encoded. All I need to do is decode
> and parse the
> comments.
> 
> Thanks,
> Bob
> 
> 
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-questions" in the body of the message
> 

-- 

# --------------------------------------------------------- #
#                                            _              #
#   Fernan Aguero            |              / \             #
#   Bioinformatics           |       ASCII  \ /  against    #
#   IIB-UNSAM                |      ribbon   /   HTML       #
#   fernan@iib.unsam.edu.ar  |    campaign  / \  email      #
#   ICQ 100325972            |             /   \            #
#                                                           #
# --------------------------------------------------------- #

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-questions" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20010105133204.S890>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation