From owner-freebsd-questions  Wed May 16 12:14:52 2001
Delivered-To: freebsd-questions@freebsd.org
Received: from relay3.inwind.it (relay3.inwind.it [212.141.53.74])
	by hub.freebsd.org (Postfix) with ESMTP id EAF8C37B423
	for <freebsd-questions@FreeBSD.ORG>; Wed, 16 May 2001 12:14:42 -0700 (PDT)
	(envelope-from bartequi@inwind.it)
Received: from bartequi.ottodomain.org (62.98.162.51) by relay3.inwind.it (5.5.029)
        id 3AE401CE00595598 for freebsd-questions@FreeBSD.ORG; Wed, 16 May 2001 21:14:36 +0200
From: Salvo Bartolotta <bartequi@inwind.it>
Date: Wed, 16 May 2001 19:16:55 GMT
Message-ID: <20010516.19165500@bartequi.ottodomain.org>
Subject: Re: Manipulating pdf/ps files -- closer to a solution
To: freebsd-questions@FreeBSD.ORG
References: <20010513.18294500@bartequi.ottodomain.org> <20010515.1075700@bartequi.ottodomain.org>
X-Mailer: SuperCalifragilis
X-Priority: 3 (Normal)
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Sender: owner-freebsd-questions@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

>>>>>>>>>>>>>>>>>> Original Message <<<<<<<<<<<<<<<<<<

On 5/15/01, 3:07:57 AM, Salvo Bartolotta <bartequi@inwind.it> wrote
regarding Re: Manipulating pdf/ps files:


> >>>>>>>>>>>>>>>>>> Original Message <<<<<<<<<<<<<<<<<<

> On 5/13/01, 8:29:45 PM, Salvo Bartolotta <bartequi@inwind.it> wrote
> regarding Manipulating pdf/ps files:


> > Dear FreeBSD'ers,

> > I would like to perform such operations as the following:

> > -- merge PDF/ps files
> > -- modify PDF/ps files in a more or less "graphical" (read:
> > human-understandable) fashion
> > -- convert PDF/ps files to other formats (eg text).

> > Browsing the archives, I learnt about pdf2ps, ps2pdf, pstotext and
> > psutils (both in the ports). I had also browsed the ports tree as we=
ll as
> > the Doc-primer, but I am probably missing something trivial here.

> > I have found some difficulties: eg, psmerge seems not to work on a f=
ew ps
> > files, which files I downloaded (originally as PDF files) from a www=

> > site. I have reason to believe those files were generated from one m=
ain
> > file (containing data arranged in a table) split into several pieces=
,
> > BTW. I couldn't convert the ps files to txt, either: pstotext genera=
ted
> > strings of hashes (the "#" character).


> I meet with problems when trying to convert PDF/ps files containing da=
ta
> arranged in a table, each raw of data being preceded as well as follow=
ed
> by a (continuous) horizontal line like this (the data were probably
> formatted with M$ excel):

> -------------------------------------------
> data data data...
> -------------------------------------------
> data data data...
> -------------------------------------------


> For example, running pdfinfo on one of the files spits out:

> Creator:      Windows NT 4.0
> Producer:     Acrobat Distiller 4.0 for Windows
> CreationDate: 20010511130351
> ModDate:      20010511130351+02'00'
> Pages:        60
> Encrypted:    no
> Linearized:   yes


> I tried xpdf (in the ports), namely pdftotext, but it didn't work.

> Summing up: I can convert those PDF files into ps, the information in =
the
> ps files IS displayed correctly, but I have managed to convert neither=

> the above-mentioned PDF nor ps files into plain text. There is a txt2p=
df
> utility on the Net, but I can't seem to find a **working** pdf2txt or
> ps2txt one. BTW, the "clipboard" (ie the mouse middle button) DOES cop=
y
> from Acrobat Reader (running in linux comp. layer) to other text edito=
rs
> within X, but it copies (raw) PDF data.


To whomever it may concern,

I keep replying to myself, but I seem to have made some progress.

I had successfully converted the PDF files into ps ones. The reason why
pstotext didn't work is probably that such files (eg a 60-page PDF file)=

are **images** (in the preceding example: a collection of 60 images, one=

per page, as was pointed out by ImageMagick). Which is also the reason
why pdftotext didn't work, BTW.

Since I had to deal with PDF "images", not "text" PDF files... I asked
(wait for it) ImageMagick for help :-)

convert <name_of_PDF_file_of_type_"image">  <name...jpg> DID work, and
created a collection of jpeg images (one per page). Thus, I can convert
PDF "images", or data acquired/manipulated/treated as such --
specifically, a M$ Excel table -- into other image formats.

<aside>pdftoimages does NOT seem to work, however</aside>

<question type=3D"dumb">
AAARGH! I am only missing the last step: how to recover text from eg suc=
h
jpeg images; and/or... which image format to choose in order to be able
to extract text from the images.
</question>

<advocacy>
Once again, I would very much like to work under FreeBSD, and NOT make
use of any M$-related product; the negation "NOT" extending from the
coasts of Western Europe to the Pacific coasts of USA -- just to make
sure that M$ is within the scope of negation :-)))

MTIA,
Salvo (with apologies for the dumb question)

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-questions" in the body of the message