FreeBSD Mail Archives

Date:      Sat, 23 Jan 2021 12:58:34 +0300
From:      Odhiambo Washington <odhiambo@gmail.com>
To:        Polytropon <freebsd@edvax.de>
Cc:        User Questions <freebsd-questions@freebsd.org>
Subject:   Re: Convert PDF to Excel
Message-ID:  <CAAdA2WPNvLZ5u5fTGH06unm9vuGR0EjT2mziov-8fPNgUpamqg@mail.gmail.com>
In-Reply-To: <20210123094041.f932fd4c.freebsd@edvax.de>
References:  <CAAdA2WPoqEaew-OuDwAJ4pTbNUJsAzc2MpZE9di5HrJfGu%2Bexw@mail.gmail.com> <20210123054209.f03ac420.freebsd@edvax.de> <CAAdA2WP%2BAh6-9pFdB4VJg5asxqHKpEUNOrtxY0TsT9PVpWu26w@mail.gmail.com> <20210123094041.f932fd4c.freebsd@edvax.de>

index | next in thread | previous in thread | raw e-mail


On Sat, 23 Jan 2021 at 11:40, Polytropon <freebsd@edvax.de> wrote:

> On Sat, 23 Jan 2021 10:36:21 +0300, Odhiambo Washington wrote:
> > On Sat, 23 Jan 2021 at 07:42, Polytropon <freebsd@edvax.de> wrote:
> >
> > > On Fri, 22 Jan 2021 19:45:11 +0300, Odhiambo Washington wrote:
> > > > I have a situation where I'd like to convert PDF to XLSX.
> > > > The documents are 35MB and 105MB but contain several thousand pages.
> > > >
> > > > Does anyone know a good tool that can handle this?
> > >
> > > Depends on what is in the PDFs.
> > >
> > > If this is rendered text, you can maybe extract the text with
> > > the tool pdftotext and convert it to CSV, then import the CSV
> > > in "Excel".
> > >
> > > But if it's images of text, use the tool pdfimages to extract the
> > > images, and then a OCR tool (maybe esseract) to obtain the data.
> > >
> > > It might be worth checking if LibreOffice an open a PDF file and
> > > export to (or save as) directly an "Excel"-compatible file, either
> > > CSV or one of the binary formats (XLS, XLSX).
> > >
> > > Restructuring with some sed / awk / perl might be needed, though.
> > > Keep in mind those steps can be automated, so if you have lots of
> > > PDF files, write a simple shell wrapper that processes all of them,
> > > so you get a bunch of result files without further handholding. :-)
> > >
> > >
> > To make the story short, I need to do some manipulation on the two
> > documents in this link:
> >
> > https://bit.ly/2KEvCwr
> >
> > I thought they are simple PDFs, but now I am not sure what/how the
> creators
> > did.
>
> They contain text, so the OCR problem is out of the way.
> Sadly, the text is re-arranged so the optimal solution (one
> line in a table equals one line of text, with the columns
> being separated by whitespace) does not appear, instead it
> is the other way round: one line equals one column.
>
>
>
> > I just need to count how many duplicate records are in these.
>
> Define "duplicate". :-)
>

There are hundreds, possibly thousands, of duplicate records in those files.
By duplicate, I mean records with almost all columns similar except one -
the name.
Only a slight change in column 2 deliberately done to make the two records
look different.


> > Any script guru to assist?? :-)
>
> I'd suggest something like this:
>
>         pdftotext <file> | ... | paste | ... | sort | uniq -d | wc -l
>
> This will probably almost do what you need, given
> sufficient assumptions... :-)


Yeah. That really helps :-)

I know there are great guys in this list who can do this in a few minutes
because they've mastered scripts and their efficient usage :-)




-- 
Best regards,
Odhiambo WASHINGTON,
Nairobi,KE
+254 7 3200 0004/+254 7 2274 3223
"Oh, the cruft.", grep ^[^#] :-)

help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAAdA2WPNvLZ5u5fTGH06unm9vuGR0EjT2mziov-8fPNgUpamqg>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation