Date: Sat, 23 Jan 2021 12:58:34 +0300 From: Odhiambo Washington <odhiambo@gmail.com> To: Polytropon <freebsd@edvax.de> Cc: User Questions <freebsd-questions@freebsd.org> Subject: Re: Convert PDF to Excel Message-ID: <CAAdA2WPNvLZ5u5fTGH06unm9vuGR0EjT2mziov-8fPNgUpamqg@mail.gmail.com> In-Reply-To: <20210123094041.f932fd4c.freebsd@edvax.de> References: <CAAdA2WPoqEaew-OuDwAJ4pTbNUJsAzc2MpZE9di5HrJfGu%2Bexw@mail.gmail.com> <20210123054209.f03ac420.freebsd@edvax.de> <CAAdA2WP%2BAh6-9pFdB4VJg5asxqHKpEUNOrtxY0TsT9PVpWu26w@mail.gmail.com> <20210123094041.f932fd4c.freebsd@edvax.de>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 23 Jan 2021 at 11:40, Polytropon <freebsd@edvax.de> wrote: > On Sat, 23 Jan 2021 10:36:21 +0300, Odhiambo Washington wrote: > > On Sat, 23 Jan 2021 at 07:42, Polytropon <freebsd@edvax.de> wrote: > > > > > On Fri, 22 Jan 2021 19:45:11 +0300, Odhiambo Washington wrote: > > > > I have a situation where I'd like to convert PDF to XLSX. > > > > The documents are 35MB and 105MB but contain several thousand pages. > > > > > > > > Does anyone know a good tool that can handle this? > > > > > > Depends on what is in the PDFs. > > > > > > If this is rendered text, you can maybe extract the text with > > > the tool pdftotext and convert it to CSV, then import the CSV > > > in "Excel". > > > > > > But if it's images of text, use the tool pdfimages to extract the > > > images, and then a OCR tool (maybe esseract) to obtain the data. > > > > > > It might be worth checking if LibreOffice an open a PDF file and > > > export to (or save as) directly an "Excel"-compatible file, either > > > CSV or one of the binary formats (XLS, XLSX). > > > > > > Restructuring with some sed / awk / perl might be needed, though. > > > Keep in mind those steps can be automated, so if you have lots of > > > PDF files, write a simple shell wrapper that processes all of them, > > > so you get a bunch of result files without further handholding. :-) > > > > > > > > To make the story short, I need to do some manipulation on the two > > documents in this link: > > > > https://bit.ly/2KEvCwr > > > > I thought they are simple PDFs, but now I am not sure what/how the > creators > > did. > > They contain text, so the OCR problem is out of the way. > Sadly, the text is re-arranged so the optimal solution (one > line in a table equals one line of text, with the columns > being separated by whitespace) does not appear, instead it > is the other way round: one line equals one column. > > > > > I just need to count how many duplicate records are in these. > > Define "duplicate". :-) > There are hundreds, possibly thousands, of duplicate records in those files. By duplicate, I mean records with almost all columns similar except one - the name. Only a slight change in column 2 deliberately done to make the two records look different. > > Any script guru to assist?? :-) > > I'd suggest something like this: > > pdftotext <file> | ... | paste | ... | sort | uniq -d | wc -l > > This will probably almost do what you need, given > sufficient assumptions... :-) Yeah. That really helps :-) I know there are great guys in this list who can do this in a few minutes because they've mastered scripts and their efficient usage :-) -- Best regards, Odhiambo WASHINGTON, Nairobi,KE +254 7 3200 0004/+254 7 2274 3223 "Oh, the cruft.", grep ^[^#] :-)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAAdA2WPNvLZ5u5fTGH06unm9vuGR0EjT2mziov-8fPNgUpamqg>