Date: Sat, 23 Jan 2021 09:40:41 +0100 From: Polytropon <freebsd@edvax.de> To: Odhiambo Washington <odhiambo@gmail.com> Cc: User Questions <freebsd-questions@freebsd.org> Subject: Re: Convert PDF to Excel Message-ID: <20210123094041.f932fd4c.freebsd@edvax.de> In-Reply-To: <CAAdA2WP%2BAh6-9pFdB4VJg5asxqHKpEUNOrtxY0TsT9PVpWu26w@mail.gmail.com> References: <CAAdA2WPoqEaew-OuDwAJ4pTbNUJsAzc2MpZE9di5HrJfGu%2Bexw@mail.gmail.com> <20210123054209.f03ac420.freebsd@edvax.de> <CAAdA2WP%2BAh6-9pFdB4VJg5asxqHKpEUNOrtxY0TsT9PVpWu26w@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 23 Jan 2021 10:36:21 +0300, Odhiambo Washington wrote: > On Sat, 23 Jan 2021 at 07:42, Polytropon <freebsd@edvax.de> wrote: > > > On Fri, 22 Jan 2021 19:45:11 +0300, Odhiambo Washington wrote: > > > I have a situation where I'd like to convert PDF to XLSX. > > > The documents are 35MB and 105MB but contain several thousand pages. > > > > > > Does anyone know a good tool that can handle this? > > > > Depends on what is in the PDFs. > > > > If this is rendered text, you can maybe extract the text with > > the tool pdftotext and convert it to CSV, then import the CSV > > in "Excel". > > > > But if it's images of text, use the tool pdfimages to extract the > > images, and then a OCR tool (maybe esseract) to obtain the data. > > > > It might be worth checking if LibreOffice an open a PDF file and > > export to (or save as) directly an "Excel"-compatible file, either > > CSV or one of the binary formats (XLS, XLSX). > > > > Restructuring with some sed / awk / perl might be needed, though. > > Keep in mind those steps can be automated, so if you have lots of > > PDF files, write a simple shell wrapper that processes all of them, > > so you get a bunch of result files without further handholding. :-) > > > > > To make the story short, I need to do some manipulation on the two > documents in this link: > > https://bit.ly/2KEvCwr > > I thought they are simple PDFs, but now I am not sure what/how the creators > did. They contain text, so the OCR problem is out of the way. Sadly, the text is re-arranged so the optimal solution (one line in a table equals one line of text, with the columns being separated by whitespace) does not appear, instead it is the other way round: one line equals one column. > I just need to count how many duplicate records are in these. Define "duplicate". :-) > Any script guru to assist?? :-) I'd suggest something like this: pdftotext <file> | ... | paste | ... | sort | uniq -d | wc -l This will probably almost do what you need, given sufficient assumptions... :-) -- Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ...
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20210123094041.f932fd4c.freebsd>