Date: Sat, 23 Jan 2021 10:36:21 +0300 From: Odhiambo Washington <odhiambo@gmail.com> To: Polytropon <freebsd@edvax.de> Cc: User Questions <freebsd-questions@freebsd.org> Subject: Re: Convert PDF to Excel Message-ID: <CAAdA2WP%2BAh6-9pFdB4VJg5asxqHKpEUNOrtxY0TsT9PVpWu26w@mail.gmail.com> In-Reply-To: <20210123054209.f03ac420.freebsd@edvax.de> References: <CAAdA2WPoqEaew-OuDwAJ4pTbNUJsAzc2MpZE9di5HrJfGu%2Bexw@mail.gmail.com> <20210123054209.f03ac420.freebsd@edvax.de>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 23 Jan 2021 at 07:42, Polytropon <freebsd@edvax.de> wrote: > On Fri, 22 Jan 2021 19:45:11 +0300, Odhiambo Washington wrote: > > I have a situation where I'd like to convert PDF to XLSX. > > The documents are 35MB and 105MB but contain several thousand pages. > > > > Does anyone know a good tool that can handle this? > > Depends on what is in the PDFs. > > If this is rendered text, you can maybe extract the text with > the tool pdftotext and convert it to CSV, then import the CSV > in "Excel". > > But if it's images of text, use the tool pdfimages to extract the > images, and then a OCR tool (maybe esseract) to obtain the data. > > It might be worth checking if LibreOffice an open a PDF file and > export to (or save as) directly an "Excel"-compatible file, either > CSV or one of the binary formats (XLS, XLSX). > > Restructuring with some sed / awk / perl might be needed, though. > Keep in mind those steps can be automated, so if you have lots of > PDF files, write a simple shell wrapper that processes all of them, > so you get a bunch of result files without further handholding. :-) > > To make the story short, I need to do some manipulation on the two documents in this link: https://bit.ly/2KEvCwr I thought they are simple PDFs, but now I am not sure what/how the creators did. I just need to count how many duplicate records are in these. Any script guru to assist?? :-) -- Best regards, Odhiambo WASHINGTON, Nairobi,KE +254 7 3200 0004/+254 7 2274 3223 "Oh, the cruft.", grep ^[^#] :-)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAAdA2WP%2BAh6-9pFdB4VJg5asxqHKpEUNOrtxY0TsT9PVpWu26w>