Date: Sat, 23 Jan 2021 05:42:09 +0100 From: Polytropon <freebsd@edvax.de> To: Odhiambo Washington <odhiambo@gmail.com> Cc: User Questions <freebsd-questions@freebsd.org> Subject: Re: Convert PDF to Excel Message-ID: <20210123054209.f03ac420.freebsd@edvax.de> In-Reply-To: <CAAdA2WPoqEaew-OuDwAJ4pTbNUJsAzc2MpZE9di5HrJfGu%2Bexw@mail.gmail.com> References: <CAAdA2WPoqEaew-OuDwAJ4pTbNUJsAzc2MpZE9di5HrJfGu%2Bexw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 22 Jan 2021 19:45:11 +0300, Odhiambo Washington wrote: > I have a situation where I'd like to convert PDF to XLSX. > The documents are 35MB and 105MB but contain several thousand pages. > > Does anyone know a good tool that can handle this? Depends on what is in the PDFs. If this is rendered text, you can maybe extract the text with the tool pdftotext and convert it to CSV, then import the CSV in "Excel". But if it's images of text, use the tool pdfimages to extract the images, and then a OCR tool (maybe esseract) to obtain the data. It might be worth checking if LibreOffice an open a PDF file and export to (or save as) directly an "Excel"-compatible file, either CSV or one of the binary formats (XLS, XLSX). Restructuring with some sed / awk / perl might be needed, though. Keep in mind those steps can be automated, so if you have lots of PDF files, write a simple shell wrapper that processes all of them, so you get a bunch of result files without further handholding. :-) -- Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ...
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20210123054209.f03ac420.freebsd>