From owner-freebsd-questions@freebsd.org Sat Jan 23 04:42:14 2021 Return-Path: Delivered-To: freebsd-questions@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id BD6574E3BB0 for ; Sat, 23 Jan 2021 04:42:14 +0000 (UTC) (envelope-from freebsd@edvax.de) Received: from mout.kundenserver.de (mout.kundenserver.de [217.72.192.75]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "mout.kundenserver.de", Issuer "TeleSec ServerPass Class 2 CA" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4DN3Px2s1Gz4v3H for ; Sat, 23 Jan 2021 04:42:12 +0000 (UTC) (envelope-from freebsd@edvax.de) Received: from r56.edvax.de ([178.5.224.154]) by mrelayeu.kundenserver.de (mreue107 [212.227.15.183]) with ESMTPA (Nemesis) id 1MK3mS-1lJBjb37gU-00LT7Y; Sat, 23 Jan 2021 05:42:09 +0100 Date: Sat, 23 Jan 2021 05:42:09 +0100 From: Polytropon To: Odhiambo Washington Cc: User Questions Subject: Re: Convert PDF to Excel Message-Id: <20210123054209.f03ac420.freebsd@edvax.de> In-Reply-To: References: Reply-To: Polytropon Organization: EDVAX X-Mailer: Sylpheed 3.1.1 (GTK+ 2.24.5; i386-portbld-freebsd8.2) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Provags-ID: V03:K1:3wT6pp8mwIDLUWwyKBkWWx4HJuAZW5v8LCrJ9kAGUiAQG6sawhx ZKdA2XBENkiMOu3jaJIW330vbpPBFa2iuj9cudFz3dnlWcDzBpAEB8dlpkmIu0nPmABfasy VnZlL92OcDf6niOLo8W5GIAuOxo5najIu8mVblRtSNW26DH5O1EEotOXIhmG20pGJUDGoSm M7rjOU4dFAntMzbZA9faQ== X-Spam-Flag: NO X-UI-Out-Filterresults: notjunk:1;V03:K0:+NIGd567wrg=:uGQU3uD3rS5ggJQGDei9sO B5ujqkVFEhWT1g3vOIkdjCqCwGoxLvgvH6T9+4s2awptVANTRYHCKDlT+zAnxuTCiZUZ27TC9 3tfmz35jdjufyLwqcF/L/Iwv7JUGwTV14j5D8lA2234oFrYZ4kDLeQRqeyxnOMKRLLrwJIu6O Iv9jtnA0Yenii7aGNeQC6KrCgB0077P7mZ8jJEfFWzS78sfjqtyz/T9BEoMW8d5lmtRsCDvMJ yiDGyg+NCQhzRG2fN7WObEVVE4S2BkFJzAjPYulYTwfEq2TG8u7JefHz1EWle82iFv2HXTLNO 5m1ylEduHMFL12bF/IIL/KJ0JVy0tgB13ycvtO5b2zoxPVn/XL+TAsrunvyh2PKsHdXJuR/ku zJl05zz5eE8m8NYN+rFFokhBLeZuWe7COf5TlvntaEUyPxYh9kwRIKT0zacjY X-Rspamd-Queue-Id: 4DN3Px2s1Gz4v3H X-Spamd-Bar: / Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=none (mx1.freebsd.org: domain of freebsd@edvax.de has no SPF policy when checking 217.72.192.75) smtp.mailfrom=freebsd@edvax.de X-Spamd-Result: default: False [-0.60 / 15.00]; HAS_REPLYTO(0.00)[freebsd@edvax.de]; RCVD_VIA_SMTP_AUTH(0.00)[]; MV_CASE(0.50)[]; HAS_ORG_HEADER(0.00)[]; TO_DN_ALL(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; RCPT_COUNT_TWO(0.00)[2]; FREEMAIL_TO(0.00)[gmail.com]; RECEIVED_SPAMHAUS_PBL(0.00)[178.5.224.154:received]; RCVD_TLS_LAST(0.00)[]; R_DKIM_NA(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; RBL_DBL_DONT_QUERY_IPS(0.00)[217.72.192.75:from]; ASN(0.00)[asn:8560, ipnet:217.72.192.0/20, country:DE]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; REPLYTO_EQ_FROM(0.00)[]; FROM_HAS_DN(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[edvax.de]; AUTH_NA(1.00)[]; SPAMHAUS_ZRD(0.00)[217.72.192.75:from:127.0.2.255]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MID_CONTAINS_FROM(1.00)[]; RCVD_IN_DNSWL_NONE(0.00)[217.72.192.75:from]; R_SPF_NA(0.00)[no SPF record]; RWL_MAILSPIKE_POSSIBLE(0.00)[217.72.192.75:from]; RCVD_COUNT_TWO(0.00)[2]; MAILMAN_DEST(0.00)[freebsd-questions] X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 23 Jan 2021 04:42:14 -0000 On Fri, 22 Jan 2021 19:45:11 +0300, Odhiambo Washington wrote: > I have a situation where I'd like to convert PDF to XLSX. > The documents are 35MB and 105MB but contain several thousand pages. > > Does anyone know a good tool that can handle this? Depends on what is in the PDFs. If this is rendered text, you can maybe extract the text with the tool pdftotext and convert it to CSV, then import the CSV in "Excel". But if it's images of text, use the tool pdfimages to extract the images, and then a OCR tool (maybe esseract) to obtain the data. It might be worth checking if LibreOffice an open a PDF file and export to (or save as) directly an "Excel"-compatible file, either CSV or one of the binary formats (XLS, XLSX). Restructuring with some sed / awk / perl might be needed, though. Keep in mind those steps can be automated, so if you have lots of PDF files, write a simple shell wrapper that processes all of them, so you get a bunch of result files without further handholding. :-) -- Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ...