From owner-freebsd-questions@freebsd.org Sat Jan 23 04:58:48 2021 Return-Path: Delivered-To: freebsd-questions@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id A909B4E43FD for ; Sat, 23 Jan 2021 04:58:48 +0000 (UTC) (envelope-from weaver@riseup.net) Received: from mx1.riseup.net (mx1.riseup.net [198.252.153.129]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "*.riseup.net", Issuer "Sectigo RSA Domain Validation Secure Server CA" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4DN3n33LWBz3C1s for ; Sat, 23 Jan 2021 04:58:47 +0000 (UTC) (envelope-from weaver@riseup.net) Received: from fews2.riseup.net (unknown [10.0.1.84]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (Client CN "*.riseup.net", Issuer "Sectigo RSA Domain Validation Secure Server CA" (not verified)) by mx1.riseup.net (Postfix) with ESMTPS id 4DN3mt3RWRzFdxZ for ; Fri, 22 Jan 2021 20:58:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=riseup.net; s=squak; t=1611377918; bh=H/ShfTivdGCOgmBSQGA6IFmA4U32RaERcu/4YrtCaos=; h=Date:From:To:Subject:In-Reply-To:References:From; b=ObB45ib075QDeEjvX4n/hK+p5xb8dlDVKSI8aaePAG5vyrNvtbEcgQWe+cwVmFk7p fqrgOembgX3fLY/VtXC+qID85CqePqqUSW4HGmkCf0DctN7PqKca9oFYTKPW7b4H8f 6Z8DGpcrHbvMAliGj3KTOiy6lP7IoHFPPUdOr0I8= X-Riseup-User-ID: 6AD36D84536A2C579BFCCC198704A7EA2BB8B6EA2B665B5938214C3A31C1A3A4 Received: from [127.0.0.1] (localhost [127.0.0.1]) by fews2.riseup.net (Postfix) with ESMTPSA id 4DN3mt2dLHz1xtg for ; Fri, 22 Jan 2021 20:58:38 -0800 (PST) MIME-Version: 1.0 Date: Fri, 22 Jan 2021 20:58:38 -0800 From: Weaver To: freebsd-questions@freebsd.org Subject: Re: Convert PDF to Excel In-Reply-To: <20210123054209.f03ac420.freebsd@edvax.de> References: <20210123054209.f03ac420.freebsd@edvax.de> Message-ID: <8236dabc52a7801b0cb7edce8c954623@riseup.net> Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 4DN3n33LWBz3C1s X-Spamd-Bar: ----- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=riseup.net header.s=squak header.b=ObB45ib0; dmarc=pass (policy=none) header.from=riseup.net; spf=pass (mx1.freebsd.org: domain of weaver@riseup.net designates 198.252.153.129 as permitted sender) smtp.mailfrom=weaver@riseup.net X-Spamd-Result: default: False [-5.10 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[]; RWL_MAILSPIKE_GOOD(0.00)[198.252.153.129:from]; R_SPF_ALLOW(-0.20)[+mx]; TO_DN_NONE(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; DKIM_TRACE(0.00)[riseup.net:+]; DMARC_POLICY_ALLOW(-0.50)[riseup.net,none]; NEURAL_HAM_SHORT(-1.00)[-1.000]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; RBL_DBL_DONT_QUERY_IPS(0.00)[198.252.153.129:from]; ASN(0.00)[asn:16652, ipnet:198.252.153.0/24, country:US]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_IN_DNSWL_LOW(-0.10)[198.252.153.129:from]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[riseup.net:s=squak]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-questions@freebsd.org]; RCPT_COUNT_ONE(0.00)[1]; DWL_DNSWL_LOW(-1.00)[riseup.net:dkim]; SPAMHAUS_ZRD(0.00)[198.252.153.129:from:127.0.2.255]; RCVD_TLS_ALL(0.00)[]; MAILMAN_DEST(0.00)[freebsd-questions] X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 23 Jan 2021 04:58:48 -0000 On 23-01-2021 14:42, Polytropon wrote: > On Fri, 22 Jan 2021 19:45:11 +0300, Odhiambo Washington wrote: >> I have a situation where I'd like to convert PDF to XLSX. >> The documents are 35MB and 105MB but contain several thousand pages. >> >> Does anyone know a good tool that can handle this? > > Depends on what is in the PDFs. > > If this is rendered text, you can maybe extract the text with > the tool pdftotext and convert it to CSV, then import the CSV > in "Excel". Or, Abiword has a .pdf import plugin. These are never perfect, however, and some extensive editing may be necessary, depending on the document. Then you could import it into Gnumeric and use ssconvert to convert it into any of four Excel formats. > But if it's images of text, use the tool pdfimages to extract the > images, and then a OCR tool (maybe esseract) to obtain the data. > > It might be worth checking if LibreOffice an open a PDF file and > export to (or save as) directly an "Excel"-compatible file, either > CSV or one of the binary formats (XLS, XLSX). > > Restructuring with some sed / awk / perl might be needed, though. > Keep in mind those steps can be automated, so if you have lots of > PDF files, write a simple shell wrapper that processes all of them, > so you get a bunch of result files without further handholding. :-) -- `The greatest obstacle to discovery is not ignorance; it is the illusion of knowledge'. --Daniel J. Boorstein