From owner-freebsd-questions@freebsd.org Sat Jan 23 08:40:45 2021 Return-Path: Delivered-To: freebsd-questions@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 813BF4EB5F1 for ; Sat, 23 Jan 2021 08:40:45 +0000 (UTC) (envelope-from freebsd@edvax.de) Received: from mout.kundenserver.de (mout.kundenserver.de [212.227.17.13]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "mout.kundenserver.de", Issuer "TeleSec ServerPass Class 2 CA" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4DN8j824fXz3Qhh for ; Sat, 23 Jan 2021 08:40:44 +0000 (UTC) (envelope-from freebsd@edvax.de) Received: from r56.edvax.de ([178.5.224.154]) by mrelayeu.kundenserver.de (mreue107 [212.227.15.183]) with ESMTPA (Nemesis) id 1M1IRY-1l0Ue80bYw-002qGX; Sat, 23 Jan 2021 09:40:42 +0100 Date: Sat, 23 Jan 2021 09:40:41 +0100 From: Polytropon To: Odhiambo Washington Cc: User Questions Subject: Re: Convert PDF to Excel Message-Id: <20210123094041.f932fd4c.freebsd@edvax.de> In-Reply-To: References: <20210123054209.f03ac420.freebsd@edvax.de> Reply-To: Polytropon Organization: EDVAX X-Mailer: Sylpheed 3.1.1 (GTK+ 2.24.5; i386-portbld-freebsd8.2) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Provags-ID: V03:K1:Zw0e78/Nx5WldLfzh/GjJ3WIOICQuCGHFDpePtj38pnzuF7gh2G huwjvpM/lhnDBI6PwvOik5Bxri1Rasr6TDE89MBWw+3kzU5aYdNXtc0XJjffqCpAFcoYzE1 GLxwj+LxbZwJfvYLWq0owgfRZ/ch5DN+om4sX4qRK92MiH+JRsJ3oNbY6xtdaW3uOJbW02o o1AuqIgmh18R62yENVNwg== X-Spam-Flag: NO X-UI-Out-Filterresults: notjunk:1;V03:K0:WH+mcCm85NI=:c+wMI09W9u6pB4/z0LTGQH o23X+kXLM4H+hru+sR46rwVPPOkj6jfB3XfcdKz15K+IV23QgeTK9ixq/zdEW01aF41CoDoI9 BPqVpb33Tgwha2qlDyZmlRzkanZExmNAGdkPsvPE/EH1/LphAy+jLk1bXpLUEGKOvLAaIqJl5 kfsKSddVwi76JshwnwPuVLhGczGxYZIbXwkKds6fHFsil/vv/YUzE/408DAKcOyDeksqbuEEt uuxDjrUxZLlS7z+Z6vWEh/gfNHO3bCFNCkrBPXcnUDdDKx6m/TTB1rWv8mbuVac0+lZ0rxUGg mBBn6fKFoMLDdwqSZGzGvTXrIiou4IWhdEdWET7QyrFlDi+aZkqVlBvU1gXVRJb/aI+MOKIAK q3yqDx6zBkB+emX7juAUNBFABxH5DH55zAxU4WURrGT3FiTL3YLrfe6mQiGiTzAbAwajjnTjc ZhqjoUayWcB2kWUffzwaFY9Fnl/FL4c= X-Rspamd-Queue-Id: 4DN8j824fXz3Qhh X-Spamd-Bar: / Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=none (mx1.freebsd.org: domain of freebsd@edvax.de has no SPF policy when checking 212.227.17.13) smtp.mailfrom=freebsd@edvax.de X-Spamd-Result: default: False [-0.60 / 15.00]; HAS_REPLYTO(0.00)[freebsd@edvax.de]; RCVD_VIA_SMTP_AUTH(0.00)[]; RWL_MAILSPIKE_GOOD(0.00)[212.227.17.13:from]; MV_CASE(0.50)[]; HAS_ORG_HEADER(0.00)[]; TO_DN_ALL(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; RCPT_COUNT_TWO(0.00)[2]; FREEMAIL_TO(0.00)[gmail.com]; FROM_EQ_ENVFROM(0.00)[]; RCVD_TLS_LAST(0.00)[]; R_DKIM_NA(0.00)[]; ASN(0.00)[asn:8560, ipnet:212.227.0.0/16, country:DE]; MIME_TRACE(0.00)[0:+]; RBL_DBL_DONT_QUERY_IPS(0.00)[212.227.17.13:from]; RECEIVED_SPAMHAUS_PBL(0.00)[178.5.224.154:received]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; REPLYTO_EQ_FROM(0.00)[]; FROM_HAS_DN(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[edvax.de]; AUTH_NA(1.00)[]; SPAMHAUS_ZRD(0.00)[212.227.17.13:from:127.0.2.255]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MID_CONTAINS_FROM(1.00)[]; RCVD_IN_DNSWL_NONE(0.00)[212.227.17.13:from]; R_SPF_NA(0.00)[no SPF record]; RCVD_COUNT_TWO(0.00)[2]; MAILMAN_DEST(0.00)[freebsd-questions] X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 23 Jan 2021 08:40:45 -0000 On Sat, 23 Jan 2021 10:36:21 +0300, Odhiambo Washington wrote: > On Sat, 23 Jan 2021 at 07:42, Polytropon wrote: > > > On Fri, 22 Jan 2021 19:45:11 +0300, Odhiambo Washington wrote: > > > I have a situation where I'd like to convert PDF to XLSX. > > > The documents are 35MB and 105MB but contain several thousand pages. > > > > > > Does anyone know a good tool that can handle this? > > > > Depends on what is in the PDFs. > > > > If this is rendered text, you can maybe extract the text with > > the tool pdftotext and convert it to CSV, then import the CSV > > in "Excel". > > > > But if it's images of text, use the tool pdfimages to extract the > > images, and then a OCR tool (maybe esseract) to obtain the data. > > > > It might be worth checking if LibreOffice an open a PDF file and > > export to (or save as) directly an "Excel"-compatible file, either > > CSV or one of the binary formats (XLS, XLSX). > > > > Restructuring with some sed / awk / perl might be needed, though. > > Keep in mind those steps can be automated, so if you have lots of > > PDF files, write a simple shell wrapper that processes all of them, > > so you get a bunch of result files without further handholding. :-) > > > > > To make the story short, I need to do some manipulation on the two > documents in this link: > > https://bit.ly/2KEvCwr > > I thought they are simple PDFs, but now I am not sure what/how the creators > did. They contain text, so the OCR problem is out of the way. Sadly, the text is re-arranged so the optimal solution (one line in a table equals one line of text, with the columns being separated by whitespace) does not appear, instead it is the other way round: one line equals one column. > I just need to count how many duplicate records are in these. Define "duplicate". :-) > Any script guru to assist?? :-) I'd suggest something like this: pdftotext | ... | paste | ... | sort | uniq -d | wc -l This will probably almost do what you need, given sufficient assumptions... :-) -- Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ...