From owner-freebsd-questions@freebsd.org Sat Jan 23 07:37:01 2021 Return-Path: Delivered-To: freebsd-questions@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 409934E99C8 for ; Sat, 23 Jan 2021 07:37:01 +0000 (UTC) (envelope-from odhiambo@gmail.com) Received: from mail-qt1-x82e.google.com (mail-qt1-x82e.google.com [IPv6:2607:f8b0:4864:20::82e]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4DN7Hc3kRcz3MsV for ; Sat, 23 Jan 2021 07:37:00 +0000 (UTC) (envelope-from odhiambo@gmail.com) Received: by mail-qt1-x82e.google.com with SMTP id t17so5986660qtq.2 for ; Fri, 22 Jan 2021 23:37:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=6po5RmWkIUX27GDbxA4fjIx8/H0Mr714QqKAEEcv4yY=; b=YWQC9EEpVI5OmZdkGm4t7WBTSZipXPYhGoIaGPx1gCkNYKLXMZVYVYCkfd9kl2xdLc t54rLkIK6lO61tA5P82KsXeSwsLkglxxKsj62W0plPhTFabijPwU87l1W7nwzkg0DtPx DwCOhuF9gDntV1NN7CgadaKpBlF7f4Zlay4c2aVWp90TIP1ZkhhEPgANEeoZ2d14hnbK qIZRK+BP+PktCi/WPsDBXTQV2xcwvgQ0GmXlE6q75sVRro75epSSKLotNyPUeHfEyPTd UYCL+ExCZpW4ah96Nf8B5jjkSXCGfWc4Rc3laVCR0aF8GsNb5SmgYEY3oJ0g5XfB0nIJ CAlw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=6po5RmWkIUX27GDbxA4fjIx8/H0Mr714QqKAEEcv4yY=; b=cZhnacStK9FbD7dUV+D2BBBRxG1shk/Jxy3bvbDOdPI3E+XkZRq6Lx3JVP3rQ+m12D WHuPpVddWnJpAB4rksEPHoeLsg1O59iQCiw2AjD+r6XHy86TnmnqG8Q/C8pG2lp88w3A 7bgTtB/MJ3Dn8MJNgaUqbfhnx8ChLmgySY8eZ1KYpqNSQBpDX0hTg+kMeG5XtaU/axtp RBpryVsFpu6ci/O6BYvUwzNcKHrdgeNidaGLXBH7nOWtOjB2I60JNmtHiruOzpdNsrre 1FBM6Kl4q1gZUbOHuFptNF+Wgwwpd+JthX867+Du9IeG9tg08VsUkkL1C2FFrbdeJmZN zTOA== X-Gm-Message-State: AOAM533HOUEsUIp+2POIMWt+3fkZlYh05UD1GDknXDEn4HpfaFkH4W+j 7Mhnyzzh+cIL6e3EcFXTCBHgtbVaIFcD86GkmMgT3I1F5oXE1g9Z X-Google-Smtp-Source: ABdhPJwxCMzSe+zaVcU7lS2Ngf+zhCS7Z6bE2OtWTZeKH/RyVHgnnuIbsToU4kxYZRexGQRfqqgKnDgm7xZePZ2KzeM= X-Received: by 2002:a05:622a:1c5:: with SMTP id t5mr4255480qtw.129.1611387419563; Fri, 22 Jan 2021 23:36:59 -0800 (PST) MIME-Version: 1.0 References: <20210123054209.f03ac420.freebsd@edvax.de> In-Reply-To: <20210123054209.f03ac420.freebsd@edvax.de> From: Odhiambo Washington Date: Sat, 23 Jan 2021 10:36:21 +0300 Message-ID: Subject: Re: Convert PDF to Excel To: Polytropon Cc: User Questions X-Rspamd-Queue-Id: 4DN7Hc3kRcz3MsV X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20161025 header.b=YWQC9EEp; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of odhiambo@gmail.com designates 2607:f8b0:4864:20::82e as permitted sender) smtp.mailfrom=odhiambo@gmail.com X-Spamd-Result: default: False [-4.00 / 15.00]; FREEMAIL_FROM(0.00)[gmail.com]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; TO_DN_ALL(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; NEURAL_HAM_SHORT(-1.00)[-1.000]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; RBL_DBL_DONT_QUERY_IPS(0.00)[2607:f8b0:4864:20::82e:from]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20161025]; FROM_HAS_DN(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-questions@freebsd.org]; SPAMHAUS_ZRD(0.00)[2607:f8b0:4864:20::82e:from:127.0.2.255]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::82e:from]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[]; MAILMAN_DEST(0.00)[freebsd-questions] Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.34 X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 23 Jan 2021 07:37:01 -0000 On Sat, 23 Jan 2021 at 07:42, Polytropon wrote: > On Fri, 22 Jan 2021 19:45:11 +0300, Odhiambo Washington wrote: > > I have a situation where I'd like to convert PDF to XLSX. > > The documents are 35MB and 105MB but contain several thousand pages. > > > > Does anyone know a good tool that can handle this? > > Depends on what is in the PDFs. > > If this is rendered text, you can maybe extract the text with > the tool pdftotext and convert it to CSV, then import the CSV > in "Excel". > > But if it's images of text, use the tool pdfimages to extract the > images, and then a OCR tool (maybe esseract) to obtain the data. > > It might be worth checking if LibreOffice an open a PDF file and > export to (or save as) directly an "Excel"-compatible file, either > CSV or one of the binary formats (XLS, XLSX). > > Restructuring with some sed / awk / perl might be needed, though. > Keep in mind those steps can be automated, so if you have lots of > PDF files, write a simple shell wrapper that processes all of them, > so you get a bunch of result files without further handholding. :-) > > To make the story short, I need to do some manipulation on the two documents in this link: https://bit.ly/2KEvCwr I thought they are simple PDFs, but now I am not sure what/how the creators did. I just need to count how many duplicate records are in these. Any script guru to assist?? :-) -- Best regards, Odhiambo WASHINGTON, Nairobi,KE +254 7 3200 0004/+254 7 2274 3223 "Oh, the cruft.", grep ^[^#] :-)