From owner-freebsd-questions@freebsd.org  Mon Jul 12 06:44:34 2021
Return-Path: <owner-freebsd-questions@freebsd.org>
Delivered-To: freebsd-questions@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 1E52F652337
 for <freebsd-questions@mailman.nyi.freebsd.org>;
 Mon, 12 Jul 2021 06:44:34 +0000 (UTC)
 (envelope-from serejk@febras.net)
Received: from prima.febras.net (prima.febras.net [62.76.193.23])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "*.febras.net",
 Issuer "Sectigo RSA Organization Validation Secure Server CA" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 4GNZ4c692Nz3ld1
 for <freebsd-questions@freebsd.org>; Mon, 12 Jul 2021 06:44:32 +0000 (UTC)
 (envelope-from serejk@febras.net)
Received: from mail.febras.net (localhost [127.0.0.1])
 by prima.febras.net ("FEB RAS network Mail Server") with ESMTP id 8A9CC55B900; 
 Mon, 12 Jul 2021 16:44:20 +1000 (VLAT)
MIME-Version: 1.0
Date: Mon, 12 Jul 2021 16:44:20 +1000
From: Korolev Sergey <serejk@febras.net>
To: Paul Procacci <pprocacci@gmail.com>
Cc: KK CHN <kkchn.in@gmail.com>, freebsd-questions
 <freebsd-questions@freebsd.org>
Subject: Re: Analyzing Log files of very large size
Organization: =?UTF-8?Q?=D0=92=D0=A6_=D0=94=D0=92=D0=9E_=D0=A0=D0=90?=
 =?UTF-8?Q?=D0=9D?=
Reply-To: <serejk@febras.net>
Mail-Reply-To: <serejk@febras.net>
In-Reply-To: <CAFbbPugNamorCpL1+bkao06iWSUJkPS5V3KORs3SCUUChbBU5Q@mail.gmail.com>
References: <CAKgGyB_TJrLWSjcnc9491Gg0Q5CLqLdmWx2yga_Ez7-gE6YcKQ@mail.gmail.com>
 <E9C00664-DAC7-4F58-BCCA-CDD2654C9325@febras.net>
 <CAKgGyB_reF4eqz4pvQj7tFsOQEEB3WrFZa-91L+NChm=85h0-A@mail.gmail.com>
 <d0ebe655c44cd2b5a70bbac4dcdddcc3@febras.net>
 <CAFbbPugNamorCpL1+bkao06iWSUJkPS5V3KORs3SCUUChbBU5Q@mail.gmail.com>
Message-ID: <64d626beea2becc5191f0a886e0291b3@febras.net>
X-Sender: serejk@febras.net
User-Agent: RoundCube Webmail/0.5.4
X-FEBRAS-Info: Contact e-mail: admin@febras.net
X-FEBRAS-ID: 8A9CC55B900.AEB2D
X-FEBRAS: clean
X-FEBRAS-SpamCheck: not spam, SpamAssassin (not cached, score=-0.198,
 required 5, autolearn=not spam, ALL_TRUSTED -1.00, BAYES_50 0.80,
 HTML_MESSAGE 0.00, SMILEY -0.50, URIBL_BLOCKED 0.00,
 VOWEL_TOCC_5 0.50)
X-FEBRAS-From: serejk@febras.net
X-FEBRAS-To: freebsd-questions@freebsd.org, kkchn.in@gmail.com,
 pprocacci@gmail.com
X-Spam-Status: No
X-Rspamd-Queue-Id: 4GNZ4c692Nz3ld1
X-Spamd-Bar: ---
Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none;
 spf=pass (mx1.freebsd.org: domain of serejk@febras.net designates 62.76.193.23
 as permitted sender) smtp.mailfrom=serejk@febras.net
X-Spamd-Result: default: False [-3.30 / 15.00];
 HAS_REPLYTO(0.00)[serejk@febras.net];
 R_SPF_ALLOW(-0.20)[+ip4:62.76.193.23:c];
 REPLYTO_ADDR_EQ_FROM(0.00)[]; HAS_ORG_HEADER(0.00)[];
 TO_DN_ALL(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000];
 FREEMAIL_TO(0.00)[gmail.com]; FROM_EQ_ENVFROM(0.00)[];
 MIME_TRACE(0.00)[0:+,1:+,2:~];
 RBL_DBL_DONT_QUERY_IPS(0.00)[62.76.193.23:from];
 ASN(0.00)[asn:34017, ipnet:62.76.193.0/24, country:RU];
 MID_RHS_MATCH_FROM(0.00)[]; ARC_NA(0.00)[];
 NEURAL_HAM_MEDIUM(-1.00)[-1.000]; FROM_HAS_DN(0.00)[];
 RCPT_COUNT_THREE(0.00)[3]; NEURAL_HAM_LONG(-1.00)[-1.000];
 TAGGED_RCPT(0.00)[]; DMARC_NA(0.00)[febras.net];
 MIME_GOOD(-0.10)[multipart/alternative,text/plain];
 SPAMHAUS_ZRD(0.00)[62.76.193.23:from:127.0.2.255];
 TO_MATCH_ENVRCPT_SOME(0.00)[]; R_DKIM_NA(0.00)[];
 FREEMAIL_CC(0.00)[gmail.com,freebsd.org];
 RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_LAST(0.00)[];
 MAILMAN_DEST(0.00)[freebsd-questions]
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.34
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions/>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Jul 2021 06:44:34 -0000

  

Yes, Perl is perhaps the best solution for this job, but only if
you don`t need to start it from zero. It can take some time to get into
basics. 

About sizes: I processed 60Gb file with shell utilities,
containing the bunch of pipes, in reasonable time (like several hours).
10K rpm HDD, not SSD. Of course, one should be aware about what is he
doing and optimize data processing pipeline. 

About indexing approach:
once again, i don`t know what exactly need to be extracted from file,
but if it is a table contaning some aggregated results, for example,
then indexing may be an overkill. 

To topic author: maybe if you show a
piece of file and explain desired result, then advise could be more
precise. 

On Mon, 12 Jul 2021 02:20:58 -0400, Paul Procacci wrote: 

>
On Mon, Jul 12, 2021 at 1:44 AM Korolev Sergey wrote:
> 
>> I think,
that proper tools usually highly depends on desired result, so my
reasoning is quite general. People here advise to use Perl and also
split one large file into managable pieces - all that is very good, I
vote for that. But I don`t know Perl at all, so I usually get along with
standard shell utilities: grep, tr, awk, sed, etc. I used to parse big
maillogs with them successfully.
> 
> Most standard shell utilities can
certainly get the job done if the file
> sizes are
> of a size that's
manageable. That is most likely the vast majority of
> cases. No
>
question about that.
> 
> There's certainly a point however when the
sizes become so unmanageable
> that their
> completion will be on your
150th birthday. ;) An exaggeration undoubtedly.
> 
> There's obviously
options for this, but you'll seldom find the answer in any
> standard
install of any userland. Sometimes you can get away with xargs,
>
depending
> on what the data is that you're working with, but that's all
that comes to
> mind.
> 
> The "promotion" from there in my mind is
going the perl route (or any other
> interpreted
> language) capable of
threading ... and from there as necessary ... C (or
> other compiled
>
language).
> 
> Someone made mention of Elasticsearch and that's a good
option too. All
> the work
> of indexing the data has already been done
for you. You just don't have to
> mind paying
> for it. ;)
> 
> Hell,
I've used postgresql with their fulltext search for similar things as
>
well and I'd argue
> if that's already in your stack, to at the very
least try that first.
> You'd be surprised at
> how darn well it does.
>

> Goodnight!
> 
> ~Paul
 

Links:
------
[1] mailto:serejk@febras.net