From owner-freebsd-questions@freebsd.org Mon Jul 12 06:44:34 2021 Return-Path: Delivered-To: freebsd-questions@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 1E52F652337 for ; Mon, 12 Jul 2021 06:44:34 +0000 (UTC) (envelope-from serejk@febras.net) Received: from prima.febras.net (prima.febras.net [62.76.193.23]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "*.febras.net", Issuer "Sectigo RSA Organization Validation Secure Server CA" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4GNZ4c692Nz3ld1 for ; Mon, 12 Jul 2021 06:44:32 +0000 (UTC) (envelope-from serejk@febras.net) Received: from mail.febras.net (localhost [127.0.0.1]) by prima.febras.net ("FEB RAS network Mail Server") with ESMTP id 8A9CC55B900; Mon, 12 Jul 2021 16:44:20 +1000 (VLAT) MIME-Version: 1.0 Date: Mon, 12 Jul 2021 16:44:20 +1000 From: Korolev Sergey To: Paul Procacci Cc: KK CHN , freebsd-questions Subject: Re: Analyzing Log files of very large size Organization: =?UTF-8?Q?=D0=92=D0=A6_=D0=94=D0=92=D0=9E_=D0=A0=D0=90?= =?UTF-8?Q?=D0=9D?= Reply-To: Mail-Reply-To: In-Reply-To: References: Message-ID: <64d626beea2becc5191f0a886e0291b3@febras.net> X-Sender: serejk@febras.net User-Agent: RoundCube Webmail/0.5.4 X-FEBRAS-Info: Contact e-mail: admin@febras.net X-FEBRAS-ID: 8A9CC55B900.AEB2D X-FEBRAS: clean X-FEBRAS-SpamCheck: not spam, SpamAssassin (not cached, score=-0.198, required 5, autolearn=not spam, ALL_TRUSTED -1.00, BAYES_50 0.80, HTML_MESSAGE 0.00, SMILEY -0.50, URIBL_BLOCKED 0.00, VOWEL_TOCC_5 0.50) X-FEBRAS-From: serejk@febras.net X-FEBRAS-To: freebsd-questions@freebsd.org, kkchn.in@gmail.com, pprocacci@gmail.com X-Spam-Status: No X-Rspamd-Queue-Id: 4GNZ4c692Nz3ld1 X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of serejk@febras.net designates 62.76.193.23 as permitted sender) smtp.mailfrom=serejk@febras.net X-Spamd-Result: default: False [-3.30 / 15.00]; HAS_REPLYTO(0.00)[serejk@febras.net]; R_SPF_ALLOW(-0.20)[+ip4:62.76.193.23:c]; REPLYTO_ADDR_EQ_FROM(0.00)[]; HAS_ORG_HEADER(0.00)[]; TO_DN_ALL(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; FREEMAIL_TO(0.00)[gmail.com]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; RBL_DBL_DONT_QUERY_IPS(0.00)[62.76.193.23:from]; ASN(0.00)[asn:34017, ipnet:62.76.193.0/24, country:RU]; MID_RHS_MATCH_FROM(0.00)[]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; NEURAL_HAM_LONG(-1.00)[-1.000]; TAGGED_RCPT(0.00)[]; DMARC_NA(0.00)[febras.net]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; SPAMHAUS_ZRD(0.00)[62.76.193.23:from:127.0.2.255]; TO_MATCH_ENVRCPT_SOME(0.00)[]; R_DKIM_NA(0.00)[]; FREEMAIL_CC(0.00)[gmail.com,freebsd.org]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_LAST(0.00)[]; MAILMAN_DEST(0.00)[freebsd-questions] Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.34 X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Jul 2021 06:44:34 -0000 Yes, Perl is perhaps the best solution for this job, but only if you don`t need to start it from zero. It can take some time to get into basics. About sizes: I processed 60Gb file with shell utilities, containing the bunch of pipes, in reasonable time (like several hours). 10K rpm HDD, not SSD. Of course, one should be aware about what is he doing and optimize data processing pipeline. About indexing approach: once again, i don`t know what exactly need to be extracted from file, but if it is a table contaning some aggregated results, for example, then indexing may be an overkill. To topic author: maybe if you show a piece of file and explain desired result, then advise could be more precise. On Mon, 12 Jul 2021 02:20:58 -0400, Paul Procacci wrote: > On Mon, Jul 12, 2021 at 1:44 AM Korolev Sergey wrote: > >> I think, that proper tools usually highly depends on desired result, so my reasoning is quite general. People here advise to use Perl and also split one large file into managable pieces - all that is very good, I vote for that. But I don`t know Perl at all, so I usually get along with standard shell utilities: grep, tr, awk, sed, etc. I used to parse big maillogs with them successfully. > > Most standard shell utilities can certainly get the job done if the file > sizes are > of a size that's manageable. That is most likely the vast majority of > cases. No > question about that. > > There's certainly a point however when the sizes become so unmanageable > that their > completion will be on your 150th birthday. ;) An exaggeration undoubtedly. > > There's obviously options for this, but you'll seldom find the answer in any > standard install of any userland. Sometimes you can get away with xargs, > depending > on what the data is that you're working with, but that's all that comes to > mind. > > The "promotion" from there in my mind is going the perl route (or any other > interpreted > language) capable of threading ... and from there as necessary ... C (or > other compiled > language). > > Someone made mention of Elasticsearch and that's a good option too. All > the work > of indexing the data has already been done for you. You just don't have to > mind paying > for it. ;) > > Hell, I've used postgresql with their fulltext search for similar things as > well and I'd argue > if that's already in your stack, to at the very least try that first. > You'd be surprised at > how darn well it does. > > Goodnight! > > ~Paul Links: ------ [1] mailto:serejk@febras.net