Date: Mon, 12 Jul 2021 16:44:20 +1000 From: Korolev Sergey <serejk@febras.net> To: Paul Procacci <pprocacci@gmail.com> Cc: KK CHN <kkchn.in@gmail.com>, freebsd-questions <freebsd-questions@freebsd.org> Subject: Re: Analyzing Log files of very large size Message-ID: <64d626beea2becc5191f0a886e0291b3@febras.net> In-Reply-To: <CAFbbPugNamorCpL1%2Bbkao06iWSUJkPS5V3KORs3SCUUChbBU5Q@mail.gmail.com> References: <CAKgGyB_TJrLWSjcnc9491Gg0Q5CLqLdmWx2yga_Ez7-gE6YcKQ@mail.gmail.com> <E9C00664-DAC7-4F58-BCCA-CDD2654C9325@febras.net> <CAKgGyB_reF4eqz4pvQj7tFsOQEEB3WrFZa-91L%2BNChm=85h0-A@mail.gmail.com> <d0ebe655c44cd2b5a70bbac4dcdddcc3@febras.net> <CAFbbPugNamorCpL1%2Bbkao06iWSUJkPS5V3KORs3SCUUChbBU5Q@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Yes, Perl is perhaps the best solution for this job, but only if you don`t need to start it from zero. It can take some time to get into basics. About sizes: I processed 60Gb file with shell utilities, containing the bunch of pipes, in reasonable time (like several hours). 10K rpm HDD, not SSD. Of course, one should be aware about what is he doing and optimize data processing pipeline. About indexing approach: once again, i don`t know what exactly need to be extracted from file, but if it is a table contaning some aggregated results, for example, then indexing may be an overkill. To topic author: maybe if you show a piece of file and explain desired result, then advise could be more precise. On Mon, 12 Jul 2021 02:20:58 -0400, Paul Procacci wrote: > On Mon, Jul 12, 2021 at 1:44 AM Korolev Sergey wrote: > >> I think, that proper tools usually highly depends on desired result, so my reasoning is quite general. People here advise to use Perl and also split one large file into managable pieces - all that is very good, I vote for that. But I don`t know Perl at all, so I usually get along with standard shell utilities: grep, tr, awk, sed, etc. I used to parse big maillogs with them successfully. > > Most standard shell utilities can certainly get the job done if the file > sizes are > of a size that's manageable. That is most likely the vast majority of > cases. No > question about that. > > There's certainly a point however when the sizes become so unmanageable > that their > completion will be on your 150th birthday. ;) An exaggeration undoubtedly. > > There's obviously options for this, but you'll seldom find the answer in any > standard install of any userland. Sometimes you can get away with xargs, > depending > on what the data is that you're working with, but that's all that comes to > mind. > > The "promotion" from there in my mind is going the perl route (or any other > interpreted > language) capable of threading ... and from there as necessary ... C (or > other compiled > language). > > Someone made mention of Elasticsearch and that's a good option too. All > the work > of indexing the data has already been done for you. You just don't have to > mind paying > for it. ;) > > Hell, I've used postgresql with their fulltext search for similar things as > well and I'd argue > if that's already in your stack, to at the very least try that first. > You'd be surprised at > how darn well it does. > > Goodnight! > > ~Paul Links: ------ [1] mailto:serejk@febras.net
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?64d626beea2becc5191f0a886e0291b3>
