Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 12 Jul 2021 00:06:04 -0400
From:      Paul Procacci <pprocacci@gmail.com>
To:        John Levine <johnl@iecc.com>
Cc:        FreeBSD Questions <freebsd-questions@freebsd.org>, dpchrist@holgerdanske.com
Subject:   Re: Analyzing Log files of very large size
Message-ID:  <CAFbbPujtM-yzk0GbKLaKr7=OrCA3rdBzQ6T%2BB8KaB8wSK0Xz2w@mail.gmail.com>
In-Reply-To: <20210711201136.B3271205F2CA@ary.qy>
References:  <e797b547-4084-351d-08a9-31784b10fecd@holgerdanske.com> <20210711201136.B3271205F2CA@ary.qy>

next in thread | previous in thread | raw e-mail | index | archive | help

This advice is sound.
I'd personally do the same leaning on either awk or perl mysql.
It just depends naturally on what you're after in the long run.

>> I am in a requirement to analyze large log files of sonic wall firewall

> >> around 50 GB. for a suspect attack. ...
>
> >But if this project is for an employer or client, I would recommend
> >starting with the commercial-off-the-shelf (COTS) log analysis tool made
> >by the hardware vendor.  Train up on it.  Buy a support contract:
> >
> >
> https://www.sonicwall.com/wp-content/uploads/2019/01/sonicwall-analyzer.pdf
>
> This is reasonable advice if you plan to be doing these analyses on a
> regular
> basis, but it's overkill if you only expect to do it once.
>
> I have found that some of the text processing utilities that come with BSD
> are a lot faster than others.  The regex matching in perl is a lot faster
> than python, sometimes by an order of magnitude.  My took of choice is
> mawk,
> an implementation of the funky but very useful awk language that is
> amazingly
> fast.  grep is OK, sed is too slow for anything other than tiny jobs.
>
> I'd suggest first dividing up the logs into manageable chunks, perhaps
> using
> split or csplit, or it would be a good first project in mawk, using
> patterns
> to divide the files into chunks that represent an hour or a day.
>
> Then you can start looking for interesting patterns, perhaps with grep if
> they
> are simple enough, or more likely with some short mawk scripts.
>
> R's,
> John
>
>
This advice is sound.
I'd personally do the same leaning on either awk or perl myself.

Another note, I've done something similar before where awk/perl simply
weren't enough
for 50+ TB of logs that were being consumed daily so I had to roll my own
using C/qp-tries[1].
Again, if not only your volume is high but your frequency of processing
this data is often,
you'd consider a more custom solution should not one already exist:

Note: A latest poster mentioned AVL trees as well.  That's fine too.  I
just prefer qp-tries.

[1] https://dotat.at/prog/qp/README.html
-- 
__________________

:(){ :|:& };:



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFbbPujtM-yzk0GbKLaKr7=OrCA3rdBzQ6T%2BB8KaB8wSK0Xz2w>