Date: Thu, 23 Jan 2014 14:45:16 -0600 From: Paul Schmehl <pschmehl_lists@tx.rr.com> To: dteske@FreeBSD.org, 'RW' <rwmaillists@googlemail.com>, freebsd-questions@freebsd.org Subject: RE: awk programming question Message-ID: <DB7199C9F15E1814EBB721FE@localhost> In-Reply-To: <04a201cf1878$8ebce540$ac36afc0$@FreeBSD.org> References: <F01EB9CE742DEB17DB6B51C7@localhost> <alpine.BSF.2.00.1401230900270.76961@wonkity.com> <20140123185604.4cbd7611@gumby.homeunix.com> <04a201cf1878$8ebce540$ac36afc0$@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
--On January 23, 2014 at 12:20:26 PM -0800 dteske@FreeBSD.org wrote: > > >> -----Original Message----- >> From: RW [mailto:rwmaillists@googlemail.com] >> Sent: Thursday, January 23, 2014 10:56 AM >> To: freebsd-questions@freebsd.org >> Subject: Re: awk programming question >> >> On Thu, 23 Jan 2014 09:30:35 -0700 (MST) Warren Block wrote: >> >> > On Thu, 23 Jan 2014, Paul Schmehl wrote: >> > >> > > I'm kind of stubborn. There's lots of different ways to skin a cat, >> > > but I like to force myself to use the built-in utilities to do >> > > things so I can learn more about them and better understand how they >> > > work. >> > > >> > > So, I'm trying to parse a file of snort rules, extract two string >> > > values and insert a double pipe between them to create a sig-msg.map >> > > file >> > > >> > > Here's a typical rule: >> > > >> > > alert udp $HOME_NET any -> $EXTERNAL_NET 69 (msg:"E3[rb] ET POLICY >> > > Outbound TFTP Read Request"; content:"|00 01|"; depth:2; >> > > classtype:bad-unknown; sid:2008120; rev:1;) >> > > >> > > Here's a typical sig-msg.map file entry: >> > > >> > > 9624 || RPC UNIX authentication machinename string overflow attempt >> > > UDP >> > > >> > > So, from the above rule I would want to create a single line like >> > > this: >> > > >> > > 2008120 || E3[rb] ET POLICY Outbound TFTP Read Request >> > > >> > > There are several ways I can extract one or the other value, and >> > > I've figured out how to extract the sid and add the double pipe, but >> > > for the life of me I can't figure out how to extract and print out >> > > sid || msg. >> > > >> > > This prints out the sid and the double pipe: >> > > >> > > echo `awk 'match($0,/sid:[0-9]*;/) {print substr($0,RSTART,RLENGTH)" >> > > || "}' /tmp/mtc.rules | tr -d ";sid" >> > > >> > > It seems I could put the results into a variable rather than >> > > printing them out, and then print var1 || var2, but my google foo >> > > hasn't found a useful example. >> > > >> > > Surely there's a way to do this using awk? I can use tr for >> > > cleanup. I just need to get close to the right result. >> > > >> > > How about it awk experts? What's the cleanest way to get this done? >> > >> > Not an awk expert, but you can do math on the start and length >> > variables to get just the date part: >> > >> > echo "sid:2008120;" \ >> > | awk '{ match($0, /sid:[0-9]*;/) ; \ >> > ymd=substr($0, RSTART+4, RLENGTH-5) ; print ymd }' >> > >> > Closer to what you want: >> > >> > echo 'msg:"E3[rb] ET POLICY Outbound TFTP Read Request"; sid:2008120;' >> > \ | awk '{ match($0, /sid:[0-9]*;/) ; \ >> > ymd=substr($0, RSTART+4, RLENGTH-5) ; \ >> > match($0, /msg:.*;/) ; \ >> > msg = substr($0, RSTART+4, RLENGTH-5) ; \ >> > print ymd, "||", msg }' >> > >> > Note the error that the too-greedy regex creates, and the inability of >> > awk to capture regex sub-expressions. awk does not have a way to >> > reduce the greediness, at least that I'm aware. You may be able to >> > work around that, like if the message is always the same length. >> >> >> $ echo 'msg:"E3[rb] ET POLICY Outbound TFTP Read Request"; sid:2008120;' >> | \ >> awk '{ match($0, /sid:[0-9]+;/) ; ymd=substr($0, RSTART+4, RLENGTH-5) ; > \ >> match($0, /msg:[^;]+;/) ; msg = substr($0, RSTART+4, RLENGTH-5) ; > \ >> print ymd, "||", msg }' >> >> 2008120 || "E3[rb] ET POLICY Outbound TFTP Read Request" >> >> Note that awk supports +, but not newfangled things like *. > > With respect to regex, what awk really needs is the quantifier syntax... > > * = {0,} = zero or more > + = {1,} = one or more > {x,y} = any quantity from x inclusively up to y > {x,} = any quantity from x or more > > sed supports it -- e.g., echo "aaa" | sed -e 's/a\{1,2\}//' # produces "a" > sed -E (aka sed -r) supports it -- e.g., echo "aaa" | sed -E 's/a{1,2}//' > # produces "a" > grep supports it -- e.g., echo "aaa" | grep 'a\{2,\}' # match printed > grep -E (aka egrep) supports it -- e.g., echo "aaa" | grep -E 'a{2,}' # > match printed > perl supports it -- obviously (in the modern regex form, lacking > backslash) nvi supports it -- e.g., :%s/a\{1,2\}// > vim supports it -- obviously (and uses the backslash form; even with > noncompatible set) > > onetrueawk however does NOT support it -- example given... > echo aaa | awk '/a{2,}/{print}' # no match printed > echo aaa | awk '/a\{2,\}/{print}' # no match printed > > There's a couple of other nits here... > > 1. sig-msg.map file according to OP shouldn't have the quotes that are > present from the snort rule input > 2. Doesn't ignore lines of disinterest (See http://oreilly.com/pub/h/1393) > NB: The result code of match() is ignored; I don't think the program > should output > known bad sig-msg.map lines (where an sid is not given, for example; which > appears > to be the key for the sig-msg.map file). > > I gather that a more complete solution would be as follows: > > awk '!/^[[:space:]]*(#|$)/{if (!match($0, > /[[:space:](;]sid:[[:space:]]*[0-9]/)) next; sid = substr($0, RSTART + > RLENGTH - 1); sub(/[^0-9].*/, "", sid); if (!match($0, > /[[:space:](;]msg:[[:space:]]*/)) next; buf = substr($0, RSTART + > RLENGTH); quoted = substr(buf, 0, 1) == "\""; split(buf, msg, quoted ? > "\"" : FS); print sid, "||", msg[quoted ? 2 : 1]}' rules_file > > Where "rules_file" is the name of the file you want to parse. > > Putting this into a script, we can clean it up so that it's readable... > ># !/bin/sh > awk ' > !/^[[:space:]]*(#|$)/ { > if (!match($0, /[[:space:](;]sid:[[:space:]]*[0-9]/)) next > sid = substr($0, RSTART + RLENGTH - 1) > sub(/[^0-9].*/, "", sid) > if (!match($0, /[[:space:](;]msg:[[:space:]]*/)) next > buf = substr($0, RSTART + RLENGTH) > quoted = substr(buf, 0, 1) == "\"" > split(buf, msg, quoted ? "\"" : FS) > print sid, "||", msg[quoted ? 2 : 1] > }' "$@" Thanks so much! In the end I opted to use perl, because i had more pressing matters to attend to, but I'm please to know that it's doable with awk, and I will test your script (and endeavor to more fully understand it) when I have the time to do so. -- Paul Schmehl, Senior Infosec Analyst As if it wasn't already obvious, my opinions are my own and not those of my employer. ******************************************* "It is as useless to argue with those who have renounced the use of reason as to administer medication to the dead." Thomas Jefferson "There are some ideas so wrong that only a very intelligent person could believe in them." George Orwell
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?DB7199C9F15E1814EBB721FE>