From owner-freebsd-questions@FreeBSD.ORG Tue Jul 15 09:22:33 2003 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AADA637B401 for ; Tue, 15 Jul 2003 09:22:33 -0700 (PDT) Received: from conn.mc.mpls.visi.com (conn.mc.mpls.visi.com [208.42.156.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5CBE743F85 for ; Tue, 15 Jul 2003 09:22:32 -0700 (PDT) (envelope-from hawkeyd@visi.com) Received: from sheol.localdomain (hawkeyd-fw.dsl.visi.com [208.42.101.193]) by conn.mc.mpls.visi.com (Postfix) with ESMTP id 6683B83BC for ; Tue, 15 Jul 2003 11:22:29 -0500 (CDT) Received: (from hawkeyd@localhost) by sheol.localdomain (8.11.6p2/8.11.6) id h6FGMTG33685 for freebsd-questions@freebsd.org; Tue, 15 Jul 2003 11:22:29 -0500 (CDT) (envelope-from hawkeyd) X-Spam-Policy: http://www.visi.com/~hawkeyd/index.html#mail Date: Tue, 15 Jul 2003 11:22:28 -0500 From: D J Hawkey Jr To: questions at FreeBSD Message-ID: <20030715162228.GB33592@sheol.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.1i Subject: Re: sed(1) regular expression gurus - SOLUTION X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: hawkeyd@visi.com List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Jul 2003 16:22:34 -0000 First off, thanks to all of you who scratched their heads over this puzzle. All had the right idea to some extent or another. Based in part on the replies, and my own work, here's the final result: FOLDER="$HOME/Mail/spam" NAME_RE="[[:alnum:]_.-]+" ADDY_RE="([0-9]{1,3}\.){3}[0-9]{1,3}" cat $FOLDER \ |grep -A 5 "^Received:" \ |egrep "^(Received:| )" \ |sed -E \ -e "s/(^Received:|by|from)[[:space:]]+//g" \ -e "s/\([HELO]{4}[[:space:]]+($NAME_RE)\)/\1/" \ -e "s/\(($NAME_RE)[[:space:]]+\[($ADDY_RE)\]\)/\1 \2/g" \ -e "s/(\(\[?|\[)($ADDY_RE)(\]|\]?\))/\2/g" \ -e "s/[[:space:]]*(\(|id|via|with|E?SMTP|;).*//" \ -e "s/(\(envelope-|for|Sun|Mon|Tue|Wed|Thu|Fri|Sat).*//" \ -e "s/[][(){}<>]//g" \ Note that the whitespace in the second pipe is one tab character. The first two pipes isolate the multi-line headers. The first sed command strips "keywords" and any following whitespace. The second sed command returns the name in a parenthetical HELO or EHLO. The third sed command returns the name and address in a "(... [...]). The fourth sed command - the one I inquired about - returns the address in any of "(...)", "([...])", or "[...]". The fifth sed command strips possible whitespace, "keywords" or an opening parenthesis (now that it's of no consequence), and anything after them. The sixth sed command strips more "keywords" and anything after them (it might be merged into the fifth, what it strips is often on another line). Finally, the last sed command strips any errant delimiters; strictly speaking, it's redundant, but when I ran a spam file (~12.3Mb) through this, some delimiters did leak through. Just thought those that replied to my plea might like to see this, and perhaps somebody else will find it useful. No, I'm not telling what it's for. ;-, Dave -- ______________________ ______________________ \__________________ \ D. J. HAWKEY JR. / __________________/ \________________/\ hawkeyd@visi.com /\________________/ http://www.visi.com/~hawkeyd/