From owner-freebsd-questions@FreeBSD.ORG Sun Jan 11 03:52:46 2004 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4CA2416A4D0 for ; Sun, 11 Jan 2004 03:52:46 -0800 (PST) Received: from smtp.infracaninophile.co.uk (happy-idiot-talk.infracaninophile.co.uk [81.2.69.218]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0EEBA43D2D for ; Sun, 11 Jan 2004 03:52:43 -0800 (PST) (envelope-from m.seaman@infracaninophile.co.uk) Received: from happy-idiot-talk.infracaninophile.co.uk (localhost.infracaninophile.co.uk [IPv6:::1])i0BBqcfn010994 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 11 Jan 2004 11:52:38 GMT (envelope-from matthew@happy-idiot-talk.infracaninophile.co.uk) Received: (from matthew@localhost)id i0BBqbGd010993; Sun, 11 Jan 2004 11:52:37 GMT (envelope-from matthew) Date: Sun, 11 Jan 2004 11:52:37 +0000 From: Matthew Seaman To: Gary Kline Message-ID: <20040111115237.GA10388@happy-idiot-talk.infracaninophile.co.uk> Mail-Followup-To: Matthew Seaman , Gary Kline , FreeBSD Mailing List References: <20040110221036.GA44130@tao.thought.org> <20040110223308.GA4881@happy-idiot-talk.infracaninophile.co.uk> <20040110223907.GA16659@Uruk-Hai.Sanitarium.mine.nu> <20040110230218.GA5347@happy-idiot-talk.infracaninophile.co.uk> <20040111013434.GC44177@tao.thought.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="EeQfGwPcQSOJBaQU" Content-Disposition: inline In-Reply-To: <20040111013434.GC44177@tao.thought.org> User-Agent: Mutt/1.5.5.1i X-Spam-Status: No, hits=-4.9 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=2.61 X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on happy-idiot-talk.infracaninophile.co.uk cc: FreeBSD Mailing List Subject: Re: perl script question. X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 11 Jan 2004 11:52:46 -0000 --EeQfGwPcQSOJBaQU Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Jan 10, 2004 at 05:34:34PM -0800, Gary Kline wrote: > On Sat, Jan 10, 2004 at 11:02:18PM +0000, Matthew Seaman wrote: > > perl -pi.bak -e 's/\s*\w+_\w+\.?//g;' filename > The lines do indeed wrap so this does the job on a test file. > I do have the re-exp book but this one is far ovr my head. > What do the "\s*" mean, and also thr "\.?/" ? OK. Time to disect a regular expression. Let's just isolate the RE bits from the surrounding stuff: \s*\w+_\w+\.? There are 5 parts to this: 1 \s* 2 \w+ 3 _ 4 \w+ 5 \.? 1) \s* -- '\s' is a metacharacter for matching whitespace: it's equivalent to saying [ \t\n\r\f]. The '*' operator says "any number of these, including zero". 2) \w+ -- '\w' is a metacharacter for matching 'word' characters. What it means is locale dependent, but if you're using the ASCII locale it corresponds to [a-zA-Z_0-9]. The '+' operator means "one or more or these". Note that while \w+ matches character sequences containing _, it will also match words that don't: hence 3) _ -- match a literal '_' character. ie. this forces the matched text to contain at least one underscore. 4) \w+ -- as (2) matches the rest of the stuff_separated_by_underscores after the underscore we've forced a match to[1]. 5) \.? -- \. matches a literal '.' It has to be escaped (with a \) because plain '.' on it's own is the used as the wildcard to match any character. The '?' operator means "optional", or more precisely, either zero or one of those. Now, the whole command: perl -pi.bak -e 's/${re}//g;' filename scans through the file line_by_line, matching strings_connected_with underscores on each line. Bj=F6rn Andersson noticed that you would need the 'g' option to the s/// substitution command which means "repeat this substitution more than once, if necessary". Like in the first line_of_this_paragraph. Then I realised that there were situations, like the last line of the previous paragraph, where there wouldn't be any leading whitespace to match. Of course, this all depends on the sequences of words_connected_with_ underscores not wrapping around onto more than one line, as in this contrived example, where the word 'underscores' on the second line of this paragraph wouldn't be deleted. There are several other edge cases like that, if word-wrap is permitted. But it was never specified if that was the case or not and I've assumed not because coping with that sort of thing is a bit trickier. Cheers, Matthew [1] In fact, due to the way regular expressions work, the literal underscore (3) will actually match at the last underscore out of all the stuff we're matching, and the stuff matched by chunk (4) won't contain any underscores. --=20 Dr Matthew J Seaman MA, D.Phil. 26 The Paddocks Savill Way PGP: http://www.infracaninophile.co.uk/pgpkey Marlow Tel: +44 1628 476614 Bucks., SL7 1TH UK --EeQfGwPcQSOJBaQU Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (FreeBSD) iD8DBQFAATkFdtESqEQa7a0RAkx2AJwOPZIaSNARA5eKKccsjIVEAPj7LgCgmpYS tv8cf73LwBVCv24W8BEB0vw= =axTX -----END PGP SIGNATURE----- --EeQfGwPcQSOJBaQU--