Date: Sun, 11 Jan 2004 11:52:37 +0000 From: Matthew Seaman <m.seaman@infracaninophile.co.uk> To: Gary Kline <kline@thought.org> Cc: FreeBSD Mailing List <freebsd-questions@freebsd.org> Subject: Re: perl script question. Message-ID: <20040111115237.GA10388@happy-idiot-talk.infracaninophile.co.uk> In-Reply-To: <20040111013434.GC44177@tao.thought.org> References: <20040110221036.GA44130@tao.thought.org> <20040110223308.GA4881@happy-idiot-talk.infracaninophile.co.uk> <20040110223907.GA16659@Uruk-Hai.Sanitarium.mine.nu> <20040110230218.GA5347@happy-idiot-talk.infracaninophile.co.uk> <20040111013434.GC44177@tao.thought.org>
next in thread | previous in thread | raw e-mail | index | archive | help
--EeQfGwPcQSOJBaQU Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Jan 10, 2004 at 05:34:34PM -0800, Gary Kline wrote: > On Sat, Jan 10, 2004 at 11:02:18PM +0000, Matthew Seaman wrote: > > perl -pi.bak -e 's/\s*\w+_\w+\.?//g;' filename > The lines do indeed wrap so this does the job on a test file. > I do have the re-exp book but this one is far ovr my head. > What do the "\s*" mean, and also thr "\.?/" ? OK. Time to disect a regular expression. Let's just isolate the RE bits from the surrounding stuff: \s*\w+_\w+\.? There are 5 parts to this: 1 \s* 2 \w+ 3 _ 4 \w+ 5 \.? 1) \s* -- '\s' is a metacharacter for matching whitespace: it's equivalent to saying [ \t\n\r\f]. The '*' operator says "any number of these, including zero". 2) \w+ -- '\w' is a metacharacter for matching 'word' characters. What it means is locale dependent, but if you're using the ASCII locale it corresponds to [a-zA-Z_0-9]. The '+' operator means "one or more or these". Note that while \w+ matches character sequences containing _, it will also match words that don't: hence 3) _ -- match a literal '_' character. ie. this forces the matched text to contain at least one underscore. 4) \w+ -- as (2) matches the rest of the stuff_separated_by_underscores after the underscore we've forced a match to[1]. 5) \.? -- \. matches a literal '.' It has to be escaped (with a \) because plain '.' on it's own is the used as the wildcard to match any character. The '?' operator means "optional", or more precisely, either zero or one of those. Now, the whole command: perl -pi.bak -e 's/${re}//g;' filename scans through the file line_by_line, matching strings_connected_with underscores on each line. Bj=F6rn Andersson noticed that you would need the 'g' option to the s/// substitution command which means "repeat this substitution more than once, if necessary". Like in the first line_of_this_paragraph. Then I realised that there were situations, like the last line of the previous paragraph, where there wouldn't be any leading whitespace to match. Of course, this all depends on the sequences of words_connected_with_ underscores not wrapping around onto more than one line, as in this contrived example, where the word 'underscores' on the second line of this paragraph wouldn't be deleted. There are several other edge cases like that, if word-wrap is permitted. But it was never specified if that was the case or not and I've assumed not because coping with that sort of thing is a bit trickier. Cheers, Matthew [1] In fact, due to the way regular expressions work, the literal underscore (3) will actually match at the last underscore out of all the stuff we're matching, and the stuff matched by chunk (4) won't contain any underscores. --=20 Dr Matthew J Seaman MA, D.Phil. 26 The Paddocks Savill Way PGP: http://www.infracaninophile.co.uk/pgpkey Marlow Tel: +44 1628 476614 Bucks., SL7 1TH UK --EeQfGwPcQSOJBaQU Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (FreeBSD) iD8DBQFAATkFdtESqEQa7a0RAkx2AJwOPZIaSNARA5eKKccsjIVEAPj7LgCgmpYS tv8cf73LwBVCv24W8BEB0vw= =axTX -----END PGP SIGNATURE----- --EeQfGwPcQSOJBaQU--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20040111115237.GA10388>