Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 11 Jan 2004 11:52:37 +0000
From:      Matthew Seaman <m.seaman@infracaninophile.co.uk>
To:        Gary Kline <kline@thought.org>
Cc:        FreeBSD Mailing List <freebsd-questions@freebsd.org>
Subject:   Re: perl script question.
Message-ID:  <20040111115237.GA10388@happy-idiot-talk.infracaninophile.co.uk>
In-Reply-To: <20040111013434.GC44177@tao.thought.org>
References:  <20040110221036.GA44130@tao.thought.org> <20040110223308.GA4881@happy-idiot-talk.infracaninophile.co.uk> <20040110223907.GA16659@Uruk-Hai.Sanitarium.mine.nu> <20040110230218.GA5347@happy-idiot-talk.infracaninophile.co.uk> <20040111013434.GC44177@tao.thought.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--EeQfGwPcQSOJBaQU
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Jan 10, 2004 at 05:34:34PM -0800, Gary Kline wrote:
> On Sat, Jan 10, 2004 at 11:02:18PM +0000, Matthew Seaman wrote:

> >     perl -pi.bak -e 's/\s*\w+_\w+\.?//g;' filename

> 	The lines do indeed wrap so this does the job on a test file.
> 	I do have the re-exp book but this one is far ovr my head.
> 	What do the "\s*" mean, and also thr "\.?/" ?

OK.  Time to disect a regular expression.  Let's just isolate the RE
bits from the surrounding stuff:

    \s*\w+_\w+\.?

There are 5 parts to this:

   1 \s*
   2    \w+
   3       _
   4        \w+
   5           \.?

1) \s* -- '\s' is a metacharacter for matching whitespace: it's equivalent
   to saying [ \t\n\r\f].  The '*' operator says "any number of these,
   including zero".

2) \w+ -- '\w' is a metacharacter for matching 'word' characters.
   What it means is locale dependent, but if you're using the ASCII
   locale it corresponds to [a-zA-Z_0-9].  The '+' operator means "one
   or more or these".  Note that while \w+ matches character sequences
   containing _, it will also match words that don't: hence

3) _ -- match a literal '_' character.  ie. this forces the matched
   text to contain at least one underscore.

4) \w+ -- as (2) matches the rest of the stuff_separated_by_underscores
   after the underscore we've forced a match to[1].

5) \.? -- \. matches a literal '.' It has to be escaped (with a \)
   because plain '.' on it's own is the used as the wildcard to match
   any character.  The '?' operator means "optional", or more precisely,
   either zero or one of those.

Now, the whole command:

   perl -pi.bak -e 's/${re}//g;' filename

scans through the file line_by_line, matching strings_connected_with
underscores on each line.  Bj=F6rn Andersson noticed that you would need
the 'g' option to the s/// substitution command which means "repeat
this substitution more than once, if necessary".  Like in the first
line_of_this_paragraph.

Then I realised that there were situations, like the last line of the
previous paragraph, where there wouldn't be any leading whitespace to
match.

Of course, this all depends on the sequences of words_connected_with_
underscores not wrapping around onto more than one line, as in this
contrived example, where the word 'underscores' on the second line of
this paragraph wouldn't be deleted.  There are several other edge
cases like that, if word-wrap is permitted. But it was never specified
if that was the case or not and I've assumed not because coping with
that sort of thing is a bit trickier.

	Cheers,

	Matthew

[1] In fact, due to the way regular expressions work, the literal
underscore (3) will actually match at the last underscore out of all
the stuff we're matching, and the stuff matched by chunk (4) won't
contain any underscores.

--=20
Dr Matthew J Seaman MA, D.Phil.                       26 The Paddocks
                                                      Savill Way
PGP: http://www.infracaninophile.co.uk/pgpkey         Marlow
Tel: +44 1628 476614                                  Bucks., SL7 1TH UK

--EeQfGwPcQSOJBaQU
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (FreeBSD)

iD8DBQFAATkFdtESqEQa7a0RAkx2AJwOPZIaSNARA5eKKccsjIVEAPj7LgCgmpYS
tv8cf73LwBVCv24W8BEB0vw=
=axTX
-----END PGP SIGNATURE-----

--EeQfGwPcQSOJBaQU--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20040111115237.GA10388>