Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 26 Jun 2016 16:41:47 +0200
From:      Eir Nym <eirnym@gmail.com>
To:        Polytropon <freebsd@edvax.de>
Cc:        =?utf-8?Q?Dani=C3=ABl_de_Kok?= <me@danieldk.eu>, freebsd-questions@freebsd.org
Subject:   Re: grep and anchoring
Message-ID:  <C23FABAB-1F4F-4654-917C-1E5A50E0E257@gmail.com>
In-Reply-To: <20160626163411.d05f863e.freebsd@edvax.de>
References:  <20232C89-B821-41EC-9188-C2A19C679BD8@danieldk.eu> <20160626163411.d05f863e.freebsd@edvax.de>

next in thread | previous in thread | raw e-mail | index | archive | help

> On 26 Jun 2016, at 16:34, Polytropon <freebsd@edvax.de> wrote:
>=20
> On Sun, 26 Jun 2016 15:10:57 +0200, Dani=C3=ABl de Kok wrote:
>> Dear all,
>>=20
>> After a BSD hiatus of many years, I am tinkering with FreeBSD again.
>> I=E2=80=99ve run into some strange issue with grep and beginning of =
line (^)
>> anchoring:
>>=20
>> =E2=80=94
>> % echo "1234 1234 1234" | egrep -o '^=E2=80=A6.'
>> 1234
>> 123
>> 4 12
>> % echo "123412341234" | egrep -o '^....'
>> 1234
>> 1234
>> 1234
>> =E2=80=94
>>=20
>> Any idea what is going on here?
>=20
> I think what you see here is a typical "UTF-8 fsck-up".
> The first search pattern contains a an ellipsis ("=E2=80=A6",
> 2 bytes long, representing 3 characters), and a single
> dot (".", one byte long, 1 character); the second pattern
> contains four dots (4 x ".", 1 byte long, 1 character).
> Of course grep interprets "=E2=80=A6" and "..." differently.
> In my mailer, I can see the difference clearly as the
> ellipsis =E2=80=A6 is displayed in monospace font as a _one_
> character wide symbol on the screen.
>=20

I think this was automatic spell correction and he mentioned 4 dot =
symbols (.), not a =E2=80=98=E2=80=A6' and =E2=80=98.=E2=80=99

> Or is this just an "enrichment" your MUA added? :-)
>=20
> I'm quite sure you run into similar problems when you
> include ligatures (like st, ft, ffi, ck or the like)
> or one of the many different hyphend and spaces in a
> search pattern. :-)
>=20
> Otherwise, your example seems to show the expected
> behaviour.
>=20
> 	% echo "1234 1234 1234" | egrep -o '^....'
> 	1234
> 	 123
> 	4 12
>=20
> 	% echo "123412341234" | egrep -o '^....'
> 	1234
> 	1234
> 	1234
>=20
> First 4-character pattern is "1234", next is " 123",
> and last is "4 12" (each 4 characters wide, as the
> space character " " is also "any character" that matches
> the . pattern). In the second example, the groups match
> 4 characters each ("1234" x 3).
>=20
> What different results did you expect? Or am I misinterpreting
> your question?
>=20
>=20
> --=20
> Polytropon
> Magdeburg, Germany
> Happy FreeBSD user since 4.0
> Andra moi ennepe, Mousa, ...
> _______________________________________________
> freebsd-questions@freebsd.org <mailto:freebsd-questions@freebsd.org> =
mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-questions =
<https://lists.freebsd.org/mailman/listinfo/freebsd-questions>;
> To unsubscribe, send any mail to =
"freebsd-questions-unsubscribe@freebsd.org =
<mailto:freebsd-questions-unsubscribe@freebsd.org>"




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C23FABAB-1F4F-4654-917C-1E5A50E0E257>