Date: Sun, 26 Jun 2016 16:41:47 +0200 From: Eir Nym <eirnym@gmail.com> To: Polytropon <freebsd@edvax.de> Cc: =?utf-8?Q?Dani=C3=ABl_de_Kok?= <me@danieldk.eu>, freebsd-questions@freebsd.org Subject: Re: grep and anchoring Message-ID: <C23FABAB-1F4F-4654-917C-1E5A50E0E257@gmail.com> In-Reply-To: <20160626163411.d05f863e.freebsd@edvax.de> References: <20232C89-B821-41EC-9188-C2A19C679BD8@danieldk.eu> <20160626163411.d05f863e.freebsd@edvax.de>
next in thread | previous in thread | raw e-mail | index | archive | help
> On 26 Jun 2016, at 16:34, Polytropon <freebsd@edvax.de> wrote: >=20 > On Sun, 26 Jun 2016 15:10:57 +0200, Dani=C3=ABl de Kok wrote: >> Dear all, >>=20 >> After a BSD hiatus of many years, I am tinkering with FreeBSD again. >> I=E2=80=99ve run into some strange issue with grep and beginning of = line (^) >> anchoring: >>=20 >> =E2=80=94 >> % echo "1234 1234 1234" | egrep -o '^=E2=80=A6.' >> 1234 >> 123 >> 4 12 >> % echo "123412341234" | egrep -o '^....' >> 1234 >> 1234 >> 1234 >> =E2=80=94 >>=20 >> Any idea what is going on here? >=20 > I think what you see here is a typical "UTF-8 fsck-up". > The first search pattern contains a an ellipsis ("=E2=80=A6", > 2 bytes long, representing 3 characters), and a single > dot (".", one byte long, 1 character); the second pattern > contains four dots (4 x ".", 1 byte long, 1 character). > Of course grep interprets "=E2=80=A6" and "..." differently. > In my mailer, I can see the difference clearly as the > ellipsis =E2=80=A6 is displayed in monospace font as a _one_ > character wide symbol on the screen. >=20 I think this was automatic spell correction and he mentioned 4 dot = symbols (.), not a =E2=80=98=E2=80=A6' and =E2=80=98.=E2=80=99 > Or is this just an "enrichment" your MUA added? :-) >=20 > I'm quite sure you run into similar problems when you > include ligatures (like st, ft, ffi, ck or the like) > or one of the many different hyphend and spaces in a > search pattern. :-) >=20 > Otherwise, your example seems to show the expected > behaviour. >=20 > % echo "1234 1234 1234" | egrep -o '^....' > 1234 > 123 > 4 12 >=20 > % echo "123412341234" | egrep -o '^....' > 1234 > 1234 > 1234 >=20 > First 4-character pattern is "1234", next is " 123", > and last is "4 12" (each 4 characters wide, as the > space character " " is also "any character" that matches > the . pattern). In the second example, the groups match > 4 characters each ("1234" x 3). >=20 > What different results did you expect? Or am I misinterpreting > your question? >=20 >=20 > --=20 > Polytropon > Magdeburg, Germany > Happy FreeBSD user since 4.0 > Andra moi ennepe, Mousa, ... > _______________________________________________ > freebsd-questions@freebsd.org <mailto:freebsd-questions@freebsd.org> = mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-questions = <https://lists.freebsd.org/mailman/listinfo/freebsd-questions> > To unsubscribe, send any mail to = "freebsd-questions-unsubscribe@freebsd.org = <mailto:freebsd-questions-unsubscribe@freebsd.org>"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C23FABAB-1F4F-4654-917C-1E5A50E0E257>