From owner-freebsd-questions@freebsd.org Sun Jun 26 14:34:21 2016 Return-Path: Delivered-To: freebsd-questions@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 66536B81707 for ; Sun, 26 Jun 2016 14:34:21 +0000 (UTC) (envelope-from freebsd@edvax.de) Received: from mx01.qsc.de (mx01.qsc.de [213.148.129.14]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 30778246A for ; Sun, 26 Jun 2016 14:34:20 +0000 (UTC) (envelope-from freebsd@edvax.de) Received: from r56.edvax.de (port-92-195-173-141.dynamic.qsc.de [92.195.173.141]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx01.qsc.de (Postfix) with ESMTPS id 329AE3CE40; Sun, 26 Jun 2016 16:34:11 +0200 (CEST) Received: from r56.edvax.de (localhost [127.0.0.1]) by r56.edvax.de (8.14.5/8.14.5) with SMTP id u5QEYBRA002077; Sun, 26 Jun 2016 16:34:11 +0200 (CEST) (envelope-from freebsd@edvax.de) Date: Sun, 26 Jun 2016 16:34:11 +0200 From: Polytropon To: =?UTF-8?B?RGFuacOrbA==?= de Kok Cc: freebsd-questions@freebsd.org Subject: Re: grep and anchoring Message-Id: <20160626163411.d05f863e.freebsd@edvax.de> In-Reply-To: <20232C89-B821-41EC-9188-C2A19C679BD8@danieldk.eu> References: <20232C89-B821-41EC-9188-C2A19C679BD8@danieldk.eu> Reply-To: Polytropon Organization: EDVAX X-Mailer: Sylpheed 3.1.1 (GTK+ 2.24.5; i386-portbld-freebsd8.2) Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Jun 2016 14:34:21 -0000 On Sun, 26 Jun 2016 15:10:57 +0200, Daniël de Kok wrote: > Dear all, > > After a BSD hiatus of many years, I am tinkering with FreeBSD again. > I’ve run into some strange issue with grep and beginning of line (^) > anchoring: > > — > % echo "1234 1234 1234" | egrep -o '^….' > 1234 > 123 > 4 12 > % echo "123412341234" | egrep -o '^....' > 1234 > 1234 > 1234 > — > > Any idea what is going on here? I think what you see here is a typical "UTF-8 fsck-up". The first search pattern contains a an ellipsis ("…", 2 bytes long, representing 3 characters), and a single dot (".", one byte long, 1 character); the second pattern contains four dots (4 x ".", 1 byte long, 1 character). Of course grep interprets "…" and "..." differently. In my mailer, I can see the difference clearly as the ellipsis … is displayed in monospace font as a _one_ character wide symbol on the screen. Or is this just an "enrichment" your MUA added? :-) I'm quite sure you run into similar problems when you include ligatures (like st, ft, ffi, ck or the like) or one of the many different hyphend and spaces in a search pattern. :-) Otherwise, your example seems to show the expected behaviour. % echo "1234 1234 1234" | egrep -o '^....' 1234 123 4 12 % echo "123412341234" | egrep -o '^....' 1234 1234 1234 First 4-character pattern is "1234", next is " 123", and last is "4 12" (each 4 characters wide, as the space character " " is also "any character" that matches the . pattern). In the second example, the groups match 4 characters each ("1234" x 3). What different results did you expect? Or am I misinterpreting your question? -- Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ...