From owner-freebsd-current@FreeBSD.ORG Mon Sep 2 16:46:27 2013 Return-Path: Delivered-To: freebsd-current@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id F12D7AE4 for ; Mon, 2 Sep 2013 16:46:27 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 4F68D20F4 for ; Mon, 2 Sep 2013 16:46:26 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id TAA24314 for ; Mon, 02 Sep 2013 19:46:25 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1VGXGi-000NFE-9x for freebsd-current@FreeBSD.org; Mon, 02 Sep 2013 19:46:24 +0300 Message-ID: <5224C08E.1070404@FreeBSD.org> Date: Mon, 02 Sep 2013 19:45:02 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130810 Thunderbird/17.0.8 MIME-Version: 1.0 To: FreeBSD Current Subject: Re: bug with special bracket expressions in regular expressions References: <5224A693.3000904@FreeBSD.org> In-Reply-To: <5224A693.3000904@FreeBSD.org> X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Sep 2013 16:46:28 -0000 on 02/09/2013 17:54 Andriy Gapon said the following: > > re_format(7) says: > There are two special cases‡ of bracket expressions: the bracket expres‐ > sions ‘[[:<:]]’ and ‘[[:>:]]’ match the null string at the beginning and > end of a word respectively. A word is defined as a sequence of word > characters which is neither preceded nor followed by word characters. A > word character is an alnum character (as defined by ctype(3)) or an > underscore. This is an extension, compatible with but not specified by > IEEE Std 1003.2 (“POSIX.2”), and should be used with caution in software > intended to be portable to other systems. > > However I observe the following: > $ echo "cd0 cd1 xx" | sed 's/cd[0-9][^ ]* *//g' > xx > $ echo "cd0 cd1 xx" | sed 's/[[:<:]]cd[0-9][^ ]* *//g' > cd1 xx > > In my opinion '[[:<:]]' should not affect how the pattern is matched in this case. It seems that the code works like this: - first it matches "cd0 " and "removes" it - then it passes "cd1 xx" for matching with a flag that tells that this is not a real start of the string - thus the matching code o knows that this is not a real line start, so it can't match [[:<:]] just for that reason o it does _not_ know what was the character before the start of the given substring, so it can not know if it could match [[:<:]] So matching fails. Not sure if this is an internal problem of regex(3) or a problem of how sed(1) uses regex(3). -- Andriy Gapon