From owner-svn-src-head@freebsd.org Sun Jun 7 14:04:14 2020 Return-Path: Delivered-To: svn-src-head@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 2AC3932CB20; Sun, 7 Jun 2020 14:04:14 +0000 (UTC) (envelope-from kevans@freebsd.org) Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 49fymZ0JD6z3ZcK; Sun, 7 Jun 2020 14:04:14 +0000 (UTC) (envelope-from kevans@freebsd.org) Received: from mail-qt1-f172.google.com (mail-qt1-f172.google.com [209.85.160.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) (Authenticated sender: kevans) by smtp.freebsd.org (Postfix) with ESMTPSA id F3B1E1109F; Sun, 7 Jun 2020 14:04:13 +0000 (UTC) (envelope-from kevans@freebsd.org) Received: by mail-qt1-f172.google.com with SMTP id i16so12623789qtr.7; Sun, 07 Jun 2020 07:04:13 -0700 (PDT) X-Gm-Message-State: AOAM5304X9zJdGx3IXmJ/HQmVRiPjRD6KaxEeWI61p0S9d62zbhYdWnh IMfjMpanbuyBanyaBsyn1w5oPXNGX4u7U58mOv8= X-Google-Smtp-Source: ABdhPJyMFgmElzS0uRQ+ag9l02uiVDc/hXcY7U/io/TqQhjlp3G+nYyAbiV9XyL31tH+GhT1LMGAATaWzG2gAU7CnE0= X-Received: by 2002:ac8:36ec:: with SMTP id b41mr19166763qtc.53.1591538653400; Sun, 07 Jun 2020 07:04:13 -0700 (PDT) MIME-Version: 1.0 References: <202006070432.0574Wc1L063319@repo.freebsd.org> <202006071331.057DV4Vo040383@gndrsh.dnsmgr.net> In-Reply-To: <202006071331.057DV4Vo040383@gndrsh.dnsmgr.net> From: Kyle Evans Date: Sun, 7 Jun 2020 09:04:02 -0500 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: svn commit: r361884 - in head/usr.bin/sed: . tests To: "Rodney W. Grimes" Cc: src-committers , svn-src-all , svn-src-head Content-Type: text/plain; charset="UTF-8" X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 07 Jun 2020 14:04:14 -0000 On Sun, Jun 7, 2020 at 8:31 AM Rodney W. Grimes wrote: > > > Author: kevans > > Date: Sun Jun 7 04:32:38 2020 > > New Revision: 361884 > > URL: https://svnweb.freebsd.org/changeset/base/361884 > > > > Log: > > sed: attempt to learn about hex escapes (e.g. \x27) > > > > Somewhat predictably, software often wants to use \x27/\x24 among others so > > that they can decline worrying about ugly escaping, if said escaping is even > > possible. Right now, this software is using these and getting the wrong > > results, as we'll interpret those as x27 and x24 respectively. Some examples > > of this, when an exp-run was ran, were science/octopus and misc/vifm. > > > > Go ahead and process these at all times. We allow either one or two digits, > > and the tests account for both. If extra digits are specified, e.g. \x2727, > > then the third and fourth digits are interpreted literally as one might > > expect. > > Does it work to do \\x27, ie I want it to NOT do \x27 so I can sed > on files that contain sequences of escapes. I'm so glad you asked this. :-) For your immediate answer: yes, the semantics there work as you expect. For the long answer, that's actually what you should have been doing all along; raising awareness of that fact is what PR 229925 aims to do, by switching our interpretation of the UB for escaping ordinary characters to make them an error if it's not specially interpreted. Prior to this change, if you had: printf "\\\\x27\n" | sed -e 's/\x27//' What you end up with is actually *not* an empty string with a newline, but just a single backslash! \x27 in the replacement pattern gets passed through to the underlying regex(3) implementation, which then happily interprets \x => x and replaces the literal 'x27', leaving \ -- which is perhaps not what you might have expected if \x27 didn't have special meaning and it almost certainly isn't what you wanted. With the new sed, you can change 'x27' to 'b27' in both strings above to see what I mean. In the New World Order, all regex(3) users will be forced to be precise here so that we don't get it wrong. This is especially important when I add GNU extensions to libregex, because some of those escaped-ordinaries will now be granted special meaning, so \s will no longer match a literal s but instead [[:space:]]; using the unadulterated libc regex(3) interface instead will give you an error and allow you to detect whether you're accidentally using libc regex(3) rather than the GNU-extended libregex. This is going to be a large and potentially world-breaking change for many, but I think we'll all be better for it in the end. The symbol version of regcomp will get bumped, so that older binaries will continue to operate with the old escaping behavior in case that was actually pertinent to their functionality. Thanks, Kyle Evans