From owner-freebsd-bugs@freebsd.org Fri Jan 10 01:47:08 2020 Return-Path: Delivered-To: freebsd-bugs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id E28C31FF432 for ; Fri, 10 Jan 2020 01:47:08 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mailman.nyi.freebsd.org (mailman.nyi.freebsd.org [IPv6:2610:1c1:1:606c::50:13]) by mx1.freebsd.org (Postfix) with ESMTP id 47v5Sr3y86z41SX for ; Fri, 10 Jan 2020 01:47:08 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: by mailman.nyi.freebsd.org (Postfix) id 879941FF431; Fri, 10 Jan 2020 01:47:08 +0000 (UTC) Delivered-To: bugs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 875B01FF430 for ; Fri, 10 Jan 2020 01:47:08 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 47v5Sr36Hlz41SV for ; Fri, 10 Jan 2020 01:47:08 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 61686A7AD for ; Fri, 10 Jan 2020 01:47:08 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 00A1l8OU071116 for ; Fri, 10 Jan 2020 01:47:08 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 00A1l82m071115 for bugs@FreeBSD.org; Fri, 10 Jan 2020 01:47:08 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 243229] awk in base system does not work with UTF-8 strings correctly Date: Fri, 10 Jan 2020 01:47:07 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: misc X-Bugzilla-Version: 12.1-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: cem@freebsd.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Jan 2020 01:47:08 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D243229 --- Comment #1 from Conrad Meyer --- I'm not sure it makes sense to compute length() on UTF-8 strings as unicode codepoints. POSIX awk is somewhat clear that you're correct: > LC_CTYPE > Determine the locale for the interpretation of sequences of bytes of text > data as characters (for example, single-byte as opposed to multi-byte > characters in arguments and input files), the behavior of character class= es > within regular expressions, the identification of characters as letters, = and > the mapping of uppercase and lowercase characters for the toupper and > tolower functions. However, the resulting behavior around indexing is nutty: this implies that index(), match(), etc, are measured in *characters*. To do this efficiently one probably has to convert non-ASCII strings to wchar_t and operate on tho= se.=20 As you could imagine, that would immensely slow down awk as a fast stream processing utility. POSIX is more explicit about toupper() and tolower(), where taking locale i= nto consideration is easier. I guess I'm not clear on what value a length() function is that operates on codepoints rather than bytes. --=20 You are receiving this mail because: You are the assignee for the bug.=