Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 10 Jan 2020 01:47:07 +0000
From:      bugzilla-noreply@freebsd.org
To:        bugs@FreeBSD.org
Subject:   [Bug 243229] awk in base system does not work with UTF-8 strings correctly
Message-ID:  <bug-243229-227-EMzg8wht7b@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-243229-227@https.bugs.freebsd.org/bugzilla/>
References:  <bug-243229-227@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D243229

--- Comment #1 from Conrad Meyer <cem@freebsd.org> ---
I'm not sure it makes sense to compute length() on UTF-8 strings as unicode
codepoints.  POSIX awk is somewhat clear that you're correct:


> LC_CTYPE
> Determine the locale for the interpretation of sequences of bytes of text
> data as characters (for example, single-byte as opposed to multi-byte
> characters in arguments and input files), the behavior of character class=
es
> within regular expressions, the identification of characters as letters, =
and
> the mapping of uppercase and lowercase characters for the toupper and
> tolower functions.

However, the resulting behavior around indexing is nutty: this implies that
index(), match(), etc, are measured in *characters*.  To do this efficiently
one probably has to convert non-ASCII strings to wchar_t and operate on tho=
se.=20
As you could imagine, that would immensely slow down awk as a fast stream
processing utility.

POSIX is more explicit about toupper() and tolower(), where taking locale i=
nto
consideration is easier.

I guess I'm not clear on what value a length() function is that operates on
codepoints rather than bytes.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-243229-227-EMzg8wht7b>