Date: Fri, 10 Jan 2020 01:47:07 +0000 From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 243229] awk in base system does not work with UTF-8 strings correctly Message-ID: <bug-243229-227-EMzg8wht7b@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-243229-227@https.bugs.freebsd.org/bugzilla/> References: <bug-243229-227@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D243229 --- Comment #1 from Conrad Meyer <cem@freebsd.org> --- I'm not sure it makes sense to compute length() on UTF-8 strings as unicode codepoints. POSIX awk is somewhat clear that you're correct: > LC_CTYPE > Determine the locale for the interpretation of sequences of bytes of text > data as characters (for example, single-byte as opposed to multi-byte > characters in arguments and input files), the behavior of character class= es > within regular expressions, the identification of characters as letters, = and > the mapping of uppercase and lowercase characters for the toupper and > tolower functions. However, the resulting behavior around indexing is nutty: this implies that index(), match(), etc, are measured in *characters*. To do this efficiently one probably has to convert non-ASCII strings to wchar_t and operate on tho= se.=20 As you could imagine, that would immensely slow down awk as a fast stream processing utility. POSIX is more explicit about toupper() and tolower(), where taking locale i= nto consideration is easier. I guess I'm not clear on what value a length() function is that operates on codepoints rather than bytes. --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-243229-227-EMzg8wht7b>