Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 23 Jun 2020 14:14:03 +0000
From:      bugzilla-noreply@freebsd.org
To:        bugs@FreeBSD.org
Subject:   [Bug 247494] sort(1) order affected by LC_CTYPE
Message-ID:  <bug-247494-227-3yC9EAK1fK@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-247494-227@https.bugs.freebsd.org/bugzilla/>
References:  <bug-247494-227@https.bugs.freebsd.org/bugzilla/>

index | next in thread | previous in thread | raw e-mail

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=247494

--- Comment #1 from Conrad Meyer <cem@freebsd.org> ---
On CURRENT:

$ LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C locale
LANG=C
LC_CTYPE=ja_JP.UTF-8
LC_COLLATE="C"
LC_TIME="C"
LC_NUMERIC="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=

sort(1) attempts to identify situations where it can run in fast, byte-compare
only mode by looking only at LC_COLLATE.  The --debug option shows more
information:

$ (echo 耳 ; echo 脳 ; echo 耳) | LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort
--debug
Memory to be used for sorting: 17100230656
Using collate rules of C locale
Byte sort is used
sort_method=radixsort
; offset=1
; k1=<耳>(1), k2=<脳>(1); offset=1; s1=<耳>, s2=<脳>; cmp1=0
; offset=1
; k1=<脳>(1), k2=<耳>(1); offset=1; s1=<脳>, s2=<耳>; cmp1=0
耳
脳
耳

Both compares seem wrong.  The UTF-8 sequences share only the first byte, 0xe8.

In LC_CTYPE=C mode:

; offset=1
; k1=<耳>(3), k2=<脳>(3); offset=1; s1=<耳>, s2=<脳>; cmp1=-4
; offset=1
; k1=<脳>(3), k2=<耳>(3); offset=1; s1=<脳>, s2=<耳>; cmp1=4
; offset=1
; k1=<耳>(3), k2=<耳>(3); offset=1; s1=<耳>, s2=<耳>; cmp1=0
耳
耳
脳

The comparisons look correct.  I will look a little more.  I think this is a
bug, not design, but I am not sure yet.

-- 
You are receiving this mail because:
You are the assignee for the bug.

help

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-247494-227-3yC9EAK1fK>