Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 23 Jun 2020 14:42:57 +0000
From:      bugzilla-noreply@freebsd.org
To:        bugs@FreeBSD.org
Subject:   [Bug 247494] sort(1) order affected by LC_CTYPE
Message-ID:  <bug-247494-227-dOAHDvuAzn@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-247494-227@https.bugs.freebsd.org/bugzilla/>
References:  <bug-247494-227@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D247494

--- Comment #2 from Conrad Meyer <cem@freebsd.org> ---
I think the lengths printed in the bad example are correct; that is a measu=
re
of wchar_t's, whereas in LC_CTYPE=3DC, the length is in bytes.  So it seems=
 like
it is a comparison problem.

I think we invoke wstrcoll() -> bwscoll() in the latter case.  bwscoll() se=
ems
to be broken for short strings:

        if (len1 <=3D offset)
                return ((len2 <=3D offset) ? 0 : -1);

E.g., $ (echo a=E8=80=B3 ; echo a=E8=84=B3 ; echo a=E8=80=B3) | LC_CTYPE=3D=
ja_JP.UTF-8 LC_COLLATE=3DC
LANG=3DC sort --debug
...
; offset=3D1
; k1=3D<a=E8=80=B3>(2), k2=3D<a=E8=84=B3>(2); offset=3D1; s1=3D<a=E8=80=B3>=
, s2=3D<a=E8=84=B3>; cmp1=3D-256
; offset=3D1
; k1=3D<a=E8=84=B3>(2), k2=3D<a=E8=80=B3>(2); offset=3D1; s1=3D<a=E8=84=B3>=
, s2=3D<a=E8=80=B3>; cmp1=3D256
; offset=3D1
; k1=3D<a=E8=80=B3>(2), k2=3D<a=E8=80=B3>(2); offset=3D1; s1=3D<a=E8=80=B3>=
, s2=3D<a=E8=80=B3>; cmp1=3D0
a=E8=80=B3
a=E8=80=B3
a=E8=84=B3

The result is correct, because length (2) < offset (1).  I don't know if
'offset' here is wrong, or if bswcoll is wrong.  It seems like maybe it only
invokes bswcoll() on strings it thinks are identical from a radix perspecti=
ve.=20
So perhaps the problem is some combination of wcstr and byte_sort in radixs=
ort.

In --mergesort mode, the result and comparisons are correct:

(echo =E8=80=B3 ; echo =E8=84=B3 ; echo =E8=80=B3) | LC_CTYPE=3Dja_JP.UTF-8=
 LC_COLLATE=3DC LANG=3DC sort
--mergesort --debug
Memory to be used for sorting: 17100230656
Using collate rules of C locale
Byte sort is used
sort_method=3Dmergesort
; k1=3D<=E8=80=B3>(1), k2=3D<=E8=84=B3>(1); s1=3D<=E8=80=B3>, s2=3D<=E8=84=
=B3>; cmp1=3D-256
; k1=3D<=E8=84=B3>(1), k2=3D<=E8=80=B3>(1); s1=3D<=E8=84=B3>, s2=3D<=E8=80=
=B3>; cmp1=3D256
; k1=3D<=E8=80=B3>(1), k2=3D<=E8=80=B3>(1); s1=3D<=E8=80=B3>, s2=3D<=E8=80=
=B3>; cmp1=3D0
=E8=80=B3
=E8=80=B3
=E8=84=B3

Something is broken in radixsort.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-247494-227-dOAHDvuAzn>