From owner-freebsd-bugs@freebsd.org Tue Jun 23 14:42:57 2020 Return-Path: Delivered-To: freebsd-bugs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 0FEEE357008 for ; Tue, 23 Jun 2020 14:42:57 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mailman.nyi.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 49rpsr6lnfz4cZr for ; Tue, 23 Jun 2020 14:42:56 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: by mailman.nyi.freebsd.org (Postfix) id E7D60356CC4; Tue, 23 Jun 2020 14:42:56 +0000 (UTC) Delivered-To: bugs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id E79FD356DC0 for ; Tue, 23 Jun 2020 14:42:56 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 49rpsr5v7Jz4cwT for ; Tue, 23 Jun 2020 14:42:56 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id C5B8C220A1 for ; Tue, 23 Jun 2020 14:42:56 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 05NEguOM095563 for ; Tue, 23 Jun 2020 14:42:56 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 05NEguXW095561 for bugs@FreeBSD.org; Tue, 23 Jun 2020 14:42:56 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 247494] sort(1) order affected by LC_CTYPE Date: Tue, 23 Jun 2020 14:42:57 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: bin X-Bugzilla-Version: 12.1-STABLE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Many People X-Bugzilla-Who: cem@freebsd.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Jun 2020 14:42:57 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D247494 --- Comment #2 from Conrad Meyer --- I think the lengths printed in the bad example are correct; that is a measu= re of wchar_t's, whereas in LC_CTYPE=3DC, the length is in bytes. So it seems= like it is a comparison problem. I think we invoke wstrcoll() -> bwscoll() in the latter case. bwscoll() se= ems to be broken for short strings: if (len1 <=3D offset) return ((len2 <=3D offset) ? 0 : -1); E.g., $ (echo a=E8=80=B3 ; echo a=E8=84=B3 ; echo a=E8=80=B3) | LC_CTYPE=3D= ja_JP.UTF-8 LC_COLLATE=3DC LANG=3DC sort --debug ... ; offset=3D1 ; k1=3D(2), k2=3D(2); offset=3D1; s1=3D= , s2=3D; cmp1=3D-256 ; offset=3D1 ; k1=3D(2), k2=3D(2); offset=3D1; s1=3D= , s2=3D; cmp1=3D256 ; offset=3D1 ; k1=3D(2), k2=3D(2); offset=3D1; s1=3D= , s2=3D; cmp1=3D0 a=E8=80=B3 a=E8=80=B3 a=E8=84=B3 The result is correct, because length (2) < offset (1). I don't know if 'offset' here is wrong, or if bswcoll is wrong. It seems like maybe it only invokes bswcoll() on strings it thinks are identical from a radix perspecti= ve.=20 So perhaps the problem is some combination of wcstr and byte_sort in radixs= ort. In --mergesort mode, the result and comparisons are correct: (echo =E8=80=B3 ; echo =E8=84=B3 ; echo =E8=80=B3) | LC_CTYPE=3Dja_JP.UTF-8= LC_COLLATE=3DC LANG=3DC sort --mergesort --debug Memory to be used for sorting: 17100230656 Using collate rules of C locale Byte sort is used sort_method=3Dmergesort ; k1=3D<=E8=80=B3>(1), k2=3D<=E8=84=B3>(1); s1=3D<=E8=80=B3>, s2=3D<=E8=84= =B3>; cmp1=3D-256 ; k1=3D<=E8=84=B3>(1), k2=3D<=E8=80=B3>(1); s1=3D<=E8=84=B3>, s2=3D<=E8=80= =B3>; cmp1=3D256 ; k1=3D<=E8=80=B3>(1), k2=3D<=E8=80=B3>(1); s1=3D<=E8=80=B3>, s2=3D<=E8=80= =B3>; cmp1=3D0 =E8=80=B3 =E8=80=B3 =E8=84=B3 Something is broken in radixsort. --=20 You are receiving this mail because: You are the assignee for the bug.=