From nobody Fri Dec 15 00:42:46 2023 X-Original-To: dev-commits-src-branches@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Srr6G3gM0z541Z7; Fri, 15 Dec 2023 00:42:46 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Srr6G36s1z3XtP; Fri, 15 Dec 2023 00:42:46 +0000 (UTC) (envelope-from git@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1702600966; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=IRl4l9E9WAC4WQOQQ1ZRA3UwdfYJ18sjO1+KPdEK9HU=; b=iC2wgjA4G7u+QR05HkWQG5QrIHc+r/WVfWQtUKLuTISBqeP4xxd4FQ1QAelAzGCLU3HquN Gwqh58M3XPg09MdMC92+rfN0FXrCqipu22Dtdi6j5pO5r1PTGrlWziOMR3u/ytrHgm6RsS hDjDlLbWAXIyPPfiYVpOP8+9E/FaKF7cLSHg8s3GRdqJJUuUJGwf0BgeJyoiwPMbESDMs5 41qyFA+d5OGnNevv1KQgjgqu/a+34dRX4zFnKHtN58QtzznCp2zDm16J0m0HFvvlfKjIBO v2UkV1no9eDtWbqbYuKH/lLG53nUZ4pjtKkkM6wkqUR9uiRIxhRp74MU3MRi1g== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1702600966; a=rsa-sha256; cv=none; b=CPB5OWzOLfKOQb50YYMlAJtXWsqzcPQol9sDuX2m4c7uEucnn0PoOziG5zGIEdtXYe0VLo yWmGz/M0mkKOzk+aKLRE/GYYlInzN2mJ8Bdt0aaVl4AL5htQ3NN7y8pGhNxxZ4cVb2ranq S1K3aVF2VI8EVmNJS8kaqedwKNfS8YJglF6sNQ8DSdHO4brOR+/AMSpUd6eB1qAJaz/YHz HHV0zsK/qR0zxKyrg8TPXXbjqipjuAUxQ1DOdv4UPf/tbaau3wIf+taqBMAGFog675EfVb FYyKpyN1TGtipnOggEeM6tqaHAsdYtQdhBu3SMiaBknzeIoRE18EJrsSFvBlyw== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1702600966; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=IRl4l9E9WAC4WQOQQ1ZRA3UwdfYJ18sjO1+KPdEK9HU=; b=xNrEY6+Y5eyQKlVANDm71s5qgU7w/jNmHevftSOAAb5iAh4CgMYEAytMxDH0oNXg8QRfiU loX5kvEVOR9TrWS9OrR/j39uFx4kyoiwFaWvRPJrIWz2hVErhV1vnvQu20yDLGobvRJ5yH P4TPTSH8/vuPocaUUVVWTcE2WdNOlcMo8FdC+k4aYLnwcxBGWZlwyrXbbyyBpg5AaLSsLp Yvs3ig0H1yeEP6cK6u8x0P1brzqg7neQQ44pEeq8/VSgYmZcPkJ0MPQ5QTh4GXnqeix/Ss oWYcv9PoxXNpwVfp9+OnO5Goby+FkiQJewhXYfXNwgtuieRfhQ3dQOdIVwwRBQ== Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4Srr6G2BRgzq2K; Fri, 15 Dec 2023 00:42:46 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from gitrepo.freebsd.org ([127.0.1.44]) by gitrepo.freebsd.org (8.17.1/8.17.1) with ESMTP id 3BF0gk4g093445; Fri, 15 Dec 2023 00:42:46 GMT (envelope-from git@gitrepo.freebsd.org) Received: (from git@localhost) by gitrepo.freebsd.org (8.17.1/8.17.1/Submit) id 3BF0gk3Z093442; Fri, 15 Dec 2023 00:42:46 GMT (envelope-from git) Date: Fri, 15 Dec 2023 00:42:46 GMT Message-Id: <202312150042.3BF0gk3Z093442@gitrepo.freebsd.org> To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-branches@FreeBSD.org From: Christos Margiolis Subject: git: bd1739a707ff - stable/14 - sort: test against all month formats in month-sort List-Id: Commits to the stable branches of the FreeBSD src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-branches List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-dev-commits-src-branches@freebsd.org X-BeenThere: dev-commits-src-branches@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Git-Committer: christos X-Git-Repository: src X-Git-Refname: refs/heads/stable/14 X-Git-Reftype: branch X-Git-Commit: bd1739a707ff0bda50dedb8aa58b2b26254bdda3 Auto-Submitted: auto-generated The branch stable/14 has been updated by christos: URL: https://cgit.FreeBSD.org/src/commit/?id=bd1739a707ff0bda50dedb8aa58b2b26254bdda3 commit bd1739a707ff0bda50dedb8aa58b2b26254bdda3 Author: Christos Margiolis AuthorDate: 2023-12-01 00:30:10 +0000 Commit: Christos Margiolis CommitDate: 2023-12-15 00:42:26 +0000 sort: test against all month formats in month-sort The CLDR specification [1] defines three possible month formats: - Abbreviation (e.g Jan, Ιαν) - Full (e.g January, Ιανουαρίου) - Standalone (e.g January, Ιανουάριος) Many languages use different case endings depending on whether the month is referenced as a standalone word (nominative case), or in date context (genitive, partitive, etc.). sort(1)'s -M option currently sorts months by testing input against only the abbrevation format, which is essentially a substring of the full format. While this works fine for languages like English, where there are no cases, for languages where there is a different case ending between the abbreviation/full and standalone formats, it is not sufficient. For example, in Greek, "May" can take the following forms: Abbreviation: Μαΐ (genitive case) Full: Μαΐου (genitive case) Standalone: Μάιος (nominative case) If we use the standalone format in Greek, sort(1) will not able to match "Μαΐ" to "Μάιος" and the sort will fail. This change makes sort(1) test against all three formats. It also works when the input contains mixed formats. [1] https://cldr.unicode.org/translation/date-time/date-time-patterns Reviewed by: markj MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D42847 (cherry picked from commit 3d44dce90a6946e2ef2ab30ffbf8e2930acf888b) --- usr.bin/sort/bwstring.c | 144 +++++++++++++++++++-------- usr.bin/sort/sort.1.in | 6 +- usr.bin/sort/tests/Makefile | 1 + usr.bin/sort/tests/sort_monthsort_test.sh | 159 ++++++++++++++++++++++++++++++ 4 files changed, 263 insertions(+), 47 deletions(-) diff --git a/usr.bin/sort/bwstring.c b/usr.bin/sort/bwstring.c index fc1b50cb78ac..b0c14e996b23 100644 --- a/usr.bin/sort/bwstring.c +++ b/usr.bin/sort/bwstring.c @@ -43,63 +43,114 @@ bool byte_sort; -static wchar_t **wmonths; -static char **cmonths; +struct wmonth { + wchar_t *mon; + wchar_t *ab; + wchar_t *alt; +}; -/* initialise months */ +struct cmonth { + char *mon; + char *ab; + char *alt; +}; + +static struct wmonth *wmonths; +static struct cmonth *cmonths; + +static int +populate_cmonth(char **field, const nl_item item, int idx) +{ + char *tmp, *m; + size_t i, len; + + tmp = nl_langinfo(item); + if (debug_sort) + printf("month[%d]=%s\n", idx, tmp); + if (*tmp == '\0') + return (0); + m = sort_strdup(tmp); + len = strlen(tmp); + for (i = 0; i < len; i++) + m[i] = toupper(m[i]); + *field = m; + + return (1); +} + +static int +populate_wmonth(wchar_t **field, const nl_item item, int idx) +{ + wchar_t *m; + char *tmp; + size_t i, len; + + tmp = nl_langinfo(item); + if (debug_sort) + printf("month[%d]=%s\n", idx, tmp); + if (*tmp == '\0') + return (0); + len = strlen(tmp); + m = sort_malloc(SIZEOF_WCHAR_STRING(len + 1)); + if (mbstowcs(m, tmp, len) == ((size_t) - 1)) { + sort_free(m); + return (0); + } + m[len] = L'\0'; + for (i = 0; i < len; i++) + m[i] = towupper(m[i]); + *field = m; + + return (1); +} void initialise_months(void) { - const nl_item item[12] = { ABMON_1, ABMON_2, ABMON_3, ABMON_4, + const nl_item mon_item[12] = { MON_1, MON_2, MON_3, MON_4, + MON_5, MON_6, MON_7, MON_8, MON_9, MON_10, + MON_11, MON_12 }; + const nl_item ab_item[12] = { ABMON_1, ABMON_2, ABMON_3, ABMON_4, ABMON_5, ABMON_6, ABMON_7, ABMON_8, ABMON_9, ABMON_10, ABMON_11, ABMON_12 }; - char *tmp; - size_t len; - + const nl_item alt_item[12] = { ALTMON_1, ALTMON_2, ALTMON_3, ALTMON_4, + ALTMON_5, ALTMON_6, ALTMON_7, ALTMON_8, ALTMON_9, ALTMON_10, + ALTMON_11, ALTMON_12 }; + int i; + + /* + * Handle all possible month formats: abbrevation, full name, + * standalone name (without case ending). + */ if (mb_cur_max == 1) { if (cmonths == NULL) { - char *m; - - cmonths = sort_malloc(sizeof(char*) * 12); - for (int i = 0; i < 12; i++) { - cmonths[i] = NULL; - tmp = nl_langinfo(item[i]); - if (debug_sort) - printf("month[%d]=%s\n", i, tmp); - if (*tmp == '\0') + cmonths = sort_malloc(sizeof(struct cmonth) * 12); + for (i = 0; i < 12; i++) { + if (!populate_cmonth(&cmonths[i].mon, + mon_item[i], i)) + continue; + if (!populate_cmonth(&cmonths[i].ab, + ab_item[i], i)) + continue; + if (!populate_cmonth(&cmonths[i].alt, + alt_item[i], i)) continue; - m = sort_strdup(tmp); - len = strlen(tmp); - for (unsigned int j = 0; j < len; j++) - m[j] = toupper(m[j]); - cmonths[i] = m; } } } else { if (wmonths == NULL) { - wchar_t *m; - - wmonths = sort_malloc(sizeof(wchar_t *) * 12); - for (int i = 0; i < 12; i++) { - wmonths[i] = NULL; - tmp = nl_langinfo(item[i]); - if (debug_sort) - printf("month[%d]=%s\n", i, tmp); - if (*tmp == '\0') + wmonths = sort_malloc(sizeof(struct wmonth) * 12); + for (i = 0; i < 12; i++) { + if (!populate_wmonth(&wmonths[i].mon, + mon_item[i], i)) continue; - len = strlen(tmp); - m = sort_malloc(SIZEOF_WCHAR_STRING(len + 1)); - if (mbstowcs(m, tmp, len) == - ((size_t) - 1)) { - sort_free(m); + if (!populate_wmonth(&wmonths[i].ab, + ab_item[i], i)) + continue; + if (!populate_wmonth(&wmonths[i].alt, + alt_item[i], i)) continue; - } - m[len] = L'\0'; - for (unsigned int j = 0; j < len; j++) - m[j] = towupper(m[j]); - wmonths[i] = m; } } } @@ -754,8 +805,11 @@ bws_month_score(const struct bwstring *s0) ++s; for (int i = 11; i >= 0; --i) { - if (cmonths[i] && - (s == strstr(s, cmonths[i]))) + if (cmonths[i].mon && (s == strstr(s, cmonths[i].mon))) + return (i); + if (cmonths[i].ab && (s == strstr(s, cmonths[i].ab))) + return (i); + if (cmonths[i].alt && (s == strstr(s, cmonths[i].alt))) return (i); } @@ -769,7 +823,11 @@ bws_month_score(const struct bwstring *s0) ++s; for (int i = 11; i >= 0; --i) { - if (wmonths[i] && (s == wcsstr(s, wmonths[i]))) + if (wmonths[i].ab && (s == wcsstr(s, wmonths[i].ab))) + return (i); + if (wmonths[i].mon && (s == wcsstr(s, wmonths[i].mon))) + return (i); + if (wmonths[i].alt && (s == wcsstr(s, wmonths[i].alt))) return (i); } } diff --git a/usr.bin/sort/sort.1.in b/usr.bin/sort/sort.1.in index 4e27838a9250..80cc1dcb0282 100644 --- a/usr.bin/sort/sort.1.in +++ b/usr.bin/sort/sort.1.in @@ -30,9 +30,7 @@ .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF .\" SUCH DAMAGE. .\" -.\" @(#)sort.1 8.1 (Berkeley) 6/6/93 -.\" -.Dd September 4, 2019 +.Dd November 30, 2023 .Dt SORT 1 .Os .Sh NAME @@ -181,7 +179,7 @@ options (human-readable). .It Fl i , Fl Fl ignore-nonprinting Ignore all non-printable characters. .It Fl M , Fl Fl month-sort , Fl Fl sort=month -Sort by month abbreviations. +Sort by month. Unknown strings are considered smaller than the month names. .It Fl n , Fl Fl numeric-sort , Fl Fl sort=numeric Sort fields numerically by arithmetic value. diff --git a/usr.bin/sort/tests/Makefile b/usr.bin/sort/tests/Makefile index 1982fd1cee0a..752dec06bbff 100644 --- a/usr.bin/sort/tests/Makefile +++ b/usr.bin/sort/tests/Makefile @@ -2,6 +2,7 @@ PACKAGE= tests NETBSD_ATF_TESTS_SH= sort_test +ATF_TESTS_SH= sort_monthsort_test ${PACKAGE}FILES+= d_any_char_dflag_out.txt ${PACKAGE}FILES+= d_any_char_fflag_out.txt diff --git a/usr.bin/sort/tests/sort_monthsort_test.sh b/usr.bin/sort/tests/sort_monthsort_test.sh new file mode 100755 index 000000000000..db42981fb107 --- /dev/null +++ b/usr.bin/sort/tests/sort_monthsort_test.sh @@ -0,0 +1,159 @@ +# +# SPDX-License-Identifier: BSD-2-Clause +# +# Copyright (c) 2023 Christos Margiolis +# + +get_months_fmt() +{ + rm -f in + for i in $(seq 12 1); do + printf "2000-%02d-01\n" ${i} | xargs -I{} \ + date -jf "%Y-%m-%d" {} "${1}" >>in + done +} + +atf_test_case monthsort_english +monthsort_english_head() +{ + atf_set "descr" "Test the -M flag with English months" +} +monthsort_english_body() +{ + export LC_TIME="en_US.UTF-8" + + cat >expout <expout <expout <expout <in <expout <