Date: Fri, 7 Sep 2018 01:55:11 +1200 From: Thomas Munro <munro@ip9.org> To: freebsd-hackers@freebsd.org Subject: Tracking CLDR version in collation definitions Message-ID: <CADLWmXWY0doSQ-7uoBC0JUzCgZTx9iw-k_viYWB7ze8Pi7R8gw@mail.gmail.com>
index | next in thread | raw e-mail
[-- Attachment #1 --]
Hello FreeBSD hackers,
An occasional problem run into by PostgreSQL users (and probably other
database-like systems) is that collation definitions change and
on-disk indexes become corrupted. This was one motivation for
PostgreSQL to adopt optional support for ICU, and to track
ucol_getVersion() and detect when it changes so that the user can be
warned that dependent indexes need to be rebuilt. However, for
various reason many users prefer to use the OS collation support,
which remains the default, and PostgreSQL supports both ways.
I'd like to be able to track collation definition versions for libc
collations too. There doesn't currently seem to be a good way to do
that. Am I missing something?
Here's the idea I had:
1. Add a new option -V to localedef(1) so that an arbitrary version
string can be stored in some spare space in the header of LC_COLLATE
files.
2. Add a new libc function: const char *querylocaleversion(int mask,
locale_t locale).
3. Modify the perl scripts under tools/tools/locale/tools/... to
invoke localedef(1) either with a version set by the maintainer in
unicode.conf (eg "30.0.3"), or perhaps extracted from CLDR data files
directly.
I've attached a proof-of-concept patch which has a very rough
implementation of steps 1 and 2. It probably needs better bounds
checking, more thought about how to report lack of version string (""
or NULL?), and other details. Before doing any further work on that I
thought I'd check if people think the idea has legs, or knows of an
existing way to get this information.
I also considered less invasive approaches to detect collation
changes: using a checksum (ie program needs to know how to find the
LC_COLLATE files), or using the FreeBSD version on the basis that
collations should only change when the base system is upgraded
(generating false positives). I don't really like those approaches
much.
I'd be grateful for any feedback, flames etc.
Thanks,
Thomas Munro
[-- Attachment #2 --]
From aca1936962e42c42861b98f996e9c32bfbd2a772 Mon Sep 17 00:00:00 2001
From: Thomas Munro <munro@ip9.org>
Date: Sat, 11 Aug 2018 07:46:45 +1200
Subject: [PATCH] Add querylocaleversion().
Allow user programs to ask for a version string for the
components of a locale_t. The structure of the version
string is undefined, but can be used to detect changes
in the locale's definition.
The intended use-case is databases that use libc collations
to implement indexes. When collation definitions change,
database indexes are frequently corrupted. By exposing
the CLDR version string, databases can detect the change
and issue an error or force an index rebuild.
For now only LC_COLLATE components have a way to report
their version. The LC_COLLATE file format is backwards
and forwards compatible. Where previously 24 bytes held
"BSD 1.0\n" followed by NUL characters, there are now 12
bytes for the file format identifier and then 12 bytes for
the data version string.
*** WORK IN PROGRESS, PROOF OF CONCEPT CODE ONLY ***
---
include/xlocale/_locale.h | 1 +
lib/libc/locale/Symbol.map | 1 +
lib/libc/locale/collate.c | 6 ++--
lib/libc/locale/collate.h | 6 +++-
lib/libc/locale/querylocaleversion.3 | 50 ++++++++++++++++++++++++++++
lib/libc/locale/xlocale.c | 16 +++++++++
lib/libc/locale/xlocale_private.h | 4 +++
usr.bin/localedef/collate.c | 14 +++++---
usr.bin/localedef/localedef.1 | 7 +++-
usr.bin/localedef/localedef.c | 7 +++-
usr.bin/localedef/localedef.h | 2 ++
11 files changed, 105 insertions(+), 9 deletions(-)
create mode 100644 lib/libc/locale/querylocaleversion.3
diff --git a/include/xlocale/_locale.h b/include/xlocale/_locale.h
index a4e04f082fa..83024175c07 100644
--- a/include/xlocale/_locale.h
+++ b/include/xlocale/_locale.h
@@ -54,6 +54,7 @@ locale_t duplocale(locale_t base);
void freelocale(locale_t loc);
locale_t newlocale(int mask, const char *locale, locale_t base);
const char *querylocale(int mask, locale_t loc);
+const char *querylocaleversion(int mask, locale_t loc);
locale_t uselocale(locale_t loc);
#endif /* _XLOCALE_LOCALE_H */
diff --git a/lib/libc/locale/Symbol.map b/lib/libc/locale/Symbol.map
index b2f2a35f2fe..fa22f399de9 100644
--- a/lib/libc/locale/Symbol.map
+++ b/lib/libc/locale/Symbol.map
@@ -207,6 +207,7 @@ FBSD_1.3 {
mbrtoc16_l;
mbrtoc32;
mbrtoc32_l;
+ querylocaleversion;
};
FBSDprivate_1.0 {
diff --git a/lib/libc/locale/collate.c b/lib/libc/locale/collate.c
index 8d040c19486..fbb15000af9 100644
--- a/lib/libc/locale/collate.c
+++ b/lib/libc/locale/collate.c
@@ -150,12 +150,14 @@ __collate_load_tables_l(const char *encoding, struct xlocale_collate *table)
return (_LDP_ERROR);
}
- if (strncmp(TMP, COLLATE_VERSION, COLLATE_STR_LEN) != 0) {
+ if (strncmp(TMP, COLLATE_VERSION, COLLATE_FORMAT_VERSION_LEN) != 0) {
(void) munmap(map, sbuf.st_size);
errno = EINVAL;
return (_LDP_ERROR);
}
- TMP += COLLATE_STR_LEN;
+ TMP += COLLATE_FORMAT_VERSION_LEN;
+ strncpy(table->header.version, TMP, sizeof(table->header.version));
+ TMP += COLLATE_DATA_VERSION_LEN;
info = (void *)TMP;
TMP += sizeof (*info);
diff --git a/lib/libc/locale/collate.h b/lib/libc/locale/collate.h
index 4abb1f936ae..d36f2cfa891 100644
--- a/lib/libc/locale/collate.h
+++ b/lib/libc/locale/collate.h
@@ -53,7 +53,10 @@
#endif
#define COLLATE_STR_LEN 24 /* should be 64-bit multiple */
+
+#define COLLATE_FORMAT_VERSION_LEN 12
#define COLLATE_VERSION "BSD 1.0\n"
+#define COLLATE_DATA_VERSION_LEN 12
#define COLLATE_MAX_PRIORITY (0x7fffffff) /* max signed value */
#define COLLATE_SUBST_PRIORITY (0x40000000) /* bit indicates subst table */
@@ -69,7 +72,8 @@
/*
* The collate file format is as follows:
*
- * char version[COLLATE_STR_LEN]; // must be COLLATE_VERSION
+ * char format_version[COLLATE_FORMAT_VERSION_LEN]; // must be COLLATE_VERSION
+ * char data_version[COLLATE_DATA_VERSION_LEN]; // NUL-terminated, may be empty
* collate_info_t info; // see below, includes padding
* collate_char_pri_t char_data[256]; // 8 bit char values
* collate_subst_t subst[*]; // 0 or more substitutions
diff --git a/lib/libc/locale/querylocaleversion.3 b/lib/libc/locale/querylocaleversion.3
new file mode 100644
index 00000000000..9458014a555
--- /dev/null
+++ b/lib/libc/locale/querylocaleversion.3
@@ -0,0 +1,50 @@
+.\" Copyright (c) 2018 The FreeBSD Foundation
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" $FreeBSD$
+.\"
+.Dd September 3, 2018
+.Dt QUERYLOCALEVERSION 3
+.Os
+.Sh NAME
+.Nm querylocaleversion
+.Nd Look up the locale version for a specified category
+.Sh LIBRARY
+.Lb libc
+.Sh SYNOPSIS
+.In locale.h
+.Ft const char *
+.Fn querylocaleversion "int mask" "locale_t locale"
+.Sh DESCRIPTION
+Returns the version of the locale for the category specified by
+.Fa mask .
+This possible values for the mask are the same as those in
+.Xr newlocale 3 .
+Currently the only component that can provide version information
+is LC_COLLATE_MASK. If no version information is available, a
+pointer to an empty string is returned.
+If more than one bit in the mask is set, the returned value is undefined.
+.Sh SEE ALSO
+.Xr querylocale 3 ,
+.Xr localedef 1
diff --git a/lib/libc/locale/xlocale.c b/lib/libc/locale/xlocale.c
index 87160b26b4d..dc492c087a1 100644
--- a/lib/libc/locale/xlocale.c
+++ b/lib/libc/locale/xlocale.c
@@ -231,6 +231,8 @@ static int dupcomponent(int type, locale_t base, locale_t new)
if (new->components[type]) {
strncpy(new->components[type]->locale, src->locale,
ENCODING_LEN);
+ strncpy(new->components[type]->version, src->version,
+ LOCALE_VERSION_LEN);
}
} else if (base->components[type]) {
new->components[type] = xlocale_retain(base->components[type]);
@@ -355,6 +357,20 @@ const char *querylocale(int mask, locale_t loc)
return ("C");
}
+/*
+ * Returns the version of the locale for a particular component of a locale_t.
+ */
+const char *querylocaleversion(int mask, locale_t loc)
+{
+ int type = ffs(mask) - 1;
+ FIX_LOCALE(loc);
+ if (type >= XLC_LAST)
+ return (NULL);
+ if (loc->components[type])
+ return (loc->components[type]->version);
+ return ("");
+}
+
/*
* Installs the specified locale_t as this thread's locale.
*/
diff --git a/lib/libc/locale/xlocale_private.h b/lib/libc/locale/xlocale_private.h
index 9aa4d86c87c..370c0a53af1 100644
--- a/lib/libc/locale/xlocale_private.h
+++ b/lib/libc/locale/xlocale_private.h
@@ -74,6 +74,8 @@ _Static_assert(XLC_TIME == LC_TIME - 1,
_Static_assert(XLC_MESSAGES == LC_MESSAGES - 1,
"XLC_MESSAGES doesn't match the LC_MESSAGES value.");
+#define LOCALE_VERSION_LEN 11
+
/**
* Header used for objects that are reference counted. Objects may optionally
* have a destructor associated, which is responsible for destroying the
@@ -99,6 +101,8 @@ struct xlocale_component {
struct xlocale_refcounted header;
/** Name of the locale used for this component. */
char locale[ENCODING_LEN+1];
+ /** Version of the data for this component. */
+ char version[LOCALE_VERSION_LEN+1];
};
/**
diff --git a/usr.bin/localedef/collate.c b/usr.bin/localedef/collate.c
index d2e8dcb922a..a6d81890742 100644
--- a/usr.bin/localedef/collate.c
+++ b/usr.bin/localedef/collate.c
@@ -1113,7 +1113,8 @@ dump_collate(void)
collelem_t *ce;
collchar_t *cc;
subst_t *sb;
- char vers[COLLATE_STR_LEN];
+ char format_version[COLLATE_FORMAT_VERSION_LEN];
+ char data_version[COLLATE_DATA_VERSION_LEN];
collate_char_t chars[UCHAR_MAX + 1];
collate_large_t *large;
collate_subst_t *subst[COLL_WEIGHTS_MAX];
@@ -1154,8 +1155,12 @@ dump_collate(void)
}
(void) memset(&chars, 0, sizeof (chars));
- (void) memset(vers, 0, COLLATE_STR_LEN);
- (void) strlcpy(vers, COLLATE_VERSION, sizeof (vers));
+ (void) memset(format_version, 0, COLLATE_FORMAT_VERSION_LEN);
+ (void) strlcpy(format_version, COLLATE_VERSION,
+ sizeof (format_version));
+ (void) memset(data_version, 0, COLLATE_DATA_VERSION_LEN);
+ if (version)
+ (void) strlcpy(data_version, version, sizeof (data_version));
/*
* We need to make sure we arrange for the UNDEFINED field
@@ -1288,7 +1293,8 @@ dump_collate(void)
/* Time to write the entire data set out */
- if ((wr_category(vers, COLLATE_STR_LEN, f) < 0) ||
+ if ((wr_category(format_version, COLLATE_FORMAT_VERSION_LEN, f) < 0) ||
+ (wr_category(data_version, COLLATE_DATA_VERSION_LEN, f) < 0) ||
(wr_category(&collinfo, sizeof (collinfo), f) < 0) ||
(wr_category(&chars, sizeof (chars), f) < 0)) {
return;
diff --git a/usr.bin/localedef/localedef.1 b/usr.bin/localedef/localedef.1
index f096ca05336..f37517f3c8a 100644
--- a/usr.bin/localedef/localedef.1
+++ b/usr.bin/localedef/localedef.1
@@ -131,6 +131,10 @@ If not supplied, then default screen widths will be assumed, which will
generally not account for East Asian encodings requiring more than a single
character cell to display, nor for combining or accent marks that occupy
no additional screen width.
+.It Fl V Ar version
+Specifies a version string describing the source collation data. This
+string can be retrieved using querylocaleversion(3), and is intented to allow
+applications to detect when the definition of a collation has changed.
.El
.Pp
The following operands are required:
@@ -195,7 +199,8 @@ If an error is detected, no permanent output will be created.
.Xr iconv_open 3 ,
.Xr nl_langinfo 3 ,
.Xr strftime 3 ,
-.Xr environ 7
+.Xr environ 7 ,
+.Xr querylocaleversion 1
.Sh WARNINGS
If warnings occur, permanent output will be created if the
.Fl c
diff --git a/usr.bin/localedef/localedef.c b/usr.bin/localedef/localedef.c
index 473de7b3db1..c13ff7aba73 100644
--- a/usr.bin/localedef/localedef.c
+++ b/usr.bin/localedef/localedef.c
@@ -59,6 +59,7 @@ int undefok = 0;
int warnok = 0;
static char *locname = NULL;
static char locpath[PATH_MAX];
+char *version = NULL;
const char *
category_name(void)
@@ -236,6 +237,7 @@ usage(void)
(void) fprintf(stderr, " -u encoding : assume encoding\n");
(void) fprintf(stderr, " -w widths : use screen widths file\n");
(void) fprintf(stderr, " -i locsrc : source file for locale\n");
+ (void) fprintf(stderr, " -V version : version string for locale\n");
exit(4);
}
@@ -260,7 +262,7 @@ main(int argc, char **argv)
(void) setlocale(LC_ALL, "");
- while ((c = getopt(argc, argv, "w:i:cf:u:vUD")) != -1) {
+ while ((c = getopt(argc, argv, "w:i:cf:u:vUDV:")) != -1) {
switch (c) {
case 'D':
bsd = 1;
@@ -289,6 +291,9 @@ main(int argc, char **argv)
case '?':
usage();
break;
+ case 'V':
+ version = optarg;
+ break;
}
}
diff --git a/usr.bin/localedef/localedef.h b/usr.bin/localedef/localedef.h
index 4367a19e2e8..6ca50e198f1 100644
--- a/usr.bin/localedef/localedef.h
+++ b/usr.bin/localedef/localedef.h
@@ -53,6 +53,8 @@ extern int undefok; /* mostly ignore undefined symbols */
extern int warnok;
extern int warnings;
+extern char *version;
+
int yylex(void);
void yyerror(const char *);
_Noreturn void errf(const char *, ...) __printflike(1, 2);
--
2.18.0
help
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CADLWmXWY0doSQ-7uoBC0JUzCgZTx9iw-k_viYWB7ze8Pi7R8gw>
