Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 4 Aug 2014 14:04:00 -0700
From:      John-Mark Gurney <jmg@funkthat.com>
To:        Phil Shafer <phil@juniper.net>
Cc:        arch@freebsd.org, Poul-Henning Kamp <phk@phk.freebsd.dk>, marcel@freebsd.org, "Simon J. Gerraty" <sjg@juniper.net>
Subject:   Re: XML Output: libxo - provide single API to output TXT, XML, JSON and HTML
Message-ID:  <20140804210400.GG88623@funkthat.com>
In-Reply-To: <201408041449.s74Emwk0019816@idle.juniper.net>
References:  <63132.1406924887@critter.freebsd.dk> <201408041449.s74Emwk0019816@idle.juniper.net>

next in thread | previous in thread | raw e-mail | index | archive | help
Phil Shafer wrote this message on Mon, Aug 04, 2014 at 10:48 -0400:
> Poul-Henning Kamp writes:
> >First of, this is not just ENOMEM, this is also invalid UTF-8 strings,
> >NULL pointers and much more bogosity.
> 
> Yup, there are 26 failure cases at present, ranging from missing
> close braces in format strings to unbalanced open/close calls.
> 
> >>Seeing broken output is better than limping
> >>along with output that looks right but isn't.
> >The output should preferably be explicitly broken, so that nobody 
> >downstream mistakenly takes it and runs with it.
> 
> I think we're in agreement, but there is the question of what
> constitutes sufficient problems to trigger abort.  I'm coding the
> UTF-8 support now and that's a perfect example.  If the output
> character set (the user's LANG setting) doesn't support a character
> of output (u+10d6), does that constitute a complete failure?  I'll

It depends... For output to terminal/text, then you should use iconv's
ICONV_SET_TRANSLITERATE option (see iconvctl(3), which isn't linked
from iconv(3), but now is)...

> assumably give flags to tailor the behavior, but by default, I'd
> be upset if character conversion issues like this turned into
> complete failure.  But a format string with an invalid UTF-8 sequence
> would be more severe.
> 
> FWIW, the UTF-8 strategy for libox is this:
> - all format strings are UTF-8
> - argument strings (%s) are UTF-8
> - "%ls" handles wide characters
> - "%hs" will handle locale-based strings
> - XML, JSON, and HTML will be UTF-8 output
> - text will be locale-based

This looks exactly what I had in mind...

Though for XML and HTML, you might want to add the proper processing
directive that says the encoding is UTF-8...  How about make this an
option to turn off?  That way if someone wants to nest the output in
another document, they provide the option to turn it off, while by
default you end up w/ a properly formed HTML or XML document?

> The painful part is that I've been using vsnprintf as the plumbing
> for formatting strings, but it doesn't handle field widths for UTF-8
> data correctly, so I'll need to start doing that by handle myself.

iconv or another i18n library should help w/ that...  Since some
languages, like Thai, have combining characters, so even though there
might be a 6 character UTF-8 sequence, it'll only take up one column
width...

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140804210400.GG88623>