Date: Sun, 30 Sep 2012 14:55:48 +0200 From: Michael Gmelin <freebsd@grem.de> To: freebsd-ports@freebsd.org Subject: Re: Problems submitting patch containing UTF-8 characters Message-ID: <20120930145548.59b03149@bsd64.grem.de> In-Reply-To: <20120930050803.7914caf6@bsd64.grem.de> References: <20120930050803.7914caf6@bsd64.grem.de>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 30 Sep 2012 05:08:03 +0200 Michael Gmelin <freebsd@grem.de> wrote: > Hi, > > I recently ran into a problem submitting a PR containing UTF-8 > characters, they ended up garbled, so the maintainer couldn't apply > the patch cleanly. > > http://www.freebsd.org/cgi/query-pr.cgi?pr=171645 > > The characters included were 0xe4 0xb8 0xad and 0xe5 0x9b 0xbd (two > three byte characters). The code affected is about testing utf-8, so > the characters are required. And even if not, patching them away would > require stating them as part of the patch. > > The original e-mail was created using porttools and therefore had no > character set specification, which usually shouldn't be a problem. The > patch was just inline as part of the body. > > http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=1 > > The character sequence had been recoded to > 0xc3 0xa4 0xc2 0xb8 0xc2 0xad 0xc3 0xa5 0xc2 0x9b 0xc2 0xbd > > It seems like it had been interpreted as latin1 on receipt and then > reencoded as utf-8: > 0xe4 => 0xc3 0xa4 > 0xb8 => 0xc2 0xb8 > 0xad => 0xc2 0xad > 0xe5 => 0xc3 0xa5 > 0x9b => 0xc2 0x9b > 0xbd => 0xc2 0xbd > > Which is obviously not what should happen. The recipient shouldn't > make any assumptions about the character set used. > > The next attempt was sending the patch as a bug-followup through a > graphical MUA. The patch was attached and had been encoded as > quoted-printable (no specific charset specification): > > +-configPath =3D u"./config/=E4=B8=AD=E5=9B=BD_client.config" > ++configPath =3D > u"./config/=E4=B8=AD=E5=9B=BD_client.config".encode("utf-8=") > > http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=2 > > Unfortunately the results are the same. I did not try forcing a > charset by manually modifying the email (not sure if this will work, > I'm willing to test, but I don't want to further litter that PR). > > At this point I figured, that sending the patch in gzipped format > might help. Said and done, the patch shows up as base64 in the PR. > When copy and pasting and decoding the base64 text, the resulting .gz > can be decompressed correctly and the content is what I expected. When > clicking the download link though: > > http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=3 > > The resulting .gz file has the correct file size, but is corrupted. > Checking it using the hex editor it looks like it has been reencoded > as utf-8 (and then truncated at the expected file size): > > Hex of the original file (first 16 bytes): > 1f 8b 08 08 ad 79 65 50 00 03 70 79 32 37 2d 49 > > Hex of the file downloaded by using the link: > 1f c2 8b 08 08 c2 ad 79 65 50 00 03 70 79 32 37 > > As you can see, all non 7bit characters have been utf-8 encoded, which > is pretty suboptimal in a binary file. > > 0x8b => 0xc2 0x8b > 0xad => 0xc2 0xad > ... > > As a result the truncated and utf-8 encoded gzip file cannot be > decompressed. > > I'm relatively certain that this has worked at some point in the past. > > Ideas anyone? > > Thanks, > By the way, the two three byte sequences mean "China", see also http://goo.gl/4muUF -- Michael Gmelin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120930145548.59b03149>