Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 30 Sep 2012 14:55:48 +0200
From:      Michael Gmelin <freebsd@grem.de>
To:        freebsd-ports@freebsd.org
Subject:   Re: Problems submitting patch containing UTF-8 characters
Message-ID:  <20120930145548.59b03149@bsd64.grem.de>
In-Reply-To: <20120930050803.7914caf6@bsd64.grem.de>
References:  <20120930050803.7914caf6@bsd64.grem.de>

next in thread | previous in thread | raw e-mail | index | archive | help


On Sun, 30 Sep 2012 05:08:03 +0200
Michael Gmelin <freebsd@grem.de> wrote:

> Hi,
> 
> I recently ran into a problem submitting a PR containing UTF-8
> characters, they ended up garbled, so the maintainer couldn't apply
> the patch cleanly.
> 
> http://www.freebsd.org/cgi/query-pr.cgi?pr=171645
> 
> The characters included were 0xe4 0xb8 0xad and 0xe5 0x9b 0xbd (two
> three byte characters). The code affected is about testing utf-8, so
> the characters are required. And even if not, patching them away would
> require stating them as part of the patch.
> 
> The original e-mail was created using porttools and therefore had no
> character set specification, which usually shouldn't be a problem. The
> patch was just inline as part of the body.
> 
> http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=1
> 
> The character sequence had been recoded to
> 0xc3 0xa4 0xc2 0xb8 0xc2 0xad 0xc3 0xa5 0xc2 0x9b 0xc2 0xbd
> 
> It seems like it had been interpreted as latin1 on receipt and then
> reencoded as utf-8:
> 0xe4 => 0xc3 0xa4
> 0xb8 => 0xc2 0xb8
> 0xad => 0xc2 0xad
> 0xe5 => 0xc3 0xa5
> 0x9b => 0xc2 0x9b
> 0xbd => 0xc2 0xbd
> 
> Which is obviously not what should happen. The recipient shouldn't
> make any assumptions about the character set used.
> 
> The next attempt was sending the patch as a bug-followup through a
> graphical MUA. The patch was attached and had been encoded as
> quoted-printable (no specific charset specification):
> 
> +-configPath =3D u"./config/=E4=B8=AD=E5=9B=BD_client.config"
> ++configPath =3D
> u"./config/=E4=B8=AD=E5=9B=BD_client.config".encode("utf-8=")
> 
> http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=2
> 
> Unfortunately the results are the same. I did not try forcing a
> charset by manually modifying the email (not sure if this will work,
> I'm willing to test, but I don't want to further litter that PR).
> 
> At this point I figured, that sending the patch in gzipped format
> might help. Said and done, the patch shows up as base64 in the PR.
> When copy and pasting and decoding the base64 text, the resulting .gz
> can be decompressed correctly and the content is what I expected. When
> clicking the download link though:
> 
> http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=3
> 
> The resulting .gz file has the correct file size, but is corrupted.
> Checking it using the hex editor it looks like it has been reencoded
> as utf-8 (and then truncated at the expected file size):
> 
> Hex of the original file (first 16 bytes):
> 1f 8b 08 08 ad 79 65 50  00 03 70 79 32 37 2d 49
> 
> Hex of the file downloaded by using the link:
> 1f c2 8b 08 08 c2 ad 79  65 50 00 03 70 79 32 37
> 
> As you can see, all non 7bit characters have been utf-8 encoded, which
> is pretty suboptimal in a binary file.
> 
> 0x8b => 0xc2 0x8b
> 0xad => 0xc2 0xad
> ...
> 
> As a result the truncated and utf-8 encoded gzip file cannot be
> decompressed.
> 
> I'm relatively certain that this has worked at some point in the past.
> 
> Ideas anyone?
> 
> Thanks,
> 

By the way, the two three byte sequences mean
"China", see also http://goo.gl/4muUF

-- 
Michael Gmelin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120930145548.59b03149>