Date: Sun, 30 Sep 2012 05:08:03 +0200 From: Michael Gmelin <freebsd@grem.de> To: freebsd-ports@freebsd.org Subject: Problems submitting patch containing UTF-8 characters Message-ID: <20120930050803.7914caf6@bsd64.grem.de>
next in thread | raw e-mail | index | archive | help
Hi, I recently ran into a problem submitting a PR containing UTF-8 characters, they ended up garbled, so the maintainer couldn't apply the patch cleanly. http://www.freebsd.org/cgi/query-pr.cgi?pr=171645 The characters included were 0xe4 0xb8 0xad and 0xe5 0x9b 0xbd (two three byte characters). The code affected is about testing utf-8, so the characters are required. And even if not, patching them away would require stating them as part of the patch. The original e-mail was created using porttools and therefore had no character set specification, which usually shouldn't be a problem. The patch was just inline as part of the body. http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=1 The character sequence had been recoded to 0xc3 0xa4 0xc2 0xb8 0xc2 0xad 0xc3 0xa5 0xc2 0x9b 0xc2 0xbd It seems like it had been interpreted as latin1 on receipt and then reencoded as utf-8: 0xe4 => 0xc3 0xa4 0xb8 => 0xc2 0xb8 0xad => 0xc2 0xad 0xe5 => 0xc3 0xa5 0x9b => 0xc2 0x9b 0xbd => 0xc2 0xbd Which is obviously not what should happen. The recipient shouldn't make any assumptions about the character set used. The next attempt was sending the patch as a bug-followup through a graphical MUA. The patch was attached and had been encoded as quoted-printable (no specific charset specification): +-configPath =3D u"./config/=E4=B8=AD=E5=9B=BD_client.config" ++configPath =3D u"./config/=E4=B8=AD=E5=9B=BD_client.config".encode("utf-8=") http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=2 Unfortunately the results are the same. I did not try forcing a charset by manually modifying the email (not sure if this will work, I'm willing to test, but I don't want to further litter that PR). At this point I figured, that sending the patch in gzipped format might help. Said and done, the patch shows up as base64 in the PR. When copy and pasting and decoding the base64 text, the resulting .gz can be decompressed correctly and the content is what I expected. When clicking the download link though: http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=3 The resulting .gz file has the correct file size, but is corrupted. Checking it using the hex editor it looks like it has been reencoded as utf-8 (and then truncated at the expected file size): Hex of the original file (first 16 bytes): 1f 8b 08 08 ad 79 65 50 00 03 70 79 32 37 2d 49 Hex of the file downloaded by using the link: 1f c2 8b 08 08 c2 ad 79 65 50 00 03 70 79 32 37 As you can see, all non 7bit characters have been utf-8 encoded, which is pretty suboptimal in a binary file. 0x8b => 0xc2 0x8b 0xad => 0xc2 0xad ... As a result the truncated and utf-8 encoded gzip file cannot be decompressed. I'm relatively certain that this has worked at some point in the past. Ideas anyone? Thanks, -- Michael Gmelin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120930050803.7914caf6>