From owner-freebsd-ports@FreeBSD.ORG Sun Sep 30 03:08:10 2012 Return-Path: Delivered-To: freebsd-ports@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5BFFB106564A for ; Sun, 30 Sep 2012 03:08:10 +0000 (UTC) (envelope-from freebsd@grem.de) Received: from mail.grem.de (outcast.grem.de [213.239.217.27]) by mx1.freebsd.org (Postfix) with SMTP id BBC868FC08 for ; Sun, 30 Sep 2012 03:08:09 +0000 (UTC) Received: (qmail 16200 invoked by uid 89); 30 Sep 2012 03:08:02 -0000 Received: from unknown (HELO bsd64.grem.de) (mg@grem.de@79.251.9.2) by mail.grem.de with ESMTPA; 30 Sep 2012 03:08:02 -0000 Date: Sun, 30 Sep 2012 05:08:03 +0200 From: Michael Gmelin To: freebsd-ports@freebsd.org Message-ID: <20120930050803.7914caf6@bsd64.grem.de> X-Mailer: Claws Mail 3.8.1 (GTK+ 2.24.6; amd64-portbld-freebsd9.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: Problems submitting patch containing UTF-8 characters X-BeenThere: freebsd-ports@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting software to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Sep 2012 03:08:10 -0000 Hi, I recently ran into a problem submitting a PR containing UTF-8 characters, they ended up garbled, so the maintainer couldn't apply the patch cleanly. http://www.freebsd.org/cgi/query-pr.cgi?pr=171645 The characters included were 0xe4 0xb8 0xad and 0xe5 0x9b 0xbd (two three byte characters). The code affected is about testing utf-8, so the characters are required. And even if not, patching them away would require stating them as part of the patch. The original e-mail was created using porttools and therefore had no character set specification, which usually shouldn't be a problem. The patch was just inline as part of the body. http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=1 The character sequence had been recoded to 0xc3 0xa4 0xc2 0xb8 0xc2 0xad 0xc3 0xa5 0xc2 0x9b 0xc2 0xbd It seems like it had been interpreted as latin1 on receipt and then reencoded as utf-8: 0xe4 => 0xc3 0xa4 0xb8 => 0xc2 0xb8 0xad => 0xc2 0xad 0xe5 => 0xc3 0xa5 0x9b => 0xc2 0x9b 0xbd => 0xc2 0xbd Which is obviously not what should happen. The recipient shouldn't make any assumptions about the character set used. The next attempt was sending the patch as a bug-followup through a graphical MUA. The patch was attached and had been encoded as quoted-printable (no specific charset specification): +-configPath =3D u"./config/=E4=B8=AD=E5=9B=BD_client.config" ++configPath =3D u"./config/=E4=B8=AD=E5=9B=BD_client.config".encode("utf-8=") http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=2 Unfortunately the results are the same. I did not try forcing a charset by manually modifying the email (not sure if this will work, I'm willing to test, but I don't want to further litter that PR). At this point I figured, that sending the patch in gzipped format might help. Said and done, the patch shows up as base64 in the PR. When copy and pasting and decoding the base64 text, the resulting .gz can be decompressed correctly and the content is what I expected. When clicking the download link though: http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=3 The resulting .gz file has the correct file size, but is corrupted. Checking it using the hex editor it looks like it has been reencoded as utf-8 (and then truncated at the expected file size): Hex of the original file (first 16 bytes): 1f 8b 08 08 ad 79 65 50 00 03 70 79 32 37 2d 49 Hex of the file downloaded by using the link: 1f c2 8b 08 08 c2 ad 79 65 50 00 03 70 79 32 37 As you can see, all non 7bit characters have been utf-8 encoded, which is pretty suboptimal in a binary file. 0x8b => 0xc2 0x8b 0xad => 0xc2 0xad ... As a result the truncated and utf-8 encoded gzip file cannot be decompressed. I'm relatively certain that this has worked at some point in the past. Ideas anyone? Thanks, -- Michael Gmelin