From owner-freebsd-ports@FreeBSD.ORG  Sun Sep 30 03:08:10 2012
Return-Path: <owner-freebsd-ports@FreeBSD.ORG>
Delivered-To: freebsd-ports@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5BFFB106564A
	for <freebsd-ports@freebsd.org>; Sun, 30 Sep 2012 03:08:10 +0000 (UTC)
	(envelope-from freebsd@grem.de)
Received: from mail.grem.de (outcast.grem.de [213.239.217.27])
	by mx1.freebsd.org (Postfix) with SMTP id BBC868FC08
	for <freebsd-ports@freebsd.org>; Sun, 30 Sep 2012 03:08:09 +0000 (UTC)
Received: (qmail 16200 invoked by uid 89); 30 Sep 2012 03:08:02 -0000
Received: from unknown (HELO bsd64.grem.de) (mg@grem.de@79.251.9.2)
	by mail.grem.de with ESMTPA; 30 Sep 2012 03:08:02 -0000
Date: Sun, 30 Sep 2012 05:08:03 +0200
From: Michael Gmelin <freebsd@grem.de>
To: freebsd-ports@freebsd.org
Message-ID: <20120930050803.7914caf6@bsd64.grem.de>
X-Mailer: Claws Mail 3.8.1 (GTK+ 2.24.6; amd64-portbld-freebsd9.0)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: Problems submitting patch containing UTF-8 characters
X-BeenThere: freebsd-ports@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Porting software to FreeBSD <freebsd-ports.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-ports>,
	<mailto:freebsd-ports-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-ports>
List-Post: <mailto:freebsd-ports@freebsd.org>
List-Help: <mailto:freebsd-ports-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-ports>,
	<mailto:freebsd-ports-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Sep 2012 03:08:10 -0000

Hi,

I recently ran into a problem submitting a PR containing UTF-8
characters, they ended up garbled, so the maintainer couldn't apply the
patch cleanly.

http://www.freebsd.org/cgi/query-pr.cgi?pr=171645

The characters included were 0xe4 0xb8 0xad and 0xe5 0x9b 0xbd (two
three byte characters). The code affected is about testing utf-8, so
the characters are required. And even if not, patching them away would
require stating them as part of the patch.

The original e-mail was created using porttools and therefore had no
character set specification, which usually shouldn't be a problem. The
patch was just inline as part of the body.

http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=1

The character sequence had been recoded to
0xc3 0xa4 0xc2 0xb8 0xc2 0xad 0xc3 0xa5 0xc2 0x9b 0xc2 0xbd

It seems like it had been interpreted as latin1 on receipt and then
reencoded as utf-8:
0xe4 => 0xc3 0xa4
0xb8 => 0xc2 0xb8
0xad => 0xc2 0xad
0xe5 => 0xc3 0xa5
0x9b => 0xc2 0x9b
0xbd => 0xc2 0xbd

Which is obviously not what should happen. The recipient shouldn't make
any assumptions about the character set used.

The next attempt was sending the patch as a bug-followup through a
graphical MUA. The patch was attached and had been encoded as
quoted-printable (no specific charset specification):

+-configPath =3D u"./config/=E4=B8=AD=E5=9B=BD_client.config"
++configPath =3D
u"./config/=E4=B8=AD=E5=9B=BD_client.config".encode("utf-8=")

http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=2

Unfortunately the results are the same. I did not try forcing a charset
by manually modifying the email (not sure if this will work, I'm
willing to test, but I don't want to further litter that PR).

At this point I figured, that sending the patch in gzipped format might
help. Said and done, the patch shows up as base64 in the PR. When
copy and pasting and decoding the base64 text, the resulting .gz can be
decompressed correctly and the content is what I expected. When
clicking the download link though:

http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=3

The resulting .gz file has the correct file size, but is corrupted.
Checking it using the hex editor it looks like it has been reencoded as
utf-8 (and then truncated at the expected file size):

Hex of the original file (first 16 bytes):
1f 8b 08 08 ad 79 65 50  00 03 70 79 32 37 2d 49

Hex of the file downloaded by using the link:
1f c2 8b 08 08 c2 ad 79  65 50 00 03 70 79 32 37

As you can see, all non 7bit characters have been utf-8 encoded, which
is pretty suboptimal in a binary file.

0x8b => 0xc2 0x8b
0xad => 0xc2 0xad
...

As a result the truncated and utf-8 encoded gzip file cannot be
decompressed.

I'm relatively certain that this has worked at some point in the past.

Ideas anyone?

Thanks,

-- 
Michael Gmelin