Date: Sun, 30 Sep 2012 23:20:11 GMT From: Michael Gmelin <freebsd@grem.de> To: freebsd-www@FreeBSD.org Subject: Re: www/172195: PR database corrupts patches Message-ID: <201209302320.q8UNKBaV062869@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
The following reply was made to PR www/172195; it has been noted by GNATS. From: Michael Gmelin <freebsd@grem.de> To: bug-followup@FreeBSD.org Cc: Subject: Re: www/172195: PR database corrupts patches Date: Mon, 1 Oct 2012 01:12:00 +0200 Analysis: 1. The PR system assumes some different encoding than UTF-8 to be the default. This means: a) Patches uploaded through the web form will corrupt b) Patches mailed as attachments without explicit charset specification will corrupt c) standard send-pr patches break - adding a charset UTF-8 header manually will probably work, but is too easy to forget. Also won't fix the download option. 2. The PR system can handle binary attachments correctly in its base64 view 3. Downloaded patches are corrupted in all cases! a) File attached via webform: fetch -o - "http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=1" | hd c3 a4 c2 b8 c2 ad (should have been: e4 b8 ad e5 9b bd) This looks like the input has been assumed to be latin1, transcoded to UTF-8 and truncated. b) File sent as follow up attachment without UTF-8 charset: fetch -o - "http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=2" | hd c3 a4 c2 b8 c2 ad c3 a5 (should have been: e4 b8 ad e5 9b bd) This looks like the input has been assumed to be latin1 and transcoded to UTF-8. c) File sent as follow up attachment WITH UTF-8 charset: (this one shows up correctly on the web page, the download is still broken though): fetch -o - "http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=3" | hd e4 b8 ad e5 (should have been: e4 b8 ad e5 9b bd) This looks like it got the encoding right, but can't handle three byte characters (string length calculation problem?!) d) Gzipped version of the patch: The base64 encoded version shown on the PR webpage is correct: md5 china.txt.gz MD5 (china.txt.gz) = 29009c79690c58b0762274da0e3ad80d echo "H4sICIG7aFAAA2NoaW5hLnR4dAB7smPt09l7uQC1SPS1BwAAAA==" \ | openssl enc -d -a | md5 29009c79690c58b0762274da0e3ad80d Downloading through the download link fails though: fetch -o - "http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=4" | md5 ae9f2f3531871be8c4af662863eb542e Taking a deeper look into the gzip file shows, that there has been an attempt to somehow UTF-8 encode the binary content: Original: 00000000 1f 8b 08 08 81 bb 68 50 00 03 63 68 69 6e 61 2e 00000010 74 78 74 00 7b b2 63 ed d3 d9 7b b9 00 b5 48 f4 00000020 b5 07 00 00 00 00000025 File as downloaded from the PR website: 00000000 1f c2 8b 08 08 c2 81 c2 bb 68 50 00 03 63 68 69 00000010 6e 61 2e 74 78 74 00 7b c2 b2 63 c3 ad c3 93 c3 00000020 99 7b c2 b9 00 00000025 As you can see, 8bit characters have been UTF-8 encoded, and the resulting file got truncated at the original file size. Conclusion: There is no simple way of submitting a patch through the PR system so that it can be downloaded using the download link. Right now the options are: 1. Send the file as an email attachment, making sure that the character encoding in the mime header is set to UTF-8 (not all email clients will do this automatically). This way a patch can be acquired by using copy and paste - the download link will not work correctly though and yield surprising results. A patch acquired this way might actually apply, but cause unintended behavior. 2. Send the file gzipped and make people use base64 decode to get the gzip. This way when the download link is used people will at least realize something went wrong. 3. Base64 encode the patch before sending it, this way everything stays us-ascii and cannot be messed with by the PR system. Requires users to base64 decode on their own and makes it hard to argue about the patch in a way that's transparent to users of the web page. None of these options seem very appealing, especially since it makes it easy for people to get it wrong and hard to get it right - also various tools used by port maintainers (porttools, send-pr etc.) might not be prepared to support the user to get it right. There will be more and more UTF-8 encoded patches in the future, so I think this should be fixed. Suggested fixes: - Change the default encoding (the coding assumed when no encoding is specified) to UTF-8. This might not be practical in all cases, but should be discussed. - Make sure that the download option provides correct files (it should treat all files as binary and not try to alter them in any way). I hope all of this makes sense. -- Michael Gmelin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201209302320.q8UNKBaV062869>