From owner-freebsd-doc Thu Nov 15 11:55: 5 2001 Delivered-To: freebsd-doc@freebsd.org Received: from columbus.cris.net (columbus.cris.net [212.110.128.65]) by hub.freebsd.org (Postfix) with ESMTP id EA8A137B420; Thu, 15 Nov 2001 11:53:38 -0800 (PST) Received: from ark.cris.net (ns2.cris.net [212.110.128.68]) by columbus.cris.net (8.9.3/8.9.3) with ESMTP id VAA14562; Thu, 15 Nov 2001 21:53:16 +0200 (EET) Received: (from phantom@localhost) by ark.cris.net (8.11.1/8.11.1) id fAFJqiR08270; Thu, 15 Nov 2001 21:52:44 +0200 (EET) Date: Thu, 15 Nov 2001 21:52:44 +0200 From: Alexey Zelkin To: Hiroki Sato Cc: horcicka@FreeBSD.cz, freebsd-doc@FreeBSD.ORG, nik@FreeBSD.ORG, saken@hotel.rmta.org Subject: Re: Why TIDY can never work correctly with ISO-8859-2 and others Message-ID: <20011115215244.A7285@ark.cris.net> References: <20011115105650.W57038-100000@dual.ms.mff.cuni.cz> <20011115.214017.71143189.hrs@sekine00.ee.noda.sut.ac.jp> <20011115160532.A61351@ark.cris.net> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="OgqxwSJOaUobr8KG" X-Mailer: Mutt 1.0i In-Reply-To: <20011115160532.A61351@ark.cris.net>; from phantom@FreeBSD.ORG on Thu, Nov 15, 2001 at 04:05:32PM +0200 X-Operating-System: FreeBSD 3.5-STABLE i386 Sender: owner-freebsd-doc@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org --OgqxwSJOaUobr8KG Content-Type: text/plain; charset=us-ascii [ Cc'ed to maintainer of ports/www/tidy ] hi, Attached patch does a job. At least my simple tests were passed successfully. I just added new option '-preserve' to tidy. This option disables translation of characters entities to characters before processing. As "side effect" we have all entities saved correctly in output file. I would like to have feedback on this one. At least for Russian Doc Project it should do a good job and I'd like to see it commited. On Thu, Nov 15, 2001 at 04:05:32PM +0200, Alexey Zelkin wrote: > hi, > > On Thu, Nov 15, 2001 at 09:40:17PM +0900, Hiroki Sato wrote: > > > horcicka> And if you use char-encoding: raw - character entities with values above 255 > > horcicka> are not printed as entities - this is really bad in 8-bit encodings. > > > > Yes, Japanese docs also suffer from it. The input routine of tidy expands > > any entities first, even if -raw flag is specified. > > > > horcicka> In my opinion Tidy cannot be used for encodings it does not natively support > > horcicka> (i.e. for Russian and Czech (- still not in main CVS) translations of pages > > horcicka> and docs). > > > > I think so, too. > > > > As a workaround, we can apply a patch and use the modified > > version of tidy that can suppress to interpret given entities > > as entities themselves, but I do not know if it will be a good solution. > > Most noticeable problem of -raw case > is converting   to character with code 160. As enough > workaround for Russian translation we've used -latin1 case, but > anyway expanding of all entities except   and & is bad. > > I am working on patch for tidy(1) to add new option which should > supress all entity -> character recoding. Hope it should be enough. --OgqxwSJOaUobr8KG Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="tidy.preserve.entities.patch" diff -u work/tidy4aug00/config.c tidy4aug00.patched/config.c --- work/tidy4aug00/config.c Fri Aug 4 19:21:05 2000 +++ tidy4aug00.patched/config.c Thu Nov 15 21:55:25 2001 @@ -94,6 +94,7 @@ Bool TidyMark = yes; /* add meta element indicating tidied doc */ Bool Emacs = no; /* if true format error output for GNU Emacs */ Bool LiteralAttribs = no; /* if true attributes may use newlines */ +Bool PreserveEntities = no; /* if true don't convert entities to chars */ typedef struct _lex PLex; @@ -186,6 +187,7 @@ {"doctype", {(int *)&doctype_str}, ParseDocType}, {"fix-backslash", {(int *)&FixBackslash}, ParseBool}, {"gnu-emacs", {(int *)&Emacs}, ParseBool}, + {"preserve-entities", {(int *)&PreserveEntities}, ParseBool}, /* this must be the final entry */ {0, 0, 0} @@ -533,6 +535,12 @@ { QuoteAmpersand = yes; HideEndTags = no; + } + + /* Avoid &copy; in preserve-entities case */ + if (PreserveEntities) + { + QuoteAmpersand = no; } } diff -u work/tidy4aug00/html.h tidy4aug00.patched/html.h --- work/tidy4aug00/html.h Fri Aug 4 19:21:05 2000 +++ tidy4aug00.patched/html.h Thu Nov 15 21:55:26 2001 @@ -758,6 +758,7 @@ extern Bool Word2000; extern Bool Emacs; /* sasdjb 01May00 GNU Emacs error output format */ extern Bool LiteralAttribs; +extern Bool PreserveEntities; /* Parser methods for tags */ diff -u work/tidy4aug00/lexer.c tidy4aug00.patched/lexer.c --- work/tidy4aug00/lexer.c Fri Aug 4 19:21:05 2000 +++ tidy4aug00.patched/lexer.c Thu Nov 15 21:55:26 2001 @@ -1517,8 +1517,10 @@ continue; } - else if (c == '&' && mode != IgnoreMarkup) - ParseEntity(lexer, mode); + else if (c == '&' && mode != IgnoreMarkup + && !PreserveEntities) { + ParseEntity(lexer, mode); + } /* this is needed to avoid trimming trailing whitespace */ if (mode == IgnoreWhitespace) @@ -2624,7 +2626,7 @@ seen_gt = yes; } - if (c == '&') + if (c == '&') /* XXX: possibly need support for PreserveEntities */ { AddCharToLexer(lexer, c); ParseEntity(lexer, null); diff -u work/tidy4aug00/localize.c tidy4aug00.patched/localize.c --- work/tidy4aug00/localize.c Fri Aug 4 19:21:05 2000 +++ tidy4aug00.patched/localize.c Thu Nov 15 21:55:26 2001 @@ -736,6 +736,7 @@ tidy_out(out, " -xml use this when input is wellformed xml\n"); tidy_out(out, " -asxml to convert html to wellformed xml\n"); tidy_out(out, " -slides to burst into slides on h2 elements\n"); + tidy_out(out, " -preserve to preserve entities as is in source file\n"); tidy_out(out, "\n"); tidy_out(out, "Character encodings\n"); diff -u work/tidy4aug00/man_page.txt tidy4aug00.patched/man_page.txt --- work/tidy4aug00/man_page.txt Fri Aug 4 19:21:05 2000 +++ tidy4aug00.patched/man_page.txt Thu Nov 15 21:55:26 2001 @@ -12,6 +12,7 @@ .IR column ] .RB [ -upper ] .RB [ -clean ] +.RB [ -preserve ] .RB [ -raw | .B -ascii @@ -106,6 +107,9 @@ .TP .B -slides Burst into slides on

elements. +.TP +.B -preserve +Preserve source file entities as is. .TP .BR -help ", " -h List command-line options. diff -u work/tidy4aug00/tidy.c tidy4aug00.patched/tidy.c --- work/tidy4aug00/tidy.c Fri Aug 4 19:21:05 2000 +++ tidy4aug00.patched/tidy.c Thu Nov 15 21:55:26 2001 @@ -785,6 +785,8 @@ Quiet = yes; else if (strcmp(arg, "slides") == 0) BurstSlides = yes; + else if (strcmp(arg, "preserve") == 0) + PreserveEntities = yes; else if (strcmp(arg, "help") == 0 || argv[1][1] == '?'|| argv[1][1] == 'h') { --OgqxwSJOaUobr8KG-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-doc" in the body of the message