Date: Thu, 15 Nov 2001 21:52:44 +0200 From: Alexey Zelkin <phantom@FreeBSD.ORG> To: Hiroki Sato <hrs@eos.ocn.ne.jp> Cc: horcicka@FreeBSD.cz, freebsd-doc@FreeBSD.ORG, nik@FreeBSD.ORG, saken@hotel.rmta.org Subject: Re: Why TIDY can never work correctly with ISO-8859-2 and others Message-ID: <20011115215244.A7285@ark.cris.net> In-Reply-To: <20011115160532.A61351@ark.cris.net>; from phantom@FreeBSD.ORG on Thu, Nov 15, 2001 at 04:05:32PM %2B0200 References: <20011115105650.W57038-100000@dual.ms.mff.cuni.cz> <20011115.214017.71143189.hrs@sekine00.ee.noda.sut.ac.jp> <20011115160532.A61351@ark.cris.net>
index | next in thread | previous in thread | raw e-mail
[-- Attachment #1 --]
[ Cc'ed to maintainer of ports/www/tidy ]
hi,
Attached patch does a job. At least my simple tests were passed successfully.
I just added new option '-preserve' to tidy. This option disables
translation of characters entities to characters before processing.
As "side effect" we have all entities saved correctly in output file.
I would like to have feedback on this one. At least for Russian Doc Project
it should do a good job and I'd like to see it commited.
On Thu, Nov 15, 2001 at 04:05:32PM +0200, Alexey Zelkin wrote:
> hi,
>
> On Thu, Nov 15, 2001 at 09:40:17PM +0900, Hiroki Sato wrote:
>
> > horcicka> And if you use char-encoding: raw - character entities with values above 255
> > horcicka> are not printed as entities - this is really bad in 8-bit encodings.
> >
> > Yes, Japanese docs also suffer from it. The input routine of tidy expands
> > any entities first, even if -raw flag is specified.
> >
> > horcicka> In my opinion Tidy cannot be used for encodings it does not natively support
> > horcicka> (i.e. for Russian and Czech (- still not in main CVS) translations of pages
> > horcicka> and docs).
> >
> > I think so, too.
> >
> > As a workaround, we can apply a patch and use the modified
> > version of tidy that can suppress to interpret given entities
> > as entities themselves, but I do not know if it will be a good solution.
>
> Most noticeable problem of -raw case
> is converting to character with code 160. As enough
> workaround for Russian translation we've used -latin1 case, but
> anyway expanding of all entities except and & is bad.
>
> I am working on patch for tidy(1) to add new option which should
> supress all entity -> character recoding. Hope it should be enough.
[-- Attachment #2 --]
diff -u work/tidy4aug00/config.c tidy4aug00.patched/config.c
--- work/tidy4aug00/config.c Fri Aug 4 19:21:05 2000
+++ tidy4aug00.patched/config.c Thu Nov 15 21:55:25 2001
@@ -94,6 +94,7 @@
Bool TidyMark = yes; /* add meta element indicating tidied doc */
Bool Emacs = no; /* if true format error output for GNU Emacs */
Bool LiteralAttribs = no; /* if true attributes may use newlines */
+Bool PreserveEntities = no; /* if true don't convert entities to chars */
typedef struct _lex PLex;
@@ -186,6 +187,7 @@
{"doctype", {(int *)&doctype_str}, ParseDocType},
{"fix-backslash", {(int *)&FixBackslash}, ParseBool},
{"gnu-emacs", {(int *)&Emacs}, ParseBool},
+ {"preserve-entities", {(int *)&PreserveEntities}, ParseBool},
/* this must be the final entry */
{0, 0, 0}
@@ -533,6 +535,12 @@
{
QuoteAmpersand = yes;
HideEndTags = no;
+ }
+
+ /* Avoid &copy; in preserve-entities case */
+ if (PreserveEntities)
+ {
+ QuoteAmpersand = no;
}
}
diff -u work/tidy4aug00/html.h tidy4aug00.patched/html.h
--- work/tidy4aug00/html.h Fri Aug 4 19:21:05 2000
+++ tidy4aug00.patched/html.h Thu Nov 15 21:55:26 2001
@@ -758,6 +758,7 @@
extern Bool Word2000;
extern Bool Emacs; /* sasdjb 01May00 GNU Emacs error output format */
extern Bool LiteralAttribs;
+extern Bool PreserveEntities;
/* Parser methods for tags */
diff -u work/tidy4aug00/lexer.c tidy4aug00.patched/lexer.c
--- work/tidy4aug00/lexer.c Fri Aug 4 19:21:05 2000
+++ tidy4aug00.patched/lexer.c Thu Nov 15 21:55:26 2001
@@ -1517,8 +1517,10 @@
continue;
}
- else if (c == '&' && mode != IgnoreMarkup)
- ParseEntity(lexer, mode);
+ else if (c == '&' && mode != IgnoreMarkup
+ && !PreserveEntities) {
+ ParseEntity(lexer, mode);
+ }
/* this is needed to avoid trimming trailing whitespace */
if (mode == IgnoreWhitespace)
@@ -2624,7 +2626,7 @@
seen_gt = yes;
}
- if (c == '&')
+ if (c == '&') /* XXX: possibly need support for PreserveEntities */
{
AddCharToLexer(lexer, c);
ParseEntity(lexer, null);
diff -u work/tidy4aug00/localize.c tidy4aug00.patched/localize.c
--- work/tidy4aug00/localize.c Fri Aug 4 19:21:05 2000
+++ tidy4aug00.patched/localize.c Thu Nov 15 21:55:26 2001
@@ -736,6 +736,7 @@
tidy_out(out, " -xml use this when input is wellformed xml\n");
tidy_out(out, " -asxml to convert html to wellformed xml\n");
tidy_out(out, " -slides to burst into slides on h2 elements\n");
+ tidy_out(out, " -preserve to preserve entities as is in source file\n");
tidy_out(out, "\n");
tidy_out(out, "Character encodings\n");
diff -u work/tidy4aug00/man_page.txt tidy4aug00.patched/man_page.txt
--- work/tidy4aug00/man_page.txt Fri Aug 4 19:21:05 2000
+++ tidy4aug00.patched/man_page.txt Thu Nov 15 21:55:26 2001
@@ -12,6 +12,7 @@
.IR column ]
.RB [ -upper ]
.RB [ -clean ]
+.RB [ -preserve ]
.RB [ -raw
|
.B -ascii
@@ -106,6 +107,9 @@
.TP
.B -slides
Burst into slides on <H2> elements.
+.TP
+.B -preserve
+Preserve source file entities as is.
.TP
.BR -help ", " -h
List command-line options.
diff -u work/tidy4aug00/tidy.c tidy4aug00.patched/tidy.c
--- work/tidy4aug00/tidy.c Fri Aug 4 19:21:05 2000
+++ tidy4aug00.patched/tidy.c Thu Nov 15 21:55:26 2001
@@ -785,6 +785,8 @@
Quiet = yes;
else if (strcmp(arg, "slides") == 0)
BurstSlides = yes;
+ else if (strcmp(arg, "preserve") == 0)
+ PreserveEntities = yes;
else if (strcmp(arg, "help") == 0 ||
argv[1][1] == '?'|| argv[1][1] == 'h')
{
help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20011115215244.A7285>
