Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 23 Jan 2012 12:39:25 -0700 (MST)
From:      Warren Block <wblock@wonkity.com>
To:        Gabor Kovesdan <gabor@FreeBSD.org>
Cc:        freebsd-doc@FreeBSD.org
Subject:   Re: Tidy and HTML tab spacing
Message-ID:  <alpine.BSF.2.00.1201231145380.90760@wonkity.com>
In-Reply-To: <4F1D93E0.2050709@FreeBSD.org>
References:  <alpine.BSF.2.00.1201181255210.39534@wonkity.com> <alpine.BSF.2.00.1201181520140.40712@wonkity.com> <4F1B4767.5070105@FreeBSD.org> <alpine.BSF.2.00.1201211648030.72083@wonkity.com> <4F1D93E0.2050709@FreeBSD.org>

index | next in thread | previous in thread | raw e-mail

[-- Attachment #1 --]
On Mon, 23 Jan 2012, Gabor Kovesdan wrote:

> On 2012.01.22. 1:30, Warren Block wrote:
>> On Sun, 22 Jan 2012, Gabor Kovesdan wrote:
>> 
>>> On 2012.01.18. 23:49, Warren Block wrote:
>>>> 5. Don't tidy HTML files at all (suggested as an option by Benedict
>>>>    Reuschling).  The unprocessed HTML is ugly, but few people are going
>>>>    to look at it directly.  Files that haven't been through tidy are a
>>>>    little larger, about 4% in the case of the Porter's Handbook. 
>>> I also think tidy should be removed. As hrs wrote, new standards should be 
>>> evaluated and probably they are much better. (I think they are.) If there 
>>> are some nits, then we should process it with a custom script or 
>>> something, instead of this crapware.
>> 
>> Tidy does a lot; it would be a lot of work to recreate. 
> Tidy is also the reason that our webpages are not valid HTML.

A new version of Tidy is supposed to be out soonish.  Whether it will 
solve the problems, I don't know.

What about lxml?  Available in ports (devel/py-lxml), reputed to be good 
at parsing problem HTML and creating good XHTML.  A quick test showed 
that it seems to do okay with <pre> elements.

A quick script to generate a test is attached.  The W3C validator says 
this version of the Porter's Handbook has eight errors, versus the six 
errors and five warnings of the Tidy version.  (The ugly special-case in 
line 12 drops the lxml version to five errors.)
[-- Attachment #2 --]
#!/usr/bin/env python

from lxml import etree
import re

inhtml = open('book.html', 'r').read()

tree = etree.HTML(inhtml.replace('\r', ''))
outxhtml = '\n'.join([ etree.tostring(stree, pretty_print=True, method="xml")
		for stree in tree ])

outxhtml = outxhtml.replace('compact="COMPACT"', 'compact="compact"')

f = open('lxml.html', 'w')
f.write('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n')
f.write('<html xmlns="http://www.w3.org/1999/xhtml">\n')
f.write(outxhtml)
f.write('</html>\n')
f.close()
home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.00.1201231145380.90760>