Date: Tue, 26 Oct 2010 13:40:12 -0700 From: Gary Kline <kline@thought.org> To: Roland Smith <rsmith@xs4all.nl> Cc: Polytropon <freebsd@edvax.de>, FreeBSD Mailing List <freebsd-questions@freebsd.org>, Liontaur <liontaur@gmail.com> Subject: Re: Is there any way of transfering my excellent PDF file into plain HTML Message-ID: <20101026204012.GD3792@thought.org> In-Reply-To: <20101026200301.GA12886@slackbox.erewhon.net> References: <20101026182958.GA3646@thought.org> <AANLkTikBba5k4CAG_Qa%2BBjwy-URiKY0ak%2BNK0cc38OJB@mail.gmail.com> <20101026205924.91748d4c.freebsd@edvax.de> <20101026193020.GA3792@thought.org> <20101026200301.GA12886@slackbox.erewhon.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Oct 26, 2010 at 10:03:01PM +0200, Roland Smith wrote: > On Tue, Oct 26, 2010 at 12:30:20PM -0700, Gary Kline wrote: > > On Tue, Oct 26, 2010 at 08:59:24PM +0200, Polytropon wrote: > > > On Tue, 26 Oct 2010 11:38:20 -0700, Liontaur <liontaur@gmail.com> wrote: > > > > Related but slightly OT, I've never had much luck getting it the other way > > > > around, HTML to PDF. It's often off a bit. I can't remember off the top of > > > > my head what ports i've tried but yea. Either the images are wonky or my > > > > forms go wonky. > > > > > > This is simply because HTML is not typesetting-capable. Depending > > > on the source of the PDF file, it may help to convert from THAT > > > format instead from PDF. E. g. if you have a .tex (LaTeX) file > > > that has been the source of the PDF file, you can use a converter > > > from LaTeX to HTML, often with acceptable results. > > > > > > The HTML concept, especially when incorporating CSS for formatting, > > > _can_ be used to gain a bit typographic quality, e. g. by defining > > > parameters for "screen" and for "printed" media. Still it suffers > > > from things like maintaining good grey values, hypenation and > > > ligatures. > > You can add proper justification to the list that HTML doesn't do well! > > > Hmm. The ligatures that looked so great in my .tex/PDF output > > got lost. > > Very few programs do ligatures well. If you're using unicode text, you can use > them directly in your text, like this: ??? ??? ??? ??? ??? ??? > > How well these look depends on the fonts used. I've got a whole list of handy > unicode characters on my webpage. See the entry marked 2010-10-16. > > > Only that somehow, HTML4 can read the hex code that > > abiword's html created. :-) Also, the `` and '' look great in > > Times. I fixed the page numbers--all had to go away; I edited > > the chapter headings--all by hand. What's left are the hundreds > > of broken paragraphs. > > You might fare better by taking the TeX souce, run it though detex(1) and use > markdown [http://daringfireball.net/projects/markdown/] do create HTML. > > > What utility take a LaTeX file -> HTML? ((Be nice to have both > > *strictly professional typeset* and then HTML. I can add > > indents for AE style paragraphing, and much more. Fix the > > hyphenation, etc. > > Next to the obvious textproc/latex2html? :-) > Yeah, found it with locate! And found some very interesting results. I haven't check my .tex source, but the latex2html produces some **very** interesting results. In my lates j.html file there are hundreds of "broken paragraphs" such as: She stopped and turned around. "What?" he said. "I just thought I'= taking the wrong course." And so on. There is a "<br>" embedded in hundreds of paragraphs. Now I have the output from latex2html to check against, things can be that much easier. Do you or does any regex wiz have a way of catching embedded <br>'s within sentences? It might save me. It would certainly make things _easier_! I'll play around with /[a-zA-z]<br><[A-Za-z]. Hope the > and < aren't a problem in regexland.... :-) thanks much, gary > Roland > -- > R.F.Smith http://www.xs4all.nl/~rsmith/ > [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated] > pgp: 1A2B 477F 9970 BA3C 2914 B7CE 1277 EFB0 C321 A725 (KeyID: C321A725) -- Gary Kline kline@thought.org http://www.thought.org Public Service Unix The 7.90a release of Jottings: http://jottings.thought.org/index.php http://journey.thought.org
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20101026204012.GD3792>