From owner-freebsd-questions@FreeBSD.ORG Wed Oct 27 00:06:54 2010 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F258F106566C for ; Wed, 27 Oct 2010 00:06:54 +0000 (UTC) (envelope-from bonomi@mail.r-bonomi.com) Received: from mail.r-bonomi.com (ns2.r-bonomi.com [204.87.227.129]) by mx1.freebsd.org (Postfix) with ESMTP id C74168FC1A for ; Wed, 27 Oct 2010 00:06:54 +0000 (UTC) Received: (from bonomi@localhost) by mail.r-bonomi.com (8.14.3/rdb1) id o9R04o1Y004753 for freebsd-questions@freebsd.org; Tue, 26 Oct 2010 19:04:50 -0500 (CDT) Date: Tue, 26 Oct 2010 19:04:50 -0500 (CDT) From: Robert Bonomi Message-ID: <201010270004.o9R04o1Y004753@mail.r-bonomi.com> To: freebsd-questions@freebsd.org Subject: Re: Is there any way of transfering my excellent PDF file into plain HTML X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Oct 2010 00:06:55 -0000 > From owner-freebsd-questions@freebsd.org Tue Oct 26 13:28:24 2010 > Date: Tue, 26 Oct 2010 11:30:01 -0700 > From: Gary Kline > To: FreeBSD Mailing List > Cc: > Subject: Is there any way of transfering my excellent PDF file into plain > HTML > > > > One thing that Linux misses--or seems to--is all the conversion > programs that go from one format to another. I _was_ able to use > abiread to get a PDF text into an obscure HTML, but hundreds of > paragraphs get broken up. So: is there any conversion program to > do it *right*? Authoritative answer: "maybe". This is one of those things where there's no subsitute for a trained eyeball. Depending on _how_ the PDF was generated, thee can be things in it that 'look like' breaks to a mechanical parser, but don't appear that way on the page. It's -really- hard for a parser to tell a 'near no-op' from a 'something' that does something 'significant'. Maybe Ghostscript's "pdf2ps", followed by "ps2ascii"; then wrap it in minimal HTML framing that simply declares it to be a '
' block.