From owner-freebsd-questions@FreeBSD.ORG Wed May 16 01:35:13 2007 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 306B216A403 for ; Wed, 16 May 2007 01:35:13 +0000 (UTC) (envelope-from kline@tao.thought.org) Received: from tao.thought.org (dsl231-043-140.sea1.dsl.speakeasy.net [216.231.43.140]) by mx1.freebsd.org (Postfix) with ESMTP id 9F7D213C447 for ; Wed, 16 May 2007 01:35:12 +0000 (UTC) (envelope-from kline@tao.thought.org) Received: from tao.thought.org (localhost [127.0.0.1]) by tao.thought.org (8.13.8/8.13.1) with ESMTP id l4E2QvgT001392; Sun, 13 May 2007 19:26:58 -0700 (PDT) (envelope-from kline@tao.thought.org) Received: (from kline@localhost) by tao.thought.org (8.13.8/8.13.1/Submit) id l4E2QuxD001391; Sun, 13 May 2007 19:26:56 -0700 (PDT) (envelope-from kline) Date: Sun, 13 May 2007 19:26:55 -0700 From: Gary Kline To: Ian Smith Message-ID: <20070514022655.GA1304@thought.org> References: <20070514210933.1024A16A478@hub.freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.2i X-Organization: Thought Unlimited. Public service Unix since 1986. X-Of_Interest: Observing twenty years of service to the Unix community Cc: Gary Kline , freebsd-questions@freebsd.org Subject: Re: what's the easiest way to de-html-ize files? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 16 May 2007 01:35:13 -0000 On Tue, May 15, 2007 at 03:34:14PM +1000, Ian Smith wrote: > On Sat, 12 May 2007 14:34:52 -0700 Gary Kline wrote: > > On Mon, May 14, 2007 at 12:09:07PM -0700, Chuck Swiger wrote: > > > On May 12, 2007, at 12:54 PM, Gary Kline wrote: > > > >This is for those of us who appreciate ASCII or straight > > > > ISO_8859-15 rather than marked up files. I have slapped together > > > > a crude C program that does scotch (or *cleanse*) text of > > > > and so on. Still... is there some standalone converter > > > > that gets rids of markup more elegantly? Something where i > > > > can say > > > > > > > > % cmd file_1.html ... file_N.html and output file_1.text ... > > > > file_N.text? > > > > > > Perhaps: > > > > > > lynx -dump file1.html ... > file.text > > > > > > ...? > > > > Hm, maybe Ineed Bill Campbell's -force_html switch. > > > > Yes, seems that way. USing just -dump got most of them, but > > using the -force_html caught all. Need to script something to > > reformat, but the worst of it's done! > > Also, if using Mozilla (so, I would assume, Firefox) the 'Save Page As' > dialog offers a picklist for 'Files of Type' that includes 'Text Files'. > > This does a pretty decent job of producing text from HTML files, and is > quicker than firing up lynx (or links) if you're already viewing a page. Oh sure; I've been saving html in text, ascii/8859-1 for years. But what I've got, and there are more saved **somewhere**, are files that are saved by default in markup. I have a slew of these on different boxen and have been moving then to one place. Problem is: how to de-html the bunch. I'm too lazy to write something that would automate what Can be automated--markup like "&foo;" are problematic. So probably the easiest way would be to create a dehtml.sh script that is just a wrapper around lynx. I don't think I'm the only hacker who wants just-plain-ascii, so this might mak a good project for somebody who's new to C or perl. That's my two pennies' worth! gary > > Cheers, Ian > -- Gary Kline kline@thought.org www.thought.org Public Service Unix