From owner-freebsd-questions@FreeBSD.ORG Wed May 16 01:46:32 2007 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 05B9E16A402 for ; Wed, 16 May 2007 01:46:32 +0000 (UTC) (envelope-from youshi10@u.washington.edu) Received: from mxout3.cac.washington.edu (mxout3.cac.washington.edu [140.142.32.166]) by mx1.freebsd.org (Postfix) with ESMTP id D622313C447 for ; Wed, 16 May 2007 01:46:31 +0000 (UTC) (envelope-from youshi10@u.washington.edu) Received: from smtp.washington.edu (smtp.washington.edu [140.142.32.141] (may be forged)) by mxout3.cac.washington.edu (8.13.7+UW06.06/8.13.7+UW07.03) with ESMTP id l4G1kUOs008837 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Tue, 15 May 2007 18:46:30 -0700 X-Auth-Received: from [192.168.10.45] (c-67-174-148-212.hsd1.ca.comcast.net [67.174.148.212]) (authenticated authid=youshi10) by smtp.washington.edu (8.13.7+UW06.06/8.13.7+UW07.03) with ESMTP id l4G1kTnI007379 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Tue, 15 May 2007 18:46:30 -0700 Message-ID: <464A6273.8080705@u.washington.edu> Date: Tue, 15 May 2007 18:46:27 -0700 From: Garrett Cooper User-Agent: Thunderbird 2.0.0.0 (Windows/20070326) MIME-Version: 1.0 To: Gary Kline References: <20070514210933.1024A16A478@hub.freebsd.org> <20070514022655.GA1304@thought.org> In-Reply-To: <20070514022655.GA1304@thought.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-PMX-Version: 5.3.1.294258, Antispam-Engine: 2.5.1.298604, Antispam-Data: 2007.5.15.183034 X-Uwash-Spam: Gauge=IIIIIII, Probability=7%, Report='HTML_NO_HTTP 0.1, __CT 0, __CTE 0, __CT_TEXT_PLAIN 0, __HAS_MSGID 0, __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0, __USER_AGENT 0' Cc: Ian Smith , freebsd-questions@freebsd.org Subject: Re: what's the easiest way to de-html-ize files? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 16 May 2007 01:46:32 -0000 Gary Kline wrote: > On Tue, May 15, 2007 at 03:34:14PM +1000, Ian Smith wrote: >> On Sat, 12 May 2007 14:34:52 -0700 Gary Kline wrote: >> > On Mon, May 14, 2007 at 12:09:07PM -0700, Chuck Swiger wrote: >> > > On May 12, 2007, at 12:54 PM, Gary Kline wrote: >> > > >This is for those of us who appreciate ASCII or straight >> > > > ISO_8859-15 rather than marked up files. I have slapped together >> > > > a crude C program that does scotch (or *cleanse*) text of >> > > > and so on. Still... is there some standalone converter >> > > > that gets rids of markup more elegantly? Something where i >> > > > can say >> > > > >> > > > % cmd file_1.html ... file_N.html and output file_1.text ... >> > > > file_N.text? >> > > >> > > Perhaps: >> > > >> > > lynx -dump file1.html ... > file.text >> > > >> > > ...? >> > >> > Hm, maybe Ineed Bill Campbell's -force_html switch. >> > >> > Yes, seems that way. USing just -dump got most of them, but >> > using the -force_html caught all. Need to script something to >> > reformat, but the worst of it's done! >> >> Also, if using Mozilla (so, I would assume, Firefox) the 'Save Page As' >> dialog offers a picklist for 'Files of Type' that includes 'Text Files'. >> >> This does a pretty decent job of producing text from HTML files, and is >> quicker than firing up lynx (or links) if you're already viewing a page. > > > Oh sure; I've been saving html in text, ascii/8859-1 for years. > But what I've got, and there are more saved **somewhere**, are > files that are saved by default in markup. I have a slew of > these on different boxen and have been moving then to one place. > Problem is: how to de-html the bunch. > > I'm too lazy to write something that would automate what Can be > automated--markup like "&foo;" are problematic. So probably the > easiest way would be to create a dehtml.sh script that is just a > wrapper around lynx. > > I don't think I'm the only hacker who wants just-plain-ascii, so > this might mak a good project for somebody who's new to C or > perl. That's my two pennies' worth! > > gary > >> Cheers, Ian >> > If you don't want formatting and the number of tags is trivial, the solution is fairly simple in Perl (less than 150 lines, if even that). -Garrett