From owner-freebsd-questions@FreeBSD.ORG Sun Jul 20 00:44:22 2008 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2293D1065674 for ; Sun, 20 Jul 2008 00:44:22 +0000 (UTC) (envelope-from keramida@ceid.upatras.gr) Received: from igloo.linux.gr (igloo.linux.gr [62.1.205.36]) by mx1.freebsd.org (Postfix) with ESMTP id B1FFF8FC18 for ; Sun, 20 Jul 2008 00:44:21 +0000 (UTC) (envelope-from keramida@ceid.upatras.gr) Received: from kobe.laptop (adsl133-207.kln.forthnet.gr [77.49.252.207]) (authenticated bits=128) by igloo.linux.gr (8.14.3/8.14.3/Debian-4) with ESMTP id m6K0i8oD000962 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Sun, 20 Jul 2008 03:44:14 +0300 Received: from kobe.laptop (kobe.laptop [127.0.0.1]) by kobe.laptop (8.14.2/8.14.2) with ESMTP id m6K0i80L003624; Sun, 20 Jul 2008 03:44:08 +0300 (EEST) (envelope-from keramida@ceid.upatras.gr) Received: (from keramida@localhost) by kobe.laptop (8.14.2/8.14.2/Submit) id m6K0i7Fu003623; Sun, 20 Jul 2008 03:44:07 +0300 (EEST) (envelope-from keramida@ceid.upatras.gr) From: Giorgos Keramidas To: Gary Kline References: <20080720002345.GA9173@thought.org> Date: Sun, 20 Jul 2008 03:44:07 +0300 In-Reply-To: <20080720002345.GA9173@thought.org> (Gary Kline's message of "Sat, 19 Jul 2008 17:23:48 -0700") Message-ID: <878wvxfkq0.fsf@kobe.laptop> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.60 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-MailScanner-ID: m6K0i8oD000962 X-Hellug-MailScanner: Found to be clean X-Hellug-MailScanner-SpamCheck: not spam, SpamAssassin (not cached, score=-3.787, required 5, autolearn=not spam, ALL_TRUSTED -1.80, AWL 0.61, BAYES_00 -2.60) X-Hellug-MailScanner-From: keramida@ceid.upatras.gr X-Spam-Status: No Cc: FreeBSD Mailing List Subject: Re: How to divide up? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Jul 2008 00:44:22 -0000 On Sat, 19 Jul 2008 17:23:48 -0700, Gary Kline wrote: > Guys, > Is there an easyy way of splitting yp these tags into one-per-line? > > I'm not obcessive [[?, :)]], but for what I've got in mind, the tags > and stuff would look better to my eyes? ....the outcome of this will > go ino a special database, not html . > > is there some clever perl one-liner ... I don't know about 'easy', because this looks pretty much like 'free form HTML'. Parsing liberally formatted HTML code from untrusted sources is a lot like trying to reinvent Firefox's HTML parsing engine or something similar. That's bound to be up there in the 'insanely difficult' and not so much in the 'easy to hack with sed and a bit of awk or some Perl' scale. If you have some sort of guarantee about the well-formedness of the HTML source though (i.e. it passes some sort of validation suite), then you can probably use tidy(1) to convert it to XML and then use xsltproc to convert the XML source to pretty much anything imaginable. Now, if you want to merely "hack something quick and dirty", a short Perl script can probably do regexp substitution similar to # # WARNING: THIS HAS NOT BEEN TESTED :P # my $foo = ; $foo = s:(<[^>]+>[^<]*]+>):$1\n:ge; print "$foo"; but you shouldn't trust the output of such a quick hack too much.