From owner-freebsd-questions@FreeBSD.ORG  Sun Jul 20 00:44:22 2008
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2293D1065674
	for <freebsd-questions@freebsd.org>;
	Sun, 20 Jul 2008 00:44:22 +0000 (UTC)
	(envelope-from keramida@ceid.upatras.gr)
Received: from igloo.linux.gr (igloo.linux.gr [62.1.205.36])
	by mx1.freebsd.org (Postfix) with ESMTP id B1FFF8FC18
	for <freebsd-questions@freebsd.org>;
	Sun, 20 Jul 2008 00:44:21 +0000 (UTC)
	(envelope-from keramida@ceid.upatras.gr)
Received: from kobe.laptop (adsl133-207.kln.forthnet.gr [77.49.252.207])
	(authenticated bits=128)
	by igloo.linux.gr (8.14.3/8.14.3/Debian-4) with ESMTP id m6K0i8oD000962
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT);
	Sun, 20 Jul 2008 03:44:14 +0300
Received: from kobe.laptop (kobe.laptop [127.0.0.1])
	by kobe.laptop (8.14.2/8.14.2) with ESMTP id m6K0i80L003624;
	Sun, 20 Jul 2008 03:44:08 +0300 (EEST)
	(envelope-from keramida@ceid.upatras.gr)
Received: (from keramida@localhost)
	by kobe.laptop (8.14.2/8.14.2/Submit) id m6K0i7Fu003623;
	Sun, 20 Jul 2008 03:44:07 +0300 (EEST)
	(envelope-from keramida@ceid.upatras.gr)
From: Giorgos Keramidas <keramida@ceid.upatras.gr>
To: Gary Kline <kline@thought.org>
References: <20080720002345.GA9173@thought.org>
Date: Sun, 20 Jul 2008 03:44:07 +0300
In-Reply-To: <20080720002345.GA9173@thought.org> (Gary Kline's message of
	"Sat, 19 Jul 2008 17:23:48 -0700")
Message-ID: <878wvxfkq0.fsf@kobe.laptop>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.60 (berkeley-unix)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-MailScanner-ID: m6K0i8oD000962
X-Hellug-MailScanner: Found to be clean
X-Hellug-MailScanner-SpamCheck: not spam, SpamAssassin (not cached,
	score=-3.787, required 5, autolearn=not spam, ALL_TRUSTED -1.80,
	AWL 0.61, BAYES_00 -2.60)
X-Hellug-MailScanner-From: keramida@ceid.upatras.gr
X-Spam-Status: No
Cc: FreeBSD Mailing List <freebsd-questions@freebsd.org>
Subject: Re: How to divide up?
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 20 Jul 2008 00:44:22 -0000

On Sat, 19 Jul 2008 17:23:48 -0700, Gary Kline <kline@thought.org> wrote:
> Guys,
> Is there an easyy way of splitting yp these tags into one-per-line?
>
> I'm not obcessive [[?, :)]], but for what I've got in mind, the tags
> and stuff would look better to my eyes?  ....the outcome of this will
> go ino a special database, not html .
>
> is there some clever perl one-liner ...

I don't know about 'easy', because this looks pretty much like 'free
form HTML'.  Parsing liberally formatted HTML code from untrusted
sources is a lot like trying to reinvent Firefox's HTML parsing engine
or something similar.  That's bound to be up there in the 'insanely
difficult' and not so much in the 'easy to hack with sed and a bit of
awk or some Perl' scale.

If you have some sort of guarantee about the well-formedness of the HTML
source though (i.e. it passes some sort of validation suite), then you
can probably use tidy(1) to convert it to XML and then use xsltproc to
convert the XML source to pretty much anything imaginable.

Now, if you want to merely "hack something quick and dirty", a short
Perl script can probably do regexp substitution similar to

        #
        # WARNING: THIS HAS NOT BEEN TESTED :P
        #
        my $foo = <STDIN>;
        $foo = s:(<[^>]+>[^<]*</[^>]+>):$1\n:ge;
        print "$foo";

but you shouldn't trust the output of such a quick hack too much.