Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 23 Feb 2005 10:43:16 +0100
From:      Simon Barner <barner@gmx.de>
To:        Mike Hauber <m.hauber@mchsi.com>
Cc:        freebsd-questions@freebsd.org
Subject:   Re: filtering HTML tags from email
Message-ID:  <20050223094316.GA70078@zi025.glhnet.mhn.de>
In-Reply-To: <200502230218.37665.m.hauber@mchsi.com>
References:  <200502222316.32866.m.hauber@mchsi.com> <20050223055018.GA82969@keyslapper.net> <200502230218.37665.m.hauber@mchsi.com>

next in thread | previous in thread | raw e-mail | index | archive | help

--k1lZvvs/B4yU6o8G
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Mike Hauber wrote:
> > Mutt saves to a temp file then calls the following command:
> > lynx -localhost -dump %s
> > where '%s' is the temporary file you saved it to.
> >
> > You could also just pipe it to the following:
> > lynx -localhost -dump -stdin
> >
> > the -localhost argument prevents lynx from simply following
> > links external to your machine - helpful to avoid generating
> > hits for unscrupulous spammers that get paid for hits on a URL.
> >
> > Just make sure lynx is installed.
> >
> > Lou
>=20
> Okay, so to be sure, there is no filter (as of yet) to simply open=20
> an email file, strip the HTML tags, and resave it?  I'm not=20
> complaining, as this may actually be something I'm capable of=20
> creating myself.  (I'll make this my first python project. :) )
>=20
> I'm just making sure I'm not missing anything obvious before I=20
> start working on it.  It's irritating to spend time on something=20
> only to find out that it's already been done.

You probably could do it also with procmail + lynx (or w3m) during the
delivery process.

Another possibility is to have the following entries in your ~/.mailcap
file, which converts html, doc and rtf to plain text.

text/html; w3m -dump -T text/html; copiousoutput;
application/msword; antiword %s; copiousoutput
application/rtf; rtfreader %s; copiousoutput

As for your python script: I don't think that just stripping everything
matching the following expressions is correct because they might appear
in non html emails, too: <.*> <\/.*> (perl syntax).

At least, you'd need a list of valid html tags, i.e. a regular grammar
for html: <b> | </b> | <i> | </i> | ... (BNF notation).

While this is not too hard to implement (and possibly a good project to
learn a new programming language), this would be too much work for
something that can be achieved easier with existing tools (that is, for
me, personally ;-)

Simon

--k1lZvvs/B4yU6o8G
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (FreeBSD)

iD8DBQFCHFA0Ckn+/eutqCoRAgNVAJ9Y/2R6ycf+xgexeEVLUH5XxcwrnwCgxfM8
lNOVsHQxYbxw3Y9Qa7cwJlI=
=y8Uh
-----END PGP SIGNATURE-----

--k1lZvvs/B4yU6o8G--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050223094316.GA70078>