Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 12 Dec 2013 16:24:40 +0000
From:      "Teske, Devin" <Devin.Teske@fisglobal.com>
To:        "Daniel O'Connor" <doconnor@gsoft.com.au>
Cc:        Kevin Oberman <rkoberman@gmail.com>, Devin Teske <dteske@freebsd.org>, "freebsd-stable@freebsd.org Stable" <freebsd-stable@freebsd.org>, "Teske, Devin" <Devin.Teske@fisglobal.com>, Darren Pilgrim <list_freebsd@bluerosetech.com>
Subject:   Re: BIND segway -> python -> first-class ports
Message-ID:  <85EE26D8-0AB4-41B0-85AE-5439160EC602@fisglobal.com>
In-Reply-To: <6052F96E-0CD3-4C56-A619-8337C4ED890C@gsoft.com.au>
References:  <20131210023615.GR55638@funkthat.com> <52A68141.6010003@mu.org> <622122.74675.bm@smtp120.sbc.mail.gq1.yahoo.com> <20131210224915.GA55638@funkthat.com> <CAN6yY1tSqbrkt5bkjhDW6npT4PAXmMck0Xco%2BERwBE=wkkBDBQ@mail.gmail.com> <52A82099.9080100@bluerosetech.com> <B62F85D0-89E6-4FF8-ADE4-5025FB360462@gsoft.com.au> <D0F85D74-E727-4487-AEA1-B9C16660192E@fisglobal.com> <0EC3A50D-A6BE-4F3B-87D6-AB0470F0BA64@gsoft.com.au> <4174A92E-F202-4FFB-BFED-C38A9D0A7F91@fisglobal.com> <0D92E13A-F869-492C-852B-37A0BFB1674C@gsoft.com.au> <E4058C5F-9360-4A1D-BFB6-4658FC8D5945@fisglobal.com> <38856510-A2D9-41E6-8CDC-ED282BDA933A@gsoft.com.au> <5A92C643-0BA6-4D15-AB54-DB78BE00583A@fisglobal.com> <6052F96E-0CD3-4C56-A619-8337C4ED890C@gsoft.com.au>

next in thread | previous in thread | raw e-mail | index | archive | help

On Dec 11, 2013, at 11:07 PM, Daniel O'Connor wrote:

>=20
> On 12 Dec 2013, at 17:32, Teske, Devin <Devin.Teske@fisglobal.com> wrote:
>> On Dec 11, 2013, at 9:46 PM, Daniel O'Connor wrote:
>>> On 12 Dec 2013, at 12:24, Teske, Devin <Devin.Teske@fisglobal.com> wrot=
e:
>>>>> Thanks, if only I'd know about this 6 months ago :)
>>>>=20
>>>> I just wrote it from scratch, so didn't exist until today ;D
>>>=20
>>> Hah nice, although I imagine there is plenty of legal XML it can't pars=
e.
>>>=20
>>> That plays to another point about this sort of work - it's very hard to=
 write shell script that will work properly in all cases (things like space=
s, or even newlines and unprintable characters in filenames).
>>>=20
>>=20
>> If I had spent more time on it, then it would be able to parse any
>> XML. However, it wasn't worth going further without first having
>> a look at the C code that produces the output.
>>=20
>> For example, different XML encoding libraries may encode the
>> property values more or less strictly (for example, are values
>> properly encoded to prevent a value of "</name>" (for example)
>> from prematurely terminating the property borking the XML
>> valiation. (my guess would be that it would be encoded fully as
>> "<name>&lt;/name&gt;</name>".
>>=20
>> Just a matter of extending the extract_data() and extract_attr()
>> functions and then generalizing a little more.
>=20
> I think looking at what produces it is 'cheating' and can end up biting y=
ou in the ass later on.
>=20
> Basically my point is that there needs to be _some_ interchange format wh=
ere you can reliably parse output from tools generating it (which by and by=
 might be written by different people with different assumptions etc). So a=
 core extremely robust parser is necessary.
>=20
> Perhaps there could be a base tool which can take such output and convert=
 it to a set of struct commands. That is really my second choice, but I thi=
nk that it is politically infeasible to modify our /bin/sh to parse XML (or=
 any other useful interchange format).

Two 'nits'...

I remember having these same types of discussions decades ago. They
seem to repeat themselves every 6-12 years.

I seem to recall that everytime the topic of format parsing and data mgmt
comes up, there's a split between two types of people.

A. The folks that want "purpose built" parsers that compartmentalize the lo=
gic

and

B. The folks that want a "general built" parsers that have to potentially be
tuned for the data that you're parsing.

In my experience in building, developing, and *using* both...

Nit 1. The general purpose tool forces you to use the data structure that it
uses for access, while at the same time not taking into consideration that
it may fail edge-cases if you don't "cheat" as you suggest and look at the
code that is generating the output for which you will feed to a generalized
parser.

NB: Notice how you don't get away from the fact that you really *ought* to
be looking at the code that generates the output (always) to make sure you
don't have a gaping edge-case.

Nit 2. The purpose-built parser can often lend simplicity to a situation wh=
ere
possible. That is to say, if you can get by with a simple parser, more often
than not, this approach may be desirable because you localize the logic to
the point where changes will occur less often. In the converse, we find that
changes to the generalized library may unintentionally break the parsing of
multiple code-points when all you did was want to add some basic "thing".

Ultimately, the benefit of not over-complicating every "parse-job" is that.=
..

+ With a localized logic, you won't have to worry about the end-to-end
regression testing that is required for such a beast.

That's why all the great generalized parsers have their own test-harnesses
and a giant pool of sample data to make sure that each change is rigorously
tested against each/every known format.

That's great, but a purpose-built parse can last 15-20 years without a chan=
ge
(be it written in C, C++, Obj-C, Assembly, whatever) because the only time =
it
will change is when the format it parses changes.

So what we relinquish by (a) giving up the use of a generalized parser to "=
Parse
The World"(tm) and (b) using a localized purpose-built parse for individual=
ized
parse-jobs...

+ Longevity of code
+ Equal or lesser cost of maintenance
+ A little team-work

Just my 2-cents. Been doing the whole "Parse The World"(tm) thing for a whi=
le
and it's given me some perspective.
--=20
Devin

_____________
The information contained in this message is proprietary and/or confidentia=
l. If you are not the intended recipient, please: (i) delete the message an=
d all copies; (ii) do not disclose, distribute or use the message in any ma=
nner; and (iii) notify the sender immediately. In addition, please be aware=
 that any message addressed to our domain is subject to archiving and revie=
w by persons other than the intended recipient. Thank you.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?85EE26D8-0AB4-41B0-85AE-5439160EC602>