From owner-freebsd-hackers  Fri Jan  9 14:43:35 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.7/8.8.7) id OAA20940
          for hackers-outgoing; Fri, 9 Jan 1998 14:43:35 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133])
          by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id OAA20920
          for <hackers@freebsd.org>; Fri, 9 Jan 1998 14:43:19 -0800 (PST)
          (envelope-from tlambert@usr04.primenet.com)
Received: (from daemon@localhost)
	by smtp03.primenet.com (8.8.8/8.8.8) id PAA12761;
	Fri, 9 Jan 1998 15:43:08 -0700 (MST)
Received: from usr04.primenet.com(206.165.6.204)
 via SMTP by smtp03.primenet.com, id smtpd012737; Fri Jan  9 15:43:04 1998
Received: (from tlambert@localhost)
	by usr04.primenet.com (8.8.5/8.8.5) id PAA00305;
	Fri, 9 Jan 1998 15:42:55 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199801092242.PAA00305@usr04.primenet.com>
Subject: Re: FreeBSD Netcards
To: jamie@itribe.net (Jamie Bowden)
Date: Fri, 9 Jan 1998 22:42:55 +0000 (GMT)
Cc: jdevale@ece.cmu.edu, hackers@FreeBSD.ORG
In-Reply-To: <199801091427.JAA07552@gatekeeper.itribe.net> from "Jamie Bowden" at Jan 9, 98 09:29:15 am
X-Mailer: ELM [version 2.4 PL23]
Content-Type: text
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk

> > Thus when her/his program aborts
> > catastrophically, they debug it, and feel stupid that they did something
> > like pass NULL to atoi (which actually causes an abort in freeBSD).  The
> > linux users, on the other hand, can't figure it out, or bitch and whine,
> > and as a result, the system call is fixed to return an error code instead
> > of an abort.  Since you have to know about these bugs to fix them, it
> > seems likely that freeBSD users/programmers just do fewer stupid things.

The abort you are seeing is a NULL pointer dereference that tries to
dereference the contents of page 0 in the process address space.

In FreeBSD, page zero is unmapped.  I will demonstrate why this is
actually a Good Thing(tm).

In SVR4, there is a tunable in the kernel configuration that allows you
to modify the behaviour.  There are two behaviours available:

1)	Map a zero filled page to page zero.  This causes the NULL
	to be treated, incorrectly, as a NULL valued string instead
	of a NULL.  A process which is still running can be examined
	via /proc for the address space mappings to see if it has
	triggered the page 0 mapping.

	This is the default behaviour.  It is default because of the
	large volume of badly written code which depended on the
	Null dereference being treated as a NULL valued string, and
	the fact that, historically, page zero was mapped and the
	magic number was such that, though it was not a NULL valued
	string, you were unlikely to get a match between what was at
	offset zero to the first occurance of a NULL byte, and whatever
	string you were trying to compare with strcmp, etc..  The
	"atoi problem" was such that the magic number did not include
	a digit before the terminating 0, and thus atoi of NULL returned
	zero.

2)	Fault, exactly as FreeBSD faults, on NULL pointer dereference,
	instead of mapping a zero filled page at page zero.

	This is non-default behaviour.


Technically, it is incorrect to map a zero filled page at page zero,
since it masks programming errors.  Specifically, it is not possible
to trap a NULL pointer dereference (ie: a dereference of a global pointer
which is not initialized before use, and for which the compiler will not
complain, since global and locally scoped static pointers are considered
to be agregate initialized to zero -- the compiler can only trap for
auto [stack] variables which are not initialed before use.  And then
only when the use occurs withing the peephole optimzation window from
the start of the scope in which the use occurs).


An alternate method of trapping NULL pointers traps fewer wild pointers,
and results in an abort for non-NULL valued pointers:

This method requires that you compare function arguments to determine
if they are NULL valued.  You can then either return an error, or you
can replace the argument with a pointer to a static NULL valued string,
in order to ensure code compatability with historical (incorrect) code.

The first approach is wrong.  It reports only a tiny subset of error
conditions in a given class of error conditions, and continues to fault.

The second approach is also wrong.  It masks badly written code, and
prevents the detection and correction of the errors.

Both the first and second approach add a compare and branch, and make
good code slower, while not fixing or flagging the errors in the bad
code.


Why not optimize for the non-error case?  And encourage correcting
historically bad code?

Why protect bad engineers from harsh notification of the fact that
they are bad engineers?

It is interesting to note that on systems which map page zero, it is
almost impossible to implement "purify" type tools.  Have yo seen a
"Purify" type tool for an MS OS?  No?  Well, you probably won't.


> > So, now that I have explained what the benchmarks are, you may be saying
> > to yourself, that sounds really stupid.  You wouldn't be the first.  While
> > it would be nice if the OS people fixed all these things, keep in mind
> > that the real target is thrid party libraries to be used in
> > mission-critical systems that are supposed to be fault tolerant.  Meaning
> > they degrade gracefully, rather than crash the process.

You are mixing metaphors here.  The definition of "fault tolerance" is
meant to include "tolerance of hardware faults", not "tolerance of bad
programming practice".

Passing invalid values to library routines is bad programming practice.

A better "benchmark" for "system fault tolerance", the definition of which
is meant to include "the isolation of well behaved programs from the
effects of badly behaved programs", would be whether or not FreeBSD is
robust in the face of bad system call arguments.

There exists a program to test this, called "crashme".  It randomly
generates code, then attempts to execute it, in order to identify
areas where system calls or other memory protection failures would
allow one process to damage the execution of another process.  It
is generally considered to do this by crashing the machine, under
the assumption that the OS correctly enforces protection domains,
and only bad programming practice would put the processes in the
same protection domain (ie: run by the same UID in the same branch
of the common filesystem).  Barring that, the only "effect" that
is possible is denial of service.


> > Operating systems
> > just provided us with a rich set of different objects with a common
> > interface to test on.

Consisting of system calls and libraries, yes.  But that does not mean
that we should be "tolerant" enough to print out "Hello World!" from
a program that only does a ``puts("foo!\n");'' merely because the
source file is named ``hello.c''.

8-).


> > Although some people here in the fault tolerant
> > computing group think that they should fix every possible bug, this is
> > ludicrous.  With the exception of the OS's we tested that claimed to be
> > fault tolerant real time operating systems, there is no payoff for the
> > vendor to jump through hoops to get all of these robustness problems
> > fixed.

Incorrect.  Fixing the problem prevents denial of service attacks
on multiuser systems in corporate, ISP, and education environments,
etc..  This is why the Pentium "f00f bug" was such a big deal.


> > It might do some good for them to fix the easy ones though.  For
> > instance, many many problems are caused by nulls getting passed in to
> > system calls.

Not on working systems, there aren't.  The only problem possible on
a working system (Windows 95 is *not* "a working system", in the sense
that it does not enforce full memory protection by domain, nor does
it engage in full resource tracking of process resource) is damage
of the ability for the incorrect program to produce a correct result.

This may seem obvious to the rest of us, but... Incorrect programs
can not be logically expected to produce correct results.  To argue
otherwise is to argue that "even a broken [analog] watch is correct
twice a day".  It's a true statement, but the error is +/- 6hours,
so even though true, it's not empirically useful.  8-).


> > Sure, the programmer should test this out, but they still
> > creep in, especially in complex programs.  If you could fix a
> > large amount of these problems just by adding in null checks to
> > the system calls, it would be pretty easy, and inexpensive in
> > terms of cpu overhead.

That's what a SIGSEGV *is*... a NULL check for system calls and library
calls... and even calls in the user's program, which *don't* have
NULL checks.

BTW: it's possible to automatically generate test suites for code
completeness and correctness testing.  The process is called "Branch
Path Analysis".  One tool which can do this is called "BattleMap".
You can obtain it for ~$20,000 a license for Solaris/SunOS systems.

Alternately, several years ago in comp.sources, someone posted a
C++ branch path analyzer for producing code coverage test cases for
C code.  You could easily use this tool, in conjunction with TET
or ETET (also freely available off the net) to produce a validation
suite for your application.


Frankly, this speaks to the distinction I draw between "Programmers"
and "Software Engineers".  A Programmer can take someone else's
problem soloution and translate it into machine instructions.  A
Software Engineer, on the other hand, can solve problems using a
computer.

There will always be differences in craftsmanship as long as we
continue to use tools.  That's the nature of the connection between
humans and tools.  There are good craftmen, and there are bad.

I think that your complaint is with the craftsman, and not his tools.

If you can find flaws in the tools, then by all means, take it up with
the toolmaker.  But it is not the responsibility of a toolmaker to
label hammers "do not use on screws"; that's the responsibility of
the people who train the next generation of craftsmen.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.