From owner-freebsd-hackers Fri Jan 9 14:43:35 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.7/8.8.7) id OAA20940 for hackers-outgoing; Fri, 9 Jan 1998 14:43:35 -0800 (PST) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133]) by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id OAA20920 for ; Fri, 9 Jan 1998 14:43:19 -0800 (PST) (envelope-from tlambert@usr04.primenet.com) Received: (from daemon@localhost) by smtp03.primenet.com (8.8.8/8.8.8) id PAA12761; Fri, 9 Jan 1998 15:43:08 -0700 (MST) Received: from usr04.primenet.com(206.165.6.204) via SMTP by smtp03.primenet.com, id smtpd012737; Fri Jan 9 15:43:04 1998 Received: (from tlambert@localhost) by usr04.primenet.com (8.8.5/8.8.5) id PAA00305; Fri, 9 Jan 1998 15:42:55 -0700 (MST) From: Terry Lambert Message-Id: <199801092242.PAA00305@usr04.primenet.com> Subject: Re: FreeBSD Netcards To: jamie@itribe.net (Jamie Bowden) Date: Fri, 9 Jan 1998 22:42:55 +0000 (GMT) Cc: jdevale@ece.cmu.edu, hackers@FreeBSD.ORG In-Reply-To: <199801091427.JAA07552@gatekeeper.itribe.net> from "Jamie Bowden" at Jan 9, 98 09:29:15 am X-Mailer: ELM [version 2.4 PL23] Content-Type: text Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk > > Thus when her/his program aborts > > catastrophically, they debug it, and feel stupid that they did something > > like pass NULL to atoi (which actually causes an abort in freeBSD). The > > linux users, on the other hand, can't figure it out, or bitch and whine, > > and as a result, the system call is fixed to return an error code instead > > of an abort. Since you have to know about these bugs to fix them, it > > seems likely that freeBSD users/programmers just do fewer stupid things. The abort you are seeing is a NULL pointer dereference that tries to dereference the contents of page 0 in the process address space. In FreeBSD, page zero is unmapped. I will demonstrate why this is actually a Good Thing(tm). In SVR4, there is a tunable in the kernel configuration that allows you to modify the behaviour. There are two behaviours available: 1) Map a zero filled page to page zero. This causes the NULL to be treated, incorrectly, as a NULL valued string instead of a NULL. A process which is still running can be examined via /proc for the address space mappings to see if it has triggered the page 0 mapping. This is the default behaviour. It is default because of the large volume of badly written code which depended on the Null dereference being treated as a NULL valued string, and the fact that, historically, page zero was mapped and the magic number was such that, though it was not a NULL valued string, you were unlikely to get a match between what was at offset zero to the first occurance of a NULL byte, and whatever string you were trying to compare with strcmp, etc.. The "atoi problem" was such that the magic number did not include a digit before the terminating 0, and thus atoi of NULL returned zero. 2) Fault, exactly as FreeBSD faults, on NULL pointer dereference, instead of mapping a zero filled page at page zero. This is non-default behaviour. Technically, it is incorrect to map a zero filled page at page zero, since it masks programming errors. Specifically, it is not possible to trap a NULL pointer dereference (ie: a dereference of a global pointer which is not initialized before use, and for which the compiler will not complain, since global and locally scoped static pointers are considered to be agregate initialized to zero -- the compiler can only trap for auto [stack] variables which are not initialed before use. And then only when the use occurs withing the peephole optimzation window from the start of the scope in which the use occurs). An alternate method of trapping NULL pointers traps fewer wild pointers, and results in an abort for non-NULL valued pointers: This method requires that you compare function arguments to determine if they are NULL valued. You can then either return an error, or you can replace the argument with a pointer to a static NULL valued string, in order to ensure code compatability with historical (incorrect) code. The first approach is wrong. It reports only a tiny subset of error conditions in a given class of error conditions, and continues to fault. The second approach is also wrong. It masks badly written code, and prevents the detection and correction of the errors. Both the first and second approach add a compare and branch, and make good code slower, while not fixing or flagging the errors in the bad code. Why not optimize for the non-error case? And encourage correcting historically bad code? Why protect bad engineers from harsh notification of the fact that they are bad engineers? It is interesting to note that on systems which map page zero, it is almost impossible to implement "purify" type tools. Have yo seen a "Purify" type tool for an MS OS? No? Well, you probably won't. > > So, now that I have explained what the benchmarks are, you may be saying > > to yourself, that sounds really stupid. You wouldn't be the first. While > > it would be nice if the OS people fixed all these things, keep in mind > > that the real target is thrid party libraries to be used in > > mission-critical systems that are supposed to be fault tolerant. Meaning > > they degrade gracefully, rather than crash the process. You are mixing metaphors here. The definition of "fault tolerance" is meant to include "tolerance of hardware faults", not "tolerance of bad programming practice". Passing invalid values to library routines is bad programming practice. A better "benchmark" for "system fault tolerance", the definition of which is meant to include "the isolation of well behaved programs from the effects of badly behaved programs", would be whether or not FreeBSD is robust in the face of bad system call arguments. There exists a program to test this, called "crashme". It randomly generates code, then attempts to execute it, in order to identify areas where system calls or other memory protection failures would allow one process to damage the execution of another process. It is generally considered to do this by crashing the machine, under the assumption that the OS correctly enforces protection domains, and only bad programming practice would put the processes in the same protection domain (ie: run by the same UID in the same branch of the common filesystem). Barring that, the only "effect" that is possible is denial of service. > > Operating systems > > just provided us with a rich set of different objects with a common > > interface to test on. Consisting of system calls and libraries, yes. But that does not mean that we should be "tolerant" enough to print out "Hello World!" from a program that only does a ``puts("foo!\n");'' merely because the source file is named ``hello.c''. 8-). > > Although some people here in the fault tolerant > > computing group think that they should fix every possible bug, this is > > ludicrous. With the exception of the OS's we tested that claimed to be > > fault tolerant real time operating systems, there is no payoff for the > > vendor to jump through hoops to get all of these robustness problems > > fixed. Incorrect. Fixing the problem prevents denial of service attacks on multiuser systems in corporate, ISP, and education environments, etc.. This is why the Pentium "f00f bug" was such a big deal. > > It might do some good for them to fix the easy ones though. For > > instance, many many problems are caused by nulls getting passed in to > > system calls. Not on working systems, there aren't. The only problem possible on a working system (Windows 95 is *not* "a working system", in the sense that it does not enforce full memory protection by domain, nor does it engage in full resource tracking of process resource) is damage of the ability for the incorrect program to produce a correct result. This may seem obvious to the rest of us, but... Incorrect programs can not be logically expected to produce correct results. To argue otherwise is to argue that "even a broken [analog] watch is correct twice a day". It's a true statement, but the error is +/- 6hours, so even though true, it's not empirically useful. 8-). > > Sure, the programmer should test this out, but they still > > creep in, especially in complex programs. If you could fix a > > large amount of these problems just by adding in null checks to > > the system calls, it would be pretty easy, and inexpensive in > > terms of cpu overhead. That's what a SIGSEGV *is*... a NULL check for system calls and library calls... and even calls in the user's program, which *don't* have NULL checks. BTW: it's possible to automatically generate test suites for code completeness and correctness testing. The process is called "Branch Path Analysis". One tool which can do this is called "BattleMap". You can obtain it for ~$20,000 a license for Solaris/SunOS systems. Alternately, several years ago in comp.sources, someone posted a C++ branch path analyzer for producing code coverage test cases for C code. You could easily use this tool, in conjunction with TET or ETET (also freely available off the net) to produce a validation suite for your application. Frankly, this speaks to the distinction I draw between "Programmers" and "Software Engineers". A Programmer can take someone else's problem soloution and translate it into machine instructions. A Software Engineer, on the other hand, can solve problems using a computer. There will always be differences in craftsmanship as long as we continue to use tools. That's the nature of the connection between humans and tools. There are good craftmen, and there are bad. I think that your complaint is with the craftsman, and not his tools. If you can find flaws in the tools, then by all means, take it up with the toolmaker. But it is not the responsibility of a toolmaker to label hammers "do not use on screws"; that's the responsibility of the people who train the next generation of craftsmen. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.