From owner-freebsd-bugs  Mon Jan 29 13:04:47 1996
Return-Path: owner-bugs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id NAA03082
          for bugs-outgoing; Mon, 29 Jan 1996 13:04:47 -0800 (PST)
Received: from obelix.cica.es (obelix.cica.es [150.214.1.10])
          by freefall.freebsd.org (8.7.3/8.7.3) with ESMTP id NAA03068
          for <bugs@freebsd.org>; Mon, 29 Jan 1996 13:04:37 -0800 (PST)
Received: (from amora@localhost) by obelix.cica.es (8.7.1/8.7.1) id WAA02694 for bugs@freebsd.org; Mon, 29 Jan 1996 22:01:53 +0100 (GMT-1:00)
Date: Mon, 29 Jan 1996 22:01:53 +0100 (GMT-1:00)
From: "Jesus A. Mora Marin" <amora@obelix.cica.es>
Message-Id: <199601292101.WAA02694@obelix.cica.es>
To: bugs@freebsd.org
Sender: owner-bugs@freebsd.org
Precedence: bulk


Hi, world! On the road again. I apologize for my delay in answering, but
I was given the replies to my previous message some days after they were
posted, and was busy last week.

Thomas Graichen (graichen@omega.physik.fu-berlin.de) said:
> is this a joke or truth ... ?

I like jokes, but also DO hate to waste bandwith just for hoaxing. Sending
endless bug reports causes me no ethical concerns :)

> ... - you must have been sitting for hours to
> write this bug-report (?) ...

Yep. I spent a full Sunday afternoon, mostly collecting data and trying to
figure out what stood /dev/<#?_^@! for in my hand-written notes. Also,
trying to polish my awful English was not a piece of cake.


Jordan -jkh@time.cdrom.com- said:

> Yes, Jesus, we will indeed do our best to help you with this problem!

Nice to meet you, Jordan. Well, in fact I didn't mean to ask for a hint, for
something that isn't clear yet whether is a real bug in any place of FreeBSD
code or a peculiarity in my bitty-box' guts. When I got interested in FreeBSD
-or in Linux or whatever free stuff-, I knew that no support should be
expected. Writting that report, my aim was only to notify a *possible* bug
and to lend a hand, if possible. Just an ACK would suffice, but I see I've
got much more. Thanks.


Now, replying to Frank Durda IV -uhclem@nemesis.lonestar.org-. First things
first: many many thanks, Frank, for your suggestions and ideas. I am very
pleased working with you and, now, we'll review some results:

> This version of firmware is newer than any I have seen, but I don't
> think this is a problem if you can do something like
>         dd if=/dev/rmatcd0a of=/dev/null bs=100k
> and let that run for ten minutes or so without any crashes or data errors.

Ok, I've run a command like this, using block sizes ranging from 64k up to
256k. After transferring more than 100MB, no problem.

> Does the crash occur with the GENERIC kernel, ie, the one that
> came on the CD-ROM?  If that version also crashes, it will help
> eliminate the numerous differences between the GENERIC kernel and
> your custom kernel.

Yes. It happens all the time. I've seen it with kernel.GENERIC, and with some
previous versions of customized kernels. It doesn't seem to be related with
any option I can imagine: you can use or not DDB, KTRACE, XSERVER, and so on,
but the nasty crash remains. The panic message with GENERIC kernel looks this
way:

 Fatal trap 12: page fault while in kernel mode
 fault virtual address = 0xf1dff000
 fault code            = supervisor write, page not present
 instruction pointer   = 0x8:0xf01cffce
 code segment          = base 0x0, limit 0xfffff, type 0x1b
                       = DPL 0, pres 1, def32 1, gran 1
 processor eflags      = interrupt enabled, resume, IOPL=0
 current process       = Idle
 interrupt mask        =
 panic: page fault

That is, exactly the same that the one obtained with the custom kernel,
except of fault addresses (JAMMBSD: 0xf1e17000) and
eip (JAMMBSD: 0x8:0xf01839de). Of course: different kernels -> different
addresses. Note that using the same kernel, you'll always get the same
addresses. I am not sure whether different virtual addresses, for different
kernels, can be translated to the same physical address (Apologize if I am
saying nonsenses, but I've never seen a good text describing clearly then
inners and workings of 80x86 MMU). I wonder this because think of a possibly
broken RAM SIMM. More about this, later.


> I assume you issued a umount before this command since the system attempted
> to mount the CD automatically when it came up.  FreeBSD will let you mount
> on top of mounts, although it isn't a real good idea.

> If you did not do a umount first, please do so and try again, OR
> don't do the mount at all  since the media should be mounted.

Good, even having the CD-ROM into the drive before booting, it seems that
it's not mounted until you do it -if `mount' is to be believed-. Further,
trying to umount /dev/matcd0a just after finishing the boot up, gets an
error, i.e., 'device not mounted'. Anyway, I verified this and an explicite
mount was required to access the CD-ROM. And, of course, the crash was there.

> It would be nice if you could cause a failure with some utility that
> is part of the bin distribution (/bin /usr/bin /sbin /usr/sbin, etc)
> and that would let me look at it right away.

That's the really funny side of this story! Frank, I have tried dd, cat, less,
more, cp,..., on files in /cdrom/ports -where the offending file appears to be-.
To be sure that the I/O was not using the blocks in the buffer cache, each
command issued was preceded by a full dismount-mount cycle of matcd0a (think
this suffices to return the blocks to the free list, I am not sure of the
implementation). They all worked! I cannot believe that this is a bug
related with a specific user app, but it only happens with Midnight Commander.
Of course, I've tried also running under other accounts than superuser, and
verified that it has not SUID/SGID bits. Still the same...


> The code in question should have been reading from the CD (does the light
> on the drive stay on when the panic occurs?), but some of the other
> state doesn't make sense right now.   If the drive light is out when the
> panic occurs, the processor has somehow wandered into this section of
> code by accident.

Yes, Frank: the light in the CD drive is on just before the panic occurs and
then goes off. Checked.

> As to all the settings of your BIOS, I really can't advise except to
> recommend you go with the settings that were present when the board
> was purchased, rather than any accelerated values you may be using now.

I've tried this way with original settings, and this doesn't change the
picture.


> Because of what I see in the rest of your description, you might make
> sure you don't have a memory problem.   This is easy to try...

This is a point to check! I have been wondering this, because when I
bought this 486 board, had to buy new SIMMs also. I got a lot of
troubles with Windows, Doom -THIS broke my heart- and even a panic
in SCO Unix 386 (a trap 0x0e, i.e., exactly the same: a page fault in kernel
mode). I identified the damned SIMM and got rid of it, and all has been
working great thereafter. Nevertheless, a faulty SIMM must be discarded.
I have rotated the three 4MB SIMMs in this scheme: 123 -> 312, so no SIMM
remained in its original position. But, alas, this didn't fixed the problem:
the crash reproduced exactly the same.

Well, perhaps there is some obscure hardware incompatibility causing the
problem -think this cannot be never discarded when dealing with PC clones-.
Now I'll try to do some hacking -I cannot promise anything but I will do
my best :) -. I'll re-`config -g' the kernel, turn the CD-ROM driver
debugging options on, and so on. Again, I'll try to get a kernel dump
after the crash. Good, must think again about all this and plan carefully.


Now, time to finish. Any contribution, idea, suggestion will be welcome.
Thanks,


                                        Jesus A. Mora Marin
                                        amora@obelix.cica.es