From owner-freebsd-hackers  Mon Jan 15  8: 6:41 2001
Delivered-To: freebsd-hackers@freebsd.org
Received: from smtp.nettoll.com (matrix.nettoll.net [212.155.143.61])
	by hub.freebsd.org (Postfix) with ESMTP id A4D2437B402
	for <freebsd-hackers@FreeBSD.ORG>; Mon, 15 Jan 2001 08:06:19 -0800 (PST)
Received: by smtp.nettoll.com; Mon, 15 Jan 2001 17:02:43 +0100 (MET)
Message-ID: <3A632009.1030604@enition.com>
Date: Mon, 15 Jan 2001 17:06:33 +0100
From: Xavier Galleri <xgalleri@enition.com>
User-Agent: Mozilla/5.0 (Windows; U; Win 9x 4.90; en-US; m18) Gecko/20001108 Netscape6/6.0
X-Accept-Language: en
MIME-Version: 1.0
To: freebsd-hackers@FreeBSD.ORG
Subject: Re: Need help for kernel crash dump analysis
References: <20010111163903.E6FF737B400@hub.freebsd.org> <3A5DE59F.6060602@enition.com> <3A5E090B.40601@enition.com> <20010111114318.C7240@fw.wintelcom.net>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Ok, let's start again (in plain text this time, thanx again, Daniel ;-)

I use a private scheme to interact with the 'ipintr' isr. The two 
following routines are expected to be called either by our modified 
version of 'ip_input' at network SWI level or at user level.

int my_global_ipl=0;
void my_enter() {
  int s=splnet();
  /* We do not expect this routine to be reentrant, thus the following 
sanity check. */
  ASSERT(my_global_ipl==0);
  my_global_ipl=s;
}
void my_exit() {
  int s=my_global_ipl;
  my_global_ipl=0;
  splx(s);
}

The crashes I got are always due to the assertion failure occuring in 
the 'ipintr' isr. This *seems* to indicate that 'my_enter' is called at 
the network SWI level after another execution flow has called 'my_enter' 
itself and has *NOT* called 'my_exit' yet ! This actually seems strange 
due to the 'splnet', and the only explanation I have found is that the 
first execution flow has fallen asleep somewhere in the kernel (while 
this is not expected, of course !).

Now, if you've read my first mail, I was actually asking for help onhow 
to dump the stack of an interrupted process with GDB when the 
kernelcrash occurs in the context of an isr. Actually, I would like to 
know how I could dump the stack of *any* process at the time of the 
crash. This way, I would be able to see where my user-land daemon was 
lying in the kernel when the interrupt occurs.

Anyway, without this information, I am reduced to add some traps on the 
track of the execution of my process within my kernel code. This brought 
me to surround calls to MALLOC with counters as follows:

somewhere_else() {
  ...
  my_enter();    /* handle competition with network isr (especially 
ipintr) */
  ...
  some_counter++;
  MALLOC(buf,cast,size,M_DEVBUF,M_NOWAIT);
  some_other_counter++;
  ...
  my_exit();
  ...
}

Then, all crashes I got show the following equation at the time of crash:

    ( some_counter - some_other_counter == 1 )

which *seems* to indicate that that my process has been somehow 
preempted during the call to MALLOC.

My belief is that the FreeBSD kernel is (currently) a monolithic 
non-preemptive non-threaded UNIX kernel, thus implying that :

* system-scope scheduling is still done at process level (no kernel 
thread yet)
* any process executing in the kernel cannot be preempted for execution 
by another process unless it either returns to user code or falls 
explicitely asleep.
* the only interlocking that must be done is with interrupts (when 
relevant), using the 'spl' management routine set.
Is that correct ?

Well, I am obviously tracking a bug in my own code, but I would greatly 
appreciate to get help either on my GDB usage question or through 
technical hints on where I should look at to progress in my investigation.

Thank you very much for your attention,

Rgds,

Xavier

Alfred Perlstein wrote:

> * Xavier Galleri <xgalleri@enition.com> <mailto:xgalleri@enition.com> [010111 11:27] wrote:
> 
>> Hi everybody,
>> 
>> I have reached a point where I am wondering if a call to 'malloc' with 
>> the M_NOWAIT flag is not falling asleep !
> 
> 
> M_NOWAIT shouldn't sleep.
> 
>> In fact, I suspect that the interrupted context is somewhere during a 
>> call to 'malloc' (I increment a counter just before calling malloc and 
>> increment another just after and the difference is one !) while I have 
>> called 'splnet' beforehand, thus normally preventing competing with any 
>> network isr. I assume that this shouldnever occur unless the code is 
>> somewhere calling 'sleep' and provoke acontext switch.
> 
> 
> if you add 1 to a variable the difference is expected to be one.
> 
>> Is there anybody who can help on this ?
> 
> 
> I'm not sure, you need to be more specific/clear.
> 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message