From owner-freebsd-hackers Fri Jan 12 2:13:48 2001 Delivered-To: freebsd-hackers@freebsd.org Received: from smtp.nettoll.com (matrix.nettoll.net [212.155.143.61]) by hub.freebsd.org (Postfix) with ESMTP id 8CCE237B6A1 for ; Fri, 12 Jan 2001 02:13:26 -0800 (PST) Received: by smtp.nettoll.com; Fri, 12 Jan 2001 11:09:38 +0100 (MET) Message-ID: <3A5ED8C9.3050309@wanadoo.fr> Date: Fri, 12 Jan 2001 11:13:29 +0100 From: Xavier Galleri User-Agent: Mozilla/5.0 (Windows; U; Win 9x 4.90; en-US; m18) Gecko/20001108 Netscape6/6.0 X-Accept-Language: en MIME-Version: 1.0 To: Alfred Perlstein Cc: freebsd-hackers@FreeBSD.ORG Subject: Re: Need help for kernel crash dump analysis References: <20010111163903.E6FF737B400@hub.freebsd.org> <3A5DE59F.6060602@enition.com> <3A5E090B.40601@enition.com> <20010111114318.C7240@fw.wintelcom.net> Content-Type: multipart/alternative; boundary="------------000601080407040901070709" Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG --------------000601080407040901070709 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Thank you for your answer, OK, let's make it a bit clearer ! I use a private scheme to interact with the 'ipintr' isr. The two following routines are expected to be called either by our modified version of 'ip_input' at network SWI level or at user level. int my_global_ipl=0; void my_enter() { int s=splnet(); /* We do not expect this routine to be reentrant, thus the following sanity check. */ ASSERT(my_global_ipl==0); my_global_ipl=s; } void my_exit() { int s=my_global_ipl; my_global_ipl=0; splx(s); } The crashes I got are always due to the assertion failure occuring in the 'ipintr' isr. This *seems* to indicate that 'my_enter' is called at the network SWI level after another execution flow has called 'my_enter' itself and has *NOT* called 'my_exit' yet ! This actually seems strange due to the 'splnet', and the only explanation I have found is that the first execution flow has fallen asleep somewhere in the kernel (while this is not expected, of course !). Now, if you've read my first mail, I was actually asking for help onhow to dump the stack of an interrupted process with GDB when the kernelcrash occurs in the context of an isr. Actually, I would like to know how I could dump the stack of *any* process at the time of the crash. This way, I would be able to see where my user-land daemon was lying in the kernel when the interrupt occurs. Anyway, without this information, I am reduced to add some traps on the track of the execution of my process within my kernel code. This brought me to surround calls to MALLOC with counters as follows: somewhere_else() { ... my_enter(); /* handle competition with network isr (especially ipintr) */ ... some_counter++; MALLOC(buf,cast,size,M_DEVBUF,M_NOWAIT); some_other_counter++; ... my_exit(); ... } Then, all crashes I got show the following equation at the time of crash: ( some_counter - some_other_counter == 1 ) which *seems* to indicate that that my process has been somehow preempted during the call to MALLOC. My belief is that the FreeBSD kernel is (currently) a monolithic non-preemptive non-threaded UNIX kernel, thus implying that : * system-scope scheduling is still done at process level (no kernel thread yet) * any process executing in the kernel cannot be preempted for execution by another process unless it either returns to user code or falls explicitely asleep. * the only interlocking that must be done is with interrupts (when relevant), using the 'spl' management routine set. Is that correct ? Well, I am obviously tracking a bug in my own code, but I would greatly appreciate to get help either on my GDB usage question or through technical hints on where I should look at to progress in my investigation. Thank you very much for your attention, Rgds, Xavier Alfred Perlstein wrote: > * Xavier Galleri [010111 11:27] wrote: > >> Hi everybody, >> >> I have reached a point where I am wondering if a call to 'malloc' with >> the M_NOWAIT flag is not falling asleep ! > > > M_NOWAIT shouldn't sleep. > >> In fact, I suspect that the interrupted context is somewhere during a >> call to 'malloc' (I increment a counter just before calling malloc and >> increment another just after and the difference is one !) while I have >> called 'splnet' beforehand, thus normally preventing competing with any >> network isr. I assume that this shouldnever occur unless the code is >> somewhere calling 'sleep' and provoke acontext switch. > > > if you add 1 to a variable the difference is expected to be one. > >> Is there anybody who can help on this ? > > > I'm not sure, you need to be more specific/clear. > --------------000601080407040901070709 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 7bit Thank you for your answer,

OK, let's make it a bit clearer !

I use a private scheme to interact with the 'ipintr' isr. The two following routines are expected to be called either by our modified version of 'ip_input' at network SWI level or at user level.

int my_global_ipl=0;
void my_enter() {
  int s=splnet();
  /* We do not expect this routine to be reentrant, thus the following sanity check. */
  ASSERT(my_global_ipl==0);
  my_global_ipl=s;
}
void my_exit() {
  int s=my_global_ipl;
  my_global_ipl=0;
  splx(s);
}

The crashes I got are always due to the assertion failure occuring in the 'ipintr' isr. This *seems* to indicate that 'my_enter' is called at the network SWI level after another execution flow has called 'my_enter' itself and has *NOT* called 'my_exit' yet ! This actually seems strange due to the 'splnet', and the only explanation I have found is that the first execution flow has fallen asleep somewhere in the kernel (while this is not expected, of course !).

Now, if you've read my first mail, I was actually asking for help onhow to dump the stack of an interrupted process with GDB when the kernelcrash occurs in the context of an isr. Actually, I would like to know how I could dump the stack of *any* process at the time of the crash. This way, I would be able to see where my user-land daemon was lying in the kernel when the interrupt occurs.

Anyway, without this information, I am reduced to add some traps on the track of the execution of my process within my kernel code. This brought me to surround calls to MALLOC with counters as follows:

somewhere_else() {
  ...
  my_enter();    /* handle competition with network isr (especially ipintr) */
  ...
  some_counter++;
  MALLOC(buf,cast,size,M_DEVBUF,M_NOWAIT);
  some_other_counter++;
  ...
  my_exit();
  ...
}

Then, all crashes I got show the following equation at the time of crash:
( some_counter - some_other_counter == 1 )
which *seems* to indicate that that my process has been somehow preempted during the call to MALLOC.

My belief is that the FreeBSD kernel is (currently) a monolithic non-preemptive non-threaded UNIX kernel, thus implying that :
  • system-scope scheduling is still done at process level (no kernel thread yet)
  • any process executing in the kernel cannot be preempted for execution by another process unless it either returns to user code or falls explicitely asleep.
  • the only interlocking that must be done is with interrupts (when relevant), using the 'spl' management routine set.
Is that correct ?

Well, I am obviously tracking a bug in my own code, but I would greatly appreciate to get help either on my GDB usage question or through technical hints on where I should look at to progress in my investigation.

Thank you very much for your attention,

Rgds,

Xavier

Alfred Perlstein wrote:
* Xavier Galleri <xgalleri@enition.com> [010111 11:27] wrote:
Hi everybody,

I have reached a point where I am wondering if a call to 'malloc' with
the M_NOWAIT flag is not falling asleep !

M_NOWAIT shouldn't sleep.

In fact, I suspect that the interrupted context is somewhere during a 
call to 'malloc' (I increment a counter just before calling malloc and
increment another just after and the difference is one !) while I have
called 'splnet' beforehand, thus normally preventing competing with any
network isr. I assume that this shouldnever occur unless the code is
somewhere calling 'sleep' and provoke acontext switch.

if you add 1 to a variable the difference is expected to be one.

Is there anybody who can help on this ?

I'm not sure, you need to be more specific/clear.


--------------000601080407040901070709-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message