From owner-freebsd-hackers  Thu Sep 13  9:43:37 2001
Delivered-To: freebsd-hackers@freebsd.org
Received: from goose.mail.pas.earthlink.net (goose.mail.pas.earthlink.net [207.217.120.18])
	by hub.freebsd.org (Postfix) with ESMTP id 941EF37B40C
	for <freebsd-hackers@freebsd.org>; Thu, 13 Sep 2001 09:43:29 -0700 (PDT)
Received: from mindspring.com (dialup-209.247.137.158.Dial1.SanJose1.Level3.net [209.247.137.158])
	by goose.mail.pas.earthlink.net (EL-8_9_3_3/8.9.3) with ESMTP id JAA26198;
	Thu, 13 Sep 2001 09:43:22 -0700 (PDT)
Message-ID: <3BA0E256.10B8F05B@mindspring.com>
Date: Thu, 13 Sep 2001 09:44:06 -0700
From: Terry Lambert <tlambert2@mindspring.com>
Reply-To: tlambert2@mindspring.com
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Zhihui Zhang <zzhang@cs.binghamton.edu>
Cc: freebsd-hackers@freebsd.org
Subject: Re: Kernel module debug help
References: <Pine.SOL.4.21.0109131124040.29270-100000@opal>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-hackers.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-hackers>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-hackers>
X-Loop: FreeBSD.ORG

Ah.  Interesting bug; perhaps related to a similar experience
of my own... so let's stare at it!


Zhihui Zhang wrote:
> 
> I am debugging a KLD and I have got the following panic inside an
> interrupt context:
> 
> fault virutal address = 0x1080050
> ...
> interrupt mask = bio
> kernel trap: type 12, code = 0
> Stopped at vwakeup+0x14: decl 0x44(%eax)
> 
> Where eax is 0x108000c and vwakeup() is called from biodone().
> 
> Since this panic occurs in an interrupt environment, I have no idea how to
> trace it. Is there a way to find the bug by tracing or what is the prime
> suspect in this case.  Thanks!

The best advice would be to repeat this failure in the
context of linking the module in statically instead of
dynamically.

If it won't repeat for you then, the problem has to be in
the form of memory allocation you are using as part of the
module.

If you want to brute-force the issue, find out what is being
dereferenced at vwakeup+0x14 ...it looks to be:

	vp->v_numoutput--;

though mine is at:

	0x40189c9c <vwakeup+20>:        decl   0x44(%eax)

which implies you have bad/older/newer vwakeup code.  Maybe
you are just missing the "if" test that verifies it's non-NULL
vnode pointer being dereferenced???  That would match the number
of bytes your "decl" instruction is off from mine:

614     void
615     vwakeup(bp)
616             register struct buf *bp;
617     {
618             register struct vnode *vp;
619
620             bp->b_flags &= ~B_WRITEINPROG;
621             if ((vp = bp->b_vp)) {
622                     vp->v_numoutput--;
623                     if (vp->v_numoutput < 0)
624                             panic("vwakeup: neg numoutput");
625                     if ((vp->v_numoutput == 0) && (vp->v_flag & VBWAIT)) {
626                             vp->v_flag &= ~VBWAIT;
627                             wakeup((caddr_t) &vp->v_numoutput);
628                     }
629             }
630     }


I'll also note that 0x44 is 68, which implies 17 long words
before v_numoutput is declared in struct vnode; this didn't
match my quick count.
	

I rather expect that it's in a swappable memory region that's
currently not mapped, or NULL (we see it's not NULL), so this
implies that it's an unitialized vnode from the zone -- a thing
you can't initialize at interrupt.

This can happen as the result of a kevent() completion being
noted (e.g. readable) at interrupt context, since you can get
swappable objects (it also looks like you may be on your way
out of splbio, which implies networking -- my guess is therefore
that you are working on network file system code, and have a
"shadow" vnode that you are using as a context for the calls
that should have been allocated out of an interrupt zone instead
of out of the main memory allocator, which is not interrupt safe
for new allocations... 8-)).

For example, I use LRP, which drastically increases my connections
per second out of the TCP stack and eliminates receiver livelock
and a number of other problems for heavily loaded servers, but it
means that sockets need to be able of accept'ing to completion
(creating a new socket) at interrupt context.

But when this happens, I don't have a proc structure handy to
deal with the issue (since I'm at interrupt context).  The
sneaky way around this is to use the proc from the already
existing socket on which the listen for which the accept is
being completed was initially posted -- which gets me the proc
struct, which gets me the ucred, so I have the proc pointer
and the ucred pointer necessary to run the connection to
completion.

I rather expect that if you are depending on the existance of
something similar at interrupt context, that you will have to
either queue it and run to completion at a software interrupt
level (e.g. NETISR -- not recommended, even for networking!),
or just "lose" the wakeup (which is what the vwakeup code I
have does, with it's "if" test).

Still, your best bet is to compile the thing in static, repeat
the problem, and then look at where things went wrong in the
kernel debugger.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message