Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 02 Aug 2003 17:56:49 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Greg 'groggy' Lehey <grog@FreeBSD.org>
Cc:        current@freebsd.org
Subject:   Re: Yet another crash in FreeBSD 5.1
Message-ID:  <3F2C5DD1.36570B38@mindspring.com>
References:  <1079.192.168.0.3.1059811884.squirrel@webmail.aminor.no> <3F2B803C.21D38E0B@mindspring.com> <20030803000302.GE95375@wantadilla.lemis.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Greg 'groggy' Lehey wrote:
> > You don't actually need a crash dump to debug a stack traceback.
> 
> Great!  So you know the answer?  Please submit a patch.
> 
> Seriously, this is nonsense.  Yes, it's a null pointer dereference.
> What?

That is precisely what doing what I suggested discovers, Greg.

If you haven't seen his response posting:

(kgdb) list *(g_dev_strategy+29)
0xc02e812d is in g_dev_strategy (/usr/src/sys/geom/geom_dev.c:415).
410             KASSERT(cp->acr || cp->acw,
411                 ("Consumer with zero access count in g_dev_strategy"));
412
413             bp2 = g_clone_bio(bp);
414             KASSERT(bp2 != NULL, ("XXX: ENOMEM in a bad place"));
415             bp2->bio_offset = (off_t)bp->bio_blkno << DEV_BSHIFT;
416             KASSERT(bp2->bio_offset >= 0,
417                 ("Negative bio_offset (%jd) on bio %p",
418                 (intmax_t)bp2->bio_offset, bp));
419             bp2->bio_length = (off_t)bp->bio_bcount;


Clearly, bp2 or bp is NULL at the time of the dereference.


> Why?

Programmer error.  Either bp2 or bp is a NULL pointer.


> How do you fix it?

It depends on the root cause.  If the root cause is that the bp is
NULL, then I'd hope that it would have been caught higher up; if it
wasn't, then I'd hope that g_clone_bio(bp) would have returned NULL.

Is the KASSERT() active at the time of the problem?  I don't know;
if it isn't, it probably should be converted to an if()...panic().

If it is, then I'd have to expect that the validity fell out from
under it as a result of an interrupt, preemption, reentrancy (if
the locking didn't prevent it) or SMP races (if the locking didn't
prevent it).

I really can't answer it for the same reason that I couldn't locate
the line in the source code that was failing for him from his
posting of hex offsets into functions compiled from unknown source
code: I don't have his object set for the problem in question, nor
his debug kernel.


> Finding the first step doesn't solve the problem.

No.  Finding the first step is *necessary* to solving the problem,
but you are entirely correct in pointing out that it's not in
itself *sufficient*.

But it's one step farther along than he was.  I didn't see anyone
else helping him take that first step, so I did.

-- Terry



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3F2C5DD1.36570B38>