Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 4 Jan 2001 02:33:18 +0200 (IST)
From:      Roman Shterenzon <roman@xpert.com>
To:        Greg Lehey <grog@lemis.com>
Cc:        Daniel Lang <dl@leo.org>, <freebsd-stable@freebsd.org>
Subject:   Re: Vinum saga continues
Message-ID:  <Pine.LNX.4.30.0101040217000.19919-100000@jamus.xpert.com>
In-Reply-To: <20010104103426.C4336@wantadilla.lemis.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 4 Jan 2001, Greg Lehey wrote:

> [Format recovered--see http://www.lemis.com/email/email-format.html]
>
> On Wednesday,  3 January 2001 at 14:15:14 +0200, Roman Shterenzon wrote:
> > Hi,
> >
> > Attached is the most valuable information that was in my pr 22103.
> > I've read the vinumdebug and the other guy's PR.
> > I'm still not getting what is missing.
> > You told the other guy to submit the backtrace, but it was in fact subm=
itted!
> > It's as well in my PR as well.
> > Your responses are very brief - "please read vinumdebug", but in fact, =
if
> > there's something that is missing, you can be more specific.
>
> OK.  I don't know what's so difficult about this, but here we go:.  On
> the web page to which I refer, I say:
>
>   If you need to contact me because of problems with Vinum, please send
>   me a mail message with the following information:
>
>   - What problems are you having?
>
>     You don't say this, but I suppose it's obvious.
Hmm, I still think that you didn't read my PR.
It was in the synopsis.

>   - Which version of FreeBSD are you running?
>
>     I can't find this in your report.
So I was right. You didn't read it.
What I resent is only small part of the PR. The version was there.
4.1-RELEASE.

>   - Have you made any changes to the system sources, including Vinum?
>
>     I can't find this in your report.
No I didn't


>   - Supply the output of the vinum list command.  If you can't start
>     Vinum, supply the on-disk configuration, as described below.  If
>     you can't start Vinum, then (and only then) send a copy of the
>     configuration file.
>
>     I can't find this in your report.
>
>   - Supply an extract of the Vinum history file.  Unless you have
>     explicitly renamed it, it will be /var/log/vinum_history.  This
>     file can get very big; please limit it to the time around when you
>     have the problems.  Each line contains a timestamp at the
>     beginning, so you will have no difficulty in establishing which
>     data is of relevance.
>
>     I can't find this in your report.
>
>   - Supply an extract of the file /var/log/messages.  Restrict the
>     extract to the same time frame as the history file.  Again, each
>     line contains a timestamp at the beginning, so you will have no
>     difficulty in establishing which data is of relevance.
>
>     I can't find this in your report.
All these were in one or both of my PRs.
There was nothing interesting there.

>
>   - If you have a crash, please supply a backtrace from the dump
>     analysis as discussed below under Kernel Panics.  Please don't
>     delete the crash dump; it may be needed for further analysis.
>
> Basically, all I can see here is the backtrace, which is still wrapped
> at 80 characters, despite all my requests.  I've had to manually
> reformat it to make it legible.  Have you really read the web page?
Hmm.. I don't know why it wrapped around at 80 chars.
I took it out of the "Raw PR"


> > Alfred Perlstein looked it my PR once and he thinks that it's due to
> > stack smashing.
> > However, he wasn't able to find where it happends.
> > It may be in fact interaction with some other driver, like you said, fo=
r
> > example - fxp. This is why I submitted the dmesg output.
>
> Please, only if I ask for it.
ok.

>
> >  #62 0xc023660b in trap (frame=3D{tf_fs =3D 0xc0270010, tf_es =3D 0xc01=
50010, tf_ds =3D 0x680010, tf_edi =3D 0xc16e9588,
> >        tf_esi =3D 0xc16e9400, tf_ebp =3D 0xc02773b0, tf_isp =3D 0xc0277=
380, tf_ebx =3D 0xc208e340, tf_edx =3D 0x0,
> >        tf_ecx =3D 0x5610001, tf_eax =3D 0xff9773bf, tf_trapno =3D 0xc, =
tf_err =3D 0x2, tf_eip =3D 0xc150fc67, tf_cs =3D 0x8,
> >        tf_eflags =3D 0x10246, tf_esp =3D 0xc16e9588, tf_ss =3D 0xc14bd0=
00}) at ../../i386/i386/trap.c:426
> >  #63 0xc150fc67 in complete_rqe () at /usr/src/sys/modules/vinum/../../=
dev/vinum/vinuminterrupt.c:199
> >  #64 0xc0178d6b in biodone (bp=3D0xc16e9588) at ../../kern/vfs_bio.c:26=
37
> >  #65 0xc0126bb9 in dadone (periph=3D0xc14ca700, done_ccb=3D0xc1808400) =
at ../../cam/scsi/scsi_da.c:1246
> >  #66 0xc0122aff in camisr (queue=3D0xc0298690) at ../../cam/cam_xpt.c:6=
319
> >  #67 0xc0122911 in swi_cambio () at ../../cam/cam_xpt.c:6222
> >  #68 0xc022d0e0 in splz_swi ()
> >  (kgdb) up 63
> >  #64 0xc0178d6b in biodone (bp=3D0xc16e9588) at ../../kern/vfs_bio.c:26=
37
> >  2637                   (*bp->b_iodone) (bp);
> >  (kgdb) print bp
> >  $1 =3D (struct buf *) 0xc16e9588
> >  (kgdb) print *bp->b_iodone
> >  $2 =3D {void ()} 0xc150f6ac <complete_rqe>
> >  (kgdb) down
> >  #63 0xc150fc67 in complete_rqe () at /usr/src/sys/modules/vinum/../../=
dev/vinum/vinuminterrupt.c:199
> >  199    }
> >  (kgdb) list
> >  194                    VOL[rq->volplex.volno].active--;            /* =
another request finished */
> >  195                biodone(ubp);                                   /* =
top level buffer completed */
> >  196                freerq(rq);                                     /* =
return the request storage */
> >  197            }
> >  198        }
> >  199    }
> >  (kgdb) down
> >  #62 0xc023660b in trap (frame=3D{tf_fs =3D 0xc0270010, tf_es =3D 0xc01=
50010, tf_ds =3D 0x680010, tf_edi =3D 0xc16e9588,
> >        tf_esi =3D 0xc16e9400, tf_ebp =3D 0xc02773b0, tf_isp =3D 0xc0277=
380, tf_ebx =3D=A00xc208e340, tf_edx =3D 0x0,
> >        tf_ecx =3D 0x5610001, tf_eax =3D 0xff9773bf, tf_trapno =3D 0xc, =
tf_err =3D 0x2, tf_eip =3D 0xc150fc67, tf_cs =3D 0x8,
> >        tf_eflags =3D 0x10246, tf_esp =3D 0xc16e9588, tf_ss =3D 0xc14bd0=
00}) at ../../i386/i386/trap.c:426
> >  426                            (void) trap_pfault(&frame, FALSE, eva);
> >  (kgdb) up 2
> >  #64 0xc0178d6b in biodone (bp=3D0xc16e9588) at ../../kern/vfs_bio.c:26=
37
> >  2637                   (*bp->b_iodone) (bp);
> >  (kgdb) up
> >  #65 0xc0126bb9 in dadone (periph=3D0xc14ca700, done_ccb=3D0xc1808400) =
at ../../cam/scsi/scsi_da.c:1246
> >  1246                   biodone(bp);
> >  (kgdb) print bp
> >  $3 =3D (struct buf *) 0xc16e9588
> >  (kgdb) print *bp
> >    b_flags =3D 0x204,
> >    b_qindex =3D 0x0,
> >    b_xflags =3D 0x0,
> >    b_lock =3D {
> >      lk_interlock =3D {
> >        lock_data =3D 0x0
> >      },
> >      lk_flags =3D 0x400,
> >      lk_sharecount =3D 0x0,
> >      lk_waitcount =3D 0x0,
> >      lk_exclusivecount =3D 0x1,
> >      lk_prio =3D 0x14,
> >      lk_wmesg =3D 0xc0257a24 "bufwait",
> >      lk_timo =3D 0x0,
> >      lk_lockholder =3D 0x5
> >    },
> >    b_error =3D 0x0,
> >    b_bufsize =3D 0x2000,
> >    b_bcount =3D 0x2000,
> >    b_resid =3D 0x0,
> >    b_dev =3D 0xc15cd880,
> >    b_data =3D 0xcbdcc000 "jA\002",
> >    b_kvabase =3D 0x0,
> >    b_kvasize =3D 0x0,
> >    b_lblkno =3D 0x0,
> >    b_blkno =3D 0x2b08149,
> >    b_offset =3D 0x0,
> >    b_iodone =3D 0xc150f6ac <complete_rqe>,
>
> OK, this is *not* the buffer header corruption bug, but it's happening
> in a very similar position.  With the buffer header corruption, you
> wouldn't have got as far as this, because b_iodone would be zeroed
> out.  I also can't see any other obvious damage to the buffer header.
>
> What we need to do now is to find out where the trap occurred.  That's
> at line 199 of complete_rqe, which shows as the very end of the
> function.  Could you give me the following information from gdb,
> please?
>
>  (gdb) x/20i 0xc150fc60
>
> Thanks

Heh :( I wish you read the PR when I submitted it. It was there. I only
took it out of the closed PR and resent to you.
This crash is not available anymore. I used the disks for RAID-1 setup.

But, for curiosity, what this command supposedly does?
The fact that the page fault occures at the end of the function, i.e.
"return" hints that perhaps the return address from the call was smashed.

I think that the crash can be reproduced - three disks in raid5 setup,
more than 50% filled. find /raid -print should crash it.
I can send you my setup. I'd 3 x 36Gb IBM drives on adaptec.
My mutilated and huge PR kern/22103 has more info (dmesg!) that you may or
may not find interesting.

Daniel, do you have any other assumptions or ideas how this can be
reproduced? I know that we had a different hardware.

BTW, Daniel Lang had his crash *exactly* in the same place - kern/21148 if
I'm not mistaken. But he doesn't have his crash file as well.
You just didn't come back to us in time. Well.. Until next time.

--Roman Shterenzon, UNIX System Administrator and Consultant
[ Xpert UNIX Systems Ltd., Herzlia, Israel. Tel: +972-9-9522361 ]



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.LNX.4.30.0101040217000.19919-100000>