Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 26 Nov 2005 22:20:12 -0500
From:      Kris Kennaway <kris@obsecurity.org>
To:        Kris Kennaway <kris@obsecurity.org>
Cc:        amd64@freeBSD.org
Subject:   smp_tlb_shootdown loop (Re: spin lock smp rendezvous held by 0xffffff01250a7980 for > 5 seconds)
Message-ID:  <20051127032012.GA86016@xor.obsecurity.org>
In-Reply-To: <20051126232244.GA83432@xor.obsecurity.org>
References:  <20051124232616.GA32023@xor.obsecurity.org> <20051126232244.GA83432@xor.obsecurity.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--opJtzjQTFsWo+cga
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Nov 26, 2005 at 06:22:45PM -0500, Kris Kennaway wrote:
> On Thu, Nov 24, 2005 at 06:26:16PM -0500, Kris Kennaway wrote:
> > I got this on a quad amd64 machine running 6.0-STABLE.  At the time it
> > was running 21 simultaneous tar extractions onto a sync-mounted md.
> >=20
> > panic() at panic+0x1e6
> > _mtx_lock_spin() at _mtx_lock_spin+0xad
> > pmap_invalidate_range() at pmap_invalidate_range+0xb3
> > pmap_qremove() at pmap_qremove+0x53
> > vfs_vmio_release() at vfs_vmio_release+0x1e0
> > getnewbuf() at getnewbuf+0x368
> > getblk() at getblk+0x3d9
> > ffs_balloc_ufs1() at ffs_balloc_ufs1+0x662
> > ffs_write() at ffs_write+0x31b
> > VOP_WRITE_APV() at VOP_WRITE_APV+0xed
> > vn_write() at vn_write+0x228
> > dofilewrite() at dofilewrite+0x90
> > kern_writev() at kern_writev+0x54
> > write() at write+0x4b

Another CPU is here:

smp_tlb_shootdown() at smp_tlb_shootdown+0x40
smp_invlpg_range() at smp_invlpg_range+0x1e
pmap_invalidate_range() at pmap_invalidate_range+0xf9
pmap_qenter() at pmap_qenter+0x64
allocbuf() at allocbuf+0x9a0
getblk() at getblk+0x52d
ffs_balloc_ufs1() at ffs_balloc_ufs1+0x662
ffs_write() at ffs_write+0x31b
VOP_WRITE_APV() at VOP_WRITE_APV+0xed
vn_write() at vn_write+0x228
dofilewrite() at dofilewrite+0x90
kern_writev() at kern_writev+0x54
write() at write+0x4b
syscall() at syscall+0x404
Xfast_syscall() at Xfast_syscall+0xa8
--- syscall (4, FreeBSD ELF64, write), rip =3D 0x80070ea6c, rsp =3D 0x7ffff=
fffe6a8, rbp =3D 0x52a800 ---
-

It is looping:

smp_tlb_shootdown+0x40: repe nop
smp_tlb_shootdown+0x42: movl    0x21c4f8,%eax
smp_tlb_shootdown+0x48: cmpl    %ebx,%eax
smp_tlb_shootdown+0x4a: jb      smp_tlb_shootdown+0x40

smp_tlb_shootdown(u_int vector, vm_offset_t addr1, vm_offset_t addr2)
{
        u_int ncpu;

        ncpu =3D mp_ncpus - 1;    /* does not shootdown self */
        if (ncpu < 1)
                return;         /* no other cpus */
        mtx_assert(&smp_ipi_mtx, MA_OWNED);
        smp_tlb_addr1 =3D addr1;
        smp_tlb_addr2 =3D addr2;
        atomic_store_rel_int(&smp_tlb_wait, 0);
        ipi_all_but_self(vector);
        while (smp_tlb_wait < ncpu)
                ia32_pause();
}

which seems to be the while loop at the end.

db> x/x smp_tlb_wait
smp_tlb_wait:   1
db> x mp_ncpus
mp_ncpus:       4

So it looks like it's stuck waiting for the tlb shootdown on the other
processors.  However, the 3 other CPUs are all in the same place:

> _mtx_lock_spin() at _mtx_lock_spin+0x6b
> getit() at getit+0x6f
> DELAY() at DELAY+0x44
> _mtx_lock_spin() at _mtx_lock_spin+0x6b
> pmap_invalidate_range() at pmap_invalidate_range+0xb3
> pmap_qremove() at pmap_qremove+0x53
> vfs_vmio_release() at vfs_vmio_release+0x1e0
> getnewbuf() at getnewbuf+0x368
> getblk() at getblk+0x3d9
> ffs_balloc_ufs1() at ffs_balloc_ufs1+0x662
> ffs_write() at ffs_write+0x31b
> VOP_WRITE_APV() at VOP_WRITE_APV+0xed
> vn_write() at vn_write+0x228
> dofilewrite() at dofilewrite+0x90
> kern_writev() at kern_writev+0x54
> write() at write+0x4b
> syscall() at syscall+0x404
> Xfast_syscall() at Xfast_syscall+0xa8
> --- syscall (4, FreeBSD ELF64, write), rip =3D 0x80070ea6c, rsp =3D 0x7ff=
fffffe6a8, rbp =3D 0x52ae00 ---
>=20
> i.e. the first _mtx_lock_spin() tried to acquire the ipi lock and
> spun, which called DELAY and getit, which tried to acquire the clock
> lock:
>=20
>         mtx_lock_spin(&clock_lock);
>=20
> which *also* spun, and called DELAY...and at that point things went to
> hell and it recursed until it blew out the stack.

So why aren't they processing the IPI?  Was the IPI lost somehow?

Kris
--opJtzjQTFsWo+cga
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (FreeBSD)

iD8DBQFDiSXrWry0BWjoQKURApEbAJ9GfAv+JVE6KdBEigKU/Dh9WGbAoACgkcks
vsIgmV7M7nBQC8H6QFDtgYg=
=0l2N
-----END PGP SIGNATURE-----

--opJtzjQTFsWo+cga--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20051127032012.GA86016>