Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 23 Jul 2008 11:23:13 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        Kostik Belousov <kostikbel@gmail.com>
Cc:        Mikhail Teterin <mi+mill@aldan.algebra.com>, Kris Kennaway <kris@freebsd.org>, stable@freebsd.org
Subject:   Re: "sleeping without queue" ?
Message-ID:  <200807231123.14229.jhb@freebsd.org>
In-Reply-To: <20080723120348.GJ17123@deviant.kiev.zoral.com.ua>
References:  <48860725.9050808@aldan.algebra.com> <48863C3D.7090401@aldan.algebra.com> <20080723120348.GJ17123@deviant.kiev.zoral.com.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday 23 July 2008 08:03:48 am Kostik Belousov wrote:
> On Tue, Jul 22, 2008 at 03:59:57PM -0400, Mikhail Teterin wrote:
> > Kostik Belousov =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=B2(=D0=BB=D0=B0=
):
> > >On Tue, Jul 22, 2008 at 03:26:29PM -0400, Mikhail Teterin wrote:
> > >>Kostik Belousov =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=B2(=D0=BB=D0=
=B0):
> > >>>Did you switched to the process before doing backtrace (using the pr=
oc=20
> > >>><pid>
> > >>>command)?
> > >>Ok, thanks. Did not know about this one. Here:
> > >>...
> > >>(kgdb) proc 79759
> > >>(kgdb) bt
> > >>#0  sched_switch (td=3D0xffffff01286dc000, newtd=3D0xffffff00010ce000=
,=20
> > >>flags=3D2) at /var/src/sys/kern/sched_4bsd.c:928
> > >>#1  0x0000000000000000 in ?? ()
> > >>#2  0xffffffff802f1108 in mi_switch (flags=3D678281216, newtd=3D0x2) =
at=20
> > >>/var/src/sys/kern/kern_synch.c:442
> > >>#3  0xffffffff80318513 in sleepq_check_timeout () at=20
> > >>/var/src/sys/kern/subr_sleepqueue.c:519
> > >>#4  0xffffffff80318c85 in sleepq_timedwait (wchan=3D0xffffffff8068840=
8) at=20
> > >>/var/src/sys/kern/subr_sleepqueue.c:597
> > >>#5  0xffffffff802f16a2 in _sleep (ident=3D0xffffffff80688408, lock=3D=
0x0,=20
> > >>priority=3D0, wmesg=3D0xffffffff804f3059 "vmo_de", timo=3D1) at=20
> > >>/var/src/sys/kern/kern_synch.c:224
> > >>#6  0xffffffff8043036b in vm_object_deallocate=20
> > >>(object=3D0xffffff0053024a90) at /var/src/sys/vm/vm_object.c:509
> > >From this frame, please, print the object (like p *object) and
> > >likewise, print the object that is at the head of the object->shadow_h=
ead
> > >list.
> > kgdb /usr/obj/var/src/sys/SILVER-SMP/kernel.debug /dev/mem
> > [GDB will not be able to debug user-mode threads:=20
> > /usr/lib/libthread_db.so: Undefined symbol "ps_pglobal_lookup"]
> > GNU gdb 6.1.1 [FreeBSD]
> > Copyright 2004 Free Software Foundation, Inc.
> > GDB is free software, covered by the GNU General Public License, and yo=
u=20
are
> > welcome to change it and/or distribute copies of it under certain=20
> > conditions.
> > Type "show copying" to see the conditions.
> > There is absolutely no warranty for GDB.  Type "show warranty" for=20
details.
> > This GDB was configured as "amd64-marcel-freebsd".
> > There is no member named pathname.
> > Reading symbols from /opt/modules/fuse.ko...done.
> > Loaded symbols for /opt/modules/fuse.ko
> > Reading symbols from /opt/modules/rtc.ko...done.
> > Loaded symbols for /opt/modules/rtc.ko
> > Reading symbols from /boot/kernel/snd_ich.ko...Reading symbols from=20
> > /boot/kernel/snd_ich.ko.symbols...done.
> > done.
> > Loaded symbols for /boot/kernel/snd_ich.ko
> > Reading symbols from /boot/kernel/msdosfs.ko...Reading symbols from=20
> > /boot/kernel/msdosfs.ko.symbols...done.
> > done.
> > Loaded symbols for /boot/kernel/msdosfs.ko
> > #0  0x0000000000000000 in ?? ()
> > (kgdb) frame 6
> > Error accessing memory address 0x0: Bad address.
> > (kgdb) pid 79759
> > Undefined command: "pid".  Try "help".
> > (kgdb) proc 79759
> > (kgdb) frame 6
> > #6  0xffffffff8043036b in vm_object_deallocate=20
> > (object=3D0xffffff0053024a90) at /var/src/sys/vm/vm_object.c:509
> > 509                                             pause("vmo_de", 1);
> > (kgdb) p *object
> > $1 =3D {mtx =3D {lock_object =3D {lo_name =3D 0xffffffff804f21c4 "vm ob=
ject",=20
> > lo_type =3D 0xffffffff804f3018 "standard object", lo_flags =3D 21168128=
,=20
> > lo_witness_data =3D {
> >        lod_list =3D {stqe_next =3D 0x0}, lod_witness =3D 0x0}}, mtx_loc=
k =3D 4,=20
> > mtx_recurse =3D 0}, object_list =3D {tqe_next =3D 0xffffff0005018a90,
> >    tqe_prev =3D 0xffffff00539a6850}, shadow_head =3D {lh_first =3D=20
> > 0xffffff005d3afa90}, shadow_list =3D {le_next =3D 0x0, le_prev =3D=20
> > 0xffffff005d2cd048}, memq =3D {
> >    tqh_first =3D 0xffffff007eb9fa58, tqh_last =3D 0xffffff007f864820}, =
root=20
> > =3D 0xffffff007ee14d38, size =3D 427, generation =3D 66, ref_count =3D =
2,=20
> > shadow_count =3D 1,
> >  type =3D 0 '\0', flags =3D 256, pg_color =3D 0, paging_in_progress =3D=
 0,=20
> > resident_page_count =3D 44, backing_object =3D 0x0, backing_object_offs=
et =3D=20
> > 0, pager_object_list =3D {
> >    tqe_next =3D 0x0, tqe_prev =3D 0x0}, cache =3D 0x0, handle =3D 0x0, =
un_pager=20
> > =3D {vnp =3D {vnp_size =3D 576646}, devp =3D {devp_pglist =3D {tqh_firs=
t =3D 0x8cc86,
> >        tqh_last =3D 0x0}}, swp =3D {swp_bcount =3D 576646}}}
> > (kgdb) p (object->shadow_head)
> > $2 =3D {lh_first =3D 0xffffff005d3afa90}
> > (kgdb) p *object->shadow_head.lh_first
> > $3 =3D {mtx =3D {lock_object =3D {lo_name =3D 0xffffffff804f21c4 "vm ob=
ject",=20
> > lo_type =3D 0xffffffff804f3018 "standard object", lo_flags =3D 21168128=
,=20
> > lo_witness_data =3D {
> >        lod_list =3D {stqe_next =3D 0x0}, lod_witness =3D 0x0}}, mtx_loc=
k =3D 4,=20
> > mtx_recurse =3D 0}, object_list =3D {tqe_next =3D 0xffffff0066c32340,
> >    tqe_prev =3D 0xffffff012f673ac0}, shadow_head =3D {lh_first =3D 0x0}=
,=20
> > shadow_list =3D {le_next =3D 0x0, le_prev =3D 0xffffff0053024ad0}, memq=
 =3D {
> >    tqh_first =3D 0xffffff007779f9a0, tqh_last =3D 0xffffff0077c04140}, =
root=20
> > =3D 0xffffff0077c04130, size =3D 387, generation =3D 3, ref_count =3D 1=
,=20
> > shadow_count =3D 0,
> >  type =3D 0 '\0', flags =3D 8452, pg_color =3D 0, paging_in_progress =
=3D 0,=20
> > resident_page_count =3D 2, backing_object =3D 0xffffff0053024a90,=20
> > backing_object_offset =3D 163840,
> >  pager_object_list =3D {tqe_next =3D 0x0, tqe_prev =3D 0x0}, cache =3D =
0x0,=20
> > handle =3D 0x0, un_pager =3D {vnp =3D {vnp_size =3D 365278}, devp =3D {=
devp_pglist =3D=20
{
> >        tqh_first =3D 0x592de, tqh_last =3D 0x0}}, swp =3D {swp_bcount =
=3D=20
365278}}}
> >=20
> >=20
> > >
> > >Another question is what scheduler do you use ?
> > options         SCHED_4BSD              # 4BSD scheduler
> > options         PREEMPTION              # Enable kernel thread preempti=
on
> The state of the both object being destroyed and the object that is shado=
wed
> looks right for me. Moreover, the shadowed object is not locked, value
> of the mtx_lock is 4. It seems as if the thread missed the wakeup
> owed by pause.
>=20
> John, could it be that the following commit is supposed to fix the issue ?
>=20
> r179974 | jhb | 2008-06-24 22:36:33 +0300 (Tue, 24 Jun 2008) | 3 lines
>=20
> MFC: Change the roundrobin implementation in the 4BSD scheduler to trigge=
r a
> userland preemption directly from hardclock() via sched_clock()

I don't think this would fix the issue.  This patch fixed problems where yo=
u=20
had a thread pinned to another CPU that held a lock (typically Giant) that =
a=20
callout handler run from softclock needed.  This prevented the 'roundrobin'=
=20
callout from running which would force all the CPUs to do a context switch=
=20
(normally this would have forced the pinned thread holding the lock to=20
eventually run).  This involves threads on the run queue not getting to run=
,=20
even though they may have a higher priority than what is running now.

I think this case is still a lingering bug in the sleep queue code since th=
e=20
thread lock stuff went in.  There have been several reports of it but I hav=
e=20
been unable to figure out how the wakeup is being missed.

> > >>>Also, show the output of ps axl <pid>.
> > >> UID   PID  PPID CPU PRI NI   VSZ   RSS MWCHAN STAT  TT       TIME=20
COMMAND
> > >>   0 79759 79758   0  96  0     0    16 -      DE+   p6    0:00,00=20
> > >>/bin/tcsh -fc=20
> >=20
>>/meow/ports/editors/openoffice.org-3/work/BEB300_m3/solver/300/unxfbsdx.p=
ro/bin/ma
> > >
> > >It makes sense to show the whole ps axl output.
> > See http://aldan.algebra.com/~mi/tmp/ps-axl.txt -- I edited it for=20
> > privacy a little bit, but process-states are intact.
> > The java-processes in the linuxf have remained unkillable for weeks now=
=20
> > -- I even forgot about them. But those are linuxulator problems, wherea=
s=20
> > the tcsh is native...
> It seems that pid 63930 is problematic too ?
>=20



=2D-=20
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200807231123.14229.jhb>