FreeBSD Mail Archives

Date:      Mon, 22 Jul 2013 12:17:03 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        Yuri <yuri@rawbw.com>
Cc:        Alan Cox <alc@freebsd.org>, freebsd-hackers@freebsd.org
Subject:   Re: Kernel crashes after sleep: how to debug?
Message-ID:  <201307221217.03525.jhb@freebsd.org>
In-Reply-To: <51E9F2EF.6000908@rawbw.com>
References:  <51E3A334.8020203@rawbw.com> <201307191704.47622.jhb@freebsd.org> <51E9F2EF.6000908@rawbw.com>

On Friday, July 19, 2013 10:16:15 pm Yuri wrote:
> On 07/19/2013 14:04, John Baldwin wrote:
> > Hmm, that definitely looks like garbage.  How are you with gdb scriptin=
g?
> > You could write a script that walks the PQ_ACTIVE queue and see if this
> > pointers ends up in there.  It would then be interesting to see if the
> > previous page's next pointer is corrupted, or if the pageq.tqe_prev ref=
erences
> > that page then it could be that this vm_page structure has been stomped=
 on
> > instead.
>=20
> As you suggested, I printed the list of pages. Actually, iteration in=20
> frame 8 goes through PQ_INACTIVE pages. So I printed those.
> <...skipped...>
> ### page#2245 ###
> $4492 =3D (struct vm_page *) 0xfffffe00b5a27658
> $4493 =3D {pageq =3D {tqe_next =3D 0xfffffe00b5a124d8, tqe_prev =3D=20
> 0xfffffe00b5b79038}, listq =3D {tqe_next =3D 0x0, tqe_prev =3D=20
> 0xfffffe00b5a276e0},
>    left =3D 0x0, right =3D 0x0, object =3D 0xfffffe005e3f7658, pindex =3D=
 5,=20
> phys_addr =3D 1884901376, md =3D {pv_list =3D {tqh_first =3D 0xfffffe005e=
439ce8,
>        tqh_last =3D 0xfffffe00795eacc0}, pat_mode =3D 6}, queue =3D 0 '\0=
',=20
> segind =3D 2 '\002', hold_count =3D 0, order =3D 13 '\r', pool =3D 0 '\0',
>    cow =3D 0, wire_count =3D 0, aflags =3D 1 '\001', flags =3D 64 '@', of=
lags =3D=20
> 0, act_count =3D 9 '\t', busy =3D 0 '\0', valid =3D 255 '=EF=BF=BD', dirt=
y =3D 255 '=EF=BF=BD'}
> ### page#2246 ###
> $4494 =3D (struct vm_page *) 0xfffffe00b5a124d8
> $4495 =3D {pageq =3D {tqe_next =3D 0xfffffe00b460abf8, tqe_prev =3D=20
> 0xfffffe00b5a27658}, listq =3D {tqe_next =3D 0x0, tqe_prev =3D=20
> 0xfffffe005e3f7cf8},
>    left =3D 0x0, right =3D 0x0, object =3D 0xfffffe005e3f7cb0, pindex =3D=
 1,=20
> phys_addr =3D 1881952256, md =3D {pv_list =3D {tqh_first =3D 0xfffffe005e=
42dd48,
>        tqh_last =3D 0xfffffe007adb03a8}, pat_mode =3D 6}, queue =3D 0 '\0=
',=20
> segind =3D 2 '\002', hold_count =3D 0, order =3D 13 '\r', pool =3D 0 '\0',
>    cow =3D 0, wire_count =3D 0, aflags =3D 1 '\001', flags =3D 64 '@', of=
lags =3D=20
> 0, act_count =3D 9 '\t', busy =3D 0 '\0', valid =3D 255 '=EF=BF=BD', dirt=
y =3D 255 '=EF=BF=BD'}
> ### page#2247 ###
> $4496 =3D (struct vm_page *) 0xfffffe00b460abf8
> $4497 =3D {pageq =3D {tqe_next =3D 0xfe26, tqe_prev =3D 0xfffffe00b5a124d=
8},=20
> listq =3D {tqe_next =3D 0xfffffe0081ad8f70, tqe_prev =3D 0xfffffe0081ad8f=
78},
>    left =3D 0x6, right =3D 0xd00000201, object =3D 0x100000000, pindex =
=3D=20
> 4294901765, phys_addr =3D 18446741877712530608, md =3D {pv_list =3D {
>        tqh_first =3D 0xfffffe00b460abc0, tqh_last =3D 0xfffffe00b5579020}=
,=20
> pat_mode =3D -1268733096}, queue =3D 72 'H', segind =3D -85 '=EF=BF=BD',
>    hold_count =3D -19360, order =3D 0 '\0', pool =3D 254 '=EF=BF=BD', cow=
 =3D 65535,=20
> wire_count =3D 0, aflags =3D 0 '\0', flags =3D 0 '\0', oflags =3D 0,
>    act_count =3D 0 '\0', busy =3D 176 '=EF=BF=BD', valid =3D 208 '=EF=BF=
=BD', dirty =3D 126 '~'}
> ### page#2248 ###
> $4498 =3D (struct vm_page *) 0xfe26
>=20
> The page #2247 is the same that caused the problem in frame 8. tqe_next=20
> is apparently invalid, so iteration stopped here.
> It appears that this structure has been stomped on. This page is=20
> probably supposed to be a valid inactive page.

Yes, it's phys_addr is also way off. I think you might even be able to
figure out which phys_addr it is supposed to have based on the virtual
address (see PHYS_TO_VM_PAGE() in vm/vm_page.c) by using the vm_page
address and phys_addr of the prior entries to establish the relative
offset.  It is certainly a page "earlier" in the array.

> > Ultimately I think you will need to look at any malloc/VM/page operatio=
ns
> > done in the suspend and resume paths to see where this happens.  It mig=
ht
> > be slightly easier if the same page gets trashed every time as you could
> > print out the relevant field periodically during suspend and resume to
> > narrow down where the breakage occurs.
>=20
> I am thinking to put code walking through all page queues and verifying=20
> that they are not damaged in this way into the code when each device is=20
> waking up from sleep.
> dev/acpica/acpi.c has acpi_EnterSleepState, which, as I understand,=20
> contains top-level code for S3 sleep. Before sleep it invokes event=20
> 'power_suspend' on all devices, and after sleep it calls 'power_resume'=20
> on devices. So maybe I will call the page check procedure after=20
> 'power_suspend' and 'power_resume'.
>=20
> But it is possible that memory gets damaged somewhere else after=20
> power_resume happens.
> Do you have any thought/suggestions?

Well, I think you should try what you've suggeseted above first.  If that
doesn't narrow it down then we can brainstorm some other places to inspect.

=2D-=20
John Baldwin

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201307221217.03525.jhb>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation