Date: Mon, 22 Jul 2013 12:17:03 -0400 From: John Baldwin <jhb@freebsd.org> To: Yuri <yuri@rawbw.com> Cc: Alan Cox <alc@freebsd.org>, freebsd-hackers@freebsd.org Subject: Re: Kernel crashes after sleep: how to debug? Message-ID: <201307221217.03525.jhb@freebsd.org> In-Reply-To: <51E9F2EF.6000908@rawbw.com> References: <51E3A334.8020203@rawbw.com> <201307191704.47622.jhb@freebsd.org> <51E9F2EF.6000908@rawbw.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Friday, July 19, 2013 10:16:15 pm Yuri wrote: > On 07/19/2013 14:04, John Baldwin wrote: > > Hmm, that definitely looks like garbage. How are you with gdb scriptin= g? > > You could write a script that walks the PQ_ACTIVE queue and see if this > > pointers ends up in there. It would then be interesting to see if the > > previous page's next pointer is corrupted, or if the pageq.tqe_prev ref= erences > > that page then it could be that this vm_page structure has been stomped= on > > instead. >=20 > As you suggested, I printed the list of pages. Actually, iteration in=20 > frame 8 goes through PQ_INACTIVE pages. So I printed those. > <...skipped...> > ### page#2245 ### > $4492 =3D (struct vm_page *) 0xfffffe00b5a27658 > $4493 =3D {pageq =3D {tqe_next =3D 0xfffffe00b5a124d8, tqe_prev =3D=20 > 0xfffffe00b5b79038}, listq =3D {tqe_next =3D 0x0, tqe_prev =3D=20 > 0xfffffe00b5a276e0}, > left =3D 0x0, right =3D 0x0, object =3D 0xfffffe005e3f7658, pindex =3D= 5,=20 > phys_addr =3D 1884901376, md =3D {pv_list =3D {tqh_first =3D 0xfffffe005e= 439ce8, > tqh_last =3D 0xfffffe00795eacc0}, pat_mode =3D 6}, queue =3D 0 '\0= ',=20 > segind =3D 2 '\002', hold_count =3D 0, order =3D 13 '\r', pool =3D 0 '\0', > cow =3D 0, wire_count =3D 0, aflags =3D 1 '\001', flags =3D 64 '@', of= lags =3D=20 > 0, act_count =3D 9 '\t', busy =3D 0 '\0', valid =3D 255 '=EF=BF=BD', dirt= y =3D 255 '=EF=BF=BD'} > ### page#2246 ### > $4494 =3D (struct vm_page *) 0xfffffe00b5a124d8 > $4495 =3D {pageq =3D {tqe_next =3D 0xfffffe00b460abf8, tqe_prev =3D=20 > 0xfffffe00b5a27658}, listq =3D {tqe_next =3D 0x0, tqe_prev =3D=20 > 0xfffffe005e3f7cf8}, > left =3D 0x0, right =3D 0x0, object =3D 0xfffffe005e3f7cb0, pindex =3D= 1,=20 > phys_addr =3D 1881952256, md =3D {pv_list =3D {tqh_first =3D 0xfffffe005e= 42dd48, > tqh_last =3D 0xfffffe007adb03a8}, pat_mode =3D 6}, queue =3D 0 '\0= ',=20 > segind =3D 2 '\002', hold_count =3D 0, order =3D 13 '\r', pool =3D 0 '\0', > cow =3D 0, wire_count =3D 0, aflags =3D 1 '\001', flags =3D 64 '@', of= lags =3D=20 > 0, act_count =3D 9 '\t', busy =3D 0 '\0', valid =3D 255 '=EF=BF=BD', dirt= y =3D 255 '=EF=BF=BD'} > ### page#2247 ### > $4496 =3D (struct vm_page *) 0xfffffe00b460abf8 > $4497 =3D {pageq =3D {tqe_next =3D 0xfe26, tqe_prev =3D 0xfffffe00b5a124d= 8},=20 > listq =3D {tqe_next =3D 0xfffffe0081ad8f70, tqe_prev =3D 0xfffffe0081ad8f= 78}, > left =3D 0x6, right =3D 0xd00000201, object =3D 0x100000000, pindex = =3D=20 > 4294901765, phys_addr =3D 18446741877712530608, md =3D {pv_list =3D { > tqh_first =3D 0xfffffe00b460abc0, tqh_last =3D 0xfffffe00b5579020}= ,=20 > pat_mode =3D -1268733096}, queue =3D 72 'H', segind =3D -85 '=EF=BF=BD', > hold_count =3D -19360, order =3D 0 '\0', pool =3D 254 '=EF=BF=BD', cow= =3D 65535,=20 > wire_count =3D 0, aflags =3D 0 '\0', flags =3D 0 '\0', oflags =3D 0, > act_count =3D 0 '\0', busy =3D 176 '=EF=BF=BD', valid =3D 208 '=EF=BF= =BD', dirty =3D 126 '~'} > ### page#2248 ### > $4498 =3D (struct vm_page *) 0xfe26 >=20 > The page #2247 is the same that caused the problem in frame 8. tqe_next=20 > is apparently invalid, so iteration stopped here. > It appears that this structure has been stomped on. This page is=20 > probably supposed to be a valid inactive page. Yes, it's phys_addr is also way off. I think you might even be able to figure out which phys_addr it is supposed to have based on the virtual address (see PHYS_TO_VM_PAGE() in vm/vm_page.c) by using the vm_page address and phys_addr of the prior entries to establish the relative offset. It is certainly a page "earlier" in the array. > > Ultimately I think you will need to look at any malloc/VM/page operatio= ns > > done in the suspend and resume paths to see where this happens. It mig= ht > > be slightly easier if the same page gets trashed every time as you could > > print out the relevant field periodically during suspend and resume to > > narrow down where the breakage occurs. >=20 > I am thinking to put code walking through all page queues and verifying=20 > that they are not damaged in this way into the code when each device is=20 > waking up from sleep. > dev/acpica/acpi.c has acpi_EnterSleepState, which, as I understand,=20 > contains top-level code for S3 sleep. Before sleep it invokes event=20 > 'power_suspend' on all devices, and after sleep it calls 'power_resume'=20 > on devices. So maybe I will call the page check procedure after=20 > 'power_suspend' and 'power_resume'. >=20 > But it is possible that memory gets damaged somewhere else after=20 > power_resume happens. > Do you have any thought/suggestions? Well, I think you should try what you've suggeseted above first. If that doesn't narrow it down then we can brainstorm some other places to inspect. =2D-=20 John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201307221217.03525.jhb>