From owner-freebsd-hackers@FreeBSD.ORG Fri Jul 19 21:04:54 2013 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id EF7D4AE3; Fri, 19 Jul 2013 21:04:54 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) by mx1.freebsd.org (Postfix) with ESMTP id CD716137; Fri, 19 Jul 2013 21:04:54 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 3407BB926; Fri, 19 Jul 2013 17:04:54 -0400 (EDT) From: John Baldwin To: Yuri Subject: Re: Kernel crashes after sleep: how to debug? Date: Fri, 19 Jul 2013 17:04:47 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p25; KDE/4.5.5; amd64; ; ) References: <51E3A334.8020203@rawbw.com> <201307191100.08549.jhb@freebsd.org> <51E9945B.1050907@rawbw.com> In-Reply-To: <51E9945B.1050907@rawbw.com> MIME-Version: 1.0 Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <201307191704.47622.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Fri, 19 Jul 2013 17:04:54 -0400 (EDT) Cc: Alan Cox , freebsd-hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Jul 2013 21:04:55 -0000 On Friday, July 19, 2013 3:32:43 pm Yuri wrote: > On 07/19/2013 08:00, John Baldwin wrote: > > Well, you can probably find the value of 'm' in a register if you look = at=20 the > > dissassembly around the fault. You can then cast that pointer to the=20 right > > type and print its contents. >=20 > Here is the value of *m in frame 8: > (kgdb) p *(struct vm_page*)0xfffffe00b460abf8 > $3 =3D {pageq =3D {tqe_next =3D 0xfe26, tqe_prev =3D 0xfffffe00b5a124d8},= listq=20 > =3D {tqe_next =3D 0xfffffe0081ad8f70, tqe_prev =3D 0xfffffe0081ad8f78}, > left =3D 0x6, right =3D 0xd00000201, object =3D 0x100000000, pindex =3D=20 > 4294901765, phys_addr =3D 18446741877712530608, md =3D {pv_list =3D { > tqh_first =3D 0xfffffe00b460abc0, tqh_last =3D 0xfffffe00b5579020}, pat_m= ode=20 > =3D -1268733096}, queue =3D 72 'H', segind =3D -85 '=EF=BF=BD', > hold_count =3D -19360, order =3D 0 '\0', pool =3D 254 '=EF=BF=BD', cow = =3D 65535,=20 > wire_count =3D 0, aflags =3D 0 '\0', flags =3D 0 '\0', oflags =3D 0, > act_count =3D 0 '\0', busy =3D 176 '=EF=BF=BD', valid =3D 208 '=EF=BF=BD'= , dirty =3D 126 '~'} Hmm, that definitely looks like garbage. How are you with gdb scripting? You could write a script that walks the PQ_ACTIVE queue and see if this pointers ends up in there. It would then be interesting to see if the previous page's next pointer is corrupted, or if the pageq.tqe_prev referen= ces=20 that page then it could be that this vm_page structure has been stomped on= =20 instead. Ultimately I think you will need to look at any malloc/VM/page operations done in the suspend and resume paths to see where this happens. It might be slightly easier if the same page gets trashed every time as you could print out the relevant field periodically during suspend and resume to narrow down where the breakage occurs. =2D-=20 John Baldwin