From owner-freebsd-hackers Mon Sep 24 12:14:19 2001 Delivered-To: freebsd-hackers@freebsd.org Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67]) by hub.freebsd.org (Postfix) with ESMTP id 5025637B40B for ; Mon, 24 Sep 2001 12:14:05 -0700 (PDT) Received: (from dillon@localhost) by earth.backplane.com (8.11.6/8.11.2) id f8OJE4l95477; Mon, 24 Sep 2001 12:14:04 -0700 (PDT) (envelope-from dillon) Date: Mon, 24 Sep 2001 12:14:04 -0700 (PDT) From: Matt Dillon Message-Id: <200109241914.f8OJE4l95477@earth.backplane.com> To: hackers@freebsd.org Cc: Alfred Perlstein , Bruce Evans , Poul-Henning Kamp , Julian Elischer Subject: VM Corruption - stumped, anyone have any ideas? Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG A number of people have been seeing these on STABLE: panic: vm_page_remove(): page not found in hash They appear to be reproducable after a random period of time on certain machines. I tracked the problem down to corruption in the vm_page_array but I cannot figure out what the cause of the corruption is. I would appreciate it if people could look at the following structural and hex dump of a corrupted vm_page_t. Does any recognize the subsystem the data is coming from? This is stumping me, and it appears to be rather serious. I can't reproduce it myself. The only other hint I have is from Mike Tancsa's messing around... when he bumps up the number of Apache children forked the problem appears to be easier to trigger. This also occurs on some Yahoo boxes. I don't think it's bad memory. -Matt $8 = 58630 (kgdb) print vm_page_buckets[$8] $9 = (struct vm_page *) 0xc08428cc (kgdb) print *vm_page_buckets[$8] $10 = {pageq = {tqe_next = 0xd715000, tqe_prev = 0x1}, hnext = 0xc0e26a34, listq = {tqe_next = 0xc0e26a3c, tqe_prev = 0xb00000}, object = 0x10015, pindex = 0, phys_addr = 255, md = {pv_list_count = -1066105816, pv_list = { tqh_first = 0xc09616b0, tqh_last = 0x0}}, queue = 0, flags = 0, pc = 5820, wire_count = 49302, hold_count = 9152, act_count = 151 '\227', busy = 214 '\xd6', valid = 1 '\001', dirty = 0 '\000'} tqe_prev is garbage. phys_addr is garbage. It's almost all garbage. The question is: how did it become garbage? The vm_page_t is a valid page in the preallocated vm_page_array[]. The VM system is physically incapable of corrupting a vm_page_t this badly. (kgdb) print vm_page_array_size $16 = 130743 (kgdb) print m $17 = 0xc0842acc (kgdb) print m - vm_page_array $18 = 55069 (kgdb) print &vm_page_array[55069] $19 = (struct vm_page *) 0xc0842acc (kgdb) 0xc08428cc: 0x0d715000 0x00000001 0xc0e26a34 0xc0e26a3c 0xc08428dc: 0x00b00000 0x00010015 0x00000000 0x000000ff 0xc08428ec: 0xc0748428 0xc09616b0 0x00000000 0x00000000 0xc08428fc: 0xc09616bc 0xd69723c0 0x00000001 0x0d716000 0xc084290c: 0x00000000 0x00000000 0xc0842910 0x00800022 0xc084291c: 0x00000016 0x00050000 0x000000ff 0xc0909564 0xc084292c: 0xc0921aec 0x00000000 0x00000000 0xd696aaf8 0xc084293c: 0xd696aae0 0x00000000 0x0d717000 0x00000000 0xc084294c: 0x00000000 0xc084294c 0x00800022 0x00000017 0xc084295c: 0x00050000 0x000000ff 0xc08d3e64 0xc0b6fde4 0xc084296c: 0x00000000 0xc0998430 0xd706ea98 0x00000000 0xc084297c: 0x00000010 0x0d718000 0x00000000 0x00000000 0xc084298c: 0xc0842988 0x00c00019 0x00000018 0x00050000 0xc084299c: 0x00000000 0xc09d9f5c 0xc0691764 0x00000000 0xc08429ac: 0xc083244c 0xc0848fa0 0xc02a6740 0x000001db -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message