Date: Thu, 12 Nov 2009 19:59:32 +0100 From: Kai Gallasch <gallasch@free.de> To: freebsd-current@freebsd.org Subject: Re: 8.0RC2 amd64 - kernel panic running make buildworld Message-ID: <20091112195932.5875387e@orwell.free.de> In-Reply-To: <200911111504.14906.jhb@freebsd.org> References: <1031257439203@webmail57.yandex.ru> <hdc73v$4rt$1@ger.gmane.org> <941257966918@webmail42.yandex.ru> <200911111504.14906.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Am Wed, 11 Nov 2009 15:04:14 -0500 schrieb John Baldwin <jhb@freebsd.org>: > On Wednesday 11 November 2009 2:15:18 pm S.N.Grigoriev wrote: > > > > 10.11.09, 09:15, "Mark Atkinson" <atkin901@yahoo.com> > > wrote: > > > > > Andriy Gapon wrote: > > > > on 10/11/2009 17:22 gary.jennejohn@freenet.de said the > > > > following: > > > > Not a trivial issue unless it is hardware indeed. > > > > > > > Also, you can try adding: > > > hw.mca.enabled="1" in /boot/loader.conf, reboot, and then see if > > > there is a machine check exception on the console during the > > > buildworld. > > > > Mark, > > > > I've added hw.mca.enabled="1" in /boot/loader.conf and got the > > following screen during the buildworld: > > > > ..... > > -c /usr/src/gnu/usr.bin/binutils/as/../../../../contrib/binutils/gas/sb.c > > > > MCA: CPU3 UNCOR PCC OVER DTLIB L1 error > > MCA: Address 0x8015fb000 > > You hardware is broken and it is telling you so. You have had > multiple machine checks with the most severe one being an > uncorrectable error in your data TLB (i.e. in the CPU itself). John, I also set hw.mca.enabled="1" and vm.pmap.pg_ps_enabled="1" in /boot/loader.conf on my (under load) spontaneously rebooting opteron proliant server. Server was upgraded to FREEBSD-8.0-PRERELEASE today. This is what happened.. ---- machine check trap, first run ---- sonnenkraft:/usr/obj # MCA: CPU 5 UNCOR PCC OVER DTLB L1 error MCA: Address 0x80e5c8000 Fatal trap 28: machine check trap while in user mode cpuid = 5; apic id = 05 instruction pointer = 0x43:0x691688 stack pointer = 0x3b:0x7fffffffdf90 frame pointer = 0x3b:0x6a2 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 3, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, IOPL = 0 current process = 29319 (cc1) [thread pid 29319 tid 100086 ] Stopped at 0x691688: leal 0x1(%rax),%edx db> where Tracing pid 29319 tid 100086 td 0xffffff000e065390 WAKEUP_cpu() at 0x691688 *** error reading from address 6aa *** db> bt Tracing pid 29319 tid 100086 td 0xffffff000e065390 WAKEUP_cpu() at 0x691688 *** error reading from address 6aa *** db> call doadump Cannot dump. Device not defined or unavailable. = 0x30 ---- machine check trap, second run - this time with dumpdev defined ---- sonnenkraft:~ # MCA: CPU 2 UNCOR PCC OVER DTLB L1 error MCA: Address 0x8011d3000 Fatal trap 28: machine check trap while in user mode cpuid = 2; apic id = 02 instruction pointer = 0x43:0x6b1241 stack pointer = 0x3b:0x7fffffffe200 frame pointer = 0x3b:0x7fffffffe240 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 3, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, IOPL = 0 current process = 69498 (cc1) [thread pid 69498 tid 100338 ] Stopped at 0x6b1241: call 0x6af140 db> where Tracing pid 69498 tid 100338 td 0xffffff000ef75720 WAKEUP_cpu() at 0x6b1241 db> bt Tracing pid 69498 tid 100338 td 0xffffff000ef75720 WAKEUP_cpu() at 0x6b1241 db> call doadump Physical memory: 20462 MB Dumping 2303 MB: 2288 2272 2256 2240 2224 2208 2192 2176 2160 2144 2128 2112 2096 2080 2064 2048 2032 2016 2000 1984 1968 1952 1936 1920 1904 1888 1872 1856 1840 1824 1808 1792 1776 1760 1744 1728 1712 1696 1680 1664 1648 1632 1616 1600 1584 1568 1552 1536 1520 1504 1488 1472 1456 1440 1424 1408 1392 1376 1360 1344 1328 1312 1296 1280 1264 1248 1232 1216 1200 1184 1168 1152 1136 1120 1104 1088 1072 1056 1040 1024 1008 992 976 960 944 928 912 896 880 864 848 832 816 800 784 768 752 736 720 704 688 672 656 640 624 608 592 576 560 544 528 512 496 480 464 448 432 416 400 384 368 352 336 320 304 288 272 256 240 224 208 192 176 160 144 128 112 96 80 64 48 32 16 Dump complete = 0 db> reboot cpu_reset: Restarting BSP cpu_reset_proxy: Stopped CPU 2 ---- machine check trap, third run - BIOS: static low power mode enabled, to rule out power/heat issue ---- sonnenkraft:~ # MCA: CPU 4 UNCOR PCC OVER DTLB L1 error MCA: Address 0x8011fd000 Fatal trap 28: machine check trap while in user mode cpuid = 4; apic id = 04 instruction pointer = 0x43:0x76127d stack pointer = 0x3b:0x7fffffffe068 frame pointer = 0x3b:0x7fffffffe090 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 3, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, IOPL = 0 current process = 73135 (cc1) [thread pid 73135 tid 100146 ] Stopped at 0x76127d: xorl %edx,%edx db> where Tracing pid 73135 tid 100146 td 0xffffff00071caab0 WAKEUP_cpu() at 0x76127d db> bt Tracing pid 73135 tid 100146 td 0xffffff00071caab0 WAKEUP_cpu() at 0x76127d db> call doadump Physical memory: 20462 MB Dumping 2335 MB: 2320 2304 2288 2272 2256 2240 2224 2208 2192 2176 2160 2144 2128 2112 2096 2080 2064 2048 2032 2016 2000 1984 1968 1952 1936 1920 1904 1888 1872 1856 1840 1824 1808 1792 1776 1760 1744 1728 1712 1696 1680 1664 1648 1632 1616 1600 1584 1568 1552 1536 1520 1504 1488 1472 1456 1440 1424 1408 1392 1376 1360 1344 1328 1312 1296 1280 1264 1248 1232 1216 1200 1184 1168 1152 1136 1120 1104 1088 1072 1056 1040 1024 1008 992 976 960 944 928 912 896 880 864 848 832 816 800 784 768 752 736 720 704 688 672 656 640 624 608 592 576 560 544 528 512 496 480 464 448 432 416 400 384 368 352 336 320 304 288 272 256 240 224 208 192 176 160 144 128 112 96 80 64 48 32 16 Dump complete = 0 db> reboot cpu_reset: Restarting BSP cpu_reset_proxy: Stopped CPU 4 ---- END: ---- What hardware parts are defective and need replacement? CPU, memory or mainboard? I now have two vmcore's + crashinfo core.txt available on the server. Are they of any use to get further information? --Kai. -- Draft beer, not people.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20091112195932.5875387e>