From owner-freebsd-current@FreeBSD.ORG Sat Oct 8 20:37:17 2011 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 41CE9106566B for ; Sat, 8 Oct 2011 20:37:17 +0000 (UTC) (envelope-from shuvaev@physik.uni-wuerzburg.de) Received: from mailrelay.rz.uni-wuerzburg.de (wrz3028.rz.uni-wuerzburg.de [132.187.3.28]) by mx1.freebsd.org (Postfix) with ESMTP id E6BA08FC13 for ; Sat, 8 Oct 2011 20:37:16 +0000 (UTC) Received: from virusscan.mail (localhost [127.0.0.1]) by mailrelay.mail (Postfix) with ESMTP id A914F5AEE1; Sat, 8 Oct 2011 22:14:58 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by virusscan.mail (Postfix) with ESMTP id A6D055AEDE; Sat, 8 Oct 2011 22:14:58 +0200 (CEST) X-Virus-Scanned: by amavisd-new at uni-wuerzburg.de Received: from mail.physik.uni-wuerzburg.de (wthp192.physik.uni-wuerzburg.de [132.187.40.192]) by mailmaster.uni-wuerzburg.de (Postfix) with ESMTP id 7E2445CC67; Sat, 8 Oct 2011 22:14:58 +0200 (CEST) Received: from lexx.ifp.tuwien.ac.at ([128.131.127.223]) by mail.physik.uni-wuerzburg.de (Lotus Domino Release 8.5.2FP3HF61) with ESMTP id 2011100822145781-98869 ; Sat, 8 Oct 2011 22:14:57 +0200 Date: Sat, 8 Oct 2011 22:14:56 +0200 From: Alexey Shuvaev To: freebsd-current@freebsd.org Message-ID: <20111008201456.GA3529@lexx.ifp.tuwien.ac.at> MIME-Version: 1.0 Organization: Universitaet Wuerzburg User-Agent: Mutt/1.5.21 (2010-09-15) X-MIMETrack: Itemize by SMTP Server on domino1/uni-wuerzburg(Release 8.5.2FP3HF61 | August 2, 2011) at 10/08/2011 10:14:57 PM, Serialize by Router on domino1/uni-wuerzburg(Release 8.5.2FP3HF61 | August 2, 2011) at 10/08/2011 10:14:58 PM, Serialize complete at 10/08/2011 10:14:58 PM Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Cc: Alexander Motin Subject: Panics after AHCI timeouts X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 08 Oct 2011 20:37:17 -0000 Hello list! In the view of upcoming RELEASE-9.0 I should have reported it earlier, but it is better later than never... Every time I wanted to report this, the system was ~one month old and I tried to upgrade it to see, if the problem was still there, waiting for the next panic... and when it finally paniced it was one month old again. For some time (around half a year actually :) under some heavy disk load my desktop panics. The panics are not 100% reproducible, two situations which could lead to a panic are: - svn up in /usr/src - last stage of openoffice building (I think, during the packaging stage) Updating the sources is less probable to panic the system (I would give ~30% of probability), but building OOO is close to 100% to cause the panic. The root cause seems to be the Samsung hard drive in use - it times out on some NCQ slots under heavy load. Same problem is reported, for example here: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/157397 Earlier this year (I would say March - May), AHCI handled these timeouts, at least to the point when the disk (and, possibly, the controlled too) were completely wedged (only power cycling had brought them back again, reset was not sufficient). From ~June, however, the system paniced right after the first timeout. So it is this panic immediately after ahci timeout which could be hopefully fixed. Some info about the system: ~> uname -a FreeBSD lexx.ifp.tuwien.ac.at 9.0-BETA2 FreeBSD 9.0-BETA2 #0 r225311: Thu Sep 1 17:17:57 CEST 2011 root@lexx.ifp.tuwien.ac.at:/usr/obj/usr/src/sys/GENERIC amd64 >From dmesg.boot: [snip] ahci0: port 0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f mem 0xfe5ffc00-0xfe5fffff irq 19 at device 17.0 on pci0 ahci0: AHCI v1.20 with 6 6Gbps ports, Port Multiplier supported ahcich0: at channel 0 on ahci0 ahcich1: at channel 1 on ahci0 ahcich2: at channel 2 on ahci0 ahcich3: at channel 3 on ahci0 ahcich4: at channel 4 on ahci0 ahcich5: at channel 5 on ahci0 [snip] ada0 at ahcich0 bus 0 scbus1 target 0 lun 0 ada0: ATA-8 SATA 2.x device ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada0: Command Queueing enabled ada0: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C) ada0: Previously was known as ad4 [snip] ~> cat /etc/rc.local #!/bin/sh ddb script kdb.enter.panic="capture on; show pcpu; trace; ps; show locks; alltrace; show alllocks; show lockedvnods; call doadump; reset" sysctl debug.debugger_on_panic=0 I have two cores: ~> ll /var/crash/ total 2628524 -rw-r--r-- 1 root wheel 2 Oct 7 15:12 bounds -rw-r----- 1 root wheel 224897 Aug 5 18:07 core.txt.3 -rw-r----- 1 root wheel 151478 Oct 7 15:13 core.txt.5 -rw-r----- 1 root wheel 475 Aug 5 18:06 info.3 -rw-r----- 1 root wheel 478 Oct 7 15:12 info.5 -rw-r--r-- 1 root wheel 5 Jan 26 2011 minfree -rw-r----- 1 root wheel 1390616576 Aug 5 18:07 vmcore.3 -rw-r----- 1 root wheel 1371832320 Oct 7 15:13 vmcore.5 However, I do not have the old kernel for the core N 3 anymore (but the current kernel is the one which produced core N 5). >From core.txt.3: [snip] Unread portion of the kernel message buffer: Fatal trap 9: general protection fault while in kernel mode cpuid = 0; apic id = 00 instruction pointer = 0x20:0xffffffff8092988e stack pointer = 0x28:0xffffff8000256a80 frame pointer = 0x28:0xffffff8000256aa0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (swi4: clock) trap number = 9 panic: general protection fault cpuid = 0 VNASSERT failed Uptime: 0xfffffe0073591780: 3dtag ufs, type VDIR 7h59m usecount 1, writecount 0, refcount 4 mountedhere 0 45s flags () v_object 0xfffffe00888501b0 ref 0 pages 1 lock type ufs: UNLOCKED ino 2709060, on dev ada0p6 VNASSERT failed 0xfffffe0073591780: tag ufs, type VDIR usecount 1, writecount 0, refcount 4 mountedhere 0 flags () v_object 0xfffffe00888501b0 ref 0 pages 1 lock type ufs: UNLOCKED ino 2709060, on dev ada0p6 Dumping 1326 out of 7914 MB:..2%..11%..21%..31%..42%..51%..61%..72%..81%..91% [snip] #0 doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:252 252 if (textdump && textdump_pending) { (kgdb) #0 doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:252 #1 0xffffffff8081f3ca in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:430 #2 0xffffffff8081ee61 in panic (fmt=Variable "fmt" is not available. ) at /usr/src/sys/kern/kern_shutdown.c:595 #3 0xffffffff80b04b70 in trap_fatal (frame=0x9, eva=Variable "eva" is not available. ) at /usr/src/sys/amd64/amd64/trap.c:805 #4 0xffffffff80b05145 in trap (frame=0xffffff80002569d0) at /usr/src/sys/amd64/amd64/trap.c:616 #5 0xffffffff80aef91f in calltrap () at /usr/src/sys/amd64/amd64/exception.S:228 #6 0xffffffff8092988e in igmp_slowtimo () at /usr/src/sys/netinet/igmp.c:2088 #7 0xffffffff8088648b in pfslowtimo (arg=0xffffffff8130b440) at /usr/src/sys/kern/uipc_domain.c:518 #8 0xffffffff80832e0a in softclock (arg=Variable "arg" is not available. ) at /usr/src/sys/kern/kern_timeout.c:564 #9 0xffffffff807f6676 in intr_event_execute_handlers (p=Variable "p" is not available. ) at /usr/src/sys/kern/kern_intr.c:1257 #10 0xffffffff807f74b2 in ithread_loop (arg=0xfffffe0005144be0) at /usr/src/sys/kern/kern_intr.c:1270 #11 0xffffffff807f3b85 in fork_exit ( callout=0xffffffff807f7400 , arg=0xfffffe0005144be0, frame=0xffffff8000256c50) at /usr/src/sys/kern/kern_fork.c:941 #12 0xffffffff80aefe4e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:603 [snip] [last message in dmesg] ahcich0: Timeout on slot 29 port 0 ahcich0: is 00000000 cs 00000000 ss ffffffff rs ffffffff tfd 40 serr 00000000 cmd 0000fc17 [snip] >From core.txt.5: [snip] Unread portion of the kernel message buffer: Memory modified after free 0xfffffe000416e200(248) val=79e8800 @ 0xfffffe000416e200 panic: Most recently used by cred cpuid = 2 Uptime: 20h11m1s Dumping 1308 out of 7914 MB:..2%..12%..21%..31%..41%..51%..62%..71%..81%..91% [snip] #0 doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:252 252 if (textdump && textdump_pending) { (kgdb) #0 doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:252 #1 0xffffffff808234aa in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:430 #2 0xffffffff80822f41 in panic (fmt=Variable "fmt" is not available. ) at /usr/src/sys/kern/kern_shutdown.c:595 #3 0xffffffff80a6f7b4 in mtrash_ctor (mem=Variable "mem" is not available. ) at /usr/src/sys/vm/uma_dbg.c:137 #4 0xffffffff80a6f01c in uma_zalloc_arg (zone=0xfffffe021ffe0700, udata=0x0, flags=258) at /usr/src/sys/vm/uma_core.c:2018 #5 0xffffffff808108be in malloc (size=Variable "size" is not available. ) at uma.h:305 #6 0xffffffff8081c21f in crget () at /usr/src/sys/kern/kern_prot.c:1809 #7 0xffffffff8081c269 in crdup (cr=0xfffffe0143103300) at /usr/src/sys/kern/kern_prot.c:1911 #8 0xffffffff808c5ca6 in kern_accessat (td=0xfffffe0007dd7000, fd=-100, path=0x80065c000
, pathseg=UIO_USERSPACE, flags=Variable "flags" is not available. ) at /usr/src/sys/kern/vfs_syscalls.c:2201 #9 0xffffffff8086719a in syscallenter (td=0xfffffe0007dd7000, sa=0xffffff8223f67bb0) at /usr/src/sys/kern/subr_trap.c:344 #10 0xffffffff80b0b43c in syscall (frame=0xffffff8223f67c50) at /usr/src/sys/amd64/amd64/trap.c:910 #11 0xffffffff80af617d in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:384 #12 0x000000080062dbdc in ?? () Previous frame inner to this frame (corrupt stack?) [snip] [last message in dmesg] ahcich0: Timeout on slot 29 port 0 ahcich0: is 00000000 cs 00000000 ss ffffffff rs ffffffff tfd 40 serr 00000000 cm d 0000fc17 [snip] HTH, Alexey.