Date: Wed, 6 Aug 2008 11:00:57 -0400 (EDT) From: Weldon S Godfrey 3 <weldon@excelsus.com> To: freebsd-fs@freebsd.org Subject: ZFS-NFS kernel panic under load Message-ID: <20080806101621.H24586@emmett.excelsus.com>
next in thread | raw e-mail | index | archive | help
Hello, Please forgive me, I didn't really see this discussed in the archives but I am wondering if anyone has seen this issue. I can replicate this issue under FreeBSD amd64 7.0-RELEASE and the latest -STABLE (RELENG_7). I do not replicate any problems running 9 instances of postmark on the machine directly, so the issue appears to be isolated with NFS. There are backtraces and more information in ticket kern/124280 I am experiencing random kernel panics while running postmark benchmark from 9 NFS clients (clients on RedHat) to a 3TB ZFS filesystem exported with NFS. The panics happen as soon as 5 mins from starting the benchmark or may take hours before it panics and reboots. It doesn't correspond to a time a cron job is going on. I am using the following settings in postmark: set number 20000 set transactions 10000000 set subdirectories 1000 set size 10000 15000 set report verbose set location /var/mail/store1/X (where X is a number 1-9 so each is operating in its own tree) The problem happens if I run 1 postmark on 9 NFS clients at the same time (each client is its own server) or if I run 9 postmarks on one NFS client. commands used to create filesystem: zpool create tank mirror da0 da12 mirror da1 da13 mirror da2 da14 mirror da3 da15\ mirror da4 da16 mirror da5 da17 mirror da6 da18 mirror da7 da19 mirror da8 da20 \ mirror da9 da21 mirror da10 da22 spare da11 da23 zfs set atime=off tank zfs create tank/mail zfs set mountpoint=/var/mail tank/mail zfs set sharenfs="-maproot=root -network 192.168.2.0 -mask 255.255.255.0" tank/mail I am using a 3ware 9690 SAS controller. I have 2 IBM EXP3000 enclosures, each drive is shown as single disk by the controller. this is my loader.conf: vm.kmem_size_max="1073741824" vm.kmem_size="1073741824" kern.maxvnodes="800000" vfs.zfs.prefetch_disable="1" vfs.zfs.cache_flush_disable="1" (I should note that kern.maxnodes in loader.conf does not appear to do anything, after boot, it is shown to be at 100000 with sysctl. It does change to 800000 if I manually set it with sysctl. However it appears my vnode usage sits at around 25-26K and is near that within 5s of the panic. The server has 16GB of RAM, and 2 quad core XEON processors. This server is only a NFS fileserver. The only non-default daemon running is sshd. It is running the GENERIC kernel, right now, unmodified. I am using two NICs. NFS is exported only on the secondary NIC. Each NIC is in it's own subnet. nothing in /var/log/messages near time of panic except: Aug 6 08:45:30 store1 savecore: reboot after panic: page fault Aug 6 08:45:30 store1 savecore: writing core to vmcore.2 I can provide cores if needed. Thank you for your time! Weldon kgdb with backtrace: store1# kgdb kernel.debug /var/crash/vmcore.2 GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Unread portion of the kernel message buffer: Fatal trap 12: page fault while in kernel mode cpuid = 5; apic id = 05 fault virtual address = 0xdc fault code = supervisor read data, page not present instruction pointer = 0x8:0xffffffff8063b3d8 stack pointer = 0x10:0xffffffffdfbc5720 frame pointer = 0x10:0xffffff00543ed000 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 839 (nfsd) trap number = 12 panic: page fault cpuid = 5 Uptime: 18m53s Physical memory: 16366 MB Dumping 1991 MB: 1976 1960 1944 1928 1912 1896 1880 1864 1848 1832 1816 1800 1784 1768 1752 1736 1720 1704 1688 1672 1656 1640 1624 1608 1592 1576 1560 1544 1528 1512 1496 1480 1464 1448 1432 1416 1400 1384 1368 1352 1336 1320 1304 1288 1272 1256 1240 1224 1208 1192 1176 1160 1144 1128 1112 1096 1080 1064 1048 1032 1016 1000 984 968 952 936 920 904 888 872 856 840 824 808 792 776 760 744 728 712 696 680 664 648 632 616 600 584 568 552 536 520 504 488 472 456 440 424 408 392 376 360 344 328 312 296 280 264 248 232 216 200 184 168 152 136 120 104 88 72 56 40 24 8 Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /boot/kernel/zfs.ko.symbols...done. done. Loaded symbols for /boot/kernel/zfs.ko #0 doadump () at pcpu.h:194 194 __asm __volatile("movq %%gs:0,%0" : "=r" (td)); (kgdb) backtrace #0 doadump () at pcpu.h:194 #1 0x0000000000000004 in ?? () #2 0xffffffff804a7049 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:418 #3 0xffffffff804a744d in panic (fmt=0x104 <Address 0x104 out of bounds>) at /usr/src/sys/kern/kern_shutdown.c:572 #4 0xffffffff807780e4 in trap_fatal (frame=0xffffff000bce26c0, eva=18446742974395967712) at /usr/src/sys/amd64/amd64/trap.c:724 #5 0xffffffff807784b5 in trap_pfault (frame=0xffffffffdfbc5670, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:641 #6 0xffffffff80778de8 in trap (frame=0xffffffffdfbc5670) at /usr/src/sys/amd64/amd64/trap.c:410 #7 0xffffffff8075e7ce in calltrap () at /usr/src/sys/amd64/amd64/exception.S:169 #8 0xffffffff8063b3d8 in nfsrv_access (vp=0xffffff00207d7dc8, flags=128, cred=0xffffff00403d4800, rdonly=0, td=0xffffff000bce26c0, override=0) at /usr/src/sys/nfsserver/nfs_serv.c:4284 #9 0xffffffff8063c4f1 in nfsrv3_access (nfsd=0xffffff00543ed000, slp=0xffffff0006396d00, td=0xffffff000bce26c0, mrq=0xffffffffdfbc5af0) at /usr/src/sys/nfsserver/nfs_serv.c:234 #10 0xffffffff8064cd1d in nfssvc (td=Variable "td" is not available. ) at /usr/src/sys/nfsserver/nfs_syscalls.c:456 #11 0xffffffff80778737 in syscall (frame=0xffffffffdfbc5c70) at /usr/src/sys/amd64/amd64/trap.c:852 #12 0xffffffff8075e9db in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:290 #13 0x0000000800687acc in ?? () Previous frame inner to this frame (corrupt stack?)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080806101621.H24586>