Date: Mon, 18 Aug 2008 15:29:12 -0400 (EDT) From: Weldon S Godfrey 3 <weldon@excelsus.com> To: freebsd-fs@freebsd.org, pjd@FreeBSD.org Subject: Re: ZFS-NFS kernel panic under load Message-ID: <20080814091337.Y94482@emmett.excelsus.com> In-Reply-To: <20080806101621.H24586@emmett.excelsus.com> References: <20080806101621.H24586@emmett.excelsus.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Update on what else I have tried (all yeild same results, same backtraces, no indication in logs/console of why it is panicing other than page fault: (FYI--I have tried to load 8-CURRENT, but it panics during install on the Dell 2950-3 I am using, I see a patch for a newer port of zfs, that looks like for 8, is there a patch for 7.0-RELEASE?) I have tried breaking it into two smaller < 2TB filesystems and performed same test on one, still I tried disabling swap all together (although I wasn't swapping) I upped number of nfs daemons from 12 to 100 I turned on zfs debugging and WITNESS to see if anything would show, like locking issues (nothing shows) I ran loops every 3s to monitor max vnodes, kmem, and arc during testes and up until the panic nothing was climbing I turned off ZIL and disabled prefetch, the problem still occurs I didn't get a panic in these situations: I created a zfs mirror filesystem of only two drives (one on each chasis) and performed the test I took one drive, created a UFS filesystem and performed the test. If memory serves me right, sometime around Aug 6, Weldon S Godfrey 3 told me: > > Hello, > > Please forgive me, I didn't really see this discussed in the archives but I am > wondering if anyone has seen this issue. I can replicate this issue under > FreeBSD amd64 7.0-RELEASE and the latest -STABLE (RELENG_7). I do not > replicate any problems running 9 instances of postmark on the machine > directly, so the issue appears to be isolated with NFS. > > There are backtraces and more information in ticket kern/124280 > > I am experiencing random kernel panics while running postmark benchmark from 9 > NFS clients (clients on RedHat) to a 3TB ZFS filesystem exported with NFS. > The panics happen as soon as 5 mins from starting the benchmark or may take > hours before it panics and reboots. It doesn't correspond to a time a cron > job is going on. I am using the following settings in postmark: > > set number 20000 > set transactions 10000000 > set subdirectories 1000 > set size 10000 15000 > set report verbose > set location /var/mail/store1/X (where X is a number 1-9 so each is operating > in its own tree) > > The problem happens if I run 1 postmark on 9 NFS clients at the same time > (each client is its own server) or if I run 9 postmarks on one NFS client. > > commands used to create filesystem: > zpool create tank mirror da0 da12 mirror da1 da13 mirror da2 da14 mirror da3 > da15\ > mirror da4 da16 mirror da5 da17 mirror da6 da18 mirror da7 da19 mirror da8 > da20 \ > mirror da9 da21 mirror da10 da22 spare da11 da23 > zfs set atime=off tank > zfs create tank/mail > zfs set mountpoint=/var/mail tank/mail > zfs set sharenfs="-maproot=root -network 192.168.2.0 -mask 255.255.255.0" > tank/mail > > I am using a 3ware 9690 SAS controller. I have 2 IBM EXP3000 enclosures, each > drive is shown as single disk by the controller. > > > this is my loader.conf: > vm.kmem_size_max="1073741824" > vm.kmem_size="1073741824" > kern.maxvnodes="800000" > vfs.zfs.prefetch_disable="1" > vfs.zfs.cache_flush_disable="1" > > (I should note that kern.maxnodes in loader.conf does not appear to do > anything, after boot, it is shown to be at 100000 with sysctl. It does change > to 800000 if I manually set it with sysctl. However it appears my vnode usage > sits at around 25-26K and is near that within 5s of the panic. > > The server has 16GB of RAM, and 2 quad core XEON processors. > > This server is only a NFS fileserver. The only non-default daemon running is > sshd. It is running the GENERIC kernel, right now, unmodified. > > I am using two NICs. NFS is exported only on the secondary NIC. Each NIC is > in it's own subnet. > > > nothing in /var/log/messages near time of panic except: > Aug 6 08:45:30 store1 savecore: reboot after panic: page fault > Aug 6 08:45:30 store1 savecore: writing core to vmcore.2 > > I can provide cores if needed. > > Thank you for your time! > > Weldon > > > > kgdb with backtrace: > > store1# kgdb kernel.debug /var/crash/vmcore.2 > GNU gdb 6.1.1 [FreeBSD] > Copyright 2004 Free Software Foundation, Inc. > GDB is free software, covered by the GNU General Public License, and you are > welcome to change it and/or distribute copies of it under certain conditions. > Type "show copying" to see the conditions. > There is absolutely no warranty for GDB. Type "show warranty" for details. > This GDB was configured as "amd64-marcel-freebsd"... > > Unread portion of the kernel message buffer: > > > Fatal trap 12: page fault while in kernel mode > cpuid = 5; apic id = 05 > fault virtual address = 0xdc > fault code = supervisor read data, page not present > instruction pointer = 0x8:0xffffffff8063b3d8 > stack pointer = 0x10:0xffffffffdfbc5720 > frame pointer = 0x10:0xffffff00543ed000 > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = interrupt enabled, resume, IOPL = 0 > current process = 839 (nfsd) > trap number = 12 > panic: page fault > cpuid = 5 > Uptime: 18m53s > Physical memory: 16366 MB > Dumping 1991 MB: 1976 1960 1944 1928 1912 1896 1880 1864 1848 1832 1816 1800 > 1784 1768 1752 1736 1720 1704 1688 1672 1656 1640 1624 1608 1592 1576 1560 > 1544 1528 1512 1496 1480 1464 1448 1432 1416 1400 1384 1368 1352 1336 1320 > 1304 1288 1272 1256 1240 1224 1208 1192 1176 1160 1144 1128 1112 1096 1080 > 1064 1048 1032 1016 1000 984 968 952 936 920 904 888 872 856 840 824 808 792 > 776 760 744 728 712 696 680 664 648 632 616 600 584 568 552 536 520 504 488 > 472 456 440 424 408 392 376 360 344 328 312 296 280 264 248 232 216 200 184 > 168 152 136 120 104 88 72 56 40 24 8 > > Reading symbols from /boot/kernel/zfs.ko...Reading symbols from > /boot/kernel/zfs.ko.symbols...done. > done. > Loaded symbols for /boot/kernel/zfs.ko > #0 doadump () at pcpu.h:194 > 194 __asm __volatile("movq %%gs:0,%0" : "=r" (td)); > (kgdb) backtrace > #0 doadump () at pcpu.h:194 > #1 0x0000000000000004 in ?? () > #2 0xffffffff804a7049 in boot (howto=260) at > /usr/src/sys/kern/kern_shutdown.c:418 > #3 0xffffffff804a744d in panic (fmt=0x104 <Address 0x104 out of bounds>) at > /usr/src/sys/kern/kern_shutdown.c:572 > #4 0xffffffff807780e4 in trap_fatal (frame=0xffffff000bce26c0, > eva=18446742974395967712) > at /usr/src/sys/amd64/amd64/trap.c:724 > #5 0xffffffff807784b5 in trap_pfault (frame=0xffffffffdfbc5670, usermode=0) > at /usr/src/sys/amd64/amd64/trap.c:641 > #6 0xffffffff80778de8 in trap (frame=0xffffffffdfbc5670) at > /usr/src/sys/amd64/amd64/trap.c:410 > #7 0xffffffff8075e7ce in calltrap () at > /usr/src/sys/amd64/amd64/exception.S:169 > #8 0xffffffff8063b3d8 in nfsrv_access (vp=0xffffff00207d7dc8, flags=128, > cred=0xffffff00403d4800, rdonly=0, > td=0xffffff000bce26c0, override=0) at > /usr/src/sys/nfsserver/nfs_serv.c:4284 > #9 0xffffffff8063c4f1 in nfsrv3_access (nfsd=0xffffff00543ed000, > slp=0xffffff0006396d00, td=0xffffff000bce26c0, > mrq=0xffffffffdfbc5af0) at /usr/src/sys/nfsserver/nfs_serv.c:234 > #10 0xffffffff8064cd1d in nfssvc (td=Variable "td" is not available. > ) at /usr/src/sys/nfsserver/nfs_syscalls.c:456 > #11 0xffffffff80778737 in syscall (frame=0xffffffffdfbc5c70) at > /usr/src/sys/amd64/amd64/trap.c:852 > #12 0xffffffff8075e9db in Xfast_syscall () at > /usr/src/sys/amd64/amd64/exception.S:290 > #13 0x0000000800687acc in ?? () > Previous frame inner to this frame (corrupt stack?) >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080814091337.Y94482>