Date: Sat, 6 Oct 2012 02:00:55 -0400 From: David Wimsey <david@wimsey.us> To: freebsd-fs@freebsd.org Subject: Deadlock on zfs import Message-ID: <074F3CC1-E29F-4552-840F-A38FDDCC7E76@wimsey.us>
next in thread | raw e-mail | index | archive | help
I have a FreeBSD 9-RELEASE machine that deadlocks when importing one of = the ZFS pools on it. When I run spool import zfs01 (the offending pool) = the data becomes available, and mounts show up as expected. Then it = continues to chew away at the disks like its doing a scrub eventually = deadlocking. I've confirmed I can get to some of the data before the = deadlock happens, so I can get data off of it but its a tedious process = and doesn't help me long term. Heres how I got to this point: This machine is essentially a network file server, it serves NFS for a = vmware ESXi machine, samba for the family's Windows based machines, and = afpd for the macs as well as a couple of jails for subversion and = tftp/netboot services. Other than home directories, all of the mount = points on this machine are generally are set as readonly and are never = writable over the network. If I need to add something to the server, = its dropped into my home directory then moved to its final location from = the command line on the server itself. Noticing the offending pool was at 94% capacity I started rearranging = things and cleaning up. I had multiple shells open copying to multiple = different file systems on the same zfs pool. This normally works fine. = This time it doesn't seem so normal. Some point in copying roughly 25GB = between different filesystems on the same pool the machine deadlocked. On reboot the machine will reach the 'mounting local filesystems' phase = and then starts chugging away at the disks until it locks up again. The = only way to get it to boot is to boot to single user mode then ifs = export the offending pool. After doing so the machine works fine with = the exception of bits that depend on file systems on the offending pool. = The two other pools on the machine (boot and zfs02) work perfectly. If I boot with zfs01 exported and then import it after boot, it chugs = away at the disks for a long time and then deadlocks the machine = eventually. Some filesystems have compression and/or dedup enabled, but I have been = turning that off due to the machine only having 4GB of ram. So, can some one point me in the direction of figuring out whats wrong = and how to maybe go about fixing it. How can I tell if its memory = exhaustion thats causing the problem? Is there a way to roll the pool back (without snapshots, which I had = actually just deleted from the pool, heh) to maybe the last valid state = on the pool? Summary of machine config (output of various commands shown at the = bottom due to its size): 4GB of ram 2 SSDs, 64GB each 4 standard drives, 500GB each (2 western digital, 2 seagate) 3 ZFS pools zboot - Configured with one vdev, it is 2 slices from the SSDs as a = mirror - This pool imports normally with no issues zfs02 - Configured with one vdev, it is 2 slices from the SSDs as a = mirror - This pool imports normally with no issues zfs01 - This is the offending pool, and of course the only one with data = that can't be replaced easily or if at all. 1 raid-z vdev consisting of 3 HDDs and one hot spare HDD. 1 mirrored vdev consisting of 2 slices from the SSDs for the ZIL 2 slices from the SSDs for L2ARC Drives are all SATA connections split between the motherboard SATA ports = and a 4 port RocketPort PCI-e 'raid controller', no raid configured, = just using as additional SATA ports and providing fault tolerance if the = onboard controller fails. Output of various commands: mayham# dmesg | head -n 15 Copyright (c) 1992-2012 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights = reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 9.0-RELEASE-p3 #0: Tue Jun 12 02:52:29 UTC 2012 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC = amd64 CPU: AMD Phenom(tm) II X4 945 Processor (3013.28-MHz K8-class CPU) Origin =3D "AuthenticAMD" Id =3D 0x100f42 Family =3D 10 Model =3D 4 = Stepping =3D 2 = Features=3D0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE= ,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT> Features2=3D0x802009<SSE3,MON,CX16,POPCNT> AMD = Features=3D0xee500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM,3DNow!+,3DNo= w!> AMD = Features2=3D0x37ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IB= S,SKINIT,WDT> TSC: P-state invariant real memory =3D 4294967296 (4096 MB) avail memory =3D 4075692032 (3886 MB) =20 =20 =20 mayham# dmesg | grep ada ada0 at ahcich1 bus 0 scbus1 target 0 lun 0 ada0: <ST3500418AS CC34> ATA-8 SATA 2.x device ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada0: Command Queueing enabled ada0: 476940MB (976773168 512 byte sectors: 16H 63S/T 16383C) ada0: Previously was known as ad6 ada1 at ahcich2 bus 0 scbus2 target 0 lun 0 ada1: <ST3500418AS CC34> ATA-8 SATA 2.x device ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada1: Command Queueing enabled ada1: 476940MB (976773168 512 byte sectors: 16H 63S/T 16383C) ada1: Previously was known as ad8 ada2 at ahcich3 bus 0 scbus3 target 0 lun 0 ada2: <M4-CT064M4SSD2 0309> ATA-9 SATA 3.x device ada2: 600.000MB/s transfers (SATA 3.x, UDMA5, PIO 8192bytes) ada2: Command Queueing enabled ada2: 61057MB (125045424 512 byte sectors: 16H 63S/T 16383C) ada2: Previously was known as ad10 ada3 at ahcich4 bus 0 scbus5 target 0 lun 0 ada3: <M4-CT064M4SSD2 0309> ATA-9 SATA 3.x device ada3: 300.000MB/s transfers (SATA 2.x, UDMA5, PIO 8192bytes) ada3: Command Queueing enabled ada3: 61057MB (125045424 512 byte sectors: 16H 63S/T 16383C) ada3: Previously was known as ad14 ada4 at ahcich5 bus 0 scbus6 target 0 lun 0 ada4: <WDC WD5000AAKS-65YGA0 12.01C02> ATA-8 SATA 2.x device ada4: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada4: Command Queueing enabled ada4: 476940MB (976773168 512 byte sectors: 16H 63S/T 16383C) ada4: Previously was known as ad16 ada5 at ahcich6 bus 0 scbus7 target 0 lun 0 ada5: <WDC WD5000AACS-00ZUB0 01.01B01> ATA-8 SATA 2.x device ada5: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada5: Command Queueing enabled ada5: 476940MB (976773168 512 byte sectors: 16H 63S/T 16383C) ada5: Previously was known as ad18 =20 =20 =20 mayham# dmesg | grep -i zfs ZFS filesystem version 5 ZFS storage pool version 28 Trying to mount root from zfs:zboot []... =20 =20 =20 mayham# cat /boot/loader.conf zfs_load=3D"YES" vfs.root.mountfrom=3D"zfs:zboot" splash_bmp_load=3D"YES" vesa_load=3D"YES" loader_logo=3D"orb" loader_color=3D"YES" bitmap_load=3D"YES" if_vlan_load=3D"YES" # Added after deadlock occured vm.kmem_size=3D"512M" vm.kmem_size_max=3D"512M" vfs.zfs.arc_max=3D"40M" vfs.zfs.vdev.cache.size=3D"5M" vfs.zfs.prefetch_disable=3D"1" =20 =20 =20 mayham# zpool status pool: zboot state: ONLINE scan: scrub repaired 0 in 0h1m with 0 errors on Sun Aug 12 03:35:52 = 2012 config: NAME STATE READ WRITE CKSUM zboot ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada2p2 ONLINE 0 0 0 ada3p2 ONLINE 0 0 0 errors: No known data errors pool: zfs01 state: ONLINE scan: resilvered 144K in 0h0m with 0 errors on Thu Aug 30 02:35:33 2012 config: NAME STATE READ WRITE CKSUM zfs01 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada5p3 ONLINE 0 0 0 logs ada2p4 ONLINE 0 0 0 ada3p4 ONLINE 0 0 0 cache ada2p5 ONLINE 0 0 0 ada3p5 ONLINE 0 0 0 spares gpt/disk3 AVAIL =20 errors: No known data errors pool: zfs02 state: ONLINE scan: scrub repaired 0 in 0h1m with 0 errors on Fri Oct 5 04:42:19 = 2012 config: NAME STATE READ WRITE CKSUM zfs02 ONLINE 0 0 0 ada2p6 ONLINE 0 0 0 ada3p6 ONLINE 0 0 0 errors: No known data errors mayham# pool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT zboot 3.97G 2.32G 1.65G 58% 1.12x ONLINE - zfs01 1.30T 1.24T 69.3G 94% 1.36x ONLINE - zfs02 41G 39.4G 1.63G 96% 1.21x ONLINE -
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?074F3CC1-E29F-4552-840F-A38FDDCC7E76>