Date: Wed, 28 Mar 2012 13:08:21 +0200 From: Peter Maloney <peter.maloney@brockmann-consult.de> To: freebsd-fs@freebsd.org Subject: Re: Strange ZFS deadlock today Message-ID: <4F72F125.6070404@brockmann-consult.de> In-Reply-To: <4F72DF48.6040108@brockmann-consult.de> References: <4F72DF48.6040108@brockmann-consult.de>
next in thread | previous in thread | raw e-mail | index | archive | help
And the top command reported (I think this was before running the first java program): 54 processes: 1 running, 52 sleeping, 1 stopped CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle Mem: 46M Active, 5504K Inact, 34G Wired, 5808K Cache, 2592K Buf, 13G Free Swap: 80G Total, 25M Used, 80G Free And after running "zdb -S" I am fairly sure zfs-stats also showed arc around 10-13 GB below the arc max I set. But I'm not sure I ran the command when "zdb -S" was done, or just after a long time of it running. And compression and dedup are not enabled on any dataset. /boot/loader.conf snippet: vm.kmem_size="40g" vm.kmem_size_max="40g" vfs.zfs.arc_min="80m" vfs.zfs.arc_max="38g" vfs.zfs.arc_meta_limit="24g" vfs.zfs.vdev.cache.size="32m" vfs.zfs.vdev.cache.max="256m" On 03/28/2012 11:52 AM, Peter Maloney wrote: > I believe it is caused by ZFS, since the state looks like the code in > this page about ARC: > [url]http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/[/url] > > To cause/reveal this problem, first, I tried to run something with > OpenJDK 6. > > In one window: > [CODE]# ./CopyFromHd.sh > load: 0.00 cmd: java 51239 [ucond] 44.56r 0.18u 0.06s 0% 19588k > load: 0.00 cmd: java 51239 [ucond] 46.22r 0.18u 0.06s 0% 19588k > load: 0.00 cmd: java 51239 [ucond] 51.14r 0.18u 0.06s 0% 19588k > load: 0.00 cmd: java 51239 [ucond] 52.90r 0.18u 0.06s 0% 19588k > ^C > load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] > 58.35r 0.18u 0.06s 0% 19872k > load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] > 61.73r 0.18u 0.06s 0% 19872k > ^C > load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] > 89.67r 0.18u 0.06s 0% 19872k > load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] > 141.07r 0.18u 0.06s 0% 19872k > ^C[/CODE] > > Then since it was hung, I thought maybe it was zfs' fault rather than > OpenJDK, so tried with du. > > Second window: > [CODE] > # du -shx /data/archive2/2011/09/11/x > 3.1G /data/archive2/2011/09/11/x > > # du -shx /data/archive2/2011/09/11 > 24G /data/archive2/2011/09/11 > > # du -shx /data/archive2/2011/ > load: 0.00 cmd: du 72503 [buf_hash_table.ht_locks[i].ht_lock] 13.75r > 0.00u 0.00s 0% 1012k > ^C^C^C^Z^C > load: 0.00 cmd: du 72503 [buf_hash_table.ht_locks[i].ht_lock] 221.97r > 0.00u 0.00s 0% 1012k > [/CODE] > > Then I thought I could just run the simplest version of the program > (which does pretty much no IO) > > Third window: > [CODE]# ./CopyFromHd.sh --help > ^C^C^C^C^C^C > load: 0.00 cmd: java 52339 [suspended] 26.33r 0.15u 0.04s 0% 25644k > load: 0.00 cmd: java 52339 [suspended] 27.38r 0.15u 0.04s 0% 25644k > ^C > load: 0.00 cmd: java 52339 [suspended] 285.23r 0.15u 0.04s 0% 25644k > ^Z > [1]+ Stopped ./CopyFromHd.sh > # jobs -p > 51988 > # kill -9 51988 > # jobs -p > 51988 > [1]+ Killed: 9 ./CopyFromHd.sh > # jobs -p > [/CODE] > > Back to the first window: > [CODE]load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] > 459.38r 0.18u 0.06s 0% 19872k > ^C^Z > [1]+ Stopped ./CopyFromHd.sh ... > # jobs -p > 51128 > # kill -9 51128 > # jobs -p > 51128 > [1]+ Interrupt: 2 ./CopyFromHd.sh ... > # jobs -p > # ps axl | grep java > 0 51239 1 0 44 0 1264940 19904 - T 0 0:00.25 > [java] > 0 76933 77797 0 44 0 9124 1180 piperd S+ 0 0:00.00 > grep java > 0 52339 1 0 44 0 1266988 25676 - T 1 0:00.20 > /usr/local/openjdk6 //bin/java -Xmx1024M -classpath ... > [/CODE] > > > The java above is the "--help" one from the 2nd window. So I guess the > first one ended. > > But "dd" commands I ran are all stuck, and so is the 2nd java, and > also "zdb" "zdb zroot" "zdb tank" and "zdb data" are all stuck. Also > running "find" on either tank or data hangs. > > [CMD="#"]zpool iostat[/CMD] [CMD="#"]zpool status [-v][/CMD] and > [CMD="#"]zfs list[/CMD] all run without hanging. > > Then I thought maybe I would disable primarycache to see what happens. > > [CODE]# zfs set primarycache=none data > # zfs set primarycache=none tank > load: 0.00 cmd: zfs 80750 [tx->tx_sync_done_cv)] 5.73r 0.00u 0.00s 0% > 1636k > (hang) > ^Z^C > load: 0.00 cmd: zfs 80750 [tx->tx_sync_done_cv)] 87.28r 0.00u 0.00s > 0% 1636k > [/CODE] > > Another window: > [CODE]# zfs get primarycache data > NAME PROPERTY VALUE SOURCE > data primarycache none local > # zfs get primarycache tank > NAME PROPERTY VALUE SOURCE > tank primarycache none local[/CODE] > > > On a Linux server that nfs mounts /data on this FreeBSD server, > [CMD="#"]df[/CMD] hangs at the point where the nfs mount should be > listed. (So I have to reboot now rather than later) > > I have not run memtest on this machine. > > I have no l2arc. > > Uptime is 22 days. > > It is normally perfectly stable. I ran CopyFromHd to process over 20 > TB of data so far (2 weeks ago), so it is not normal for this to > happen. Other similar servers don't hang this way. (most hangs I've > caused in the past are not in this > "buf_hash_table.ht_locks[i].ht_lock" state. > > The last thing I did before this was running "zdb -S tank" yesterday. > (tank is not the same pool as data) > > FreeBSD version is 8.2-STABLE csupped on 2012-02-04. > > > And then on restart, the console showed: "some processes would not > die; ps axl advised" and "shutdown terminated abnormally" > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" -- -------------------------------------------- Peter Maloney Brockmann Consult Max-Planck-Str. 2 21502 Geesthacht Germany Tel: +49 4152 889 300 Fax: +49 4152 889 333 E-mail: peter.maloney@brockmann-consult.de Internet: http://www.brockmann-consult.de --------------------------------------------
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F72F125.6070404>