Date: Wed, 28 Mar 2012 11:52:08 +0200 From: Peter Maloney <peter.maloney@brockmann-consult.de> To: freebsd-fs@freebsd.org Subject: Strange ZFS deadlock today Message-ID: <4F72DF48.6040108@brockmann-consult.de>
next in thread | raw e-mail | index | archive | help
I believe it is caused by ZFS, since the state looks like the code in this page about ARC: [url]http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/[/url] To cause/reveal this problem, first, I tried to run something with OpenJDK 6. In one window: [CODE]# ./CopyFromHd.sh load: 0.00 cmd: java 51239 [ucond] 44.56r 0.18u 0.06s 0% 19588k load: 0.00 cmd: java 51239 [ucond] 46.22r 0.18u 0.06s 0% 19588k load: 0.00 cmd: java 51239 [ucond] 51.14r 0.18u 0.06s 0% 19588k load: 0.00 cmd: java 51239 [ucond] 52.90r 0.18u 0.06s 0% 19588k ^C load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 58.35r 0.18u 0.06s 0% 19872k load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 61.73r 0.18u 0.06s 0% 19872k ^C load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 89.67r 0.18u 0.06s 0% 19872k load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 141.07r 0.18u 0.06s 0% 19872k ^C[/CODE] Then since it was hung, I thought maybe it was zfs' fault rather than OpenJDK, so tried with du. Second window: [CODE] # du -shx /data/archive2/2011/09/11/x 3.1G /data/archive2/2011/09/11/x # du -shx /data/archive2/2011/09/11 24G /data/archive2/2011/09/11 # du -shx /data/archive2/2011/ load: 0.00 cmd: du 72503 [buf_hash_table.ht_locks[i].ht_lock] 13.75r 0.00u 0.00s 0% 1012k ^C^C^C^Z^C load: 0.00 cmd: du 72503 [buf_hash_table.ht_locks[i].ht_lock] 221.97r 0.00u 0.00s 0% 1012k [/CODE] Then I thought I could just run the simplest version of the program (which does pretty much no IO) Third window: [CODE]# ./CopyFromHd.sh --help ^C^C^C^C^C^C load: 0.00 cmd: java 52339 [suspended] 26.33r 0.15u 0.04s 0% 25644k load: 0.00 cmd: java 52339 [suspended] 27.38r 0.15u 0.04s 0% 25644k ^C load: 0.00 cmd: java 52339 [suspended] 285.23r 0.15u 0.04s 0% 25644k ^Z [1]+ Stopped ./CopyFromHd.sh # jobs -p 51988 # kill -9 51988 # jobs -p 51988 [1]+ Killed: 9 ./CopyFromHd.sh # jobs -p [/CODE] Back to the first window: [CODE]load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 459.38r 0.18u 0.06s 0% 19872k ^C^Z [1]+ Stopped ./CopyFromHd.sh ... # jobs -p 51128 # kill -9 51128 # jobs -p 51128 [1]+ Interrupt: 2 ./CopyFromHd.sh ... # jobs -p # ps axl | grep java 0 51239 1 0 44 0 1264940 19904 - T 0 0:00.25 [java] 0 76933 77797 0 44 0 9124 1180 piperd S+ 0 0:00.00 grep java 0 52339 1 0 44 0 1266988 25676 - T 1 0:00.20 /usr/local/openjdk6 //bin/java -Xmx1024M -classpath ... [/CODE] The java above is the "--help" one from the 2nd window. So I guess the first one ended. But "dd" commands I ran are all stuck, and so is the 2nd java, and also "zdb" "zdb zroot" "zdb tank" and "zdb data" are all stuck. Also running "find" on either tank or data hangs. [CMD="#"]zpool iostat[/CMD] [CMD="#"]zpool status [-v][/CMD] and [CMD="#"]zfs list[/CMD] all run without hanging. Then I thought maybe I would disable primarycache to see what happens. [CODE]# zfs set primarycache=none data # zfs set primarycache=none tank load: 0.00 cmd: zfs 80750 [tx->tx_sync_done_cv)] 5.73r 0.00u 0.00s 0% 1636k (hang) ^Z^C load: 0.00 cmd: zfs 80750 [tx->tx_sync_done_cv)] 87.28r 0.00u 0.00s 0% 1636k [/CODE] Another window: [CODE]# zfs get primarycache data NAME PROPERTY VALUE SOURCE data primarycache none local # zfs get primarycache tank NAME PROPERTY VALUE SOURCE tank primarycache none local[/CODE] On a Linux server that nfs mounts /data on this FreeBSD server, [CMD="#"]df[/CMD] hangs at the point where the nfs mount should be listed. (So I have to reboot now rather than later) I have not run memtest on this machine. I have no l2arc. Uptime is 22 days. It is normally perfectly stable. I ran CopyFromHd to process over 20 TB of data so far (2 weeks ago), so it is not normal for this to happen. Other similar servers don't hang this way. (most hangs I've caused in the past are not in this "buf_hash_table.ht_locks[i].ht_lock" state. The last thing I did before this was running "zdb -S tank" yesterday. (tank is not the same pool as data) FreeBSD version is 8.2-STABLE csupped on 2012-02-04. And then on restart, the console showed: "some processes would not die; ps axl advised" and "shutdown terminated abnormally"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F72DF48.6040108>