Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 28 Mar 2012 11:52:08 +0200
From:      Peter Maloney <peter.maloney@brockmann-consult.de>
To:        freebsd-fs@freebsd.org
Subject:   Strange ZFS deadlock today
Message-ID:  <4F72DF48.6040108@brockmann-consult.de>

next in thread | raw e-mail | index | archive | help
I believe it is caused by ZFS, since the state looks like the code in 
this page about ARC: 
[url]http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/[/url]

To cause/reveal this problem, first, I tried to run something with 
OpenJDK 6.

In one window:
[CODE]# ./CopyFromHd.sh
load: 0.00  cmd: java 51239 [ucond] 44.56r 0.18u 0.06s 0% 19588k
load: 0.00  cmd: java 51239 [ucond] 46.22r 0.18u 0.06s 0% 19588k
load: 0.00  cmd: java 51239 [ucond] 51.14r 0.18u 0.06s 0% 19588k
load: 0.00  cmd: java 51239 [ucond] 52.90r 0.18u 0.06s 0% 19588k
^C
load: 0.00  cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 58.35r 
0.18u 0.06s 0% 19872k
load: 0.00  cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 61.73r 
0.18u 0.06s 0% 19872k
^C
load: 0.00  cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 89.67r 
0.18u 0.06s 0% 19872k
load: 0.00  cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 141.07r 
0.18u 0.06s 0% 19872k
^C[/CODE]

Then since it was hung, I thought maybe it was zfs' fault rather than 
OpenJDK, so tried with du.

Second window:
[CODE]
# du -shx /data/archive2/2011/09/11/x
3.1G    /data/archive2/2011/09/11/x

# du -shx /data/archive2/2011/09/11
  24G    /data/archive2/2011/09/11

# du -shx /data/archive2/2011/
load: 0.00  cmd: du 72503 [buf_hash_table.ht_locks[i].ht_lock] 13.75r 
0.00u 0.00s 0% 1012k
^C^C^C^Z^C
load: 0.00  cmd: du 72503 [buf_hash_table.ht_locks[i].ht_lock] 221.97r 
0.00u 0.00s 0% 1012k
[/CODE]

Then I thought I could just run the simplest version of the program 
(which does pretty much no IO)

Third window:
[CODE]# ./CopyFromHd.sh --help
^C^C^C^C^C^C
load: 0.00  cmd: java 52339 [suspended] 26.33r 0.15u 0.04s 0% 25644k
load: 0.00  cmd: java 52339 [suspended] 27.38r 0.15u 0.04s 0% 25644k
^C
load: 0.00  cmd: java 52339 [suspended] 285.23r 0.15u 0.04s 0% 25644k
^Z
[1]+  Stopped                 ./CopyFromHd.sh
# jobs -p
51988
# kill -9 51988
# jobs -p
51988
[1]+  Killed: 9               ./CopyFromHd.sh
# jobs -p
[/CODE]

Back to the first window:
[CODE]load: 0.00  cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 
459.38r 0.18u 0.06s 0% 19872k
^C^Z
[1]+  Stopped                 ./CopyFromHd.sh ...
# jobs -p
51128
# kill -9 51128
# jobs -p
51128
[1]+  Interrupt: 2            ./CopyFromHd.sh ...
# jobs -p
# ps axl | grep java
     0 51239     1   0  44  0 1264940 19904 -      T      0    0:00.25 
[java]
     0 76933 77797   0  44  0  9124  1180 piperd S+     0    0:00.00 
grep java
     0 52339     1   0  44  0 1266988 25676 -      T      1    0:00.20 
/usr/local/openjdk6 //bin/java -Xmx1024M -classpath ...
[/CODE]


The java above is the "--help" one from the 2nd window. So I guess the 
first one ended.

But "dd" commands I ran are all stuck, and so is the 2nd java, and also 
"zdb" "zdb zroot" "zdb tank" and "zdb data" are all stuck. Also running 
"find" on either tank or data hangs.

[CMD="#"]zpool iostat[/CMD] [CMD="#"]zpool status [-v][/CMD] and 
[CMD="#"]zfs list[/CMD] all run without hanging.

Then I thought maybe I would disable primarycache to see what happens.

[CODE]# zfs set primarycache=none data
# zfs set primarycache=none tank
load: 0.00  cmd: zfs 80750 [tx->tx_sync_done_cv)] 5.73r 0.00u 0.00s 0% 1636k
(hang)
^Z^C
load: 0.00  cmd: zfs 80750 [tx->tx_sync_done_cv)] 87.28r 0.00u 0.00s 0% 
1636k
[/CODE]

Another window:
[CODE]# zfs get primarycache data
NAME  PROPERTY      VALUE         SOURCE
data  primarycache  none          local
# zfs get primarycache tank
NAME  PROPERTY      VALUE         SOURCE
tank  primarycache  none          local[/CODE]


On a Linux server that nfs mounts /data on this FreeBSD server, 
[CMD="#"]df[/CMD] hangs at the point where the nfs mount should be 
listed. (So I have to reboot now rather than later)

I have not run memtest on this machine.

I have no l2arc.

Uptime is 22 days.

It is normally perfectly stable. I ran CopyFromHd to process over 20 TB 
of data so far (2 weeks ago), so it is not normal for this to happen. 
Other similar servers don't hang this way. (most hangs I've caused in 
the past are not in this "buf_hash_table.ht_locks[i].ht_lock" state.

The last thing I did before this was running "zdb -S tank" yesterday. 
(tank is not the same pool as data)

FreeBSD version is 8.2-STABLE csupped on 2012-02-04.


And then on restart, the console showed: "some processes would not die; 
ps axl advised" and "shutdown terminated abnormally"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F72DF48.6040108>