Date: Tue, 29 Sep 2009 10:29:26 +0200 From: Borja Marcos <borjam@sarenet.es> To: freebsd-stable@freebsd.org Subject: 8.0RC1, ZFS: deadlock Message-ID: <089F63A7-574B-4646-97C7-D82B226CD4CF@sarenet.es>
next in thread | raw e-mail | index | archive | help
Hello, I have observed a deadlock condition when using ZFS. We are making a heavy usage of zfs send/zfs receive to keep a replica of a dataset on a remote machine. It can be done at one minute intervals. Maybe we're doing a somehow atypical usage of ZFS, but, well, seems to be a great solution to keep filesystem replicas once this is sorted out. How to reproduce: Set up two systems. A dataset with heavy I/O activity is replicated from the first to the second one. I've used a dataset containing /usr/ obj while I did a make buildworld. Replicate the dataset from the first machine to the second one using an incremental send zfs send -i pool/dataset@Nminus1 pool/dataset@N | ssh destination zfs receive -d pool When there is read activity on the second system, reading the replicated system, I mean, having read access while zfs receive is updating it, there can be a deadlock. We have discovered this doing a test on a hopefully soon in production server, with 8 GB RAM. A Bacula backup agent was running and ZFS deadlocked. I have set up a couple of VMWare Fussion virtual machines in order to test this, and it has deadlocked as well. The virtual machines have little memory, 512 MB, but I don't believe this is the actual problem. There is no complaint about lack of memory. A running top shows processes stuck on "zfsvfs" last pid: 2051; load averages: 0.00, 0.07, 0.55 up 0+01:18:25 12:05:48 37 processes: 1 running, 36 sleeping CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle Mem: 18M Active, 20M Inact, 114M Wired, 40K Cache, 59M Buf, 327M Free Swap: 1024M Total, 1024M Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 1914 root 1 62 0 11932K 2564K zfsvfs 0 0:51 0.00% bsdtar 1093 borjam 1 44 0 8304K 2464K CPU1 1 0:32 0.00% top 1913 root 1 54 0 11932K 2600K rrl->r 0 0:19 0.00% bsdtar 1019 root 1 44 0 25108K 4812K select 0 0:05 0.00% sshd 2008 root 1 76 0 13600K 1904K tx->tx 0 0:04 0.00% zfs 1089 borjam 1 44 0 37040K 5216K select 1 0:04 0.00% sshd 995 root 1 76 0 8252K 2652K pause 0 0:02 0.00% csh 840 root 1 44 0 11044K 3828K select 1 0:02 0.00% sendmail 1086 root 1 76 0 37040K 5156K sbwait 1 0:01 0.00% sshd 850 root 1 44 0 6920K 1612K nanslp 0 0:01 0.00% cron 607 root 1 44 0 5992K 1540K select 1 0:01 0.00% syslogd 1090 borjam 1 76 0 8252K 2636K pause 1 0:01 0.00% csh 990 borjam 1 44 0 37040K 5220K select 0 0:00 0.00% sshd 985 root 1 48 0 37040K 5160K sbwait 1 0:00 0.00% sshd 911 root 1 44 0 8252K 2608K ttyin 0 0:00 0.00% csh 991 borjam 1 56 0 8252K 2636K pause 0 0:00 0.00% csh 844 smmsp 1 46 0 11044K 3852K pause 0 0:00 0.00% sendmail Interestingly, this has blocked access to all the filesystems. I cannot, for instance, ssh into the machine anymore, even though all the system-important filesystems are on ufs, I was just using ZFS for a test. Any ideas on what information might be useful to collect? I have the vmware machine right now. I've made a couple of VMWare snapshots of it, first before breaking into DDB with the deadlock just started, the second being into DDB (I've broken into DDB with sysctl). Also, a copy of the VMWare virtual machine with snapshots is avaiable on request. Your choice ;) Borja.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?089F63A7-574B-4646-97C7-D82B226CD4CF>