From owner-freebsd-fs@FreeBSD.ORG Wed Feb 3 09:49:00 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F00E31065692 for ; Wed, 3 Feb 2010 09:49:00 +0000 (UTC) (envelope-from bra@fsn.hu) Received: from people.fsn.hu (people.fsn.hu [195.228.252.137]) by mx1.freebsd.org (Postfix) with ESMTP id 91D008FC14 for ; Wed, 3 Feb 2010 09:49:00 +0000 (UTC) Received: by people.fsn.hu (Postfix, from userid 1001) id 4C802208FE6; Wed, 3 Feb 2010 10:48:58 +0100 (CET) X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MF-ACE0E1EA [pR: 19.1563] X-CRM114-CacheID: sfid-20100203_10485_1A065689 X-CRM114-Status: Good ( pR: 19.1563 ) Message-ID: <4B694689.2030704@fsn.hu> Date: Wed, 03 Feb 2010 10:48:57 +0100 From: Attila Nagy User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.23) Gecko/20090817 Thunderbird/2.0.0.23 Mnenhy/0.7.6.0 MIME-Version: 1.0 To: freebsd-fs@freebsd.org X-Stationery: 0.4.10 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.3 (people.fsn.hu); Wed, 03 Feb 2010 10:48:57 +0100 (CET) Subject: Machine stops for some seconds with ZFS X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Feb 2010 09:49:01 -0000 Hello, After a long time, I've switched back to ZFS on my desktop. It runs 8-STABLE/amd64 with two SATA disks and an USB pendrive. One-one partition is used from each disk for the zpool, which is encrypted using GELI, and the pendrive is there for L2ARC: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 mirror ONLINE 0 0 0 ad0s1d.eli ONLINE 0 0 0 ad1s1d.eli ONLINE 0 0 0 cache da0 ONLINE 0 0 0 Today, after 12 days of uptime the machine has frozen. I could ping it from a different machine, even could open a telnet to its ssh port, but I couldn't get the ssh banner. Now I'm building a 9-CURRENT kernel and world to see whether the same problem persists with that, and during the make process I've noticed a strange thing. I build with -j4 (the machine has one dual core CPU), so the fans are screaming during the process. But every few minutes (I couldn't recognize any patterns in it) the machine goes completely silent (even more silent than normally), and everything halts. During this, the top running on the machine can refresh itself, and I can type on pass through ssh connections (that is, I use the machine in question to access other machines with ssh), but I can't open new ssh connections to it, and can't start anything new (for example from an open shell). ping is running seamlessly during this, and top shows the following: last pid: 36503; load averages: 1.59, 3.04, 3.01 up 0+00:49:53 10:32:10 97 processes: 1 running, 96 sleeping CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle Mem: 218M Active, 24M Inact, 639M Wired, 40M Cache, 6208K Buf, 1022M Free Swap: 4096M Total, 4096M Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 1342 root 1 44 0 3204K 620K select 0 0:02 0.00% make 1424 root 1 44 0 3204K 1036K select 0 0:01 0.00% make 1280 root 1 44 0 12540K 1900K select 0 0:01 0.00% hald-addon-storage 1234 haldaemon 1 44 0 24116K 4464K select 0 0:01 0.00% hald 93600 root 1 44 0 3204K 1028K select 0 0:00 0.00% make 1260 root 1 44 0 19704K 2688K select 0 0:00 0.00% hald-addon-mouse-sy 15142 bra 1 44 0 9332K 2864K CPU0 0 0:00 0.00% top 1263 root 1 44 0 12540K 1896K cgticb 0 0:00 0.00% hald-addon-storage 94415 bra 1 44 0 37944K 4992K select 1 0:00 0.00% sshd 35837 root 1 44 0 5252K 2424K select 1 0:00 0.00% make 95361 bra 1 44 0 37944K 4992K select 1 0:00 0.00% sshd 35973 root 1 44 0 3204K 1772K select 0 0:00 0.00% make 608 root 1 44 0 6892K 1436K select 1 0:00 0.00% syslogd 96928 root 1 44 0 3204K 728K select 0 0:00 0.00% make 94369 root 1 51 0 37944K 4584K sbwait 0 0:00 0.00% sshd 82631 root 1 50 0 37944K 4584K sbwait 0 0:00 0.00% sshd 16304 root 1 44 0 37944K 4576K zio->i 1 0:00 0.00% sshd 951 _ntp 1 44 0 6876K 1692K select 0 0:00 0.00% ntpd 1238 root 1 76 0 16768K 2372K select 0 0:00 0.00% hald-runner 4916 root 1 44 0 3204K 728K select 1 0:00 0.00% make 95338 root 1 49 0 37944K 4584K sbwait 1 0:00 0.00% sshd 1259 root 1 44 0 10280K 2712K pause 1 0:00 0.00% csh 33357 bra 1 44 0 21596K 4004K select 0 0:00 0.00% ssh 16405 bra 1 44 0 37944K 5012K zio->i 0 0:00 0.00% sshd 1044 root 1 44 0 9104K 1796K kqread 0 0:00 0.00% master 34765 root 1 76 0 8260K 1764K wait 1 0:00 0.00% sh 82685 bra 1 44 0 37944K 4960K select 1 0:00 0.00% sshd 1065 postfix 1 44 0 9100K 1872K kqread 0 0:00 0.00% qmgr 1237 root 17 44 0 27460K 4124K waitvt 0 0:00 0.00% console-kit-daemon 95362 bra 1 44 0 10216K 2612K ttyin 0 0:00 0.00% bash 34764 root 1 44 0 3204K 852K select 0 0:00 0.00% make 1222 root 1 49 0 21672K 1896K wait 0 0:00 0.00% login 35728 root 1 44 0 3204K 860K select 0 0:00 0.00% make 1064 postfix 1 44 0 9104K 1772K zio->i 1 0:00 0.00% pickup 82696 bra 1 44 0 10216K 2596K wait 0 0:00 0.00% bash 94417 bra 1 44 0 10216K 2596K wait 1 0:00 0.00% bash 35455 root 1 44 0 3204K 744K select 0 0:00 0.00% make 35774 root 1 44 0 3204K 728K select 1 0:00 0.00% make 16409 bra 1 44 0 10216K 2592K ttyin 0 0:00 0.00% bash 1155 root 1 44 0 7948K 1604K nanslp 0 0:00 0.00% cron 1077 messagebus 1 53 0 8092K 2060K select 0 0:00 0.00% dbus-daemon 1149 root 1 44 0 26012K 3960K select 1 0:00 0.00% sshd 35729 root 1 76 0 8260K 1760K wait 0 0:00 0.00% sh 4921 root 1 57 0 8260K 1748K wait 0 0:00 0.00% sh 825 root 1 76 0 39212K 2372K lockf 1 0:00 0.00% saslauthd 35460 root 1 76 0 8260K 1748K wait 0 0:00 0.00% sh 34761 root 1 48 0 8260K 1740K wait 1 0:00 0.00% sh 96923 root 1 50 0 8260K 1740K wait 0 0:00 0.00% sh As you can see, top reports that the machine is 100% idle, while a make -j4 buildworld runs. This lasts for few seconds (10-20), then everything goes back to normal, the fans start to scream, the build continues and I can use the machine. This occasional halt is new to me -but I'm just switched to ZFS on my desktop, in a server it's harder to notice if you don't use it for interactive sessions-, but I could see the final freeze on more than one servers. How could I help to debug this, and the final one? Thanks,