Date: Thu, 25 Oct 2012 08:34:01 -0700 From: Dennis Glatting <freebsd@pki2.com> To: freebsd-fs@freebsd.org Subject: stable/9 + ZFS IS NOT ready, me thinks Message-ID: <1351179241.12775.34.camel@btw.pki2.com>
next in thread | raw e-mail | index | archive | help
At least that is what I suspect. As I have previously mentioned, I have five servers with stable/9 running ZFS. Four are AMD systems (similar but not identical) and the fifth Intel. The AMD systems are the workhorses. The AMDs have a long history of stalling under load. Specifically, the kernel, keyboard, display, and network I/O are still there, but the disks are stalled across all volumes, arrays, and disks (e.g., if I enter a command not on the disks, such as on a memory disk, and statically linked, the command will run, otherwise the command DOES NOT run). Over the last week I changed operating systems on two of these systems. System #1 I downgraded to stable/8. System #3 I installed CentOS 6.3 ZFS-on-Linux (ZoL). These two systems have been running the same job (2d17h on the first and 3d on the second) without trouble. Previously System #1 would have within 48 hours, typically less than 12, and System #3 would spontaneously reboot whenever I tried to send a data set via "zfs send" to it. On System #1 I found one of the OS disks, a hardware RAID1 array, was toast. I found and replaced that disk before I installed 8.3. You can argue the problem with stable/9 was that disk but I don't believe it because I have the SAME problem across all four systems. When a new set of disks arrive I plan to re-introduce stable/9 to that system to see if the faulting returns. Also, smartd says I need to update the firmware in some of my disks, which I plan to do this weekend (below). Under ZoL and 8.3 the systems are more responsive than stable/9. For example, a "ls" of the busy data set returns data MUCH more quickly under ZoL and 8.3. Under stable/9 it sputters out the data. Here is the current load on System #1: mc# top last pid: 53918; load averages: 73.73, 73.08, 72.81 up 2+17:58:24 08:16:47 61 processes: 10 running, 51 sleeping CPU: 11.4% user, 46.0% nice, 42.6% system, 0.1% interrupt, 0.0% idle Mem: 702M Active, 1003M Inact, 35G Wired, 160K Cache, 88M Buf, 88G Free ARC: 32G Total, 3594M MRU, 27G MFU, 32M Anon, 581M Header, 562M Other Swap: 233G Total, 233G Free mc# zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT disk-1 16.2T 6.57T 9.68T 40% 1.33x ONLINE - disk-2 3.62T 3.63G 3.62T 0% 1.00x ONLINE - All of the data is going onto disk-1 which had under 10GB when I started the job. Here is System #3, running the same job but has only 25% of the cores as System #1: [root@rotfl ~]# top top - 08:19:13 up 3 days, 16:13, 7 users, load average: 94.61, 94.57, 100.94 Tasks: 710 total, 10 running, 700 sleeping, 0 stopped, 0 zombie Cpu(s): 13.3%us, 4.4%sy, 82.2%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 65951592k total, 39561920k used, 26389672k free, 154372k buffers Swap: 134217720k total, 0k used, 134217720k free, 377996k cached [root@rotfl ~]# zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT disk-1 16.2T 6.72T 9.53T 41% 1.00x ONLINE - disk-2 1.81T 3.24G 1.81T 0% 1.00x ONLINE - Like System #1, the data is going to disk-1 which also had less than 10GB when started. I am working on getting many TB of data off one of the remaining two stable/9 systems for more experimentation but the system stalls, which makes the process a bit cumbersome. I strongly suspect a contributing factor is the system cron scripts that run at night. Finally, as I have also previously mentioned, I am NOT the only one having this problem. One individual stated that he did update his BIOS, his controller firmware, and disk firmware but that didn't help. I am happy to work with FreeBSD component knowledgeable folks but only one stepped forward.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1351179241.12775.34.camel>