From owner-freebsd-fs@FreeBSD.ORG Fri Jun 29 08:57:06 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B1BA5106564A for ; Fri, 29 Jun 2012 08:57:06 +0000 (UTC) (envelope-from lserinol@gmail.com) Received: from mail-yx0-f182.google.com (mail-yx0-f182.google.com [209.85.213.182]) by mx1.freebsd.org (Postfix) with ESMTP id 614E28FC17 for ; Fri, 29 Jun 2012 08:57:06 +0000 (UTC) Received: by yenl8 with SMTP id l8so2984185yen.13 for ; Fri, 29 Jun 2012 01:57:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=3WdepfDigT5VM94qr4Ocb/4VxSj7ZKXJhIEW6XB32KE=; b=iFAwz5OdO0rEMcSGaSZajAKf8NB+m5tbCyU8Ar7X7pr0zyokIrZwkCpf4mWNcy9VEB tV+TquyoBvEauRjBsiZD6FCMZyNyghY/rqnnqrmFyvYB9FwyszmNVCbszGd76yCwEnWA LsOt5mpvQpuxpLp2IgRP4G8tMt4mB7Kk90cJmMg22Tnc8mjf2tm4MjrGEyfIDlzilKoQ ORbJneC/+Ax6jL+ZWqdSJ17BCsdPRMTQ7TYxhGPnuMQmW9DJESEK3c09NfHmEAcinZia srgLN5HficQZ+NRhUhioEtUcKG3SXCspE7S7jdlKl7vVLD8jsLgQZcNHrgLDkD+NhlkC 3IDA== MIME-Version: 1.0 Received: by 10.101.165.13 with SMTP id s13mr272020ano.1.1340960224937; Fri, 29 Jun 2012 01:57:04 -0700 (PDT) Received: by 10.146.103.7 with HTTP; Fri, 29 Jun 2012 01:57:04 -0700 (PDT) In-Reply-To: <20120627192843.69214ea0@fabiankeil.de> References: <20120627192843.69214ea0@fabiankeil.de> Date: Fri, 29 Jun 2012 11:57:04 +0300 Message-ID: From: Levent Serinol To: Fabian Keil Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: freebsd-fs@freebsd.org Subject: Re: ZFS stalls on Heavy I/O X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 29 Jun 2012 08:57:06 -0000 Hi, On Wed, Jun 27, 2012 at 8:28 PM, Fabian Keil wrote: > Levent Serinol wrote: > > > On 27 Haz 2012, at 19:34, Andreas Nilsson wrote: > > > > On Wed, Jun 27, 2012 at 5:50 PM, Dean Jones < > dean.jones@oregonstate.edu> wrote: > > > On Wed, Jun 27, 2012 at 2:15 AM, Levent Serinol > wrote: > > > > Hi, > > > > > > > > Under heavy I/O load we face freeze problems on ZFS volumes on both > > > > Freebsd 9 Release and 10 Current versions. Machines are HP servers > (64bit) > > > > with HP Smart array 6400 raid controllers (with battery units). > Every da > > > > device is a hardware raid5 where each one includes 9x300GB 10K SCSI > hard > > > > drivers. Main of I/O pattern happens on local system except some > small NFS > > > > I/O from some other servers (NFS lookup/getattr/ etc.). These > servers are > > > > mail servers (qmail) with small I/O patterns (64K Read/Write). > Below you > > > > can find procstat output on freeze time. write_limit is set to 200MB > > > > because of the huge amount of txg_wait_opens observed before. Every > process > > > > stops on D state I think due to txg queue and other 2 queues are > full. Is > > > > there any suggestion to fix the problem ? > > > > > > > > btw inject_compress is the main process injecting emails to user > inboxes > > > > (databases). Also, those machines were running without problems on > > > > Linux/XFS filesystem. For a while ago, we started migration from > Linux to > > > > Freebsd > > > > > > > > > > > > http://pastebin.com/raw.php?i=ic3YepWQ > > > > _______________________________________________ > > > > > > Looks like you are running dedup with only 12 gigs of ram? > > > > > > Dedup is very ram hungry and the dedup tables are probably no longer > > > fitting entirely in memory and therefore the system is swapping and > > > thrashing about during writes. > > > > > > Also ZFS really prefers to directly address drives instead of RAID > > > controllers. It can not guarantee or know what the controller is > > > doing behind the scenes. > > > You might want to read > http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-dedupe and see > if you need more ram. > > > > > > And yes, having raid below zfs somewhat defeats the point of zfs. > > > That was one the machines, i'm running several similar machines except > few changes. For examplw some of them have 50gb and 20gbs of ram and some > of them has direct access every disk itself on poil as you suggested ( > pools including 24 disks) some of the machines also running p812 hp raid > cards (1gb cache ) , every hp card has battery unit. Every machine wheter > rumning on 50gb ram or pools with lots of disks have the same stall problem > except one of them which is using hp p6300 san with fc connection . It's > running zfs without problems. Do i have to suspect on ciss driver which is > common on all machines where problems occur ? Wheter they use 6400 or p812 > raid cards all of them is using same ciss driver except the one connected > via fc to san. > > > > Btw when zfs stalls after 1-2 minutes later it contiunes to write and > read as usual. > > Do the stalls get shorter if you decrease kern.cam.da.default_timeout? > > Fabian > I set kern.cam.da.default_timeout to 30 later tried 10, nothing happened. I just tried this on a machine where there are plenty of ram (50GB) , hp p812 raid controller (1gb cache+battery) and zpool configuration as below. state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM mail213 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 0 da3 ONLINE 0 0 0 da4 ONLINE 0 0 0 da5 ONLINE 0 0 0 da6 ONLINE 0 0 0 da7 ONLINE 0 0 0 da8 ONLINE 0 0 0 da9 ONLINE 0 0 0 da10 ONLINE 0 0 0 da11 ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 da13 ONLINE 0 0 0 da14 ONLINE 0 0 0 da15 ONLINE 0 0 0 da16 ONLINE 0 0 0 da17 ONLINE 0 0 0 da18 ONLINE 0 0 0 da19 ONLINE 0 0 0 da20 ONLINE 0 0 0 da21 ONLINE 0 0 0 da22 ONLINE 0 0 0 da24 ONLINE 0 0 0 cache da23 ONLINE 0 0 0 spares da12 AVAIL zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT mail213 6T 4.32T 1.68T 71% 2.25x ONLINE - every process where trying to write or read on this pool is in "D" state. here is a small trace of a process. Every one of them, strangely waits too long on lseek syscall. I have also replaced the lseek with pread/pwrite but same delay occures. 20338: 0.008389433 lseek(3,0x51dc00,SEEK_SET) = 5364736 (0x51dc00) 20338: 0.469858534 read(3,"\^B\0\0\0\r\0Y\0\0\0Yd\^C\M-/\^C"...,1024) = 1024 (0x400) 20338: 0.470058003 lseek(3,0xf1000,SEEK_SET) = 987136 (0xf1000) 20338: 0.492391062 read(3,"\n\0\0\0\^N\0007\0\0007\0z\0\M-B"...,1024) = 1024 (0x400) 20338: 0.492525369 lseek(4,0x2a50,SEEK_SET) = 10832 (0x2a50) 20338: 0.492680628 write(4,"\0\0\^Tx",4) = 4 (0x4) 20338: 0.492759340 lseek(4,0x2a54,SEEK_SET) = 10836 (0x2a54) 20338: 0.492950148 write(4,"\^B\0\0\0\r\0Y\0\0\0Yd\^C\M-/\^C"...,1024) = 1024 (0x400) 20338: 0.493027673 lseek(4,0x2e54,SEEK_SET) = 11860 (0x2e54) 20338: 0.493124404 write(4,"\M-&"\t\M-K",4) = 4 (0x4) 20338: 0.493204793 lseek(3,0xb3e800,SEEK_SET) = 11790336 (0xb3e800) 20338: 0.555674360 read(3,"\n\0\0\0\r\0}\0\0\M-B\^A\^F\^AV"...,1024) = 1024 (0x400) 20338: 0.555823752 lseek(3,0x2c7800,SEEK_SET) = 2914304 (0x2c7800) 20338: 0.568206547 read(3,"\n\0\0\0\r\0K\0\0\M^V\0\M-W\^A'"...,1024) = 1024 (0x400) 20338: 0.568324720 lseek(4,0x2e58,SEEK_SET) = 11864 (0x2e58) 20338: 0.568437026 write(4,"\0\0\^C\M-E",4) = 4 (0x4) 20338: 0.568511757 lseek(4,0x2e5c,SEEK_SET) = 11868 (0x2e5c) 20338: 0.568666527 write(4,"\n\0\0\0\^N\0007\0\0007\0z\0\M-B"...,1024) = 1024 (0x400) 20338: 0.568742795 lseek(4,0x325c,SEEK_SET) = 12892 (0x325c) 20338: 0.568852168 write(4,"\M-&"\no",4) = 4 (0x4) 20338: 0.568930600 lseek(4,0x3260,SEEK_SET) = 12896 (0x3260) 20338: 0.569028379 write(4,"\0\0\v\^_",4) = 4 (0x4) 20338: 0.569094659 lseek(4,0x3264,SEEK_SET) = 12900 (0x3264) 20338: 0.569235950 write(4,"\n\0\0\0\r\0K\0\0\M^V\0\M-W\^A'"...,1024) = 1024 (0x400) 20338: 0.569302439 lseek(4,0x3664,SEEK_SET) = 13924 (0x3664) 20338: 0.569390022 write(4,"\M-&"\n\^[",4) = 4 (0x4) 20338: 0.569451692 lseek(4,0x3668,SEEK_SET) = 13928 (0x3668) 20338: 0.569534176 write(4,"\0\0,\M-{",4) = 4 (0x4) 20338: 0.569594170 lseek(4,0x366c,SEEK_SET) = 13932 (0x366c) 20338: 0.569721771 write(4,"\n\0\0\0\r\0}\0\0\M-B\^A\^F\^AV"...,1024) = 1024 (0x400) 20338: 0.569782604 lseek(4,0x3a6c,SEEK_SET) = 14956 (0x3a6c) 20338: 0.569873818 write(4,"\M-&"\nW",4) = 4 (0x4) 20338: 0.569953368 lseek(3,0x9000,SEEK_SET) = 36864 (0x9000) 20338: 0.570027610 read(3,"\^B\0\0\0\^S\0\M^D\0\0\0\M-(\M-X"...,1024) = 1024 (0x400) 20338: 0.570122595 lseek(3,0x2a35c00,SEEK_SET) = 44260352 (0x2a35c00) 20338: 0.728753280 read(3,"\^B\0\0\0\^S\0\M^D\^A\0\0\M-/7"...,1024) = 1024 (0x400) 20338: 0.728882977 lseek(3,0x2bcd800,SEEK_SET) = 45930496 (0x2bcd800) 20338: 0.728959175 read(3,"\n\0\0\0\^R\0\M-~\0\0\M-~\^A)\^A"...,1024) = 1024 (0x400) 20338: 0.729059188 lseek(4,0x3a70,SEEK_SET) = 14960 (0x3a70) 20338: 0.729143278 write(4,"\0\0\M-/7",4) = 4 (0x4) 20338: 0.729203063 lseek(4,0x3a74,SEEK_SET) = 14964 (0x3a74) 20338: 0.729328011 write(4,"\n\0\0\0\^R\0\M-~\0\0\M-~\^A)\^A"...,1024) = 1024 (0x400) 20338: 0.729385351 lseek(4,0x3e74,SEEK_SET) = 15988 (0x3e74) 20338: 0.729473212 write(4,"\M-&"\t\M^T",4) = 4 (0x4) 20338: 0.729539772 lseek(3,0x9800,SEEK_SET) = 38912 (0x9800) 20338: 0.729608776 read(3,"\^B\0\0\0\^A\^C\M-G\0\0\0b\^\\^C"...,1024) = 1024 (0x400) 20338: 0.729702783 lseek(3,0x1886c00,SEEK_SET) = 25717760 (0x1886c00) 20338: 0.876752391 read(3,"\^B\^A\M^P\0\^P\09\^A\0\0.I\^A"...,1024) = 1024 (0x400) 20338: 0.876904926 lseek(3,0x2999000,SEEK_SET) = 43618304 (0x2999000) 20338: 1.088693201 read(3,"\^B\0\0\0\^O\0\M-*\0\0\0$l\0\M-*"...,1024) = 1024 (0x400) 20338: 1.088931153 lseek(3,0x2bc1400,SEEK_SET) = 45880320 (0x2bc1400) 20338: 1.089040037 read(3,"\n\0\0\0\^N\^A\^Z\0\^A\^Z\^AN\^A"...,1024) = 1024 (0x400) 20338: 1.089176648 lseek(4,0x3e78,SEEK_SET) = 15992 (0x3e78) 20338: 1.089300199 write(4,"\0\0\M-/\^F",4) = 4 (0x4) 20338: 1.089384009 lseek(4,0x3e7c,SEEK_SET) = 15996 (0x3e7c) 20338: 1.089572373 write(4,"\n\0\0\0\^N\^A\^Z\0\^A\^Z\^AN\^A"...,1024) = 1024 (0x400) 20338: 1.089654927 lseek(4,0x427c,SEEK_SET) = 17020 (0x427c) 20338: 1.089771773 write(4,"\M-&"\n\^D",4) = 4 (0x4) 20338: 1.089882891 lseek(4,0x4280,SEEK_SET) = 17024 (0x4280) 20338: 1.089998899 write(4,"\0\0\0\^A",4) = 4 (0x4) 20338: 1.090082290 lseek(4,0x4284,SEEK_SET) = 17028 (0x4284) 20338: 1.090246000 write(4,"SQLite format 3\0\^D\0\^A\^A\0@ "...,1024) = 1024 (0x400) 20338: 1.090328135 lseek(4,0x4684,SEEK_SET) = 18052 (0x4684) 20338: 1.090441349 write(4,"\M-&"\t3",4) = 4 (0x4) 20338: 1.090547439 lseek(4,0x4688,SEEK_SET) = 18056 (0x4688) 20338: 1.090660234 write(4,"\0\0\M-.'",4) = 4 (0x4) 20338: 1.090750959 lseek(4,0x468c,SEEK_SET) = 18060 (0x468c) 20338: 1.090939462 write(4,"\^E\0\0\0]\^Au\0\0\0\M-/\^?\^A"...,1024) = 1024 (0x400) 20338: 1.091030327 lseek(4,0x4a8c,SEEK_SET) = 19084 (0x4a8c) 20338: 1.091149966 write(4,"\M-&"\n1",4) = 4 (0x4) 20338: 1.091271003 lseek(4,0x4a90,SEEK_SET) = 19088 (0x4a90) 20338: 1.091398535 write(4,"\0\0\0\^F",4) = 4 (0x4) 20338: 1.091485069 lseek(4,0x4a94,SEEK_SET) = 19092 (0x4a94) 20338: 1.091673573 write(4,"\r\0\0\0\^B\^C\M-g\0\^C\M-s\^C"...,1024) = 1024 (0x400) 20338: 1.091756336 lseek(4,0x4e94,SEEK_SET) = 20116 (0x4e94) 20338: 1.091875836 write(4,"\M-&"\b\M-)",4) = 4 (0x4) 20338: 1.091975011 fcntl(3,F_SETLK,0x7fffffffcf80) = 0 (0x0) 20338: 1.092062384 fcntl(3,F_SETLK,0x7fffffffcf80) = 0 (0x0) 20338: 1.092152550 lseek(4,0x5000,SEEK_SET) = 20480 (0x5000) 20338: 1.092241040 read(4,0x7fffffffcf10,8) = 0 (0x0) 20338: 1.092337352 fsync(0x4,0x0,0x0,0x0,0x1,0x0) = 0 (0x0) Here is a stack trace on problem time http://pastebin.com/raw.php?i=T1iwhuEG Thanks, Levent -- http://lserinol.blogspot.com/