From owner-freebsd-fs@FreeBSD.ORG Mon Sep 24 07:48:31 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6169D16A418 for ; Mon, 24 Sep 2007 07:48:31 +0000 (UTC) (envelope-from ighighi@gmail.com) Received: from wx-out-0506.google.com (wx-out-0506.google.com [66.249.82.229]) by mx1.freebsd.org (Postfix) with ESMTP id 0A57313C448 for ; Mon, 24 Sep 2007 07:48:25 +0000 (UTC) (envelope-from ighighi@gmail.com) Received: by wx-out-0506.google.com with SMTP id i29so947802wxd for ; Mon, 24 Sep 2007 00:48:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:user-agent:mime-version:to:subject:content-type:content-transfer-encoding; bh=SVHDVfFE6M5jt26r3O4YvyvwaOfJqDIUSXPaXcaO6SM=; b=ui+tugw44Ex0BDIkHGNuRmZH4IICCBK/vfjwceQPhdJKuoGzt51C00H9mPXcfeW7A6Ma33SChLxWwmRl7efO0diQWldgBcR9mTeGQ/ZsM6IXAWRUqfcJ+FdVM/jYY0E5LctJCz4lA80jbs6CHhRsSPFt3Zc6raXfvtuRWIyIZoQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:user-agent:mime-version:to:subject:content-type:content-transfer-encoding; b=EsU23F2DkMsynmLvBr3sUrtUJ7ZBKCw4QccLacL2f629FKCnUUImflvQt3svJlpG3ehOac9ce8TwGyJYJt4MYscA7cVpEgSTEZ+GwThAEXMrOhF55rLKTpfUGeJUrbo+6+8up8eiuroUSXtfB0Qr60lPYm0CE7L5sYk05hMEFl8= Received: by 10.90.118.8 with SMTP id q8mr5322626agc.1190618651027; Mon, 24 Sep 2007 00:24:11 -0700 (PDT) Received: from orion.nebula.mil ( [200.44.87.60]) by mx.google.com with ESMTPS id 44sm2506537hsa.2007.09.24.00.24.07 (version=TLSv1/SSLv3 cipher=RC4-MD5); Mon, 24 Sep 2007 00:24:08 -0700 (PDT) Message-ID: <46F765E6.6080905@gmail.com> Date: Mon, 24 Sep 2007 03:23:18 -0400 From: Ighighi User-Agent: Thunderbird 2.0.0.6 (X11/20070803) MIME-Version: 1.0 To: freebsd-fs@freebsd.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: nmount() version of mount_ntfs(8) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Sep 2007 07:48:31 -0000 Is there anybody working on a nmount() version of mount_ntfs(8) to complete transition of the mount_xxx() tools or no such plan exists? I'm not sure where I've read that these tools are to be merged into one in some BSD. Anyway I could do the work... I also added support for dirmask in NTFS as is used by MSDOSFS that I tested on 6.2-STABLE. If such feature is recognized as useful as it is in msdosfs(5), I could make the newer nmount()-based mount_ntfs(8) understand it as well so it could debut on 7.0. Details in: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/114847 Can any developer with commit rights apply the patch for the NTFS bug in http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/114856 ? Regards, Igh. From owner-freebsd-fs@FreeBSD.ORG Mon Sep 24 11:08:20 2007 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 02A4916A418 for ; Mon, 24 Sep 2007 11:08:20 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id D4CE913C448 for ; Mon, 24 Sep 2007 11:08:19 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (gnats@localhost [127.0.0.1]) by freefall.freebsd.org (8.14.1/8.14.1) with ESMTP id l8OB8JFr064148 for ; Mon, 24 Sep 2007 11:08:19 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.1/8.14.1/Submit) id l8OB8I5b064144 for freebsd-fs@FreeBSD.org; Mon, 24 Sep 2007 11:08:18 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 24 Sep 2007 11:08:18 GMT Message-Id: <200709241108.l8OB8I5b064144@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-fs@FreeBSD.org Cc: Subject: Current problem reports assigned to you X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Sep 2007 11:08:20 -0000 Current FreeBSD problem reports Critical problems Serious problems S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/112658 fs [smbfs] [patch] smbfs and caching problems (resolves b o kern/114676 fs [ufs] snapshot creation panics: snapacct_ufs2: bad blo o kern/114856 fs [ntfs] [patch] Bug in NTFS allows bogus file modes. o kern/116170 fs Kernel panic when mounting /tmp 4 problems total. Non-critical problems S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/114847 fs [ntfs] [patch] dirmask support for NTFS ala MSDOSFS 1 problem total. From owner-freebsd-fs@FreeBSD.ORG Mon Sep 24 17:13:51 2007 Return-Path: Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B3DBC16A417 for ; Mon, 24 Sep 2007 17:13:51 +0000 (UTC) (envelope-from randy@psg.com) Received: from rip.psg.com (rip.psg.com [147.28.0.39]) by mx1.freebsd.org (Postfix) with ESMTP id CDEF413C458 for ; Mon, 24 Sep 2007 17:13:51 +0000 (UTC) (envelope-from randy@psg.com) Received: from cust16202.lava.net ([64.65.95.74] helo=[192.168.0.101]) by rip.psg.com with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.67 (FreeBSD)) (envelope-from ) id 1IZrLM-0003ZM-90 for freebsd-fs@FreeBSD.ORG; Mon, 24 Sep 2007 17:03:36 +0000 Message-ID: <46F7EDD7.6060904@psg.com> Date: Mon, 24 Sep 2007 07:03:19 -1000 From: Randy Bush User-Agent: Thunderbird 2.0.0.6 (Windows/20070728) MIME-Version: 1.0 To: freebsd-fs@FreeBSD.ORG X-Enigmail-Version: 0.95.3 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Cc: Subject: zfs in production? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Sep 2007 17:13:51 -0000 we are thinking of using zfs on a production server, using gmirror for booting and then following http://wiki.freebsd.org/ZFSOnRoot for the rest. but we would like to hear from folk using zfs in production for any length of time, as we do not really have the resources to be pioneers. thanks. randy From owner-freebsd-fs@FreeBSD.ORG Mon Sep 24 21:29:26 2007 Return-Path: Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6DFD916A421 for ; Mon, 24 Sep 2007 21:29:26 +0000 (UTC) (envelope-from bp@eden.barryp.org) Received: from eden.barryp.org (host-42-60-230-24.midco.net [24.230.60.42]) by mx1.freebsd.org (Postfix) with ESMTP id 4823313C469 for ; Mon, 24 Sep 2007 21:29:25 +0000 (UTC) (envelope-from bp@eden.barryp.org) Received: from macbook.home ([10.66.1.10]) by eden.barryp.org with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.67 (FreeBSD)) (envelope-from ) id 1IZupA-000MQz-61; Mon, 24 Sep 2007 15:46:36 -0500 Message-ID: <46F82227.5090302@barryp.org> Date: Mon, 24 Sep 2007 15:46:31 -0500 From: Barry Pederson User-Agent: Thunderbird 2.0.0.6 (Macintosh/20070728) MIME-Version: 1.0 To: Randy Bush References: <46F7EDD7.6060904@psg.com> In-Reply-To: <46F7EDD7.6060904@psg.com> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Sender: bp@eden.barryp.org Cc: freebsd-fs@FreeBSD.ORG Subject: Re: zfs in production? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Sep 2007 21:29:26 -0000 Randy Bush wrote: > we are thinking of using zfs on a production server, using gmirror for > booting and then following http://wiki.freebsd.org/ZFSOnRoot for the rest. > > but we would like to hear from folk using zfs in production for any > length of time, as we do not really have the resources to be pioneers. > > thanks. > > randy I've setup a few machines now using a CompactFlash device for booting, plugged straight into the motherboard with a CF-IDE adapter, and then having zfs-on-root with the actual harddisks 100% controlled by ZFS (no gmirror or slices otherwise). One machine is zfs-mirror and the other is 8-disk raidz2. The CF hardware is ony $30 or so, and it's nice not to have to deal with two different mirroring systems. A bonus is that CF devices are so large nowadays, that it's convenient just have a complete installation of FreeBSD on it and be able to use it as an emergency recovery system just by entering "vfs.root.mountfrom=ufs:ad0s1a" at the loader. I've also found it works to name the disks using glabel and add disks to the pool using the glabel names, to eliminate uncertainty as to which disk exactly you're offlining or seeing errors from (especially with SAS-connected drives where the /dev/da name doesn't correspond with a particular physical port). Barry From owner-freebsd-fs@FreeBSD.ORG Mon Sep 24 23:56:06 2007 Return-Path: Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3332A16A41A for ; Mon, 24 Sep 2007 23:56:06 +0000 (UTC) (envelope-from kris@FreeBSD.org) Received: from weak.local (hub.freebsd.org [IPv6:2001:4f8:fff6::36]) by mx1.freebsd.org (Postfix) with ESMTP id 4340913C455; Mon, 24 Sep 2007 23:56:03 +0000 (UTC) (envelope-from kris@FreeBSD.org) Message-ID: <46F84E94.8070706@FreeBSD.org> Date: Tue, 25 Sep 2007 01:56:04 +0200 From: Kris Kennaway User-Agent: Thunderbird 2.0.0.6 (Macintosh/20070728) MIME-Version: 1.0 To: Randy Bush References: <46F7EDD7.6060904@psg.com> In-Reply-To: <46F7EDD7.6060904@psg.com> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@FreeBSD.ORG Subject: Re: zfs in production? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Sep 2007 23:56:06 -0000 Randy Bush wrote: > we are thinking of using zfs on a production server, using gmirror for > booting and then following http://wiki.freebsd.org/ZFSOnRoot for the rest. > > but we would like to hear from folk using zfs in production for any > length of time, as we do not really have the resources to be pioneers. > > thanks. > > randy > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > > I use it on a couple of heavily loaded servers. The only issues are those I have posted about on current before. Kris From owner-freebsd-fs@FreeBSD.ORG Tue Sep 25 01:23:50 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5D52B16A417 for ; Tue, 25 Sep 2007 01:23:50 +0000 (UTC) (envelope-from amdmi3@amdmi3.ru) Received: from cp65.agava.net (cp65.agava.net [89.108.66.215]) by mx1.freebsd.org (Postfix) with ESMTP id DC1A213C447 for ; Tue, 25 Sep 2007 01:23:49 +0000 (UTC) (envelope-from amdmi3@amdmi3.ru) Received: from [213.148.20.85] (helo=nexii.panopticon) by cp65.agava.net with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.44 (FreeBSD)) id 1IZxro-00087Q-Ix for freebsd-fs@freebsd.org; Tue, 25 Sep 2007 04:01:32 +0400 Received: from hades.panopticon (hades.panopticon [192.168.0.2]) by nexii.panopticon (Postfix) with ESMTP id 9EF1F17041 for ; Tue, 25 Sep 2007 04:00:38 +0400 (MSD) Received: by hades.panopticon (Postfix, from userid 1000) id 16D0E404B; Tue, 25 Sep 2007 04:02:55 +0400 (MSD) Date: Tue, 25 Sep 2007 04:02:54 +0400 From: Dmitry Marakasov To: freebsd-fs@freebsd.org Message-ID: <20070925000254.GA35310@hades.panopticon> Mail-Followup-To: freebsd-fs@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=koi8-r Content-Disposition: inline User-Agent: Mutt/1.5.16 (2007-06-09) X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - cp65.agava.net X-AntiAbuse: Original Domain - freebsd.org X-AntiAbuse: Originator/Caller UID/GID - [0 0] / [26 6] X-AntiAbuse: Sender Address Domain - amdmi3.ru X-Source: X-Source-Args: X-Source-Dir: Subject: Shooting yourself in the foot with ZFS: is quite easy X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Sep 2007 01:23:50 -0000 Hi! I'm just playing with ZFS in qemu, and I think I've found some bug in the logic which can lead to shoot-yourself-in-the-foot condition, which can be avoided. First of all, I've constructed raidz array: --- # mdconfig -a -tswap -s64m md0 # mdconfig -a -tswap -s64m md1 # mdconfig -a -tswap -s64m md2 # zpool create pool raidz md{0,1,2} --- Next, I've brought one of the devices offline and rewrote part of it. Let's imagine that I've needed some space for emergency situation. --- # zpool offline pool md0 Bringing device md0 offline # zpool status ... NAME STATE READ WRITE CKSUM pool DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 md0 OFFLINE 0 0 0 md1 ONLINE 0 0 0 md2 ONLINE 0 0 0 ... # dd if=/dev/zero of=/dev/md0 bs=1m count=1 1+0 records in 1+0 records out 1048576 bytes transferred in 0.084011 secs (12481402 bytes/sec) --- Now, how do I put md0 back to the pool? `zpool online pool md0' seems reasonable, and the pool will recover itself on scrub, but I'm paranoid and I want to recreate data on md0 completely. But: --- # zpool replace pool md0 cannot replace md0 with md0: md0 is busy # zpool replace -f pool md0 cannot replace md0 with md0: md0 is busy --- Seems like it's looking onto ondisk data (remains of ZFS) and thinks that it's still used in the pool, because if I erase the whole device with dd, it thinks of md0 as a new disk and replaces it without problems: --- # dd if=/dev/zero of=/dev/md0 bs=1m dd: /dev/md0: end of device 65+0 records in 64+0 records out 67108864 bytes transferred in 10.154127 secs (6609023 bytes/sec) # zpool replace pool md0 # zpool status ... NAME STATE READ WRITE CKSUM pool DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 replacing DEGRADED 0 0 0 md0/old OFFLINE 0 0 0 md0 ONLINE 0 0 0 md1 ONLINE 0 0 0 md2 ONLINE 0 0 0 ... # zpool status ... NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 md0 ONLINE 0 0 0 md1 ONLINE 0 0 0 md2 ONLINE 0 0 0 ... --- This behaviour is, I think, undesired and one shoule be able to replace offline device by itself any time. Which is worse: --- # zpool offline pool md0 Bringing device md0 offline # dd if=/dev/zero of=/dev/md0 bs=1m dd: /dev/md0: end of device 65+0 records in 64+0 records out 67108864 bytes transferred in 8.076568 secs (8309082 bytes/sec) # zpool online pool md0 Bringing device md0 online # zpool status pool: pool state: ONLINE status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: resilver completed with 0 errors on Mon Sep 24 23:21:49 2007 config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 md0 UNAVAIL 0 0 0 corrupted data md1 ONLINE 0 0 0 md2 ONLINE 0 0 0 errors: No known data errors # zpool replace pool md0 invalid vdev specification use '-f' to override the following errors: md0 is in use (r1w1e1) # zpool replace -f pool md0 invalid vdev specification the following errors must be manually repaired: md0 is in use (r1w1e1) # zpool scrub pool # zpool status pool: pool state: ONLINE status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: resilver completed with 0 errors on Mon Sep 24 23:22:22 2007 config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 md0 UNAVAIL 0 0 0 corrupted data md1 ONLINE 0 0 0 md2 ONLINE 0 0 0 errors: No known data errors # zpool offline md0 missing device name usage: offline [-t] ... # zpool offline pool md0 cannot offline md0: no valid replicas # mdconfig -du0 mdconfig: ioctl(/dev/mdctl): Device busy --- This is very confusing: md0 is UNAVAIL, but the table says pool is ONLINE (not DEGRADED!), though status says it's degraded. Still I neither can bring the device offline nor replace it with itself (though replacing it with equal md3 worked). My opinion is that such situation should be avoided. First of all, zpool behaviour with one of disks in UNAVAIL state seems to be clear bug (array shown as ONLINE, unavility of brind unavail device offline etc.). Also, ZFS should not trust any ondisk contents after bringing a disk online. The best solution is completely recreating ZFS data structures on disk in such case. This should solve both cases: 1) `zfs replace ` won't say that the offline disk is busy 2) One won't need to clear disk with dd to recreate it 3) `zfs online ` won't lead to UNAVAIL state. 4) I think there could be more potential problems with current behavour: for example, what happens if I replace a disk in raidz with another disk, that was used in another raidz before? As I understand, currently `zfs offline`/`zfs online` on a disk leads to it's resilvering anyway? -- Best regards, Dmitry Marakasov mailto:amdmi3@amdmi3.ru From owner-freebsd-fs@FreeBSD.ORG Tue Sep 25 08:56:28 2007 Return-Path: Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9507D16A494 for ; Tue, 25 Sep 2007 08:56:28 +0000 (UTC) (envelope-from des@des.no) Received: from tim.des.no (tim.des.no [194.63.250.121]) by mx1.freebsd.org (Postfix) with ESMTP id 42D3413C457 for ; Tue, 25 Sep 2007 08:56:27 +0000 (UTC) (envelope-from des@des.no) Received: from tim.des.no (localhost [127.0.0.1]) by spam.des.no (Postfix) with ESMTP id 8945A20A3; Tue, 25 Sep 2007 10:56:23 +0200 (CEST) X-Spam-Tests: AWL X-Spam-Learn: disabled X-Spam-Score: 0.0/3.0 X-Spam-Checker-Version: SpamAssassin 3.2.1 (2007-05-02) on tim.des.no Received: from ds4.des.no (des.no [80.203.243.180]) by smtp.des.no (Postfix) with ESMTP id 048B7209E; Tue, 25 Sep 2007 10:56:23 +0200 (CEST) Received: by ds4.des.no (Postfix, from userid 1001) id A373784481; Tue, 25 Sep 2007 10:56:22 +0200 (CEST) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: Randy Bush References: <46F7EDD7.6060904@psg.com> Date: Tue, 25 Sep 2007 10:56:22 +0200 In-Reply-To: <46F7EDD7.6060904@psg.com> (Randy Bush's message of "Mon\, 24 Sep 2007 07\:03\:19 -1000") Message-ID: <868x6vi0nd.fsf@ds4.des.no> User-Agent: Gnus/5.110006 (No Gnus v0.6) Emacs/22.1 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: freebsd-fs@FreeBSD.ORG Subject: Re: zfs in production? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Sep 2007 08:56:28 -0000 Randy Bush writes: > we are thinking of using zfs on a production server, using gmirror for > booting and then following http://wiki.freebsd.org/ZFSOnRoot for the rest. > > but we would like to hear from folk using zfs in production for any > length of time, as we do not really have the resources to be pioneers. Works fine, but if using SATA, avoid Promise controllers. DES --=20 Dag-Erling Sm=C3=B8rgrav - des@des.no From owner-freebsd-fs@FreeBSD.ORG Tue Sep 25 10:12:16 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3568B16A41A for ; Tue, 25 Sep 2007 10:12:16 +0000 (UTC) (envelope-from ticso@cicely12.cicely.de) Received: from raven.bwct.de (raven.bwct.de [85.159.14.73]) by mx1.freebsd.org (Postfix) with ESMTP id 94B8813C4B2 for ; Tue, 25 Sep 2007 10:12:15 +0000 (UTC) (envelope-from ticso@cicely12.cicely.de) Received: from cicely5.cicely.de ([10.1.1.7]) by raven.bwct.de (8.13.4/8.13.4) with ESMTP id l8PACCKX033418; Tue, 25 Sep 2007 12:12:12 +0200 (CEST) (envelope-from ticso@cicely12.cicely.de) Received: from cicely12.cicely.de (cicely12.cicely.de [10.1.1.14]) by cicely5.cicely.de (8.13.4/8.13.4) with ESMTP id l8PAC53E021057 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 25 Sep 2007 12:12:06 +0200 (CEST) (envelope-from ticso@cicely12.cicely.de) Received: from cicely12.cicely.de (localhost [127.0.0.1]) by cicely12.cicely.de (8.13.4/8.13.3) with ESMTP id l8PAC5Z2045884; Tue, 25 Sep 2007 12:12:05 +0200 (CEST) (envelope-from ticso@cicely12.cicely.de) Received: (from ticso@localhost) by cicely12.cicely.de (8.13.4/8.13.3/Submit) id l8PAC5lN045883; Tue, 25 Sep 2007 12:12:05 +0200 (CEST) (envelope-from ticso) Date: Tue, 25 Sep 2007 12:12:04 +0200 From: Bernd Walter To: Dag-Erling =?iso-8859-1?Q?Sm=F8rgrav?= Message-ID: <20070925101204.GQ38890@cicely12.cicely.de> References: <46F7EDD7.6060904@psg.com> <868x6vi0nd.fsf@ds4.des.no> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <868x6vi0nd.fsf@ds4.des.no> X-Operating-System: FreeBSD cicely12.cicely.de 5.4-STABLE alpha User-Agent: Mutt/1.5.9i X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED=-1.8, AWL=0.000, BAYES_00=-2.599 autolearn=ham version=3.1.7 X-Spam-Checker-Version: SpamAssassin 3.1.7 (2006-10-05) on cicely12.cicely.de Cc: Randy Bush , freebsd-fs@freebsd.org Subject: Re: zfs in production? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: ticso@cicely.de List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Sep 2007 10:12:16 -0000 On Tue, Sep 25, 2007 at 10:56:22AM +0200, Dag-Erling Smørgrav wrote: > Randy Bush writes: > > we are thinking of using zfs on a production server, using gmirror for > > booting and then following http://wiki.freebsd.org/ZFSOnRoot for the rest. > > > > but we would like to hear from folk using zfs in production for any > > length of time, as we do not really have the resources to be pioneers. > > Works fine, but if using SATA, avoid Promise controllers. It is worse: pool: data state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requested config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad4 ONLINE 0 0 5 ad6 ONLINE 0 0 8 ad8 ONLINE 0 0 11 These are WDC WD3200AAKS-00SBA0/12.01B01 connected to an SIL3114. System is amd64 from 26th june on core2quad with ECC RAM. My home system is using the same controller on i386/P3 and has no checksum errors - it is running source from 12th july. Considered that I'd seen lots of silent data corruptions with PATA disks on alpha during the last years I'm not that shure if the problem depends on a specific controller, but more on timing or such. It is easy to blame the controller, especially since SIL isn't known for quality, but in this case I believe it is our problem somehow. -- B.Walter http://www.bwct.de http://www.fizon.de bernd@bwct.de info@bwct.de support@fizon.de From owner-freebsd-fs@FreeBSD.ORG Wed Sep 26 03:19:02 2007 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F0D2716A419 for ; Wed, 26 Sep 2007 03:19:02 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: from kiwi-computer.com (keira.kiwi-computer.com [63.224.10.3]) by mx1.freebsd.org (Postfix) with SMTP id 570AC13C45A for ; Wed, 26 Sep 2007 03:19:02 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: (qmail 34528 invoked by uid 2001); 26 Sep 2007 03:12:20 -0000 Date: Tue, 25 Sep 2007 22:12:20 -0500 From: "Rick C. Petty" To: Bruce Evans Message-ID: <20070926031219.GB34186@keira.kiwi-computer.com> References: <46F3A64C.4090507@fluffles.net> <46F3B4B0.40606@freebsd.org> <20070921131919.GA46759@in-addr.com> <20070921133127.GB46759@in-addr.com> <20070922022524.X43853@delplex.bde.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070922022524.X43853@delplex.bde.org> User-Agent: Mutt/1.4.2.1i Cc: freebsd-fs@FreeBSD.org Subject: Re: Writing contigiously to UFS2? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: rick-freebsd@kiwi-computer.com List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 03:19:03 -0000 On Sat, Sep 22, 2007 at 04:10:19AM +1000, Bruce Evans wrote: > > of disk can be mapped. I get 180MB in practice, with an inode bitmap > size of only 3K, so there is not much to be gained by tuning -i but I disagree. There is much to be gained by tuning -i: 224.50 MB per CG vs. 183.77 MB.. that's a 22% difference. However, the biggest gain by tuning -i is the loss of extra (unused) inodes. Care should be used with the -i option-- running out of inodes when you have gigs of free space could be very frustrating. But I newfs all my volumes knowing an approximate inode density based on already-existing files and a minor fudge factor. The only time I ran out of inodes with this method was due to a calculation error on my part. > more to be gained by tuning -b and -f (several doublings are reasonable). I completely agree with this. It's unfortunate that newfs doesn't scale the defaults here based on the device size. Before someone dives in and commits any adjustments, I hope they do sufficient testing and post their results on this mailing list. -- Rick C. Petty From owner-freebsd-fs@FreeBSD.ORG Wed Sep 26 03:30:41 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 00E2216A417 for ; Wed, 26 Sep 2007 03:30:41 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: from kiwi-computer.com (keira.kiwi-computer.com [63.224.10.3]) by mx1.freebsd.org (Postfix) with SMTP id 704A013C447 for ; Wed, 26 Sep 2007 03:30:40 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: (qmail 34470 invoked by uid 2001); 26 Sep 2007 03:03:58 -0000 Date: Tue, 25 Sep 2007 22:03:58 -0500 From: "Rick C. Petty" To: Ivan Voras Message-ID: <20070926030358.GA34186@keira.kiwi-computer.com> References: <46F3A64C.4090507@fluffles.net> <46F3B520.1070708@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.1i Cc: freebsd-fs@freebsd.org Subject: Re: Writing contigiously to UFS2? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: rick-freebsd@kiwi-computer.com List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 03:30:41 -0000 On Fri, Sep 21, 2007 at 02:45:35PM +0200, Ivan Voras wrote: > Stefan Esser wrote: > > From experience (not from reading code or the docs) I conclude that > cylinder groups cannot be larger than around 190 MB. I know this from > numerous runnings of newfs and during development of gvirstor which > interacts with cg in an "interesting" way. Then you didn't run newfs enough: # newfs -N -i 12884901888 /dev/gvinum/mm-flac density reduced from 2147483647 to 3680255 /mm/flac: 196608.0MB (402653184 sectors) block size 16384, fragment size 2048 using 876 cylinder groups of 224.50MB, 14368 blks, 64 inodes. When specifying the -i option to newfs, it will minimize the number of inodes created. If the density option is high enough, it will use only one block of inodes per CG (the minimum).. from there, the density is reduced (as per the message above) and the CG size is increased until the frag bitmap can fit into a single block. With UFS2 and the default options of -b 16384 -f 2048, this gives you 224.50 MB per CG. If you wish to play around with the block/frag sizes, you can greatly increase the CG size: # newfs -N -f 8192 -b 65536 -i 12884901888 /dev/gvinum/mm-flac density reduced from 2147483647 to 14868479 /mm/flac: 196608.0MB (402653184 sectors) block size 65536, fragment size 8192 using 55 cylinder groups of 3628.00MB, 58048 blks, 256 inodes. Doing this is quite appropriate for large disks. This last command means: blocks are allocated in 64k chunks and the minimum allocation size is 8k. Some may say this is wasteful, but one could also argue that using less than 10% of your inodes is also wasteful. > I know the reasons why cgs > exist (mainly to lower latencies from seeking) but with todays drives I don't believe that is true. CGs exist because to prevent complete data loss if the front of the disk is trashed. The blocks and inodes have close proximity partly for lower latency but also to reduce corruption risk. It is suggested that the CG offsets are staggered to make best use of rotational delay but this is obviously irrelevent with modern drives. > and memory configurations it would sometimes be nice to make them larger > or in the extreme, make just one cg that covers the entire drive. And put it in the middle of the drive, not at the front. Gee, this is what NTFS does.. Hmm... There are significant advantages to staggering the CGs across the device (or in the case of some GEOM: providers). Here might be an interesting experiment to try. Write a new version of /usr/src/sbin/newfs/mkfs.c that doesn't have the restriction that the free fragment bitmap resides in one block. I'm not 100% sure if the FFS code would handle it properly, but in theory it should work (the offsets are stored in the superblocks). This is the biggest restriction on the CG size. You should be able to create 2-4 CGs to span each of your 1TB drives without increasing the block size and thus minimum allocation unit. -- -- Rick C. Petty From owner-freebsd-fs@FreeBSD.ORG Wed Sep 26 07:59:35 2007 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 62E6216A41A; Wed, 26 Sep 2007 07:59:35 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail13.syd.optusnet.com.au (mail13.syd.optusnet.com.au [211.29.132.194]) by mx1.freebsd.org (Postfix) with ESMTP id 19F7513C478; Wed, 26 Sep 2007 07:59:34 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c220-239-235-248.carlnfd3.nsw.optusnet.com.au (c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248]) by mail13.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l8Q7xO3R028355 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 26 Sep 2007 17:59:32 +1000 Date: Wed, 26 Sep 2007 17:59:24 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: "Rick C. Petty" In-Reply-To: <20070926030358.GA34186@keira.kiwi-computer.com> Message-ID: <20070926171239.E58990@delplex.bde.org> References: <46F3A64C.4090507@fluffles.net> <46F3B520.1070708@FreeBSD.org> <20070926030358.GA34186@keira.kiwi-computer.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@FreeBSD.org, Ivan Voras Subject: Re: Writing contigiously to UFS2? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 07:59:35 -0000 On Tue, 25 Sep 2007, Rick C. Petty wrote: > On Fri, Sep 21, 2007 at 02:45:35PM +0200, Ivan Voras wrote: >> Stefan Esser wrote: >> >> From experience (not from reading code or the docs) I conclude that >> cylinder groups cannot be larger than around 190 MB. I know this from >> numerous runnings of newfs and during development of gvirstor which >> interacts with cg in an "interesting" way. > > Then you didn't run newfs enough: > > # newfs -N -i 12884901888 /dev/gvinum/mm-flac > density reduced from 2147483647 to 3680255 > /mm/flac: 196608.0MB (402653184 sectors) block size 16384, fragment size 2048 > using 876 cylinder groups of 224.50MB, 14368 blks, 64 inodes. That's insignificantly more. Even doubling the size wouldn't make much difference. I see differences of at most 25% going the other way and halving the block size twice, which halves the cg size 4 times: on ffs1: 4K blocks, 512-frags -e 512 (broken default): 40MB/S 4K blocks, 512-frags -e 1024 (broken default): 44MB/S 4K blocks, 512-frags -e 2048 (best), kernel fixes: 47MB/S 4K blocks, 512-frags -e 8192 (try too hard), kernel fixes (kernel fixes are not complete enough to handle this case; defaults and -e values which are < the cg size work best except possibly when the fixes are complete): 45MB/S 16K blocks, 2K-frags -e 2K (broken default): 50MB/S 16K blocks, 2K-frags -e 4K (fixed default): 50.5MB/S 16K blocks, 2K-frags -e 8K (best): 51.5MB/S 16K blocks, 2K-frags -e 64K (try too hard): < 51MB/S again Getting a 3% iimprovement just be avoiding a seek or 2 every cg is very surprising for 16K-blocks with 2K frags. There has to be a seek for every cg, and bugs give 2 seeks. However, with -e 2K, that is only 2 extra seeks every 2048 blocks, where the block size is large, so I would have expected an improvement of at most 2 in 2048. The access pattern is probably confusing the drive's cache (it's an old ATA drive with only 2MB cache). > If you wish to play around with the block/frag sizes, you can greatly > increase the CG size: > > # newfs -N -f 8192 -b 65536 -i 12884901888 /dev/gvinum/mm-flac > density reduced from 2147483647 to 14868479 > /mm/flac: 196608.0MB (402653184 sectors) block size 65536, fragment size 8192 > using 55 cylinder groups of 3628.00MB, 58048 blks, 256 inodes. > > Doing this is quite appropriate for large disks. This last command means: > blocks are allocated in 64k chunks and the minimum allocation size is 8k. > Some may say this is wasteful, but one could also argue that using less > than 10% of your inodes is also wasteful. Both are wasteful. The kernel buffer cache is tuned for 16K-blocks. 64K-blocks cause either resource contention (if you don't tune BKVASIZE) or bogusly reduced resources (if you do tune it without fixing other really arcane parameters (wrong magic numbers in source code...)). There is lots of FUD about block sizes larger than 16K causing bugs, but I haven't seen any problems from them except slowness. 64K-blocks also cause slowness in general because they are just too big, but this shouldn't be a problem if most files are large. > Here might be an interesting experiment to try. Write a new version of > /usr/src/sbin/newfs/mkfs.c that doesn't have the restriction that the free > fragment bitmap resides in one block. I'm not 100% sure if the FFS code > would handle it properly, but in theory it should work (the offsets are > stored in the superblocks). This is the biggest restriction on the CG > size. You should be able to create 2-4 CGs to span each of your 1TB > drives without increasing the block size and thus minimum allocation unit. In theory it won't work. From fs.h: %%% /* * The size of a cylinder group is calculated by CGSIZE. The maximum size * is limited by the fact that cylinder groups are at most one block. * Its size is derived from the size of the maps maintained in the * cylinder group and the (struct cg) size. */ %%% Only offsets to the inode blocks, etc. are stored in the superblock. Bruce From owner-freebsd-fs@FreeBSD.ORG Wed Sep 26 08:37:22 2007 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7403816A41B for ; Wed, 26 Sep 2007 08:37:22 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail16.syd.optusnet.com.au (mail16.syd.optusnet.com.au [211.29.132.197]) by mx1.freebsd.org (Postfix) with ESMTP id 2DC7713C4A7 for ; Wed, 26 Sep 2007 08:37:21 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c220-239-235-248.carlnfd3.nsw.optusnet.com.au (c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248]) by mail16.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l8Q8bIic025232 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 26 Sep 2007 18:37:19 +1000 Date: Wed, 26 Sep 2007 18:37:18 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: "Rick C. Petty" In-Reply-To: <20070926031219.GB34186@keira.kiwi-computer.com> Message-ID: <20070926175943.H58990@delplex.bde.org> References: <46F3A64C.4090507@fluffles.net> <46F3B4B0.40606@freebsd.org> <20070921131919.GA46759@in-addr.com> <20070921133127.GB46759@in-addr.com> <20070922022524.X43853@delplex.bde.org> <20070926031219.GB34186@keira.kiwi-computer.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@FreeBSD.org Subject: Re: Writing contigiously to UFS2? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 08:37:22 -0000 On Tue, 25 Sep 2007, Rick C. Petty wrote: > On Sat, Sep 22, 2007 at 04:10:19AM +1000, Bruce Evans wrote: >> >> of disk can be mapped. I get 180MB in practice, with an inode bitmap >> size of only 3K, so there is not much to be gained by tuning -i but > > I disagree. There is much to be gained by tuning -i: 224.50 MB per CG vs. > 183.77 MB.. that's a 22% difference. That's a 22% reduction in seeks where the cost of seeking every 187MB is a few mS every second. Say the disk speed 61MB/S and the seek cost is 15 mS. Then we waste 15 mS every 3 seconds with 183 MB cg's, or 2%. After saving 22%, we waste only 1.8%. These estimates are consistent with numbers I gave in previous mail. With the broken default of -e 2048 for 16K-blocks for ffs1, there was an unnecessary seek or 2 after only every 32MB. The disk speed was 52 MB/S (disk manufacturers's MB = 10^6 B). -e 2048 gave 50 MB/S and -e 8192 gave 51.5 MB/S. (52 MB/S was measured on the raw disk using dd. The raw disk tends to actually be slower than the file system due to not streaming.) Seeking after every 32MB (real MB) gives a seek every 645 mS, so if 2 seeks take 15 mS each the wastage was 4.7% so it was not surprising to get a speedup of 3% using -e 8192. Since I got to within 1% of the raw disk speed, there is little more to be gained in speed here. (The OP's problem was not speed.) (All this is for the benchmark "dd if=/dev/zero of=zz bs=1m count=N" where N = 200 or 1000.) >> more to be gained by tuning -b and -f (several doublings are reasonable). > > I completely agree with this. It's unfortunate that newfs doesn't scale > the defaults here based on the device size. Before someone dives in and > commits any adjustments, I hope they do sufficient testing and post their > results on this mailing list. Testing shows that only one doubling of -b and -f is reasonable for /usr/src but it makes little difference, so nothing should be changed. I'm still trying to make halving -b and -f back to 512/512 work right, so that it has the same disk speed as any/any, using contiguous layout and clustering so that physical disk i/o sizes are independent of the fs block sizes unless small i/o sizes are sufficient. Clustering already almost does this for data blocks provided the allocator manages to do a contiguous layout. Clustering already wastes a lot of CPU doing this by brute force, but CPU is relatively free. Bruce From owner-freebsd-fs@FreeBSD.ORG Wed Sep 26 12:08:44 2007 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 40CB216A41A; Wed, 26 Sep 2007 12:08:44 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail13.syd.optusnet.com.au (mail13.syd.optusnet.com.au [211.29.132.194]) by mx1.freebsd.org (Postfix) with ESMTP id B10EF13C458; Wed, 26 Sep 2007 12:08:43 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c220-239-235-248.carlnfd3.nsw.optusnet.com.au (c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248]) by mail13.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l8QC8b9r032760 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 26 Sep 2007 22:08:39 +1000 Date: Wed, 26 Sep 2007 22:08:37 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Bruce Evans In-Reply-To: <20070926171239.E58990@delplex.bde.org> Message-ID: <20070926204857.W59443@delplex.bde.org> References: <46F3A64C.4090507@fluffles.net> <46F3B520.1070708@FreeBSD.org> <20070926030358.GA34186@keira.kiwi-computer.com> <20070926171239.E58990@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@FreeBSD.org, "Rick C. Petty" , Ivan Voras Subject: Re: Writing contigiously to UFS2? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 12:08:44 -0000 On Wed, 26 Sep 2007, I wrote: > ... Even doubling the [block] size wouldn't make much > difference. I see differences of at most 25% going the other way and > halving the block size twice, which halves the cg size 4 times: on ffs1: > > 4K blocks, 512-frags -e 512 (broken default): 40MB/S > 4K blocks, 512-frags -e 1024 (broken default): 44MB/S > 4K blocks, 512-frags -e 2048 (best), kernel fixes: 47MB/S > 4K blocks, 512-frags -e 8192 (try too hard), kernel fixes > (kernel fixes are not complete enough to handle this case; > defaults and -e values which are < the cg size work best except > possibly when the fixes are complete): 45MB/S [Max possible is 52 MB/S. 1MB = 10^6 bytes.] All of these must have been with some kernel fixes. Retesting with -current gives 33MB/S with -e 512 and a max of 42MB/S with each of -e 1024, 2048 and 3072. Reducing the -e parameter below 512 gives surprisingly (if you don't remember how slow seeks can be) large further losses. E.g., -e 128 gives 21MB/S and the following bad layout: % fs_bsize = 4096 % fs_fsize = 512 % [bpg = 3240, maxbpg = 128] % 4: lbn 0-11 blkno 624-719 % lbn [<1>indir]12-1035 blkno 608-615 The first indirect block is laid out discontiguously. This is a standard bug in reallocblks. % lbn 12-127 blkno 720-1647 The blocks pointed to by the first indirect block are laid out contigously with the last direct block. Reallocblks does extra work to move the indirect block out of the way so that only the data blocks are contiguous. % lbn 128-255 blkno 1744-2767 % lbn 256-383 blkno 2864-3887 There is a bug after every maxbpg = 128 blocks. Due to a hack, ffs_blkpref_ufs1() handles the blocks pointed to by the first indirect block specially (maxbpg doesn't work for them), and due to a bug somewhere, it leaves a gap of 2864-2768 = 96 blkno's (frags) after every maxbpg blocks. % lbn 384-511 blkno 3984-5007 % lbn 512-639 blkno 5104-6127 % lbn 640-767 blkno 6224-7247 % lbn 768-895 blkno 7344-8367 % lbn 896-1035 blkno 8464-9583 It keeps leaving gaps of 96 blkno's until the end of the indirect block. % lbn [<2>indir]1036-1049611 blkno 207368-207375 % lbn [<1>indir]1036-2059 blkno 207376-207383 % lbn 1036-1163 blkno 207824-208847 Now ffs_blkpref_ffs1() skips 7 cg's (1 cg = 3240 blocks = 8*3240 = 25920 frags) because it doesn't know the current cg and its best guess is off by a factor of 8 due to our maxbpg being weird by a factor of 8. Normally it only skips 1 cg due to the default maxbpg being wrong by a factor of 2. % lbn 1164-1291 blkno 259664-260687 % lbn 1292-1419 blkno 311504-312527 % lbn 1420-1547 blkno 363344-364367 % lbn 1548-1675 blkno 415184-416207 % lbn 1676-1803 blkno 467024-468047 % lbn 1804-1931 blkno 518864-519887 % lbn 1932-2059 blkno 570704-571727 Indirect blocks after the first are not handled specially, so a new cg is preferred after every maxbpg blocks, and some other bug causes 1 cg to be skipped for every maxbpg blocks. % ... The pattern continues for subsequent indirect blocks. The layout is only unusual for the first one, so the pessimizations for the first one have little effect for large files -- for large files, the speed is dominated by seeking every 128 blocks as requested by -e 128. Bruce From owner-freebsd-fs@FreeBSD.ORG Wed Sep 26 17:10:56 2007 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4942F16A41B for ; Wed, 26 Sep 2007 17:10:56 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: from kiwi-computer.com (keira.kiwi-computer.com [63.224.10.3]) by mx1.freebsd.org (Postfix) with SMTP id E191E13C447 for ; Wed, 26 Sep 2007 17:10:55 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: (qmail 41633 invoked by uid 2001); 26 Sep 2007 17:10:54 -0000 Date: Wed, 26 Sep 2007 12:10:54 -0500 From: "Rick C. Petty" To: Bruce Evans Message-ID: <20070926171054.GA41567@keira.kiwi-computer.com> References: <46F3A64C.4090507@fluffles.net> <46F3B520.1070708@FreeBSD.org> <20070926030358.GA34186@keira.kiwi-computer.com> <20070926171239.E58990@delplex.bde.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070926171239.E58990@delplex.bde.org> User-Agent: Mutt/1.4.2.1i Cc: freebsd-fs@FreeBSD.org Subject: Re: Writing contigiously to UFS2? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: rick-freebsd@kiwi-computer.com List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 17:10:56 -0000 On Wed, Sep 26, 2007 at 05:59:24PM +1000, Bruce Evans wrote: > On Tue, 25 Sep 2007, Rick C. Petty wrote: > > That's insignificantly more. Even doubling the size wouldn't make much > difference. I see differences of at most 25% going the other way and Some would say that 25% difference is significant. Obviously you disagree. > 4K blocks, 512-frags -e 512 (broken default): 40MB/S > 4K blocks, 512-frags -e 1024 (broken default): 44MB/S > 4K blocks, 512-frags -e 2048 (best), kernel fixes: 47MB/S > 4K blocks, 512-frags -e 8192 (try too hard), kernel fixes > (kernel fixes are not complete enough to handle this case; > defaults and -e values which are < the cg size work best except > possibly when the fixes are complete): 45MB/S > 16K blocks, 2K-frags -e 2K (broken default): 50MB/S > 16K blocks, 2K-frags -e 4K (fixed default): 50.5MB/S > 16K blocks, 2K-frags -e 8K (best): 51.5MB/S > 16K blocks, 2K-frags -e 64K (try too hard): < 51MB/S again Are you talking about throughputs now? I was just talking about space. Time and space are usually mutually-exclusive optimizations. > >Here might be an interesting experiment to try. Write a new version of > >/usr/src/sbin/newfs/mkfs.c that doesn't have the restriction that the free > >fragment bitmap resides in one block. I'm not 100% sure if the FFS code > >would handle it properly, but in theory it should work (the offsets are > >stored in the superblocks). This is the biggest restriction on the CG > >size. You should be able to create 2-4 CGs to span each of your 1TB > >drives without increasing the block size and thus minimum allocation unit. > > In theory it won't work. From fs.h: > > %%% > /* > * The size of a cylinder group is calculated by CGSIZE. The maximum size > * is limited by the fact that cylinder groups are at most one block. > * Its size is derived from the size of the maps maintained in the > * cylinder group and the (struct cg) size. > */ > %%% Debug code, not comments! :-P > Only offsets to the inode blocks, etc. are stored in the superblock. Yes, the offset to the cylinder group block and the offset to the inode block are in the superblock (struct fs). It wouldn't be too difficult to tweak the ffs code to read in CGs larger than one block, by checking the difference between fs_iblkno and fs_cblkno. I'm saying it's theoretically possible, although it will require tweaks in ffs code. Again, I think it's worth investigating, especially if you believe there are performance penalties for having block sizes greater than the kernel buffer size. -- Rick C. Petty From owner-freebsd-fs@FreeBSD.ORG Wed Sep 26 17:17:58 2007 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 28EAB16A46B for ; Wed, 26 Sep 2007 17:17:58 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: from kiwi-computer.com (keira.kiwi-computer.com [63.224.10.3]) by mx1.freebsd.org (Postfix) with SMTP id A965113C48D for ; Wed, 26 Sep 2007 17:17:57 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: (qmail 41701 invoked by uid 2001); 26 Sep 2007 17:17:56 -0000 Date: Wed, 26 Sep 2007 12:17:56 -0500 From: "Rick C. Petty" To: Bruce Evans Message-ID: <20070926171756.GB41567@keira.kiwi-computer.com> References: <46F3A64C.4090507@fluffles.net> <46F3B4B0.40606@freebsd.org> <20070921131919.GA46759@in-addr.com> <20070921133127.GB46759@in-addr.com> <20070922022524.X43853@delplex.bde.org> <20070926031219.GB34186@keira.kiwi-computer.com> <20070926175943.H58990@delplex.bde.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070926175943.H58990@delplex.bde.org> User-Agent: Mutt/1.4.2.1i Cc: freebsd-fs@FreeBSD.org Subject: Re: Writing contigiously to UFS2? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: rick-freebsd@kiwi-computer.com List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 17:17:58 -0000 On Wed, Sep 26, 2007 at 06:37:18PM +1000, Bruce Evans wrote: > On Tue, 25 Sep 2007, Rick C. Petty wrote: > > >On Sat, Sep 22, 2007 at 04:10:19AM +1000, Bruce Evans wrote: > >> > >>of disk can be mapped. I get 180MB in practice, with an inode bitmap > >>size of only 3K, so there is not much to be gained by tuning -i but > > > >I disagree. There is much to be gained by tuning -i: 224.50 MB per CG vs. > >183.77 MB.. that's a 22% difference. > > That's a 22% reduction in seeks where the cost of seeking every 187MB > is a few mS every second. Say the disk speed 61MB/S and the seek cost > is 15 mS. Then we waste 15 mS every 3 seconds with 183 MB cg's, or 2%. > After saving 22%, we waste only 1.8%. I'm not sure why this discussion has moved into speed/performance comparisons. I'm saying 22% difference in CG size. > Since I > got to within 1% of the raw disk speed, there is little more to be > gained in speed here. (The OP's problem was not speed.) I agree-- why are you discussing speed? I mean, it's interesting. But I was only discussing CG sizes and suggesting using the inode density option to reduce the amount of space "wasted" with filesystem metadata. I do think the performance differences are interesting, but how much of the differences are irrelevant when looking at modern drives with tagged queuing, large I/O caches, and reordered block operations? -- Rick C. Petty From owner-freebsd-fs@FreeBSD.ORG Wed Sep 26 20:06:58 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9A75E16A419 for ; Wed, 26 Sep 2007 20:06:58 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au [211.29.132.183]) by mx1.freebsd.org (Postfix) with ESMTP id 3913613C458 for ; Wed, 26 Sep 2007 20:06:57 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c220-239-235-248.carlnfd3.nsw.optusnet.com.au (c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248]) by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l8QK6TLQ029453 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 27 Sep 2007 06:06:34 +1000 Date: Thu, 27 Sep 2007 06:06:29 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: "Rick C. Petty" In-Reply-To: <20070926171054.GA41567@keira.kiwi-computer.com> Message-ID: <20070927050547.B60762@delplex.bde.org> References: <46F3A64C.4090507@fluffles.net> <46F3B520.1070708@FreeBSD.org> <20070926030358.GA34186@keira.kiwi-computer.com> <20070926171239.E58990@delplex.bde.org> <20070926171054.GA41567@keira.kiwi-computer.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org Subject: Re: Writing contigiously to UFS2? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 20:06:58 -0000 On Wed, 26 Sep 2007, Rick C. Petty wrote: > On Wed, Sep 26, 2007 at 05:59:24PM +1000, Bruce Evans wrote: >> On Tue, 25 Sep 2007, Rick C. Petty wrote: >> >> That's insignificantly more. Even doubling the size wouldn't make much >> difference. I see differences of at most 25% going the other way and > > Some would say that 25% difference is significant. Obviously you disagree. No, 25% is significant, but it takes intentional mistuning combined with no attempt to optimize the mistuned case and bugs for the general case that are more harmful for the mistuned case to get as much as 25%. >> 4K blocks, 512-frags -e 512 (broken default): 40MB/S >> 4K blocks, 512-frags -e 1024 (broken default): 44MB/S er, fixed default >> 4K blocks, 512-frags -e 2048 (best), kernel fixes: 47MB/S >> 4K blocks, 512-frags -e 8192 (try too hard), kernel fixes >> (kernel fixes are not complete enough to handle this case; >> defaults and -e values which are < the cg size work best except >> possibly when the fixes are complete): 45MB/S >> 16K blocks, 2K-frags -e 2K (broken default): 50MB/S >> 16K blocks, 2K-frags -e 4K (fixed default): 50.5MB/S >> 16K blocks, 2K-frags -e 8K (best): 51.5MB/S >> 16K blocks, 2K-frags -e 64K (try too hard): < 51MB/S again 64K-blocks, 8K-frags -e barely matters close to max 52 MB/S (I was able to create a perfectly contiguous (modulo indirect blocks which were allocated as contiguously as possible) file of size 1GB on a fs with a cg size of almost 2GB. A second file would not have been allocated so well, since it would be started on the same cg as the directory inode = same cg as the first file.) > > Are you talking about throughputs now? I was just talking about space. > Time and space are usually mutually-exclusive optimizations. These are all throughputs starting with a new file system. Since it's a new file system with defaults for most parameters, it has the usual space/ time tuning (-m 8 -o time), but normal space/time tuning doesn't apply for huge files anyway since there are no normal fragments. >> ... >>> size. You should be able to create 2-4 CGs to span each of your 1TB >>> drives without increasing the block size and thus minimum allocation unit. >> >> In theory it won't work. From fs.h: >> ... >> Only offsets to the inode blocks, etc. are stored in the superblock. > > Yes, the offset to the cylinder group block and the offset to the inode > block are in the superblock (struct fs). It wouldn't be too difficult to > tweak the ffs code to read in CGs larger than one block, by checking the > difference between fs_iblkno and fs_cblkno. I'm saying it's theoretically > possible, although it will require tweaks in ffs code. Again, I think it's > worth investigating, especially if you believe there are performance > penalties for having block sizes greater than the kernel buffer size. But then it won't be binary compatible. The performance penalties are easier to fix (should just never have existed on 64-bit platforms). My main point here is that small cylinder groups alone are not a problem for large files provided they are not too small. They cost a few percent in best cases. In worst cases, this loss is in the noise. Bruce From owner-freebsd-fs@FreeBSD.ORG Wed Sep 26 20:20:05 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F061016A417 for ; Wed, 26 Sep 2007 20:20:04 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au [211.29.132.189]) by mx1.freebsd.org (Postfix) with ESMTP id 8C41D13C46E for ; Wed, 26 Sep 2007 20:20:04 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c220-239-235-248.carlnfd3.nsw.optusnet.com.au (c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248]) by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l8QKJlam001667 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 27 Sep 2007 06:19:49 +1000 Date: Thu, 27 Sep 2007 06:19:47 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: "Rick C. Petty" In-Reply-To: <20070926171756.GB41567@keira.kiwi-computer.com> Message-ID: <20070927060725.O60903@delplex.bde.org> References: <46F3A64C.4090507@fluffles.net> <46F3B4B0.40606@freebsd.org> <20070921131919.GA46759@in-addr.com> <20070921133127.GB46759@in-addr.com> <20070922022524.X43853@delplex.bde.org> <20070926031219.GB34186@keira.kiwi-computer.com> <20070926175943.H58990@delplex.bde.org> <20070926171756.GB41567@keira.kiwi-computer.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org Subject: Re: Writing contigiously to UFS2? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 20:20:05 -0000 On Wed, 26 Sep 2007, Rick C. Petty wrote: > On Wed, Sep 26, 2007 at 06:37:18PM +1000, Bruce Evans wrote: >> On Tue, 25 Sep 2007, Rick C. Petty wrote: >> >>> On Sat, Sep 22, 2007 at 04:10:19AM +1000, Bruce Evans wrote: >>>> >>>> of disk can be mapped. I get 180MB in practice, with an inode bitmap >>>> size of only 3K, so there is not much to be gained by tuning -i but >>> >>> I disagree. There is much to be gained by tuning -i: 224.50 MB per CG vs. >>> 183.77 MB.. that's a 22% difference. >> >> That's a 22% reduction in seeks where the cost of seeking every 187MB >> is a few mS every second. Say the disk speed 61MB/S and the seek cost >> is 15 mS. Then we waste 15 mS every 3 seconds with 183 MB cg's, or 2%. >> After saving 22%, we waste only 1.8%. > > I'm not sure why this discussion has moved into speed/performance > comparisons. I'm saying 22% difference in CG size. Size is uninteresting except where it affects speed. "-i large" saves some disk space but not 22%, and disk space is almost free. "-b large -f large" costs disk space. >> Since I >> got to within 1% of the raw disk speed, there is little more to be >> gained in speed here. (The OP's problem was not speed.) > > I agree-- why are you discussing speed? I mean, it's interesting. But I > was only discussing CG sizes and suggesting using the inode density option > to reduce the amount of space "wasted" with filesystem metadata. The OP's problem was that due to an apparently-untuned maxbpg and/or maxbpg not actually working, data was scattered over all cg's and thus over all disks when it was expected/wanted to be packed into a small number of disks. Packing into a large number of small cg's should give the same effect on the number of disks used as packing into a small number of large cg's, but apparently doesn't, due to the untuned maxbpg and/or bugs. > I do think the performance differences are interesting, but how much of the > differences are irrelevant when looking at modern drives with tagged > queuing, large I/O caches, and reordered block operations? It depends on how big the seeks are (except a really modern drive would be RAM with infinitely fast seeks :-). I think any large-enough cylinder groups would be large enough for the seek time to be significant. Bruce From owner-freebsd-fs@FreeBSD.ORG Thu Sep 27 07:59:48 2007 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2423216A417 for ; Thu, 27 Sep 2007 07:59:48 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au [211.29.132.185]) by mx1.freebsd.org (Postfix) with ESMTP id C240F13C457 for ; Thu, 27 Sep 2007 07:59:47 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248]) by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l8R7xhfe003034 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Thu, 27 Sep 2007 17:59:45 +1000 Date: Thu, 27 Sep 2007 17:59:43 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: fs@freebsd.org Message-ID: <20070927175933.L770@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: deaadlock for large writes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Sep 2007 07:59:48 -0000 I'm getting deadlock with wait message "nbufkv" for "dd if=/dev/zero of=zz bs=1m count=1000" to an msdosfs file system with a block size of 64K. There seems to be nothing except the buf_dirty_count_severe() hack to prevent deadlock for large writes in general. Large writes want to generate a lot of dirty buffers using bdwrite() or cluster_write(). The vnode lock is normally held exclusively throughout VOP_WRITE() calls, so there seems to be no way to complete the delayed writes until VOP_WRITE() returns (since flushbufqueues() needs to hold the vnode lock exlusively to write). Deadlock is handled in some cases by the buf_dirty_count_severe() hack: in just 2 file systems (ffs and msdosfs), in the main loop in VOP_WRITE(), if (vm_page_count_severe() || buf_dirty_count_severe()), then bawrite() is used to avoid creating [m]any more dirty buffers. (I don't like this because it makes the slow case even slower -- when the system gets congested doing writes, we switch to a slower writing method so the congestion will take even longer to clear.) Some file systems use the old pessimization of always writing complete blocks using bawrite(), so they don't need to call buf_dirty_count_severe() but are slow even without it. msdosfs did that until recently when I implemented write clustering in it. buf_dirty_count_severe() just uses the dirty buffer count, so it can only prevent buffer kva resource starvation by accident. The accident apparently doesn't happen with a block size of 64K (I think not just for msdosfs -- this may be why large block sizes for ffs are considered dangerous). When the nbufkv deadlock occurred, the dirty count was 0x5d3 and bufspace is 0x5d30000 -- bufspace consisted entirely of the 0x5d3 dirty buffers each of size 0x10000. hidirtycount was 0x709. Since this is larger than 0x5d3, the buf_dirty_count_severe() hack didn't help, but since it is not much larger, it almost helped. ffs also has a buf_dirty_count_severe() check in ffs_update(). This is missing in msdosfs and of course in all other file systems. This is less important that in the write loop, since at most one dirty buffer can be generated per vnode. But this check probably belongs in bdwrite() itself, so that most file systems don't forget to do it and so that all file systems don't forget to do it before all their bdwrite()'s except the ones in *fs_write() and *fs_update(). (ffs does about 50 bdwrite()'s without checking, mainly for indirect blocks and snapshots.) I think it is safe to blindly turn bdwrite() into bawrite(). ffs_update() actually blindly turns bdwrite() into bwrite() and returns the result, but this seems wrong since it makes the congested case even more congested than with bawrite(), for no advantage (callers shouldn't be checking the result in the !waitfor case that can use bdwrite()). Bruce From owner-freebsd-fs@FreeBSD.ORG Fri Sep 28 18:36:28 2007 Return-Path: Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AB8B816A4E0 for ; Fri, 28 Sep 2007 18:36:28 +0000 (UTC) (envelope-from scode@hyperion.scode.org) Received: from hyperion.scode.org (cl-1361.ams-04.nl.sixxs.net [IPv6:2001:960:2:550::2]) by mx1.freebsd.org (Postfix) with ESMTP id 59D0D13C4CC for ; Fri, 28 Sep 2007 18:36:28 +0000 (UTC) (envelope-from scode@hyperion.scode.org) Received: by hyperion.scode.org (Postfix, from userid 1001) id 7433523C44A; Fri, 28 Sep 2007 20:36:26 +0200 (CEST) Date: Fri, 28 Sep 2007 20:36:26 +0200 From: Peter Schuller To: Randy Bush Message-ID: <20070928183625.GA8655@hyperion.scode.org> References: <46F7EDD7.6060904@psg.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="tKW2IUtsqtDRztdT" Content-Disposition: inline In-Reply-To: <46F7EDD7.6060904@psg.com> User-Agent: Mutt/1.5.16 (2007-06-09) Cc: freebsd-fs@FreeBSD.ORG Subject: Re: zfs in production? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Sep 2007 18:36:28 -0000 --tKW2IUtsqtDRztdT Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable > but we would like to hear from folk using zfs in production for any > length of time, as we do not really have the resources to be pioneers. I'm using it in production on at least three machines (not counting e.g. my workstation). By production I mean for real data and/or services that are important, but not necessarily stressing the system in terms of performance/load or edge cases. Some minor issues exist (memory issues on 32bit, wanting to disable prefetch, etc, swap not working on zfs, etc) but I have never had any showstoppers. And ZFS has btw already saved me from silent (until some time later) data corruption (sort of; I tried hot swapping SATA devices in a situation where I did not know whether it was supposed to be supported - in all fairness I would never have tried it to begin with if I had not been running ZFS, but if I had I would have had silent corruption). Personal gut feeling for me is that I am not too worried about data loss, but would be more hesitant to deploy without proper testing in cases where performance/latency/soft real time performance is a concearn. Biggest actual problem so far has actually been hardware rather than software. A huge joy of ZFS is the fact that it actually does send cache flush commands to constituent drives. I have however recently found out that the Perc 5/i controllers will not pass this through to underlying drives (at least not with SATA). So suddenly my crappy cheap-o home server is more reliably in case of power failure than a more expensive server with a real raid controller (when running without BBU; I can only hope that they will actually flush SATA drive caches prior to evicting contents in the cache when running with BBU enabled). --=20 / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller ' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org --tKW2IUtsqtDRztdT Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4 (FreeBSD) iD8DBQFG/UmpDNor2+l1i30RAtHmAKCvTBXB/XW3bDpNNTBh84HrT693SQCg6Bcq TPNPmCYPMRMntQmAdUDlStw= =1C0I -----END PGP SIGNATURE----- --tKW2IUtsqtDRztdT--