From owner-freebsd-fs@FreeBSD.ORG  Mon Sep 24 07:48:31 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6169D16A418
	for <freebsd-fs@freebsd.org>; Mon, 24 Sep 2007 07:48:31 +0000 (UTC)
	(envelope-from ighighi@gmail.com)
Received: from wx-out-0506.google.com (wx-out-0506.google.com [66.249.82.229])
	by mx1.freebsd.org (Postfix) with ESMTP id 0A57313C448
	for <freebsd-fs@freebsd.org>; Mon, 24 Sep 2007 07:48:25 +0000 (UTC)
	(envelope-from ighighi@gmail.com)
Received: by wx-out-0506.google.com with SMTP id i29so947802wxd
	for <freebsd-fs@freebsd.org>; Mon, 24 Sep 2007 00:48:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta;
	h=domainkey-signature:received:received:message-id:date:from:user-agent:mime-version:to:subject:content-type:content-transfer-encoding;
	bh=SVHDVfFE6M5jt26r3O4YvyvwaOfJqDIUSXPaXcaO6SM=;
	b=ui+tugw44Ex0BDIkHGNuRmZH4IICCBK/vfjwceQPhdJKuoGzt51C00H9mPXcfeW7A6Ma33SChLxWwmRl7efO0diQWldgBcR9mTeGQ/ZsM6IXAWRUqfcJ+FdVM/jYY0E5LctJCz4lA80jbs6CHhRsSPFt3Zc6raXfvtuRWIyIZoQ=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta;
	h=received:message-id:date:from:user-agent:mime-version:to:subject:content-type:content-transfer-encoding;
	b=EsU23F2DkMsynmLvBr3sUrtUJ7ZBKCw4QccLacL2f629FKCnUUImflvQt3svJlpG3ehOac9ce8TwGyJYJt4MYscA7cVpEgSTEZ+GwThAEXMrOhF55rLKTpfUGeJUrbo+6+8up8eiuroUSXtfB0Qr60lPYm0CE7L5sYk05hMEFl8=
Received: by 10.90.118.8 with SMTP id q8mr5322626agc.1190618651027;
	Mon, 24 Sep 2007 00:24:11 -0700 (PDT)
Received: from orion.nebula.mil ( [200.44.87.60])
	by mx.google.com with ESMTPS id 44sm2506537hsa.2007.09.24.00.24.07
	(version=TLSv1/SSLv3 cipher=RC4-MD5);
	Mon, 24 Sep 2007 00:24:08 -0700 (PDT)
Message-ID: <46F765E6.6080905@gmail.com>
Date: Mon, 24 Sep 2007 03:23:18 -0400
From: Ighighi <ighighi@gmail.com>
User-Agent: Thunderbird 2.0.0.6 (X11/20070803)
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: nmount() version of mount_ntfs(8)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Sep 2007 07:48:31 -0000

Is there anybody working on a nmount() version of mount_ntfs(8) to
complete transition of the mount_xxx() tools or no such plan exists?

I'm not sure where I've read that these tools are to be merged into one
in some BSD.  Anyway I could do the work...  I also added support for
dirmask in NTFS as is used by MSDOSFS that I tested on 6.2-STABLE.
If such feature is recognized as useful as it is in msdosfs(5), I could
make the newer nmount()-based mount_ntfs(8) understand it as well so it
could debut on 7.0.

Details in:
http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/114847

Can any developer with commit rights apply the patch for the NTFS bug in
http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/114856
?


Regards,
Igh.

From owner-freebsd-fs@FreeBSD.ORG  Mon Sep 24 11:08:20 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 02A4916A418
	for <freebsd-fs@FreeBSD.org>; Mon, 24 Sep 2007 11:08:20 +0000 (UTC)
	(envelope-from owner-bugmaster@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id D4CE913C448
	for <freebsd-fs@FreeBSD.org>; Mon, 24 Sep 2007 11:08:19 +0000 (UTC)
	(envelope-from owner-bugmaster@FreeBSD.org)
Received: from freefall.freebsd.org (gnats@localhost [127.0.0.1])
	by freefall.freebsd.org (8.14.1/8.14.1) with ESMTP id l8OB8JFr064148
	for <freebsd-fs@FreeBSD.org>; Mon, 24 Sep 2007 11:08:19 GMT
	(envelope-from owner-bugmaster@FreeBSD.org)
Received: (from gnats@localhost)
	by freefall.freebsd.org (8.14.1/8.14.1/Submit) id l8OB8I5b064144
	for freebsd-fs@FreeBSD.org; Mon, 24 Sep 2007 11:08:18 GMT
	(envelope-from owner-bugmaster@FreeBSD.org)
Date: Mon, 24 Sep 2007 11:08:18 GMT
Message-Id: <200709241108.l8OB8I5b064144@freefall.freebsd.org>
X-Authentication-Warning: freefall.freebsd.org: gnats set sender to
	owner-bugmaster@FreeBSD.org using -f
From: FreeBSD bugmaster <bugmaster@FreeBSD.org>
To: freebsd-fs@FreeBSD.org
Cc: 
Subject: Current problem reports assigned to you
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Sep 2007 11:08:20 -0000

Current FreeBSD problem reports
Critical problems
Serious problems

S Tracker      Resp.      Description
--------------------------------------------------------------------------------
o kern/112658  fs         [smbfs] [patch] smbfs and caching problems (resolves b
o kern/114676  fs         [ufs] snapshot creation panics: snapacct_ufs2: bad blo
o kern/114856  fs         [ntfs] [patch] Bug in NTFS allows bogus file modes.
o kern/116170  fs         Kernel panic when mounting /tmp

4 problems total.

Non-critical problems

S Tracker      Resp.      Description
--------------------------------------------------------------------------------
o kern/114847  fs         [ntfs] [patch] dirmask support for NTFS ala MSDOSFS

1 problem total.


From owner-freebsd-fs@FreeBSD.ORG  Mon Sep 24 17:13:51 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B3DBC16A417
	for <freebsd-fs@FreeBSD.ORG>; Mon, 24 Sep 2007 17:13:51 +0000 (UTC)
	(envelope-from randy@psg.com)
Received: from rip.psg.com (rip.psg.com [147.28.0.39])
	by mx1.freebsd.org (Postfix) with ESMTP id CDEF413C458
	for <freebsd-fs@FreeBSD.ORG>; Mon, 24 Sep 2007 17:13:51 +0000 (UTC)
	(envelope-from randy@psg.com)
Received: from cust16202.lava.net ([64.65.95.74] helo=[192.168.0.101])
	by rip.psg.com with esmtpsa (TLSv1:AES256-SHA:256)
	(Exim 4.67 (FreeBSD)) (envelope-from <randy@psg.com>)
	id 1IZrLM-0003ZM-90
	for freebsd-fs@FreeBSD.ORG; Mon, 24 Sep 2007 17:03:36 +0000
Message-ID: <46F7EDD7.6060904@psg.com>
Date: Mon, 24 Sep 2007 07:03:19 -1000
From: Randy Bush <randy@psg.com>
User-Agent: Thunderbird 2.0.0.6 (Windows/20070728)
MIME-Version: 1.0
To: freebsd-fs@FreeBSD.ORG
X-Enigmail-Version: 0.95.3
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
Cc: 
Subject: zfs in production?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Sep 2007 17:13:51 -0000

we are thinking of using zfs on a production server, using gmirror for
booting and then following http://wiki.freebsd.org/ZFSOnRoot for the rest.

but we would like to hear from folk using zfs in production for any
length of time, as we do not really have the resources to be pioneers.

thanks.

randy

From owner-freebsd-fs@FreeBSD.ORG  Mon Sep 24 21:29:26 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6DFD916A421
	for <freebsd-fs@FreeBSD.ORG>; Mon, 24 Sep 2007 21:29:26 +0000 (UTC)
	(envelope-from bp@eden.barryp.org)
Received: from eden.barryp.org (host-42-60-230-24.midco.net [24.230.60.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 4823313C469
	for <freebsd-fs@FreeBSD.ORG>; Mon, 24 Sep 2007 21:29:25 +0000 (UTC)
	(envelope-from bp@eden.barryp.org)
Received: from macbook.home ([10.66.1.10])
	by eden.barryp.org with esmtpsa (TLSv1:AES256-SHA:256)
	(Exim 4.67 (FreeBSD)) (envelope-from <bp@eden.barryp.org>)
	id 1IZupA-000MQz-61; Mon, 24 Sep 2007 15:46:36 -0500
Message-ID: <46F82227.5090302@barryp.org>
Date: Mon, 24 Sep 2007 15:46:31 -0500
From: Barry Pederson <bp@barryp.org>
User-Agent: Thunderbird 2.0.0.6 (Macintosh/20070728)
MIME-Version: 1.0
To: Randy Bush <randy@psg.com>
References: <46F7EDD7.6060904@psg.com>
In-Reply-To: <46F7EDD7.6060904@psg.com>
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
Sender: bp@eden.barryp.org
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: zfs in production?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Sep 2007 21:29:26 -0000

Randy Bush wrote:
> we are thinking of using zfs on a production server, using gmirror for
> booting and then following http://wiki.freebsd.org/ZFSOnRoot for the rest.
> 
> but we would like to hear from folk using zfs in production for any
> length of time, as we do not really have the resources to be pioneers.
> 
> thanks.
> 
> randy

I've setup a few machines now using a CompactFlash device for booting, 
plugged straight into the motherboard with a CF-IDE adapter, and then 
having zfs-on-root with the actual harddisks 100% controlled by ZFS (no 
gmirror or slices otherwise).  One machine is zfs-mirror and the other 
is 8-disk raidz2.  The CF hardware is ony $30 or so, and it's nice not 
to have to deal with two different mirroring systems.

A bonus is that CF devices are so large nowadays, that it's convenient 
just have a complete installation of FreeBSD on it and be able to use it 
as an emergency recovery system just by entering 
"vfs.root.mountfrom=ufs:ad0s1a" at the loader.

I've also found it works to name the disks using glabel and add disks to 
the pool using the glabel names, to eliminate uncertainty as to which 
disk exactly you're offlining or seeing errors from (especially with 
SAS-connected drives where the /dev/da<n> name doesn't correspond with a 
particular physical port).

	Barry

From owner-freebsd-fs@FreeBSD.ORG  Mon Sep 24 23:56:06 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3332A16A41A
	for <freebsd-fs@FreeBSD.ORG>; Mon, 24 Sep 2007 23:56:06 +0000 (UTC)
	(envelope-from kris@FreeBSD.org)
Received: from weak.local (hub.freebsd.org [IPv6:2001:4f8:fff6::36])
	by mx1.freebsd.org (Postfix) with ESMTP id 4340913C455;
	Mon, 24 Sep 2007 23:56:03 +0000 (UTC)
	(envelope-from kris@FreeBSD.org)
Message-ID: <46F84E94.8070706@FreeBSD.org>
Date: Tue, 25 Sep 2007 01:56:04 +0200
From: Kris Kennaway <kris@FreeBSD.org>
User-Agent: Thunderbird 2.0.0.6 (Macintosh/20070728)
MIME-Version: 1.0
To: Randy Bush <randy@psg.com>
References: <46F7EDD7.6060904@psg.com>
In-Reply-To: <46F7EDD7.6060904@psg.com>
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: zfs in production?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Sep 2007 23:56:06 -0000

Randy Bush wrote:
> we are thinking of using zfs on a production server, using gmirror for
> booting and then following http://wiki.freebsd.org/ZFSOnRoot for the rest.
> 
> but we would like to hear from folk using zfs in production for any
> length of time, as we do not really have the resources to be pioneers.
> 
> thanks.
> 
> randy
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
> 
> 

I use it on a couple of heavily loaded servers.  The only issues are 
those I have posted about on current before.

Kris

From owner-freebsd-fs@FreeBSD.ORG  Tue Sep 25 01:23:50 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5D52B16A417
	for <freebsd-fs@freebsd.org>; Tue, 25 Sep 2007 01:23:50 +0000 (UTC)
	(envelope-from amdmi3@amdmi3.ru)
Received: from cp65.agava.net (cp65.agava.net [89.108.66.215])
	by mx1.freebsd.org (Postfix) with ESMTP id DC1A213C447
	for <freebsd-fs@freebsd.org>; Tue, 25 Sep 2007 01:23:49 +0000 (UTC)
	(envelope-from amdmi3@amdmi3.ru)
Received: from [213.148.20.85] (helo=nexii.panopticon)
	by cp65.agava.net with esmtpsa (TLSv1:AES256-SHA:256)
	(Exim 4.44 (FreeBSD)) id 1IZxro-00087Q-Ix
	for freebsd-fs@freebsd.org; Tue, 25 Sep 2007 04:01:32 +0400
Received: from hades.panopticon (hades.panopticon [192.168.0.2])
	by nexii.panopticon (Postfix) with ESMTP id 9EF1F17041
	for <freebsd-fs@freebsd.org>; Tue, 25 Sep 2007 04:00:38 +0400 (MSD)
Received: by hades.panopticon (Postfix, from userid 1000)
	id 16D0E404B; Tue, 25 Sep 2007 04:02:55 +0400 (MSD)
Date: Tue, 25 Sep 2007 04:02:54 +0400
From: Dmitry Marakasov <amdmi3@amdmi3.ru>
To: freebsd-fs@freebsd.org
Message-ID: <20070925000254.GA35310@hades.panopticon>
Mail-Followup-To: freebsd-fs@freebsd.org
MIME-Version: 1.0
Content-Type: text/plain; charset=koi8-r
Content-Disposition: inline
User-Agent: Mutt/1.5.16 (2007-06-09)
X-AntiAbuse: This header was added to track abuse,
	please include it with any abuse report
X-AntiAbuse: Primary Hostname - cp65.agava.net
X-AntiAbuse: Original Domain - freebsd.org
X-AntiAbuse: Originator/Caller UID/GID - [0 0] / [26 6]
X-AntiAbuse: Sender Address Domain - amdmi3.ru
X-Source: 
X-Source-Args: 
X-Source-Dir: 
Subject: Shooting yourself in the foot with ZFS: is quite easy
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Sep 2007 01:23:50 -0000

Hi!

I'm just playing with ZFS in qemu, and I think I've found some bug in
the logic which can lead to shoot-yourself-in-the-foot condition, which
can be avoided.

First of all, I've constructed raidz array:

---
# mdconfig -a -tswap -s64m
md0
# mdconfig -a -tswap -s64m
md1
# mdconfig -a -tswap -s64m
md2
# zpool create pool raidz md{0,1,2}
---

Next, I've brought one of the devices offline and rewrote part of it.
Let's imagine that I've needed some space for emergency situation.

---
# zpool offline pool md0
Bringing device md0 offline
# zpool status
...
        NAME        STATE     READ WRITE CKSUM
        pool        DEGRADED     0     0     0
          raidz1    DEGRADED     0     0     0
            md0     OFFLINE      0     0     0
            md1     ONLINE       0     0     0
            md2     ONLINE       0     0     0
...
# dd if=/dev/zero of=/dev/md0 bs=1m count=1
1+0 records in
1+0 records out
1048576 bytes transferred in 0.084011 secs (12481402 bytes/sec)
---

Now, how do I put md0 back to the pool?
`zpool online pool md0' seems reasonable, and the pool will recover
itself on scrub, but I'm paranoid and I want to recreate data on md0
completely. But:

---
# zpool replace pool md0
cannot replace md0 with md0: md0 is busy
# zpool replace -f pool md0
cannot replace md0 with md0: md0 is busy
---

Seems like it's looking onto ondisk data (remains of ZFS) and thinks
that it's still used in the pool, because if I erase the whole device
with dd, it thinks of md0 as a new disk and replaces it without problems:

---
# dd if=/dev/zero of=/dev/md0 bs=1m
dd: /dev/md0: end of device
65+0 records in
64+0 records out
67108864 bytes transferred in 10.154127 secs (6609023 bytes/sec)
# zpool replace pool md0
# zpool status
...
        NAME           STATE     READ WRITE CKSUM
        pool           DEGRADED     0     0     0
          raidz1       DEGRADED     0     0     0
            replacing  DEGRADED     0     0     0
              md0/old  OFFLINE      0     0     0
              md0      ONLINE       0     0     0
            md1        ONLINE       0     0     0
            md2        ONLINE       0     0     0
...
# zpool status
...
        NAME        STATE     READ WRITE CKSUM
        pool        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            md0     ONLINE       0     0     0
            md1     ONLINE       0     0     0
            md2     ONLINE       0     0     0
...
---

This behaviour is, I think, undesired and one shoule be able to replace
offline device by itself any time.

Which is worse:

---
# zpool offline pool md0
Bringing device md0 offline
# dd if=/dev/zero of=/dev/md0 bs=1m
dd: /dev/md0: end of device
65+0 records in
64+0 records out
67108864 bytes transferred in 8.076568 secs (8309082 bytes/sec)
# zpool online pool md0
Bringing device md0 online
# zpool status
  pool: pool
 state: ONLINE
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: resilver completed with 0 errors on Mon Sep 24 23:21:49 2007
config:

        NAME        STATE     READ WRITE CKSUM
        pool        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            md0     UNAVAIL      0     0     0  corrupted data
            md1     ONLINE       0     0     0
            md2     ONLINE       0     0     0

errors: No known data errors
# zpool replace pool md0
invalid vdev specification
use '-f' to override the following errors:
md0 is in use (r1w1e1)
# zpool replace -f pool md0
invalid vdev specification
the following errors must be manually repaired:
md0 is in use (r1w1e1)
# zpool scrub pool
# zpool status
  pool: pool
 state: ONLINE
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: resilver completed with 0 errors on Mon Sep 24 23:22:22 2007
config:

        NAME        STATE     READ WRITE CKSUM
        pool        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            md0     UNAVAIL      0     0     0  corrupted data
            md1     ONLINE       0     0     0
            md2     ONLINE       0     0     0

errors: No known data errors
# zpool offline md0
missing device name
usage:
        offline [-t] <pool> <device> ...
# zpool offline pool md0
cannot offline md0: no valid replicas
# mdconfig -du0
mdconfig: ioctl(/dev/mdctl): Device busy
---

This is very confusing: md0 is UNAVAIL, but the table says pool is
ONLINE (not DEGRADED!), though status says it's degraded. Still I
neither can bring the device offline nor replace it with itself
(though replacing it with equal md3 worked).

My opinion is that such situation should be avoided. First of all,
zpool behaviour with one of disks in UNAVAIL state seems to be clear
bug (array shown as ONLINE, unavility of brind unavail device offline
etc.). Also, ZFS should not trust any ondisk contents after bringing a
disk online. The best solution is completely recreating ZFS data
structures on disk in such case. This should solve both cases:

1) `zfs replace <pool> <disk currently offline>` won't say that the
offline disk is busy
2) One won't need to clear disk with dd to recreate it
3) `zfs online <pool> <disk currently offline with erased contents>`
won't lead to UNAVAIL state.
4) I think there could be more potential problems with current behavour:
for example, what happens if I replace a disk in raidz with another
disk, that was used in another raidz before?

As I understand, currently `zfs offline`/`zfs online` on a disk leads to
it's resilvering anyway?

-- 
Best regards,
  Dmitry Marakasov               mailto:amdmi3@amdmi3.ru

From owner-freebsd-fs@FreeBSD.ORG  Tue Sep 25 08:56:28 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9507D16A494
	for <freebsd-fs@FreeBSD.ORG>; Tue, 25 Sep 2007 08:56:28 +0000 (UTC)
	(envelope-from des@des.no)
Received: from tim.des.no (tim.des.no [194.63.250.121])
	by mx1.freebsd.org (Postfix) with ESMTP id 42D3413C457
	for <freebsd-fs@FreeBSD.ORG>; Tue, 25 Sep 2007 08:56:27 +0000 (UTC)
	(envelope-from des@des.no)
Received: from tim.des.no (localhost [127.0.0.1])
	by spam.des.no (Postfix) with ESMTP id 8945A20A3;
	Tue, 25 Sep 2007 10:56:23 +0200 (CEST)
X-Spam-Tests: AWL
X-Spam-Learn: disabled
X-Spam-Score: 0.0/3.0
X-Spam-Checker-Version: SpamAssassin 3.2.1 (2007-05-02) on tim.des.no
Received: from ds4.des.no (des.no [80.203.243.180])
	by smtp.des.no (Postfix) with ESMTP id 048B7209E;
	Tue, 25 Sep 2007 10:56:23 +0200 (CEST)
Received: by ds4.des.no (Postfix, from userid 1001)
	id A373784481; Tue, 25 Sep 2007 10:56:22 +0200 (CEST)
From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= <des@des.no>
To: Randy Bush <randy@psg.com>
References: <46F7EDD7.6060904@psg.com>
Date: Tue, 25 Sep 2007 10:56:22 +0200
In-Reply-To: <46F7EDD7.6060904@psg.com> (Randy Bush's message of "Mon\, 24 Sep
	2007 07\:03\:19 -1000")
Message-ID: <868x6vi0nd.fsf@ds4.des.no>
User-Agent: Gnus/5.110006 (No Gnus v0.6) Emacs/22.1 (berkeley-unix)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: zfs in production?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Sep 2007 08:56:28 -0000

Randy Bush <randy@psg.com> writes:
> we are thinking of using zfs on a production server, using gmirror for
> booting and then following http://wiki.freebsd.org/ZFSOnRoot for the rest.
>
> but we would like to hear from folk using zfs in production for any
> length of time, as we do not really have the resources to be pioneers.

Works fine, but if using SATA, avoid Promise controllers.

DES
--=20
Dag-Erling Sm=C3=B8rgrav - des@des.no

From owner-freebsd-fs@FreeBSD.ORG  Tue Sep 25 10:12:16 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3568B16A41A
	for <freebsd-fs@freebsd.org>; Tue, 25 Sep 2007 10:12:16 +0000 (UTC)
	(envelope-from ticso@cicely12.cicely.de)
Received: from raven.bwct.de (raven.bwct.de [85.159.14.73])
	by mx1.freebsd.org (Postfix) with ESMTP id 94B8813C4B2
	for <freebsd-fs@freebsd.org>; Tue, 25 Sep 2007 10:12:15 +0000 (UTC)
	(envelope-from ticso@cicely12.cicely.de)
Received: from cicely5.cicely.de ([10.1.1.7])
	by raven.bwct.de (8.13.4/8.13.4) with ESMTP id l8PACCKX033418;
	Tue, 25 Sep 2007 12:12:12 +0200 (CEST)
	(envelope-from ticso@cicely12.cicely.de)
Received: from cicely12.cicely.de (cicely12.cicely.de [10.1.1.14])
	by cicely5.cicely.de (8.13.4/8.13.4) with ESMTP id l8PAC53E021057
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Tue, 25 Sep 2007 12:12:06 +0200 (CEST)
	(envelope-from ticso@cicely12.cicely.de)
Received: from cicely12.cicely.de (localhost [127.0.0.1])
	by cicely12.cicely.de (8.13.4/8.13.3) with ESMTP id l8PAC5Z2045884;
	Tue, 25 Sep 2007 12:12:05 +0200 (CEST)
	(envelope-from ticso@cicely12.cicely.de)
Received: (from ticso@localhost)
	by cicely12.cicely.de (8.13.4/8.13.3/Submit) id l8PAC5lN045883;
	Tue, 25 Sep 2007 12:12:05 +0200 (CEST) (envelope-from ticso)
Date: Tue, 25 Sep 2007 12:12:04 +0200
From: Bernd Walter <ticso@cicely12.cicely.de>
To: Dag-Erling =?iso-8859-1?Q?Sm=F8rgrav?= <des@des.no>
Message-ID: <20070925101204.GQ38890@cicely12.cicely.de>
References: <46F7EDD7.6060904@psg.com> <868x6vi0nd.fsf@ds4.des.no>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <868x6vi0nd.fsf@ds4.des.no>
X-Operating-System: FreeBSD cicely12.cicely.de 5.4-STABLE alpha
User-Agent: Mutt/1.5.9i
X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED=-1.8, AWL=0.000,
	BAYES_00=-2.599 autolearn=ham version=3.1.7
X-Spam-Checker-Version: SpamAssassin 3.1.7 (2006-10-05) on cicely12.cicely.de
Cc: Randy Bush <randy@psg.com>, freebsd-fs@freebsd.org
Subject: Re: zfs in production?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: ticso@cicely.de
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Sep 2007 10:12:16 -0000

On Tue, Sep 25, 2007 at 10:56:22AM +0200, Dag-Erling Sm�rgrav wrote:
> Randy Bush <randy@psg.com> writes:
> > we are thinking of using zfs on a production server, using gmirror for
> > booting and then following http://wiki.freebsd.org/ZFSOnRoot for the rest.
> >
> > but we would like to hear from folk using zfs in production for any
> > length of time, as we do not really have the resources to be pioneers.
> 
> Works fine, but if using SATA, avoid Promise controllers.

It is worse:
  pool: data
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            ad4     ONLINE       0     0     5
            ad6     ONLINE       0     0     8
            ad8     ONLINE       0     0    11

These are WDC WD3200AAKS-00SBA0/12.01B01 connected to an SIL3114.
System is amd64 from 26th june on core2quad with ECC RAM.
My home system is using the same controller on i386/P3 and has no
checksum errors - it is running source from 12th july.
Considered that I'd seen lots of silent data corruptions with PATA
disks on alpha during the last years I'm not that shure if the problem
depends on a specific controller, but more on timing or such.
It is easy to blame the controller, especially since SIL isn't known
for quality, but in this case I believe it is our problem somehow.

-- 
B.Walter                http://www.bwct.de      http://www.fizon.de
bernd@bwct.de           info@bwct.de            support@fizon.de

From owner-freebsd-fs@FreeBSD.ORG  Wed Sep 26 03:19:02 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id F0D2716A419
	for <freebsd-fs@FreeBSD.org>; Wed, 26 Sep 2007 03:19:02 +0000 (UTC)
	(envelope-from rick@kiwi-computer.com)
Received: from kiwi-computer.com (keira.kiwi-computer.com [63.224.10.3])
	by mx1.freebsd.org (Postfix) with SMTP id 570AC13C45A
	for <freebsd-fs@FreeBSD.org>; Wed, 26 Sep 2007 03:19:02 +0000 (UTC)
	(envelope-from rick@kiwi-computer.com)
Received: (qmail 34528 invoked by uid 2001); 26 Sep 2007 03:12:20 -0000
Date: Tue, 25 Sep 2007 22:12:20 -0500
From: "Rick C. Petty" <rick-freebsd@kiwi-computer.com>
To: Bruce Evans <brde@optusnet.com.au>
Message-ID: <20070926031219.GB34186@keira.kiwi-computer.com>
References: <46F3A64C.4090507@fluffles.net> <46F3B4B0.40606@freebsd.org>
	<fd0em7$8hn$1@sea.gmane.org> <20070921131919.GA46759@in-addr.com>
	<fd0gk8$f0d$2@sea.gmane.org> <20070921133127.GB46759@in-addr.com>
	<20070922022524.X43853@delplex.bde.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20070922022524.X43853@delplex.bde.org>
User-Agent: Mutt/1.4.2.1i
Cc: freebsd-fs@FreeBSD.org
Subject: Re: Writing contigiously to UFS2?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: rick-freebsd@kiwi-computer.com
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Sep 2007 03:19:03 -0000

On Sat, Sep 22, 2007 at 04:10:19AM +1000, Bruce Evans wrote:
> 
> of disk can be mapped.  I get 180MB in practice, with an inode bitmap
> size of only 3K, so there is not much to be gained by tuning -i but

I disagree.  There is much to be gained by tuning -i: 224.50 MB per CG vs.
183.77 MB..  that's a 22% difference.

However, the biggest gain by tuning -i is the loss of extra (unused)
inodes.  Care should be used with the -i option-- running out of inodes
when you have gigs of free space could be very frustrating.  But I newfs
all my volumes knowing an approximate inode density based on
already-existing files and a minor fudge factor.  The only time I ran out
of inodes with this method was due to a calculation error on my part.

> more to be gained by tuning -b and -f (several doublings are reasonable).

I completely agree with this.  It's unfortunate that newfs doesn't scale
the defaults here based on the device size.  Before someone dives in and
commits any adjustments, I hope they do sufficient testing and post their
results on this mailing list.

-- Rick C. Petty

From owner-freebsd-fs@FreeBSD.ORG  Wed Sep 26 03:30:41 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 00E2216A417
	for <freebsd-fs@freebsd.org>; Wed, 26 Sep 2007 03:30:41 +0000 (UTC)
	(envelope-from rick@kiwi-computer.com)
Received: from kiwi-computer.com (keira.kiwi-computer.com [63.224.10.3])
	by mx1.freebsd.org (Postfix) with SMTP id 704A013C447
	for <freebsd-fs@freebsd.org>; Wed, 26 Sep 2007 03:30:40 +0000 (UTC)
	(envelope-from rick@kiwi-computer.com)
Received: (qmail 34470 invoked by uid 2001); 26 Sep 2007 03:03:58 -0000
Date: Tue, 25 Sep 2007 22:03:58 -0500
From: "Rick C. Petty" <rick-freebsd@kiwi-computer.com>
To: Ivan Voras <ivoras@freebsd.org>
Message-ID: <20070926030358.GA34186@keira.kiwi-computer.com>
References: <46F3A64C.4090507@fluffles.net> <fd0aaj$poh$1@sea.gmane.org>
	<46F3B520.1070708@FreeBSD.org> <fd0edf$7jd$1@sea.gmane.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <fd0edf$7jd$1@sea.gmane.org>
User-Agent: Mutt/1.4.2.1i
Cc: freebsd-fs@freebsd.org
Subject: Re: Writing contigiously to UFS2?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: rick-freebsd@kiwi-computer.com
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Sep 2007 03:30:41 -0000

On Fri, Sep 21, 2007 at 02:45:35PM +0200, Ivan Voras wrote:
> Stefan Esser wrote:
> 
> From experience (not from reading code or the docs) I conclude that 
> cylinder groups cannot be larger than around 190 MB. I know this from 
> numerous runnings of newfs and during development of gvirstor which 
> interacts with cg in an "interesting" way.

Then you didn't run newfs enough:

# newfs -N -i 12884901888 /dev/gvinum/mm-flac
density reduced from 2147483647 to 3680255
/mm/flac: 196608.0MB (402653184 sectors) block size 16384, fragment size 2048
        using 876 cylinder groups of 224.50MB, 14368 blks, 64 inodes.

When specifying the -i option to newfs, it will minimize the number of
inodes created.  If the density option is high enough, it will use only one
block of inodes per CG (the minimum)..  from there, the density is reduced
(as per the message above) and the CG size is increased until the frag
bitmap can fit into a single block.  With UFS2 and the default options of
-b 16384 -f 2048, this gives you 224.50 MB per CG.

If you wish to play around with the block/frag sizes, you can greatly
increase the CG size:

# newfs -N -f 8192 -b 65536 -i 12884901888 /dev/gvinum/mm-flac
density reduced from 2147483647 to 14868479
/mm/flac: 196608.0MB (402653184 sectors) block size 65536, fragment size 8192
        using 55 cylinder groups of 3628.00MB, 58048 blks, 256 inodes.

Doing this is quite appropriate for large disks.  This last command means:
blocks are allocated in 64k chunks and the minimum allocation size is 8k.
Some may say this is wasteful, but one could also argue that using less
than 10% of your inodes is also wasteful.

> I know the reasons why cgs 
> exist (mainly to lower latencies from seeking) but with todays drives 

I don't believe that is true.  CGs exist because to prevent complete data
loss if the front of the disk is trashed.  The blocks and inodes have close
proximity partly for lower latency but also to reduce corruption risk.
It is suggested that the CG offsets are staggered to make best use of
rotational delay but this is obviously irrelevent with modern drives.

> and memory configurations it would sometimes be nice to make them larger 
> or in the extreme, make just one cg that covers the entire drive. 

And put it in the middle of the drive, not at the front.  Gee, this is what
NTFS does..  Hmm...

There are significant advantages to staggering the CGs across the device
(or in the case of some GEOM: providers).

Here might be an interesting experiment to try.  Write a new version of
/usr/src/sbin/newfs/mkfs.c that doesn't have the restriction that the free
fragment bitmap resides in one block.  I'm not 100% sure if the FFS code
would handle it properly, but in theory it should work (the offsets are
stored in the superblocks).  This is the biggest restriction on the CG
size.  You should be able to create 2-4 CGs to span each of your 1TB
drives without increasing the block size and thus minimum allocation unit.

-- 


-- Rick C. Petty

From owner-freebsd-fs@FreeBSD.ORG  Wed Sep 26 07:59:35 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 62E6216A41A;
	Wed, 26 Sep 2007 07:59:35 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail13.syd.optusnet.com.au (mail13.syd.optusnet.com.au
	[211.29.132.194])
	by mx1.freebsd.org (Postfix) with ESMTP id 19F7513C478;
	Wed, 26 Sep 2007 07:59:34 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from c220-239-235-248.carlnfd3.nsw.optusnet.com.au
	(c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248])
	by mail13.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	l8Q7xO3R028355
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Wed, 26 Sep 2007 17:59:32 +1000
Date: Wed, 26 Sep 2007 17:59:24 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: "Rick C. Petty" <rick-freebsd@kiwi-computer.com>
In-Reply-To: <20070926030358.GA34186@keira.kiwi-computer.com>
Message-ID: <20070926171239.E58990@delplex.bde.org>
References: <46F3A64C.4090507@fluffles.net> <fd0aaj$poh$1@sea.gmane.org>
	<46F3B520.1070708@FreeBSD.org> <fd0edf$7jd$1@sea.gmane.org>
	<20070926030358.GA34186@keira.kiwi-computer.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-fs@FreeBSD.org, Ivan Voras <ivoras@FreeBSD.org>
Subject: Re: Writing contigiously to UFS2?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Sep 2007 07:59:35 -0000

On Tue, 25 Sep 2007, Rick C. Petty wrote:

> On Fri, Sep 21, 2007 at 02:45:35PM +0200, Ivan Voras wrote:
>> Stefan Esser wrote:
>>
>> From experience (not from reading code or the docs) I conclude that
>> cylinder groups cannot be larger than around 190 MB. I know this from
>> numerous runnings of newfs and during development of gvirstor which
>> interacts with cg in an "interesting" way.
>
> Then you didn't run newfs enough:
>
> # newfs -N -i 12884901888 /dev/gvinum/mm-flac
> density reduced from 2147483647 to 3680255
> /mm/flac: 196608.0MB (402653184 sectors) block size 16384, fragment size 2048
>        using 876 cylinder groups of 224.50MB, 14368 blks, 64 inodes.

That's insignificantly more.  Even doubling the size wouldn't make much
difference.  I see differences of at most 25% going the other way and
halving the block size twice, which halves the cg size 4 times: on ffs1:

     4K blocks, 512-frags -e 512  (broken default):     40MB/S
     4K blocks, 512-frags -e 1024 (broken default):     44MB/S
     4K blocks, 512-frags -e 2048 (best), kernel fixes: 47MB/S
     4K blocks, 512-frags -e 8192 (try too hard), kernel fixes
        (kernel fixes are not complete enough to handle this case;
        defaults and -e values which are < the cg size work best except
        possibly when the fixes are complete):          45MB/S
     16K blocks, 2K-frags -e 2K   (broken default):     50MB/S
     16K blocks, 2K-frags -e 4K   (fixed default):      50.5MB/S
     16K blocks, 2K-frags -e 8K   (best):               51.5MB/S
     16K blocks, 2K-frags -e 64K  (try too hard):       < 51MB/S again

     Getting a 3% iimprovement just be avoiding a seek or 2 every cg is
     very surprising for 16K-blocks with 2K frags.  There has to be a
     seek for every cg, and bugs give 2 seeks.  However, with -e 2K, that
     is only 2 extra seeks every 2048 blocks, where the block size is
     large, so I would have expected an improvement of at most 2 in 2048.
     The access pattern is probably confusing the drive's cache (it's an
     old ATA drive with only 2MB cache).

> If you wish to play around with the block/frag sizes, you can greatly
> increase the CG size:
>
> # newfs -N -f 8192 -b 65536 -i 12884901888 /dev/gvinum/mm-flac
> density reduced from 2147483647 to 14868479
> /mm/flac: 196608.0MB (402653184 sectors) block size 65536, fragment size 8192
>        using 55 cylinder groups of 3628.00MB, 58048 blks, 256 inodes.
>
> Doing this is quite appropriate for large disks.  This last command means:
> blocks are allocated in 64k chunks and the minimum allocation size is 8k.
> Some may say this is wasteful, but one could also argue that using less
> than 10% of your inodes is also wasteful.

Both are wasteful.  The kernel buffer cache is tuned for 16K-blocks.
64K-blocks cause either resource contention (if you don't tune BKVASIZE)
or bogusly reduced resources (if you do tune it without fixing other
really arcane parameters (wrong magic numbers in source code...)).
There is lots of FUD about block sizes larger than 16K causing bugs,
but I haven't seen any problems from them except slowness.  64K-blocks
also cause slowness in general because they are just too big, but this
shouldn't be a problem if most files are large.

> Here might be an interesting experiment to try.  Write a new version of
> /usr/src/sbin/newfs/mkfs.c that doesn't have the restriction that the free
> fragment bitmap resides in one block.  I'm not 100% sure if the FFS code
> would handle it properly, but in theory it should work (the offsets are
> stored in the superblocks).  This is the biggest restriction on the CG
> size.  You should be able to create 2-4 CGs to span each of your 1TB
> drives without increasing the block size and thus minimum allocation unit.

In theory it won't work.  From fs.h:

%%%
/*
  * The size of a cylinder group is calculated by CGSIZE. The maximum size
  * is limited by the fact that cylinder groups are at most one block.
  * Its size is derived from the size of the maps maintained in the
  * cylinder group and the (struct cg) size.
  */
%%%

Only offsets to the inode blocks, etc. are stored in the superblock.

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Wed Sep 26 08:37:22 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7403816A41B
	for <freebsd-fs@FreeBSD.org>; Wed, 26 Sep 2007 08:37:22 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail16.syd.optusnet.com.au (mail16.syd.optusnet.com.au
	[211.29.132.197])
	by mx1.freebsd.org (Postfix) with ESMTP id 2DC7713C4A7
	for <freebsd-fs@FreeBSD.org>; Wed, 26 Sep 2007 08:37:21 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from c220-239-235-248.carlnfd3.nsw.optusnet.com.au
	(c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248])
	by mail16.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	l8Q8bIic025232
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Wed, 26 Sep 2007 18:37:19 +1000
Date: Wed, 26 Sep 2007 18:37:18 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: "Rick C. Petty" <rick-freebsd@kiwi-computer.com>
In-Reply-To: <20070926031219.GB34186@keira.kiwi-computer.com>
Message-ID: <20070926175943.H58990@delplex.bde.org>
References: <46F3A64C.4090507@fluffles.net> <46F3B4B0.40606@freebsd.org>
	<fd0em7$8hn$1@sea.gmane.org> <20070921131919.GA46759@in-addr.com>
	<fd0gk8$f0d$2@sea.gmane.org> <20070921133127.GB46759@in-addr.com>
	<20070922022524.X43853@delplex.bde.org>
	<20070926031219.GB34186@keira.kiwi-computer.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-fs@FreeBSD.org
Subject: Re: Writing contigiously to UFS2?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Sep 2007 08:37:22 -0000

On Tue, 25 Sep 2007, Rick C. Petty wrote:

> On Sat, Sep 22, 2007 at 04:10:19AM +1000, Bruce Evans wrote:
>>
>> of disk can be mapped.  I get 180MB in practice, with an inode bitmap
>> size of only 3K, so there is not much to be gained by tuning -i but
>
> I disagree.  There is much to be gained by tuning -i: 224.50 MB per CG vs.
> 183.77 MB..  that's a 22% difference.

That's a 22% reduction in seeks where the cost of seeking every 187MB
is a few mS every second.  Say the disk speed 61MB/S and the seek cost
is 15 mS.  Then we waste 15 mS every 3 seconds with 183 MB cg's, or 2%.
After saving 22%, we waste only 1.8%.

These estimates are consistent with numbers I gave in previous mail.
With the broken default of -e 2048 for 16K-blocks for ffs1, there was
an unnecessary seek or 2 after only every 32MB.  The disk speed was
52 MB/S (disk manufacturers's MB = 10^6 B).  -e 2048 gave 50 MB/S and
-e 8192 gave 51.5 MB/S.  (52 MB/S was measured on the raw disk using
dd.  The raw disk tends to actually be slower than the file system due
to not streaming.)  Seeking after every 32MB (real MB) gives a seek
every 645 mS, so if 2 seeks take 15 mS each the wastage was 4.7% so
it was not surprising to get a speedup of 3% using -e 8192.  Since I
got to within 1% of the raw disk speed, there is little more to be
gained in speed here.  (The OP's problem was not speed.)  (All this
is for the benchmark "dd if=/dev/zero of=zz bs=1m count=N" where
N = 200 or 1000.)

>> more to be gained by tuning -b and -f (several doublings are reasonable).
>
> I completely agree with this.  It's unfortunate that newfs doesn't scale
> the defaults here based on the device size.  Before someone dives in and
> commits any adjustments, I hope they do sufficient testing and post their
> results on this mailing list.

Testing shows that only one doubling of -b and -f is reasonable for
/usr/src but it makes little difference, so nothing should be changed.
I'm still trying to make halving -b and -f back to 512/512 work right,
so that it has the same disk speed as any/any, using contiguous layout
and clustering so that physical disk i/o sizes are independent of the
fs block sizes unless small i/o sizes are sufficient.  Clustering
already almost does this for data blocks provided the allocator manages
to do a contiguous layout.  Clustering already wastes a lot of CPU doing
this by brute force, but CPU is relatively free.

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Wed Sep 26 12:08:44 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 40CB216A41A;
	Wed, 26 Sep 2007 12:08:44 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail13.syd.optusnet.com.au (mail13.syd.optusnet.com.au
	[211.29.132.194])
	by mx1.freebsd.org (Postfix) with ESMTP id B10EF13C458;
	Wed, 26 Sep 2007 12:08:43 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from c220-239-235-248.carlnfd3.nsw.optusnet.com.au
	(c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248])
	by mail13.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	l8QC8b9r032760
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Wed, 26 Sep 2007 22:08:39 +1000
Date: Wed, 26 Sep 2007 22:08:37 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20070926171239.E58990@delplex.bde.org>
Message-ID: <20070926204857.W59443@delplex.bde.org>
References: <46F3A64C.4090507@fluffles.net> <fd0aaj$poh$1@sea.gmane.org>
	<46F3B520.1070708@FreeBSD.org> <fd0edf$7jd$1@sea.gmane.org>
	<20070926030358.GA34186@keira.kiwi-computer.com>
	<20070926171239.E58990@delplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-fs@FreeBSD.org, "Rick C. Petty" <rick-freebsd@kiwi-computer.com>,
	Ivan Voras <ivoras@FreeBSD.org>
Subject: Re: Writing contigiously to UFS2?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Sep 2007 12:08:44 -0000

On Wed, 26 Sep 2007, I wrote:

> ... Even doubling the [block] size wouldn't make much
> difference.  I see differences of at most 25% going the other way and
> halving the block size twice, which halves the cg size 4 times: on ffs1:
>
>    4K blocks, 512-frags -e 512  (broken default):     40MB/S
>    4K blocks, 512-frags -e 1024 (broken default):     44MB/S
>    4K blocks, 512-frags -e 2048 (best), kernel fixes: 47MB/S
>    4K blocks, 512-frags -e 8192 (try too hard), kernel fixes
>       (kernel fixes are not complete enough to handle this case;
>       defaults and -e values which are < the cg size work best except
>       possibly when the fixes are complete):          45MB/S

[Max possible is 52 MB/S.  1MB = 10^6 bytes.]

All of these must have been with some kernel fixes.  Retesting with -current
gives 33MB/S with -e 512 and a max of 42MB/S with each of -e 1024, 2048 and
3072.  Reducing the -e parameter below 512 gives surprisingly (if you don't
remember how slow seeks can be) large further losses.  E.g., -e 128 gives
21MB/S and the following bad layout:

% fs_bsize = 4096
% fs_fsize = 512
% [bpg = 3240, maxbpg = 128]
% 4:	lbn 0-11 blkno 624-719
% 	lbn [<1>indir]12-1035 blkno 608-615

The first indirect block is laid out discontiguously.  This is a standard
bug in reallocblks.

% 	lbn 12-127 blkno 720-1647

The blocks pointed to by the first indirect block are laid out contigously
with the last direct block.  Reallocblks does extra work to move the
indirect block out of the way so that only the data blocks are contiguous.

% 	lbn 128-255 blkno 1744-2767
% 	lbn 256-383 blkno 2864-3887

There is a bug after every maxbpg = 128 blocks. Due to a hack,
ffs_blkpref_ufs1() handles the blocks pointed to by the first indirect
block specially (maxbpg doesn't work for them), and due to a bug
somewhere, it leaves a gap of 2864-2768 = 96 blkno's (frags) after
every maxbpg blocks.

% 	lbn 384-511 blkno 3984-5007
% 	lbn 512-639 blkno 5104-6127
% 	lbn 640-767 blkno 6224-7247
% 	lbn 768-895 blkno 7344-8367
% 	lbn 896-1035 blkno 8464-9583

It keeps leaving gaps of 96 blkno's until the end of the indirect block.

% 	lbn [<2>indir]1036-1049611 blkno 207368-207375
% 	lbn [<1>indir]1036-2059 blkno 207376-207383
% 	lbn 1036-1163 blkno 207824-208847

Now ffs_blkpref_ffs1() skips 7 cg's (1 cg = 3240 blocks = 8*3240 =
25920 frags) because it doesn't know the current cg and its best guess
is off by a factor of 8 due to our maxbpg being weird by a factor of
8.  Normally it only skips 1 cg due to the default maxbpg being wrong
by a factor of 2.

% 	lbn 1164-1291 blkno 259664-260687
% 	lbn 1292-1419 blkno 311504-312527
% 	lbn 1420-1547 blkno 363344-364367
% 	lbn 1548-1675 blkno 415184-416207
% 	lbn 1676-1803 blkno 467024-468047
% 	lbn 1804-1931 blkno 518864-519887
% 	lbn 1932-2059 blkno 570704-571727

Indirect blocks after the first are not handled specially, so a new
cg is preferred after every maxbpg blocks, and some other bug causes
1 cg to be skipped for every maxbpg blocks.

% 	...

The pattern continues for subsequent indirect blocks.  The layout is only
unusual for the first one, so the pessimizations for the first one have
little effect for large files -- for large files, the speed is dominated
by seeking every 128 blocks as requested by -e 128.

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Wed Sep 26 17:10:56 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4942F16A41B
	for <freebsd-fs@FreeBSD.org>; Wed, 26 Sep 2007 17:10:56 +0000 (UTC)
	(envelope-from rick@kiwi-computer.com)
Received: from kiwi-computer.com (keira.kiwi-computer.com [63.224.10.3])
	by mx1.freebsd.org (Postfix) with SMTP id E191E13C447
	for <freebsd-fs@FreeBSD.org>; Wed, 26 Sep 2007 17:10:55 +0000 (UTC)
	(envelope-from rick@kiwi-computer.com)
Received: (qmail 41633 invoked by uid 2001); 26 Sep 2007 17:10:54 -0000
Date: Wed, 26 Sep 2007 12:10:54 -0500
From: "Rick C. Petty" <rick-freebsd@kiwi-computer.com>
To: Bruce Evans <brde@optusnet.com.au>
Message-ID: <20070926171054.GA41567@keira.kiwi-computer.com>
References: <46F3A64C.4090507@fluffles.net> <fd0aaj$poh$1@sea.gmane.org>
	<46F3B520.1070708@FreeBSD.org> <fd0edf$7jd$1@sea.gmane.org>
	<20070926030358.GA34186@keira.kiwi-computer.com>
	<20070926171239.E58990@delplex.bde.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20070926171239.E58990@delplex.bde.org>
User-Agent: Mutt/1.4.2.1i
Cc: freebsd-fs@FreeBSD.org
Subject: Re: Writing contigiously to UFS2?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: rick-freebsd@kiwi-computer.com
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Sep 2007 17:10:56 -0000

On Wed, Sep 26, 2007 at 05:59:24PM +1000, Bruce Evans wrote:
> On Tue, 25 Sep 2007, Rick C. Petty wrote:
> 
> That's insignificantly more.  Even doubling the size wouldn't make much
> difference.  I see differences of at most 25% going the other way and

Some would say that 25% difference is significant.  Obviously you disagree.

>     4K blocks, 512-frags -e 512  (broken default):     40MB/S
>     4K blocks, 512-frags -e 1024 (broken default):     44MB/S
>     4K blocks, 512-frags -e 2048 (best), kernel fixes: 47MB/S
>     4K blocks, 512-frags -e 8192 (try too hard), kernel fixes
>        (kernel fixes are not complete enough to handle this case;
>        defaults and -e values which are < the cg size work best except
>        possibly when the fixes are complete):          45MB/S
>     16K blocks, 2K-frags -e 2K   (broken default):     50MB/S
>     16K blocks, 2K-frags -e 4K   (fixed default):      50.5MB/S
>     16K blocks, 2K-frags -e 8K   (best):               51.5MB/S
>     16K blocks, 2K-frags -e 64K  (try too hard):       < 51MB/S again

Are you talking about throughputs now?  I was just talking about space.
Time and space are usually mutually-exclusive optimizations.

> >Here might be an interesting experiment to try.  Write a new version of
> >/usr/src/sbin/newfs/mkfs.c that doesn't have the restriction that the free
> >fragment bitmap resides in one block.  I'm not 100% sure if the FFS code
> >would handle it properly, but in theory it should work (the offsets are
> >stored in the superblocks).  This is the biggest restriction on the CG
> >size.  You should be able to create 2-4 CGs to span each of your 1TB
> >drives without increasing the block size and thus minimum allocation unit.
> 
> In theory it won't work.  From fs.h:
> 
> %%%
> /*
>  * The size of a cylinder group is calculated by CGSIZE. The maximum size
>  * is limited by the fact that cylinder groups are at most one block.
>  * Its size is derived from the size of the maps maintained in the
>  * cylinder group and the (struct cg) size.
>  */
> %%%

Debug code, not comments!  :-P

> Only offsets to the inode blocks, etc. are stored in the superblock.

Yes, the offset to the cylinder group block and the offset to the inode
block are in the superblock (struct fs).  It wouldn't be too difficult to
tweak the ffs code to read in CGs larger than one block, by checking the
difference between fs_iblkno and fs_cblkno.  I'm saying it's theoretically
possible, although it will require tweaks in ffs code.  Again, I think it's
worth investigating, especially if you believe there are performance
penalties for having block sizes greater than the kernel buffer size.

-- Rick C. Petty

From owner-freebsd-fs@FreeBSD.ORG  Wed Sep 26 17:17:58 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 28EAB16A46B
	for <freebsd-fs@FreeBSD.org>; Wed, 26 Sep 2007 17:17:58 +0000 (UTC)
	(envelope-from rick@kiwi-computer.com)
Received: from kiwi-computer.com (keira.kiwi-computer.com [63.224.10.3])
	by mx1.freebsd.org (Postfix) with SMTP id A965113C48D
	for <freebsd-fs@FreeBSD.org>; Wed, 26 Sep 2007 17:17:57 +0000 (UTC)
	(envelope-from rick@kiwi-computer.com)
Received: (qmail 41701 invoked by uid 2001); 26 Sep 2007 17:17:56 -0000
Date: Wed, 26 Sep 2007 12:17:56 -0500
From: "Rick C. Petty" <rick-freebsd@kiwi-computer.com>
To: Bruce Evans <brde@optusnet.com.au>
Message-ID: <20070926171756.GB41567@keira.kiwi-computer.com>
References: <46F3A64C.4090507@fluffles.net> <46F3B4B0.40606@freebsd.org>
	<fd0em7$8hn$1@sea.gmane.org> <20070921131919.GA46759@in-addr.com>
	<fd0gk8$f0d$2@sea.gmane.org> <20070921133127.GB46759@in-addr.com>
	<20070922022524.X43853@delplex.bde.org>
	<20070926031219.GB34186@keira.kiwi-computer.com>
	<20070926175943.H58990@delplex.bde.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20070926175943.H58990@delplex.bde.org>
User-Agent: Mutt/1.4.2.1i
Cc: freebsd-fs@FreeBSD.org
Subject: Re: Writing contigiously to UFS2?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: rick-freebsd@kiwi-computer.com
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Sep 2007 17:17:58 -0000

On Wed, Sep 26, 2007 at 06:37:18PM +1000, Bruce Evans wrote:
> On Tue, 25 Sep 2007, Rick C. Petty wrote:
> 
> >On Sat, Sep 22, 2007 at 04:10:19AM +1000, Bruce Evans wrote:
> >>
> >>of disk can be mapped.  I get 180MB in practice, with an inode bitmap
> >>size of only 3K, so there is not much to be gained by tuning -i but
> >
> >I disagree.  There is much to be gained by tuning -i: 224.50 MB per CG vs.
> >183.77 MB..  that's a 22% difference.
> 
> That's a 22% reduction in seeks where the cost of seeking every 187MB
> is a few mS every second.  Say the disk speed 61MB/S and the seek cost
> is 15 mS.  Then we waste 15 mS every 3 seconds with 183 MB cg's, or 2%.
> After saving 22%, we waste only 1.8%.

I'm not sure why this discussion has moved into speed/performance
comparisons.  I'm saying 22% difference in CG size.

> Since I
> got to within 1% of the raw disk speed, there is little more to be
> gained in speed here.  (The OP's problem was not speed.)

I agree-- why are you discussing speed?  I mean, it's interesting.  But I
was only discussing CG sizes and suggesting using the inode density option
to reduce the amount of space "wasted" with filesystem metadata.

I do think the performance differences are interesting, but how much of the
differences are irrelevant when looking at modern drives with tagged
queuing, large I/O caches, and reordered block operations?

-- Rick C. Petty

From owner-freebsd-fs@FreeBSD.ORG  Wed Sep 26 20:06:58 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9A75E16A419
	for <freebsd-fs@freebsd.org>; Wed, 26 Sep 2007 20:06:58 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au
	[211.29.132.183])
	by mx1.freebsd.org (Postfix) with ESMTP id 3913613C458
	for <freebsd-fs@freebsd.org>; Wed, 26 Sep 2007 20:06:57 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from c220-239-235-248.carlnfd3.nsw.optusnet.com.au
	(c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248])
	by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	l8QK6TLQ029453
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 27 Sep 2007 06:06:34 +1000
Date: Thu, 27 Sep 2007 06:06:29 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: "Rick C. Petty" <rick-freebsd@kiwi-computer.com>
In-Reply-To: <20070926171054.GA41567@keira.kiwi-computer.com>
Message-ID: <20070927050547.B60762@delplex.bde.org>
References: <46F3A64C.4090507@fluffles.net> <fd0aaj$poh$1@sea.gmane.org>
	<46F3B520.1070708@FreeBSD.org> <fd0edf$7jd$1@sea.gmane.org>
	<20070926030358.GA34186@keira.kiwi-computer.com>
	<20070926171239.E58990@delplex.bde.org>
	<20070926171054.GA41567@keira.kiwi-computer.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-fs@freebsd.org
Subject: Re: Writing contigiously to UFS2?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Sep 2007 20:06:58 -0000

On Wed, 26 Sep 2007, Rick C. Petty wrote:

> On Wed, Sep 26, 2007 at 05:59:24PM +1000, Bruce Evans wrote:
>> On Tue, 25 Sep 2007, Rick C. Petty wrote:
>>
>> That's insignificantly more.  Even doubling the size wouldn't make much
>> difference.  I see differences of at most 25% going the other way and
>
> Some would say that 25% difference is significant.  Obviously you disagree.

No, 25% is significant, but it takes intentional mistuning combined with
no attempt to optimize the mistuned case and bugs for the general case
that are more harmful for the mistuned case to get as much as 25%.

>>     4K blocks, 512-frags -e 512  (broken default):     40MB/S
>>     4K blocks, 512-frags -e 1024 (broken default):     44MB/S
                                      er, fixed default
>>     4K blocks, 512-frags -e 2048 (best), kernel fixes: 47MB/S
>>     4K blocks, 512-frags -e 8192 (try too hard), kernel fixes
>>        (kernel fixes are not complete enough to handle this case;
>>        defaults and -e values which are < the cg size work best except
>>        possibly when the fixes are complete):          45MB/S
>>     16K blocks, 2K-frags -e 2K   (broken default):     50MB/S
>>     16K blocks, 2K-frags -e 4K   (fixed default):      50.5MB/S
>>     16K blocks, 2K-frags -e 8K   (best):               51.5MB/S
>>     16K blocks, 2K-frags -e 64K  (try too hard):       < 51MB/S again
        64K-blocks, 8K-frags -e barely matters             close to max 52 MB/S
 	  (I was able to create a perfectly contiguous (modulo indirect
 	  blocks which were allocated as contiguously as possible) file
 	  of size 1GB on a fs with a cg size of almost 2GB.  A second file
 	  would not have been allocated so well, since it would be started
 	  on the same cg as the directory inode = same cg as the first file.)
>
> Are you talking about throughputs now?  I was just talking about space.
> Time and space are usually mutually-exclusive optimizations.

These are all throughputs starting with a new file system.  Since it's
a new file system with defaults for most parameters, it has the usual
space/ time tuning (-m 8 -o time), but normal space/time tuning doesn't
apply for huge files anyway since there are no normal fragments.

>> ...
>>> size.  You should be able to create 2-4 CGs to span each of your 1TB
>>> drives without increasing the block size and thus minimum allocation unit.
>>
>> In theory it won't work.  From fs.h:
>> ...
>> Only offsets to the inode blocks, etc. are stored in the superblock.
>
> Yes, the offset to the cylinder group block and the offset to the inode
> block are in the superblock (struct fs).  It wouldn't be too difficult to
> tweak the ffs code to read in CGs larger than one block, by checking the
> difference between fs_iblkno and fs_cblkno.  I'm saying it's theoretically
> possible, although it will require tweaks in ffs code.  Again, I think it's
> worth investigating, especially if you believe there are performance
> penalties for having block sizes greater than the kernel buffer size.

But then it won't be binary compatible.

The performance penalties are easier to fix (should just never have existed
on 64-bit platforms).

My main point here is that small cylinder groups alone are not a problem
for large files provided they are not too small.  They cost a few percent
in best cases.  In worst cases, this loss is in the noise.

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Wed Sep 26 20:20:05 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id F061016A417
	for <freebsd-fs@freebsd.org>; Wed, 26 Sep 2007 20:20:04 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au
	[211.29.132.189])
	by mx1.freebsd.org (Postfix) with ESMTP id 8C41D13C46E
	for <freebsd-fs@freebsd.org>; Wed, 26 Sep 2007 20:20:04 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from c220-239-235-248.carlnfd3.nsw.optusnet.com.au
	(c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248])
	by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	l8QKJlam001667
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 27 Sep 2007 06:19:49 +1000
Date: Thu, 27 Sep 2007 06:19:47 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: "Rick C. Petty" <rick-freebsd@kiwi-computer.com>
In-Reply-To: <20070926171756.GB41567@keira.kiwi-computer.com>
Message-ID: <20070927060725.O60903@delplex.bde.org>
References: <46F3A64C.4090507@fluffles.net> <46F3B4B0.40606@freebsd.org>
	<fd0em7$8hn$1@sea.gmane.org> <20070921131919.GA46759@in-addr.com>
	<fd0gk8$f0d$2@sea.gmane.org> <20070921133127.GB46759@in-addr.com>
	<20070922022524.X43853@delplex.bde.org>
	<20070926031219.GB34186@keira.kiwi-computer.com>
	<20070926175943.H58990@delplex.bde.org>
	<20070926171756.GB41567@keira.kiwi-computer.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-fs@freebsd.org
Subject: Re: Writing contigiously to UFS2?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Sep 2007 20:20:05 -0000

On Wed, 26 Sep 2007, Rick C. Petty wrote:

> On Wed, Sep 26, 2007 at 06:37:18PM +1000, Bruce Evans wrote:
>> On Tue, 25 Sep 2007, Rick C. Petty wrote:
>>
>>> On Sat, Sep 22, 2007 at 04:10:19AM +1000, Bruce Evans wrote:
>>>>
>>>> of disk can be mapped.  I get 180MB in practice, with an inode bitmap
>>>> size of only 3K, so there is not much to be gained by tuning -i but
>>>
>>> I disagree.  There is much to be gained by tuning -i: 224.50 MB per CG vs.
>>> 183.77 MB..  that's a 22% difference.
>>
>> That's a 22% reduction in seeks where the cost of seeking every 187MB
>> is a few mS every second.  Say the disk speed 61MB/S and the seek cost
>> is 15 mS.  Then we waste 15 mS every 3 seconds with 183 MB cg's, or 2%.
>> After saving 22%, we waste only 1.8%.
>
> I'm not sure why this discussion has moved into speed/performance
> comparisons.  I'm saying 22% difference in CG size.

Size is uninteresting except where it affects speed.  "-i large" saves some
disk space but not 22%, and disk space is almost free.  "-b large -f large"
costs disk space.

>> Since I
>> got to within 1% of the raw disk speed, there is little more to be
>> gained in speed here.  (The OP's problem was not speed.)
>
> I agree-- why are you discussing speed?  I mean, it's interesting.  But I
> was only discussing CG sizes and suggesting using the inode density option
> to reduce the amount of space "wasted" with filesystem metadata.

The OP's problem was that due to an apparently-untuned maxbpg and/or maxbpg
not actually working, data was scattered over all cg's and thus over all
disks when it was expected/wanted to be packed into a small number of disks.
Packing into a large number of small cg's should give the same effect on
the number of disks used as packing into a small number of large cg's, but
apparently doesn't, due to the untuned maxbpg and/or bugs.

> I do think the performance differences are interesting, but how much of the
> differences are irrelevant when looking at modern drives with tagged
> queuing, large I/O caches, and reordered block operations?

It depends on how big the seeks are (except a really modern drive would be
RAM with infinitely fast seeks :-).  I think any large-enough cylinder
groups would be large enough for the seek time to be significant.

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Thu Sep 27 07:59:48 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2423216A417
	for <fs@freebsd.org>; Thu, 27 Sep 2007 07:59:48 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au
	[211.29.132.185])
	by mx1.freebsd.org (Postfix) with ESMTP id C240F13C457
	for <fs@freebsd.org>; Thu, 27 Sep 2007 07:59:47 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c220-239-235-248.carlnfd3.nsw.optusnet.com.au
	[220.239.235.248])
	by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	l8R7xhfe003034
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO)
	for <fs@freebsd.org>; Thu, 27 Sep 2007 17:59:45 +1000
Date: Thu, 27 Sep 2007 17:59:43 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: fs@freebsd.org
Message-ID: <20070927175933.L770@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: 
Subject: deaadlock for large writes
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Sep 2007 07:59:48 -0000

I'm getting deadlock with wait message "nbufkv" for
"dd if=/dev/zero of=zz bs=1m count=1000" to an msdosfs file system with
a block size of 64K.

There seems to be nothing except the buf_dirty_count_severe() hack to
prevent deadlock for large writes in general.  Large writes want to
generate a lot of dirty buffers using bdwrite() or cluster_write().
The vnode lock is normally held exclusively throughout VOP_WRITE() calls,
so there seems to be no way to complete the delayed writes until
VOP_WRITE() returns (since flushbufqueues() needs to hold the vnode
lock exlusively to write).

Deadlock is handled in some cases by the buf_dirty_count_severe() hack:
in just 2 file systems (ffs and msdosfs), in the main loop in VOP_WRITE(),
if (vm_page_count_severe() || buf_dirty_count_severe()), then bawrite()
is used to avoid creating [m]any more dirty buffers.  (I don't like
this because it makes the slow case even slower -- when the system
gets congested doing writes, we switch to a slower writing method so
the congestion will take even longer to clear.)  Some file systems use
the old pessimization of always writing complete blocks using bawrite(),
so they don't need to call buf_dirty_count_severe() but are slow even
without it.  msdosfs did that until recently when I implemented write
clustering in it.

buf_dirty_count_severe() just uses the dirty buffer count, so it can
only prevent buffer kva resource starvation by accident.  The accident
apparently doesn't happen with a block size of 64K (I think not just
for msdosfs -- this may be why large block sizes for ffs are considered
dangerous).  When the nbufkv deadlock occurred, the dirty count was
0x5d3 and bufspace is 0x5d30000 -- bufspace consisted entirely of the
0x5d3 dirty buffers each of size 0x10000.  hidirtycount was 0x709.
Since this is larger than 0x5d3, the buf_dirty_count_severe() hack
didn't help, but since it is not much larger, it almost helped.

ffs also has a buf_dirty_count_severe() check in ffs_update().  This is
missing in msdosfs and of course in all other file systems.  This is
less important that in the write loop, since at most one dirty buffer
can be generated per vnode.  But this check probably belongs in bdwrite()
itself, so that most file systems don't forget to do it and so that 
all file systems don't forget to do it before all their bdwrite()'s
except the ones in *fs_write() and *fs_update().  (ffs does about 50
bdwrite()'s without checking, mainly for indirect blocks and snapshots.)
I think it is safe to blindly turn bdwrite() into bawrite().  ffs_update()
actually blindly turns bdwrite() into bwrite() and returns the result,
but this seems wrong since it makes the congested case even more congested
than with bawrite(), for no advantage (callers shouldn't be checking the
result in the !waitfor case that can use bdwrite()).

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Fri Sep 28 18:36:28 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id AB8B816A4E0
	for <freebsd-fs@FreeBSD.ORG>; Fri, 28 Sep 2007 18:36:28 +0000 (UTC)
	(envelope-from scode@hyperion.scode.org)
Received: from hyperion.scode.org (cl-1361.ams-04.nl.sixxs.net
	[IPv6:2001:960:2:550::2])
	by mx1.freebsd.org (Postfix) with ESMTP id 59D0D13C4CC
	for <freebsd-fs@FreeBSD.ORG>; Fri, 28 Sep 2007 18:36:28 +0000 (UTC)
	(envelope-from scode@hyperion.scode.org)
Received: by hyperion.scode.org (Postfix, from userid 1001)
	id 7433523C44A; Fri, 28 Sep 2007 20:36:26 +0200 (CEST)
Date: Fri, 28 Sep 2007 20:36:26 +0200
From: Peter Schuller <peter.schuller@infidyne.com>
To: Randy Bush <randy@psg.com>
Message-ID: <20070928183625.GA8655@hyperion.scode.org>
References: <46F7EDD7.6060904@psg.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="tKW2IUtsqtDRztdT"
Content-Disposition: inline
In-Reply-To: <46F7EDD7.6060904@psg.com>
User-Agent: Mutt/1.5.16 (2007-06-09)
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: zfs in production?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Sep 2007 18:36:28 -0000


--tKW2IUtsqtDRztdT
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

> but we would like to hear from folk using zfs in production for any
> length of time, as we do not really have the resources to be pioneers.

I'm using it in production on at least three machines (not counting
e.g. my workstation). By production I mean for real data and/or
services that are important, but not necessarily stressing the system
in terms of performance/load or edge cases.

Some minor issues exist (memory issues on 32bit, wanting to disable
prefetch, etc, swap not working on zfs, etc) but I have never had any
showstoppers. And ZFS has btw already saved me from silent (until some
time later) data corruption (sort of; I tried hot swapping SATA
devices in a situation where I did not know whether it was supposed to
be supported - in all fairness I would never have tried it to begin
with if I had not been running ZFS, but if I had I would have had
silent corruption).

Personal gut feeling for me is that I am not too worried about data
loss, but would be more hesitant to deploy without proper testing in
cases where performance/latency/soft real time performance is a
concearn.

Biggest actual problem so far has actually been hardware rather than
software. A huge joy of ZFS is the fact that it actually does send
cache flush commands to constituent drives. I have however recently
found out that the Perc 5/i controllers will not pass this through to
underlying drives (at least not with SATA). So suddenly my crappy
cheap-o home server is more reliably in case of power failure than a
more expensive server with a real raid controller (when running
without BBU; I can only hope that they will actually flush SATA drive
caches prior to evicting contents in the cache when running with BBU
enabled).

--=20
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>'
Key retrieval: Send an E-Mail to getpgpkey@scode.org
E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org


--tKW2IUtsqtDRztdT
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4 (FreeBSD)

iD8DBQFG/UmpDNor2+l1i30RAtHmAKCvTBXB/XW3bDpNNTBh84HrT693SQCg6Bcq
TPNPmCYPMRMntQmAdUDlStw=
=1C0I
-----END PGP SIGNATURE-----

--tKW2IUtsqtDRztdT--