From owner-freebsd-fs@FreeBSD.ORG  Sun Oct  8 15:22:14 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id DB1FA16A407
	for <freebsd-fs@freebsd.org>; Sun,  8 Oct 2006 15:22:14 +0000 (UTC)
	(envelope-from arne_woerner@yahoo.com)
Received: from web30307.mail.mud.yahoo.com (web30307.mail.mud.yahoo.com
	[209.191.69.69]) by mx1.FreeBSD.org (Postfix) with SMTP id 5839743D4C
	for <freebsd-fs@freebsd.org>; Sun,  8 Oct 2006 15:22:14 +0000 (GMT)
	(envelope-from arne_woerner@yahoo.com)
Received: (qmail 59249 invoked by uid 60001); 8 Oct 2006 15:22:13 -0000
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com;
	h=Message-ID:Received:Date:From:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding;
	b=w6QoOQsswjFqdafmJM6xLek4RZ1MpW+ocM7gaUbOvwaf64wTEGKj51wt77yyKkOaZT3jRLQWUH0hbC4CM9kArEoQWrR8hCPJ+fHjdKMK8gIXzhY19nPIiVhK2louPKVszgZP/LyK0/umqlAvloUKjAyR1LnlJqPia0jFTISpklY=
	; 
Message-ID: <20061008152213.59247.qmail@web30307.mail.mud.yahoo.com>
Received: from [83.129.181.92] by web30307.mail.mud.yahoo.com via HTTP;
	Sun, 08 Oct 2006 08:22:13 PDT
Date: Sun, 8 Oct 2006 08:22:13 -0700 (PDT)
From: "R. B. Riddick" <arne_woerner@yahoo.com>
To: freebsd-fs@freebsd.org
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
Content-Transfer-Encoding: quoted-printable
Subject: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 08 Oct 2006 15:22:14 -0000

Hi!=0A=0AWe (me and Veronica) mentioned, that starting 2 bonnie (ports/benc=
h) processes on a UFS=0A1. on a geom_bsd, geom_disk (like ad4 or da0) and g=
eom_stripe (using ad4a, ..., ad10a)=0A2. with different controllers areca a=
nd nVidia and different motherboards and=0A3. with up to 8 SATA disks=0Ares=
ults in a permanently disk-dead system.=0A=0A=0AVeronica's box had more tha=
n 700MB of free memory (according to top), when it happened.=0A=0AHeavy loa=
d (caused by blogbench, rawio, raidtest and dd) causes no problem, while bo=
nnie gets stuck somewhere between putc phase and end of rewrite phase.=0A=
=0AThe bonnie processes were blocked due to "nbufkv" (some VFS reason).=0AG=
eom activity is impossible then (no file system activity happens).=0ANo sys=
log message can be seen on the console.=0A=0ABye=0A-A&V=0A=0A=0A=0A

From owner-freebsd-fs@FreeBSD.ORG  Sun Oct  8 16:58:24 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 92AC916A40F
	for <freebsd-fs@freebsd.org>; Sun,  8 Oct 2006 16:58:24 +0000 (UTC)
	(envelope-from kris@obsecurity.org)
Received: from elvis.mu.org (elvis.mu.org [192.203.228.196])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 508C543D46
	for <freebsd-fs@freebsd.org>; Sun,  8 Oct 2006 16:58:24 +0000 (GMT)
	(envelope-from kris@obsecurity.org)
Received: from obsecurity.dyndns.org (elvis.mu.org [192.203.228.196])
	by elvis.mu.org (Postfix) with ESMTP id 3454F1A3C1A;
	Sun,  8 Oct 2006 09:58:24 -0700 (PDT)
Received: by obsecurity.dyndns.org (Postfix, from userid 1000)
	id 9D6A25176D; Sun,  8 Oct 2006 12:58:23 -0400 (EDT)
Date: Sun, 8 Oct 2006 12:58:23 -0400
From: Kris Kennaway <kris@obsecurity.org>
To: "R. B. Riddick" <arne_woerner@yahoo.com>
Message-ID: <20061008165823.GA2061@xor.obsecurity.org>
References: <20061008152213.59247.qmail@web30307.mail.mud.yahoo.com>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="oyUTqETQ0mS9luUI"
Content-Disposition: inline
In-Reply-To: <20061008152213.59247.qmail@web30307.mail.mud.yahoo.com>
User-Agent: Mutt/1.4.2.2i
Cc: freebsd-fs@freebsd.org
Subject: Re: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 08 Oct 2006 16:58:24 -0000


--oyUTqETQ0mS9luUI
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sun, Oct 08, 2006 at 08:22:13AM -0700, R. B. Riddick wrote:
> Hi!
>=20
> We (me and Veronica) mentioned, that starting 2 bonnie (ports/bench) proc=
esses on a UFS
> 1. on a geom_bsd, geom_disk (like ad4 or da0) and geom_stripe (using ad4a=
, ..., ad10a)
> 2. with different controllers areca and nVidia and different motherboards=
 and
> 3. with up to 8 SATA disks
> results in a permanently disk-dead system.
>=20
>=20
> Veronica's box had more than 700MB of free memory (according to top), whe=
n it happened.
>=20
> Heavy load (caused by blogbench, rawio, raidtest and dd) causes no proble=
m, while bonnie gets stuck somewhere between putc phase and end of rewrite =
phase.
>=20
> The bonnie processes were blocked due to "nbufkv" (some VFS reason).
> Geom activity is impossible then (no file system activity happens).
> No syslog message can be seen on the console.

You forgot to even mention what version you're running ;-)

Also show your kernel config file.  Configure DDB per the chapter on
kernel debugging in the developers handbook, break to DDB from the
console or serial console, then show us what processes are running and
what are their backtraces.

Kris

--oyUTqETQ0mS9luUI
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (FreeBSD)

iD8DBQFFKS4uWry0BWjoQKURAnd2AKDmi5t4q69iCzZxaIWKz37ve9LX9ACg/tvb
KyJ0BYVUNgjVdPZU7SGQebc=
=ByF9
-----END PGP SIGNATURE-----

--oyUTqETQ0mS9luUI--

From owner-freebsd-fs@FreeBSD.ORG  Sun Oct  8 17:01:59 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 4CC0916A403
	for <freebsd-fs@freebsd.org>; Sun,  8 Oct 2006 17:01:59 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.FreeBSD.org (Postfix) with ESMTP id A420D43D49
	for <freebsd-fs@freebsd.org>; Sun,  8 Oct 2006 17:01:56 +0000 (GMT)
	(envelope-from scottl@samsco.org)
Received: from [192.168.254.14] (imini.samsco.home [192.168.254.14])
	(authenticated bits=0)
	by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id k98H1lPM025130;
	Sun, 8 Oct 2006 11:01:52 -0600 (MDT)
	(envelope-from scottl@samsco.org)
Message-ID: <45292EFA.4060903@samsco.org>
Date: Sun, 08 Oct 2006 11:01:46 -0600
From: Scott Long <scottl@samsco.org>
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US;
	rv:1.7.7) Gecko/20050416
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Kris Kennaway <kris@obsecurity.org>
References: <20061008152213.59247.qmail@web30307.mail.mud.yahoo.com>
	<20061008165823.GA2061@xor.obsecurity.org>
In-Reply-To: <20061008165823.GA2061@xor.obsecurity.org>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-1.4 required=3.8 tests=ALL_TRUSTED autolearn=failed 
	version=3.1.1
X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on pooker.samsco.org
Cc: freebsd-fs@freebsd.org
Subject: Re: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 08 Oct 2006 17:01:59 -0000

Kris Kennaway wrote:

> On Sun, Oct 08, 2006 at 08:22:13AM -0700, R. B. Riddick wrote:
> 
>>Hi!
>>
>>We (me and Veronica) mentioned, that starting 2 bonnie (ports/bench) processes on a UFS
>>1. on a geom_bsd, geom_disk (like ad4 or da0) and geom_stripe (using ad4a, ..., ad10a)
>>2. with different controllers areca and nVidia and different motherboards and
>>3. with up to 8 SATA disks
>>results in a permanently disk-dead system.
>>
>>
>>Veronica's box had more than 700MB of free memory (according to top), when it happened.
>>
>>Heavy load (caused by blogbench, rawio, raidtest and dd) causes no problem, while bonnie gets stuck somewhere between putc phase and end of rewrite phase.
>>
>>The bonnie processes were blocked due to "nbufkv" (some VFS reason).
>>Geom activity is impossible then (no file system activity happens).
>>No syslog message can be seen on the console.
> 
> 
> You forgot to even mention what version you're running ;-)
> 
> Also show your kernel config file.  Configure DDB per the chapter on
> kernel debugging in the developers handbook, break to DDB from the
> console or serial console, then show us what processes are running and
> what are their backtraces.
> 
> Kris

No need for all of that information, the bug in vfs_bio.c is quite 
obvious. =-(  Fixing it will take some thought, though.

Scott


From owner-freebsd-fs@FreeBSD.ORG  Sun Oct  8 19:39:45 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 2B2E816A403
	for <freebsd-fs@freebsd.org>; Sun,  8 Oct 2006 19:39:45 +0000 (UTC)
	(envelope-from bde@zeta.org.au)
Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210])
	by mx1.FreeBSD.org (Postfix) with ESMTP id BE9C643D45
	for <freebsd-fs@freebsd.org>; Sun,  8 Oct 2006 19:39:44 +0000 (GMT)
	(envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.2.163])
	by mailout1.pacific.net.au (Postfix) with ESMTP id 80C8D5BFC33;
	Mon,  9 Oct 2006 05:39:19 +1000 (EST)
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3sarge3) with ESMTP
	id k98JdFVA028027; Mon, 9 Oct 2006 05:39:17 +1000
Date: Mon, 9 Oct 2006 05:39:15 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: Scott Long <scottl@samsco.org>
In-Reply-To: <45292EFA.4060903@samsco.org>
Message-ID: <20061009052237.X30864@delplex.bde.org>
References: <20061008152213.59247.qmail@web30307.mail.mud.yahoo.com>
	<20061008165823.GA2061@xor.obsecurity.org>
	<45292EFA.4060903@samsco.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-fs@freebsd.org, Kris Kennaway <kris@obsecurity.org>
Subject: Re: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 08 Oct 2006 19:39:45 -0000

On Sun, 8 Oct 2006, Scott Long wrote:

> Kris Kennaway wrote:
>> You forgot to even mention what version you're running ;-)
>> 
>> Also show your kernel config file.  Configure DDB per the chapter on

> No need for all of that information, the bug in vfs_bio.c is quite obvious. 
> =-(  Fixing it will take some thought, though.

Is it really obvious?  I think it is only obvious that many things are
not quite right.  The quick fix of increasing BKVASIZE to the size of
the largest buffer used should still work to prevent bkva fragmentation.

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Sun Oct  8 20:33:56 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 17D0416A40F
	for <freebsd-fs@freebsd.org>; Sun,  8 Oct 2006 20:33:56 +0000 (UTC)
	(envelope-from arne_woerner@yahoo.com)
Received: from web30312.mail.mud.yahoo.com (web30312.mail.mud.yahoo.com
	[209.191.69.74]) by mx1.FreeBSD.org (Postfix) with SMTP id 928D143D45
	for <freebsd-fs@freebsd.org>; Sun,  8 Oct 2006 20:33:55 +0000 (GMT)
	(envelope-from arne_woerner@yahoo.com)
Received: (qmail 84150 invoked by uid 60001); 8 Oct 2006 20:33:50 -0000
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com;
	h=Message-ID:Received:Date:From:Subject:To:Cc:MIME-Version:Content-Type:Content-Transfer-Encoding;
	b=Y1kC8scDcax36/g3c9FPS5QfiX8UX0gnfO+l5bi3IpKr8KuJwNHKF65pqMukWeZ3Qyp+44969gn0ZtE7htGHc7CIEu56FGkTm72cHDxGrJqewZ/VE+EtLW34utR8Wczz074QZJnzEliccbIcErccs9XgKi1hvtuO/G5+ZNicbog=
	; 
Message-ID: <20061008203349.84148.qmail@web30312.mail.mud.yahoo.com>
Received: from [83.129.181.92] by web30312.mail.mud.yahoo.com via HTTP;
	Sun, 08 Oct 2006 13:33:49 PDT
Date: Sun, 8 Oct 2006 13:33:49 -0700 (PDT)
From: "R. B. Riddick" <arne_woerner@yahoo.com>
To: Bruce Evans <bde@zeta.org.au>, Scott Long <scottl@samsco.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
Content-Transfer-Encoding: quoted-printable
Cc: freebsd-fs@freebsd.org, Kris Kennaway <kris@obsecurity.org>
Subject: Re: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 08 Oct 2006 20:33:56 -0000

Bruce wrote:=0A>On Sun, 8 Oct 2006, Scott Long wrote:=0A>> Kris Kennaway wr=
ote:=0A>>> You forgot to even mention what version you're running ;-)=0A>>>=
 =0A>>> Also show your kernel config file.  Configure DDB per the chapter o=
n=0A>>>=0A>> No need for all of that information, the bug in vfs_bio.c is q=
uite obvious. =0A>> =3D-(  Fixing it will take some thought, though.=0A>=0A=
>Is it really obvious?  I think it is only obvious that many things are=0A>=
not quite right.  The quick fix of increasing BKVASIZE to the size of=0A>th=
e largest buffer used should still work to prevent bkva fragmentation.=0A>=
=0AOK: The FBSD version was varying: R6.1, R6.1-CURRENT, R7-CURRENT.=0A=0AB=
ut we just found out, that it happens when we use "newfs -b 65536", but not=
 with default "-b" value (whatever that might be)...=0A=0ASo if somebody wa=
nts to reproduce it, he/she should use >R6 and "newfs -b 65536"...=0AI thin=
k that were all steps to do...=0A=0ACan somebody reproduce it now?=0ADDB is=
 not my bag, so that I would be glad, if somebody with an appropriate setti=
ng could reproduce it...=0A=0A-Arne=0A=0A=0A=0A=0A=0A

From owner-freebsd-fs@FreeBSD.ORG  Sun Oct  8 20:43:09 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 8B8B116A412
	for <freebsd-fs@freebsd.org>; Sun,  8 Oct 2006 20:43:09 +0000 (UTC)
	(envelope-from kris@obsecurity.org)
Received: from elvis.mu.org (elvis.mu.org [192.203.228.196])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 3E5F843D46
	for <freebsd-fs@freebsd.org>; Sun,  8 Oct 2006 20:43:09 +0000 (GMT)
	(envelope-from kris@obsecurity.org)
Received: from obsecurity.dyndns.org (elvis.mu.org [192.203.228.196])
	by elvis.mu.org (Postfix) with ESMTP id F0C4F1A3C1C;
	Sun,  8 Oct 2006 13:43:08 -0700 (PDT)
Received: by obsecurity.dyndns.org (Postfix, from userid 1000)
	id 6F230515FA; Sun,  8 Oct 2006 16:43:08 -0400 (EDT)
Date: Sun, 8 Oct 2006 16:43:08 -0400
From: Kris Kennaway <kris@obsecurity.org>
To: "R. B. Riddick" <arne_woerner@yahoo.com>
Message-ID: <20061008204308.GA7702@xor.obsecurity.org>
References: <20061008203349.84148.qmail@web30312.mail.mud.yahoo.com>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="mP3DRpeJDSE+ciuQ"
Content-Disposition: inline
In-Reply-To: <20061008203349.84148.qmail@web30312.mail.mud.yahoo.com>
User-Agent: Mutt/1.4.2.2i
Cc: freebsd-fs@freebsd.org, Kris Kennaway <kris@obsecurity.org>
Subject: Re: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 08 Oct 2006 20:43:09 -0000


--mP3DRpeJDSE+ciuQ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sun, Oct 08, 2006 at 01:33:49PM -0700, R. B. Riddick wrote:
> Bruce wrote:
> >On Sun, 8 Oct 2006, Scott Long wrote:
> >> Kris Kennaway wrote:
> >>> You forgot to even mention what version you're running ;-)
> >>>=20
> >>> Also show your kernel config file.  Configure DDB per the chapter on
> >>>
> >> No need for all of that information, the bug in vfs_bio.c is quite obv=
ious.=20
> >> =3D-(  Fixing it will take some thought, though.
> >
> >Is it really obvious?  I think it is only obvious that many things are
> >not quite right.  The quick fix of increasing BKVASIZE to the size of
> >the largest buffer used should still work to prevent bkva fragmentation.
> >
> OK: The FBSD version was varying: R6.1, R6.1-CURRENT, R7-CURRENT.
>=20
> But we just found out, that it happens when we use "newfs -b 65536", but =
not with default "-b" value (whatever that might be)...
>=20
> So if somebody wants to reproduce it, he/she should use >R6 and "newfs -b=
 65536"...
> I think that were all steps to do...

Thanks, I can now reproduce on 7.0.

 8197  3980  8197     0  S+      nbufkv   0xc07cec08 bonnie
 8196  3980  8196     0  S+      nbufkv   0xc07cec08 bonnie

db> wh 8197
Tracing pid 8197 tid 100205 td 0xc87a6510
sched_switch(c87a6510,0,1,15e,4,...) at sched_switch+0x120
mi_switch(1,0,c0758aba,1bf,0,...) at mi_switch+0x1b2
sleepq_switch(c07c5390,0,c0758aba,211,ec9217d0,...) at sleepq_switch+0xee
sleepq_wait(c07cec08,0,c075614c,c9,0,...) at sleepq_wait+0x3e
msleep(c07cec08,c07cec0c,50,c075dece,0,...) at msleep+0x171
getnewbuf(10000,10000,c075da89,9fe,10000,...) at getnewbuf+0x319
getblk(c5d58514,fffffff4,ffffffff,10000,0,...) at getblk+0x307
breadn(c5d58514,fffffff4,ffffffff,10000,0,...) at breadn+0x4d
bread(c5d58514,fffffff4,ffffffff,10000,0,...) at bread+0x4c
ffs_balloc_ufs2(c5d58514,273a000,0,2000,c51b0e00,...) at ffs_balloc_ufs2+0x=
5ab
ffs_write(ec921b9c,0,c07535f4,0,0,...) at ffs_write+0x2f2
VOP_WRITE_APV(c07ada20,ec921b9c,c87a6510,c54d5c60,2,...) at VOP_WRITE_APV+0=
x9a
vn_write(c54d5c60,ec921c64,c51b0e00,0,c87a6510,...) at vn_write+0x1d5
dofilewrite(c54d5c60,ec921c64,ffffffff,ffffffff,0,...) at dofilewrite+0x7c
kern_writev(c87a6510,3,ec921c64,bfbfc820,2000,...) at kern_writev+0x6b
write(c87a6510,ec921d04,c,158,3,...) at write+0x4d
syscall(820003b,3b,bfbf003b,0,2000,...) at syscall+0x152
Xint0x80_syscall() at Xint0x80_syscall+0x1f
--- syscall (4, FreeBSD ELF32, write), eip =3D 0x28155dff, esp =3D 0xbfbf73=
6c, ebp =3D 0xbfbfe838 ---
db> wh 8196
Tracing pid 8196 tid 100138 td 0xc50c3d80
sched_switch(c50c3d80,0,1,15e,246,...) at sched_switch+0x120
mi_switch(1,0,c0758aba,1bf,0,...) at mi_switch+0x1b2
sleepq_switch(c07c5390,0,c0758aba,211,ec79b820,...) at sleepq_switch+0xee
sleepq_wait(c07cec08,0,c075614c,c9,0,...) at sleepq_wait+0x3e
msleep(c07cec08,c07cec0c,50,c075dece,0,...) at msleep+0x171
getnewbuf(10000,10000,c075da89,9fe,10000,...) at getnewbuf+0x319
getblk(c5e6d514,fffffff4,ffffffff,10000,0,...) at getblk+0x307
ufs_bmaparray(c5e6d514,3cc,0,ec79b994,0,...) at ufs_bmaparray+0x298
ufs_bmap(ec79b9dc,c075da89,1ac) at ufs_bmap+0x69
VOP_BMAP_APV(c07ada20,ec79b9dc,c075da89,3b7,ffffffff,...) at VOP_BMAP_APV+0=
x72
bdwrite(ddbe5790,0,ec79bc64,2000,c51b0e00,...) at bdwrite+0x485
ffs_write(ec79bb9c,0,c07535f4,0,0,...) at ffs_write+0x5b5
VOP_WRITE_APV(c07ada20,ec79bb9c,c50c3d80,c5058c60,2,...) at VOP_WRITE_APV+0=
x9a
vn_write(c5058c60,ec79bc64,c51b0e00,0,c50c3d80,...) at vn_write+0x1d5
dofilewrite(c5058c60,ec79bc64,ffffffff,ffffffff,0,...) at dofilewrite+0x7c
kern_writev(c50c3d80,3,ec79bc64,bfbfe820,0,...) at kern_writev+0x6b
write(c50c3d80,ec79bd04,c,158,3,...) at write+0x4d
syscall(3b,3b,bfbf003b,0,2000,...) at syscall+0x152
Xint0x80_syscall() at Xint0x80_syscall+0x1f
--- syscall (4, FreeBSD ELF32, write), eip =3D 0x28155dff, esp =3D 0xbfbf73=
6c, ebp =3D 0xbfbfe838 ---
db>

Kris

--mP3DRpeJDSE+ciuQ
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (FreeBSD)

iD8DBQFFKWLbWry0BWjoQKURAhYEAKCr/KsTlrpfgcn5JPq6Lc7HcY/LBwCgjsDn
NCg2VRMnxO8xbit/xmqKtuQ=
=ror+
-----END PGP SIGNATURE-----

--mP3DRpeJDSE+ciuQ--

From owner-freebsd-fs@FreeBSD.ORG  Sun Oct  8 21:02:17 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 7478116A403
	for <freebsd-fs@freebsd.org>; Sun,  8 Oct 2006 21:02:17 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 5909F43D53
	for <freebsd-fs@freebsd.org>; Sun,  8 Oct 2006 21:02:15 +0000 (GMT)
	(envelope-from scottl@samsco.org)
Received: from [192.168.254.14] (imini.samsco.home [192.168.254.14])
	(authenticated bits=0)
	by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id k98L27Xc026112;
	Sun, 8 Oct 2006 15:02:13 -0600 (MDT)
	(envelope-from scottl@samsco.org)
Message-ID: <4529674E.6000405@samsco.org>
Date: Sun, 08 Oct 2006 15:02:06 -0600
From: Scott Long <scottl@samsco.org>
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US;
	rv:1.7.7) Gecko/20050416
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Bruce Evans <bde@zeta.org.au>
References: <20061008152213.59247.qmail@web30307.mail.mud.yahoo.com>
	<20061008165823.GA2061@xor.obsecurity.org>
	<45292EFA.4060903@samsco.org>
	<20061009052237.X30864@delplex.bde.org>
In-Reply-To: <20061009052237.X30864@delplex.bde.org>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-1.4 required=3.8 tests=ALL_TRUSTED autolearn=failed 
	version=3.1.1
X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on pooker.samsco.org
Cc: freebsd-fs@freebsd.org, Kris Kennaway <kris@obsecurity.org>
Subject: Re: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 08 Oct 2006 21:02:17 -0000

Bruce Evans wrote:

> On Sun, 8 Oct 2006, Scott Long wrote:
> 
>> Kris Kennaway wrote:
>>
>>> You forgot to even mention what version you're running ;-)
>>>
>>> Also show your kernel config file.  Configure DDB per the chapter on
> 
> 
>> No need for all of that information, the bug in vfs_bio.c is quite 
>> obvious. =-(  Fixing it will take some thought, though.
> 
> 
> Is it really obvious?  I think it is only obvious that many things are
> not quite right.  The quick fix of increasing BKVASIZE to the size of
> the largest buffer used should still work to prevent bkva fragmentation.
> 
> Bruce

The use of needsbuffer global presents a very wide open race.

Scott


From owner-freebsd-fs@FreeBSD.ORG  Sun Oct  8 22:25:00 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@FreeBSD.org
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id E719D16A403
	for <freebsd-fs@FreeBSD.org>; Sun,  8 Oct 2006 22:25:00 +0000 (UTC)
	(envelope-from bde@zeta.org.au)
Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 5625A43D45
	for <freebsd-fs@FreeBSD.org>; Sun,  8 Oct 2006 22:25:00 +0000 (GMT)
	(envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.2.163])
	by mailout1.pacific.net.au (Postfix) with ESMTP id E03555DFC21;
	Mon,  9 Oct 2006 08:24:58 +1000 (EST)
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3sarge3) with ESMTP
	id k98MOuXF008714; Mon, 9 Oct 2006 08:24:56 +1000
Date: Mon, 9 Oct 2006 08:24:55 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: "R. B. Riddick" <arne_woerner@yahoo.com>
In-Reply-To: <20061008203349.84148.qmail@web30312.mail.mud.yahoo.com>
Message-ID: <20061009075528.W31379@delplex.bde.org>
References: <20061008203349.84148.qmail@web30312.mail.mud.yahoo.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-fs@FreeBSD.org, Kris Kennaway <kris@obsecurity.org>
Subject: Re: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 08 Oct 2006 22:25:01 -0000

On Sun, 8 Oct 2006, R. B. Riddick wrote:

> Bruce wrote:
>> On Sun, 8 Oct 2006, Scott Long wrote:
>>> Kris Kennaway wrote:
>>>> You forgot to even mention what version you're running ;-)
>>>>
>>>> Also show your kernel config file.  Configure DDB per the chapter on
>>>>
>>> No need for all of that information, the bug in vfs_bio.c is quite obvious.
>>> =-(  Fixing it will take some thought, though.
>>
>> Is it really obvious?  I think it is only obvious that many things are
>> not quite right.  The quick fix of increasing BKVASIZE to the size of
>> the largest buffer used should still work to prevent bkva fragmentation.
>>
> OK: The FBSD version was varying: R6.1, R6.1-CURRENT, R7-CURRENT.
>
> But we just found out, that it happens when we use "newfs -b 65536", but not with default "-b" value (whatever that might be)...

That's certainly a good way to exercise bkva fragmentation.  I don't
know any other use for such a large block sizes in ffs :-).  Such a large
block size might be best for file systems with mainly very large files,
but the possible benefits are not large and might be smaller than the
extra overheads for defragmentation (even if it works).

The fragmentation can also be reduced by not using different block
sizes for different mounted file systems (including non-ffs ones)
once one of the sizes exceeds BKVASIZE.  Alternatively it can be
increased by doing the reverse.  I think "newfs -b 65536 -f 8192"
gives the bad mixture with different (ffs)block and (ffs)frag sizes.
"newfs -b 65536 -f 65536" usually gives very bad perfromance because
its frag size is to large.

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Sun Oct  8 22:38:09 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@FreeBSD.org
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 4F8A716A40F
	for <freebsd-fs@FreeBSD.org>; Sun,  8 Oct 2006 22:38:09 +0000 (UTC)
	(envelope-from info@fluffles.net)
Received: from auriate.fluffles.net (a83-68-3-169.adsl.cistron.nl
	[83.68.3.169]) by mx1.FreeBSD.org (Postfix) with ESMTP id CBE6B43DA0
	for <freebsd-fs@FreeBSD.org>; Sun,  8 Oct 2006 22:38:01 +0000 (GMT)
	(envelope-from info@fluffles.net)
Received: from destiny ([10.0.0.21])
	by auriate.fluffles.net with esmtpa (Exim 4.63 (FreeBSD))
	(envelope-from <info@fluffles.net>)
	id 1GWhHQ-0009vG-Nw; Mon, 09 Oct 2006 00:37:56 +0200
Message-ID: <45297DA2.4000509@fluffles.net>
Date: Mon, 09 Oct 2006 00:37:22 +0200
From: "Fluffles.net" <info@fluffles.net>
User-Agent: Thunderbird 1.5.0.7 (X11/20060917)
MIME-Version: 1.0
To: freebsd-fs@FreeBSD.org
X-Enigmail-Version: 0.94.0.0
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
Cc: Kris Kennaway <kris@obsecurity.org>
Subject: Re: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 08 Oct 2006 22:38:09 -0000

Hi Bruce,

I'm the "veronica" Arne mentioned in the freebsd-fs mailinglist.
Regarding the effectiveness of a higher blocksize, these are my findings:

areca RAID5 (8x da, 128KB stripe, default newfs, NCQ enabled)
              -------Sequential Output-------- ---Sequential Input--
--Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block---
--Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU 
/sec %CPU
ARC8xR5  8480 119973 91.3 247178 58.6 67862 17.5 90426 86.9 172490 24.0
120.7  0.5

areca RAID5 (8x da, 128KB stripe, 64KB blocksize newfs, NCQ enabled)
              -------Sequential Output-------- ---Sequential Input--
--Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block---
--Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU 
/sec %CPU
ARC8xR5  8480 128920 97.8 265920 58.9 116787 31.0 103261 97.8 392970
53.8 119.8  0.6

As you can see, the block read increased from ~172MB/s to ~392MB/s,
quite significant increase. Also the reqrite speed increased from
~67MB/s to ~116MB/s.

Ofcourse these tests are on a brand clean filesystem, which might not
tally with real-life crowded filesystems. But at least there is much
potential in a higher blocksize, and it would be a shame for it to crash
FreeBSD. There are quite a few people who store big files on big RAID
arrays; they could profit from a non-crashing FreeBSD with bigger
blocksize. Besides, a crashing VFS/Geom isn't all that sexy. ;-)

- Veronica

From owner-freebsd-fs@FreeBSD.ORG  Mon Oct  9 19:38:28 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@FreeBSD.org
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 1689716A403
	for <freebsd-fs@FreeBSD.org>; Mon,  9 Oct 2006 19:38:28 +0000 (UTC)
	(envelope-from bde@zeta.org.au)
Received: from mailout2.pacific.net.au (mailout2-3.pacific.net.au [61.8.2.226])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 4AF1443D45
	for <freebsd-fs@FreeBSD.org>; Mon,  9 Oct 2006 19:38:27 +0000 (GMT)
	(envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.2.163])
	by mailout2.pacific.net.au (Postfix) with ESMTP id 325996E125;
	Tue, 10 Oct 2006 05:38:25 +1000 (EST)
Received: from epsplex.bde.org (katana.zip.com.au [61.8.7.246])
	by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3sarge3) with ESMTP
	id k99JcKSo014207; Tue, 10 Oct 2006 05:38:21 +1000
Date: Tue, 10 Oct 2006 05:37:33 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@epsplex.bde.org
To: "Fluffles.net" <info@fluffles.net>
In-Reply-To: <45297DA2.4000509@fluffles.net>
Message-ID: <20061010051216.G814@epsplex.bde.org>
References: <45297DA2.4000509@fluffles.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-fs@FreeBSD.org, Kris Kennaway <kris@obsecurity.org>
Subject: Re: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 09 Oct 2006 19:38:28 -0000

On Mon, 9 Oct 2006, Fluffles.net wrote:

> I'm the "veronica" Arne mentioned in the freebsd-fs mailinglist.
> Regarding the effectiveness of a higher blocksize, these are my findings:
>
> areca RAID5 (8x da, 128KB stripe, default newfs, NCQ enabled)
>              -------Sequential Output-------- ---Sequential Input--
> --Random--
>              -Per Char- --Block--- -Rewrite-- -Per Char- --Block---
> --Seeks---
> Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU
> /sec %CPU
> ARC8xR5  8480 119973 91.3 247178 58.6 67862 17.5 90426 86.9 172490 24.0
> 120.7  0.5
>
> areca RAID5 (8x da, 128KB stripe, 64KB blocksize newfs, NCQ enabled)
>              -------Sequential Output-------- ---Sequential Input--
> --Random--
>              -Per Char- --Block--- -Rewrite-- -Per Char- --Block---
> --Seeks---
> Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU
> /sec %CPU
> ARC8xR5  8480 128920 97.8 265920 58.9 116787 31.0 103261 97.8 392970
> 53.8 119.8  0.6
>
> As you can see, the block read increased from ~172MB/s to ~392MB/s,
> quite significant increase. Also the reqrite speed increased from
> ~67MB/s to ~116MB/s.
>
> Ofcourse these tests are on a brand clean filesystem, which might not
> tally with real-life crowded filesystems. But at least there is much
> ...

This is a bit surprising.  FreeBSD is supposed to cluster the i/o so
that (especially for large files on new file systems) almost all i/o
is done in blocks of size 64K or 128K.

I suspect the problems are that the 64K-block i/o is usually perfectly
misaligned unless the fs itself has 64K-blocks and the fs's partition
starts on a 64K-block boundary, and that some hardware or firmware
(mainly RAIDs) want the blocks to be aligned.  I'm not very familiar
with RAIDs but think it would take a fairly advanced/expensive one to
reblock all the i/at so that the alignment doesn't matter.  It would
take more advanced/complicated clustering code or better buffering code
than FreeBSD has to do the reblocking at the clustering or buffering
level.  Perhaps even 64K-blocks are too small with your RAID's stripe
size of 128K.

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Mon Oct  9 21:20:46 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 271C316A412
	for <freebsd-fs@freebsd.org>; Mon,  9 Oct 2006 21:20:46 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 36CEA43D77
	for <freebsd-fs@freebsd.org>; Mon,  9 Oct 2006 21:20:39 +0000 (GMT)
	(envelope-from scottl@samsco.org)
Received: from [10.10.3.185] ([165.236.175.187]) (authenticated bits=0)
	by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id k99KlZ1s039109;
	Mon, 9 Oct 2006 14:47:42 -0600 (MDT)
	(envelope-from scottl@samsco.org)
Message-ID: <452AB55D.9090607@samsco.org>
Date: Mon, 09 Oct 2006 14:47:25 -0600
From: Scott Long <scottl@samsco.org>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.12) Gecko/20060206
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Bruce Evans <bde@zeta.org.au>
References: <45297DA2.4000509@fluffles.net>
	<20061010051216.G814@epsplex.bde.org>
In-Reply-To: <20061010051216.G814@epsplex.bde.org>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=0.0 required=3.8 tests=none autolearn=failed 
	version=3.1.1
X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on pooker.samsco.org
Cc: freebsd-fs@freebsd.org, "Fluffles.net" <info@fluffles.net>,
	Kris Kennaway <kris@obsecurity.org>
Subject: Re: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 09 Oct 2006 21:20:46 -0000

Bruce Evans wrote:
> On Mon, 9 Oct 2006, Fluffles.net wrote:
> 
>> I'm the "veronica" Arne mentioned in the freebsd-fs mailinglist.
>> Regarding the effectiveness of a higher blocksize, these are my findings:
>>
>> areca RAID5 (8x da, 128KB stripe, default newfs, NCQ enabled)
>>              -------Sequential Output-------- ---Sequential Input--
>> --Random--
>>              -Per Char- --Block--- -Rewrite-- -Per Char- --Block---
>> --Seeks---
>> Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU
>> /sec %CPU
>> ARC8xR5  8480 119973 91.3 247178 58.6 67862 17.5 90426 86.9 172490 24.0
>> 120.7  0.5
>>
>> areca RAID5 (8x da, 128KB stripe, 64KB blocksize newfs, NCQ enabled)
>>              -------Sequential Output-------- ---Sequential Input--
>> --Random--
>>              -Per Char- --Block--- -Rewrite-- -Per Char- --Block---
>> --Seeks---
>> Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU
>> /sec %CPU
>> ARC8xR5  8480 128920 97.8 265920 58.9 116787 31.0 103261 97.8 392970
>> 53.8 119.8  0.6
>>
>> As you can see, the block read increased from ~172MB/s to ~392MB/s,
>> quite significant increase. Also the reqrite speed increased from
>> ~67MB/s to ~116MB/s.
>>
>> Ofcourse these tests are on a brand clean filesystem, which might not
>> tally with real-life crowded filesystems. But at least there is much
>> ...
> 
> 
> This is a bit surprising.  FreeBSD is supposed to cluster the i/o so
> that (especially for large files on new file systems) almost all i/o
> is done in blocks of size 64K or 128K.
> 
> I suspect the problems are that the 64K-block i/o is usually perfectly
> misaligned unless the fs itself has 64K-blocks and the fs's partition
> starts on a 64K-block boundary, and that some hardware or firmware
> (mainly RAIDs) want the blocks to be aligned.  I'm not very familiar
> with RAIDs but think it would take a fairly advanced/expensive one to
> reblock all the i/at so that the alignment doesn't matter.  It would
> take more advanced/complicated clustering code or better buffering code
> than FreeBSD has to do the reblocking at the clustering or buffering
> level.  Perhaps even 64K-blocks are too small with your RAID's stripe
> size of 128K.
> 
> Bruce

Yes, it's a well-known problem that the combination of 
fdisk+disklabel+ufs means that all FS blocks are mis-aligned in the 
worst way possible (blocks start on odd sector numbers).  This
_horribly_ pessimizes RAID-5 on most controllers.  Solving it reliably
and automatically is hard, though.  The filesystem ultimately needs to
know the physical sector that it starts on, and compensate accordingly.
You could cheat by having the disklabel tools always align partitions,
but the tool would still need to know the physical address of where it
starts in the slice.  Either way, something high up needs to get the
logical to physical translation of the sectors.

Suggestions have been made to just put blind offsets into the disklabel
tool that assumes the common case (mbr is present and is a known length,
and that the disklabel is in the first slice of the MBR).  Obviously,
this is only a crude hack.  I get around this right now by not using a
disklabel or fdisk table on arrays where I value speed.  For those, I
just put a filesystem directly on the array, and boot off of a small
system disk.

Scott

From owner-freebsd-fs@FreeBSD.ORG  Mon Oct  9 21:50:20 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 773F516A417
	for <freebsd-fs@freebsd.org>; Mon,  9 Oct 2006 21:50:20 +0000 (UTC)
	(envelope-from mike@sentex.net)
Received: from smarthost2.sentex.ca (smarthost2.sentex.ca [205.211.164.50])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 0C46043D55
	for <freebsd-fs@freebsd.org>; Mon,  9 Oct 2006 21:50:19 +0000 (GMT)
	(envelope-from mike@sentex.net)
Received: from BLUELAPIS.sentex.ca (cage.simianscience.com [64.7.134.1])
	by smarthost2.sentex.ca (8.13.8/8.13.8) with SMTP id k99LoIiu065164;
	Mon, 9 Oct 2006 17:50:19 -0400 (EDT) (envelope-from mike@sentex.net)
From: Mike Tancsa <mike@sentex.net>
To: Scott Long <scottl@samsco.org>
Date: Mon, 09 Oct 2006 17:50:30 -0400
Message-ID: <ltgli2hqprmfqq3ttahrtbppj4l3dkt4b7@4ax.com>
References: <45297DA2.4000509@fluffles.net>
	<20061010051216.G814@epsplex.bde.org> <452AB55D.9090607@samsco.org>
In-Reply-To: <452AB55D.9090607@samsco.org>
X-Mailer: Forte Agent 1.93/32.576 English (American)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Cc: freebsd-fs@freebsd.org
Subject: Re: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 09 Oct 2006 21:50:20 -0000

On Mon, 09 Oct 2006 14:47:25 -0600, in sentex.lists.freebsd.fs you
wrote:

>this is only a crude hack.  I get around this right now by not using a
>disklabel or fdisk table on arrays where I value speed.  For those, I
>just put a filesystem directly on the array, and boot off of a small
>system disk.


Hi Scott,
	How is that done ?  just newfs -O2 -U /dev/da0  ?

	---Mike
--------------------------------------------------------
Mike Tancsa, Sentex communications http://www.sentex.net
Providing Internet Access since 1994
mike@sentex.net, (http://www.tancsa.com)

From owner-freebsd-fs@FreeBSD.ORG  Mon Oct  9 21:54:13 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 622D916A416
	for <freebsd-fs@freebsd.org>; Mon,  9 Oct 2006 21:54:13 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 7C33E43D46
	for <freebsd-fs@freebsd.org>; Mon,  9 Oct 2006 21:54:10 +0000 (GMT)
	(envelope-from scottl@samsco.org)
Received: from [10.10.3.185] ([165.236.175.187]) (authenticated bits=0)
	by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id k99LrvaD040223;
	Mon, 9 Oct 2006 15:54:04 -0600 (MDT)
	(envelope-from scottl@samsco.org)
Message-ID: <452AC4EB.8000006@samsco.org>
Date: Mon, 09 Oct 2006 15:53:47 -0600
From: Scott Long <scottl@samsco.org>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.12) Gecko/20060206
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Mike Tancsa <mike@sentex.net>
References: <45297DA2.4000509@fluffles.net>
	<20061010051216.G814@epsplex.bde.org> <452AB55D.9090607@samsco.org>
	<ltgli2hqprmfqq3ttahrtbppj4l3dkt4b7@4ax.com>
In-Reply-To: <ltgli2hqprmfqq3ttahrtbppj4l3dkt4b7@4ax.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=0.0 required=3.8 tests=none autolearn=failed 
	version=3.1.1
X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on pooker.samsco.org
Cc: freebsd-fs@freebsd.org
Subject: Re: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 09 Oct 2006 21:54:13 -0000

Mike Tancsa wrote:
> On Mon, 09 Oct 2006 14:47:25 -0600, in sentex.lists.freebsd.fs you
> wrote:
> 
> 
>>this is only a crude hack.  I get around this right now by not using a
>>disklabel or fdisk table on arrays where I value speed.  For those, I
>>just put a filesystem directly on the array, and boot off of a small
>>system disk.
> 
> 
> 
> Hi Scott,
> 	How is that done ?  just newfs -O2 -U /dev/da0  ?
> 
> 	---Mike

Yup.

Scott

From owner-freebsd-fs@FreeBSD.ORG  Mon Oct  9 23:13:55 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@FreeBSD.org
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id C3E9516A407
	for <freebsd-fs@FreeBSD.org>; Mon,  9 Oct 2006 23:13:55 +0000 (UTC)
	(envelope-from bde@zeta.org.au)
Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 76B2343D45
	for <freebsd-fs@FreeBSD.org>; Mon,  9 Oct 2006 23:13:45 +0000 (GMT)
	(envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.2.163])
	by mailout1.pacific.net.au (Postfix) with ESMTP id 443F3328117;
	Tue, 10 Oct 2006 09:13:41 +1000 (EST)
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3sarge3) with ESMTP
	id k99NDb9u008107; Tue, 10 Oct 2006 09:13:38 +1000
Date: Tue, 10 Oct 2006 09:13:36 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: Scott Long <scottl@samsco.org>
In-Reply-To: <452AB55D.9090607@samsco.org>
Message-ID: <20061010081212.I35683@delplex.bde.org>
References: <45297DA2.4000509@fluffles.net>
	<20061010051216.G814@epsplex.bde.org> <452AB55D.9090607@samsco.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-fs@FreeBSD.org, "Fluffles.net" <info@fluffles.net>,
	Kris Kennaway <kris@obsecurity.org>
Subject: Re: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 09 Oct 2006 23:13:55 -0000

On Mon, 9 Oct 2006, Scott Long wrote:

> Bruce Evans wrote:
>> ...
>> I suspect the problems are that the 64K-block i/o is usually perfectly
>> misaligned unless the fs itself has 64K-blocks and the fs's partition
>> starts on a 64K-block boundary, and that some hardware or firmware
>> (mainly RAIDs) want the blocks to be aligned.  I'm not very familiar
>> ...
>
> Yes, it's a well-known problem that the combination of fdisk+disklabel+ufs 
> means that all FS blocks are mis-aligned in the worst way possible (blocks 
> start on odd sector numbers).  This
> _horribly_ pessimizes RAID-5 on most controllers.

Apparently the internal fs block alignment/size problem is not so well
known.  I knew about the external one but didn't connect it with fs
block sizes at first.  How horribly do aligned 16K-blocks pessimize
RAID-5?  Does it help much to have misaligned 64K-blocks instead of
misaligned 16K-blocks?

> Solving it reliably
> and automatically is hard, though.  The filesystem ultimately needs to
> know the physical sector that it starts on, and compensate accordingly.
> You could cheat by having the disklabel tools always align partitions,
> but the tool would still need to know the physical address of where it
> starts in the slice.  Either way, something high up needs to get the
> logical to physical translation of the sectors.

The filesystem shouldn't need to know more than that its starting sector
is not physically misaligned.  The clustering code could use knowledge
of physical offsets and alignment requirements to fix up some cases.
My version of newfs_msdosfs(8) uses the (unimplemented) ioctl
DIOCMEDIAOFFSET to ask the system for the physical offset.  Using
this is much easier than parsing XML.

> Suggestions have been made to just put blind offsets into the disklabel
> tool that assumes the common case (mbr is present and is a known length,
> and that the disklabel is in the first slice of the MBR).  Obviously,
> this is only a crude hack.  I get around this right now by not using a
> disklabel or fdisk table on arrays where I value speed.  For those, I
> just put a filesystem directly on the array, and boot off of a small
> system disk.

I normally align FreeBSD slices and partitions manually to a "cylinder"
boundary, and this sometimes gives alignment to a large power of 2
accidentally.  I normally use a fake cylinder size of 16065 (255 fake
heads and 63 sectors per fake track).  This is just as bad for cylinder
alignment as 63 is for track alignment, but new systems only need it
for compatibility with other systems.

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Mon Oct  9 23:52:40 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@FreeBSD.org
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 18BDE16A40F
	for <freebsd-fs@FreeBSD.org>; Mon,  9 Oct 2006 23:52:40 +0000 (UTC)
	(envelope-from etc@fluffles.net)
Received: from auriate.fluffles.net (a83-68-3-169.adsl.cistron.nl
	[83.68.3.169]) by mx1.FreeBSD.org (Postfix) with ESMTP id AEF1743D73
	for <freebsd-fs@FreeBSD.org>; Mon,  9 Oct 2006 23:52:39 +0000 (GMT)
	(envelope-from etc@fluffles.net)
Received: from destiny ([10.0.0.21])
	by auriate.fluffles.net with esmtpa (Exim 4.63 (FreeBSD))
	(envelope-from <etc@fluffles.net>)
	id 1GX4vF-000GrP-LR; Tue, 10 Oct 2006 01:52:37 +0200
Message-ID: <452AE0A5.3010503@fluffles.net>
Date: Tue, 10 Oct 2006 01:52:05 +0200
From: Fluffles <etc@fluffles.net>
User-Agent: Thunderbird 1.5.0.7 (X11/20060917)
MIME-Version: 1.0
To: freebsd-fs@FreeBSD.org
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
Cc: 
Subject: Re: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 09 Oct 2006 23:52:40 -0000

Bruce Evans wrote:
>I suspect the problems are that the 64K-block i/o is usually perfectly
>misaligned unless the fs itself has 64K-blocks and the fs's partition
>starts on a 64K-block boundary, and that some hardware or firmware
>(mainly RAIDs) want the blocks to be aligned.


But i have done these tests on /dev/da0, thus without any labeling! This
means there is no offset such as 16 sectors caused by disk labeling,
which then spoils my stripe-block. So i would assume there is no
alignment problem, is there? I would assume that if i do newfs directly
on /dev/da0, that the 64KB blocksize starts at offset 0, which implies
no alignment problems exist.

If all this is true, another reason for the huge performance increase
must be sought. In all my tests using 64KB blocksize instead of the
default 16KB yielded better results; also with software RAID like
gstripe. And i never use labeling. Actually i would have liked the
blocksize limit to be higher, and try out if 128KB or even higher would
continue to yield better results.

- Veronica

From owner-freebsd-fs@FreeBSD.ORG  Tue Oct 10 10:02:20 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id B9A6016A403;
	Tue, 10 Oct 2006 10:02:20 +0000 (UTC)
	(envelope-from danny@cs.huji.ac.il)
Received: from cs1.cs.huji.ac.il (cs1.cs.huji.ac.il [132.65.16.10])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 4658A43D58;
	Tue, 10 Oct 2006 10:02:20 +0000 (GMT)
	(envelope-from danny@cs.huji.ac.il)
Received: from pampa.cs.huji.ac.il ([132.65.80.32])
	by cs1.cs.huji.ac.il with esmtp
	id 1GXERG-000BLK-2p; Tue, 10 Oct 2006 12:02:18 +0200
X-Mailer: exmh version 2.7.2 01/07/2005 with nmh-1.2
To: Daichi GOTO <daichi@freebsd.org>
In-reply-to: <44FD8B2B.60501@freebsd.org> 
References: <44B67340.1080405@freebsd.org> <44B74036.6060101@freebsd.org> 
	<20060903170129.GA98917@xor.obsecurity.org>
	<20060903172033.GA99212@xor.obsecurity.org>
	<20060904184717.GA41475@xor.obsecurity.org>
	<44FD8B2B.60501@freebsd.org>
Comments: In-reply-to Daichi GOTO <daichi@freebsd.org>
	message dated "Tue, 05 Sep 2006 23:35:23 +0900."
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Tue, 10 Oct 2006 12:02:17 +0200
From: Danny Braniss <danny@cs.huji.ac.il>
Message-ID: <E1GXERG-000BLK-2p@cs1.cs.huji.ac.il>
Cc: freebsd-fs@freebsd.org, freebsd-current@freebsd.org,
	Kris Kennaway <kris@obsecurity.org>
Subject: Re: [ANN] unionfs patchset-16 release, it is ready for the merge 
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 10 Oct 2006 10:02:20 -0000

[...]
> Yeah, we have a new patchset to solve above problem I think.

any chance that the new unionfs will make it to 6.2?
I'm using it, and it's working just fine - as opposed to the unusable
one supplied.

If not, Daichi GOTO, will you have a new set of patches? 
	union_vfsops.c just changed, for example.
thanks,
	danny


From owner-freebsd-fs@FreeBSD.ORG  Tue Oct 10 10:25:19 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id BEC0C16A407;
	Tue, 10 Oct 2006 10:25:19 +0000 (UTC)
	(envelope-from daichi@freebsd.org)
Received: from natial.ongs.co.jp (natial.ongs.co.jp [202.216.232.58])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 5A36D43D45;
	Tue, 10 Oct 2006 10:25:19 +0000 (GMT)
	(envelope-from daichi@freebsd.org)
Received: from [192.168.1.101] (dullmdaler.ongs.co.jp [202.216.232.62])
	by natial.ongs.co.jp (Postfix) with ESMTP id 6EF00244C29;
	Tue, 10 Oct 2006 19:25:17 +0900 (JST)
Message-ID: <452B750D.2020104@freebsd.org>
Date: Tue, 10 Oct 2006 19:25:17 +0900
From: Daichi GOTO <daichi@freebsd.org>
User-Agent: Thunderbird 1.5.0.7 (X11/20060915)
MIME-Version: 1.0
To: Danny Braniss <danny@cs.huji.ac.il>
References: <44B67340.1080405@freebsd.org> <44B74036.6060101@freebsd.org>
	<20060903170129.GA98917@xor.obsecurity.org>
	<20060903172033.GA99212@xor.obsecurity.org>
	<20060904184717.GA41475@xor.obsecurity.org>
	<44FD8B2B.60501@freebsd.org> <E1GXERG-000BLK-2p@cs1.cs.huji.ac.il>
In-Reply-To: <E1GXERG-000BLK-2p@cs1.cs.huji.ac.il>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-fs@freebsd.org, freebsd-current@freebsd.org,
	Kris Kennaway <kris@obsecurity.org>
Subject: Re: [ANN] unionfs patchset-16 release, it is ready for the merge
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 10 Oct 2006 10:25:19 -0000

Danny Braniss wrote:
> [...]
>> Yeah, we have a new patchset to solve above problem I think.
> 
> any chance that the new unionfs will make it to 6.2?

We cannot merger unionfs patch to 6.x branch. It'll just only for
-current. For 6.x patchset is just a patchset.

> I'm using it, and it's working just fine - as opposed to the unusable
> one supplied.

For under some heavy situation with mount_nullfs, it has a problem since 
the lock mechanism. To solve that problem, we need a new API(function)
for VFS. We are discussing about it and need vfs-hackers help.
Sorry for my slow response :(

> If not, Daichi GOTO, will you have a new set of patches? 
> 	union_vfsops.c just changed, for example.
> thanks,
> 	danny

uhmm...  you need a new patchset if it is under construction?

-- 
   Daichi GOTO, http://people.freebsd.org/~daichi

From owner-freebsd-fs@FreeBSD.ORG  Tue Oct 10 14:06:00 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id A34E716A4E6;
	Tue, 10 Oct 2006 14:06:00 +0000 (UTC)
	(envelope-from daichi@freebsd.org)
Received: from natial.ongs.co.jp (natial.ongs.co.jp [202.216.232.58])
	by mx1.FreeBSD.org (Postfix) with ESMTP id AD92E43D58;
	Tue, 10 Oct 2006 14:05:59 +0000 (GMT)
	(envelope-from daichi@freebsd.org)
Received: from [192.168.1.101] (dullmdaler.ongs.co.jp [202.216.232.62])
	by natial.ongs.co.jp (Postfix) with ESMTP id 8D497244C2C;
	Tue, 10 Oct 2006 23:05:57 +0900 (JST)
Message-ID: <452BA8C4.7040906@freebsd.org>
Date: Tue, 10 Oct 2006 23:05:56 +0900
From: Daichi GOTO <daichi@freebsd.org>
User-Agent: Thunderbird 1.5.0.7 (X11/20060915)
MIME-Version: 1.0
To: freebsd-hackers@freebsd.org,  freebsd-current@freebsd.org, 
	freebsd-fs@freebsd.org,  rodrigc@crodrigues.org
Content-Type: text/plain; charset=ISO-2022-JP
Content-Transfer-Encoding: 7bit
Cc: daichi@freebsd.org
Subject: [REQUEST] unionfs needs some guys can do implements new 2 APIs for
 VFS
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 10 Oct 2006 14:06:00 -0000

Hi Guys!

Now we need a man or a guy who can do implements new 2 APIs for VFS.
Someone please help us!!

  http://people.freebsd.org/~daichi/unionfs/request-new-api-for-vfs.html


----

The FreeBSD new unionfs implementation: New API request for FreeBSD VFS
=======================================================================


Daichi GOTO (daichi@freebsd.org)


1 Introduction

      We have always tried to keep changes just in unionfs segment
  only. But by accomplish nothing, we need change the other segment.


2 Problem Description

      Until now we have did many improvements for unionfs, but
  now we feel the limication arount the process of unionfs's
  "copied-up file".   Additional thinking of future support for
  MAC extention, ADVLOCK lock infomation and somethinkg like those,
  all the more reason to be careful.


3 Impact

      It leads the confution of unionfs implementation and some
  problem around lock mechanism. We cannot solve those problem
  by just only changes in unionfs segument.


4 Solution Request

      We need new 2 APIs(functions) for VFS. Please some developer
  do implement new APIs like as follow:

    int VOP_GETALLATTR(struct vnode *vp, struct vnode_xxx *data,
                       struct thread *td)
    {
            set the all attr to data from vp;
            ...;
    }

    int VOP_SETALLATTR(struct vnode *vp, struct vnode_xxx *data,
                       struct thread *td)
    {
            set the all attr to vp from data;
            ...;
    }

  Above funtions can set/get vnode information(now those are attr,
  extattr and ADVLOCK) together if its type is VREG.

  We cannot do implement it caused by lack of vfs arcana. Please
  raise your hands and do it, please.


5 References

  http://people.freebsd.org/~daichi/unionfs/
  http://people.freebsd.org/~daichi/unionfs/index-ja.html
  http://people.freebsd.org/~daichi/unionfs/reason-for-sys-uio-file.html
----

We need your help. Please help us.

-- 
  Daichi GOTO, http://people.freebsd.org/~daichi

From owner-freebsd-fs@FreeBSD.ORG  Tue Oct 10 14:11:12 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id BBDD116A40F;
	Tue, 10 Oct 2006 14:11:12 +0000 (UTC)
	(envelope-from daichi@freebsd.org)
Received: from natial.ongs.co.jp (natial.ongs.co.jp [202.216.232.58])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 9C2DA43D46;
	Tue, 10 Oct 2006 14:11:11 +0000 (GMT)
	(envelope-from daichi@freebsd.org)
Received: from [192.168.1.101] (dullmdaler.ongs.co.jp [202.216.232.62])
	by natial.ongs.co.jp (Postfix) with ESMTP id 7E775244C2C;
	Tue, 10 Oct 2006 23:11:08 +0900 (JST)
Message-ID: <452BA9FB.3080401@freebsd.org>
Date: Tue, 10 Oct 2006 23:11:07 +0900
From: Daichi GOTO <daichi@freebsd.org>
User-Agent: Thunderbird 1.5.0.7 (X11/20060915)
MIME-Version: 1.0
To: Daichi GOTO <daichi@freebsd.org>
References: <44B67340.1080405@freebsd.org> <44B74036.6060101@freebsd.org>
	<20060903170129.GA98917@xor.obsecurity.org>
	<20060903172033.GA99212@xor.obsecurity.org>
	<20060904184717.GA41475@xor.obsecurity.org>
	<44FD8B2B.60501@freebsd.org> <E1GXERG-000BLK-2p@cs1.cs.huji.ac.il>
	<452B750D.2020104@freebsd.org>
In-Reply-To: <452B750D.2020104@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Danny Braniss <danny@cs.huji.ac.il>, freebsd-fs@freebsd.org,
	freebsd-current@freebsd.org, Kris Kennaway <kris@obsecurity.org>
Subject: Re: [ANN] unionfs patchset-16 release, it is ready for the merge
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 10 Oct 2006 14:11:12 -0000

Daichi GOTO wrote:
> Danny Braniss wrote:
>> [...]
>>> Yeah, we have a new patchset to solve above problem I think.
>>
>> any chance that the new unionfs will make it to 6.2?
> 
> We cannot merger unionfs patch to 6.x branch. It'll just only for
> -current. For 6.x patchset is just a patchset.
> 
>> I'm using it, and it's working just fine - as opposed to the unusable
>> one supplied.
> 
> For under some heavy situation with mount_nullfs, it has a problem since 
> the lock mechanism. To solve that problem, we need a new API(function)
> for VFS. We are discussing about it and need vfs-hackers help.
> Sorry for my slow response :(
> 
>> If not, Daichi GOTO, will you have a new set of patches? 
>>     union_vfsops.c just changed, for example.
>> thanks,
>>     danny
> 
> uhmm...  you need a new patchset if it is under construction?

I updated new two dosuments:

   http://people.freebsd.org/~daichi/unionfs/reason-for-sys-uio-file.html

   http://people.freebsd.org/~daichi/unionfs/request-new-api-for-vfs.html

Folks who have a interest in unionfs, read it please :)

-- 
   Daichi GOTO, http://people.freebsd.org/~daichi

From owner-freebsd-fs@FreeBSD.ORG  Tue Oct 10 14:57:11 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 4AD6116A403;
	Tue, 10 Oct 2006 14:57:11 +0000 (UTC)
	(envelope-from kris@obsecurity.org)
Received: from elvis.mu.org (elvis.mu.org [192.203.228.196])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 36C3443D7B;
	Tue, 10 Oct 2006 14:57:00 +0000 (GMT)
	(envelope-from kris@obsecurity.org)
Received: from obsecurity.dyndns.org (elvis.mu.org [192.203.228.196])
	by elvis.mu.org (Postfix) with ESMTP id 1B55E1A3C19;
	Tue, 10 Oct 2006 07:57:00 -0700 (PDT)
Received: by obsecurity.dyndns.org (Postfix, from userid 1000)
	id 9287251398; Tue, 10 Oct 2006 10:56:59 -0400 (EDT)
Date: Tue, 10 Oct 2006 10:56:59 -0400
From: Kris Kennaway <kris@obsecurity.org>
To: Danny Braniss <danny@cs.huji.ac.il>
Message-ID: <20061010145659.GA76958@xor.obsecurity.org>
References: <44B67340.1080405@freebsd.org> <44B74036.6060101@freebsd.org>
	<20060903170129.GA98917@xor.obsecurity.org>
	<20060903172033.GA99212@xor.obsecurity.org>
	<20060904184717.GA41475@xor.obsecurity.org>
	<44FD8B2B.60501@freebsd.org> <E1GXERG-000BLK-2p@cs1.cs.huji.ac.il>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="gKMricLos+KVdGMg"
Content-Disposition: inline
In-Reply-To: <E1GXERG-000BLK-2p@cs1.cs.huji.ac.il>
User-Agent: Mutt/1.4.2.2i
Cc: freebsd-fs@freebsd.org, Daichi GOTO <daichi@freebsd.org>,
	freebsd-current@freebsd.org, Kris Kennaway <kris@obsecurity.org>
Subject: Re: [ANN] unionfs patchset-16 release, it is ready for the merge
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 10 Oct 2006 14:57:11 -0000


--gKMricLos+KVdGMg
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Oct 10, 2006 at 12:02:17PM +0200, Danny Braniss wrote:
> [...]
> > Yeah, we have a new patchset to solve above problem I think.
>=20
> any chance that the new unionfs will make it to 6.2?

None, unfortunately - it's not even in 7.0 yet.

Kris

> I'm using it, and it's working just fine - as opposed to the unusable
> one supplied.
>=20
> If not, Daichi GOTO, will you have a new set of patches?=20
> 	union_vfsops.c just changed, for example.
> thanks,
> 	danny
>=20
>=20
>=20
>=20

--gKMricLos+KVdGMg
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (FreeBSD)

iD8DBQFFK7S7Wry0BWjoQKURAuVmAJ0WR7w8ti+QU60g+j8XlYF/58Q0vACg2W5B
VIfhvbhHGR2LPWUWs66mUfs=
=OMU4
-----END PGP SIGNATURE-----

--gKMricLos+KVdGMg--

From owner-freebsd-fs@FreeBSD.ORG  Tue Oct 10 18:09:36 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id D258C16A403;
	Tue, 10 Oct 2006 18:09:36 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 7786943D79;
	Tue, 10 Oct 2006 18:09:36 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id E160A46B08;
	Tue, 10 Oct 2006 14:09:35 -0400 (EDT)
Date: Tue, 10 Oct 2006 19:09:36 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Daichi GOTO <daichi@freebsd.org>
In-Reply-To: <452BA8C4.7040906@freebsd.org>
Message-ID: <20061010190815.L92182@fledge.watson.org>
References: <452BA8C4.7040906@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org,
	freebsd-current@freebsd.org
Subject: Re: [REQUEST] unionfs needs some guys can do implements new 2 APIs
 for VFS
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 10 Oct 2006 18:09:36 -0000


On Tue, 10 Oct 2006, Daichi GOTO wrote:

> 1 Introduction
>
>      We have always tried to keep changes just in unionfs segment
>  only. But by accomplish nothing, we need change the other segment.
>
>
> 2 Problem Description
>
>      Until now we have did many improvements for unionfs, but
>  now we feel the limication arount the process of unionfs's
>  "copied-up file".  Additional thinking of future support for
>  MAC extention, ADVLOCK lock infomation and somethinkg like those,
>  all the more reason to be careful.
>
> 3 Impact
>
>      It leads the confution of unionfs implementation and some
>  problem around lock mechanism. We cannot solve those problem
>  by just only changes in unionfs segument.

So, just to be clear that I understand things: the basic problem here is that 
when unionfs copies a file up a layer in the stack due to local modifications 
in the upper layer, you are not able to properly preserve the full set of file 
attributes, so are looking for a way to do this?

Robert N M Watson
Computer Laboratory
University of Cambridge

From owner-freebsd-fs@FreeBSD.ORG  Wed Oct 11 02:42:32 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 9A43D16A4C8
	for <freebsd-fs@freebsd.org>; Wed, 11 Oct 2006 02:42:32 +0000 (UTC)
	(envelope-from janm@transactionware.com)
Received: from mail.transactionware.com (mail.transactionware.com
	[203.14.245.7]) by mx1.FreeBSD.org (Postfix) with SMTP id 1674043D5A
	for <freebsd-fs@freebsd.org>; Wed, 11 Oct 2006 02:41:58 +0000 (GMT)
	(envelope-from janm@transactionware.com)
Received: (qmail 9574 invoked from network); 11 Oct 2006 02:42:12 -0000
Received: from new.transactionware.com (192.168.1.55)
	by dm.transactionware.com with SMTP; 11 Oct 2006 02:42:12 -0000
Received: (qmail 10705 invoked by uid 1026); 11 Oct 2006 02:42:11 -0000
Received: from 192.168.1.51 by new.transactionware.com (envelope-from
	<janm@transactionware.com>, uid 1003) with qmail-scanner-1.25 
	(spamassassin: 3.0.2.  Clear:RC:1(192.168.1.51):. 
	Processed in 3.221908 secs); 11 Oct 2006 02:42:11 -0000
Received: from unknown (HELO janmxp) (192.168.1.51)
	by new.transactionware.com with SMTP; 11 Oct 2006 02:42:07 -0000
Message-ID: <004d01c6ecde$db9ca990$3301a8c0@janmxp>
From: "Jan Mikkelsen" <janm@transactionware.com>
To: "Daichi GOTO" <daichi@freebsd.org>, "Danny Braniss" <danny@cs.huji.ac.il>
References: <44B67340.1080405@freebsd.org>
	<44B74036.6060101@freebsd.org><20060903170129.GA98917@xor.obsecurity.org><20060903172033.GA99212@xor.obsecurity.org><20060904184717.GA41475@xor.obsecurity.org><44FD8B2B.60501@freebsd.org>
	<E1GXERG-000BLK-2p@cs1.cs.huji.ac.il>
	<452B750D.2020104@freebsd.org>
Date: Wed, 11 Oct 2006 12:42:13 +1000
MIME-Version: 1.0
Content-Type: text/plain; format=flowed; charset="iso-8859-1";
	reply-type=response
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.3790.2663
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.2757
Cc: freebsd-fs@freebsd.org, freebsd-current@freebsd.org,
	Kris Kennaway <kris@obsecurity.org>
Subject: Re: [ANN] unionfs patchset-16 release, it is ready for the merge
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 11 Oct 2006 02:42:32 -0000

Daichi GOTO wrote:
> Danny Braniss wrote:
>> [...]
>>> Yeah, we have a new patchset to solve above problem I think.
>>
>> any chance that the new unionfs will make it to 6.2?
>
> We cannot merger unionfs patch to 6.x branch. It'll just only for
> -current. For 6.x patchset is just a patchset.

Getting it to 6-STABLE at some point would be very nice;  what is currently 
there is unusable.

I have been using your patch successfully and I certainly don't see any 
regressions.  The man pages make it clear that the subsystem will be subject 
to change.

>> I'm using it, and it's working just fine - as opposed to the unusable
>> one supplied.
>
> For under some heavy situation with mount_nullfs, it has a problem since 
> the lock mechanism. To solve that problem, we need a new API(function)
> for VFS. We are discussing about it and need vfs-hackers help.
> Sorry for my slow response :(

Even so, your patch works better than what is there.

>> If not, Daichi GOTO, will you have a new set of patches? union_vfsops.c 
>> just changed, for example.
>> thanks,
>> danny
>
> uhmm...  you need a new patchset if it is under construction?

The patch at http://people.freebsd.org/~daichi/unionfs/unionfs6-p16.diff no 
longer applies cleanly to 6-STABLE.  Where you have replaced complete files, 
it might be worth just providing the new file.

Thank you for your work on this;  I find it very useful.

Regards,

Jan Mikkelsen


From owner-freebsd-fs@FreeBSD.ORG  Wed Oct 11 15:53:03 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 29BC516A415
	for <freebsd-fs@freebsd.org>; Wed, 11 Oct 2006 15:53:03 +0000 (UTC)
	(envelope-from mike@sentex.net)
Received: from smarthost2.sentex.ca (smarthost2.sentex.ca [205.211.164.50])
	by mx1.FreeBSD.org (Postfix) with ESMTP id BF4EE43D66
	for <freebsd-fs@freebsd.org>; Wed, 11 Oct 2006 15:53:02 +0000 (GMT)
	(envelope-from mike@sentex.net)
Received: from BLUELAPIS.sentex.ca (cage.simianscience.com [64.7.134.1])
	by smarthost2.sentex.ca (8.13.8/8.13.8) with SMTP id k9BFr1Vi012893;
	Wed, 11 Oct 2006 11:53:01 -0400 (EDT) (envelope-from mike@sentex.net)
From: Mike Tancsa <mike@sentex.net>
To: Scott Long <scottl@samsco.org>
Date: Wed, 11 Oct 2006 11:53:04 -0400
Message-ID: <dl4qi2t4ms9gp44crfa0marascvoo93grf@4ax.com>
References: <45297DA2.4000509@fluffles.net>
	<20061010051216.G814@epsplex.bde.org> <452AB55D.9090607@samsco.org>
	<ltgli2hqprmfqq3ttahrtbppj4l3dkt4b7@4ax.com>
	<452AC4EB.8000006@samsco.org>
In-Reply-To: <452AC4EB.8000006@samsco.org>
X-Mailer: Forte Agent 1.93/32.576 English (American)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Cc: freebsd-fs@freebsd.org
Subject: Re: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 11 Oct 2006 15:53:03 -0000

On Mon, 09 Oct 2006 15:53:47 -0600, in sentex.lists.freebsd.fs you
wrote:

>Mike Tancsa wrote:
>> On Mon, 09 Oct 2006 14:47:25 -0600, in sentex.lists.freebsd.fs you
>> wrote:
>>=20
>>=20
>>>this is only a crude hack.  I get around this right now by not using a
>>>disklabel or fdisk table on arrays where I value speed.  For those, I
>>>just put a filesystem directly on the array, and boot off of a small
>>>system disk.
>>=20
>>=20
>>=20
>> 	How is that done ?  just newfs -O2 -U /dev/da0  ?
>
>Yup.

Hi,
	Is this going to work in most/all cases ?  In other words, how
to I make sure the file system I lay down is indeed properly /
optimally aligned with the underlying structure ?

	---Mike
--------------------------------------------------------
Mike Tancsa, Sentex communications http://www.sentex.net
Providing Internet Access since 1994
mike@sentex.net, (http://www.tancsa.com)

From owner-freebsd-fs@FreeBSD.ORG  Wed Oct 11 16:59:20 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 6B0C316A47C
	for <freebsd-fs@freebsd.org>; Wed, 11 Oct 2006 16:59:20 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.FreeBSD.org (Postfix) with ESMTP id CE80A43D99
	for <freebsd-fs@freebsd.org>; Wed, 11 Oct 2006 16:55:40 +0000 (GMT)
	(envelope-from scottl@samsco.org)
Received: from [10.10.3.185] ([165.236.175.187]) (authenticated bits=0)
	by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id k9BGtRo7063167;
	Wed, 11 Oct 2006 10:55:33 -0600 (MDT)
	(envelope-from scottl@samsco.org)
Message-ID: <452D21F6.20601@samsco.org>
Date: Wed, 11 Oct 2006 10:55:18 -0600
From: Scott Long <scottl@samsco.org>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.12) Gecko/20060206
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Mike Tancsa <mike@sentex.net>
References: <45297DA2.4000509@fluffles.net>
	<20061010051216.G814@epsplex.bde.org> <452AB55D.9090607@samsco.org>
	<ltgli2hqprmfqq3ttahrtbppj4l3dkt4b7@4ax.com>
	<452AC4EB.8000006@samsco.org>
	<dl4qi2t4ms9gp44crfa0marascvoo93grf@4ax.com>
In-Reply-To: <dl4qi2t4ms9gp44crfa0marascvoo93grf@4ax.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=0.0 required=3.8 tests=none autolearn=failed 
	version=3.1.1
X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on pooker.samsco.org
Cc: freebsd-fs@freebsd.org
Subject: Re: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 11 Oct 2006 16:59:20 -0000

Mike Tancsa wrote:
> On Mon, 09 Oct 2006 15:53:47 -0600, in sentex.lists.freebsd.fs you
> wrote:
> 
> 
>>Mike Tancsa wrote:
>>
>>>On Mon, 09 Oct 2006 14:47:25 -0600, in sentex.lists.freebsd.fs you
>>>wrote:
>>>
>>>
>>>
>>>>this is only a crude hack.  I get around this right now by not using a
>>>>disklabel or fdisk table on arrays where I value speed.  For those, I
>>>>just put a filesystem directly on the array, and boot off of a small
>>>>system disk.
>>>
>>>
>>>
>>>	How is that done ?  just newfs -O2 -U /dev/da0  ?
>>
>>Yup.
> 
> 
> Hi,
> 	Is this going to work in most/all cases ?  In other words, how
> to I make sure the file system I lay down is indeed properly /
> optimally aligned with the underlying structure ?
> 
> 	---Mike

UFS1 skips the first 8k of its space to allow for
bootstrapping/partitioning data.  UFS2 skips the first 64k.
Blocks are then aligned to that skip.  64K is a good alignment
for most RAID cases.  But understanding exactly how RAID-5 works
will help you make appropriate choices.

(Note that in the follow write-up I'm actually describing RAID-4.
The only difference between RAID-4 and 5 is that the parity data
is spread out to all of the disks instead of being kept all on a
single disk.  However, this is just a performance detail, and it's
easier to describe how things work if you ignore it)

As you might know, RAID-4/5 takes N disks and writes data to N-1 of
them while computing and writing a parity calculation to the Nth
disk.  That parity calculation is a logical XOR of the data disks.
One of the neat properties of XOR is that it's a reversible algorithm;
you can take the final answer and re-run the XOR using all but one of
the opriginal comoponents and get an answer that represents the data of
the missing component.

The array is divided into 'stripes', each stripe containing a equal
subsection of each data disk plus the parity disk.  When we talk about
'stripe size', what we are refering to is the size of one of those
subsections.  A 64K stripe size means that each disk is divided into
64K subsections.  The total amount of data in a stripe is then a
function of the stripe size and the number of disks in the array.  If
you have 5 disks in your array and have set a stripe size of 64K, each
stripe will hold a total of 256K of data (4 data disks and 1 parity
disk).

Every time you write to an RAID-5 array, parity needs to be updated.
As everything operates in terms of the stripes, the most straight
forward way to do this is to read all of the data from the stripe,
replace the portion that is being written, recompute the parity, and
then write out the updates.  This is also the slowest way to do it.

An easy optimization is to buffer the writes and look for situations
where all of the data in a stripe is being written sequentially.  If
all of the data in the stripe is being replaced, there is no need to
read any of the old data.  Just collect all of the writes together,
compute the parity, and write everything out all at once.

Another optimization is to recognize when only one member of the stripe
is being updated.  For that, you read the parity, read the old data, and
then XOR out the old data and XOR in the new data.  You still have the
latency of waiting for a read, but on a busy system you reduce head
movement on all of the disks, which is a big win.

Both of these optmizations rely on the writes having a certain amount
of alignment.  If your stripe size is 64k and your writes are 64k, but
they all start at an 8k offset into the stripe, you loose.  Each 64K
write will have to touch 56k of one disk and 8k of the next disk.  But,
an 8k offset can be made to work if you reduce your stripe size to 8k.
It then becomes an excercise in balancing the parameters of FS block
size and array stripe size to give you the best peformance for your
needs.  The 64k offset in UFS2 gives you more room to work here, so
that's why I say at the beginning that it's a good value.  In any case,
you want to choose parameters that result in each block write covering
either a single disk or a whole stripe.

Where things really go bad for BSD is when a _63_ sector offset gets
introduced for the MBR.  Now everything is offset to an odd,
non-power-of-2 value, and there isn't anything that you can tweak in the
filesystem or array to compensate.  The best you can do is to manually
calculate a compensating offset in the disklabel for each partition.
But at the point, it often becomes easier to just ditch all of that and
put the fielsystem directly on the disk.

Scott

From owner-freebsd-fs@FreeBSD.ORG  Wed Oct 11 19:41:19 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 33ECD16A403
	for <freebsd-fs@freebsd.org>; Wed, 11 Oct 2006 19:41:19 +0000 (UTC)
	(envelope-from anderson@centtech.com)
Received: from mh2.centtech.com (moat3.centtech.com [64.129.166.50])
	by mx1.FreeBSD.org (Postfix) with ESMTP id AA34E43D6E
	for <freebsd-fs@freebsd.org>; Wed, 11 Oct 2006 19:41:18 +0000 (GMT)
	(envelope-from anderson@centtech.com)
Received: from [10.177.171.220] (neutrino.centtech.com [10.177.171.220])
	by mh2.centtech.com (8.13.1/8.13.1) with ESMTP id k9BJfHfv044470;
	Wed, 11 Oct 2006 14:41:17 -0500 (CDT)
	(envelope-from anderson@centtech.com)
Message-ID: <452D48DF.5010502@centtech.com>
Date: Wed, 11 Oct 2006 14:41:19 -0500
From: Eric Anderson <anderson@centtech.com>
User-Agent: Thunderbird 1.5.0.7 (X11/20060923)
MIME-Version: 1.0
To: Scott Long <scottl@samsco.org>
References: <45297DA2.4000509@fluffles.net>	<20061010051216.G814@epsplex.bde.org>
	<452AB55D.9090607@samsco.org>	<ltgli2hqprmfqq3ttahrtbppj4l3dkt4b7@4ax.com>	<452AC4EB.8000006@samsco.org>	<dl4qi2t4ms9gp44crfa0marascvoo93grf@4ax.com>
	<452D21F6.20601@samsco.org>
In-Reply-To: <452D21F6.20601@samsco.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: ClamAV 0.87.1/2024/Wed Oct 11 05:53:09 2006 on
	mh2.centtech.com
X-Virus-Status: Clean
Cc: freebsd-fs@freebsd.org
Subject: Re: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 11 Oct 2006 19:41:19 -0000

On 10/11/06 11:55, Scott Long wrote:
> Mike Tancsa wrote:
>> On Mon, 09 Oct 2006 15:53:47 -0600, in sentex.lists.freebsd.fs you
>> wrote:
>>
>>
>>> Mike Tancsa wrote:
>>>
>>>> On Mon, 09 Oct 2006 14:47:25 -0600, in sentex.lists.freebsd.fs you
>>>> wrote:
>>>>
>>>>
>>>>
>>>>> this is only a crude hack.  I get around this right now by not using a
>>>>> disklabel or fdisk table on arrays where I value speed.  For those, I
>>>>> just put a filesystem directly on the array, and boot off of a small
>>>>> system disk.
>>>>
>>>>
>>>> 	How is that done ?  just newfs -O2 -U /dev/da0  ?
>>> Yup.
>>
>> Hi,
>> 	Is this going to work in most/all cases ?  In other words, how
>> to I make sure the file system I lay down is indeed properly /
>> optimally aligned with the underlying structure ?
>>
>> 	---Mike
> 
> UFS1 skips the first 8k of its space to allow for
> bootstrapping/partitioning data.  UFS2 skips the first 64k.
> Blocks are then aligned to that skip.  64K is a good alignment
> for most RAID cases.  But understanding exactly how RAID-5 works
> will help you make appropriate choices.
> 
> (Note that in the follow write-up I'm actually describing RAID-4.
> The only difference between RAID-4 and 5 is that the parity data
> is spread out to all of the disks instead of being kept all on a
> single disk.  However, this is just a performance detail, and it's
> easier to describe how things work if you ignore it)
> 
> As you might know, RAID-4/5 takes N disks and writes data to N-1 of
> them while computing and writing a parity calculation to the Nth
> disk.  That parity calculation is a logical XOR of the data disks.
> One of the neat properties of XOR is that it's a reversible algorithm;
> you can take the final answer and re-run the XOR using all but one of
> the opriginal comoponents and get an answer that represents the data of
> the missing component.
> 
> The array is divided into 'stripes', each stripe containing a equal
> subsection of each data disk plus the parity disk.  When we talk about
> 'stripe size', what we are refering to is the size of one of those
> subsections.  A 64K stripe size means that each disk is divided into
> 64K subsections.  The total amount of data in a stripe is then a
> function of the stripe size and the number of disks in the array.  If
> you have 5 disks in your array and have set a stripe size of 64K, each
> stripe will hold a total of 256K of data (4 data disks and 1 parity
> disk).
> 
> Every time you write to an RAID-5 array, parity needs to be updated.
> As everything operates in terms of the stripes, the most straight
> forward way to do this is to read all of the data from the stripe,
> replace the portion that is being written, recompute the parity, and
> then write out the updates.  This is also the slowest way to do it.
> 
> An easy optimization is to buffer the writes and look for situations
> where all of the data in a stripe is being written sequentially.  If
> all of the data in the stripe is being replaced, there is no need to
> read any of the old data.  Just collect all of the writes together,
> compute the parity, and write everything out all at once.
> 
> Another optimization is to recognize when only one member of the stripe
> is being updated.  For that, you read the parity, read the old data, and
> then XOR out the old data and XOR in the new data.  You still have the
> latency of waiting for a read, but on a busy system you reduce head
> movement on all of the disks, which is a big win.
> 
> Both of these optmizations rely on the writes having a certain amount
> of alignment.  If your stripe size is 64k and your writes are 64k, but
> they all start at an 8k offset into the stripe, you loose.  Each 64K
> write will have to touch 56k of one disk and 8k of the next disk.  But,
> an 8k offset can be made to work if you reduce your stripe size to 8k.
> It then becomes an excercise in balancing the parameters of FS block
> size and array stripe size to give you the best peformance for your
> needs.  The 64k offset in UFS2 gives you more room to work here, so
> that's why I say at the beginning that it's a good value.  In any case,
> you want to choose parameters that result in each block write covering
> either a single disk or a whole stripe.
> 
> Where things really go bad for BSD is when a _63_ sector offset gets
> introduced for the MBR.  Now everything is offset to an odd,
> non-power-of-2 value, and there isn't anything that you can tweak in the
> filesystem or array to compensate.  The best you can do is to manually
> calculate a compensating offset in the disklabel for each partition.
> But at the point, it often becomes easier to just ditch all of that and
> put the fielsystem directly on the disk.
> 
> Scott


Scott,

Just wanted to say thanks for such a well put explanation on this, with 
all the right details.

Eric


-- 
------------------------------------------------------------------------
Eric Anderson        Sr. Systems Administrator        Centaur Technology
Anything that works is better than anything that doesn't.
------------------------------------------------------------------------

From owner-freebsd-fs@FreeBSD.ORG  Sat Oct 14 06:07:00 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: fs@freebsd.org
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 3583A16A412;
	Sat, 14 Oct 2006 06:07:00 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailout2.pacific.net.au (mailout2-3.pacific.net.au [61.8.2.226])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 7FA7243D55;
	Sat, 14 Oct 2006 06:06:57 +0000 (GMT) (envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.2.163])
	by mailout2.pacific.net.au (Postfix) with ESMTP id E67B110A1BC;
	Sat, 14 Oct 2006 16:06:55 +1000 (EST)
Received: from epsplex.bde.org (katana.zip.com.au [61.8.7.246])
	by mailproxy2.pacific.net.au (Postfix) with ESMTP id 103EA2740C;
	Sat, 14 Oct 2006 16:06:53 +1000 (EST)
Date: Sat, 14 Oct 2006 16:06:53 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@epsplex.bde.org
To: fs@freebsd.org
In-Reply-To: <20061006050913.Y5250@epsplex.bde.org>
Message-ID: <20061014143825.F1264@epsplex.bde.org>
References: <20061006050913.Y5250@epsplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: mohans@freebsd.org
Subject: Re: lost dotdot caching pessimizes nfs especially
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 14 Oct 2006 06:07:00 -0000

On Fri, 6 Oct 2006, Bruce Evans wrote:

> This change:
>
> % Index: vfs_cache.c
> % ===================================================================
> % RCS file: /home/ncvs/src/sys/kern/vfs_cache.c,v
> % retrieving revision 1.102
> % retrieving revision 1.103
> % diff -u -2 -r1.102 -r1.103
> % --- vfs_cache.c	13 Jun 2005 05:59:59 -0000	1.102
> % +++ vfs_cache.c	17 Jun 2005 01:05:13 -0000	1.103
> % ...
>
> is responsible for about half of the performance loss since RELENG_4
> for building kernels over nfs (/usr and sys trees on nfs).  The kernel
> build uses "../../" a lot, and the above change apparently results in
> lots of network activity for things that should be cached locally.
>
> Some times for building a RELENG_4 kernel under conditions invariant
> except for the host kernel (after "make clean; sleep 2; make depend;
> make; make clean; sleep 2; make depend" to warm up caches):
>
> kernel:
> RELENG_4                 77.51 real        60.62 user         4.36 sys
> current.2004.07.01       ~78.5 (lost details)
> current.2005.01.01       ~79 (lost details)
> current.2005.06.17       82.42 real        62.50 user         4.71 sys
> current.2005.06.19       89.53 real        62.18 user         5.44 sys
> current.2005.06.17+      ~89.5 (lost details)
>               .17+ = .17 plus above change
> current.2005.06.17+*     86.08 real        62.43 user         5.13 sys
>               .17+* = .17+ with ../.. in Makefile avoided using a symlink
> 			    @ -> <path to sys not using ..>
> RELENG_6                 91.14 real        62.04 user         5.71 sys
> current                  similar to RELENG_6 (lost details)
>
> The total performance loss is about 18%.
>
> The total performance loss for a local sys tree (/usr still on nfs) is much
> smaller (about 4%):
>
> RELENG_4                 65.19 real        60.50 user         3.95 sys
> current.2005.06.17       67.49 real        62.13 user         4.27 sys
> RELENG_6                 67.83 real        61.84 user         4.71 sys
> current                  similar to RELENG_6 (lost details)
>
> The nfs performance for building of things that should be entirely
> cached locally is very dependent on network latency.  Not caching
> things very well causes lots of unnecessary network traffic for Getattr
> and Lookup.  The packets are small, so throughput is unimportant and
> latency dominates.  For building over nfs without -j, the dead time
> (real - user - sys) is almost directly proportional to the latency.
> My usual local network has fairly low latency (~100uS unloaded) and
> the ~14 seconds dead time in the above is for it.  Switching to a 1
> Gbps network with lower quality NICs gives an unloaded latency of ~160uS
> and a dead time of ~21 seconds.  Building with -j helps even for UP,
> at the cost of extra CPU, by letting some processes advance using cached
> stuff while others are waiting for the network.  Building with -j helps
> even more on FreeBSD cluster machines, more because they have a much
> higher network latency than because they are SMP.

I finished finding almost all the lost performance.  As indicated above,
It was almost all in nfs.

This change:

% Index: nfs_vnops.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/nfsclient/nfs_vnops.c,v
% retrieving revision 1.235
% retrieving revision 1.236
% diff -u -2 -r1.235 -r1.236
% --- nfs_vnops.c	6 Dec 2004 18:52:28 -0000	1.235
% +++ nfs_vnops.c	6 Dec 2004 19:18:00 -0000	1.236
% @@ -418,10 +418,11 @@
%  		if (error)
%  			return (error);
% -		np->n_mtime = vattr.va_mtime.tv_sec;
% +		np->n_mtime = vattr.va_mtime;
%  	} else {
% +		np->n_attrstamp = 0;
    		^^^^^^^^^^^^^^^^^^^^
%  		error = VOP_GETATTR(vp, &vattr, ap->a_cred, ap->a_td);
%  		if (error)
%  			return (error);
% -		if (np->n_mtime != vattr.va_mtime.tv_sec) {
% +		if (NFS_TIMESPEC_COMPARE(&np->n_mtime, &vattr.va_mtime)) {
%  			if (vp->v_type == VDIR)
%  				np->n_direofoffset = 0;

and associated changes give silly behaviour that almost doubles the
number of Access RPCs.  One of the associated changes clears n_attrstamp
on close().  Then on open(), since lookup() is called before the above
is reached, nfs_access_otw() has always just been called, and the above
forces another call.

Counting RPCs gives a good metric for the pessimizations.  Removing the
above clearing in RELENG_6 gives the following improvement:

Before:
        89.90 real        62.16 user         5.50 sys
  Lookup Read Write Create Access Fsstat Setattr Other   Total
   60010 2410  5353    442  43785   1742    5194     6  118942
After:
        86.46 real        62.22 user         5.21 sys
  Lookup Read Write Create Access Fsstat Setattr Other   Total
   59986 2410  5353    442  20935   1742    5194     6   96068

Note the RPC delta-counts barely changed except for the Access one.
About 20000 Access calls were avoided.  Just removing the clearing
is not correct but is close.

The pessimization in vfs_cache.c 1.103 is now easy to quantify.  It
triples the number of Lookup RPCs.  Removing it in addition to the
above gives a much larger improvement:

        79.24 real        61.87 user         5.04 sys
  Lookup Read Write Create Access Fsstat Setattr Other   Total
   19548 2410  5353    442  20922   1742    5194     6   55617

Note the RPC delta-counts barely changed except for the Lookup one.
About 40000 Lookup calls were avoided.  Just removing the change in
vfs_cache.c 1.103 is not close to being correct.

The last major pessimization is another silly one.  The changes to
mark atimes on exec() and mmap() cause a silly null Setattr RPC for
every exec() (more for interprters?) and every mmap().  This is
easy to fix (almost) correctly.  VOP_SETATTR() is assumed to do
nothing for requests that it doesn't understand, but nfs_setattr()
does null RPCs instead.  The following fix:

% diff -c2 ./nfsclient/nfs_vnops.c~ ./nfsclient/nfs_vnops.c
% *** ./nfsclient/nfs_vnops.c~	Sun Oct  8 23:08:57 2006
% --- ./nfsclient/nfs_vnops.c	Fri Oct 13 09:58:12 2006
% ***************
% *** 669,675 ****
% 
%   	/*
% ! 	 * Setting of flags is not supported.
%   	 */
% ! 	if (vap->va_flags != VNOVAL)
%   		return (EOPNOTSUPP);
% 
% --- 677,684 ----
% 
%   	/*
% ! 	 * Setting of flags and marking of atimes are not supported.
%   	 */
% ! 	if (vap->va_flags != VNOVAL ||
% ! 	    ((bdefix & 4) && (vap->va_vaflags & VA_MARK_ATIME)))
%   		return (EOPNOTSUPP);
%

in addition to the removals gives the following improvement with
bdefix set to 7:

        78.14 real        62.03 user         4.79 sys
  Lookup Read Write Create Access Fsstat Other   Total
   19556 2410  5353    442  19581   1738    14   49094

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Sat Oct 14 14:37:45 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: fs@freebsd.org
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 0F12E16A407;
	Sat, 14 Oct 2006 14:37:45 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 8DEDA43D46;
	Sat, 14 Oct 2006 14:37:44 +0000 (GMT)
	(envelope-from scottl@samsco.org)
Received: from [192.168.254.11] (phobos.samsco.home [192.168.254.11])
	(authenticated bits=0)
	by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id k9EEbbHp087005;
	Sat, 14 Oct 2006 08:37:43 -0600 (MDT)
	(envelope-from scottl@samsco.org)
Message-ID: <4530F62E.20308@samsco.org>
Date: Sat, 14 Oct 2006 08:37:34 -0600
From: Scott Long <scottl@samsco.org>
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US;
	rv:1.8.0.7) Gecko/20060910 SeaMonkey/1.0.5
MIME-Version: 1.0
To: Bruce Evans <bde@zeta.org.au>
References: <20061006050913.Y5250@epsplex.bde.org>
	<20061014143825.F1264@epsplex.bde.org>
In-Reply-To: <20061014143825.F1264@epsplex.bde.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-1.4 required=3.8 tests=ALL_TRUSTED autolearn=failed 
	version=3.1.1
X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on pooker.samsco.org
Cc: fs@freebsd.org, mohans@freebsd.org
Subject: Re: lost dotdot caching pessimizes nfs especially
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 14 Oct 2006 14:37:45 -0000

Bruce Evans wrote:
> On Fri, 6 Oct 2006, Bruce Evans wrote:
> 
[...]
> The last major pessimization is another silly one.  The changes to
> mark atimes on exec() and mmap() cause a silly null Setattr RPC for
> every exec() (more for interprters?) and every mmap().  This is
> easy to fix (almost) correctly.  VOP_SETATTR() is assumed to do
> nothing for requests that it doesn't understand, but nfs_setattr()
> does null RPCs instead.  The following fix:
> 
> % diff -c2 ./nfsclient/nfs_vnops.c~ ./nfsclient/nfs_vnops.c
> % *** ./nfsclient/nfs_vnops.c~    Sun Oct  8 23:08:57 2006
> % --- ./nfsclient/nfs_vnops.c    Fri Oct 13 09:58:12 2006
> % ***************
> % *** 669,675 ****
> % %       /*
> % !      * Setting of flags is not supported.
> %        */
> % !     if (vap->va_flags != VNOVAL)
> %           return (EOPNOTSUPP);
> % % --- 677,684 ----
> % %       /*
> % !      * Setting of flags and marking of atimes are not supported.
> %        */
> % !     if (vap->va_flags != VNOVAL ||
> % !         ((bdefix & 4) && (vap->va_vaflags & VA_MARK_ATIME)))
> %           return (EOPNOTSUPP);
> %
> 
> in addition to the removals gives the following improvement with
> bdefix set to 7:
> 
>        78.14 real        62.03 user         4.79 sys
>  Lookup Read Write Create Access Fsstat Other   Total
>   19556 2410  5353    442  19581   1738    14   49094
> 
> Bruce

I've seen hints that the excessive null SETATTR calls also create
unpredictable problems with some servers.  Thanks a lot for tracking
this down.

Scott