From owner-freebsd-fs@FreeBSD.ORG Sun Oct 8 15:22:14 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id DB1FA16A407 for ; Sun, 8 Oct 2006 15:22:14 +0000 (UTC) (envelope-from arne_woerner@yahoo.com) Received: from web30307.mail.mud.yahoo.com (web30307.mail.mud.yahoo.com [209.191.69.69]) by mx1.FreeBSD.org (Postfix) with SMTP id 5839743D4C for ; Sun, 8 Oct 2006 15:22:14 +0000 (GMT) (envelope-from arne_woerner@yahoo.com) Received: (qmail 59249 invoked by uid 60001); 8 Oct 2006 15:22:13 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:Received:Date:From:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=w6QoOQsswjFqdafmJM6xLek4RZ1MpW+ocM7gaUbOvwaf64wTEGKj51wt77yyKkOaZT3jRLQWUH0hbC4CM9kArEoQWrR8hCPJ+fHjdKMK8gIXzhY19nPIiVhK2louPKVszgZP/LyK0/umqlAvloUKjAyR1LnlJqPia0jFTISpklY= ; Message-ID: <20061008152213.59247.qmail@web30307.mail.mud.yahoo.com> Received: from [83.129.181.92] by web30307.mail.mud.yahoo.com via HTTP; Sun, 08 Oct 2006 08:22:13 PDT Date: Sun, 8 Oct 2006 08:22:13 -0700 (PDT) From: "R. B. Riddick" To: freebsd-fs@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Content-Transfer-Encoding: quoted-printable Subject: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 08 Oct 2006 15:22:14 -0000 Hi!=0A=0AWe (me and Veronica) mentioned, that starting 2 bonnie (ports/benc= h) processes on a UFS=0A1. on a geom_bsd, geom_disk (like ad4 or da0) and g= eom_stripe (using ad4a, ..., ad10a)=0A2. with different controllers areca a= nd nVidia and different motherboards and=0A3. with up to 8 SATA disks=0Ares= ults in a permanently disk-dead system.=0A=0A=0AVeronica's box had more tha= n 700MB of free memory (according to top), when it happened.=0A=0AHeavy loa= d (caused by blogbench, rawio, raidtest and dd) causes no problem, while bo= nnie gets stuck somewhere between putc phase and end of rewrite phase.=0A= =0AThe bonnie processes were blocked due to "nbufkv" (some VFS reason).=0AG= eom activity is impossible then (no file system activity happens).=0ANo sys= log message can be seen on the console.=0A=0ABye=0A-A&V=0A=0A=0A=0A From owner-freebsd-fs@FreeBSD.ORG Sun Oct 8 16:58:24 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 92AC916A40F for ; Sun, 8 Oct 2006 16:58:24 +0000 (UTC) (envelope-from kris@obsecurity.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.FreeBSD.org (Postfix) with ESMTP id 508C543D46 for ; Sun, 8 Oct 2006 16:58:24 +0000 (GMT) (envelope-from kris@obsecurity.org) Received: from obsecurity.dyndns.org (elvis.mu.org [192.203.228.196]) by elvis.mu.org (Postfix) with ESMTP id 3454F1A3C1A; Sun, 8 Oct 2006 09:58:24 -0700 (PDT) Received: by obsecurity.dyndns.org (Postfix, from userid 1000) id 9D6A25176D; Sun, 8 Oct 2006 12:58:23 -0400 (EDT) Date: Sun, 8 Oct 2006 12:58:23 -0400 From: Kris Kennaway To: "R. B. Riddick" Message-ID: <20061008165823.GA2061@xor.obsecurity.org> References: <20061008152213.59247.qmail@web30307.mail.mud.yahoo.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="oyUTqETQ0mS9luUI" Content-Disposition: inline In-Reply-To: <20061008152213.59247.qmail@web30307.mail.mud.yahoo.com> User-Agent: Mutt/1.4.2.2i Cc: freebsd-fs@freebsd.org Subject: Re: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 08 Oct 2006 16:58:24 -0000 --oyUTqETQ0mS9luUI Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Oct 08, 2006 at 08:22:13AM -0700, R. B. Riddick wrote: > Hi! >=20 > We (me and Veronica) mentioned, that starting 2 bonnie (ports/bench) proc= esses on a UFS > 1. on a geom_bsd, geom_disk (like ad4 or da0) and geom_stripe (using ad4a= , ..., ad10a) > 2. with different controllers areca and nVidia and different motherboards= and > 3. with up to 8 SATA disks > results in a permanently disk-dead system. >=20 >=20 > Veronica's box had more than 700MB of free memory (according to top), whe= n it happened. >=20 > Heavy load (caused by blogbench, rawio, raidtest and dd) causes no proble= m, while bonnie gets stuck somewhere between putc phase and end of rewrite = phase. >=20 > The bonnie processes were blocked due to "nbufkv" (some VFS reason). > Geom activity is impossible then (no file system activity happens). > No syslog message can be seen on the console. You forgot to even mention what version you're running ;-) Also show your kernel config file. Configure DDB per the chapter on kernel debugging in the developers handbook, break to DDB from the console or serial console, then show us what processes are running and what are their backtraces. Kris --oyUTqETQ0mS9luUI Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFFKS4uWry0BWjoQKURAnd2AKDmi5t4q69iCzZxaIWKz37ve9LX9ACg/tvb KyJ0BYVUNgjVdPZU7SGQebc= =ByF9 -----END PGP SIGNATURE----- --oyUTqETQ0mS9luUI-- From owner-freebsd-fs@FreeBSD.ORG Sun Oct 8 17:01:59 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4CC0916A403 for ; Sun, 8 Oct 2006 17:01:59 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.FreeBSD.org (Postfix) with ESMTP id A420D43D49 for ; Sun, 8 Oct 2006 17:01:56 +0000 (GMT) (envelope-from scottl@samsco.org) Received: from [192.168.254.14] (imini.samsco.home [192.168.254.14]) (authenticated bits=0) by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id k98H1lPM025130; Sun, 8 Oct 2006 11:01:52 -0600 (MDT) (envelope-from scottl@samsco.org) Message-ID: <45292EFA.4060903@samsco.org> Date: Sun, 08 Oct 2006 11:01:46 -0600 From: Scott Long User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.7) Gecko/20050416 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Kris Kennaway References: <20061008152213.59247.qmail@web30307.mail.mud.yahoo.com> <20061008165823.GA2061@xor.obsecurity.org> In-Reply-To: <20061008165823.GA2061@xor.obsecurity.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-1.4 required=3.8 tests=ALL_TRUSTED autolearn=failed version=3.1.1 X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on pooker.samsco.org Cc: freebsd-fs@freebsd.org Subject: Re: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 08 Oct 2006 17:01:59 -0000 Kris Kennaway wrote: > On Sun, Oct 08, 2006 at 08:22:13AM -0700, R. B. Riddick wrote: > >>Hi! >> >>We (me and Veronica) mentioned, that starting 2 bonnie (ports/bench) processes on a UFS >>1. on a geom_bsd, geom_disk (like ad4 or da0) and geom_stripe (using ad4a, ..., ad10a) >>2. with different controllers areca and nVidia and different motherboards and >>3. with up to 8 SATA disks >>results in a permanently disk-dead system. >> >> >>Veronica's box had more than 700MB of free memory (according to top), when it happened. >> >>Heavy load (caused by blogbench, rawio, raidtest and dd) causes no problem, while bonnie gets stuck somewhere between putc phase and end of rewrite phase. >> >>The bonnie processes were blocked due to "nbufkv" (some VFS reason). >>Geom activity is impossible then (no file system activity happens). >>No syslog message can be seen on the console. > > > You forgot to even mention what version you're running ;-) > > Also show your kernel config file. Configure DDB per the chapter on > kernel debugging in the developers handbook, break to DDB from the > console or serial console, then show us what processes are running and > what are their backtraces. > > Kris No need for all of that information, the bug in vfs_bio.c is quite obvious. =-( Fixing it will take some thought, though. Scott From owner-freebsd-fs@FreeBSD.ORG Sun Oct 8 19:39:45 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2B2E816A403 for ; Sun, 8 Oct 2006 19:39:45 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210]) by mx1.FreeBSD.org (Postfix) with ESMTP id BE9C643D45 for ; Sun, 8 Oct 2006 19:39:44 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.2.163]) by mailout1.pacific.net.au (Postfix) with ESMTP id 80C8D5BFC33; Mon, 9 Oct 2006 05:39:19 +1000 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3sarge3) with ESMTP id k98JdFVA028027; Mon, 9 Oct 2006 05:39:17 +1000 Date: Mon, 9 Oct 2006 05:39:15 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Scott Long In-Reply-To: <45292EFA.4060903@samsco.org> Message-ID: <20061009052237.X30864@delplex.bde.org> References: <20061008152213.59247.qmail@web30307.mail.mud.yahoo.com> <20061008165823.GA2061@xor.obsecurity.org> <45292EFA.4060903@samsco.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org, Kris Kennaway Subject: Re: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 08 Oct 2006 19:39:45 -0000 On Sun, 8 Oct 2006, Scott Long wrote: > Kris Kennaway wrote: >> You forgot to even mention what version you're running ;-) >> >> Also show your kernel config file. Configure DDB per the chapter on > No need for all of that information, the bug in vfs_bio.c is quite obvious. > =-( Fixing it will take some thought, though. Is it really obvious? I think it is only obvious that many things are not quite right. The quick fix of increasing BKVASIZE to the size of the largest buffer used should still work to prevent bkva fragmentation. Bruce From owner-freebsd-fs@FreeBSD.ORG Sun Oct 8 20:33:56 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 17D0416A40F for ; Sun, 8 Oct 2006 20:33:56 +0000 (UTC) (envelope-from arne_woerner@yahoo.com) Received: from web30312.mail.mud.yahoo.com (web30312.mail.mud.yahoo.com [209.191.69.74]) by mx1.FreeBSD.org (Postfix) with SMTP id 928D143D45 for ; Sun, 8 Oct 2006 20:33:55 +0000 (GMT) (envelope-from arne_woerner@yahoo.com) Received: (qmail 84150 invoked by uid 60001); 8 Oct 2006 20:33:50 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:Received:Date:From:Subject:To:Cc:MIME-Version:Content-Type:Content-Transfer-Encoding; b=Y1kC8scDcax36/g3c9FPS5QfiX8UX0gnfO+l5bi3IpKr8KuJwNHKF65pqMukWeZ3Qyp+44969gn0ZtE7htGHc7CIEu56FGkTm72cHDxGrJqewZ/VE+EtLW34utR8Wczz074QZJnzEliccbIcErccs9XgKi1hvtuO/G5+ZNicbog= ; Message-ID: <20061008203349.84148.qmail@web30312.mail.mud.yahoo.com> Received: from [83.129.181.92] by web30312.mail.mud.yahoo.com via HTTP; Sun, 08 Oct 2006 13:33:49 PDT Date: Sun, 8 Oct 2006 13:33:49 -0700 (PDT) From: "R. B. Riddick" To: Bruce Evans , Scott Long MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Content-Transfer-Encoding: quoted-printable Cc: freebsd-fs@freebsd.org, Kris Kennaway Subject: Re: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 08 Oct 2006 20:33:56 -0000 Bruce wrote:=0A>On Sun, 8 Oct 2006, Scott Long wrote:=0A>> Kris Kennaway wr= ote:=0A>>> You forgot to even mention what version you're running ;-)=0A>>>= =0A>>> Also show your kernel config file. Configure DDB per the chapter o= n=0A>>>=0A>> No need for all of that information, the bug in vfs_bio.c is q= uite obvious. =0A>> =3D-( Fixing it will take some thought, though.=0A>=0A= >Is it really obvious? I think it is only obvious that many things are=0A>= not quite right. The quick fix of increasing BKVASIZE to the size of=0A>th= e largest buffer used should still work to prevent bkva fragmentation.=0A>= =0AOK: The FBSD version was varying: R6.1, R6.1-CURRENT, R7-CURRENT.=0A=0AB= ut we just found out, that it happens when we use "newfs -b 65536", but not= with default "-b" value (whatever that might be)...=0A=0ASo if somebody wa= nts to reproduce it, he/she should use >R6 and "newfs -b 65536"...=0AI thin= k that were all steps to do...=0A=0ACan somebody reproduce it now?=0ADDB is= not my bag, so that I would be glad, if somebody with an appropriate setti= ng could reproduce it...=0A=0A-Arne=0A=0A=0A=0A=0A=0A From owner-freebsd-fs@FreeBSD.ORG Sun Oct 8 20:43:09 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8B8B116A412 for ; Sun, 8 Oct 2006 20:43:09 +0000 (UTC) (envelope-from kris@obsecurity.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3E5F843D46 for ; Sun, 8 Oct 2006 20:43:09 +0000 (GMT) (envelope-from kris@obsecurity.org) Received: from obsecurity.dyndns.org (elvis.mu.org [192.203.228.196]) by elvis.mu.org (Postfix) with ESMTP id F0C4F1A3C1C; Sun, 8 Oct 2006 13:43:08 -0700 (PDT) Received: by obsecurity.dyndns.org (Postfix, from userid 1000) id 6F230515FA; Sun, 8 Oct 2006 16:43:08 -0400 (EDT) Date: Sun, 8 Oct 2006 16:43:08 -0400 From: Kris Kennaway To: "R. B. Riddick" Message-ID: <20061008204308.GA7702@xor.obsecurity.org> References: <20061008203349.84148.qmail@web30312.mail.mud.yahoo.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="mP3DRpeJDSE+ciuQ" Content-Disposition: inline In-Reply-To: <20061008203349.84148.qmail@web30312.mail.mud.yahoo.com> User-Agent: Mutt/1.4.2.2i Cc: freebsd-fs@freebsd.org, Kris Kennaway Subject: Re: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 08 Oct 2006 20:43:09 -0000 --mP3DRpeJDSE+ciuQ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Oct 08, 2006 at 01:33:49PM -0700, R. B. Riddick wrote: > Bruce wrote: > >On Sun, 8 Oct 2006, Scott Long wrote: > >> Kris Kennaway wrote: > >>> You forgot to even mention what version you're running ;-) > >>>=20 > >>> Also show your kernel config file. Configure DDB per the chapter on > >>> > >> No need for all of that information, the bug in vfs_bio.c is quite obv= ious.=20 > >> =3D-( Fixing it will take some thought, though. > > > >Is it really obvious? I think it is only obvious that many things are > >not quite right. The quick fix of increasing BKVASIZE to the size of > >the largest buffer used should still work to prevent bkva fragmentation. > > > OK: The FBSD version was varying: R6.1, R6.1-CURRENT, R7-CURRENT. >=20 > But we just found out, that it happens when we use "newfs -b 65536", but = not with default "-b" value (whatever that might be)... >=20 > So if somebody wants to reproduce it, he/she should use >R6 and "newfs -b= 65536"... > I think that were all steps to do... Thanks, I can now reproduce on 7.0. 8197 3980 8197 0 S+ nbufkv 0xc07cec08 bonnie 8196 3980 8196 0 S+ nbufkv 0xc07cec08 bonnie db> wh 8197 Tracing pid 8197 tid 100205 td 0xc87a6510 sched_switch(c87a6510,0,1,15e,4,...) at sched_switch+0x120 mi_switch(1,0,c0758aba,1bf,0,...) at mi_switch+0x1b2 sleepq_switch(c07c5390,0,c0758aba,211,ec9217d0,...) at sleepq_switch+0xee sleepq_wait(c07cec08,0,c075614c,c9,0,...) at sleepq_wait+0x3e msleep(c07cec08,c07cec0c,50,c075dece,0,...) at msleep+0x171 getnewbuf(10000,10000,c075da89,9fe,10000,...) at getnewbuf+0x319 getblk(c5d58514,fffffff4,ffffffff,10000,0,...) at getblk+0x307 breadn(c5d58514,fffffff4,ffffffff,10000,0,...) at breadn+0x4d bread(c5d58514,fffffff4,ffffffff,10000,0,...) at bread+0x4c ffs_balloc_ufs2(c5d58514,273a000,0,2000,c51b0e00,...) at ffs_balloc_ufs2+0x= 5ab ffs_write(ec921b9c,0,c07535f4,0,0,...) at ffs_write+0x2f2 VOP_WRITE_APV(c07ada20,ec921b9c,c87a6510,c54d5c60,2,...) at VOP_WRITE_APV+0= x9a vn_write(c54d5c60,ec921c64,c51b0e00,0,c87a6510,...) at vn_write+0x1d5 dofilewrite(c54d5c60,ec921c64,ffffffff,ffffffff,0,...) at dofilewrite+0x7c kern_writev(c87a6510,3,ec921c64,bfbfc820,2000,...) at kern_writev+0x6b write(c87a6510,ec921d04,c,158,3,...) at write+0x4d syscall(820003b,3b,bfbf003b,0,2000,...) at syscall+0x152 Xint0x80_syscall() at Xint0x80_syscall+0x1f --- syscall (4, FreeBSD ELF32, write), eip =3D 0x28155dff, esp =3D 0xbfbf73= 6c, ebp =3D 0xbfbfe838 --- db> wh 8196 Tracing pid 8196 tid 100138 td 0xc50c3d80 sched_switch(c50c3d80,0,1,15e,246,...) at sched_switch+0x120 mi_switch(1,0,c0758aba,1bf,0,...) at mi_switch+0x1b2 sleepq_switch(c07c5390,0,c0758aba,211,ec79b820,...) at sleepq_switch+0xee sleepq_wait(c07cec08,0,c075614c,c9,0,...) at sleepq_wait+0x3e msleep(c07cec08,c07cec0c,50,c075dece,0,...) at msleep+0x171 getnewbuf(10000,10000,c075da89,9fe,10000,...) at getnewbuf+0x319 getblk(c5e6d514,fffffff4,ffffffff,10000,0,...) at getblk+0x307 ufs_bmaparray(c5e6d514,3cc,0,ec79b994,0,...) at ufs_bmaparray+0x298 ufs_bmap(ec79b9dc,c075da89,1ac) at ufs_bmap+0x69 VOP_BMAP_APV(c07ada20,ec79b9dc,c075da89,3b7,ffffffff,...) at VOP_BMAP_APV+0= x72 bdwrite(ddbe5790,0,ec79bc64,2000,c51b0e00,...) at bdwrite+0x485 ffs_write(ec79bb9c,0,c07535f4,0,0,...) at ffs_write+0x5b5 VOP_WRITE_APV(c07ada20,ec79bb9c,c50c3d80,c5058c60,2,...) at VOP_WRITE_APV+0= x9a vn_write(c5058c60,ec79bc64,c51b0e00,0,c50c3d80,...) at vn_write+0x1d5 dofilewrite(c5058c60,ec79bc64,ffffffff,ffffffff,0,...) at dofilewrite+0x7c kern_writev(c50c3d80,3,ec79bc64,bfbfe820,0,...) at kern_writev+0x6b write(c50c3d80,ec79bd04,c,158,3,...) at write+0x4d syscall(3b,3b,bfbf003b,0,2000,...) at syscall+0x152 Xint0x80_syscall() at Xint0x80_syscall+0x1f --- syscall (4, FreeBSD ELF32, write), eip =3D 0x28155dff, esp =3D 0xbfbf73= 6c, ebp =3D 0xbfbfe838 --- db> Kris --mP3DRpeJDSE+ciuQ Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFFKWLbWry0BWjoQKURAhYEAKCr/KsTlrpfgcn5JPq6Lc7HcY/LBwCgjsDn NCg2VRMnxO8xbit/xmqKtuQ= =ror+ -----END PGP SIGNATURE----- --mP3DRpeJDSE+ciuQ-- From owner-freebsd-fs@FreeBSD.ORG Sun Oct 8 21:02:17 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7478116A403 for ; Sun, 8 Oct 2006 21:02:17 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5909F43D53 for ; Sun, 8 Oct 2006 21:02:15 +0000 (GMT) (envelope-from scottl@samsco.org) Received: from [192.168.254.14] (imini.samsco.home [192.168.254.14]) (authenticated bits=0) by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id k98L27Xc026112; Sun, 8 Oct 2006 15:02:13 -0600 (MDT) (envelope-from scottl@samsco.org) Message-ID: <4529674E.6000405@samsco.org> Date: Sun, 08 Oct 2006 15:02:06 -0600 From: Scott Long User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.7) Gecko/20050416 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Bruce Evans References: <20061008152213.59247.qmail@web30307.mail.mud.yahoo.com> <20061008165823.GA2061@xor.obsecurity.org> <45292EFA.4060903@samsco.org> <20061009052237.X30864@delplex.bde.org> In-Reply-To: <20061009052237.X30864@delplex.bde.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-1.4 required=3.8 tests=ALL_TRUSTED autolearn=failed version=3.1.1 X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on pooker.samsco.org Cc: freebsd-fs@freebsd.org, Kris Kennaway Subject: Re: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 08 Oct 2006 21:02:17 -0000 Bruce Evans wrote: > On Sun, 8 Oct 2006, Scott Long wrote: > >> Kris Kennaway wrote: >> >>> You forgot to even mention what version you're running ;-) >>> >>> Also show your kernel config file. Configure DDB per the chapter on > > >> No need for all of that information, the bug in vfs_bio.c is quite >> obvious. =-( Fixing it will take some thought, though. > > > Is it really obvious? I think it is only obvious that many things are > not quite right. The quick fix of increasing BKVASIZE to the size of > the largest buffer used should still work to prevent bkva fragmentation. > > Bruce The use of needsbuffer global presents a very wide open race. Scott From owner-freebsd-fs@FreeBSD.ORG Sun Oct 8 22:25:00 2006 Return-Path: X-Original-To: freebsd-fs@FreeBSD.org Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E719D16A403 for ; Sun, 8 Oct 2006 22:25:00 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5625A43D45 for ; Sun, 8 Oct 2006 22:25:00 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.2.163]) by mailout1.pacific.net.au (Postfix) with ESMTP id E03555DFC21; Mon, 9 Oct 2006 08:24:58 +1000 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3sarge3) with ESMTP id k98MOuXF008714; Mon, 9 Oct 2006 08:24:56 +1000 Date: Mon, 9 Oct 2006 08:24:55 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: "R. B. Riddick" In-Reply-To: <20061008203349.84148.qmail@web30312.mail.mud.yahoo.com> Message-ID: <20061009075528.W31379@delplex.bde.org> References: <20061008203349.84148.qmail@web30312.mail.mud.yahoo.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@FreeBSD.org, Kris Kennaway Subject: Re: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 08 Oct 2006 22:25:01 -0000 On Sun, 8 Oct 2006, R. B. Riddick wrote: > Bruce wrote: >> On Sun, 8 Oct 2006, Scott Long wrote: >>> Kris Kennaway wrote: >>>> You forgot to even mention what version you're running ;-) >>>> >>>> Also show your kernel config file. Configure DDB per the chapter on >>>> >>> No need for all of that information, the bug in vfs_bio.c is quite obvious. >>> =-( Fixing it will take some thought, though. >> >> Is it really obvious? I think it is only obvious that many things are >> not quite right. The quick fix of increasing BKVASIZE to the size of >> the largest buffer used should still work to prevent bkva fragmentation. >> > OK: The FBSD version was varying: R6.1, R6.1-CURRENT, R7-CURRENT. > > But we just found out, that it happens when we use "newfs -b 65536", but not with default "-b" value (whatever that might be)... That's certainly a good way to exercise bkva fragmentation. I don't know any other use for such a large block sizes in ffs :-). Such a large block size might be best for file systems with mainly very large files, but the possible benefits are not large and might be smaller than the extra overheads for defragmentation (even if it works). The fragmentation can also be reduced by not using different block sizes for different mounted file systems (including non-ffs ones) once one of the sizes exceeds BKVASIZE. Alternatively it can be increased by doing the reverse. I think "newfs -b 65536 -f 8192" gives the bad mixture with different (ffs)block and (ffs)frag sizes. "newfs -b 65536 -f 65536" usually gives very bad perfromance because its frag size is to large. Bruce From owner-freebsd-fs@FreeBSD.ORG Sun Oct 8 22:38:09 2006 Return-Path: X-Original-To: freebsd-fs@FreeBSD.org Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4F8A716A40F for ; Sun, 8 Oct 2006 22:38:09 +0000 (UTC) (envelope-from info@fluffles.net) Received: from auriate.fluffles.net (a83-68-3-169.adsl.cistron.nl [83.68.3.169]) by mx1.FreeBSD.org (Postfix) with ESMTP id CBE6B43DA0 for ; Sun, 8 Oct 2006 22:38:01 +0000 (GMT) (envelope-from info@fluffles.net) Received: from destiny ([10.0.0.21]) by auriate.fluffles.net with esmtpa (Exim 4.63 (FreeBSD)) (envelope-from ) id 1GWhHQ-0009vG-Nw; Mon, 09 Oct 2006 00:37:56 +0200 Message-ID: <45297DA2.4000509@fluffles.net> Date: Mon, 09 Oct 2006 00:37:22 +0200 From: "Fluffles.net" User-Agent: Thunderbird 1.5.0.7 (X11/20060917) MIME-Version: 1.0 To: freebsd-fs@FreeBSD.org X-Enigmail-Version: 0.94.0.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Cc: Kris Kennaway Subject: Re: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 08 Oct 2006 22:38:09 -0000 Hi Bruce, I'm the "veronica" Arne mentioned in the freebsd-fs mailinglist. Regarding the effectiveness of a higher blocksize, these are my findings: areca RAID5 (8x da, 128KB stripe, default newfs, NCQ enabled) -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU ARC8xR5 8480 119973 91.3 247178 58.6 67862 17.5 90426 86.9 172490 24.0 120.7 0.5 areca RAID5 (8x da, 128KB stripe, 64KB blocksize newfs, NCQ enabled) -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU ARC8xR5 8480 128920 97.8 265920 58.9 116787 31.0 103261 97.8 392970 53.8 119.8 0.6 As you can see, the block read increased from ~172MB/s to ~392MB/s, quite significant increase. Also the reqrite speed increased from ~67MB/s to ~116MB/s. Ofcourse these tests are on a brand clean filesystem, which might not tally with real-life crowded filesystems. But at least there is much potential in a higher blocksize, and it would be a shame for it to crash FreeBSD. There are quite a few people who store big files on big RAID arrays; they could profit from a non-crashing FreeBSD with bigger blocksize. Besides, a crashing VFS/Geom isn't all that sexy. ;-) - Veronica From owner-freebsd-fs@FreeBSD.ORG Mon Oct 9 19:38:28 2006 Return-Path: X-Original-To: freebsd-fs@FreeBSD.org Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1689716A403 for ; Mon, 9 Oct 2006 19:38:28 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout2.pacific.net.au (mailout2-3.pacific.net.au [61.8.2.226]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4AF1443D45 for ; Mon, 9 Oct 2006 19:38:27 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.2.163]) by mailout2.pacific.net.au (Postfix) with ESMTP id 325996E125; Tue, 10 Oct 2006 05:38:25 +1000 (EST) Received: from epsplex.bde.org (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3sarge3) with ESMTP id k99JcKSo014207; Tue, 10 Oct 2006 05:38:21 +1000 Date: Tue, 10 Oct 2006 05:37:33 +1000 (EST) From: Bruce Evans X-X-Sender: bde@epsplex.bde.org To: "Fluffles.net" In-Reply-To: <45297DA2.4000509@fluffles.net> Message-ID: <20061010051216.G814@epsplex.bde.org> References: <45297DA2.4000509@fluffles.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@FreeBSD.org, Kris Kennaway Subject: Re: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 09 Oct 2006 19:38:28 -0000 On Mon, 9 Oct 2006, Fluffles.net wrote: > I'm the "veronica" Arne mentioned in the freebsd-fs mailinglist. > Regarding the effectiveness of a higher blocksize, these are my findings: > > areca RAID5 (8x da, 128KB stripe, default newfs, NCQ enabled) > -------Sequential Output-------- ---Sequential Input-- > --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- > --Seeks--- > Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU > /sec %CPU > ARC8xR5 8480 119973 91.3 247178 58.6 67862 17.5 90426 86.9 172490 24.0 > 120.7 0.5 > > areca RAID5 (8x da, 128KB stripe, 64KB blocksize newfs, NCQ enabled) > -------Sequential Output-------- ---Sequential Input-- > --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- > --Seeks--- > Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU > /sec %CPU > ARC8xR5 8480 128920 97.8 265920 58.9 116787 31.0 103261 97.8 392970 > 53.8 119.8 0.6 > > As you can see, the block read increased from ~172MB/s to ~392MB/s, > quite significant increase. Also the reqrite speed increased from > ~67MB/s to ~116MB/s. > > Ofcourse these tests are on a brand clean filesystem, which might not > tally with real-life crowded filesystems. But at least there is much > ... This is a bit surprising. FreeBSD is supposed to cluster the i/o so that (especially for large files on new file systems) almost all i/o is done in blocks of size 64K or 128K. I suspect the problems are that the 64K-block i/o is usually perfectly misaligned unless the fs itself has 64K-blocks and the fs's partition starts on a 64K-block boundary, and that some hardware or firmware (mainly RAIDs) want the blocks to be aligned. I'm not very familiar with RAIDs but think it would take a fairly advanced/expensive one to reblock all the i/at so that the alignment doesn't matter. It would take more advanced/complicated clustering code or better buffering code than FreeBSD has to do the reblocking at the clustering or buffering level. Perhaps even 64K-blocks are too small with your RAID's stripe size of 128K. Bruce From owner-freebsd-fs@FreeBSD.ORG Mon Oct 9 21:20:46 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 271C316A412 for ; Mon, 9 Oct 2006 21:20:46 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.FreeBSD.org (Postfix) with ESMTP id 36CEA43D77 for ; Mon, 9 Oct 2006 21:20:39 +0000 (GMT) (envelope-from scottl@samsco.org) Received: from [10.10.3.185] ([165.236.175.187]) (authenticated bits=0) by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id k99KlZ1s039109; Mon, 9 Oct 2006 14:47:42 -0600 (MDT) (envelope-from scottl@samsco.org) Message-ID: <452AB55D.9090607@samsco.org> Date: Mon, 09 Oct 2006 14:47:25 -0600 From: Scott Long User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.12) Gecko/20060206 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Bruce Evans References: <45297DA2.4000509@fluffles.net> <20061010051216.G814@epsplex.bde.org> In-Reply-To: <20061010051216.G814@epsplex.bde.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=3.8 tests=none autolearn=failed version=3.1.1 X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on pooker.samsco.org Cc: freebsd-fs@freebsd.org, "Fluffles.net" , Kris Kennaway Subject: Re: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 09 Oct 2006 21:20:46 -0000 Bruce Evans wrote: > On Mon, 9 Oct 2006, Fluffles.net wrote: > >> I'm the "veronica" Arne mentioned in the freebsd-fs mailinglist. >> Regarding the effectiveness of a higher blocksize, these are my findings: >> >> areca RAID5 (8x da, 128KB stripe, default newfs, NCQ enabled) >> -------Sequential Output-------- ---Sequential Input-- >> --Random-- >> -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- >> --Seeks--- >> Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU >> /sec %CPU >> ARC8xR5 8480 119973 91.3 247178 58.6 67862 17.5 90426 86.9 172490 24.0 >> 120.7 0.5 >> >> areca RAID5 (8x da, 128KB stripe, 64KB blocksize newfs, NCQ enabled) >> -------Sequential Output-------- ---Sequential Input-- >> --Random-- >> -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- >> --Seeks--- >> Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU >> /sec %CPU >> ARC8xR5 8480 128920 97.8 265920 58.9 116787 31.0 103261 97.8 392970 >> 53.8 119.8 0.6 >> >> As you can see, the block read increased from ~172MB/s to ~392MB/s, >> quite significant increase. Also the reqrite speed increased from >> ~67MB/s to ~116MB/s. >> >> Ofcourse these tests are on a brand clean filesystem, which might not >> tally with real-life crowded filesystems. But at least there is much >> ... > > > This is a bit surprising. FreeBSD is supposed to cluster the i/o so > that (especially for large files on new file systems) almost all i/o > is done in blocks of size 64K or 128K. > > I suspect the problems are that the 64K-block i/o is usually perfectly > misaligned unless the fs itself has 64K-blocks and the fs's partition > starts on a 64K-block boundary, and that some hardware or firmware > (mainly RAIDs) want the blocks to be aligned. I'm not very familiar > with RAIDs but think it would take a fairly advanced/expensive one to > reblock all the i/at so that the alignment doesn't matter. It would > take more advanced/complicated clustering code or better buffering code > than FreeBSD has to do the reblocking at the clustering or buffering > level. Perhaps even 64K-blocks are too small with your RAID's stripe > size of 128K. > > Bruce Yes, it's a well-known problem that the combination of fdisk+disklabel+ufs means that all FS blocks are mis-aligned in the worst way possible (blocks start on odd sector numbers). This _horribly_ pessimizes RAID-5 on most controllers. Solving it reliably and automatically is hard, though. The filesystem ultimately needs to know the physical sector that it starts on, and compensate accordingly. You could cheat by having the disklabel tools always align partitions, but the tool would still need to know the physical address of where it starts in the slice. Either way, something high up needs to get the logical to physical translation of the sectors. Suggestions have been made to just put blind offsets into the disklabel tool that assumes the common case (mbr is present and is a known length, and that the disklabel is in the first slice of the MBR). Obviously, this is only a crude hack. I get around this right now by not using a disklabel or fdisk table on arrays where I value speed. For those, I just put a filesystem directly on the array, and boot off of a small system disk. Scott From owner-freebsd-fs@FreeBSD.ORG Mon Oct 9 21:50:20 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 773F516A417 for ; Mon, 9 Oct 2006 21:50:20 +0000 (UTC) (envelope-from mike@sentex.net) Received: from smarthost2.sentex.ca (smarthost2.sentex.ca [205.211.164.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0C46043D55 for ; Mon, 9 Oct 2006 21:50:19 +0000 (GMT) (envelope-from mike@sentex.net) Received: from BLUELAPIS.sentex.ca (cage.simianscience.com [64.7.134.1]) by smarthost2.sentex.ca (8.13.8/8.13.8) with SMTP id k99LoIiu065164; Mon, 9 Oct 2006 17:50:19 -0400 (EDT) (envelope-from mike@sentex.net) From: Mike Tancsa To: Scott Long Date: Mon, 09 Oct 2006 17:50:30 -0400 Message-ID: References: <45297DA2.4000509@fluffles.net> <20061010051216.G814@epsplex.bde.org> <452AB55D.9090607@samsco.org> In-Reply-To: <452AB55D.9090607@samsco.org> X-Mailer: Forte Agent 1.93/32.576 English (American) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Cc: freebsd-fs@freebsd.org Subject: Re: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 09 Oct 2006 21:50:20 -0000 On Mon, 09 Oct 2006 14:47:25 -0600, in sentex.lists.freebsd.fs you wrote: >this is only a crude hack. I get around this right now by not using a >disklabel or fdisk table on arrays where I value speed. For those, I >just put a filesystem directly on the array, and boot off of a small >system disk. Hi Scott, How is that done ? just newfs -O2 -U /dev/da0 ? ---Mike -------------------------------------------------------- Mike Tancsa, Sentex communications http://www.sentex.net Providing Internet Access since 1994 mike@sentex.net, (http://www.tancsa.com) From owner-freebsd-fs@FreeBSD.ORG Mon Oct 9 21:54:13 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 622D916A416 for ; Mon, 9 Oct 2006 21:54:13 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7C33E43D46 for ; Mon, 9 Oct 2006 21:54:10 +0000 (GMT) (envelope-from scottl@samsco.org) Received: from [10.10.3.185] ([165.236.175.187]) (authenticated bits=0) by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id k99LrvaD040223; Mon, 9 Oct 2006 15:54:04 -0600 (MDT) (envelope-from scottl@samsco.org) Message-ID: <452AC4EB.8000006@samsco.org> Date: Mon, 09 Oct 2006 15:53:47 -0600 From: Scott Long User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.12) Gecko/20060206 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Mike Tancsa References: <45297DA2.4000509@fluffles.net> <20061010051216.G814@epsplex.bde.org> <452AB55D.9090607@samsco.org> In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=3.8 tests=none autolearn=failed version=3.1.1 X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on pooker.samsco.org Cc: freebsd-fs@freebsd.org Subject: Re: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 09 Oct 2006 21:54:13 -0000 Mike Tancsa wrote: > On Mon, 09 Oct 2006 14:47:25 -0600, in sentex.lists.freebsd.fs you > wrote: > > >>this is only a crude hack. I get around this right now by not using a >>disklabel or fdisk table on arrays where I value speed. For those, I >>just put a filesystem directly on the array, and boot off of a small >>system disk. > > > > Hi Scott, > How is that done ? just newfs -O2 -U /dev/da0 ? > > ---Mike Yup. Scott From owner-freebsd-fs@FreeBSD.ORG Mon Oct 9 23:13:55 2006 Return-Path: X-Original-To: freebsd-fs@FreeBSD.org Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C3E9516A407 for ; Mon, 9 Oct 2006 23:13:55 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210]) by mx1.FreeBSD.org (Postfix) with ESMTP id 76B2343D45 for ; Mon, 9 Oct 2006 23:13:45 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.2.163]) by mailout1.pacific.net.au (Postfix) with ESMTP id 443F3328117; Tue, 10 Oct 2006 09:13:41 +1000 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3sarge3) with ESMTP id k99NDb9u008107; Tue, 10 Oct 2006 09:13:38 +1000 Date: Tue, 10 Oct 2006 09:13:36 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Scott Long In-Reply-To: <452AB55D.9090607@samsco.org> Message-ID: <20061010081212.I35683@delplex.bde.org> References: <45297DA2.4000509@fluffles.net> <20061010051216.G814@epsplex.bde.org> <452AB55D.9090607@samsco.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@FreeBSD.org, "Fluffles.net" , Kris Kennaway Subject: Re: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 09 Oct 2006 23:13:55 -0000 On Mon, 9 Oct 2006, Scott Long wrote: > Bruce Evans wrote: >> ... >> I suspect the problems are that the 64K-block i/o is usually perfectly >> misaligned unless the fs itself has 64K-blocks and the fs's partition >> starts on a 64K-block boundary, and that some hardware or firmware >> (mainly RAIDs) want the blocks to be aligned. I'm not very familiar >> ... > > Yes, it's a well-known problem that the combination of fdisk+disklabel+ufs > means that all FS blocks are mis-aligned in the worst way possible (blocks > start on odd sector numbers). This > _horribly_ pessimizes RAID-5 on most controllers. Apparently the internal fs block alignment/size problem is not so well known. I knew about the external one but didn't connect it with fs block sizes at first. How horribly do aligned 16K-blocks pessimize RAID-5? Does it help much to have misaligned 64K-blocks instead of misaligned 16K-blocks? > Solving it reliably > and automatically is hard, though. The filesystem ultimately needs to > know the physical sector that it starts on, and compensate accordingly. > You could cheat by having the disklabel tools always align partitions, > but the tool would still need to know the physical address of where it > starts in the slice. Either way, something high up needs to get the > logical to physical translation of the sectors. The filesystem shouldn't need to know more than that its starting sector is not physically misaligned. The clustering code could use knowledge of physical offsets and alignment requirements to fix up some cases. My version of newfs_msdosfs(8) uses the (unimplemented) ioctl DIOCMEDIAOFFSET to ask the system for the physical offset. Using this is much easier than parsing XML. > Suggestions have been made to just put blind offsets into the disklabel > tool that assumes the common case (mbr is present and is a known length, > and that the disklabel is in the first slice of the MBR). Obviously, > this is only a crude hack. I get around this right now by not using a > disklabel or fdisk table on arrays where I value speed. For those, I > just put a filesystem directly on the array, and boot off of a small > system disk. I normally align FreeBSD slices and partitions manually to a "cylinder" boundary, and this sometimes gives alignment to a large power of 2 accidentally. I normally use a fake cylinder size of 16065 (255 fake heads and 63 sectors per fake track). This is just as bad for cylinder alignment as 63 is for track alignment, but new systems only need it for compatibility with other systems. Bruce From owner-freebsd-fs@FreeBSD.ORG Mon Oct 9 23:52:40 2006 Return-Path: X-Original-To: freebsd-fs@FreeBSD.org Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 18BDE16A40F for ; Mon, 9 Oct 2006 23:52:40 +0000 (UTC) (envelope-from etc@fluffles.net) Received: from auriate.fluffles.net (a83-68-3-169.adsl.cistron.nl [83.68.3.169]) by mx1.FreeBSD.org (Postfix) with ESMTP id AEF1743D73 for ; Mon, 9 Oct 2006 23:52:39 +0000 (GMT) (envelope-from etc@fluffles.net) Received: from destiny ([10.0.0.21]) by auriate.fluffles.net with esmtpa (Exim 4.63 (FreeBSD)) (envelope-from ) id 1GX4vF-000GrP-LR; Tue, 10 Oct 2006 01:52:37 +0200 Message-ID: <452AE0A5.3010503@fluffles.net> Date: Tue, 10 Oct 2006 01:52:05 +0200 From: Fluffles User-Agent: Thunderbird 1.5.0.7 (X11/20060917) MIME-Version: 1.0 To: freebsd-fs@FreeBSD.org Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Cc: Subject: Re: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 09 Oct 2006 23:52:40 -0000 Bruce Evans wrote: >I suspect the problems are that the 64K-block i/o is usually perfectly >misaligned unless the fs itself has 64K-blocks and the fs's partition >starts on a 64K-block boundary, and that some hardware or firmware >(mainly RAIDs) want the blocks to be aligned. But i have done these tests on /dev/da0, thus without any labeling! This means there is no offset such as 16 sectors caused by disk labeling, which then spoils my stripe-block. So i would assume there is no alignment problem, is there? I would assume that if i do newfs directly on /dev/da0, that the 64KB blocksize starts at offset 0, which implies no alignment problems exist. If all this is true, another reason for the huge performance increase must be sought. In all my tests using 64KB blocksize instead of the default 16KB yielded better results; also with software RAID like gstripe. And i never use labeling. Actually i would have liked the blocksize limit to be higher, and try out if 128KB or even higher would continue to yield better results. - Veronica From owner-freebsd-fs@FreeBSD.ORG Tue Oct 10 10:02:20 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B9A6016A403; Tue, 10 Oct 2006 10:02:20 +0000 (UTC) (envelope-from danny@cs.huji.ac.il) Received: from cs1.cs.huji.ac.il (cs1.cs.huji.ac.il [132.65.16.10]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4658A43D58; Tue, 10 Oct 2006 10:02:20 +0000 (GMT) (envelope-from danny@cs.huji.ac.il) Received: from pampa.cs.huji.ac.il ([132.65.80.32]) by cs1.cs.huji.ac.il with esmtp id 1GXERG-000BLK-2p; Tue, 10 Oct 2006 12:02:18 +0200 X-Mailer: exmh version 2.7.2 01/07/2005 with nmh-1.2 To: Daichi GOTO In-reply-to: <44FD8B2B.60501@freebsd.org> References: <44B67340.1080405@freebsd.org> <44B74036.6060101@freebsd.org> <20060903170129.GA98917@xor.obsecurity.org> <20060903172033.GA99212@xor.obsecurity.org> <20060904184717.GA41475@xor.obsecurity.org> <44FD8B2B.60501@freebsd.org> Comments: In-reply-to Daichi GOTO message dated "Tue, 05 Sep 2006 23:35:23 +0900." Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Tue, 10 Oct 2006 12:02:17 +0200 From: Danny Braniss Message-ID: Cc: freebsd-fs@freebsd.org, freebsd-current@freebsd.org, Kris Kennaway Subject: Re: [ANN] unionfs patchset-16 release, it is ready for the merge X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 10 Oct 2006 10:02:20 -0000 [...] > Yeah, we have a new patchset to solve above problem I think. any chance that the new unionfs will make it to 6.2? I'm using it, and it's working just fine - as opposed to the unusable one supplied. If not, Daichi GOTO, will you have a new set of patches? union_vfsops.c just changed, for example. thanks, danny From owner-freebsd-fs@FreeBSD.ORG Tue Oct 10 10:25:19 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id BEC0C16A407; Tue, 10 Oct 2006 10:25:19 +0000 (UTC) (envelope-from daichi@freebsd.org) Received: from natial.ongs.co.jp (natial.ongs.co.jp [202.216.232.58]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5A36D43D45; Tue, 10 Oct 2006 10:25:19 +0000 (GMT) (envelope-from daichi@freebsd.org) Received: from [192.168.1.101] (dullmdaler.ongs.co.jp [202.216.232.62]) by natial.ongs.co.jp (Postfix) with ESMTP id 6EF00244C29; Tue, 10 Oct 2006 19:25:17 +0900 (JST) Message-ID: <452B750D.2020104@freebsd.org> Date: Tue, 10 Oct 2006 19:25:17 +0900 From: Daichi GOTO User-Agent: Thunderbird 1.5.0.7 (X11/20060915) MIME-Version: 1.0 To: Danny Braniss References: <44B67340.1080405@freebsd.org> <44B74036.6060101@freebsd.org> <20060903170129.GA98917@xor.obsecurity.org> <20060903172033.GA99212@xor.obsecurity.org> <20060904184717.GA41475@xor.obsecurity.org> <44FD8B2B.60501@freebsd.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, freebsd-current@freebsd.org, Kris Kennaway Subject: Re: [ANN] unionfs patchset-16 release, it is ready for the merge X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 10 Oct 2006 10:25:19 -0000 Danny Braniss wrote: > [...] >> Yeah, we have a new patchset to solve above problem I think. > > any chance that the new unionfs will make it to 6.2? We cannot merger unionfs patch to 6.x branch. It'll just only for -current. For 6.x patchset is just a patchset. > I'm using it, and it's working just fine - as opposed to the unusable > one supplied. For under some heavy situation with mount_nullfs, it has a problem since the lock mechanism. To solve that problem, we need a new API(function) for VFS. We are discussing about it and need vfs-hackers help. Sorry for my slow response :( > If not, Daichi GOTO, will you have a new set of patches? > union_vfsops.c just changed, for example. > thanks, > danny uhmm... you need a new patchset if it is under construction? -- Daichi GOTO, http://people.freebsd.org/~daichi From owner-freebsd-fs@FreeBSD.ORG Tue Oct 10 14:06:00 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A34E716A4E6; Tue, 10 Oct 2006 14:06:00 +0000 (UTC) (envelope-from daichi@freebsd.org) Received: from natial.ongs.co.jp (natial.ongs.co.jp [202.216.232.58]) by mx1.FreeBSD.org (Postfix) with ESMTP id AD92E43D58; Tue, 10 Oct 2006 14:05:59 +0000 (GMT) (envelope-from daichi@freebsd.org) Received: from [192.168.1.101] (dullmdaler.ongs.co.jp [202.216.232.62]) by natial.ongs.co.jp (Postfix) with ESMTP id 8D497244C2C; Tue, 10 Oct 2006 23:05:57 +0900 (JST) Message-ID: <452BA8C4.7040906@freebsd.org> Date: Tue, 10 Oct 2006 23:05:56 +0900 From: Daichi GOTO User-Agent: Thunderbird 1.5.0.7 (X11/20060915) MIME-Version: 1.0 To: freebsd-hackers@freebsd.org, freebsd-current@freebsd.org, freebsd-fs@freebsd.org, rodrigc@crodrigues.org Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Cc: daichi@freebsd.org Subject: [REQUEST] unionfs needs some guys can do implements new 2 APIs for VFS X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 10 Oct 2006 14:06:00 -0000 Hi Guys! Now we need a man or a guy who can do implements new 2 APIs for VFS. Someone please help us!! http://people.freebsd.org/~daichi/unionfs/request-new-api-for-vfs.html ---- The FreeBSD new unionfs implementation: New API request for FreeBSD VFS ======================================================================= Daichi GOTO (daichi@freebsd.org) 1 Introduction We have always tried to keep changes just in unionfs segment only. But by accomplish nothing, we need change the other segment. 2 Problem Description Until now we have did many improvements for unionfs, but now we feel the limication arount the process of unionfs's "copied-up file". Additional thinking of future support for MAC extention, ADVLOCK lock infomation and somethinkg like those, all the more reason to be careful. 3 Impact It leads the confution of unionfs implementation and some problem around lock mechanism. We cannot solve those problem by just only changes in unionfs segument. 4 Solution Request We need new 2 APIs(functions) for VFS. Please some developer do implement new APIs like as follow: int VOP_GETALLATTR(struct vnode *vp, struct vnode_xxx *data, struct thread *td) { set the all attr to data from vp; ...; } int VOP_SETALLATTR(struct vnode *vp, struct vnode_xxx *data, struct thread *td) { set the all attr to vp from data; ...; } Above funtions can set/get vnode information(now those are attr, extattr and ADVLOCK) together if its type is VREG. We cannot do implement it caused by lack of vfs arcana. Please raise your hands and do it, please. 5 References http://people.freebsd.org/~daichi/unionfs/ http://people.freebsd.org/~daichi/unionfs/index-ja.html http://people.freebsd.org/~daichi/unionfs/reason-for-sys-uio-file.html ---- We need your help. Please help us. -- Daichi GOTO, http://people.freebsd.org/~daichi From owner-freebsd-fs@FreeBSD.ORG Tue Oct 10 14:11:12 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id BBDD116A40F; Tue, 10 Oct 2006 14:11:12 +0000 (UTC) (envelope-from daichi@freebsd.org) Received: from natial.ongs.co.jp (natial.ongs.co.jp [202.216.232.58]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9C2DA43D46; Tue, 10 Oct 2006 14:11:11 +0000 (GMT) (envelope-from daichi@freebsd.org) Received: from [192.168.1.101] (dullmdaler.ongs.co.jp [202.216.232.62]) by natial.ongs.co.jp (Postfix) with ESMTP id 7E775244C2C; Tue, 10 Oct 2006 23:11:08 +0900 (JST) Message-ID: <452BA9FB.3080401@freebsd.org> Date: Tue, 10 Oct 2006 23:11:07 +0900 From: Daichi GOTO User-Agent: Thunderbird 1.5.0.7 (X11/20060915) MIME-Version: 1.0 To: Daichi GOTO References: <44B67340.1080405@freebsd.org> <44B74036.6060101@freebsd.org> <20060903170129.GA98917@xor.obsecurity.org> <20060903172033.GA99212@xor.obsecurity.org> <20060904184717.GA41475@xor.obsecurity.org> <44FD8B2B.60501@freebsd.org> <452B750D.2020104@freebsd.org> In-Reply-To: <452B750D.2020104@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Danny Braniss , freebsd-fs@freebsd.org, freebsd-current@freebsd.org, Kris Kennaway Subject: Re: [ANN] unionfs patchset-16 release, it is ready for the merge X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 10 Oct 2006 14:11:12 -0000 Daichi GOTO wrote: > Danny Braniss wrote: >> [...] >>> Yeah, we have a new patchset to solve above problem I think. >> >> any chance that the new unionfs will make it to 6.2? > > We cannot merger unionfs patch to 6.x branch. It'll just only for > -current. For 6.x patchset is just a patchset. > >> I'm using it, and it's working just fine - as opposed to the unusable >> one supplied. > > For under some heavy situation with mount_nullfs, it has a problem since > the lock mechanism. To solve that problem, we need a new API(function) > for VFS. We are discussing about it and need vfs-hackers help. > Sorry for my slow response :( > >> If not, Daichi GOTO, will you have a new set of patches? >> union_vfsops.c just changed, for example. >> thanks, >> danny > > uhmm... you need a new patchset if it is under construction? I updated new two dosuments: http://people.freebsd.org/~daichi/unionfs/reason-for-sys-uio-file.html http://people.freebsd.org/~daichi/unionfs/request-new-api-for-vfs.html Folks who have a interest in unionfs, read it please :) -- Daichi GOTO, http://people.freebsd.org/~daichi From owner-freebsd-fs@FreeBSD.ORG Tue Oct 10 14:57:11 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4AD6116A403; Tue, 10 Oct 2006 14:57:11 +0000 (UTC) (envelope-from kris@obsecurity.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.FreeBSD.org (Postfix) with ESMTP id 36C3443D7B; Tue, 10 Oct 2006 14:57:00 +0000 (GMT) (envelope-from kris@obsecurity.org) Received: from obsecurity.dyndns.org (elvis.mu.org [192.203.228.196]) by elvis.mu.org (Postfix) with ESMTP id 1B55E1A3C19; Tue, 10 Oct 2006 07:57:00 -0700 (PDT) Received: by obsecurity.dyndns.org (Postfix, from userid 1000) id 9287251398; Tue, 10 Oct 2006 10:56:59 -0400 (EDT) Date: Tue, 10 Oct 2006 10:56:59 -0400 From: Kris Kennaway To: Danny Braniss Message-ID: <20061010145659.GA76958@xor.obsecurity.org> References: <44B67340.1080405@freebsd.org> <44B74036.6060101@freebsd.org> <20060903170129.GA98917@xor.obsecurity.org> <20060903172033.GA99212@xor.obsecurity.org> <20060904184717.GA41475@xor.obsecurity.org> <44FD8B2B.60501@freebsd.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="gKMricLos+KVdGMg" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.2i Cc: freebsd-fs@freebsd.org, Daichi GOTO , freebsd-current@freebsd.org, Kris Kennaway Subject: Re: [ANN] unionfs patchset-16 release, it is ready for the merge X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 10 Oct 2006 14:57:11 -0000 --gKMricLos+KVdGMg Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Oct 10, 2006 at 12:02:17PM +0200, Danny Braniss wrote: > [...] > > Yeah, we have a new patchset to solve above problem I think. >=20 > any chance that the new unionfs will make it to 6.2? None, unfortunately - it's not even in 7.0 yet. Kris > I'm using it, and it's working just fine - as opposed to the unusable > one supplied. >=20 > If not, Daichi GOTO, will you have a new set of patches?=20 > union_vfsops.c just changed, for example. > thanks, > danny >=20 >=20 >=20 >=20 --gKMricLos+KVdGMg Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFFK7S7Wry0BWjoQKURAuVmAJ0WR7w8ti+QU60g+j8XlYF/58Q0vACg2W5B VIfhvbhHGR2LPWUWs66mUfs= =OMU4 -----END PGP SIGNATURE----- --gKMricLos+KVdGMg-- From owner-freebsd-fs@FreeBSD.ORG Tue Oct 10 18:09:36 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D258C16A403; Tue, 10 Oct 2006 18:09:36 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7786943D79; Tue, 10 Oct 2006 18:09:36 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id E160A46B08; Tue, 10 Oct 2006 14:09:35 -0400 (EDT) Date: Tue, 10 Oct 2006 19:09:36 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Daichi GOTO In-Reply-To: <452BA8C4.7040906@freebsd.org> Message-ID: <20061010190815.L92182@fledge.watson.org> References: <452BA8C4.7040906@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-current@freebsd.org Subject: Re: [REQUEST] unionfs needs some guys can do implements new 2 APIs for VFS X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 10 Oct 2006 18:09:36 -0000 On Tue, 10 Oct 2006, Daichi GOTO wrote: > 1 Introduction > > We have always tried to keep changes just in unionfs segment > only. But by accomplish nothing, we need change the other segment. > > > 2 Problem Description > > Until now we have did many improvements for unionfs, but > now we feel the limication arount the process of unionfs's > "copied-up file". Additional thinking of future support for > MAC extention, ADVLOCK lock infomation and somethinkg like those, > all the more reason to be careful. > > 3 Impact > > It leads the confution of unionfs implementation and some > problem around lock mechanism. We cannot solve those problem > by just only changes in unionfs segument. So, just to be clear that I understand things: the basic problem here is that when unionfs copies a file up a layer in the stack due to local modifications in the upper layer, you are not able to properly preserve the full set of file attributes, so are looking for a way to do this? Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-fs@FreeBSD.ORG Wed Oct 11 02:42:32 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9A43D16A4C8 for ; Wed, 11 Oct 2006 02:42:32 +0000 (UTC) (envelope-from janm@transactionware.com) Received: from mail.transactionware.com (mail.transactionware.com [203.14.245.7]) by mx1.FreeBSD.org (Postfix) with SMTP id 1674043D5A for ; Wed, 11 Oct 2006 02:41:58 +0000 (GMT) (envelope-from janm@transactionware.com) Received: (qmail 9574 invoked from network); 11 Oct 2006 02:42:12 -0000 Received: from new.transactionware.com (192.168.1.55) by dm.transactionware.com with SMTP; 11 Oct 2006 02:42:12 -0000 Received: (qmail 10705 invoked by uid 1026); 11 Oct 2006 02:42:11 -0000 Received: from 192.168.1.51 by new.transactionware.com (envelope-from , uid 1003) with qmail-scanner-1.25 (spamassassin: 3.0.2. Clear:RC:1(192.168.1.51):. Processed in 3.221908 secs); 11 Oct 2006 02:42:11 -0000 Received: from unknown (HELO janmxp) (192.168.1.51) by new.transactionware.com with SMTP; 11 Oct 2006 02:42:07 -0000 Message-ID: <004d01c6ecde$db9ca990$3301a8c0@janmxp> From: "Jan Mikkelsen" To: "Daichi GOTO" , "Danny Braniss" References: <44B67340.1080405@freebsd.org> <44B74036.6060101@freebsd.org><20060903170129.GA98917@xor.obsecurity.org><20060903172033.GA99212@xor.obsecurity.org><20060904184717.GA41475@xor.obsecurity.org><44FD8B2B.60501@freebsd.org> <452B750D.2020104@freebsd.org> Date: Wed, 11 Oct 2006 12:42:13 +1000 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=response Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.3790.2663 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.2757 Cc: freebsd-fs@freebsd.org, freebsd-current@freebsd.org, Kris Kennaway Subject: Re: [ANN] unionfs patchset-16 release, it is ready for the merge X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Oct 2006 02:42:32 -0000 Daichi GOTO wrote: > Danny Braniss wrote: >> [...] >>> Yeah, we have a new patchset to solve above problem I think. >> >> any chance that the new unionfs will make it to 6.2? > > We cannot merger unionfs patch to 6.x branch. It'll just only for > -current. For 6.x patchset is just a patchset. Getting it to 6-STABLE at some point would be very nice; what is currently there is unusable. I have been using your patch successfully and I certainly don't see any regressions. The man pages make it clear that the subsystem will be subject to change. >> I'm using it, and it's working just fine - as opposed to the unusable >> one supplied. > > For under some heavy situation with mount_nullfs, it has a problem since > the lock mechanism. To solve that problem, we need a new API(function) > for VFS. We are discussing about it and need vfs-hackers help. > Sorry for my slow response :( Even so, your patch works better than what is there. >> If not, Daichi GOTO, will you have a new set of patches? union_vfsops.c >> just changed, for example. >> thanks, >> danny > > uhmm... you need a new patchset if it is under construction? The patch at http://people.freebsd.org/~daichi/unionfs/unionfs6-p16.diff no longer applies cleanly to 6-STABLE. Where you have replaced complete files, it might be worth just providing the new file. Thank you for your work on this; I find it very useful. Regards, Jan Mikkelsen From owner-freebsd-fs@FreeBSD.ORG Wed Oct 11 15:53:03 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 29BC516A415 for ; Wed, 11 Oct 2006 15:53:03 +0000 (UTC) (envelope-from mike@sentex.net) Received: from smarthost2.sentex.ca (smarthost2.sentex.ca [205.211.164.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id BF4EE43D66 for ; Wed, 11 Oct 2006 15:53:02 +0000 (GMT) (envelope-from mike@sentex.net) Received: from BLUELAPIS.sentex.ca (cage.simianscience.com [64.7.134.1]) by smarthost2.sentex.ca (8.13.8/8.13.8) with SMTP id k9BFr1Vi012893; Wed, 11 Oct 2006 11:53:01 -0400 (EDT) (envelope-from mike@sentex.net) From: Mike Tancsa To: Scott Long Date: Wed, 11 Oct 2006 11:53:04 -0400 Message-ID: References: <45297DA2.4000509@fluffles.net> <20061010051216.G814@epsplex.bde.org> <452AB55D.9090607@samsco.org> <452AC4EB.8000006@samsco.org> In-Reply-To: <452AC4EB.8000006@samsco.org> X-Mailer: Forte Agent 1.93/32.576 English (American) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Cc: freebsd-fs@freebsd.org Subject: Re: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Oct 2006 15:53:03 -0000 On Mon, 09 Oct 2006 15:53:47 -0600, in sentex.lists.freebsd.fs you wrote: >Mike Tancsa wrote: >> On Mon, 09 Oct 2006 14:47:25 -0600, in sentex.lists.freebsd.fs you >> wrote: >>=20 >>=20 >>>this is only a crude hack. I get around this right now by not using a >>>disklabel or fdisk table on arrays where I value speed. For those, I >>>just put a filesystem directly on the array, and boot off of a small >>>system disk. >>=20 >>=20 >>=20 >> How is that done ? just newfs -O2 -U /dev/da0 ? > >Yup. Hi, Is this going to work in most/all cases ? In other words, how to I make sure the file system I lay down is indeed properly / optimally aligned with the underlying structure ? ---Mike -------------------------------------------------------- Mike Tancsa, Sentex communications http://www.sentex.net Providing Internet Access since 1994 mike@sentex.net, (http://www.tancsa.com) From owner-freebsd-fs@FreeBSD.ORG Wed Oct 11 16:59:20 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6B0C316A47C for ; Wed, 11 Oct 2006 16:59:20 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.FreeBSD.org (Postfix) with ESMTP id CE80A43D99 for ; Wed, 11 Oct 2006 16:55:40 +0000 (GMT) (envelope-from scottl@samsco.org) Received: from [10.10.3.185] ([165.236.175.187]) (authenticated bits=0) by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id k9BGtRo7063167; Wed, 11 Oct 2006 10:55:33 -0600 (MDT) (envelope-from scottl@samsco.org) Message-ID: <452D21F6.20601@samsco.org> Date: Wed, 11 Oct 2006 10:55:18 -0600 From: Scott Long User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.12) Gecko/20060206 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Mike Tancsa References: <45297DA2.4000509@fluffles.net> <20061010051216.G814@epsplex.bde.org> <452AB55D.9090607@samsco.org> <452AC4EB.8000006@samsco.org> In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=3.8 tests=none autolearn=failed version=3.1.1 X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on pooker.samsco.org Cc: freebsd-fs@freebsd.org Subject: Re: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Oct 2006 16:59:20 -0000 Mike Tancsa wrote: > On Mon, 09 Oct 2006 15:53:47 -0600, in sentex.lists.freebsd.fs you > wrote: > > >>Mike Tancsa wrote: >> >>>On Mon, 09 Oct 2006 14:47:25 -0600, in sentex.lists.freebsd.fs you >>>wrote: >>> >>> >>> >>>>this is only a crude hack. I get around this right now by not using a >>>>disklabel or fdisk table on arrays where I value speed. For those, I >>>>just put a filesystem directly on the array, and boot off of a small >>>>system disk. >>> >>> >>> >>> How is that done ? just newfs -O2 -U /dev/da0 ? >> >>Yup. > > > Hi, > Is this going to work in most/all cases ? In other words, how > to I make sure the file system I lay down is indeed properly / > optimally aligned with the underlying structure ? > > ---Mike UFS1 skips the first 8k of its space to allow for bootstrapping/partitioning data. UFS2 skips the first 64k. Blocks are then aligned to that skip. 64K is a good alignment for most RAID cases. But understanding exactly how RAID-5 works will help you make appropriate choices. (Note that in the follow write-up I'm actually describing RAID-4. The only difference between RAID-4 and 5 is that the parity data is spread out to all of the disks instead of being kept all on a single disk. However, this is just a performance detail, and it's easier to describe how things work if you ignore it) As you might know, RAID-4/5 takes N disks and writes data to N-1 of them while computing and writing a parity calculation to the Nth disk. That parity calculation is a logical XOR of the data disks. One of the neat properties of XOR is that it's a reversible algorithm; you can take the final answer and re-run the XOR using all but one of the opriginal comoponents and get an answer that represents the data of the missing component. The array is divided into 'stripes', each stripe containing a equal subsection of each data disk plus the parity disk. When we talk about 'stripe size', what we are refering to is the size of one of those subsections. A 64K stripe size means that each disk is divided into 64K subsections. The total amount of data in a stripe is then a function of the stripe size and the number of disks in the array. If you have 5 disks in your array and have set a stripe size of 64K, each stripe will hold a total of 256K of data (4 data disks and 1 parity disk). Every time you write to an RAID-5 array, parity needs to be updated. As everything operates in terms of the stripes, the most straight forward way to do this is to read all of the data from the stripe, replace the portion that is being written, recompute the parity, and then write out the updates. This is also the slowest way to do it. An easy optimization is to buffer the writes and look for situations where all of the data in a stripe is being written sequentially. If all of the data in the stripe is being replaced, there is no need to read any of the old data. Just collect all of the writes together, compute the parity, and write everything out all at once. Another optimization is to recognize when only one member of the stripe is being updated. For that, you read the parity, read the old data, and then XOR out the old data and XOR in the new data. You still have the latency of waiting for a read, but on a busy system you reduce head movement on all of the disks, which is a big win. Both of these optmizations rely on the writes having a certain amount of alignment. If your stripe size is 64k and your writes are 64k, but they all start at an 8k offset into the stripe, you loose. Each 64K write will have to touch 56k of one disk and 8k of the next disk. But, an 8k offset can be made to work if you reduce your stripe size to 8k. It then becomes an excercise in balancing the parameters of FS block size and array stripe size to give you the best peformance for your needs. The 64k offset in UFS2 gives you more room to work here, so that's why I say at the beginning that it's a good value. In any case, you want to choose parameters that result in each block write covering either a single disk or a whole stripe. Where things really go bad for BSD is when a _63_ sector offset gets introduced for the MBR. Now everything is offset to an odd, non-power-of-2 value, and there isn't anything that you can tweak in the filesystem or array to compensate. The best you can do is to manually calculate a compensating offset in the disklabel for each partition. But at the point, it often becomes easier to just ditch all of that and put the fielsystem directly on the disk. Scott From owner-freebsd-fs@FreeBSD.ORG Wed Oct 11 19:41:19 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 33ECD16A403 for ; Wed, 11 Oct 2006 19:41:19 +0000 (UTC) (envelope-from anderson@centtech.com) Received: from mh2.centtech.com (moat3.centtech.com [64.129.166.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id AA34E43D6E for ; Wed, 11 Oct 2006 19:41:18 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from [10.177.171.220] (neutrino.centtech.com [10.177.171.220]) by mh2.centtech.com (8.13.1/8.13.1) with ESMTP id k9BJfHfv044470; Wed, 11 Oct 2006 14:41:17 -0500 (CDT) (envelope-from anderson@centtech.com) Message-ID: <452D48DF.5010502@centtech.com> Date: Wed, 11 Oct 2006 14:41:19 -0500 From: Eric Anderson User-Agent: Thunderbird 1.5.0.7 (X11/20060923) MIME-Version: 1.0 To: Scott Long References: <45297DA2.4000509@fluffles.net> <20061010051216.G814@epsplex.bde.org> <452AB55D.9090607@samsco.org> <452AC4EB.8000006@samsco.org> <452D21F6.20601@samsco.org> In-Reply-To: <452D21F6.20601@samsco.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.87.1/2024/Wed Oct 11 05:53:09 2006 on mh2.centtech.com X-Virus-Status: Clean Cc: freebsd-fs@freebsd.org Subject: Re: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Oct 2006 19:41:19 -0000 On 10/11/06 11:55, Scott Long wrote: > Mike Tancsa wrote: >> On Mon, 09 Oct 2006 15:53:47 -0600, in sentex.lists.freebsd.fs you >> wrote: >> >> >>> Mike Tancsa wrote: >>> >>>> On Mon, 09 Oct 2006 14:47:25 -0600, in sentex.lists.freebsd.fs you >>>> wrote: >>>> >>>> >>>> >>>>> this is only a crude hack. I get around this right now by not using a >>>>> disklabel or fdisk table on arrays where I value speed. For those, I >>>>> just put a filesystem directly on the array, and boot off of a small >>>>> system disk. >>>> >>>> >>>> How is that done ? just newfs -O2 -U /dev/da0 ? >>> Yup. >> >> Hi, >> Is this going to work in most/all cases ? In other words, how >> to I make sure the file system I lay down is indeed properly / >> optimally aligned with the underlying structure ? >> >> ---Mike > > UFS1 skips the first 8k of its space to allow for > bootstrapping/partitioning data. UFS2 skips the first 64k. > Blocks are then aligned to that skip. 64K is a good alignment > for most RAID cases. But understanding exactly how RAID-5 works > will help you make appropriate choices. > > (Note that in the follow write-up I'm actually describing RAID-4. > The only difference between RAID-4 and 5 is that the parity data > is spread out to all of the disks instead of being kept all on a > single disk. However, this is just a performance detail, and it's > easier to describe how things work if you ignore it) > > As you might know, RAID-4/5 takes N disks and writes data to N-1 of > them while computing and writing a parity calculation to the Nth > disk. That parity calculation is a logical XOR of the data disks. > One of the neat properties of XOR is that it's a reversible algorithm; > you can take the final answer and re-run the XOR using all but one of > the opriginal comoponents and get an answer that represents the data of > the missing component. > > The array is divided into 'stripes', each stripe containing a equal > subsection of each data disk plus the parity disk. When we talk about > 'stripe size', what we are refering to is the size of one of those > subsections. A 64K stripe size means that each disk is divided into > 64K subsections. The total amount of data in a stripe is then a > function of the stripe size and the number of disks in the array. If > you have 5 disks in your array and have set a stripe size of 64K, each > stripe will hold a total of 256K of data (4 data disks and 1 parity > disk). > > Every time you write to an RAID-5 array, parity needs to be updated. > As everything operates in terms of the stripes, the most straight > forward way to do this is to read all of the data from the stripe, > replace the portion that is being written, recompute the parity, and > then write out the updates. This is also the slowest way to do it. > > An easy optimization is to buffer the writes and look for situations > where all of the data in a stripe is being written sequentially. If > all of the data in the stripe is being replaced, there is no need to > read any of the old data. Just collect all of the writes together, > compute the parity, and write everything out all at once. > > Another optimization is to recognize when only one member of the stripe > is being updated. For that, you read the parity, read the old data, and > then XOR out the old data and XOR in the new data. You still have the > latency of waiting for a read, but on a busy system you reduce head > movement on all of the disks, which is a big win. > > Both of these optmizations rely on the writes having a certain amount > of alignment. If your stripe size is 64k and your writes are 64k, but > they all start at an 8k offset into the stripe, you loose. Each 64K > write will have to touch 56k of one disk and 8k of the next disk. But, > an 8k offset can be made to work if you reduce your stripe size to 8k. > It then becomes an excercise in balancing the parameters of FS block > size and array stripe size to give you the best peformance for your > needs. The 64k offset in UFS2 gives you more room to work here, so > that's why I say at the beginning that it's a good value. In any case, > you want to choose parameters that result in each block write covering > either a single disk or a whole stripe. > > Where things really go bad for BSD is when a _63_ sector offset gets > introduced for the MBR. Now everything is offset to an odd, > non-power-of-2 value, and there isn't anything that you can tweak in the > filesystem or array to compensate. The best you can do is to manually > calculate a compensating offset in the disklabel for each partition. > But at the point, it often becomes easier to just ditch all of that and > put the fielsystem directly on the disk. > > Scott Scott, Just wanted to say thanks for such a well put explanation on this, with all the right details. Eric -- ------------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology Anything that works is better than anything that doesn't. ------------------------------------------------------------------------ From owner-freebsd-fs@FreeBSD.ORG Sat Oct 14 06:07:00 2006 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3583A16A412; Sat, 14 Oct 2006 06:07:00 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout2.pacific.net.au (mailout2-3.pacific.net.au [61.8.2.226]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7FA7243D55; Sat, 14 Oct 2006 06:06:57 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.2.163]) by mailout2.pacific.net.au (Postfix) with ESMTP id E67B110A1BC; Sat, 14 Oct 2006 16:06:55 +1000 (EST) Received: from epsplex.bde.org (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (Postfix) with ESMTP id 103EA2740C; Sat, 14 Oct 2006 16:06:53 +1000 (EST) Date: Sat, 14 Oct 2006 16:06:53 +1000 (EST) From: Bruce Evans X-X-Sender: bde@epsplex.bde.org To: fs@freebsd.org In-Reply-To: <20061006050913.Y5250@epsplex.bde.org> Message-ID: <20061014143825.F1264@epsplex.bde.org> References: <20061006050913.Y5250@epsplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: mohans@freebsd.org Subject: Re: lost dotdot caching pessimizes nfs especially X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 14 Oct 2006 06:07:00 -0000 On Fri, 6 Oct 2006, Bruce Evans wrote: > This change: > > % Index: vfs_cache.c > % =================================================================== > % RCS file: /home/ncvs/src/sys/kern/vfs_cache.c,v > % retrieving revision 1.102 > % retrieving revision 1.103 > % diff -u -2 -r1.102 -r1.103 > % --- vfs_cache.c 13 Jun 2005 05:59:59 -0000 1.102 > % +++ vfs_cache.c 17 Jun 2005 01:05:13 -0000 1.103 > % ... > > is responsible for about half of the performance loss since RELENG_4 > for building kernels over nfs (/usr and sys trees on nfs). The kernel > build uses "../../" a lot, and the above change apparently results in > lots of network activity for things that should be cached locally. > > Some times for building a RELENG_4 kernel under conditions invariant > except for the host kernel (after "make clean; sleep 2; make depend; > make; make clean; sleep 2; make depend" to warm up caches): > > kernel: > RELENG_4 77.51 real 60.62 user 4.36 sys > current.2004.07.01 ~78.5 (lost details) > current.2005.01.01 ~79 (lost details) > current.2005.06.17 82.42 real 62.50 user 4.71 sys > current.2005.06.19 89.53 real 62.18 user 5.44 sys > current.2005.06.17+ ~89.5 (lost details) > .17+ = .17 plus above change > current.2005.06.17+* 86.08 real 62.43 user 5.13 sys > .17+* = .17+ with ../.. in Makefile avoided using a symlink > @ -> > RELENG_6 91.14 real 62.04 user 5.71 sys > current similar to RELENG_6 (lost details) > > The total performance loss is about 18%. > > The total performance loss for a local sys tree (/usr still on nfs) is much > smaller (about 4%): > > RELENG_4 65.19 real 60.50 user 3.95 sys > current.2005.06.17 67.49 real 62.13 user 4.27 sys > RELENG_6 67.83 real 61.84 user 4.71 sys > current similar to RELENG_6 (lost details) > > The nfs performance for building of things that should be entirely > cached locally is very dependent on network latency. Not caching > things very well causes lots of unnecessary network traffic for Getattr > and Lookup. The packets are small, so throughput is unimportant and > latency dominates. For building over nfs without -j, the dead time > (real - user - sys) is almost directly proportional to the latency. > My usual local network has fairly low latency (~100uS unloaded) and > the ~14 seconds dead time in the above is for it. Switching to a 1 > Gbps network with lower quality NICs gives an unloaded latency of ~160uS > and a dead time of ~21 seconds. Building with -j helps even for UP, > at the cost of extra CPU, by letting some processes advance using cached > stuff while others are waiting for the network. Building with -j helps > even more on FreeBSD cluster machines, more because they have a much > higher network latency than because they are SMP. I finished finding almost all the lost performance. As indicated above, It was almost all in nfs. This change: % Index: nfs_vnops.c % =================================================================== % RCS file: /home/ncvs/src/sys/nfsclient/nfs_vnops.c,v % retrieving revision 1.235 % retrieving revision 1.236 % diff -u -2 -r1.235 -r1.236 % --- nfs_vnops.c 6 Dec 2004 18:52:28 -0000 1.235 % +++ nfs_vnops.c 6 Dec 2004 19:18:00 -0000 1.236 % @@ -418,10 +418,11 @@ % if (error) % return (error); % - np->n_mtime = vattr.va_mtime.tv_sec; % + np->n_mtime = vattr.va_mtime; % } else { % + np->n_attrstamp = 0; ^^^^^^^^^^^^^^^^^^^^ % error = VOP_GETATTR(vp, &vattr, ap->a_cred, ap->a_td); % if (error) % return (error); % - if (np->n_mtime != vattr.va_mtime.tv_sec) { % + if (NFS_TIMESPEC_COMPARE(&np->n_mtime, &vattr.va_mtime)) { % if (vp->v_type == VDIR) % np->n_direofoffset = 0; and associated changes give silly behaviour that almost doubles the number of Access RPCs. One of the associated changes clears n_attrstamp on close(). Then on open(), since lookup() is called before the above is reached, nfs_access_otw() has always just been called, and the above forces another call. Counting RPCs gives a good metric for the pessimizations. Removing the above clearing in RELENG_6 gives the following improvement: Before: 89.90 real 62.16 user 5.50 sys Lookup Read Write Create Access Fsstat Setattr Other Total 60010 2410 5353 442 43785 1742 5194 6 118942 After: 86.46 real 62.22 user 5.21 sys Lookup Read Write Create Access Fsstat Setattr Other Total 59986 2410 5353 442 20935 1742 5194 6 96068 Note the RPC delta-counts barely changed except for the Access one. About 20000 Access calls were avoided. Just removing the clearing is not correct but is close. The pessimization in vfs_cache.c 1.103 is now easy to quantify. It triples the number of Lookup RPCs. Removing it in addition to the above gives a much larger improvement: 79.24 real 61.87 user 5.04 sys Lookup Read Write Create Access Fsstat Setattr Other Total 19548 2410 5353 442 20922 1742 5194 6 55617 Note the RPC delta-counts barely changed except for the Lookup one. About 40000 Lookup calls were avoided. Just removing the change in vfs_cache.c 1.103 is not close to being correct. The last major pessimization is another silly one. The changes to mark atimes on exec() and mmap() cause a silly null Setattr RPC for every exec() (more for interprters?) and every mmap(). This is easy to fix (almost) correctly. VOP_SETATTR() is assumed to do nothing for requests that it doesn't understand, but nfs_setattr() does null RPCs instead. The following fix: % diff -c2 ./nfsclient/nfs_vnops.c~ ./nfsclient/nfs_vnops.c % *** ./nfsclient/nfs_vnops.c~ Sun Oct 8 23:08:57 2006 % --- ./nfsclient/nfs_vnops.c Fri Oct 13 09:58:12 2006 % *************** % *** 669,675 **** % % /* % ! * Setting of flags is not supported. % */ % ! if (vap->va_flags != VNOVAL) % return (EOPNOTSUPP); % % --- 677,684 ---- % % /* % ! * Setting of flags and marking of atimes are not supported. % */ % ! if (vap->va_flags != VNOVAL || % ! ((bdefix & 4) && (vap->va_vaflags & VA_MARK_ATIME))) % return (EOPNOTSUPP); % in addition to the removals gives the following improvement with bdefix set to 7: 78.14 real 62.03 user 4.79 sys Lookup Read Write Create Access Fsstat Other Total 19556 2410 5353 442 19581 1738 14 49094 Bruce From owner-freebsd-fs@FreeBSD.ORG Sat Oct 14 14:37:45 2006 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0F12E16A407; Sat, 14 Oct 2006 14:37:45 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8DEDA43D46; Sat, 14 Oct 2006 14:37:44 +0000 (GMT) (envelope-from scottl@samsco.org) Received: from [192.168.254.11] (phobos.samsco.home [192.168.254.11]) (authenticated bits=0) by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id k9EEbbHp087005; Sat, 14 Oct 2006 08:37:43 -0600 (MDT) (envelope-from scottl@samsco.org) Message-ID: <4530F62E.20308@samsco.org> Date: Sat, 14 Oct 2006 08:37:34 -0600 From: Scott Long User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.0.7) Gecko/20060910 SeaMonkey/1.0.5 MIME-Version: 1.0 To: Bruce Evans References: <20061006050913.Y5250@epsplex.bde.org> <20061014143825.F1264@epsplex.bde.org> In-Reply-To: <20061014143825.F1264@epsplex.bde.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-1.4 required=3.8 tests=ALL_TRUSTED autolearn=failed version=3.1.1 X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on pooker.samsco.org Cc: fs@freebsd.org, mohans@freebsd.org Subject: Re: lost dotdot caching pessimizes nfs especially X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 14 Oct 2006 14:37:45 -0000 Bruce Evans wrote: > On Fri, 6 Oct 2006, Bruce Evans wrote: > [...] > The last major pessimization is another silly one. The changes to > mark atimes on exec() and mmap() cause a silly null Setattr RPC for > every exec() (more for interprters?) and every mmap(). This is > easy to fix (almost) correctly. VOP_SETATTR() is assumed to do > nothing for requests that it doesn't understand, but nfs_setattr() > does null RPCs instead. The following fix: > > % diff -c2 ./nfsclient/nfs_vnops.c~ ./nfsclient/nfs_vnops.c > % *** ./nfsclient/nfs_vnops.c~ Sun Oct 8 23:08:57 2006 > % --- ./nfsclient/nfs_vnops.c Fri Oct 13 09:58:12 2006 > % *************** > % *** 669,675 **** > % % /* > % ! * Setting of flags is not supported. > % */ > % ! if (vap->va_flags != VNOVAL) > % return (EOPNOTSUPP); > % % --- 677,684 ---- > % % /* > % ! * Setting of flags and marking of atimes are not supported. > % */ > % ! if (vap->va_flags != VNOVAL || > % ! ((bdefix & 4) && (vap->va_vaflags & VA_MARK_ATIME))) > % return (EOPNOTSUPP); > % > > in addition to the removals gives the following improvement with > bdefix set to 7: > > 78.14 real 62.03 user 4.79 sys > Lookup Read Write Create Access Fsstat Other Total > 19556 2410 5353 442 19581 1738 14 49094 > > Bruce I've seen hints that the excessive null SETATTR calls also create unpredictable problems with some servers. Thanks a lot for tracking this down. Scott