From owner-freebsd-fs@FreeBSD.ORG Sun Aug 20 17:28:51 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C837916A4DA for ; Sun, 20 Aug 2006 17:28:51 +0000 (UTC) (envelope-from pho@holm.cc) Received: from relay03.pair.com (relay03.pair.com [209.68.5.17]) by mx1.FreeBSD.org (Postfix) with SMTP id F08FB43D49 for ; Sun, 20 Aug 2006 17:28:50 +0000 (GMT) (envelope-from pho@holm.cc) Received: (qmail 22701 invoked from network); 20 Aug 2006 17:28:49 -0000 Received: from unknown (HELO peter.osted.lan) (unknown) by unknown with SMTP; 20 Aug 2006 17:28:49 -0000 X-pair-Authenticated: 80.165.155.106 Received: from peter.osted.lan (localhost.osted.lan [127.0.0.1]) by peter.osted.lan (8.13.6/8.13.6) with ESMTP id k7KHSk3q074813; Sun, 20 Aug 2006 19:28:46 +0200 (CEST) (envelope-from pho@peter.osted.lan) Received: (from pho@localhost) by peter.osted.lan (8.13.6/8.13.6/Submit) id k7KHSkOB074812; Sun, 20 Aug 2006 19:28:46 +0200 (CEST) (envelope-from pho) Date: Sun, 20 Aug 2006 19:28:45 +0200 From: Peter Holm To: Konstantin Belousov Message-ID: <20060820172845.GA74767@peter.osted.lan> References: <20060816155310.GA64420@peter.osted.lan> <20060817105155.GC1483@deviant.kiev.zoral.com.ua> <22339.193.3.142.123.1155814154.squirrel@webmail4.pair.com> <20060817113203.GD1483@deviant.kiev.zoral.com.ua> <20060817170314.GA17490@peter.osted.lan> <20060818164903.GF20768@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20060818164903.GF20768@deviant.kiev.zoral.com.ua> User-Agent: Mutt/1.4.2.1i Cc: freebsd-fs@freebsd.org, tegge@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. [Was: Re: Livelock while accessing /tmp] X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Aug 2006 17:28:51 -0000 On Fri, Aug 18, 2006 at 07:49:03PM +0300, Konstantin Belousov wrote: > On Thu, Aug 17, 2006 at 07:03:14PM +0200, Peter Holm wrote: > > > > Ok, I got a new one after some 6 hours of testing with the attached > > script + the default stress test: > > http://people.freebsd.org/~pho/stress/log/cons205a.html > > > > - Peter > > First, big thanks to Peter for helping debugging the problem ! > > This deadlock happens between processes 764 (nfsd) and 62981 (mksnap_ffs). > In fact, deadlock is not specific to nfsd. It happens when ufs_inactive() > interposes with ffs_snapshot. > > > Look: > > db> where 764 > Tracing pid 764 tid 100076 td 0xc3fdb870 > sched_switch(c3fdb870,0,1) at sched_switch+0x183 > mi_switch(1,0) at mi_switch+0x280 > sleepq_switch(c40ca57c,c0a0b0b0,0,c092000a,211,...) at sleepq_switch+0xcd > sleepq_wait(c40ca57c,0,c0927acf,3f3,c093229c,...) at sleepq_wait+0x46 > msleep(c40ca57c,c40ca534,29f,c0927b18,0,...) at msleep+0x27d > vn_start_secondary_write(c59bc820,e6586988,1) at vn_start_secondary_write+0x122 > ufs_inactive(e65869b8) at ufs_inactive+0x257 > VOP_INACTIVE_APV(c09d9a00,e65869b8) at VOP_INACTIVE_APV+0x7e > vinactive(c59bc820,c3fdb870) at vinactive+0x72 > vput(c59bc820,c0a0b0c8,1,c0932293,407,...) at vput+0x1b3 > nfsrv_read(c4703600,c3f12900,c3fdb870,e6586c40) at nfsrv_read+0xc21 > nfssvc_nfsd(c3fdb870) at nfssvc_nfsd+0x409 > nfssvc(c3fdb870,e6586d04) at nfssvc+0x18c > syscall(3b,3b,3b,1,0,...) at syscall+0x256 > Xint0x80_syscall() at Xint0x80_syscall+0x1f > > db> where 62981 > Tracing pid 62981 tid 100135 td 0xc46e3d80 > sched_switch(c46e3d80,0,1) at sched_switch+0x183 > mi_switch(1,0) at mi_switch+0x280 > sleepq_switch(c59bc878,c0a0b0b0,0,c092000a,211,...) at sleepq_switch+0xcd > sleepq_wait(c59bc878,0,c59bc89c,b1,c0926903,...) at sleepq_wait+0x46 > msleep(c59bc878,c0a0a930,50,c0924f24,0,...) at msleep+0x27d > acquire(e66ee5a8,40,60000,c46e3d80,0,...) at acquire+0x76 > lockmgr(c59bc878,2002,c59bc89c,c46e3d80) at lockmgr+0x44a > ffs_lock(e66ee600) at ffs_lock+0x6e > VOP_LOCK_APV(c09d9a00,e66ee600) at VOP_LOCK_APV+0x87 > vn_lock(c59bc820,2002,c46e3d80,c59bc820) at vn_lock+0xa8 > ffs_snapshot(c40ca510,c3defb60,c3defb60,c401e000,c4016514,...) at ffs_snapshot+0x1210 > ffs_mount(c40ca510,c46e3d80,20000000,201300,0,...) at ffs_mount+0x927 > vfs_domount(c46e3d80,c3dffa80,c3d45b40,1211300,c3f662c0,c0a0b0c8,0,c09268fa,2b0) at vfs_domount+0x554 > vfs_donmount(c46e3d80,1211300,e66eebac) at vfs_donmount+0x414 > kernel_mount(c3fc5690,1211300,bfbfecdc,0,0,...) at kernel_mount+0x6d > ffs_cmount(c3fc5690,bfbfe500,1211300,c46e3d80,c09d96e0,...) at ffs_cmount+0x5d > mount(c46e3d80,e66eed04) at mount+0x15e > syscall(3b,3b,3b,2816772c,bfbfe4a0,...) at syscall+0x256 > > mnt_kern_flag = 0x2c000000 (MNTK_SUSPEND | MNTK_SUSPEND2 | MNTK_MPSAFE). > > vn_lock in the ffs_snapshot is called with flags LK_INTERLOCK | LK_EXCLUSIVE. > There is only one such place in the ffs_snapshot.c, at line 541. > > On the other hand, ufs_inactive calls vn_start_secondary_write(vp, XXX, V_WAIT). > ufs_inactive is running with vnode locked, If happens at the right time, > system will deadlock. > > nfsd is the most vulnerable to the problem due to it oftenly being the > only (and last) user of vnode, vput() from nfsd have high chance resulting > in vinactive(). > > Below is the patch that set VI_OWEINACT for the inode if the last call to > vn_start_sec_write(..., V_NOWAIT) fails. The return from that point is safe > because mp == NULL means that no previous code that changes inode was executed. > > Please, review and test. > I have tested your patch for more than 24 hours and ran into this panic: http://people.freebsd.org/~pho/stress/log/cons205b.html - Peter > Index: sys/ufs/ufs/ufs_inode.c > =================================================================== > RCS file: /usr/local/arch/ncvs/src/sys/ufs/ufs/ufs_inode.c,v > retrieving revision 1.67 > diff -u -r1.67 ufs_inode.c > --- sys/ufs/ufs/ufs_inode.c 9 May 2006 22:33:43 -0000 1.67 > +++ sys/ufs/ufs/ufs_inode.c 18 Aug 2006 16:42:48 -0000 > @@ -147,9 +147,23 @@ > mp = NULL; > ip->i_flag &= ~IN_ACCESS; > } else { > - if (mp == NULL) > - (void) vn_start_secondary_write(vp, &mp, > - V_WAIT); > + if (mp == NULL) { > + loop1: > + if (vn_start_secondary_write(vp, &mp, V_NOWAIT)) { > + MNT_ILOCK(mp); > + if ((mp->mnt_kern_flag & > + (MNTK_SUSPEND2 | MNTK_SUSPENDED)) == 0) { > + MNT_IUNLOCK(mp); > + goto loop1; > + } > + > + VI_LOCK(vp); > + vp->v_iflag |= VI_OWEINACT; > + VI_UNLOCK(vp); > + MNT_IUNLOCK(mp); > + return (0); > + } > + } > UFS_UPDATE(vp, 0); > } > } > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.5 (FreeBSD) > > iD8DBQFE5e9+C3+MBN1Mb4gRAqlxAKCqmgB9LqfeuVA0H5wTihtwDcurBACcCWs7 > k+kLvfy3/ko+YS7pDWeagoo= > =PGnw > -----END PGP SIGNATURE----- -- Peter Holm From owner-freebsd-fs@FreeBSD.ORG Mon Aug 21 13:22:16 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4B7B316A4DE; Mon, 21 Aug 2006 13:22:16 +0000 (UTC) (envelope-from Tor.Egge@cvsup.no.freebsd.org) Received: from pil.idi.ntnu.no (pil.idi.ntnu.no [129.241.107.93]) by mx1.FreeBSD.org (Postfix) with ESMTP id E830743D76; Mon, 21 Aug 2006 13:22:05 +0000 (GMT) (envelope-from Tor.Egge@cvsup.no.freebsd.org) Received: from cvsup.no.freebsd.org (c2h5oh.idi.ntnu.no [129.241.103.69]) by pil.idi.ntnu.no (8.13.6/8.13.1) with ESMTP id k7LDM3TR012044 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Mon, 21 Aug 2006 15:22:03 +0200 (MEST) Received: from localhost (localhost [127.0.0.1]) by cvsup.no.freebsd.org (8.13.4/8.13.4) with ESMTP id k7LDM242042354; Mon, 21 Aug 2006 13:22:02 GMT (envelope-from Tor.Egge@cvsup.no.freebsd.org) Date: Mon, 21 Aug 2006 13:21:51 +0000 (UTC) Message-Id: <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> To: kostikbel@gmail.com From: Tor Egge In-Reply-To: <20060818.202001.74745664.Tor.Egge@cvsup.no.freebsd.org> References: <20060817170314.GA17490@peter.osted.lan> <20060818164903.GF20768@deviant.kiev.zoral.com.ua> <20060818.202001.74745664.Tor.Egge@cvsup.no.freebsd.org> X-Mailer: Mew version 3.3 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Virus-Scanned-By: mimedefang.idi.ntnu.no, using CLAMD X-SMTP-From: Sender=, Relay/Client=c2h5oh.idi.ntnu.no [129.241.103.69], EHLO=cvsup.no.freebsd.org X-Scanned-By: MIMEDefang 2.48 on 129.241.107.38 X-Scanned-By: mimedefang.idi.ntnu.no, using MIMEDefang 2.48 with local filter 16.42-idi X-Filter-Time: 1 seconds Cc: freebsd-fs@freebsd.org, tegge@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Aug 2006 13:22:16 -0000 I wrote: > The deadlock indicates that one or more of IN_CHANGE, IN_MODIFIED or > IN_UPDATE was set on the inode, indicating a write operation > (e.g. VOP_WRITE(), VOP_RENAME(), VOP_CREATE(), VOP_REMOVE(), VOP_LINK(), > VOP_SYMLINK(), VOP_SETATTR(), VOP_MKDIR(), VOP_RMDIR(), VOP_MKNOD()) that was > not protected by vn_start_write() or vn_start_secondary_write(). The most common "write" operation was probably VOP_GETATTR(). ufs_itimes(), called from ufs_getattr(), might set the IN_MODIFIED inode flag if IN_ACCESS is set on the inode even if neither IN_CHANGE nor IN_UPDATE is set, transitioning the inode flags into a state where ufs_inactive() calls the blocking variant of vn_start_secondary_write(). calling ufs_itimes() with only a shared vnode lock might cause unsafe accesses to the inode flags. Setting of IN_ACCESS at the end of ffs_read() and ffs_extread() might also be unsafe. If DIRECTIO is enabled then O_DIRECT reads might not even attempt to set the IN_ACCESS flag. - Tor Egge From owner-freebsd-fs@FreeBSD.ORG Mon Aug 21 13:37:18 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A666916A4E2; Mon, 21 Aug 2006 13:37:18 +0000 (UTC) (envelope-from anderson@centtech.com) Received: from mh2.centtech.com (moat3.centtech.com [207.200.51.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 215A543D7B; Mon, 21 Aug 2006 13:37:14 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from [10.177.171.220] (neutrino.centtech.com [10.177.171.220]) by mh2.centtech.com (8.13.1/8.13.1) with ESMTP id k7LDbAM8095409; Mon, 21 Aug 2006 08:37:11 -0500 (CDT) (envelope-from anderson@centtech.com) Message-ID: <44E9B722.2040407@centtech.com> Date: Mon, 21 Aug 2006 08:37:38 -0500 From: Eric Anderson User-Agent: Thunderbird 1.5.0.5 (X11/20060802) MIME-Version: 1.0 To: Tor Egge References: <20060817170314.GA17490@peter.osted.lan> <20060818164903.GF20768@deviant.kiev.zoral.com.ua> <20060818.202001.74745664.Tor.Egge@cvsup.no.freebsd.org> <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> In-Reply-To: <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.87.1/1700/Mon Aug 21 07:08:16 2006 on mh2.centtech.com X-Virus-Status: Clean Cc: freebsd-fs@freebsd.org, tegge@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Aug 2006 13:37:18 -0000 On 08/21/06 08:21, Tor Egge wrote: > I wrote: > >> The deadlock indicates that one or more of IN_CHANGE, IN_MODIFIED or >> IN_UPDATE was set on the inode, indicating a write operation >> (e.g. VOP_WRITE(), VOP_RENAME(), VOP_CREATE(), VOP_REMOVE(), VOP_LINK(), >> VOP_SYMLINK(), VOP_SETATTR(), VOP_MKDIR(), VOP_RMDIR(), VOP_MKNOD()) that was >> not protected by vn_start_write() or vn_start_secondary_write(). > > The most common "write" operation was probably VOP_GETATTR(). > > ufs_itimes(), called from ufs_getattr(), might set the IN_MODIFIED inode flag > if IN_ACCESS is set on the inode even if neither IN_CHANGE nor IN_UPDATE is > set, transitioning the inode flags into a state where ufs_inactive() calls the > blocking variant of vn_start_secondary_write(). > > calling ufs_itimes() with only a shared vnode lock might cause unsafe accesses > to the inode flags. Setting of IN_ACCESS at the end of ffs_read() and > ffs_extread() might also be unsafe. If DIRECTIO is enabled then O_DIRECT reads > might not even attempt to set the IN_ACCESS flag. Does this mean that setting the noatime flag on mount would dodge this? Eric -- ------------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology Anything that works is better than anything that doesn't. ------------------------------------------------------------------------ From owner-freebsd-fs@FreeBSD.ORG Mon Aug 21 13:38:58 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2B44216A4DE; Mon, 21 Aug 2006 13:38:58 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from fw.zoral.com.ua (fw.zoral.com.ua [213.186.206.134]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4327443D58; Mon, 21 Aug 2006 13:38:45 +0000 (GMT) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by fw.zoral.com.ua (8.13.4/8.13.4) with ESMTP id k7LDcae0055697 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 21 Aug 2006 16:38:36 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.13.6/8.13.6) with ESMTP id k7LDcbt3066446; Mon, 21 Aug 2006 16:38:37 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.13.6/8.13.6/Submit) id k7LDcaKb066445; Mon, 21 Aug 2006 16:38:36 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 21 Aug 2006 16:38:36 +0300 From: Kostik Belousov To: Tor Egge Message-ID: <20060821133836.GB56637@deviant.kiev.zoral.com.ua> References: <20060817170314.GA17490@peter.osted.lan> <20060818164903.GF20768@deviant.kiev.zoral.com.ua> <20060818.202001.74745664.Tor.Egge@cvsup.no.freebsd.org> <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="+pHx0qQiF2pBVqBT" Content-Disposition: inline In-Reply-To: <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> User-Agent: Mutt/1.4.2.2i X-Virus-Scanned: ClamAV version 0.88.4, clamav-milter version 0.88.4 on fw.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=1.4 required=5.0 tests=SPF_NEUTRAL, UNPARSEABLE_RELAY autolearn=no version=3.1.4 X-Spam-Level: * X-Spam-Checker-Version: SpamAssassin 3.1.4 (2006-07-25) on fw.zoral.com.ua Cc: freebsd-fs@freebsd.org, tegge@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Aug 2006 13:38:58 -0000 --+pHx0qQiF2pBVqBT Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Aug 21, 2006 at 01:21:51PM +0000, Tor Egge wrote: >=20 > I wrote: >=20 > > The deadlock indicates that one or more of IN_CHANGE, IN_MODIFIED or > > IN_UPDATE was set on the inode, indicating a write operation > > (e.g. VOP_WRITE(), VOP_RENAME(), VOP_CREATE(), VOP_REMOVE(), VOP_LINK(), > > VOP_SYMLINK(), VOP_SETATTR(), VOP_MKDIR(), VOP_RMDIR(), VOP_MKNOD()) th= at was > > not protected by vn_start_write() or vn_start_secondary_write(). >=20 > The most common "write" operation was probably VOP_GETATTR(). >=20 > ufs_itimes(), called from ufs_getattr(), might set the IN_MODIFIED inode = flag > if IN_ACCESS is set on the inode even if neither IN_CHANGE nor IN_UPDATE = is > set, transitioning the inode flags into a state where ufs_inactive() call= s the > blocking variant of vn_start_secondary_write(). >=20 > calling ufs_itimes() with only a shared vnode lock might cause unsafe acc= esses > to the inode flags. Setting of IN_ACCESS at the end of ffs_read() and > ffs_extread() might also be unsafe. If DIRECTIO is enabled then O_DIRECT= reads > might not even attempt to set the IN_ACCESS flag. Thanks for analysis ! I already thought about ufs_itimes/GETATTR. I am currently musing about storing the list of threads that called vm_start_write in the mount struct, and checking that the current thread is on list during modifying operations, at least that ops that set the IN_* flags. --+pHx0qQiF2pBVqBT Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFE6bdcC3+MBN1Mb4gRAjlpAJ484ne6ze8nb1JRq4r3iKgwinU9TQCgpTgl BqUwy3qBaYwC5XCg3rHlrb8= =wRsm -----END PGP SIGNATURE----- --+pHx0qQiF2pBVqBT-- From owner-freebsd-fs@FreeBSD.ORG Mon Aug 21 13:44:51 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A582A16A4DE; Mon, 21 Aug 2006 13:44:51 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from fw.zoral.com.ua (fw.zoral.com.ua [213.186.206.134]) by mx1.FreeBSD.org (Postfix) with ESMTP id C347B43D7E; Mon, 21 Aug 2006 13:44:30 +0000 (GMT) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by fw.zoral.com.ua (8.13.4/8.13.4) with ESMTP id k7LDiLFi055858 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 21 Aug 2006 16:44:21 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.13.6/8.13.6) with ESMTP id k7LDiMxG082713; Mon, 21 Aug 2006 16:44:22 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.13.6/8.13.6/Submit) id k7LDiL4M082704; Mon, 21 Aug 2006 16:44:21 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 21 Aug 2006 16:44:21 +0300 From: Kostik Belousov To: Eric Anderson Message-ID: <20060821134421.GC56637@deviant.kiev.zoral.com.ua> References: <20060817170314.GA17490@peter.osted.lan> <20060818164903.GF20768@deviant.kiev.zoral.com.ua> <20060818.202001.74745664.Tor.Egge@cvsup.no.freebsd.org> <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> <44E9B722.2040407@centtech.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="WplhKdTI2c8ulnbP" Content-Disposition: inline In-Reply-To: <44E9B722.2040407@centtech.com> User-Agent: Mutt/1.4.2.2i X-Virus-Scanned: ClamAV version 0.88.4, clamav-milter version 0.88.4 on fw.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=1.9 required=5.0 tests=DNS_FROM_RFC_ABUSE, SPF_NEUTRAL,UNPARSEABLE_RELAY autolearn=no version=3.1.4 X-Spam-Level: * X-Spam-Checker-Version: SpamAssassin 3.1.4 (2006-07-25) on fw.zoral.com.ua Cc: freebsd-fs@freebsd.org, tegge@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Aug 2006 13:44:51 -0000 --WplhKdTI2c8ulnbP Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Aug 21, 2006 at 08:37:38AM -0500, Eric Anderson wrote: > On 08/21/06 08:21, Tor Egge wrote: > >I wrote: > > > >>The deadlock indicates that one or more of IN_CHANGE, IN_MODIFIED or > >>IN_UPDATE was set on the inode, indicating a write operation > >>(e.g. VOP_WRITE(), VOP_RENAME(), VOP_CREATE(), VOP_REMOVE(), VOP_LINK(), > >>VOP_SYMLINK(), VOP_SETATTR(), VOP_MKDIR(), VOP_RMDIR(), VOP_MKNOD()) th= at=20 > >>was > >>not protected by vn_start_write() or vn_start_secondary_write(). > > > >The most common "write" operation was probably VOP_GETATTR(). > > > >ufs_itimes(), called from ufs_getattr(), might set the IN_MODIFIED inode= =20 > >flag > >if IN_ACCESS is set on the inode even if neither IN_CHANGE nor IN_UPDATE= is > >set, transitioning the inode flags into a state where ufs_inactive() cal= ls=20 > >the > >blocking variant of vn_start_secondary_write(). > > > >calling ufs_itimes() with only a shared vnode lock might cause unsafe=20 > >accesses > >to the inode flags. Setting of IN_ACCESS at the end of ffs_read() and > >ffs_extread() might also be unsafe. If DIRECTIO is enabled then O_DIREC= T=20 > >reads > >might not even attempt to set the IN_ACCESS flag. >=20 > Does this mean that setting the noatime flag on mount would dodge this? On the server, yes. --WplhKdTI2c8ulnbP Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFE6bi1C3+MBN1Mb4gRAi0AAKCAMHYgGLM2NiwbeACHYWYcf7KGBACdHaFr bsTADBtxT8QDEPAA9iT0YT0= =ic3w -----END PGP SIGNATURE----- --WplhKdTI2c8ulnbP-- From owner-freebsd-fs@FreeBSD.ORG Mon Aug 21 14:09:34 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id CD8EB16A4DE; Mon, 21 Aug 2006 14:09:34 +0000 (UTC) (envelope-from Tor.Egge@cvsup.no.freebsd.org) Received: from pil.idi.ntnu.no (pil.idi.ntnu.no [129.241.107.93]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0BC1D43D70; Mon, 21 Aug 2006 14:09:26 +0000 (GMT) (envelope-from Tor.Egge@cvsup.no.freebsd.org) Received: from cvsup.no.freebsd.org (c2h5oh.idi.ntnu.no [129.241.103.69]) by pil.idi.ntnu.no (8.13.6/8.13.1) with ESMTP id k7LE9LBw020785 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Mon, 21 Aug 2006 16:09:22 +0200 (MEST) Received: from localhost (localhost [127.0.0.1]) by cvsup.no.freebsd.org (8.13.4/8.13.4) with ESMTP id k7LE9Lmb042771; Mon, 21 Aug 2006 14:09:21 GMT (envelope-from Tor.Egge@cvsup.no.freebsd.org) Date: Mon, 21 Aug 2006 14:09:20 +0000 (UTC) Message-Id: <20060821.140920.85376544.Tor.Egge@cvsup.no.freebsd.org> To: anderson@centtech.com From: Tor Egge In-Reply-To: <44E9B722.2040407@centtech.com> References: <20060818.202001.74745664.Tor.Egge@cvsup.no.freebsd.org> <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> <44E9B722.2040407@centtech.com> X-Mailer: Mew version 3.3 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Virus-Scanned-By: mimedefang.idi.ntnu.no, using CLAMD X-SMTP-From: Sender=, Relay/Client=c2h5oh.idi.ntnu.no [129.241.103.69], EHLO=cvsup.no.freebsd.org X-Scanned-By: MIMEDefang 2.48 on 129.241.107.38 X-Scanned-By: mimedefang.idi.ntnu.no, using MIMEDefang 2.48 with local filter 16.42-idi X-Filter-Time: 0 seconds Cc: freebsd-fs@freebsd.org, tegge@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Aug 2006 14:09:34 -0000 > Does this mean that setting the noatime flag on mount would dodge this? It might solve the deadlock issue when creating snapshots. Note that snapshots might fail to make copies of the original content when file system metadata changes on some systems (cf. PR kern/100365). Setting the noatime flag does not prevent ufs_itimes() from changing the inode flags without proper locking. IN_CHANGE might be set on the inode after a chmod() system call, a following fstat() system call can then trigger a call to ufs_itimes(). - Tor Egge From owner-freebsd-fs@FreeBSD.ORG Tue Aug 22 09:15:45 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 34EB716A4DA; Tue, 22 Aug 2006 09:15:45 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.115]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4C0C943D73; Tue, 22 Aug 2006 09:15:44 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.2.162]) by mailout2.pacific.net.au (Postfix) with ESMTP id C7FDE6EA66; Tue, 22 Aug 2006 19:13:19 +1000 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (8.13.4/8.13.4/Debian-3sarge1) with ESMTP id k7M9DF4K006823; Tue, 22 Aug 2006 19:13:17 +1000 Date: Tue, 22 Aug 2006 19:13:15 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Tor Egge In-Reply-To: <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> Message-ID: <20060822175540.V58720@delplex.bde.org> References: <20060817170314.GA17490@peter.osted.lan> <20060818164903.GF20768@deviant.kiev.zoral.com.ua> <20060818.202001.74745664.Tor.Egge@cvsup.no.freebsd.org> <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org, tegge@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2006 09:15:45 -0000 On Mon, 21 Aug 2006, Tor Egge wrote: > I wrote: > >> The deadlock indicates that one or more of IN_CHANGE, IN_MODIFIED or >> IN_UPDATE was set on the inode, indicating a write operation >> (e.g. VOP_WRITE(), VOP_RENAME(), VOP_CREATE(), VOP_REMOVE(), VOP_LINK(), >> VOP_SYMLINK(), VOP_SETATTR(), VOP_MKDIR(), VOP_RMDIR(), VOP_MKNOD()) that was >> not protected by vn_start_write() or vn_start_secondary_write(). > > The most common "write" operation was probably VOP_GETATTR(). Reading the attributes really is a write operation if it causes marks for update to be turned into updates. ufs_inactive() is also a write operation if it causes this. I think marking for update shouldn't require much locking. > ufs_itimes(), called from ufs_getattr(), might set the IN_MODIFIED inode flag > if IN_ACCESS is set on the inode even if neither IN_CHANGE nor IN_UPDATE is > set, transitioning the inode flags into a state where ufs_inactive() calls the > blocking variant of vn_start_secondary_write(). In the implementation of marking access times for update on exec, we try hard to avoid calling vn_start_write(), etc. Early implementations did call vn_start_write(), etc., and had some bugs from this. The current implementation is mainly for ffs and is rather fragile: VOP_SETATTR() depends on callers calling vfs_start_write, but vfs_mark_atime() calls VOP_SETATTR() without calling vfs_start_write(). The correctness of this depends on the VA_MARK_ATIME case of VOP_SETATTR() not writing any more than VOP_READ() would or should. I think ufs_itimes() shouldn't call vn_start_write() any more than ufs_getattr() should. Callers should be aware that GETATTR may write and thus it seems to be necessary for them to call vn_start_write() unconditionally. ufs_itimes() is a utility function that is called from places other than ufs_getattr(). The other places are ufs_*close() and ufs_setattr(). These don't cause any additional problems. VOP_CLOSE() is called from several places which already seem to have sufficient locking. VOP_GETATTR() is called from many places that don't call vn_start_write(), starting with vn_stat(). The updates to timestamps in in ufs_itimes() and ufs_getattr() are still soft (not even delayed writes, but writes to the vnode that will usually become delayed writes later). Perhaps vn_start_write() can be avoided for them, but since they are logically writes this might be hard to implement correctly. ufs_inactive() has to do a (possibly delayed) physical write to force any updates to disk, so it needs strong locking. BTW, ufs_itimes() has some possibly related kludges involving changing from r/w mounts to r/o mounts. Some of the marks for update aren't handled quite right and are still present after the change. Then they want to be turned into updates on a r/o file system. This is impossible. The problem is handled bogusly, essentially by clearing them without doing the update, in a way that triggers seome of my local debugging code. > calling ufs_itimes() with only a shared vnode lock might cause unsafe accesses > to the inode flags. Setting of IN_ACCESS at the end of ffs_read() and > ffs_extread() might also be unsafe. If DIRECTIO is enabled then O_DIRECT reads > might not even attempt to set the IN_ACCESS flag. I think setting it should be safe -- see above. Not setting the IN_ACCESS flag in the early return for the O_DIRECT case is a different bug. read() is supposed to set IN_ACCESS on sucessful completion and returning early for the O_DIRECT case would defeat this. Bruce From owner-freebsd-fs@FreeBSD.ORG Tue Aug 22 10:45:38 2006 Return-Path: X-Original-To: freebsd-fs@FreeBSD.org Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2E08F16A4DD; Tue, 22 Aug 2006 10:45:38 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: from mail.garage.freebsd.pl (arm132.internetdsl.tpnet.pl [83.17.198.132]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6514043D45; Tue, 22 Aug 2006 10:45:36 +0000 (GMT) (envelope-from pjd@garage.freebsd.pl) Received: by mail.garage.freebsd.pl (Postfix, from userid 65534) id CC7C45133B; Tue, 22 Aug 2006 12:45:34 +0200 (CEST) Received: from localhost (pjd.wheel.pl [10.0.1.1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.garage.freebsd.pl (Postfix) with ESMTP id 77EAC5131F; Tue, 22 Aug 2006 12:45:27 +0200 (CEST) Date: Tue, 22 Aug 2006 12:45:16 +0200 From: Pawel Jakub Dawidek To: freebsd-current@FreeBSD.org Message-ID: <20060822104516.GB16033@garage.freebsd.pl> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="A6N2fC+uXW/VQSAv" Content-Disposition: inline X-PGP-Key-URL: http://people.freebsd.org/~pjd/pjd.asc X-OS: FreeBSD 7.0-CURRENT i386 User-Agent: mutt-ng/devel-r804 (FreeBSD) X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.garage.freebsd.pl X-Spam-Level: X-Spam-Status: No, score=-5.9 required=3.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.0.4 Cc: freebsd-fs@FreeBSD.org, zfs-discuss@opensolaris.org Subject: Porting ZFS file system to FreeBSD. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2006 10:45:38 -0000 --A6N2fC+uXW/VQSAv Content-Type: text/plain; charset=iso-8859-2 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi. I started porting the ZFS file system to the FreeBSD operating system. There is a lot to do, but I'm making good progress, I think. I'm doing my work in those directories: contrib/opensolaris/ - userland files taken directly from OpenSolaris (libzfs, zpool, zfs and others) sys/contrib/opensolaris/ - kernel files taken directly from OpenSolaris (zfs, taskq, callb and others) compat/opensolaris/ - compatibility userland layer, so I can reduce diffs against vendor files sys/compat/opensolaris/ - compatibility kernel layer, so I can reduce diffs against vendor files (kmem based on malloc(9) and uma(9), mutexes based on our sx(9) locks, condvars based on sx(9) locks and more) cddl/ - FreeBSD specific makefiles for userland bits sys/modules/zfs/ - FreeBSD specific makefile for the kernel module You can find all those on FreeBSD perforce server: http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=3D//depot/user/pjd/z= fs&HIDEDEL=3DNO Ok, so where am I? I ported the userland bits (libzfs, zfs and zpool). I had ztest and libzpool compiling and working as well, but I left them behind for now to focus on kernel bits. I'm building in all (except 2) files into zfs.ko (kernel module). I created new VDEV - vdev_geom, which fits to FreeBSD's GEOM infrastructure, so basically you can use any GEOM provider to build your ZFS pool. VDEV_GEOM is implemented as consumers-only GEOM class. I reimplemented ZVOL to also export storage as GEOM provider. This time it is providers-only GEOM class. This way one can create for example RAID-Z on top of GELI encrypted disks or encrypt ZFS volume. The order is free. Basically you can put UFS on ZFS volumes already and it behaves really stable even under heavy load. Currently I'm working on file system bits (ZPL), which is the most hard part of the entire ZFS port, because it talks to one of the most complex part of the FreeBSD kernel - VFS. I can already mount ZFS-created file systems (with 'zfs create' command), create files/directories, change permissions/owner/etc., list directories content, and perform few other minor operation. Some "screenshots": lcf:root:~# uname -a FreeBSD lcf 7.0-CURRENT FreeBSD 7.0-CURRENT #74: Tue Aug 22 03:04:01 UTC 2= 006 root@lcf:/usr/obj/zoo/pjd/lcf/sys/LCF i386 lcf:root:~# zpool create tank raidz /dev/ad4a /dev/ad6a /dev/ad5a lcf:root:~# zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT tank 35,8G 11,7M 35,7G 0% ONLINE - lcf:root:~# zpool status pool: tank state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad4a ONLINE 0 0 0 ad6a ONLINE 0 0 0 ad5a ONLINE 0 0 0 errors: No known data errors lcf:root:# zfs create -V 10g tank/vol lcf:root:# newfs /dev/zvol/tank/vol lcf:root:# mount /dev/zvol/tank/vol /mnt/test lcf:root:# zfs create tank/fs lcf:root:~# mount -t zfs,ufs tank on /tank (zfs, local) tank/fs on /tank/fs (zfs, local) /dev/zvol/tank/vol on /mnt/test (ufs, local) lcf:root:~# df -ht zfs,ufs Filesystem Size Used Avail Capacity Mounted on tank 13G 34K 13G 0% /tank tank/fs 13G 33K 13G 0% /tank/fs /dev/zvol/tank/vol 9.7G 4.0K 8.9G 0% /mnt/test lcf:root:~# mkdir /tank/fs/foo lcf:root:~# touch /tank/fs/foo/bar lcf:root:~# chown root:operator /tank/fs/foo /tank/fs/foo/bar lcf:root:~# chmod 500 /tank/fs/foo lcf:root:~# ls -ld /tank/fs/foo /tank/fs/foo/bar dr-x------ 2 root operator 3 22 sie 05:41 /tank/fs/foo -rw-r--r-- 1 root operator 0 22 sie 05:42 /tank/fs/foo/bar The most important missing pieces: - Most of the ZPL layer. - Autoconfiguration. I need implement vdev discovery based on GEOM's taste mechanism. - .zfs/ control directory (entirely commented out for now). And many more, but hey, this is after 10 days of work. PS. Please contact me privately if your company would like to donate to the ZFS effort. Even without sponsorship the work will be finished, but your contributions will allow me to spend more time working on ZFS. --=20 Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! --A6N2fC+uXW/VQSAv Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.4 (FreeBSD) iD8DBQFE6uA8ForvXbEpPzQRAr1vAJ0T/FHgwwNxWYXh3a3298DHiOTeiwCgh/NZ ixnrVrJZoTppOnLxNeAoGfM= =doT1 -----END PGP SIGNATURE----- --A6N2fC+uXW/VQSAv-- From owner-freebsd-fs@FreeBSD.ORG Tue Aug 22 10:57:48 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 46B8F16A4E5 for ; Tue, 22 Aug 2006 10:57:48 +0000 (UTC) (envelope-from joao.barros@gmail.com) Received: from py-out-1112.google.com (py-out-1112.google.com [64.233.166.182]) by mx1.FreeBSD.org (Postfix) with ESMTP id 31B0043D45 for ; Tue, 22 Aug 2006 10:57:46 +0000 (GMT) (envelope-from joao.barros@gmail.com) Received: by py-out-1112.google.com with SMTP id o67so2818286pye for ; Tue, 22 Aug 2006 03:57:46 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=i4Hg2s0OBkVZfE+ytVGmk4rpHMPSOv8T5qdR+iFxqLp++tc/RI9Yjn8WuXWVfdrXdCyyciLlMbnhQo55eQH3LaVz9GYYbxmU/m0Sh7+cLn3u0aMG32W+0FdFKrzfDqPPV8+33GHANksEe3eonLCGjm6r09WqfMTDXwtnYoQ6nWQ= Received: by 10.35.63.2 with SMTP id q2mr15311043pyk; Tue, 22 Aug 2006 03:57:46 -0700 (PDT) Received: by 10.35.114.2 with HTTP; Tue, 22 Aug 2006 03:57:46 -0700 (PDT) Message-ID: <70e8236f0608220357i22767c5dm239b36b10de2158b@mail.gmail.com> Date: Tue, 22 Aug 2006 11:57:46 +0100 From: "Joao Barros" To: "Pawel Jakub Dawidek" In-Reply-To: <20060822104516.GB16033@garage.freebsd.pl> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20060822104516.GB16033@garage.freebsd.pl> Cc: freebsd-fs@freebsd.org, freebsd-current@freebsd.org Subject: Re: Porting ZFS file system to FreeBSD. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2006 10:57:48 -0000 On 8/22/06, Pawel Jakub Dawidek wrote: > Hi. > > I started porting the ZFS file system to the FreeBSD operating system. > > There is a lot to do, but I'm making good progress, I think. > > And many more, but hey, this is after 10 days of work. Impressive! I'm available for beta testing whenever you feel it's ready to. I also have a machine available if you need it. Very good work! :-) -- Joao Barros From owner-freebsd-fs@FreeBSD.ORG Tue Aug 22 13:08:06 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0560316A4E1; Tue, 22 Aug 2006 13:08:06 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from fw.zoral.com.ua (fw.zoral.com.ua [213.186.206.134]) by mx1.FreeBSD.org (Postfix) with ESMTP id 60CCC43D7E; Tue, 22 Aug 2006 13:07:53 +0000 (GMT) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by fw.zoral.com.ua (8.13.4/8.13.4) with ESMTP id k7MD7ijX093621 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 22 Aug 2006 16:07:44 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.13.6/8.13.6) with ESMTP id k7MD7jaP022271; Tue, 22 Aug 2006 16:07:45 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.13.6/8.13.6/Submit) id k7MD7hdr022270; Tue, 22 Aug 2006 16:07:43 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 22 Aug 2006 16:07:43 +0300 From: Kostik Belousov To: Bruce Evans Message-ID: <20060822130743.GL56637@deviant.kiev.zoral.com.ua> References: <20060817170314.GA17490@peter.osted.lan> <20060818164903.GF20768@deviant.kiev.zoral.com.ua> <20060818.202001.74745664.Tor.Egge@cvsup.no.freebsd.org> <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> <20060822175540.V58720@delplex.bde.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="L/Qt9NZ8t00Dhfad" Content-Disposition: inline In-Reply-To: <20060822175540.V58720@delplex.bde.org> User-Agent: Mutt/1.4.2.2i X-Virus-Scanned: ClamAV version 0.88.4, clamav-milter version 0.88.4 on fw.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=1.4 required=5.0 tests=SPF_NEUTRAL, UNPARSEABLE_RELAY autolearn=no version=3.1.4 X-Spam-Level: * X-Spam-Checker-Version: SpamAssassin 3.1.4 (2006-07-25) on fw.zoral.com.ua Cc: freebsd-fs@freebsd.org, tegge@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2006 13:08:06 -0000 --L/Qt9NZ8t00Dhfad Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Aug 22, 2006 at 07:13:15PM +1000, Bruce Evans wrote: > On Mon, 21 Aug 2006, Tor Egge wrote: >=20 > >I wrote: > > > >>The deadlock indicates that one or more of IN_CHANGE, IN_MODIFIED or > >>IN_UPDATE was set on the inode, indicating a write operation > >>(e.g. VOP_WRITE(), VOP_RENAME(), VOP_CREATE(), VOP_REMOVE(), VOP_LINK(), > >>VOP_SYMLINK(), VOP_SETATTR(), VOP_MKDIR(), VOP_RMDIR(), VOP_MKNOD()) th= at=20 > >>was > >>not protected by vn_start_write() or vn_start_secondary_write(). > > > >The most common "write" operation was probably VOP_GETATTR(). >=20 > Reading the attributes really is a write operation if it causes marks for > update to be turned into updates. ufs_inactive() is also a write operati= on > if it causes this. I think marking for update shouldn't require much=20 > locking. >=20 > >ufs_itimes(), called from ufs_getattr(), might set the IN_MODIFIED inode= =20 > >flag > >if IN_ACCESS is set on the inode even if neither IN_CHANGE nor IN_UPDATE= is > >set, transitioning the inode flags into a state where ufs_inactive() cal= ls=20 > >the > >blocking variant of vn_start_secondary_write(). >=20 > In the implementation of marking access times for update on exec, we try= =20 > hard > to avoid calling vn_start_write(), etc. Early implementations did call > vn_start_write(), etc., and had some bugs from this. The current > implementation is mainly for ffs and is rather fragile: VOP_SETATTR() > depends on callers calling vfs_start_write, but vfs_mark_atime() calls > VOP_SETATTR() without calling vfs_start_write(). The correctness of this > depends on the VA_MARK_ATIME case of VOP_SETATTR() not writing any more > than VOP_READ() would or should. >=20 > I think ufs_itimes() shouldn't call vn_start_write() any more than > ufs_getattr() should. Callers should be aware that GETATTR may write > and thus it seems to be necessary for them to call vn_start_write() > unconditionally. ufs_itimes() is a utility function that is called > from places other than ufs_getattr(). The other places are ufs_*close() > and ufs_setattr(). These don't cause any additional problems. > VOP_CLOSE() is called from several places which already seem to have > sufficient locking. VOP_GETATTR() is called from many places that don't > call vn_start_write(), starting with vn_stat(). >=20 > The updates to timestamps in in ufs_itimes() and ufs_getattr() are still > soft (not even delayed writes, but writes to the vnode that will usually > become delayed writes later). Perhaps vn_start_write() can be avoided > for them, but since they are logically writes this might be hard to > implement correctly. ufs_inactive() has to do a (possibly delayed) physi= cal > write to force any updates to disk, so it needs strong locking. >=20 > BTW, ufs_itimes() has some possibly related kludges involving changing > from r/w mounts to r/o mounts. Some of the marks for update aren't > handled quite right and are still present after the change. Then they > want to be turned into updates on a r/o file system. This is impossible. > The problem is handled bogusly, essentially by clearing them without doing > the update, in a way that triggers seome of my local debugging code. I have a proposal. 1. Remove IN_ACCESS, IN_UPDATE, IN_CHANGE from i_flag. For each flag, introduce two new i_ fields, e.g., i_access of type timespec, and i_accessed of boolean type. 2. All places that currently set IN_ACCESS, instead would increment i_acces= sed using the atomic ops. ufs_itimes shall update i_access under some mutex if i_accessed is greater than zero. 3. Check the i_access instead of the IN_ACCESS. 4. ffs_update and ffs_syncvnode shall do the DIP_SET(i_atime) under the mutex from #2 before the main run and set IN_MODIFIED accordingly if i_accessed is not 0. 4. ufs_getattr shall retrieve the *time from new i_ fields under the mutex from #2 if corresponding i_ flag is set. Basically, I want to set IN_MODIFIED i_flag (induced by IN_ACCESS and others) only under exclusive vnode lock. Moreover, i_accessed can be zeroed only under exclusive lock. This way, even shared lock on the vnode shall be enough to safely update modification times, and the times are moved to the disk often enough (at least, at the sync of the syncer vnodes). Am I missing something obvious there ? I want to hear you opinion before starting to prototype the changes. --L/Qt9NZ8t00Dhfad Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFE6wGeC3+MBN1Mb4gRAnGdAKCPPwG71MitriTA3GePiti0ynpb2ACdFTHl Ox1lfWlTikgax/ltr9iTg+o= =+90j -----END PGP SIGNATURE----- --L/Qt9NZ8t00Dhfad-- From owner-freebsd-fs@FreeBSD.ORG Tue Aug 22 13:17:08 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AA50316A4DE; Tue, 22 Aug 2006 13:17:08 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from fw.zoral.com.ua (fw.zoral.com.ua [213.186.206.134]) by mx1.FreeBSD.org (Postfix) with ESMTP id 532CF43D58; Tue, 22 Aug 2006 13:17:07 +0000 (GMT) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by fw.zoral.com.ua (8.13.4/8.13.4) with ESMTP id k7MDH10Y093926 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 22 Aug 2006 16:17:01 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.13.6/8.13.6) with ESMTP id k7MDH2ur022500; Tue, 22 Aug 2006 16:17:02 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.13.6/8.13.6/Submit) id k7MDH2NH022499; Tue, 22 Aug 2006 16:17:02 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 22 Aug 2006 16:17:02 +0300 From: Kostik Belousov To: Bruce Evans Message-ID: <20060822131702.GM56637@deviant.kiev.zoral.com.ua> References: <20060817170314.GA17490@peter.osted.lan> <20060818164903.GF20768@deviant.kiev.zoral.com.ua> <20060818.202001.74745664.Tor.Egge@cvsup.no.freebsd.org> <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> <20060822175540.V58720@delplex.bde.org> <20060822130743.GL56637@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="6Mt39TZj+HFMr11E" Content-Disposition: inline In-Reply-To: <20060822130743.GL56637@deviant.kiev.zoral.com.ua> User-Agent: Mutt/1.4.2.2i X-Virus-Scanned: ClamAV version 0.88.4, clamav-milter version 0.88.4 on fw.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=1.9 required=5.0 tests=DNS_FROM_RFC_ABUSE, SPF_NEUTRAL,UNPARSEABLE_RELAY autolearn=no version=3.1.4 X-Spam-Level: * X-Spam-Checker-Version: SpamAssassin 3.1.4 (2006-07-25) on fw.zoral.com.ua Cc: freebsd-fs@freebsd.org, tegge@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2006 13:17:08 -0000 --6Mt39TZj+HFMr11E Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Aug 22, 2006 at 04:07:43PM +0300, Kostik Belousov wrote: > 4. ffs_update and ffs_syncvnode shall do the DIP_SET(i_atime) under the > mutex from #2 before the main run and set IN_MODIFIED accordingly if > i_accessed is not 0. >=20 ffs_update shall be excluded, only ffs_syncvnode left in the list. ffs_syncvnode is enclosed in the vn_start_write braces. --6Mt39TZj+HFMr11E Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFE6wPNC3+MBN1Mb4gRAtP1AJ9MiFeqzHKNL7O8ifm0hnt6dAuG8wCeM3XF MQediFnkExD3QrWuY5uQ5BY= =Na/O -----END PGP SIGNATURE----- --6Mt39TZj+HFMr11E-- From owner-freebsd-fs@FreeBSD.ORG Tue Aug 22 14:30:10 2006 Return-Path: X-Original-To: freebsd-fs@FreeBSD.org Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4BCB016A52F; Tue, 22 Aug 2006 14:30:10 +0000 (UTC) (envelope-from tataz@tataz.chchile.org) Received: from smtp1-g19.free.fr (smtp1-g19.free.fr [212.27.42.27]) by mx1.FreeBSD.org (Postfix) with ESMTP id E14C543D73; Tue, 22 Aug 2006 14:30:08 +0000 (GMT) (envelope-from tataz@tataz.chchile.org) Received: from tatooine.tataz.chchile.org (tataz.chchile.org [82.233.239.98]) by smtp1-g19.free.fr (Postfix) with ESMTP id BB7649132B; Tue, 22 Aug 2006 16:30:07 +0200 (CEST) Received: from obiwan.tataz.chchile.org (unknown [192.168.1.25]) by tatooine.tataz.chchile.org (Postfix) with ESMTP id C8A329C46F; Tue, 22 Aug 2006 14:30:44 +0000 (UTC) Received: by obiwan.tataz.chchile.org (Postfix, from userid 1000) id BF0B9405B; Tue, 22 Aug 2006 16:30:44 +0200 (CEST) Date: Tue, 22 Aug 2006 16:30:44 +0200 From: Jeremie Le Hen To: Pawel Jakub Dawidek Message-ID: <20060822143044.GD58048@obiwan.tataz.chchile.org> References: <20060822104516.GB16033@garage.freebsd.pl> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20060822104516.GB16033@garage.freebsd.pl> User-Agent: Mutt/1.5.12-2006-07-14 Cc: freebsd-fs@FreeBSD.org, zfs-discuss@opensolaris.org, freebsd-current@FreeBSD.org Subject: Re: [fbsd] Porting ZFS file system to FreeBSD. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2006 14:30:10 -0000 Hi Pawel, On Tue, Aug 22, 2006 at 12:45:16PM +0200, Pawel Jakub Dawidek wrote: > I started porting the ZFS file system to the FreeBSD operating system. First, thank you for working on this. I must admit I am quite impressed by the amount of work you've achieved this last months, you are really a coding machine. I don't say others have done less, but maybe I am simply not smart enough to estimate their work. I don't know much about ZFS, but Sun states this is a "128 bits" filesystem. How will you handle this in regards to the FreeBSD kernel interface that is already struggling to be 64 bits compliant ? (I'm stating this based on this URL [1], but maybe it's not fully up-to-date.) [1] http://www.freebsd.org/projects/bigdisk/index.html Thank you. Best regards, -- Jeremie Le Hen < jeremie at le-hen dot org >< ttz at chchile dot org > From owner-freebsd-fs@FreeBSD.ORG Tue Aug 22 14:36:40 2006 Return-Path: X-Original-To: freebsd-fs@FreeBSD.org Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9768916A4DF; Tue, 22 Aug 2006 14:36:40 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: from mail.garage.freebsd.pl (arm132.internetdsl.tpnet.pl [83.17.198.132]) by mx1.FreeBSD.org (Postfix) with ESMTP id CFA5D43D49; Tue, 22 Aug 2006 14:36:39 +0000 (GMT) (envelope-from pjd@garage.freebsd.pl) Received: by mail.garage.freebsd.pl (Postfix, from userid 65534) id 4C7ED51397; Tue, 22 Aug 2006 16:36:37 +0200 (CEST) Received: from localhost (pjd.wheel.pl [10.0.1.1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.garage.freebsd.pl (Postfix) with ESMTP id 53D8D51391; Tue, 22 Aug 2006 16:36:31 +0200 (CEST) Date: Tue, 22 Aug 2006 16:36:19 +0200 From: Pawel Jakub Dawidek To: Jeremie Le Hen Message-ID: <20060822143619.GG16033@garage.freebsd.pl> References: <20060822104516.GB16033@garage.freebsd.pl> <20060822143044.GD58048@obiwan.tataz.chchile.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="KIzF6Cje4W/osXrF" Content-Disposition: inline In-Reply-To: <20060822143044.GD58048@obiwan.tataz.chchile.org> X-PGP-Key-URL: http://people.freebsd.org/~pjd/pjd.asc X-OS: FreeBSD 7.0-CURRENT i386 User-Agent: mutt-ng/devel-r804 (FreeBSD) X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.garage.freebsd.pl X-Spam-Level: X-Spam-Status: No, score=-5.9 required=3.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.0.4 Cc: freebsd-fs@FreeBSD.org, zfs-discuss@opensolaris.org, freebsd-current@FreeBSD.org Subject: Re: [fbsd] Porting ZFS file system to FreeBSD. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2006 14:36:40 -0000 --KIzF6Cje4W/osXrF Content-Type: text/plain; charset=iso-8859-2 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Aug 22, 2006 at 04:30:44PM +0200, Jeremie Le Hen wrote: > I don't know much about ZFS, but Sun states this is a "128 bits" > filesystem. How will you handle this in regards to the FreeBSD > kernel interface that is already struggling to be 64 bits > compliant ? (I'm stating this based on this URL [1], but maybe > it's not fully up-to-date.) 128 bits is not my goal, but I do want all the other goodies:) --=20 Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! --KIzF6Cje4W/osXrF Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.4 (FreeBSD) iD8DBQFE6xZjForvXbEpPzQRAoyyAJ9C7jHBvUhrZ/nwBxF84+ir/IiETgCeJPsn ROYooOKJTwdQhVzbyLYRTh0= =tWfl -----END PGP SIGNATURE----- --KIzF6Cje4W/osXrF-- From owner-freebsd-fs@FreeBSD.ORG Tue Aug 22 14:39:28 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id F3FD916A4DE for ; Tue, 22 Aug 2006 14:39:27 +0000 (UTC) (envelope-from dan.cojocar@gmail.com) Received: from nf-out-0910.google.com (nf-out-0910.google.com [64.233.182.190]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9E61743D5C for ; Tue, 22 Aug 2006 14:39:26 +0000 (GMT) (envelope-from dan.cojocar@gmail.com) Received: by nf-out-0910.google.com with SMTP id n29so76052nfc for ; Tue, 22 Aug 2006 07:39:25 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=mcyDluvnLR8ACpfRNJer5wM0oGj3EiXr2aqVUczxcRsH7HLuKmWk5AERd80Ir5mo8CmIjaJPD6/nU+hdri5EVh8Yj0RtUjJfNojweB8Vza0pks9TszFWfiulTuOKjiTY4gU+lIMV/pCbSs/oygdIDjRwHFx+pV6xMgOmZ6SR6D8= Received: by 10.49.94.20 with SMTP id w20mr505834nfl; Tue, 22 Aug 2006 07:39:25 -0700 (PDT) Received: by 10.78.150.6 with HTTP; Tue, 22 Aug 2006 07:39:25 -0700 (PDT) Message-ID: Date: Tue, 22 Aug 2006 17:39:25 +0300 From: "Dan Cojocar" To: "Pawel Jakub Dawidek" In-Reply-To: <20060822104516.GB16033@garage.freebsd.pl> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20060822104516.GB16033@garage.freebsd.pl> Cc: freebsd-fs@freebsd.org, freebsd-current@freebsd.org Subject: Re: Porting ZFS file system to FreeBSD. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2006 14:39:28 -0000 On 8/22/06, Pawel Jakub Dawidek wrote: > Hi. > > I started porting the ZFS file system to the FreeBSD operating system. > > There is a lot to do, but I'm making good progress, I think. > > I'm doing my work in those directories: > > contrib/opensolaris/ - userland files taken directly from > OpenSolaris (libzfs, zpool, zfs and others) > > sys/contrib/opensolaris/ - kernel files taken directly from > OpenSolaris (zfs, taskq, callb and others) > > compat/opensolaris/ - compatibility userland layer, so I can > reduce diffs against vendor files > > sys/compat/opensolaris/ - compatibility kernel layer, so I can > reduce diffs against vendor files (kmem based on > malloc(9) and uma(9), mutexes based on our sx(9) locks, > condvars based on sx(9) locks and more) > > cddl/ - FreeBSD specific makefiles for userland bits > > sys/modules/zfs/ - FreeBSD specific makefile for the kernel > module > > You can find all those on FreeBSD perforce server: > > http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/pjd/zfs&HIDEDEL=NO > > Ok, so where am I? > > I ported the userland bits (libzfs, zfs and zpool). I had ztest and > libzpool compiling and working as well, but I left them behind for now > to focus on kernel bits. > > I'm building in all (except 2) files into zfs.ko (kernel module). > > I created new VDEV - vdev_geom, which fits to FreeBSD's GEOM > infrastructure, so basically you can use any GEOM provider to build your > ZFS pool. VDEV_GEOM is implemented as consumers-only GEOM class. > > I reimplemented ZVOL to also export storage as GEOM provider. This time > it is providers-only GEOM class. > > This way one can create for example RAID-Z on top of GELI encrypted > disks or encrypt ZFS volume. The order is free. > Basically you can put UFS on ZFS volumes already and it behaves really > stable even under heavy load. > > Currently I'm working on file system bits (ZPL), which is the most hard > part of the entire ZFS port, because it talks to one of the most complex > part of the FreeBSD kernel - VFS. > > I can already mount ZFS-created file systems (with 'zfs create' > command), create files/directories, change permissions/owner/etc., list > directories content, and perform few other minor operation. > > Some "screenshots": > > lcf:root:~# uname -a > FreeBSD lcf 7.0-CURRENT FreeBSD 7.0-CURRENT #74: Tue Aug 22 03:04:01 UTC 2006 root@lcf:/usr/obj/zoo/pjd/lcf/sys/LCF i386 > > lcf:root:~# zpool create tank raidz /dev/ad4a /dev/ad6a /dev/ad5a > > lcf:root:~# zpool list > NAME SIZE USED AVAIL CAP HEALTH ALTROOT > tank 35,8G 11,7M 35,7G 0% ONLINE - > > lcf:root:~# zpool status > pool: tank > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > ad4a ONLINE 0 0 0 > ad6a ONLINE 0 0 0 > ad5a ONLINE 0 0 0 > > errors: No known data errors > > lcf:root:# zfs create -V 10g tank/vol > lcf:root:# newfs /dev/zvol/tank/vol > lcf:root:# mount /dev/zvol/tank/vol /mnt/test > > lcf:root:# zfs create tank/fs > > lcf:root:~# mount -t zfs,ufs > tank on /tank (zfs, local) > tank/fs on /tank/fs (zfs, local) > /dev/zvol/tank/vol on /mnt/test (ufs, local) > > lcf:root:~# df -ht zfs,ufs > Filesystem Size Used Avail Capacity Mounted on > tank 13G 34K 13G 0% /tank > tank/fs 13G 33K 13G 0% /tank/fs > /dev/zvol/tank/vol 9.7G 4.0K 8.9G 0% /mnt/test > > lcf:root:~# mkdir /tank/fs/foo > lcf:root:~# touch /tank/fs/foo/bar > lcf:root:~# chown root:operator /tank/fs/foo /tank/fs/foo/bar > lcf:root:~# chmod 500 /tank/fs/foo > lcf:root:~# ls -ld /tank/fs/foo /tank/fs/foo/bar > dr-x------ 2 root operator 3 22 sie 05:41 /tank/fs/foo > -rw-r--r-- 1 root operator 0 22 sie 05:42 /tank/fs/foo/bar > > The most important missing pieces: > - Most of the ZPL layer. > - Autoconfiguration. I need implement vdev discovery based on GEOM's taste > mechanism. > - .zfs/ control directory (entirely commented out for now). > And many more, but hey, this is after 10 days of work. > > PS. Please contact me privately if your company would like to donate to the > ZFS effort. Even without sponsorship the work will be finished, but > your contributions will allow me to spend more time working on ZFS. > > -- > Pawel Jakub Dawidek http://www.wheel.pl > pjd@FreeBSD.org http://www.FreeBSD.org > FreeBSD committer Am I Evil? Yes, I Am! > > > Hello Pawel, Thank you for your work. When will you release a patch so we can test this? Thanks, Dan From owner-freebsd-fs@FreeBSD.ORG Tue Aug 22 14:42:44 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 43C4116A4DD; Tue, 22 Aug 2006 14:42:44 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: from mail.garage.freebsd.pl (arm132.internetdsl.tpnet.pl [83.17.198.132]) by mx1.FreeBSD.org (Postfix) with ESMTP id 70EAA43D6A; Tue, 22 Aug 2006 14:42:43 +0000 (GMT) (envelope-from pjd@garage.freebsd.pl) Received: by mail.garage.freebsd.pl (Postfix, from userid 65534) id 40F645139A; Tue, 22 Aug 2006 16:42:42 +0200 (CEST) Received: from localhost (pjd.wheel.pl [10.0.1.1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.garage.freebsd.pl (Postfix) with ESMTP id 90F975133B; Tue, 22 Aug 2006 16:42:37 +0200 (CEST) Date: Tue, 22 Aug 2006 16:42:26 +0200 From: Pawel Jakub Dawidek To: Dan Cojocar Message-ID: <20060822144226.GH16033@garage.freebsd.pl> References: <20060822104516.GB16033@garage.freebsd.pl> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="MGu/vTNewDGZ7tmp" Content-Disposition: inline In-Reply-To: X-PGP-Key-URL: http://people.freebsd.org/~pjd/pjd.asc X-OS: FreeBSD 7.0-CURRENT i386 User-Agent: mutt-ng/devel-r804 (FreeBSD) X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.garage.freebsd.pl X-Spam-Level: X-Spam-Status: No, score=-5.9 required=3.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.0.4 Cc: freebsd-fs@freebsd.org, freebsd-current@freebsd.org Subject: Re: Porting ZFS file system to FreeBSD. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2006 14:42:44 -0000 --MGu/vTNewDGZ7tmp Content-Type: text/plain; charset=iso-8859-2 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Aug 22, 2006 at 05:39:25PM +0300, Dan Cojocar wrote: > When will you release a patch so we can test this? Sure, when it will be ready for testing:) --=20 Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! --MGu/vTNewDGZ7tmp Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.4 (FreeBSD) iD8DBQFE6xfSForvXbEpPzQRAvrRAJ0cbwhdsdWys02SJ4g/faSH5YYxcACfWA3T pBWI4gm4trHAtVBhae376JU= =MoM7 -----END PGP SIGNATURE----- --MGu/vTNewDGZ7tmp-- From owner-freebsd-fs@FreeBSD.ORG Tue Aug 22 14:43:02 2006 Return-Path: X-Original-To: freebsd-fs@FreeBSD.org Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2351116A506; Tue, 22 Aug 2006 14:43:02 +0000 (UTC) (envelope-from Michael.Schuster@Sun.COM) Received: from gmpea-pix-1.sun.com (gmpea-pix-1.sun.com [192.18.1.36]) by mx1.FreeBSD.org (Postfix) with ESMTP id CBA9A43D7B; Tue, 22 Aug 2006 14:43:00 +0000 (GMT) (envelope-from Michael.Schuster@Sun.COM) Received: from d1-emea-09.sun.com ([192.18.2.119]) by gmpea-pix-1.sun.com (8.13.6+Sun/8.12.9) with ESMTP id k7MEgx1J020119; Tue, 22 Aug 2006 15:42:59 +0100 (BST) Received: from conversion-daemon.d1-emea-09.sun.com by d1-emea-09.sun.com (Sun Java System Messaging Server 6.2-4.02 (built Sep 9 2005)) id <0J4E00E01M486A00@d1-emea-09.sun.com> (original mail from Michael.Schuster@Sun.COM); Tue, 22 Aug 2006 15:42:59 +0100 (BST) Received: from [129.157.133.195] by d1-emea-09.sun.com (Sun Java System Messaging Server 6.2-4.02 (built Sep 9 2005)) with ESMTPSA id <0J4E00M5ZM7L3E1S@d1-emea-09.sun.com>; Tue, 22 Aug 2006 15:42:58 +0100 (BST) Date: Tue, 22 Aug 2006 16:42:57 +0200 From: Michael Schuster - Sun Microsystems In-reply-to: <20060822143619.GG16033@garage.freebsd.pl> Sender: Michael.Schuster@Sun.COM To: Pawel Jakub Dawidek Message-id: <44EB17F1.7070407@sun.com> MIME-version: 1.0 Content-type: text/plain; format=flowed; charset=ISO-8859-1 Content-transfer-encoding: 7BIT References: <20060822104516.GB16033@garage.freebsd.pl> <20060822143044.GD58048@obiwan.tataz.chchile.org> <20060822143619.GG16033@garage.freebsd.pl> User-Agent: Thunderbird 1.5.0.2 (X11/20060602) Cc: freebsd-fs@FreeBSD.org, zfs-discuss@opensolaris.org, freebsd-current@FreeBSD.org, Jeremie Le Hen Subject: Re: [zfs-discuss] Re: [fbsd] Porting ZFS file system to FreeBSD. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2006 14:43:02 -0000 Pawel Jakub Dawidek wrote: > On Tue, Aug 22, 2006 at 04:30:44PM +0200, Jeremie Le Hen wrote: >> I don't know much about ZFS, but Sun states this is a "128 bits" >> filesystem. How will you handle this in regards to the FreeBSD >> kernel interface that is already struggling to be 64 bits >> compliant ? (I'm stating this based on this URL [1], but maybe >> it's not fully up-to-date.) > > 128 bits is not my goal, but I do want all the other goodies:) are you going to attempt on-disk compatibility? Michael -- Michael Schuster +49 89 46008-2974 / x62974 visit the online support center: http://www.sun.com/osc/ Recursion, n.: see 'Recursion' From owner-freebsd-fs@FreeBSD.ORG Tue Aug 22 14:45:57 2006 Return-Path: X-Original-To: freebsd-fs@FreeBSD.org Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1766C16A4DE; Tue, 22 Aug 2006 14:45:57 +0000 (UTC) (envelope-from kris@obsecurity.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.FreeBSD.org (Postfix) with ESMTP id 69DAB43D77; Tue, 22 Aug 2006 14:45:56 +0000 (GMT) (envelope-from kris@obsecurity.org) Received: from obsecurity.dyndns.org (elvis.mu.org [192.203.228.196]) by elvis.mu.org (Postfix) with ESMTP id 4CBCE1A3C2F; Tue, 22 Aug 2006 07:45:56 -0700 (PDT) Received: by obsecurity.dyndns.org (Postfix, from userid 1000) id AB4A951305; Tue, 22 Aug 2006 10:45:55 -0400 (EDT) Date: Tue, 22 Aug 2006 10:45:55 -0400 From: Kris Kennaway To: Jeremie Le Hen Message-ID: <20060822144555.GA74174@xor.obsecurity.org> References: <20060822104516.GB16033@garage.freebsd.pl> <20060822143044.GD58048@obiwan.tataz.chchile.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="liOOAslEiF7prFVr" Content-Disposition: inline In-Reply-To: <20060822143044.GD58048@obiwan.tataz.chchile.org> User-Agent: Mutt/1.4.2.2i Cc: freebsd-fs@FreeBSD.org, zfs-discuss@opensolaris.org, freebsd-current@FreeBSD.org, Pawel Jakub Dawidek Subject: Re: [fbsd] Porting ZFS file system to FreeBSD. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2006 14:45:57 -0000 --liOOAslEiF7prFVr Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Aug 22, 2006 at 04:30:44PM +0200, Jeremie Le Hen wrote: > Hi Pawel, >=20 > On Tue, Aug 22, 2006 at 12:45:16PM +0200, Pawel Jakub Dawidek wrote: > > I started porting the ZFS file system to the FreeBSD operating system. >=20 > First, thank you for working on this. I must admit I am quite impressed > by the amount of work you've achieved this last months, you are really a > coding machine. I don't say others have done less, but maybe I am > simply not smart enough to estimate their work. >=20 > I don't know much about ZFS, but Sun states this is a "128 bits" > filesystem. How will you handle this in regards to the FreeBSD > kernel interface that is already struggling to be 64 bits > compliant ? (I'm stating this based on this URL [1], but maybe > it's not fully up-to-date.) >=20 > [1] http://www.freebsd.org/projects/bigdisk/index.html Actually if you read that URL then it shows that it's only some userland tools + secondary UFS features that do not handle 64-bit filesystem sizes in FreeBSD. Kris --liOOAslEiF7prFVr Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFE6xijWry0BWjoQKURArtHAKDNVYsqnCh1wI8hPHQbSMl8DW6VcgCgtVM1 ooJ/g3rG17jXaAgGnlRcKXE= =/iEZ -----END PGP SIGNATURE----- --liOOAslEiF7prFVr-- From owner-freebsd-fs@FreeBSD.ORG Tue Aug 22 15:11:11 2006 Return-Path: X-Original-To: freebsd-fs@FreeBSD.org Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8ECE416A4DA; Tue, 22 Aug 2006 15:11:11 +0000 (UTC) (envelope-from eschrock@zion.eng.sun.com) Received: from brmea-mail-2.sun.com (brmea-mail-2.Sun.COM [192.18.98.43]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2651643D72; Tue, 22 Aug 2006 15:11:07 +0000 (GMT) (envelope-from eschrock@zion.eng.sun.com) Received: from engmail3mpk.sfbay.Sun.COM ([129.146.11.26]) by brmea-mail-2.sun.com (8.13.6+Sun/8.12.9) with ESMTP id k7MFB7Gm013675; Tue, 22 Aug 2006 09:11:07 -0600 (MDT) Received: from zion.eng.sun.com (zion.SFBay.Sun.COM [129.146.17.75]) by engmail3mpk.sfbay.Sun.COM (8.13.6+Sun/8.13.6/ENSMAIL,v2.2) with ESMTP id k7MFB7IP007028; Tue, 22 Aug 2006 08:11:07 -0700 (PDT) Received: from zion.eng.sun.com (localhost [127.0.0.1]) by zion.eng.sun.com (8.13.7+Sun/8.13.7) with ESMTP id k7MFB77t014822; Tue, 22 Aug 2006 08:11:07 -0700 (PDT) Received: (from eschrock@localhost) by zion.eng.sun.com (8.13.7+Sun/8.13.7/Submit) id k7MFB7Sf014821; Tue, 22 Aug 2006 08:11:07 -0700 (PDT) Date: Tue, 22 Aug 2006 08:11:07 -0700 From: Eric Schrock To: Michael Schuster - Sun Microsystems Message-ID: <20060822151107.GA13426@eng.sun.com> References: <20060822104516.GB16033@garage.freebsd.pl> <20060822143044.GD58048@obiwan.tataz.chchile.org> <20060822143619.GG16033@garage.freebsd.pl> <44EB17F1.7070407@sun.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <44EB17F1.7070407@sun.com> User-Agent: Mutt/1.4.2.1i Cc: freebsd-fs@FreeBSD.org, Jeremie Le Hen , zfs-discuss@opensolaris.org, freebsd-current@FreeBSD.org, Pawel Jakub Dawidek Subject: Re: [zfs-discuss] Re: [fbsd] Porting ZFS file system to FreeBSD. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2006 15:11:11 -0000 On Tue, Aug 22, 2006 at 04:42:57PM +0200, Michael Schuster - Sun Microsystems wrote: > Pawel Jakub Dawidek wrote: > >On Tue, Aug 22, 2006 at 04:30:44PM +0200, Jeremie Le Hen wrote: > >>I don't know much about ZFS, but Sun states this is a "128 bits" > >>filesystem. How will you handle this in regards to the FreeBSD > >>kernel interface that is already struggling to be 64 bits > >>compliant ? (I'm stating this based on this URL [1], but maybe > >>it's not fully up-to-date.) > > > >128 bits is not my goal, but I do want all the other goodies:) > > are you going to attempt on-disk compatibility? Please note that the '128-bitness' of ZFS currently only comes into play in the on-disk format, and the allowed size of the storage pool. This should be very easy to maintain compatability with. However, each filesystem is currently limited to 64-bits, due largely to the lack of 128-bit support in the POSIX interfaces. So there's very little 128-bit code floating around except at the SPA layer, and as long as you have an unsigned 64-bit type there shouldn't be any problems at higher layers. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock From owner-freebsd-fs@FreeBSD.ORG Tue Aug 22 16:02:17 2006 Return-Path: X-Original-To: freebsd-fs@FreeBSD.org Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id DFF5D16A4DD; Tue, 22 Aug 2006 16:02:17 +0000 (UTC) (envelope-from Mark.Maybee@Sun.COM) Received: from brmea-mail-4.sun.com (brmea-mail-4.Sun.COM [192.18.98.36]) by mx1.FreeBSD.org (Postfix) with ESMTP id EDE1C43D6D; Tue, 22 Aug 2006 16:02:11 +0000 (GMT) (envelope-from Mark.Maybee@Sun.COM) Received: from fe-amer-09.sun.com ([192.18.108.183]) by brmea-mail-4.sun.com (8.13.6+Sun/8.12.9) with ESMTP id k7MG2B65027098; Tue, 22 Aug 2006 10:02:11 -0600 (MDT) Received: from conversion-daemon.mail-amer.sun.com by mail-amer.sun.com (Sun Java System Messaging Server 6.2-4.02 (built Sep 9 2005)) id <0J4E00B01PT71X00@mail-amer.sun.com> (original mail from Mark.Maybee@Sun.COM); Tue, 22 Aug 2006 10:02:11 -0600 (MDT) Received: from [192.168.0.100] ([199.45.247.21]) by mail-amer.sun.com (Sun Java System Messaging Server 6.2-4.02 (built Sep 9 2005)) with ESMTPSA id <0J4E00ESKPVLSUZN@mail-amer.sun.com>; Tue, 22 Aug 2006 10:02:11 -0600 (MDT) Date: Tue, 22 Aug 2006 10:02:09 -0600 From: Mark Maybee In-reply-to: <44EB17F1.7070407@sun.com> Sender: Mark.Maybee@Sun.COM To: Pawel Jakub Dawidek Message-id: <44EB2A81.9050300@sun.com> MIME-version: 1.0 Content-type: text/plain; format=flowed; charset=us-ascii Content-transfer-encoding: 7BIT X-Accept-Language: en-us, en References: <20060822104516.GB16033@garage.freebsd.pl> <20060822143044.GD58048@obiwan.tataz.chchile.org> <20060822143619.GG16033@garage.freebsd.pl> <44EB17F1.7070407@sun.com> User-Agent: Mozilla/5.0 (X11; U; SunOS i86pc; en-US; rv:1.7) Gecko/20051027 Cc: freebsd-fs@FreeBSD.org, zfs-discuss@opensolaris.org, freebsd-current@FreeBSD.org, Jeremie Le Hen , Michael Schuster - Sun Microsystems Subject: Re: [zfs-discuss] Re: [fbsd] Porting ZFS file system to FreeBSD. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2006 16:02:18 -0000 Michael Schuster - Sun Microsystems wrote: > Pawel Jakub Dawidek wrote: > >> On Tue, Aug 22, 2006 at 04:30:44PM +0200, Jeremie Le Hen wrote: >> >>> I don't know much about ZFS, but Sun states this is a "128 bits" >>> filesystem. How will you handle this in regards to the FreeBSD >>> kernel interface that is already struggling to be 64 bits >>> compliant ? (I'm stating this based on this URL [1], but maybe >>> it's not fully up-to-date.) >> >> >> 128 bits is not my goal, but I do want all the other goodies:) > > > are you going to attempt on-disk compatibility? > > Michael Amazing work Pawel! Please do try to maintain on-disk compatibility! Let us know if you run into anything that might prevent that (or any other issues that you run across). -Mark From owner-freebsd-fs@FreeBSD.ORG Tue Aug 22 20:35:16 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5CADD16A4DA; Tue, 22 Aug 2006 20:35:16 +0000 (UTC) (envelope-from anderson@centtech.com) Received: from mh2.centtech.com (moat3.centtech.com [207.200.51.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id EE31143D45; Tue, 22 Aug 2006 20:35:15 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from [10.177.171.220] (neutrino.centtech.com [10.177.171.220]) by mh2.centtech.com (8.13.1/8.13.1) with ESMTP id k7MKZE9w094480; Tue, 22 Aug 2006 15:35:15 -0500 (CDT) (envelope-from anderson@centtech.com) Message-ID: <44EB6A81.20307@centtech.com> Date: Tue, 22 Aug 2006 15:35:13 -0500 From: Eric Anderson User-Agent: Thunderbird 1.5.0.5 (X11/20060802) MIME-Version: 1.0 To: freebsd-fs@freebsd.org, FreeBSD Hackers Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.87.1/1709/Tue Aug 22 14:34:50 2006 on mh2.centtech.com X-Virus-Status: Clean Cc: Subject: devfs related panic info X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2006 20:35:16 -0000 While removing a tape device from an FC fabric, I got a nice panic. I have screen captures posted here: http://www.googlebit.com/freebsd/snapshots/devfs_panic/ This is 6-STABLE (amd64) as of about a week ago. Sorry for the screen captures - that's all I had at the time. I do have a vmcore sitting around now though. Anything else I can provide? Eric -- ------------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology Anything that works is better than anything that doesn't. ------------------------------------------------------------------------ From owner-freebsd-fs@FreeBSD.ORG Tue Aug 22 21:47:18 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B629516A4E0 for ; Tue, 22 Aug 2006 21:47:18 +0000 (UTC) (envelope-from Tor.Egge@cvsup.no.freebsd.org) Received: from pil.idi.ntnu.no (pil.idi.ntnu.no [129.241.107.93]) by mx1.FreeBSD.org (Postfix) with ESMTP id 278CC43D45 for ; Tue, 22 Aug 2006 21:47:17 +0000 (GMT) (envelope-from Tor.Egge@cvsup.no.freebsd.org) Received: from cvsup.no.freebsd.org (c2h5oh.idi.ntnu.no [129.241.103.69]) by pil.idi.ntnu.no (8.13.6/8.13.1) with ESMTP id k7MLlFcw017912 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Tue, 22 Aug 2006 23:47:15 +0200 (MEST) Received: from localhost (localhost [127.0.0.1]) by cvsup.no.freebsd.org (8.13.4/8.13.4) with ESMTP id k7MLlEnW056180; Tue, 22 Aug 2006 21:47:14 GMT (envelope-from Tor.Egge@cvsup.no.freebsd.org) Date: Tue, 22 Aug 2006 21:46:38 +0000 (UTC) Message-Id: <20060822.214638.74697110.Tor.Egge@cvsup.no.freebsd.org> To: kostikbel@gmail.com From: Tor Egge In-Reply-To: <20060822130743.GL56637@deviant.kiev.zoral.com.ua> References: <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> <20060822175540.V58720@delplex.bde.org> <20060822130743.GL56637@deviant.kiev.zoral.com.ua> X-Mailer: Mew version 3.3 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Virus-Scanned-By: mimedefang.idi.ntnu.no, using CLAMD X-SMTP-From: Sender=, Relay/Client=c2h5oh.idi.ntnu.no [129.241.103.69], EHLO=cvsup.no.freebsd.org X-Scanned-By: MIMEDefang 2.48 on 129.241.107.38 X-Scanned-By: mimedefang.idi.ntnu.no, using MIMEDefang 2.48 with local filter 16.42-idi X-Filter-Time: 1 seconds Cc: freebsd-fs@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2006 21:47:18 -0000 > I have a proposal. > > 1. Remove IN_ACCESS, IN_UPDATE, IN_CHANGE from i_flag. For each flag, > introduce two new i_ fields, e.g., i_access of type timespec, and > i_accessed of boolean type. On amd64, sizeof(struct timespec) is 16 bytes and sizeof(struct boolean_t) is 4 bytes. 3 * (16 + 4) = 60 bytes extra per inode. With 100K inodes that becomes 6 MB extra memory. I don't see why all these extra fields are needed. > 2. All places that currently set IN_ACCESS, instead would increment > i_accessed using the atomic ops. ufs_itimes shall update i_access under some > mutex if i_accessed is greater than zero. Protecting the existing i_flag and the timestamps with the vnode interlock when the current thread only has a shared vnode lock should be sufficient to protect against the races, removing the need for #3, #4 and #4 below. What's left is avoiding setting IN_MODIFIED when it's unsafe, to protect against the deadlock. > 3. Check the i_access instead of the IN_ACCESS. > 4. ffs_update and ffs_syncvnode shall do the DIP_SET(i_atime) under the mutex > from #2 before the main run and set IN_MODIFIED accordingly if i_accessed is > not 0. > 4. ufs_getattr shall retrieve the *time from new i_ fields under the mutex > from #2 if corresponding i_ flag is set. > Basically, I want to set IN_MODIFIED i_flag (induced by IN_ACCESS and others) > only under exclusive vnode lock. Moreover, i_accessed can be zeroed only > under exclusive lock. This way, even shared lock on the vnode shall be enough > to safely update modification times, and the times are moved to the disk > often enough (at least, at the sync of the syncer vnodes). An exclusive vnode lock isn't needed, see above. Holding an exclusive vnode lock does not make it safe to set IN_MODIFIED. There are some constraints with regards to setting IN_MODIFIED on an inode. If neither IN_CHANGE nor IN_UPDATE is set then it might be unsafe to set IN_MODIFIED since the file system might be suspended or in the process of being suspended with the vnode sync loop in ffs_sync() having iterated past the vnode. If the file system is suspended then IN_MODIFIED cannot be set. If IN_MODIFIED, IN_CHANGE or IN_UPDATE is set and the file system is suspended then something is wrong. If the file system is in the process of being suspended then IN_MODIFIED can be set at the cost of triggering a restart of the vnode sync loop in ffs_sync(). If either IN_MODIFIED, IN_CHANGE or IN_UPDATE is already set then the vnode sync loop has not reached the vnode, and a restart isn't needed. When ufs_itimes() cannot set IN_MODIFIED then it has to either risk losing the access time update or use some mechanism to defer it (e.g. set IN_LAZYMOD or a new flag and let process_deferred_inactive() set IN_MODIFIED after the file system has been resumed). - Tor Egge From owner-freebsd-fs@FreeBSD.ORG Wed Aug 23 04:40:55 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C7ACB16A4DD for ; Wed, 23 Aug 2006 04:40:55 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from fw.zoral.com.ua (fw.zoral.com.ua [213.186.206.134]) by mx1.FreeBSD.org (Postfix) with ESMTP id 16E4C43D45 for ; Wed, 23 Aug 2006 04:40:54 +0000 (GMT) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by fw.zoral.com.ua (8.13.4/8.13.4) with ESMTP id k7N4ej2v018159 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 23 Aug 2006 07:40:45 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.13.6/8.13.6) with ESMTP id k7N4eksB094117; Wed, 23 Aug 2006 07:40:46 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.13.6/8.13.6/Submit) id k7N4ehE4094116; Wed, 23 Aug 2006 07:40:43 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 23 Aug 2006 07:40:43 +0300 From: Kostik Belousov To: Tor Egge Message-ID: <20060823044043.GA64800@deviant.kiev.zoral.com.ua> References: <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> <20060822175540.V58720@delplex.bde.org> <20060822130743.GL56637@deviant.kiev.zoral.com.ua> <20060822.214638.74697110.Tor.Egge@cvsup.no.freebsd.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="Nq2Wo0NMKNjxTN9z" Content-Disposition: inline In-Reply-To: <20060822.214638.74697110.Tor.Egge@cvsup.no.freebsd.org> User-Agent: Mutt/1.4.2.2i X-Virus-Scanned: ClamAV version 0.88.4, clamav-milter version 0.88.4 on fw.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=1.9 required=5.0 tests=DNS_FROM_RFC_ABUSE, SPF_NEUTRAL,UNPARSEABLE_RELAY autolearn=no version=3.1.4 X-Spam-Level: * X-Spam-Checker-Version: SpamAssassin 3.1.4 (2006-07-25) on fw.zoral.com.ua Cc: freebsd-fs@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Aug 2006 04:40:56 -0000 --Nq2Wo0NMKNjxTN9z Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Tue, Aug 22, 2006 at 09:46:38PM +0000, Tor Egge wrote: > > 2. All places that currently set IN_ACCESS, instead would increment > > i_accessed using the atomic ops. ufs_itimes shall update i_access > > under some mutex if i_accessed is greater than zero. > > Protecting the existing i_flag and the timestamps with the vnode > interlock when the current thread only has a shared vnode lock should > be sufficient to protect against the races, removing the need for #3, > #4 and #4 below. > > What's left is avoiding setting IN_MODIFIED when it's unsafe, to > protect against the deadlock. So, I will do the following: 1. Protect both setting and reading inode times and i_flag with vnode interlock. This shall be done through all the sys/ufs/*/* code. 2. Modify ufs_itimes: > If neither IN_CHANGE nor IN_UPDATE is set then it might be unsafe to > set IN_MODIFIED since the file system might be suspended or in the > process of being suspended with the vnode sync loop in ffs_sync() > having iterated past the vnode. In other words, if IN_CHANGE or IN_UPDATE are already set, I can safely convert IN_ACCESS into IN_MOD. Otherwise, I shall implemented the algorithm below. Suspending/suspended checks need to take MNT_ILOCK. > > If the file system is suspended then IN_MODIFIED cannot be set. If > IN_MODIFIED, IN_CHANGE or IN_UPDATE is set and the file system is > suspended then something is wrong. > > If the file system is in the process of being suspended then > IN_MODIFIED can be set at the cost of triggering a restart of the > vnode sync loop in ffs_sync(). If either IN_MODIFIED, IN_CHANGE or > IN_UPDATE is already set then the vnode sync loop has not reached the > vnode, and a restart isn't needed. > > When ufs_itimes() cannot set IN_MODIFIED then it has to either risk > losing the access time update or use some mechanism to defer it (e.g. > set IN_LAZYMOD or a new flag and let process_deferred_inactive() set > IN_MODIFIED after the file system has been resumed). > BTW, shall the test for MNT_RDONLY in the ufs_itimes moved earlier ? 3. Add the process_deferred_lazymod procedure, called from ffs_snapshot before proc_deferred_inactive, that shall convert IN_LAZYMOD | IN_ACCESS into IN_MODIFIED. To be safe, the proc_def_lazymod needs vn_start_write braces. --Nq2Wo0NMKNjxTN9z Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFE69xLC3+MBN1Mb4gRAqsrAKCEh5Tb/vYlSAXLGErJJP6AeE6H0ACcCN/O gDmX+uH3k1p+6QZWJcOY+nk= =hPmb -----END PGP SIGNATURE----- --Nq2Wo0NMKNjxTN9z-- From owner-freebsd-fs@FreeBSD.ORG Wed Aug 23 06:29:28 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8EA4316A4DF for ; Wed, 23 Aug 2006 06:29:28 +0000 (UTC) (envelope-from dudu@dudu.ro) Received: from wr-out-f131.google.com (wr-out-f131.google.com [64.233.184.131]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1AE6243D60 for ; Wed, 23 Aug 2006 06:29:27 +0000 (GMT) (envelope-from dudu@dudu.ro) Received: by wr-out-f131.google.com with SMTP id c5so129747wra for ; Tue, 22 Aug 2006 23:29:27 -0700 (PDT) Received: by 10.65.240.17 with SMTP id s17mr9676760qbr; Tue, 22 Aug 2006 23:29:27 -0700 (PDT) Received: by 10.65.225.6 with HTTP; Tue, 22 Aug 2006 23:29:27 -0700 (PDT) Message-ID: Date: Wed, 23 Aug 2006 09:29:27 +0300 From: "Vlad GALU" To: freebsd-fs@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline Subject: fdescfs/devfs X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Aug 2006 06:29:28 -0000 Is it just a wrong impression on my side, or devfs really supersedes fdescfs ? I've been using /dev/fd/* for a while having the impression I had fdescfs mounted on /dev/fd, and I hadn't. I didn't see any differences whatsoever in terms of usage. -- If it's there, and you can see it, it's real. If it's not there, and you can see it, it's virtual. If it's there, and you can't see it, it's transparent. If it's not there, and you can't see it, you erased it. From owner-freebsd-fs@FreeBSD.ORG Wed Aug 23 06:59:08 2006 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3EF1316A4DA; Wed, 23 Aug 2006 06:59:08 +0000 (UTC) (envelope-from jullquvaliou@steffieq.de) Received: from steffieq.de (110.4.189.220.broad.nb.zj.dynamic.cndata.com [220.189.4.110]) by mx1.FreeBSD.org (Postfix) with SMTP id 6273743D45; Wed, 23 Aug 2006 06:58:40 +0000 (GMT) (envelope-from jullquvaliou@steffieq.de) Message-ID: <515D2746.6D0B3DD@steffieq.de> Date: Wed, 23 Aug 2006 00:31:02 -0700 From: "Daryl" User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 MIME-Version: 1.0 To: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8bit Cc: Subject: chance of a lifetime X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Daryl List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Aug 2006 06:59:08 -0000 Hi, Hope I am not writing to wrong addresbs. I am nice, pretty looking girl. I am planning on visiting yobur town thais month. Can we meet each other in person? Message me back at khmxe@funnydayz.com From owner-freebsd-fs@FreeBSD.ORG Wed Aug 23 07:38:23 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0C5B616A4DA for ; Wed, 23 Aug 2006 07:38:23 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.FreeBSD.org (Postfix) with ESMTP id A7ED943D45 for ; Wed, 23 Aug 2006 07:38:22 +0000 (GMT) (envelope-from phk@phk.freebsd.dk) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.48.2]) by phk.freebsd.dk (Postfix) with ESMTP id 210C9170C6; Wed, 23 Aug 2006 07:38:20 +0000 (UTC) To: "Vlad GALU" From: "Poul-Henning Kamp" In-Reply-To: Your message of "Wed, 23 Aug 2006 09:29:27 +0300." Date: Wed, 23 Aug 2006 07:38:20 +0000 Message-ID: <75690.1156318700@critter.freebsd.dk> Cc: freebsd-fs@freebsd.org Subject: Re: fdescfs/devfs X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Aug 2006 07:38:23 -0000 In message , "Vlad GALU" writes: > Is it just a wrong impression on my side, or devfs really >supersedes fdescfs ? I've been using /dev/fd/* for a while having the >impression I had fdescfs mounted on /dev/fd, and I hadn't. I didn't >see any differences whatsoever in terms of usage. It's not really devfs, it's kern_descrip.c and it only implements /dev/fd[0-2] and the /dev/std{in,out,err} aliases. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-fs@FreeBSD.ORG Wed Aug 23 07:42:32 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4D6AB16A4DD for ; Wed, 23 Aug 2006 07:42:32 +0000 (UTC) (envelope-from dudu@dudu.ro) Received: from wr-out-f131.google.com (wr-out-f131.google.com [64.233.184.131]) by mx1.FreeBSD.org (Postfix) with ESMTP id E4BBF43D45 for ; Wed, 23 Aug 2006 07:42:31 +0000 (GMT) (envelope-from dudu@dudu.ro) Received: by wr-out-f131.google.com with SMTP id 20so116239wra for ; Wed, 23 Aug 2006 00:42:31 -0700 (PDT) Received: by 10.65.139.9 with SMTP id r9mr29086qbn; Wed, 23 Aug 2006 00:42:31 -0700 (PDT) Received: by 10.65.225.6 with HTTP; Wed, 23 Aug 2006 00:42:31 -0700 (PDT) Message-ID: Date: Wed, 23 Aug 2006 10:42:31 +0300 From: "Vlad GALU" To: freebsd-fs@freebsd.org In-Reply-To: <75690.1156318700@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <75690.1156318700@critter.freebsd.dk> Subject: Re: fdescfs/devfs X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Aug 2006 07:42:32 -0000 On 8/23/06, Poul-Henning Kamp wrote: > In message , "Vlad > GALU" writes: > > > Is it just a wrong impression on my side, or devfs really > >supersedes fdescfs ? I've been using /dev/fd/* for a while having the > >impression I had fdescfs mounted on /dev/fd, and I hadn't. I didn't > >see any differences whatsoever in terms of usage. > > It's not really devfs, it's kern_descrip.c and it only implements > /dev/fd[0-2] and the /dev/std{in,out,err} aliases. Thanks. I wrote a small test program that opened a socket and indeed it didn't show up in /dev/fd/. I was just about to return this info to the list. > > -- > Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > phk@FreeBSD.ORG | TCP/IP since RFC 956 > FreeBSD committer | BSD since 4.3-tahoe > Never attribute to malice what can adequately be explained by incompetence. > -- If it's there, and you can see it, it's real. If it's not there, and you can see it, it's virtual. If it's there, and you can't see it, it's transparent. If it's not there, and you can't see it, you erased it. From owner-freebsd-fs@FreeBSD.ORG Wed Aug 23 11:08:17 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 74EA316A4DA for ; Wed, 23 Aug 2006 11:08:17 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from fw.zoral.com.ua (fw.zoral.com.ua [213.186.206.134]) by mx1.FreeBSD.org (Postfix) with ESMTP id A6E5F43D49 for ; Wed, 23 Aug 2006 11:08:16 +0000 (GMT) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by fw.zoral.com.ua (8.13.4/8.13.4) with ESMTP id k7NB88Wh028309 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 23 Aug 2006 14:08:08 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.13.8/8.13.8) with ESMTP id k7NB8AtC082815; Wed, 23 Aug 2006 14:08:10 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.13.8/8.13.8/Submit) id k7NB88sh082814; Wed, 23 Aug 2006 14:08:08 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 23 Aug 2006 14:08:08 +0300 From: Kostik Belousov To: Tor Egge Message-ID: <20060823110808.GD64800@deviant.kiev.zoral.com.ua> References: <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> <20060822175540.V58720@delplex.bde.org> <20060822130743.GL56637@deviant.kiev.zoral.com.ua> <20060822.214638.74697110.Tor.Egge@cvsup.no.freebsd.org> <20060823044043.GA64800@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="at6+YcpfzWZg/htY" Content-Disposition: inline In-Reply-To: <20060823044043.GA64800@deviant.kiev.zoral.com.ua> User-Agent: Mutt/1.4.2.2i X-Virus-Scanned: ClamAV version 0.88.4, clamav-milter version 0.88.4 on fw.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=1.9 required=5.0 tests=DNS_FROM_RFC_ABUSE, SPF_NEUTRAL,UNPARSEABLE_RELAY autolearn=no version=3.1.4 X-Spam-Level: * X-Spam-Checker-Version: SpamAssassin 3.1.4 (2006-07-25) on fw.zoral.com.ua Cc: freebsd-fs@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Aug 2006 11:08:17 -0000 --at6+YcpfzWZg/htY Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Attached is the prototype change. System booted from the modified kernel survived cyclic snapshotting of the exported partition. The partition was also mounted by loopback nfs, and the loop of extracting perl-5.8.8 from archive, grepping tree for foobar and removing it run over nfs. I have at least one questions: > > Protecting the existing i_flag and the timestamps with the vnode > > interlock when the current thread only has a shared vnode lock should > > be sufficient to protect against the races, removing the need for #3, > > #4 and #4 below. Could you, please, explain this point ? I did not wrap all accesses to i_flag and timestamps with vnode interlock, only ufs_itimes, ufs_lazyaccess and ufs_getattr for now. Index: ffs/ffs_snapshot.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /usr/local/arch/ncvs/src/sys/ufs/ffs/ffs_snapshot.c,v retrieving revision 1.128 diff -u -r1.128 ffs_snapshot.c --- ffs/ffs_snapshot.c 21 Aug 2006 17:20:19 -0000 1.128 +++ ffs/ffs_snapshot.c 23 Aug 2006 10:59:24 -0000 @@ -2322,6 +2322,8 @@ loop: MNT_VNODE_FOREACH(vp, mp, mvp) { VI_LOCK(vp); + if (vp->v_type =3D=3D VREG) + ufs_lazyaccess(vp); if ((vp->v_iflag & (VI_DOOMED | VI_OWEINACT)) !=3D VI_OWEINACT || vp->v_usecount > 0 || vp->v_type =3D=3D VNON) { Index: ufs/inode.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /usr/local/arch/ncvs/src/sys/ufs/ufs/inode.h,v retrieving revision 1.49 diff -u -r1.49 inode.h --- ufs/inode.h 14 Mar 2005 10:21:16 -0000 1.49 +++ ufs/inode.h 23 Aug 2006 10:59:24 -0000 @@ -119,6 +119,7 @@ #define IN_RENAME 0x0010 /* Inode is being renamed. */ #define IN_LAZYMOD 0x0040 /* Modified, but don't write yet. */ #define IN_SPACECOUNTED 0x0080 /* Blocks to be freed in free count. */ +#define IN_LAZYACCESS 0x0100 /* Process IN_ACCESS after the suspension fi= nished */ =20 #define i_devvp i_ump->um_devvp #define i_umbufobj i_ump->um_bo Index: ufs/ufs_extern.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /usr/local/arch/ncvs/src/sys/ufs/ufs/ufs_extern.h,v retrieving revision 1.55 diff -u -r1.55 ufs_extern.h --- ufs/ufs_extern.h 14 Mar 2005 10:21:16 -0000 1.55 +++ ufs/ufs_extern.h 23 Aug 2006 10:59:24 -0000 @@ -74,6 +74,7 @@ int ufs_inactive(struct vop_inactive_args *); int ufs_init(struct vfsconf *); void ufs_itimes(struct vnode *vp); +void ufs_lazyaccess(struct vnode *vp); int ufs_lookup(struct vop_cachedlookup_args *); int ufs_readdir(struct vop_readdir_args *); int ufs_reclaim(struct vop_reclaim_args *); Index: ufs/ufs_vnops.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /usr/local/arch/ncvs/src/sys/ufs/ufs/ufs_vnops.c,v retrieving revision 1.277 diff -u -r1.277 ufs_vnops.c --- ufs/ufs_vnops.c 31 May 2006 15:55:52 -0000 1.277 +++ ufs/ufs_vnops.c 23 Aug 2006 10:59:24 -0000 @@ -128,31 +128,70 @@ { struct inode *ip; struct timespec ts; + int mnt_locked; =20 ip =3D VTOI(vp); + mnt_locked =3D 0; + + if ((vp->v_mount->mnt_flag & MNT_RDONLY) !=3D 0) { + VI_LOCK(vp); + goto out; + } + + vfs_timestamp(&ts); + + MNT_ILOCK(vp->v_mount); /* For reading of mnt_kern_flags */ + mnt_locked =3D 1; + + VI_LOCK(vp); if ((ip->i_flag & (IN_ACCESS | IN_CHANGE | IN_UPDATE)) =3D=3D 0) - return; + goto out_unl; + if ((vp->v_type =3D=3D VBLK || vp->v_type =3D=3D VCHR) && !DOINGSOFTDEP(v= p)) ip->i_flag |=3D IN_LAZYMOD; - else + else if (((vp->v_mount->mnt_kern_flag & MNTK_SUSPENDED) =3D=3D 0) || + ((ip->i_flag & (IN_CHANGE | IN_UPDATE)) !=3D 0)) ip->i_flag |=3D IN_MODIFIED; - if ((vp->v_mount->mnt_flag & MNT_RDONLY) =3D=3D 0) { - vfs_timestamp(&ts); - if (ip->i_flag & IN_ACCESS) { - DIP_SET(ip, i_atime, ts.tv_sec); - DIP_SET(ip, i_atimensec, ts.tv_nsec); - } - if (ip->i_flag & IN_UPDATE) { - DIP_SET(ip, i_mtime, ts.tv_sec); - DIP_SET(ip, i_mtimensec, ts.tv_nsec); - ip->i_modrev++; - } - if (ip->i_flag & IN_CHANGE) { - DIP_SET(ip, i_ctime, ts.tv_sec); - DIP_SET(ip, i_ctimensec, ts.tv_nsec); - } + else if ((ip->i_flag & IN_ACCESS) !=3D 0) + ip->i_flag |=3D IN_LAZYACCESS; +=09 + if (ip->i_flag & IN_ACCESS) { + DIP_SET(ip, i_atime, ts.tv_sec); + DIP_SET(ip, i_atimensec, ts.tv_nsec); + } + if (ip->i_flag & IN_UPDATE) { + DIP_SET(ip, i_mtime, ts.tv_sec); + DIP_SET(ip, i_mtimensec, ts.tv_nsec); + ip->i_modrev++; + } + if (ip->i_flag & IN_CHANGE) { + DIP_SET(ip, i_ctime, ts.tv_sec); + DIP_SET(ip, i_ctimensec, ts.tv_nsec); } + + out: ip->i_flag &=3D ~(IN_ACCESS | IN_CHANGE | IN_UPDATE); + out_unl: + VI_UNLOCK(vp); + if (mnt_locked) + MNT_IUNLOCK(vp->v_mount); +} + +/* + * Clear the IN_LAZYACCESS i_flag. vnode shall be interlocked. + */ + +void +ufs_lazyaccess(vp) + struct vnode *vp; +{ + struct inode *ip; + + ip =3D VTOI(vp); + if ((ip->i_flag & IN_LAZYACCESS) !=3D 0) { + ip->i_flag &=3D ~IN_LAZYACCESS; + ip->i_flag |=3D IN_MODIFIED; + } } =20 /* @@ -266,11 +305,15 @@ } */ *ap; { struct vnode *vp =3D ap->a_vp; + int usecount; =20 VI_LOCK(vp); - if (vp->v_usecount > 1) - ufs_itimes(vp); + usecount =3D vp->v_usecount; VI_UNLOCK(vp); + + if (usecount > 1) + ufs_itimes(vp); + return (0); } =20 @@ -378,8 +421,10 @@ if (ip->i_ump->um_fstype =3D=3D UFS1) { vap->va_rdev =3D ip->i_din1->di_rdev; vap->va_size =3D ip->i_din1->di_size; + VI_LOCK(vp); vap->va_atime.tv_sec =3D ip->i_din1->di_atime; vap->va_atime.tv_nsec =3D ip->i_din1->di_atimensec; + VI_UNLOCK(vp); vap->va_mtime.tv_sec =3D ip->i_din1->di_mtime; vap->va_mtime.tv_nsec =3D ip->i_din1->di_mtimensec; vap->va_ctime.tv_sec =3D ip->i_din1->di_ctime; @@ -390,8 +435,10 @@ } else { vap->va_rdev =3D ip->i_din2->di_rdev; vap->va_size =3D ip->i_din2->di_size; + VI_LOCK(vp); vap->va_atime.tv_sec =3D ip->i_din2->di_atime; vap->va_atime.tv_nsec =3D ip->i_din2->di_atimensec; + VI_UNLOCK(vp); vap->va_mtime.tv_sec =3D ip->i_din2->di_mtime; vap->va_mtime.tv_nsec =3D ip->i_din2->di_mtimensec; vap->va_ctime.tv_sec =3D ip->i_din2->di_ctime; @@ -400,7 +447,9 @@ vap->va_birthtime.tv_nsec =3D ip->i_din2->di_birthnsec; vap->va_bytes =3D dbtob((u_quad_t)ip->i_din2->di_blocks); } + VI_LOCK(vp); vap->va_flags =3D ip->i_flags; + VI_UNLOCK(vp); vap->va_gen =3D ip->i_gen; vap->va_blocksize =3D vp->v_mount->mnt_stat.f_iosize; vap->va_type =3D IFTOVT(ip->i_mode); @@ -1992,11 +2041,15 @@ } */ *ap; { struct vnode *vp =3D ap->a_vp; + int usecount; =20 VI_LOCK(vp); - if (vp->v_usecount > 1) - ufs_itimes(vp); + usecount =3D vp->v_usecount; VI_UNLOCK(vp); + + if (usecount > 1) + ufs_itimes(vp); + return (fifo_specops.vop_close(ap)); } =20 --at6+YcpfzWZg/htY Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFE7DcYC3+MBN1Mb4gRAp1bAJ9lQu5xqeISUdM/+exevjAt6osv2QCfXYoB uJDDlH8BXDHMOY1HnDZuT7U= =Vih+ -----END PGP SIGNATURE----- --at6+YcpfzWZg/htY-- From owner-freebsd-fs@FreeBSD.ORG Wed Aug 23 14:05:07 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id EFFC016A4DF for ; Wed, 23 Aug 2006 14:05:07 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.85]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4000D43D4C for ; Wed, 23 Aug 2006 14:05:07 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.2.162]) by mailout2.pacific.net.au (Postfix) with ESMTP id E93F8109AF8; Thu, 24 Aug 2006 00:05:05 +1000 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (8.13.4/8.13.4/Debian-3sarge1) with ESMTP id k7NE52Bn014610; Thu, 24 Aug 2006 00:05:03 +1000 Date: Thu, 24 Aug 2006 00:05:01 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Tor Egge In-Reply-To: <20060822.214638.74697110.Tor.Egge@cvsup.no.freebsd.org> Message-ID: <20060823203148.M62850@delplex.bde.org> References: <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> <20060822175540.V58720@delplex.bde.org> <20060822130743.GL56637@deviant.kiev.zoral.com.ua> <20060822.214638.74697110.Tor.Egge@cvsup.no.freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Aug 2006 14:05:08 -0000 On Tue, 22 Aug 2006, Tor Egge wrote: >> I have a proposal. Sorry, I don't like it. >> 1. Remove IN_ACCESS, IN_UPDATE, IN_CHANGE from i_flag. For each flag, >> introduce two new i_ fields, e.g., i_access of type timespec, and >> i_accessed of boolean type. > > On amd64, sizeof(struct timespec) is 16 bytes and sizeof(struct boolean_t) is 4 > bytes. 3 * (16 + 4) = 60 bytes extra per inode. With 100K inodes that becomes > 6 MB extra memory. > > I don't see why all these extra fields are needed. I think i_access is to avoid changing the dinode, but changes to the dinode are best avoided by just not changing it (as you give details on below). IN_ACCESS is not in the dinode so it shouldn't need a new field. >> 2. All places that currently set IN_ACCESS, instead would increment >> i_accessed using the atomic ops. ufs_itimes shall update i_access under some M >> mutex if i_accessed is greater than zero. > > Protecting the existing i_flag and the timestamps with the vnode interlock when > the current thread only has a shared vnode lock should be sufficient to protect > against the races, removing the need for #3, #4 and #4 below. I agree that this should be sufficient. Don't know if it is. Actually, I thought that the vnode lock was more exclusive. How can a shared lock work even for a reader if a writer is changing inode contents? > What's left is avoiding setting IN_MODIFIED when it's unsafe, to protect > against the deadlock. > >> 3. Check the i_access instead of the IN_ACCESS. > >> 4. ffs_update and ffs_syncvnode shall do the DIP_SET(i_atime) under the mutex >> from #2 before the main run and set IN_MODIFIED accordingly if i_accessed is >> not 0. >From a followup: % ffs_update shall be excluded, only ffs_syncvnode left in the list. % ffs_syncvnode is enclosed in the vn_start_write braces. >> 4. ufs_getattr shall retrieve the *time from new i_ fields under the mutex >> from #2 if corresponding i_ flag is set. > >> Basically, I want to set IN_MODIFIED i_flag (induced by IN_ACCESS and others) >> only under exclusive vnode lock. Moreover, i_accessed can be zeroed only >> under exclusive lock. This way, even shared lock on the vnode shall be enough >> to safely update modification times, and the times are moved to the disk >> often enough (at least, at the sync of the syncer vnodes). > > An exclusive vnode lock isn't needed, see above. Holding an exclusive vnode > lock does not make it safe to set IN_MODIFIED. Locking is complicated enough even if you can actually lock things :-(. > There are some constraints with regards to setting IN_MODIFIED on an inode. > > If neither IN_CHANGE nor IN_UPDATE is set then it might be unsafe to set > IN_MODIFIED since the file system might be suspended or in the process of being > suspended with the vnode sync loop in ffs_sync() having iterated past the > vnode. The case can't happen. IN_CHANGE is always set if IN_MODIFIED will (or should be) set later. There are some buggy cases where the combined setting of { IN_CHANGE, IN_MODIFIED } is incorrect, but these don't cause any problems here. We just have to avoid setting (or having to set) IN_MODIFED when setting it is not safe. Hopefully this is only in suspend mode. > If the file system is suspended then IN_MODIFIED cannot be set. If IN_MODIFIED, > IN_CHANGE or IN_UPDATE is set and the file system is suspended then something > is wrong. I think ufs_itimes() needs to use a non-blocking vn_start_write() and do nothing (except perhaps assert that the above harmful IN_* flags are not set) if it (ufs_itimes()) would set IN_MODIFIED. > If the file system is in the process of being suspended then IN_MODIFIED can be > set at the cost of triggering a restart of the vnode sync loop in ffs_sync(). Yes, once IN_MODIFIED is set, it is up to the sync loop or somewhere near it to ensure that suspend mode is not entered while an IN_MODIFIED flag is set for _any_ inode, irrespective of why or how IN_MODIFIED was set (whatever set it should be free to call vn_finish_write()). > If either IN_MODIFIED, IN_CHANGE or IN_UPDATE is already set then the vnode > sync loop has not reached the vnode, and a restart isn't needed. > > When ufs_itimes() cannot set IN_MODIFIED then it has to either risk losing the > access time update or use some mechanism to defer it (e.g. set IN_LAZYMOD or > a new flag and let process_deferred_inactive() set IN_MODIFIED after the file > system has been resumed). Yes, IN_LAZYMOD can be used to reduce the problem, and not much would be lost by throwing away atime changes that occur during a suspend if there are too many to convert to IN_LAZYMOD. Note that IN_LAZYMOD is completely (?) unused in -current: - it us not used for soft updates for historical reasons (initially mostly FUD) - it was only used for special files, but now devfs is used for special files. I have used IN_LAZYMOD for atime-only updates of all file types in ufs_itimes() for several years, but haven't tested it much since I mount almost all file systems with -noatime. The patch is simple. IN_LAZYMOD still gets turned into IN_MODIFIED in ufs_reclaim() so using it only for atime-only updates wouldn't completely fix the problem. It's hard for ufs_reclaim() to do anything except discard IN_LAZYMOD updates now. Summary of my understanding of this problem: - no problem with normal "writes" (at least modulo the sync loop checking the correct flags and/or bugs related to missing settings of IN_MODIFED). We wait until writes complete before entrering suspend mode, and don't allow writes in while in suspend mode. - problem for atime updates caused by reads. These become writes while in suspend mode. We want reads to work in suspend mode, so we cannout disallow reads, and we cannot disallow the implicit writes without breaking atime semantics. This problem can be partly avoided by ignoring IN_ACCESS in ufs_itimes() while in suspend mode, or by converting it to IN_LAZYMOD. If the inode gets reclaimed then we lose the atime update; otherwise the atime gets updated some time after we leave suspend mode, and either way the update doesn't go to snapshots, as is necessary for having a coherent snapshot. - problem for IN_LAZYMOD in ufs_reclaim(). Currently not reached. A quick fix would be to lose whatever updates are being done lazily. - problem syncing atimes while entering suspend mode. For writes, hopefully we get a consistent snapshot by disallowing new writes while entering sync mode (not just when it is entered). This doesn't work for reads. Bruce From owner-freebsd-fs@FreeBSD.ORG Wed Aug 23 14:07:27 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2D4C716A503; Wed, 23 Aug 2006 14:07:27 +0000 (UTC) (envelope-from lscharf@vt.edu) Received: from lennier.cc.vt.edu (lennier.cc.vt.edu [198.82.162.213]) by mx1.FreeBSD.org (Postfix) with ESMTP id 77A3E43D45; Wed, 23 Aug 2006 14:07:26 +0000 (GMT) (envelope-from lscharf@vt.edu) Received: from steiner.cc.vt.edu (IDENT:mirapoint@evil-steiner.cc.vt.edu [10.1.1.14]) by lennier.cc.vt.edu (8.12.11.20060308/8.12.11) with ESMTP id k7NDvp7H021842; Wed, 23 Aug 2006 10:07:25 -0400 Received: from authsmtp1.cc.vt.edu (imp.cc.vt.edu [198.82.161.55]) by steiner.cc.vt.edu (MOS 3.8.0-FCS) with ESMTP id FVN66918; Wed, 23 Aug 2006 10:07:24 -0400 (EDT) Received: from [128.173.14.24] (scharf.cc.vt.edu [128.173.14.24]) (authenticated bits=0) by authsmtp1.cc.vt.edu (8.13.1/8.13.1) with ESMTP id k7NE7N8x023537; Wed, 23 Aug 2006 10:07:24 -0400 Message-ID: <44EC611B.30905@vt.edu> Date: Wed, 23 Aug 2006 10:07:23 -0400 From: Luke Scharf User-Agent: Mail/News 1.5 (X11/20051201) MIME-Version: 1.0 To: Ricardo Correia References: <20060822104516.GB16033@garage.freebsd.pl> <200608221830.29039.zfs-opensolaris@wizy.org> In-Reply-To: <200608221830.29039.zfs-opensolaris@wizy.org> Content-Type: multipart/signed; protocol="application/x-pkcs7-signature"; micalg=sha1; boundary="------------ms020907050702080103080401" Cc: freebsd-fs@freebsd.org, zfs-discuss@opensolaris.org, freebsd-current@freebsd.org, Pawel Jakub Dawidek Subject: Re: [zfs-discuss] Porting ZFS file system to FreeBSD. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Aug 2006 14:07:27 -0000 This is a cryptographically signed message in MIME format. --------------ms020907050702080103080401 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Ricardo Correia wrote: > Wow, congratulations, nice work! > > I'm the one porting ZFS to FUSE and seeing you doing such progress so fast is > very very encouraging :) > I'd like to throw a "me too" into the pile of thank-you messages! I spent part of the weekend expanding and manipulating a set of LVM volumes on a pair of RHEL4-ish Linux servers... And I kept grumbling to myself "if this were ZFS, I could be done by now!" Not only that, but I could have matched the configuration to the needs of the users more closely. [0] I look forward to ZFS on both Linux and FreeBSD. It will be a powerful addition to both platforms! Thanks, -Luke [0] Changing a production server from an RHEL4 clone to Solaris isn't something that I'm likely to just-do in a couple of hours over the weekend on a cross-platform domain where I'm just assisting. If I were the sysadmin there, though, it would be practical. --------------ms020907050702080103080401 Content-Type: application/x-pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" Content-Description: S/MIME Cryptographic Signature MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJJTCC Au0wggJWoAMCAQICEB2va7zlCHACQRtrQO7Zi4EwDQYJKoZIhvcNAQEEBQAwYjELMAkGA1UE BhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMT I1RoYXd0ZSBQZXJzb25hbCBGcmVlbWFpbCBJc3N1aW5nIENBMB4XDTA2MDUxOTAxMjM0NVoX DTA3MDUxOTAxMjM0NVowVzEPMA0GA1UEBBMGU2NoYXJmMQ4wDAYDVQQqEwVMdWNhczEVMBMG A1UEAxMMTHVjYXMgU2NoYXJmMR0wGwYJKoZIhvcNAQkBFg5sc2NoYXJmQHZ0LmVkdTCCASIw DQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAPF9pURsqTPJtjLd9H0N7YjoL+N9M7hYRrXd Y7H3hL0RBs9H15M2ElmkFe879w+3Z9dAl+A9ZQDriGfs87jBl2012s2ndMPn1viKSj/wtb5u Glg4ZxZOnyQm7eiHriCVq2heGkHG7Siv6PcctfDcUt2YieTezjdvtRYDIxYCPQl1R8gtIWXe 6OpZDBYnA+Lc30nmMxoFcFnGdO1DdMJpnWR/D7TuPhMtAEtR+xTouLoKpnyHVYP0bBGTYflk YHbtBO2XilUMlwi5hkZiw/Ug0qDFqUP5RYoA6NpwpkL6AFKexwKfuf7Qq4GkXLXeJLWB2LKD GlYtUBA4BHZOTge2uOcCAwEAAaMrMCkwGQYDVR0RBBIwEIEObHNjaGFyZkB2dC5lZHUwDAYD VR0TAQH/BAIwADANBgkqhkiG9w0BAQQFAAOBgQCj+E9GE6UIn+R0ySpXrK/yOtghunqmLnm+ R68f8g60/tNhzG416Z42eQlze5Au6cqNRx5hrAK4laXSPu49O3LV9oeayJNRBWlJnHxK5AIh ym/26wmWW5YQYVSAK92X99fsti1GJmH2UPC6GmGsQfEh+tQWW3Llw8SVxBy40W/P/DCCAu0w ggJWoAMCAQICEB2va7zlCHACQRtrQO7Zi4EwDQYJKoZIhvcNAQEEBQAwYjELMAkGA1UEBhMC WkExJTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMTI1Ro YXd0ZSBQZXJzb25hbCBGcmVlbWFpbCBJc3N1aW5nIENBMB4XDTA2MDUxOTAxMjM0NVoXDTA3 MDUxOTAxMjM0NVowVzEPMA0GA1UEBBMGU2NoYXJmMQ4wDAYDVQQqEwVMdWNhczEVMBMGA1UE AxMMTHVjYXMgU2NoYXJmMR0wGwYJKoZIhvcNAQkBFg5sc2NoYXJmQHZ0LmVkdTCCASIwDQYJ KoZIhvcNAQEBBQADggEPADCCAQoCggEBAPF9pURsqTPJtjLd9H0N7YjoL+N9M7hYRrXdY7H3 hL0RBs9H15M2ElmkFe879w+3Z9dAl+A9ZQDriGfs87jBl2012s2ndMPn1viKSj/wtb5uGlg4 ZxZOnyQm7eiHriCVq2heGkHG7Siv6PcctfDcUt2YieTezjdvtRYDIxYCPQl1R8gtIWXe6OpZ DBYnA+Lc30nmMxoFcFnGdO1DdMJpnWR/D7TuPhMtAEtR+xTouLoKpnyHVYP0bBGTYflkYHbt BO2XilUMlwi5hkZiw/Ug0qDFqUP5RYoA6NpwpkL6AFKexwKfuf7Qq4GkXLXeJLWB2LKDGlYt UBA4BHZOTge2uOcCAwEAAaMrMCkwGQYDVR0RBBIwEIEObHNjaGFyZkB2dC5lZHUwDAYDVR0T AQH/BAIwADANBgkqhkiG9w0BAQQFAAOBgQCj+E9GE6UIn+R0ySpXrK/yOtghunqmLnm+R68f 8g60/tNhzG416Z42eQlze5Au6cqNRx5hrAK4laXSPu49O3LV9oeayJNRBWlJnHxK5AIhym/2 6wmWW5YQYVSAK92X99fsti1GJmH2UPC6GmGsQfEh+tQWW3Llw8SVxBy40W/P/DCCAz8wggKo oAMCAQICAQ0wDQYJKoZIhvcNAQEFBQAwgdExCzAJBgNVBAYTAlpBMRUwEwYDVQQIEwxXZXN0 ZXJuIENhcGUxEjAQBgNVBAcTCUNhcGUgVG93bjEaMBgGA1UEChMRVGhhd3RlIENvbnN1bHRp bmcxKDAmBgNVBAsTH0NlcnRpZmljYXRpb24gU2VydmljZXMgRGl2aXNpb24xJDAiBgNVBAMT G1RoYXd0ZSBQZXJzb25hbCBGcmVlbWFpbCBDQTErMCkGCSqGSIb3DQEJARYccGVyc29uYWwt ZnJlZW1haWxAdGhhd3RlLmNvbTAeFw0wMzA3MTcwMDAwMDBaFw0xMzA3MTYyMzU5NTlaMGIx CzAJBgNVBAYTAlpBMSUwIwYDVQQKExxUaGF3dGUgQ29uc3VsdGluZyAoUHR5KSBMdGQuMSww KgYDVQQDEyNUaGF3dGUgUGVyc29uYWwgRnJlZW1haWwgSXNzdWluZyBDQTCBnzANBgkqhkiG 9w0BAQEFAAOBjQAwgYkCgYEAxKY8VXNV+065yplaHmjAdQRwnd/p/6Me7L3N9VvyGna9fww6 YfK/Uc4B1OVQCjDXAmNaLIkVcI7dyfArhVqqP3FWy688Cwfn8R+RNiQqE88r1fOCdz0Dviv+ uxg+B79AgAJk16emu59l0cUqVIUPSAR/p7bRPGEEQB5kGXJgt/sCAwEAAaOBlDCBkTASBgNV HRMBAf8ECDAGAQH/AgEAMEMGA1UdHwQ8MDowOKA2oDSGMmh0dHA6Ly9jcmwudGhhd3RlLmNv bS9UaGF3dGVQZXJzb25hbEZyZWVtYWlsQ0EuY3JsMAsGA1UdDwQEAwIBBjApBgNVHREEIjAg pB4wHDEaMBgGA1UEAxMRUHJpdmF0ZUxhYmVsMi0xMzgwDQYJKoZIhvcNAQEFBQADgYEASIzR UIPqCy7MDaNmrGcPf6+svsIXoUOWlJ1/TCG4+DYfqi2fNi/A9BxQIJNwPP2t4WFiw9k6GX6E sZkbAMUaC4J0niVQlGLH2ydxVyWN3amcOY6MIE9lX5Xa9/eH1sYITq726jTlEBpbNU1341Yh eILcIRk13iSx0x1G/11fZU8xggNkMIIDYAIBATB2MGIxCzAJBgNVBAYTAlpBMSUwIwYDVQQK ExxUaGF3dGUgQ29uc3VsdGluZyAoUHR5KSBMdGQuMSwwKgYDVQQDEyNUaGF3dGUgUGVyc29u YWwgRnJlZW1haWwgSXNzdWluZyBDQQIQHa9rvOUIcAJBG2tA7tmLgTAJBgUrDgMCGgUAoIIB wzAYBgkqhkiG9w0BCQMxCwYJKoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0wNjA4MjMxNDA3 MjNaMCMGCSqGSIb3DQEJBDEWBBQDpgeImZwb8rYKU8t7BBq0Y5hZLzBSBgkqhkiG9w0BCQ8x RTBDMAoGCCqGSIb3DQMHMA4GCCqGSIb3DQMCAgIAgDANBggqhkiG9w0DAgIBQDAHBgUrDgMC BzANBggqhkiG9w0DAgIBKDCBhQYJKwYBBAGCNxAEMXgwdjBiMQswCQYDVQQGEwJaQTElMCMG A1UEChMcVGhhd3RlIENvbnN1bHRpbmcgKFB0eSkgTHRkLjEsMCoGA1UEAxMjVGhhd3RlIFBl cnNvbmFsIEZyZWVtYWlsIElzc3VpbmcgQ0ECEB2va7zlCHACQRtrQO7Zi4EwgYcGCyqGSIb3 DQEJEAILMXigdjBiMQswCQYDVQQGEwJaQTElMCMGA1UEChMcVGhhd3RlIENvbnN1bHRpbmcg KFB0eSkgTHRkLjEsMCoGA1UEAxMjVGhhd3RlIFBlcnNvbmFsIEZyZWVtYWlsIElzc3Vpbmcg Q0ECEB2va7zlCHACQRtrQO7Zi4EwDQYJKoZIhvcNAQEBBQAEggEATPhULHac9WyeD5RqEkbO 2o7O88EuNGSlZOJjZYn2R6ilqb4gHLYe9jHma0ALotR2ie38F/aXkYtFHFEPoDJGGv72VxG7 Y4AHzxw7KdMRmG3FulWFGRdEmUKT5j/mv1jN7OW40haWRnP7vxeAupRbzgvRaT09tAJjaWq2 /Fn17RsgDgLdMM6ulEEw2HJmuPynMXGT9KEbLwzNkN69zqK1tDHMfNfO+9ewH1artobLmRk+ aOrxC6CR8VEmI+Q/d+seLDprqUDlD6HOSSKG0M8wodzC+Ht6vzFTjVI/TsH/nxlhpN4Q4QWQ YS1qCZGZG0oRoFVZVOGuIkoar0VzgxFF4AAAAAAAAA== --------------ms020907050702080103080401-- From owner-freebsd-fs@FreeBSD.ORG Wed Aug 23 15:40:01 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 386AB16A4DD for ; Wed, 23 Aug 2006 15:40:01 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84]) by mx1.FreeBSD.org (Postfix) with ESMTP id B2C8C43D73 for ; Wed, 23 Aug 2006 15:39:54 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.2.163]) by mailout1.pacific.net.au (Postfix) with ESMTP id 4CF505A0D14; Thu, 24 Aug 2006 01:39:53 +1000 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3sarge1) with ESMTP id k7NFdn5u010329; Thu, 24 Aug 2006 01:39:50 +1000 Date: Thu, 24 Aug 2006 01:39:48 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Kostik Belousov In-Reply-To: <20060823044043.GA64800@deviant.kiev.zoral.com.ua> Message-ID: <20060824003453.M63627@delplex.bde.org> References: <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> <20060822175540.V58720@delplex.bde.org> <20060822130743.GL56637@deviant.kiev.zoral.com.ua> <20060822.214638.74697110.Tor.Egge@cvsup.no.freebsd.org> <20060823044043.GA64800@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org, Tor Egge Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Aug 2006 15:40:01 -0000 On Wed, 23 Aug 2006, Kostik Belousov wrote: > On Tue, Aug 22, 2006 at 09:46:38PM +0000, Tor Egge wrote: >>> 2. All places that currently set IN_ACCESS, instead would increment >>> i_accessed using the atomic ops. ufs_itimes shall update i_access >>> under some mutex if i_accessed is greater than zero. >> >> Protecting the existing i_flag and the timestamps with the vnode >> interlock when the current thread only has a shared vnode lock should >> be sufficient to protect against the races, removing the need for #3, >> #4 and #4 below. You asked about this in a later reply (the one with the patch). This seems wrong to me. I think MNT_ILOCK() (like you used) is sufficient, but you should just use a nonblocking vn_start_write() to avoid knowing about the internals of vn_start_write(). If the shared (or whatever) vnode lock is insufficient, then there are much larger, much older bugs. Inodes are accessed a lot with just the the vnode lock, and the vnode interlock here won't affect races with most other accesses. >> What's left is avoiding setting IN_MODIFIED when it's unsafe, to >> protect against the deadlock. > > So, I will do the following: > > 1. Protect both setting and reading inode times and i_flag with vnode > interlock. This shall be done through all the sys/ufs/*/* code. The patch doesn't change so much. Most places shouldn't need changing. > 2. Modify ufs_itimes: >> If neither IN_CHANGE nor IN_UPDATE is set then it might be unsafe to >> set IN_MODIFIED since the file system might be suspended or in the >> process of being suspended with the vnode sync loop in ffs_sync() >> having iterated past the vnode. > In other words, if IN_CHANGE or IN_UPDATE are already set, I can > safely convert IN_ACCESS into IN_MOD. Not quite. IN_ACCESS can also be handled if IN_MODIFIED is already set, even if neither IN_CHANGE nor IN_UPDATE is set. All this is when the file system is suspended; otherwise we can do anything to the inode. In all cases, we depend on the inode not changing underneath us. I think ordinary vnode locking gives that, just like it did before suspension existed. > Otherwise, I shall implemented the algorithm below. Suspending/suspended > checks need to take MNT_ILOCK. >> ... >> When ufs_itimes() cannot set IN_MODIFIED then it has to either risk >> losing the access time update or use some mechanism to defer it (e.g. >> set IN_LAZYMOD or a new flag and let process_deferred_inactive() set >> IN_MODIFIED after the file system has been resumed). >> > BTW, shall the test for MNT_RDONLY in the ufs_itimes moved earlier ? Probably not. It is supposed to be earlier, but is misplaced to work around bugs in the MNT_RDONLY case (this case should never set a flag or a timestamp, but in fact does set flags to work around bugs elsewhere). > 3. Add the process_deferred_lazymod procedure, called from ffs_snapshot > before proc_deferred_inactive, that shall convert IN_LAZYMOD | IN_ACCESS > into IN_MODIFIED. To be safe, the proc_def_lazymod needs vn_start_write braces. I think you should just ignore IN_ACCESS when it cannot be converted to a timestamp, or use IN_LAZYMOD. ufs_inactive() needs to do something with IN_LAZYMOD other than blindly turn it into IN_MODIFIED (but I believe the problem case is unreachable in -current -- see other mail). ffs_update() clears IN_MODIFIED. I now think this and other clearings in ffs_update() are safe. vnode locking should make it safe to change the inode, and it doesn't matter if the file system is suspended (or being suspended) provided IN_MODIFIED is not set when it shouldn't be. Bruce From owner-freebsd-fs@FreeBSD.ORG Wed Aug 23 15:47:27 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2C38D16A4E1 for ; Wed, 23 Aug 2006 15:47:27 +0000 (UTC) (envelope-from Tor.Egge@cvsup.no.freebsd.org) Received: from pil.idi.ntnu.no (pil.idi.ntnu.no [129.241.107.93]) by mx1.FreeBSD.org (Postfix) with ESMTP id DB2DF43D7D for ; Wed, 23 Aug 2006 15:47:20 +0000 (GMT) (envelope-from Tor.Egge@cvsup.no.freebsd.org) Received: from cvsup.no.freebsd.org (c2h5oh.idi.ntnu.no [129.241.103.69]) by pil.idi.ntnu.no (8.13.6/8.13.1) with ESMTP id k7NFlIjE025893 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Wed, 23 Aug 2006 17:47:19 +0200 (MEST) Received: from localhost (localhost [127.0.0.1]) by cvsup.no.freebsd.org (8.13.4/8.13.4) with ESMTP id k7NFlIPC063934; Wed, 23 Aug 2006 15:47:18 GMT (envelope-from Tor.Egge@cvsup.no.freebsd.org) Date: Wed, 23 Aug 2006 15:47:18 +0000 (UTC) Message-Id: <20060823.154718.126633648.Tor.Egge@cvsup.no.freebsd.org> To: kostikbel@gmail.com From: Tor Egge In-Reply-To: <20060823110808.GD64800@deviant.kiev.zoral.com.ua> References: <20060822.214638.74697110.Tor.Egge@cvsup.no.freebsd.org> <20060823044043.GA64800@deviant.kiev.zoral.com.ua> <20060823110808.GD64800@deviant.kiev.zoral.com.ua> X-Mailer: Mew version 3.3 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Virus-Scanned-By: mimedefang.idi.ntnu.no, using CLAMD X-SMTP-From: Sender=, Relay/Client=c2h5oh.idi.ntnu.no [129.241.103.69], EHLO=cvsup.no.freebsd.org X-Scanned-By: MIMEDefang 2.48 on 129.241.107.38 X-Scanned-By: mimedefang.idi.ntnu.no, using MIMEDefang 2.48 with local filter 16.42-idi X-Filter-Time: 0 seconds Cc: freebsd-fs@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Aug 2006 15:47:27 -0000 > I have at least one questions: > > > > Protecting the existing i_flag and the timestamps with the vnode > > > interlock when the current thread only has a shared vnode lock should > > > be sufficient to protect against the races, removing the need for #3, > > > #4 and #4 below. > Could you, please, explain this point ? I did not wrap all accesses to > i_flag and timestamps with vnode interlock, only ufs_itimes, ufs_lazyaccess > and ufs_getattr for now. As long as i_flag and the timstamps are never accessed without holding a shared or exclusive vnode lock, the vnode interlock can be used to serialize access for those holding a shared vnode lock. Access is already serialized for those holding an exclusive vnode lock, since no other thread can hold a shared or exclusive lock for the same vnode at the same time. An alternate locking protocol is to always use the vnode interlock to serialize access to i_flag and the timestamps. That increases the cost of accessing the fields in code that uses an exclusive vnode lock, but can temporarily lower the cost in other parts of the code (e.g. when scanning all the vnodes belonging to a mount point looking for dirty vnodes) since the vnode lock would no longer be needed to access i_flag and the timestamps. Problems with your suggested patch: ufs_lazyaccess() changes i_flags with only the vnode interlock held. The vnode interlock is not sufficient by itself to access i_flags without switching to the alternate locking protocol. ufs_itimes() doesn't optimize the common case where none of the flags are set. - Tor Egge From owner-freebsd-fs@FreeBSD.ORG Wed Aug 23 16:31:27 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9419016A4E8 for ; Wed, 23 Aug 2006 16:31:27 +0000 (UTC) (envelope-from Tor.Egge@cvsup.no.freebsd.org) Received: from pil.idi.ntnu.no (pil.idi.ntnu.no [129.241.107.93]) by mx1.FreeBSD.org (Postfix) with ESMTP id 06F9E43D4C for ; Wed, 23 Aug 2006 16:31:16 +0000 (GMT) (envelope-from Tor.Egge@cvsup.no.freebsd.org) Received: from cvsup.no.freebsd.org (c2h5oh.idi.ntnu.no [129.241.103.69]) by pil.idi.ntnu.no (8.13.6/8.13.1) with ESMTP id k7NGVEbe001453 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Wed, 23 Aug 2006 18:31:15 +0200 (MEST) Received: from localhost (localhost [127.0.0.1]) by cvsup.no.freebsd.org (8.13.4/8.13.4) with ESMTP id k7NGVCjM064187; Wed, 23 Aug 2006 16:31:14 GMT (envelope-from Tor.Egge@cvsup.no.freebsd.org) Date: Wed, 23 Aug 2006 16:31:11 +0000 (UTC) Message-Id: <20060823.163111.41690377.Tor.Egge@cvsup.no.freebsd.org> To: bde@zeta.org.au From: Tor Egge In-Reply-To: <20060823203148.M62850@delplex.bde.org> References: <20060822130743.GL56637@deviant.kiev.zoral.com.ua> <20060822.214638.74697110.Tor.Egge@cvsup.no.freebsd.org> <20060823203148.M62850@delplex.bde.org> X-Mailer: Mew version 3.3 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Virus-Scanned-By: mimedefang.idi.ntnu.no, using CLAMD X-SMTP-From: Sender=, Relay/Client=c2h5oh.idi.ntnu.no [129.241.103.69], EHLO=cvsup.no.freebsd.org X-Scanned-By: MIMEDefang 2.48 on 129.241.107.38 X-Scanned-By: mimedefang.idi.ntnu.no, using MIMEDefang 2.48 with local filter 16.42-idi X-Filter-Time: 1 seconds Cc: freebsd-fs@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Aug 2006 16:31:27 -0000 > > Protecting the existing i_flag and the timestamps with the vnode interlock > > when the current thread only has a shared vnode lock should be sufficient > > to protect against the races, removing the need for #3, #4 and #4 below. > I agree that this should be sufficient. Don't know if it is. Actually, > I thought that the vnode lock was more exclusive. How can a shared lock > work even for a reader if a writer is changing inode contents? Holding a shared lock prevents others from holding an exclusive lock, thus the possible modifications to the inode are limited (i_flags and timestamps). > > There are some constraints with regards to setting IN_MODIFIED on an inode. > >If neither IN_CHANGE nor IN_UPDATE is set then it might be unsafe to set > >IN_MODIFIED since the file system might be suspended or in the process of > >being suspended with the vnode sync loop in ffs_sync() having iterated past > >the vnode. > The case can't happen. IN_CHANGE is always set if IN_MODIFIED will > (or should be) set later. There are some buggy cases where the combined > setting of { IN_CHANGE, IN_MODIFIED } is incorrect, but these don't cause > any problems here. We just have to avoid setting (or having to set) > IN_MODIFED when setting it is not safe. Hopefully this is only in suspend > mode. If VOP_READ() has recently been called then IN_ACCESS might be set without IN_CHANGE, IN_UPDATE or IN_MODIFIED set. That's when ufs_itimes() needs to be careful. The VOP_READ() call might have been performed after the vnode was handled in the vnode sync loop. > > If the file system is suspended then IN_MODIFIED cannot be set. If > > IN_MODIFIED, IN_CHANGE or IN_UPDATE is set and the file system is suspended > > then something is wrong. > I think ufs_itimes() needs to use a non-blocking vn_start_write() and do > nothing (except perhaps assert that the above harmful IN_* flags are not > set) if it (ufs_itimes()) would set IN_MODIFIED. If a nonblocking vn_start_write() call fails then you don't know if the file system is suspended or in the processes of being suspended. If the file system is in the process of being suspended then the vnode sync loop might still be running. > - problem for atime updates caused by reads. These become writes while in > suspend mode. We want reads to work in suspend mode, so we cannout > disallow reads, and we cannot disallow the implicit writes without > breaking atime semantics. This problem can be partly avoided by > ignoring IN_ACCESS in ufs_itimes() while in suspend mode, or by > converting it to IN_LAZYMOD. If the inode gets reclaimed then we > lose the atime update; otherwise the atime gets updated some time > after we leave suspend mode, and either way the update doesn't go > to snapshots, as is necessary for having a coherent snapshot. vnodes are not reclaimed while the file system is suspended or in the processess of being suspended. > - problem for IN_LAZYMOD in ufs_reclaim(). Currently not reached. A quick > fix would be to lose whatever updates are being done lazily. What problem ? - Tor Egge From owner-freebsd-fs@FreeBSD.ORG Wed Aug 23 16:58:14 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id CDB2C16A4E5 for ; Wed, 23 Aug 2006 16:58:14 +0000 (UTC) (envelope-from Tor.Egge@cvsup.no.freebsd.org) Received: from pil.idi.ntnu.no (pil.idi.ntnu.no [129.241.107.93]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3203443D53 for ; Wed, 23 Aug 2006 16:58:12 +0000 (GMT) (envelope-from Tor.Egge@cvsup.no.freebsd.org) Received: from cvsup.no.freebsd.org (c2h5oh.idi.ntnu.no [129.241.103.69]) by pil.idi.ntnu.no (8.13.6/8.13.1) with ESMTP id k7NGwBsj005260 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Wed, 23 Aug 2006 18:58:11 +0200 (MEST) Received: from localhost (localhost [127.0.0.1]) by cvsup.no.freebsd.org (8.13.4/8.13.4) with ESMTP id k7NGwAi8064351; Wed, 23 Aug 2006 16:58:10 GMT (envelope-from Tor.Egge@cvsup.no.freebsd.org) Date: Wed, 23 Aug 2006 16:58:10 +0000 (UTC) Message-Id: <20060823.165810.130180685.Tor.Egge@cvsup.no.freebsd.org> To: bde@zeta.org.au From: Tor Egge In-Reply-To: <20060824003453.M63627@delplex.bde.org> References: <20060822.214638.74697110.Tor.Egge@cvsup.no.freebsd.org> <20060823044043.GA64800@deviant.kiev.zoral.com.ua> <20060824003453.M63627@delplex.bde.org> X-Mailer: Mew version 3.3 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Virus-Scanned-By: mimedefang.idi.ntnu.no, using CLAMD X-SMTP-From: Sender=, Relay/Client=c2h5oh.idi.ntnu.no [129.241.103.69], EHLO=cvsup.no.freebsd.org X-Scanned-By: MIMEDefang 2.48 on 129.241.107.38 X-Scanned-By: mimedefang.idi.ntnu.no, using MIMEDefang 2.48 with local filter 16.42-idi X-Filter-Time: 1 seconds Cc: freebsd-fs@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Aug 2006 16:58:14 -0000 > This seems wrong to me. I think MNT_ILOCK() (like you used) is sufficient, > but you should just use a nonblocking vn_start_write() to avoid knowing > about the internals of vn_start_write(). If the shared (or whatever) > vnode lock is insufficient, then there are much larger, much older bugs. > Inodes are accessed a lot with just the the vnode lock, and the vnode > interlock here won't affect races with most other accesses. The check for MNTK_SUSPENDED in the proposed patch checks for if the file system is suspended (cf. vn_start_secondary_write()). To check for if the file system is in the process of being suspended, MNTK_SUSPEND is needed (cf. vn_start_write()). If IN_MODIFIED is set while the file system is in the processes of being suspended and it isn't known that the vnode sync loop has not passed beyond this vnode then some hint must be left to indicate that the vnode sync loop should be restarted. vn_start_secondary_write() uses mnt_secondary_accwrites for this. - Tor Egge From owner-freebsd-fs@FreeBSD.ORG Wed Aug 23 18:32:30 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B86C416A4DA for ; Wed, 23 Aug 2006 18:32:30 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.85]) by mx1.FreeBSD.org (Postfix) with ESMTP id 20A4E43D46 for ; Wed, 23 Aug 2006 18:32:30 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.2.163]) by mailout2.pacific.net.au (Postfix) with ESMTP id E2E426E004; Thu, 24 Aug 2006 04:04:44 +1000 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3sarge1) with ESMTP id k7NI4glx029318; Thu, 24 Aug 2006 04:04:42 +1000 Date: Thu, 24 Aug 2006 04:04:41 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Tor Egge In-Reply-To: <20060823.154718.126633648.Tor.Egge@cvsup.no.freebsd.org> Message-ID: <20060824031104.B64391@delplex.bde.org> References: <20060822.214638.74697110.Tor.Egge@cvsup.no.freebsd.org> <20060823044043.GA64800@deviant.kiev.zoral.com.ua> <20060823110808.GD64800@deviant.kiev.zoral.com.ua> <20060823.154718.126633648.Tor.Egge@cvsup.no.freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Aug 2006 18:32:30 -0000 On Wed, 23 Aug 2006, Tor Egge wrote: >> I have at least one questions: >> >>>> Protecting the existing i_flag and the timestamps with the vnode >>>> interlock when the current thread only has a shared vnode lock should >>>> be sufficient to protect against the races, removing the need for #3, >>>> #4 and #4 below. >> Could you, please, explain this point ? I did not wrap all accesses to >> i_flag and timestamps with vnode interlock, only ufs_itimes, ufs_lazyaccess >> and ufs_getattr for now. > > As long as i_flag and the timstamps are never accessed without holding a shared > or exclusive vnode lock, the vnode interlock can be used to serialize access > for those holding a shared vnode lock. Access is already serialized for those > holding an exclusive vnode lock, since no other thread can hold a shared or > exclusive lock for the same vnode at the same time. I found some cases where i_flag, timestamps and probably other parts of the vnode seem to be accessed without any proper locks (maybe a refcount) :-(. See another (private) reply and below. I don't understand how the vnode interlock can help much for interoperation with code which uses vnode locks. Aren't these locks almost independent of each other, so if the vnode lock is already held then acquiring the vnode interlock wouldn't block? So to protect iflag with the interlock you would have to add the interlock to mounds of code that just uses the vnode lock now. vn_stat() uses an exclusive vnode lock although it is almost read-only. Thus non-exlusive locks in ufs_itimes() don't occur for the most common problem case of calls from ufs_getattr() for at least stat(2). > An alternate locking protocol is to always use the vnode interlock to serialize > access to i_flag and the timestamps. That increases the cost of accessing the > fields in code that uses an exclusive vnode lock, but can temporarily lower the > cost in other parts of the code (e.g. when scanning all the vnodes belonging to > a mount point looking for dirty vnodes) since the vnode lock would no longer be > needed to access i_flag and the timestamps. I understand this locking :-). ffs_sync() actually uses only uses the vnode interlock to access i_flag. I think this is intentionally quick and not quite right -- there is a comment a few lines before it saying that we depend on a mntvnode lock to keep things stable enough for a quick test. The scope of the comment is unclear. I think the quick test is only good enough for sync(2). > Problems with your suggested patch: > > ufs_lazyaccess() changes i_flags with only the vnode interlock held. The vnode > interlock is not sufficient by itself to access i_flags without switching to > the alternate locking protocol. That's what I thought. In fact, i_flags is accessed a lot without the vnode lock held, sometimes even without the vnode interlock. One case is ufs_close(). It is called without the vnode lock since VNOP_CLOSE() doesn't lock (I think). It calls ufs_itimes() without acquiring the vnode lock. It calls ufs_itimes with the vnode interlock, but only accidentally since it has to acquire the interlock for accessing v_usecount. This bug seems to be very old. In FreeBSD-1, ufs_close() does the equivalent of ufs_itimes() if the vnode is NOT locked. This makes some sense: - if the vnode is locked, then write accesses to it (just memory accesses) would race with whatever has it locked, and apparently doing nothing was considered good enough, - if the vnode is not locked, then write access to it without locking it were safe because the kernel was nor preemptible and the accesses don't block or trap. Things are now more complicated. Bruce From owner-freebsd-fs@FreeBSD.ORG Wed Aug 23 19:13:02 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7EB4D16A4DF for ; Wed, 23 Aug 2006 19:13:02 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84]) by mx1.FreeBSD.org (Postfix) with ESMTP id E37A643D79 for ; Wed, 23 Aug 2006 19:13:00 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.2.162]) by mailout1.pacific.net.au (Postfix) with ESMTP id 33B3A5A0EC1; Thu, 24 Aug 2006 04:43:29 +1000 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (8.13.4/8.13.4/Debian-3sarge1) with ESMTP id k7NIhQRm007486; Thu, 24 Aug 2006 04:43:27 +1000 Date: Thu, 24 Aug 2006 04:43:26 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Tor Egge In-Reply-To: <20060823.163111.41690377.Tor.Egge@cvsup.no.freebsd.org> Message-ID: <20060824040523.B64391@delplex.bde.org> References: <20060822130743.GL56637@deviant.kiev.zoral.com.ua> <20060822.214638.74697110.Tor.Egge@cvsup.no.freebsd.org> <20060823203148.M62850@delplex.bde.org> <20060823.163111.41690377.Tor.Egge@cvsup.no.freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Aug 2006 19:13:02 -0000 On Wed, 23 Aug 2006, Tor Egge wrote: >>> Protecting the existing i_flag and the timestamps with the vnode interlock >>> when the current thread only has a shared vnode lock should be sufficient >>> to protect against the races, removing the need for #3, #4 and #4 below. > >> I agree that this should be sufficient. Don't know if it is. Actually, >> I thought that the vnode lock was more exclusive. How can a shared lock >> work even for a reader if a writer is changing inode contents? > > Holding a shared lock prevents others from holding an exclusive lock, thus > the possible modifications to the inode are limited (i_flags and timestamps). Seems to be not much of a problem (since vn_stat() acquires an exclusive lock. >>> There are some constraints with regards to setting IN_MODIFIED on an inode. >>> If neither IN_CHANGE nor IN_UPDATE is set then it might be unsafe to set >>> IN_MODIFIED since the file system might be suspended or in the process of >>> being suspended with the vnode sync loop in ffs_sync() having iterated past >>> the vnode. > >> The case can't happen. IN_CHANGE is always set if IN_MODIFIED will >> (or should be) set later. There are some buggy cases where the combined >> setting of { IN_CHANGE, IN_MODIFIED } is incorrect, but these don't cause >> any problems here. We just have to avoid setting (or having to set) >> IN_MODIFED when setting it is not safe. Hopefully this is only in suspend >> mode. > > If VOP_READ() has recently been called then IN_ACCESS might be set without > IN_CHANGE, IN_UPDATE or IN_MODIFIED set. That's when ufs_itimes() needs to be > careful. The VOP_READ() call might have been performed after the vnode was > handled in the vnode sync loop. Certainly. You didn't mention IN_ACCESS originally, and I thought about it but didn't mention it either. For IN_ACCESS, we can consider the change to not actually have occured (and thus force a setting of IN_MODIFED) until we set the atime, or even later with IN_LAZYMOD, or never if we are really lazy. We can set IN_ACCESS in ffs_read() safely because it's not in the dinode (and vn_rdwr() acquires an exclusive lock). We can decide not to convert it to IN_MODIFIED later if this is too hard. Not setting the atime properly is harmless compared with not setting other timestamps. I should only have said that "IN_CHANGE is always set if IN_MODIFIED will be set later due to a _disk_ inode change that has _already_ occurred" (IN_ACCESS and IN_UPDATE both give in-core-only inode changes that will affect the disk inode later). >>> If the file system is suspended then IN_MODIFIED cannot be set. If >>> IN_MODIFIED, IN_CHANGE or IN_UPDATE is set and the file system is suspended >>> then something is wrong. > >> I think ufs_itimes() needs to use a non-blocking vn_start_write() and do >> nothing (except perhaps assert that the above harmful IN_* flags are not >> set) if it (ufs_itimes()) would set IN_MODIFIED. > > If a nonblocking vn_start_write() call fails then you don't know if the file > system is suspended or in the processes of being suspended. If the file system > is in the process of being suspended then the vnode sync loop might still be > running. Yes. I missed the different suspend flags. Now I'll ask for a vfs level function to avoid accessing the is-suspended flag directly. >> - problem for atime updates caused by reads. These become writes while in >> suspend mode. We want reads to work in suspend mode, so we cannout >> disallow reads, and we cannot disallow the implicit writes without >> breaking atime semantics. This problem can be partly avoided by >> ignoring IN_ACCESS in ufs_itimes() while in suspend mode, or by >> converting it to IN_LAZYMOD. If the inode gets reclaimed then we >> lose the atime update; otherwise the atime gets updated some time >> after we leave suspend mode, and either way the update doesn't go >> to snapshots, as is necessary for having a coherent snapshot. > > vnodes are not reclaimed while the file system is suspended or in the > processess of being suspended. Does this mean that big tree walks soon run out of vnodes and cause everything to block until the file system is unsuspended? >> - problem for IN_LAZYMOD in ufs_reclaim(). Currently not reached. A quick >> fix would be to lose whatever updates are being done lazily. > > What problem ? It blindly converts IN_LAZYMOD to IN_MODIFIED and calls ufs_update(), but it has no problem with suspend if it is not called while suspended. It isn't missing vnode locking like I first thought -- the locking annotation was just wrong in the old version of vnode_if.src that I was looking at. The locking used to be "L L L" but was documented as "U U U". Now it is "E E E" and it probably needs to be exclusive to set IN_MODIFIED. Bruce From owner-freebsd-fs@FreeBSD.ORG Wed Aug 23 20:06:56 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C684316A4E0 for ; Wed, 23 Aug 2006 20:06:56 +0000 (UTC) (envelope-from Tor.Egge@cvsup.no.freebsd.org) Received: from pil.idi.ntnu.no (pil.idi.ntnu.no [129.241.107.93]) by mx1.FreeBSD.org (Postfix) with ESMTP id 87C2F43D4C for ; Wed, 23 Aug 2006 20:06:55 +0000 (GMT) (envelope-from Tor.Egge@cvsup.no.freebsd.org) Received: from cvsup.no.freebsd.org (c2h5oh.idi.ntnu.no [129.241.103.69]) by pil.idi.ntnu.no (8.13.6/8.13.1) with ESMTP id k7NK6r4d028075 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Wed, 23 Aug 2006 22:06:54 +0200 (MEST) Received: from localhost (localhost [127.0.0.1]) by cvsup.no.freebsd.org (8.13.4/8.13.4) with ESMTP id k7NK6pSA065827; Wed, 23 Aug 2006 20:06:53 GMT (envelope-from Tor.Egge@cvsup.no.freebsd.org) Date: Wed, 23 Aug 2006 20:06:46 +0000 (UTC) Message-Id: <20060823.200646.74691347.Tor.Egge@cvsup.no.freebsd.org> To: bde@zeta.org.au From: Tor Egge In-Reply-To: <20060824031104.B64391@delplex.bde.org> References: <20060823110808.GD64800@deviant.kiev.zoral.com.ua> <20060823.154718.126633648.Tor.Egge@cvsup.no.freebsd.org> <20060824031104.B64391@delplex.bde.org> X-Mailer: Mew version 3.3 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Virus-Scanned-By: mimedefang.idi.ntnu.no, using CLAMD X-SMTP-From: Sender=, Relay/Client=c2h5oh.idi.ntnu.no [129.241.103.69], EHLO=cvsup.no.freebsd.org X-Scanned-By: MIMEDefang 2.48 on 129.241.107.38 X-Scanned-By: mimedefang.idi.ntnu.no, using MIMEDefang 2.48 with local filter 16.42-idi X-Filter-Time: 0 seconds Cc: freebsd-fs@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Aug 2006 20:06:56 -0000 > I found some cases where i_flag, timestamps and probably other parts of the > vnode seem to be accessed without any proper locks (maybe a refcount) :-(. > See another (private) reply and below. > > I don't understand how the vnode interlock can help much for interoperation > with code which uses vnode locks. Aren't these locks almost independent of > each other, so if the vnode lock is already held then acquiring the vnode > interlock wouldn't block? So to protect iflag with the interlock you > would have to add the interlock to mounds of code that just uses the vnode > lock now. the vnode interlock is used to resolve between those holding a shared lock. There are 4 locking protocols for the protection of i_flags and the timestamps: 1. Previous locking protocol: Giant 2. Current locking protocol: vnode lock. This breaks down when processes with a shared vnode lock perform changes (e.g. sets flags or timestamps). 3. Proposed locking protocol: exclusive vnode lock or shared vnode lock plus the vnode interlock. 4. Alternate locking protocol: vnode interlock. Further code changes might be needed for this to work properly, e.g. IN_MODIFIED might have to be set before making changes to the inode instead of aftewards. > I understand this locking :-). ffs_sync() actually uses only uses the vnode > interlock to access i_flag. I think this is intentionally quick and not > quite right -- there is a comment a few lines before it saying that we depend > on a mntvnode lock to keep things stable enough for a quick test. The scope > of the comment is unclear. I think the quick test is only good enough for > sync(2). The correct unoptimized version of ffs_sync() would have the vnode lock before checking i_flag. When this loop executes as part of suspending the file system, MNTK_SUSPEND is set. No code is executing within regions protected by vn_start_write() and new calls to vn_start_write() will fail or block. Calls to vn_start_secondary_write() might succeed but will then trigger a retry of the whole vnode sync loop. While not having the vnode lock when checking i_flag in ffs_sync() opens a race, the conseqeuences are limited due to other locking and the restart. - Tor Egge