From owner-svn-src-stable@FreeBSD.ORG Tue Sep 29 10:53:06 2009 Return-Path: Delivered-To: svn-src-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A9628106566B; Tue, 29 Sep 2009 10:53:06 +0000 (UTC) (envelope-from pjd@FreeBSD.org) Received: from svn.freebsd.org (svn.freebsd.org [IPv6:2001:4f8:fff6::2c]) by mx1.freebsd.org (Postfix) with ESMTP id 94D2F8FC17; Tue, 29 Sep 2009 10:53:06 +0000 (UTC) Received: from svn.freebsd.org (localhost [127.0.0.1]) by svn.freebsd.org (8.14.3/8.14.3) with ESMTP id n8TAr6X8083218; Tue, 29 Sep 2009 10:53:06 GMT (envelope-from pjd@svn.freebsd.org) Received: (from pjd@localhost) by svn.freebsd.org (8.14.3/8.14.3/Submit) id n8TAr6A5083207; Tue, 29 Sep 2009 10:53:06 GMT (envelope-from pjd@svn.freebsd.org) Message-Id: <200909291053.n8TAr6A5083207@svn.freebsd.org> From: Pawel Jakub Dawidek Date: Tue, 29 Sep 2009 10:53:06 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-stable@freebsd.org, svn-src-stable-8@freebsd.org X-SVN-Group: stable-8 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cc: Subject: svn commit: r197613 - in stable/8: cddl/contrib/opensolaris cddl/contrib/opensolaris/cmd/zfs sys sys/amd64/include/xen sys/cddl/contrib/opensolaris sys/cddl/contrib/opensolaris/uts/common/fs sys/cd... X-BeenThere: svn-src-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SVN commit messages for all the -stable branches of the src tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Sep 2009 10:53:06 -0000 Author: pjd Date: Tue Sep 29 10:53:06 2009 New Revision: 197613 URL: http://svn.freebsd.org/changeset/base/197613 Log: MFC r197287, r197289, r197351, r197426, r197458, r197459, r197497, r197498, r197512, r197513, r197514, r197515, r197525: r197287: Purge namecache for the file system being rolled back, so it doesn't point at invalid vnodes after the rollback resulting in EIO errors when trying to access files which are in the namecache. Reported by: des r197289: Purge file system namecache when receiving incremental stream and rolling back to it. r197351: Purge namecache in the same place OpenSolaris does. r197426: Restore BSD behaviour - when creating new directory entry use parent directory gid to set group ownership and not process gid. This was overlooked during v6 -> v13 switch. PR: kern/139076 Reported by: Sean Winn r197458: Close race in zfs_zget(). We have to increase usecount first and then check for VI_DOOMED flag. Before this change vnode could be reclaimed between checking for the flag and increasing usecount. r197459: Before calling vflush(FORCECLOSE) mark file system as unmounted so the following vnops will fail. This is very important, because without this change vnode could be reclaimed at any point, even if we increased usecount. The only way to ensure that vnode won't be reclaimed was to lock it, which would be very hard to do in ZFS without changing a lot of code. With this change simply increasing usecount is enough to be sure vnode won't be reclaimed from under us. To be precise it can still be reclaimed but we won't be able to see it, because every try to enter ZFS through VFS will result in EIO. The only function that cannot return EIO, because it is needed for vflush() is zfs_root(). Introduce ZFS_ENTER_NOERROR() macro that only locks z_teardown_lock and never returns EIO. r197497: Switch to fletcher4 as the default checksum algorithm. Fletcher2 was proven to be a bit weak and OpenSolaris also switched to fletcher4. r197498: head/cddl/contrib/opensolaris Fletcher4 is not the default checksum algorithm. r197512: - Don't depend on value returned by gfs_*_inactive(), it doesn't work well with forced unmounts when GFS vnodes are referenced. - Make other preparations to GFS for forced unmounts. PR: kern/139062 Reported by: trasz r197513: Use traverse() function to find and return mount point's vnode instead of covered vnode when snapshot is already mounted. r197514: On lookup error VFS expects *vpp to be set to NULL, be sure to do that. r197515: Handle cases where virtual (GFS) vnodes are referenced when doing forced unmount. In that case we cannot depend on the proper order of invalidating vnodes, so we have to free resources when we have a chance. PR: kern/139062 Reported by: trasz r197525: Ensure that tv_sec is between INT32_MIN and INT32_MAX, so ZFS won't object. This completes the fix from r185586. PR: kern/139059 Reported by: Daniel Braniss Submitted by: Jaakko Heinonen Tested by: Daniel Braniss Approved by: re (kib) Modified: stable/8/cddl/contrib/opensolaris/ (props changed) stable/8/cddl/contrib/opensolaris/cmd/zfs/zfs.8 stable/8/sys/ (props changed) stable/8/sys/amd64/include/xen/ (props changed) stable/8/sys/cddl/contrib/opensolaris/ (props changed) stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/gfs.c stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/fletcher.c stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c stable/8/sys/contrib/dev/acpica/ (props changed) stable/8/sys/contrib/pf/ (props changed) stable/8/sys/dev/xen/xenpci/ (props changed) stable/8/sys/nfsserver/nfs_serv.c Modified: stable/8/cddl/contrib/opensolaris/cmd/zfs/zfs.8 ============================================================================== --- stable/8/cddl/contrib/opensolaris/cmd/zfs/zfs.8 Tue Sep 29 10:50:02 2009 (r197612) +++ stable/8/cddl/contrib/opensolaris/cmd/zfs/zfs.8 Tue Sep 29 10:53:06 2009 (r197613) @@ -535,7 +535,7 @@ This property is not inherited. .ad .sp .6 .RS 4n -Controls the checksum used to verify data integrity. The default value is "on", which automatically selects an appropriate algorithm (currently, \fIfletcher2\fR, but this may change in future releases). The value "off" disables integrity +Controls the checksum used to verify data integrity. The default value is "on", which automatically selects an appropriate algorithm (currently, \fIfletcher4\fR, but this may change in future releases). The value "off" disables integrity checking on user data. Disabling checksums is NOT a recommended practice. .RE Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/gfs.c ============================================================================== --- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/gfs.c Tue Sep 29 10:50:02 2009 (r197612) +++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/gfs.c Tue Sep 29 10:53:06 2009 (r197613) @@ -595,7 +595,6 @@ found: if (vp->v_flag & V_XATTRDIR) VI_LOCK(fp->gfs_parent); VI_LOCK(vp); - ASSERT(vp->v_count < 2); /* * Really remove this vnode */ @@ -607,12 +606,7 @@ found: */ ge->gfse_vnode = NULL; } - if (vp->v_count == 1) { - vp->v_usecount--; - vdropl(vp); - } else { - VI_UNLOCK(vp); - } + VI_UNLOCK(vp); /* * Free vnode and release parent @@ -1084,18 +1078,16 @@ gfs_vop_inactive(ap) { vnode_t *vp = ap->a_vp; gfs_file_t *fp = vp->v_data; - void *data; if (fp->gfs_type == GFS_DIR) - data = gfs_dir_inactive(vp); + gfs_dir_inactive(vp); else - data = gfs_file_inactive(vp); - - if (data != NULL) - kmem_free(data, fp->gfs_size); + gfs_file_inactive(vp); VI_LOCK(vp); vp->v_data = NULL; VI_UNLOCK(vp); + kmem_free(fp, fp->gfs_size); + return (0); } Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/fletcher.c ============================================================================== --- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/fletcher.c Tue Sep 29 10:50:02 2009 (r197612) +++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/fletcher.c Tue Sep 29 10:53:06 2009 (r197613) @@ -19,11 +19,111 @@ * CDDL HEADER END */ /* - * Copyright 2006 Sun Microsystems, Inc. All rights reserved. + * Copyright 2009 Sun Microsystems, Inc. All rights reserved. * Use is subject to license terms. */ -#pragma ident "%Z%%M% %I% %E% SMI" +/* + * Fletcher Checksums + * ------------------ + * + * ZFS's 2nd and 4th order Fletcher checksums are defined by the following + * recurrence relations: + * + * a = a + f + * i i-1 i-1 + * + * b = b + a + * i i-1 i + * + * c = c + b (fletcher-4 only) + * i i-1 i + * + * d = d + c (fletcher-4 only) + * i i-1 i + * + * Where + * a_0 = b_0 = c_0 = d_0 = 0 + * and + * f_0 .. f_(n-1) are the input data. + * + * Using standard techniques, these translate into the following series: + * + * __n_ __n_ + * \ | \ | + * a = > f b = > i * f + * n /___| n - i n /___| n - i + * i = 1 i = 1 + * + * + * __n_ __n_ + * \ | i*(i+1) \ | i*(i+1)*(i+2) + * c = > ------- f d = > ------------- f + * n /___| 2 n - i n /___| 6 n - i + * i = 1 i = 1 + * + * For fletcher-2, the f_is are 64-bit, and [ab]_i are 64-bit accumulators. + * Since the additions are done mod (2^64), errors in the high bits may not + * be noticed. For this reason, fletcher-2 is deprecated. + * + * For fletcher-4, the f_is are 32-bit, and [abcd]_i are 64-bit accumulators. + * A conservative estimate of how big the buffer can get before we overflow + * can be estimated using f_i = 0xffffffff for all i: + * + * % bc + * f=2^32-1;d=0; for (i = 1; d<2^64; i++) { d += f*i*(i+1)*(i+2)/6 }; (i-1)*4 + * 2264 + * quit + * % + * + * So blocks of up to 2k will not overflow. Our largest block size is + * 128k, which has 32k 4-byte words, so we can compute the largest possible + * accumulators, then divide by 2^64 to figure the max amount of overflow: + * + * % bc + * a=b=c=d=0; f=2^32-1; for (i=1; i<=32*1024; i++) { a+=f; b+=a; c+=b; d+=c } + * a/2^64;b/2^64;c/2^64;d/2^64 + * 0 + * 0 + * 1365 + * 11186858 + * quit + * % + * + * So a and b cannot overflow. To make sure each bit of input has some + * effect on the contents of c and d, we can look at what the factors of + * the coefficients in the equations for c_n and d_n are. The number of 2s + * in the factors determines the lowest set bit in the multiplier. Running + * through the cases for n*(n+1)/2 reveals that the highest power of 2 is + * 2^14, and for n*(n+1)*(n+2)/6 it is 2^15. So while some data may overflow + * the 64-bit accumulators, every bit of every f_i effects every accumulator, + * even for 128k blocks. + * + * If we wanted to make a stronger version of fletcher4 (fletcher4c?), + * we could do our calculations mod (2^32 - 1) by adding in the carries + * periodically, and store the number of carries in the top 32-bits. + * + * -------------------- + * Checksum Performance + * -------------------- + * + * There are two interesting components to checksum performance: cached and + * uncached performance. With cached data, fletcher-2 is about four times + * faster than fletcher-4. With uncached data, the performance difference is + * negligible, since the cost of a cache fill dominates the processing time. + * Even though fletcher-4 is slower than fletcher-2, it is still a pretty + * efficient pass over the data. + * + * In normal operation, the data which is being checksummed is in a buffer + * which has been filled either by: + * + * 1. a compression step, which will be mostly cached, or + * 2. a bcopy() or copyin(), which will be uncached (because the + * copy is cache-bypassing). + * + * For both cached and uncached data, both fletcher checksums are much faster + * than sha-256, and slower than 'off', which doesn't touch the data at all. + */ #include #include Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h ============================================================================== --- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h Tue Sep 29 10:50:02 2009 (r197612) +++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h Tue Sep 29 10:53:06 2009 (r197613) @@ -255,6 +255,7 @@ VTOZ(vnode_t *vp) /* * ZFS_ENTER() is called on entry to each ZFS vnode and vfs operation. + * ZFS_ENTER_NOERROR() is called when we can't return EIO. * ZFS_EXIT() must be called before exitting the vop. * ZFS_VERIFY_ZP() verifies the znode is valid. */ @@ -267,6 +268,9 @@ VTOZ(vnode_t *vp) } \ } +#define ZFS_ENTER_NOERROR(zfsvfs) \ + rrw_enter(&(zfsvfs)->z_teardown_lock, RW_READER, FTAG) + #define ZFS_EXIT(zfsvfs) rrw_exit(&(zfsvfs)->z_teardown_lock, FTAG) #define ZFS_VERIFY_ZP(zp) \ Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h ============================================================================== --- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h Tue Sep 29 10:50:02 2009 (r197612) +++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h Tue Sep 29 10:53:06 2009 (r197613) @@ -76,7 +76,7 @@ enum zio_checksum { ZIO_CHECKSUM_FUNCTIONS }; -#define ZIO_CHECKSUM_ON_VALUE ZIO_CHECKSUM_FLETCHER_2 +#define ZIO_CHECKSUM_ON_VALUE ZIO_CHECKSUM_FLETCHER_4 #define ZIO_CHECKSUM_DEFAULT ZIO_CHECKSUM_ON enum zio_compress { Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c ============================================================================== --- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c Tue Sep 29 10:50:02 2009 (r197612) +++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c Tue Sep 29 10:53:06 2009 (r197613) @@ -1841,7 +1841,7 @@ zfs_perm_init(znode_t *zp, znode_t *pare fgid = zfs_fuid_create_cred(zfsvfs, ZFS_GROUP, tx, cr, fuidp); #ifdef __FreeBSD__ - gid = parent->z_phys->zp_gid; + gid = fgid = parent->z_phys->zp_gid; #else gid = crgetgid(cr); #endif Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c ============================================================================== --- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c Tue Sep 29 10:50:02 2009 (r197612) +++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c Tue Sep 29 10:53:06 2009 (r197613) @@ -818,7 +818,11 @@ zfsctl_snapdir_lookup(ap) if ((sep = avl_find(&sdp->sd_snaps, &search, &where)) != NULL) { *vpp = sep->se_root; VN_HOLD(*vpp); - if ((*vpp)->v_mountedhere == NULL) { + err = traverse(vpp, LK_EXCLUSIVE | LK_RETRY); + if (err) { + VN_RELE(*vpp); + *vpp = NULL; + } else if (*vpp == sep->se_root) { /* * The snapshot was unmounted behind our backs, * try to remount it. @@ -832,10 +836,9 @@ zfsctl_snapdir_lookup(ap) */ (*vpp)->v_flag &= ~VROOT; } - vn_lock(*vpp, LK_EXCLUSIVE | LK_RETRY); mutex_exit(&sdp->sd_lock); ZFS_EXIT(zfsvfs); - return (0); + return (err); } /* @@ -895,6 +898,8 @@ domount: } mutex_exit(&sdp->sd_lock); ZFS_EXIT(zfsvfs); + if (err != 0) + *vpp = NULL; return (err); } @@ -1002,15 +1007,24 @@ zfsctl_snapdir_inactive(ap) { vnode_t *vp = ap->a_vp; zfsctl_snapdir_t *sdp = vp->v_data; - void *private; + zfs_snapentry_t *sep; - private = gfs_dir_inactive(vp); - if (private != NULL) { - ASSERT(avl_numnodes(&sdp->sd_snaps) == 0); - mutex_destroy(&sdp->sd_lock); - avl_destroy(&sdp->sd_snaps); - kmem_free(private, sizeof (zfsctl_snapdir_t)); + /* + * On forced unmount we have to free snapshots from here. + */ + mutex_enter(&sdp->sd_lock); + while ((sep = avl_first(&sdp->sd_snaps)) != NULL) { + avl_remove(&sdp->sd_snaps, sep); + kmem_free(sep->se_name, strlen(sep->se_name) + 1); + kmem_free(sep, sizeof (zfs_snapentry_t)); } + mutex_exit(&sdp->sd_lock); + gfs_dir_inactive(vp); + ASSERT(avl_numnodes(&sdp->sd_snaps) == 0); + mutex_destroy(&sdp->sd_lock); + avl_destroy(&sdp->sd_snaps); + kmem_free(sdp, sizeof (zfsctl_snapdir_t)); + return (0); } @@ -1068,6 +1082,9 @@ zfsctl_snapshot_inactive(ap) int locked; vnode_t *dvp; + if (vp->v_count > 0) + goto end; + VERIFY(gfs_dir_lookup(vp, "..", &dvp, cr, 0, NULL, NULL) == 0); sdp = dvp->v_data; VOP_UNLOCK(dvp, 0); @@ -1075,11 +1092,6 @@ zfsctl_snapshot_inactive(ap) if (!(locked = MUTEX_HELD(&sdp->sd_lock))) mutex_enter(&sdp->sd_lock); - if (vp->v_count > 1) { - if (!locked) - mutex_exit(&sdp->sd_lock); - return (0); - } ASSERT(!vn_ismntpt(vp)); sep = avl_first(&sdp->sd_snaps); @@ -1099,6 +1111,7 @@ zfsctl_snapshot_inactive(ap) if (!locked) mutex_exit(&sdp->sd_lock); VN_RELE(dvp); +end: VFS_RELE(vp->v_vfsp); /* Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c ============================================================================== --- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c Tue Sep 29 10:50:02 2009 (r197612) +++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c Tue Sep 29 10:53:06 2009 (r197613) @@ -864,7 +864,7 @@ zfs_root(vfs_t *vfsp, int flags, vnode_t znode_t *rootzp; int error; - ZFS_ENTER(zfsvfs); + ZFS_ENTER_NOERROR(zfsvfs); error = zfs_zget(zfsvfs, zfsvfs->z_root, &rootzp); if (error == 0) { @@ -898,6 +898,9 @@ zfsvfs_teardown(zfsvfs_t *zfsvfs, boolea * 'z_parent' is self referential for non-snapshots. */ (void) dnlc_purge_vfsp(zfsvfs->z_parent->z_vfs, 0); +#ifdef FREEBSD_NAMECACHE + cache_purgevfs(zfsvfs->z_parent->z_vfs); +#endif } /* @@ -1027,6 +1030,17 @@ zfs_umount(vfs_t *vfsp, int fflag) ASSERT(zfsvfs->z_ctldir == NULL); } + if (fflag & MS_FORCE) { + /* + * Mark file system as unmounted before calling + * vflush(FORCECLOSE). This way we ensure no future vnops + * will be called and risk operating on DOOMED vnodes. + */ + rrw_enter(&zfsvfs->z_teardown_lock, RW_WRITER, FTAG); + zfsvfs->z_unmounted = B_TRUE; + rrw_exit(&zfsvfs->z_teardown_lock, FTAG); + } + /* * Flush all the files. */ @@ -1093,8 +1107,7 @@ zfs_umount(vfs_t *vfsp, int fflag) if (zfsvfs->z_issnap) { vnode_t *svp = vfsp->mnt_vnodecovered; - ASSERT(svp->v_count == 2 || svp->v_count == 1); - if (svp->v_count == 2) + if (svp->v_count >= 2) VN_RELE(svp); } zfs_freevfs(vfsp); Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c ============================================================================== --- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c Tue Sep 29 10:50:02 2009 (r197612) +++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c Tue Sep 29 10:53:06 2009 (r197613) @@ -890,17 +890,25 @@ again: if (zp->z_unlinked) { err = ENOENT; } else { - if ((vp = ZTOV(zp)) != NULL) { - VI_LOCK(vp); + int dying = 0; + + vp = ZTOV(zp); + if (vp == NULL) + dying = 1; + else { + VN_HOLD(vp); if ((vp->v_iflag & VI_DOOMED) != 0) { - VI_UNLOCK(vp); - vp = NULL; - } else - VI_UNLOCK(vp); + dying = 1; + /* + * Don't VN_RELE() vnode here, because + * it can call vn_lock() which creates + * LOR between vnode lock and znode + * lock. We will VN_RELE() the vnode + * after droping znode lock. + */ + } } - if (vp != NULL) - VN_HOLD(vp); - else { + if (dying) { if (first) { ZFS_LOG(1, "dying znode detected (zp=%p)", zp); first = 0; @@ -912,6 +920,8 @@ again: dmu_buf_rele(db, NULL); mutex_exit(&zp->z_lock); ZFS_OBJ_HOLD_EXIT(zfsvfs, obj_num); + if (vp != NULL) + VN_RELE(vp); tsleep(zp, 0, "zcollide", 1); goto again; } Modified: stable/8/sys/nfsserver/nfs_serv.c ============================================================================== --- stable/8/sys/nfsserver/nfs_serv.c Tue Sep 29 10:50:02 2009 (r197612) +++ stable/8/sys/nfsserver/nfs_serv.c Tue Sep 29 10:53:06 2009 (r197613) @@ -1332,7 +1332,7 @@ nfsrv_create(struct nfsrv_descript *nfsd tl = nfsm_dissect_nonblock(u_int32_t *, NFSX_V3CREATEVERF); /* Unique bytes, endianness is not important. */ - cverf.tv_sec = tl[0]; + cverf.tv_sec = (int32_t)tl[0]; cverf.tv_nsec = tl[1]; exclusive_flag = 1; break;