From owner-svn-src-all@FreeBSD.ORG Mon Apr 18 20:13:07 2011 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 59F88106566C; Mon, 18 Apr 2011 20:13:07 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id B519A8FC17; Mon, 18 Apr 2011 20:13:06 +0000 (UTC) Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id p3IKD2Ff054822 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 18 Apr 2011 23:13:02 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.4/8.14.4) with ESMTP id p3IKD2qt027527; Mon, 18 Apr 2011 23:13:02 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.4/8.14.4/Submit) id p3IKD2d1027526; Mon, 18 Apr 2011 23:13:02 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 18 Apr 2011 23:13:02 +0300 From: Kostik Belousov To: mdf@freebsd.org Message-ID: <20110418201302.GY48734@deviant.kiev.zoral.com.ua> References: <201104181632.p3IGWM5v077720@svn.freebsd.org> <20110418192810.GX48734@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="UC+RhZhEc8lcmajv" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-3.4 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00, DNS_FROM_OPENWHOIS autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r220791 - in head: lib/libc/sys sys/compat/freebsd32 sys/kern sys/sys X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Apr 2011 20:13:07 -0000 --UC+RhZhEc8lcmajv Content-Type: text/plain; charset=koi8-r Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Apr 18, 2011 at 12:49:38PM -0700, mdf@freebsd.org wrote: > On Mon, Apr 18, 2011 at 12:28 PM, Kostik Belousov w= rote: > > On Mon, Apr 18, 2011 at 04:32:22PM +0000, Matthew D Fleming wrote: > >> Author: mdf > >> Date: Mon Apr 18 16:32:22 2011 > >> New Revision: 220791 > >> URL: http://svn.freebsd.org/changeset/base/220791 > >> > >> Log: > >> =9A Add the posix_fallocate(2) syscall. =9AThe default implementation = in > >> =9A vop_stdallocate() is filesystem agnostic and will run as slow as a > >> =9A read/write loop in userspace; however, it serves to correctly > >> =9A implement the functionality for filesystems that do not implement a > >> =9A VOP_ALLOCATE. > >> > >> =9A Note that __FreeBSD_version was already bumped today to 900036 for= any > >> =9A ports which would like to use this function. > >> > >> =9A Also reserve space in the syscall table for posix_fadvise(2). >=20 > >> +#ifdef __notyet__ > >> + =9A =9A /* > >> + =9A =9A =9A* Check if the filesystem sets f_maxfilesize; if not use > >> + =9A =9A =9A* VOP_SETATTR to perform the check. > >> + =9A =9A =9A*/ > >> + =9A =9A error =3D VFS_STATFS(vp->v_mount, &sfs, td); > >> + =9A =9A if (error !=3D 0) > >> + =9A =9A =9A =9A =9A =9A goto out; > >> + =9A =9A if (sfs.f_maxfilesize) { > >> + =9A =9A =9A =9A =9A =9A if (offset > sfs.f_maxfilesize || len > sfs.= f_maxfilesize || > >> + =9A =9A =9A =9A =9A =9A =9A =9A offset + len > sfs.f_maxfilesize) { > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A error =3D EFBIG; > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A goto out; > >> + =9A =9A =9A =9A =9A =9A } > >> + =9A =9A } else > >> +#endif > >> + =9A =9A if (offset + len > vap->va_size) { > >> + =9A =9A =9A =9A =9A =9A VATTR_NULL(vap); > >> + =9A =9A =9A =9A =9A =9A vap->va_size =3D offset + len; > >> + =9A =9A =9A =9A =9A =9A error =3D VOP_SETATTR(vp, vap, td->td_ucred); > >> + =9A =9A =9A =9A =9A =9A if (error !=3D 0) > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A goto out; > >> + =9A =9A } > > I still do not see a reason to do VOP_SETATTR() there. VOP_WRITE() will > > do auto-extend as needed. Also, see below. >=20 > The need is, as commented, to return EFBIG when the new file size will > be larger than the FS supports. Without this code, passing in > something like posix_fallocate(fd, 0, OFF_MAX) will run the filesystem > out of space. Handling max file size and not overflowing the fs are different things. VOP_WRITE() will handle file size on its own too. I see no problem with exhausting free space if this is what user asked for. >=20 > >> + > >> + =9A =9A while (len > 0) { > >> + =9A =9A =9A =9A =9A =9A if (should_yield()) { > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A VOP_UNLOCK(vp, 0); > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A locked =3D 0; > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A kern_yield(-1); > > Please note that, despite dropping the vnode lock, the snapshot creation > > is still blocked at this point, due to previous vn_start_write(). > > > > If doing vn_finished_write() there, then bwillwrite() before > > next iteration is desired. > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A error =3D vn_lock(vp, LK_EXC= LUSIVE); > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A if (error !=3D 0) > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A break; > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A locked =3D 1; > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A error =3D VOP_GETATTR(vp, va= p, td->td_ucred); > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A if (error !=3D 0) > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A break; > >> + =9A =9A =9A =9A =9A =9A } > >> + > >> + =9A =9A =9A =9A =9A =9A /* > >> + =9A =9A =9A =9A =9A =9A =9A* Read and write back anything below the = nominal file > >> + =9A =9A =9A =9A =9A =9A =9A* size. =9AThere's currently no way outsi= de the filesystem > >> + =9A =9A =9A =9A =9A =9A =9A* to know whether this area is sparse or = not. > >> + =9A =9A =9A =9A =9A =9A =9A*/ > >> + =9A =9A =9A =9A =9A =9A cur =3D iosize; > >> + =9A =9A =9A =9A =9A =9A if ((offset % iosize) !=3D 0) > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A cur -=3D (offset % iosize); > >> + =9A =9A =9A =9A =9A =9A if (cur > len) > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A cur =3D len; > >> + =9A =9A =9A =9A =9A =9A if (offset < vap->va_size) { > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A aiov.iov_base =3D buf; > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A aiov.iov_len =3D cur; > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A auio.uio_iov =3D &aiov; > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A auio.uio_iovcnt =3D 1; > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A auio.uio_offset =3D offset; > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A auio.uio_resid =3D cur; > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A auio.uio_segflg =3D UIO_SYSS= PACE; > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A auio.uio_rw =3D UIO_READ; > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A auio.uio_td =3D td; > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A error =3D VOP_READ(vp, &auio= , 0, td->td_ucred); > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A if (error !=3D 0) > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A break; > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A if (auio.uio_resid > 0) { > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A bzero(buf + = cur - auio.uio_resid, > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A auio= .uio_resid); > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A } > >> + =9A =9A =9A =9A =9A =9A } else { > >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A bzero(buf, cur); > >> + =9A =9A =9A =9A =9A =9A } > > Wouldn't VOP_SETATTR() at the start of the function mostly prevent > > this bzero from executing ? >=20 > Yes. If struct statfs had a member indicating the file system's max > file size, then the extend wouldn't be necessary. We have that > feature locally, but it's only implemented for ufs and our custom file > system, and it requires an ABI change so it's a bit of work to > upstream. And as with most of those things, it's hard to find the > time to upstream it outside of work hours. >=20 > > I estimated what it would take to do the optimized implementation for U= FS, > > and I think that the following change would allow to lessen the code > > duplication much. > > > > What if the vnode lock drop and looping be handled by the syscall, inst= ead > > of the vop implementation ? In other words, allow the VOP_ALLOCATE() > > to =9Aallocate less then requested, and return the allocated amount to > > the caller. The loop would be centralized then, freeing fs from doing > > the dance. Also, if fs considers that suitable, it would do a whole > > allocation in one run. >=20 > I like this idea. Perhaps input to the vop should be pointers to > offset and len, and the vop can change them as it iterates? Otherwise > the return code must be overloaded to distinguish between an error > code and the len that has been handled so far. And, I think the VOP > interface must return int, so there would be no way to indicate that > more than 2GB had been allocated. >=20 > So something in the kernel like: > while ((error =3D VOP_ALLOCATE(vp, &offset, &len)) =3D=3D EAGAIN) { > VOP_UNLOCK(vp, 0); > /* XXX unlock other resources */ > maybe_yield(); > bwillwrite(); > /* XXX vn_start_write dance */ > error =3D VOP_LOCK(vp, LK_EXCLUSIVE); > } Exactly. I would not even bother to return EAGAIN, just 0, and the need to retry the loop can be determined by len being non-zero. Also, by rearranging the loop, we can avoid the duplication of calls to bwillwrite, vn_start_write, vn_lock etc. for (;;) { /* All error handling is removed for convenience */ bwillwrite(); vn_start_write(); vn_lock(); VOP_ALLOCATE(vp, &offset, &len); VOP_UNLOCK(); vn_finished_write(); if (len =3D=3D 0) break; yield(); } >=20 > Cheers, > matthew --UC+RhZhEc8lcmajv Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (FreeBSD) iEYEARECAAYFAk2sm04ACgkQC3+MBN1Mb4g83wCgt/Wfj5af9rhuGwPP6287phXX p0YAn1rtl6ou7yaYNd7dDocwsRVEZ8Ny =VatD -----END PGP SIGNATURE----- --UC+RhZhEc8lcmajv--