Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 18 Apr 2011 23:13:02 +0300
From:      Kostik Belousov <kostikbel@gmail.com>
To:        mdf@freebsd.org
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org
Subject:   Re: svn commit: r220791 - in head: lib/libc/sys sys/compat/freebsd32 sys/kern sys/sys
Message-ID:  <20110418201302.GY48734@deviant.kiev.zoral.com.ua>
In-Reply-To: <BANLkTi=A-C6PY%2B5VSbigQpLKLVJYQ2b2Mg@mail.gmail.com>
References:  <201104181632.p3IGWM5v077720@svn.freebsd.org> <20110418192810.GX48734@deviant.kiev.zoral.com.ua> <BANLkTi=A-C6PY%2B5VSbigQpLKLVJYQ2b2Mg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

--UC+RhZhEc8lcmajv
Content-Type: text/plain; charset=koi8-r
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, Apr 18, 2011 at 12:49:38PM -0700, mdf@freebsd.org wrote:
> On Mon, Apr 18, 2011 at 12:28 PM, Kostik Belousov <kostikbel@gmail.com> w=
rote:
> > On Mon, Apr 18, 2011 at 04:32:22PM +0000, Matthew D Fleming wrote:
> >> Author: mdf
> >> Date: Mon Apr 18 16:32:22 2011
> >> New Revision: 220791
> >> URL: http://svn.freebsd.org/changeset/base/220791
> >>
> >> Log:
> >> =9A Add the posix_fallocate(2) syscall. =9AThe default implementation =
in
> >> =9A vop_stdallocate() is filesystem agnostic and will run as slow as a
> >> =9A read/write loop in userspace; however, it serves to correctly
> >> =9A implement the functionality for filesystems that do not implement a
> >> =9A VOP_ALLOCATE.
> >>
> >> =9A Note that __FreeBSD_version was already bumped today to 900036 for=
 any
> >> =9A ports which would like to use this function.
> >>
> >> =9A Also reserve space in the syscall table for posix_fadvise(2).
>=20
> >> +#ifdef __notyet__
> >> + =9A =9A /*
> >> + =9A =9A =9A* Check if the filesystem sets f_maxfilesize; if not use
> >> + =9A =9A =9A* VOP_SETATTR to perform the check.
> >> + =9A =9A =9A*/
> >> + =9A =9A error =3D VFS_STATFS(vp->v_mount, &sfs, td);
> >> + =9A =9A if (error !=3D 0)
> >> + =9A =9A =9A =9A =9A =9A goto out;
> >> + =9A =9A if (sfs.f_maxfilesize) {
> >> + =9A =9A =9A =9A =9A =9A if (offset > sfs.f_maxfilesize || len > sfs.=
f_maxfilesize ||
> >> + =9A =9A =9A =9A =9A =9A =9A =9A offset + len > sfs.f_maxfilesize) {
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A error =3D EFBIG;
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A goto out;
> >> + =9A =9A =9A =9A =9A =9A }
> >> + =9A =9A } else
> >> +#endif
> >> + =9A =9A if (offset + len > vap->va_size) {
> >> + =9A =9A =9A =9A =9A =9A VATTR_NULL(vap);
> >> + =9A =9A =9A =9A =9A =9A vap->va_size =3D offset + len;
> >> + =9A =9A =9A =9A =9A =9A error =3D VOP_SETATTR(vp, vap, td->td_ucred);
> >> + =9A =9A =9A =9A =9A =9A if (error !=3D 0)
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A goto out;
> >> + =9A =9A }
> > I still do not see a reason to do VOP_SETATTR() there. VOP_WRITE() will
> > do auto-extend as needed. Also, see below.
>=20
> The need is, as commented, to return EFBIG when the new file size will
> be larger than the FS supports.  Without this code, passing in
> something like posix_fallocate(fd, 0, OFF_MAX) will run the filesystem
> out of space.
Handling max file size and not overflowing the fs are different things.
VOP_WRITE() will handle file size on its own too. I see no problem with
exhausting free space if this is what user asked for.

>=20
> >> +
> >> + =9A =9A while (len > 0) {
> >> + =9A =9A =9A =9A =9A =9A if (should_yield()) {
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A VOP_UNLOCK(vp, 0);
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A locked =3D 0;
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A kern_yield(-1);
> > Please note that, despite dropping the vnode lock, the snapshot creation
> > is still blocked at this point, due to previous vn_start_write().
> >
> > If doing vn_finished_write() there, then bwillwrite() before
> > next iteration is desired.
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A error =3D vn_lock(vp, LK_EXC=
LUSIVE);
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A if (error !=3D 0)
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A break;
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A locked =3D 1;
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A error =3D VOP_GETATTR(vp, va=
p, td->td_ucred);
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A if (error !=3D 0)
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A break;
> >> + =9A =9A =9A =9A =9A =9A }
> >> +
> >> + =9A =9A =9A =9A =9A =9A /*
> >> + =9A =9A =9A =9A =9A =9A =9A* Read and write back anything below the =
nominal file
> >> + =9A =9A =9A =9A =9A =9A =9A* size. =9AThere's currently no way outsi=
de the filesystem
> >> + =9A =9A =9A =9A =9A =9A =9A* to know whether this area is sparse or =
not.
> >> + =9A =9A =9A =9A =9A =9A =9A*/
> >> + =9A =9A =9A =9A =9A =9A cur =3D iosize;
> >> + =9A =9A =9A =9A =9A =9A if ((offset % iosize) !=3D 0)
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A cur -=3D (offset % iosize);
> >> + =9A =9A =9A =9A =9A =9A if (cur > len)
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A cur =3D len;
> >> + =9A =9A =9A =9A =9A =9A if (offset < vap->va_size) {
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A aiov.iov_base =3D buf;
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A aiov.iov_len =3D cur;
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A auio.uio_iov =3D &aiov;
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A auio.uio_iovcnt =3D 1;
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A auio.uio_offset =3D offset;
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A auio.uio_resid =3D cur;
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A auio.uio_segflg =3D UIO_SYSS=
PACE;
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A auio.uio_rw =3D UIO_READ;
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A auio.uio_td =3D td;
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A error =3D VOP_READ(vp, &auio=
, 0, td->td_ucred);
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A if (error !=3D 0)
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A break;
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A if (auio.uio_resid > 0) {
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A bzero(buf + =
cur - auio.uio_resid,
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A auio=
.uio_resid);
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A }
> >> + =9A =9A =9A =9A =9A =9A } else {
> >> + =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A bzero(buf, cur);
> >> + =9A =9A =9A =9A =9A =9A }
> > Wouldn't VOP_SETATTR() at the start of the function mostly prevent
> > this bzero from executing ?
>=20
> Yes.  If struct statfs had a member indicating the file system's max
> file size, then the extend wouldn't be necessary.  We have that
> feature locally, but it's only implemented for ufs and our custom file
> system, and it requires an ABI change so it's a bit of work to
> upstream.  And as with most of those things, it's hard to find the
> time to upstream it outside of work hours.
>=20
> > I estimated what it would take to do the optimized implementation for U=
FS,
> > and I think that the following change would allow to lessen the code
> > duplication much.
> >
> > What if the vnode lock drop and looping be handled by the syscall, inst=
ead
> > of the vop implementation ? In other words, allow the VOP_ALLOCATE()
> > to =9Aallocate less then requested, and return the allocated amount to
> > the caller. The loop would be centralized then, freeing fs from doing
> > the dance. Also, if fs considers that suitable, it would do a whole
> > allocation in one run.
>=20
> I like this idea.  Perhaps input to the vop should be pointers to
> offset and len, and the vop can change them as it iterates? Otherwise
> the return code must be overloaded to distinguish between an error
> code and the len that has been handled so far.  And, I think the VOP
> interface must return int, so there would be no way to indicate that
> more than 2GB had been allocated.
>=20
> So something in the kernel like:
> 	while ((error =3D VOP_ALLOCATE(vp, &offset, &len)) =3D=3D EAGAIN) {
> 		VOP_UNLOCK(vp, 0);
> 		/* XXX unlock other resources */
> 		maybe_yield();
> 		bwillwrite();
> 		/* XXX vn_start_write dance */
> 		error =3D VOP_LOCK(vp, LK_EXCLUSIVE);
> 	}
Exactly. I would not even bother to return EAGAIN, just 0, and the need
to retry the loop can be determined by len being non-zero.

Also, by rearranging the loop, we can avoid the duplication of
calls to bwillwrite, vn_start_write, vn_lock etc.

	for (;;) {
		/* All error handling is removed for convenience */
		bwillwrite();
		vn_start_write();
		vn_lock();
		VOP_ALLOCATE(vp, &offset, &len);
		VOP_UNLOCK();
		vn_finished_write();
		if (len =3D=3D 0)
			break;
		yield();
	}
>=20
> Cheers,
> matthew

--UC+RhZhEc8lcmajv
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (FreeBSD)

iEYEARECAAYFAk2sm04ACgkQC3+MBN1Mb4g83wCgt/Wfj5af9rhuGwPP6287phXX
p0YAn1rtl6ou7yaYNd7dDocwsRVEZ8Ny
=VatD
-----END PGP SIGNATURE-----

--UC+RhZhEc8lcmajv--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110418201302.GY48734>