From owner-svn-src-head@FreeBSD.ORG Mon Apr 18 20:14:55 2011 Return-Path: Delivered-To: svn-src-head@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9E7A9106564A; Mon, 18 Apr 2011 20:14:55 +0000 (UTC) (envelope-from mdf356@gmail.com) Received: from mail-wy0-f182.google.com (mail-wy0-f182.google.com [74.125.82.182]) by mx1.freebsd.org (Postfix) with ESMTP id D2B538FC14; Mon, 18 Apr 2011 20:14:54 +0000 (UTC) Received: by wyf23 with SMTP id 23so5224170wyf.13 for ; Mon, 18 Apr 2011 13:14:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=ZoEtvvR6L/yqyCKCwxtYIkC2T1oZrHOMBiLLYfC/dSw=; b=lHE+fgm7yl1jig5s/Q6AhFSCmat9Eh/qzkR+OJwNlX8XHZCc4+iNTDe56WYS1z+BES pLTMjf6FzXlo/O7fDQnQH1oP+c1lzf5I9EzpgDWqSaVrQfJhvGAuCTXL4F4Swk4yQv0G hRzcapIfw4/YxPX/ty8LmseT6pCzR0CClX5UA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=f3dfx+jIWmXRA3ynHXHhlNEdFTOtDgd+BrMmPcH8C/sObK/moEAewVaXUt6faY3Aj+ FuhgmZZ2EaBZ3BKsFcU4oFc5HQ7sJ+axgOhTAEwhajuup6Cd4qafGIKN+T/WE4JbYrK7 QzDnmUDnJw6su935vLZi72oKqbLuDZhOnOpNM= MIME-Version: 1.0 Received: by 10.216.120.144 with SMTP id p16mr357944weh.10.1303156178891; Mon, 18 Apr 2011 12:49:38 -0700 (PDT) Sender: mdf356@gmail.com Received: by 10.216.9.67 with HTTP; Mon, 18 Apr 2011 12:49:38 -0700 (PDT) In-Reply-To: <20110418192810.GX48734@deviant.kiev.zoral.com.ua> References: <201104181632.p3IGWM5v077720@svn.freebsd.org> <20110418192810.GX48734@deviant.kiev.zoral.com.ua> Date: Mon, 18 Apr 2011 12:49:38 -0700 X-Google-Sender-Auth: eUOV2by3w2lDNGjdOa7DtuY8uA4 Message-ID: From: mdf@FreeBSD.org To: Kostik Belousov Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r220791 - in head: lib/libc/sys sys/compat/freebsd32 sys/kern sys/sys X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Apr 2011 20:14:55 -0000 On Mon, Apr 18, 2011 at 12:28 PM, Kostik Belousov wro= te: > On Mon, Apr 18, 2011 at 04:32:22PM +0000, Matthew D Fleming wrote: >> Author: mdf >> Date: Mon Apr 18 16:32:22 2011 >> New Revision: 220791 >> URL: http://svn.freebsd.org/changeset/base/220791 >> >> Log: >> =A0 Add the posix_fallocate(2) syscall. =A0The default implementation in >> =A0 vop_stdallocate() is filesystem agnostic and will run as slow as a >> =A0 read/write loop in userspace; however, it serves to correctly >> =A0 implement the functionality for filesystems that do not implement a >> =A0 VOP_ALLOCATE. >> >> =A0 Note that __FreeBSD_version was already bumped today to 900036 for a= ny >> =A0 ports which would like to use this function. >> >> =A0 Also reserve space in the syscall table for posix_fadvise(2). >> +#ifdef __notyet__ >> + =A0 =A0 /* >> + =A0 =A0 =A0* Check if the filesystem sets f_maxfilesize; if not use >> + =A0 =A0 =A0* VOP_SETATTR to perform the check. >> + =A0 =A0 =A0*/ >> + =A0 =A0 error =3D VFS_STATFS(vp->v_mount, &sfs, td); >> + =A0 =A0 if (error !=3D 0) >> + =A0 =A0 =A0 =A0 =A0 =A0 goto out; >> + =A0 =A0 if (sfs.f_maxfilesize) { >> + =A0 =A0 =A0 =A0 =A0 =A0 if (offset > sfs.f_maxfilesize || len > sfs.f_= maxfilesize || >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 offset + len > sfs.f_maxfilesize) { >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 error =3D EFBIG; >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out; >> + =A0 =A0 =A0 =A0 =A0 =A0 } >> + =A0 =A0 } else >> +#endif >> + =A0 =A0 if (offset + len > vap->va_size) { >> + =A0 =A0 =A0 =A0 =A0 =A0 VATTR_NULL(vap); >> + =A0 =A0 =A0 =A0 =A0 =A0 vap->va_size =3D offset + len; >> + =A0 =A0 =A0 =A0 =A0 =A0 error =3D VOP_SETATTR(vp, vap, td->td_ucred); >> + =A0 =A0 =A0 =A0 =A0 =A0 if (error !=3D 0) >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out; >> + =A0 =A0 } > I still do not see a reason to do VOP_SETATTR() there. VOP_WRITE() will > do auto-extend as needed. Also, see below. The need is, as commented, to return EFBIG when the new file size will be larger than the FS supports. Without this code, passing in something like posix_fallocate(fd, 0, OFF_MAX) will run the filesystem out of space. >> + >> + =A0 =A0 while (len > 0) { >> + =A0 =A0 =A0 =A0 =A0 =A0 if (should_yield()) { >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 VOP_UNLOCK(vp, 0); >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 locked =3D 0; >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 kern_yield(-1); > Please note that, despite dropping the vnode lock, the snapshot creation > is still blocked at this point, due to previous vn_start_write(). > > If doing vn_finished_write() there, then bwillwrite() before > next iteration is desired. >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 error =3D vn_lock(vp, LK_EXCLU= SIVE); >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (error !=3D 0) >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 locked =3D 1; >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 error =3D VOP_GETATTR(vp, vap,= td->td_ucred); >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (error !=3D 0) >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> + =A0 =A0 =A0 =A0 =A0 =A0 } >> + >> + =A0 =A0 =A0 =A0 =A0 =A0 /* >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0* Read and write back anything below the no= minal file >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0* size. =A0There's currently no way outside= the filesystem >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0* to know whether this area is sparse or no= t. >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0*/ >> + =A0 =A0 =A0 =A0 =A0 =A0 cur =3D iosize; >> + =A0 =A0 =A0 =A0 =A0 =A0 if ((offset % iosize) !=3D 0) >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 cur -=3D (offset % iosize); >> + =A0 =A0 =A0 =A0 =A0 =A0 if (cur > len) >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 cur =3D len; >> + =A0 =A0 =A0 =A0 =A0 =A0 if (offset < vap->va_size) { >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 aiov.iov_base =3D buf; >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 aiov.iov_len =3D cur; >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 auio.uio_iov =3D &aiov; >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 auio.uio_iovcnt =3D 1; >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 auio.uio_offset =3D offset; >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 auio.uio_resid =3D cur; >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 auio.uio_segflg =3D UIO_SYSSPA= CE; >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 auio.uio_rw =3D UIO_READ; >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 auio.uio_td =3D td; >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 error =3D VOP_READ(vp, &auio, = 0, td->td_ucred); >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (error !=3D 0) >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (auio.uio_resid > 0) { >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 bzero(buf + cu= r - auio.uio_resid, >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 auio.u= io_resid); >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 } >> + =A0 =A0 =A0 =A0 =A0 =A0 } else { >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 bzero(buf, cur); >> + =A0 =A0 =A0 =A0 =A0 =A0 } > Wouldn't VOP_SETATTR() at the start of the function mostly prevent > this bzero from executing ? Yes. If struct statfs had a member indicating the file system's max file size, then the extend wouldn't be necessary. We have that feature locally, but it's only implemented for ufs and our custom file system, and it requires an ABI change so it's a bit of work to upstream. And as with most of those things, it's hard to find the time to upstream it outside of work hours. > I estimated what it would take to do the optimized implementation for UFS= , > and I think that the following change would allow to lessen the code > duplication much. > > What if the vnode lock drop and looping be handled by the syscall, instea= d > of the vop implementation ? In other words, allow the VOP_ALLOCATE() > to =A0allocate less then requested, and return the allocated amount to > the caller. The loop would be centralized then, freeing fs from doing > the dance. Also, if fs considers that suitable, it would do a whole > allocation in one run. I like this idea. Perhaps input to the vop should be pointers to offset and len, and the vop can change them as it iterates? Otherwise the return code must be overloaded to distinguish between an error code and the len that has been handled so far. And, I think the VOP interface must return int, so there would be no way to indicate that more than 2GB had been allocated. So something in the kernel like: while ((error =3D VOP_ALLOCATE(vp, &offset, &len)) =3D=3D EAGAIN) { VOP_UNLOCK(vp, 0); /* XXX unlock other resources */ maybe_yield(); bwillwrite(); /* XXX vn_start_write dance */ error =3D VOP_LOCK(vp, LK_EXCLUSIVE); } Cheers, matthew