From owner-freebsd-arch@FreeBSD.ORG Fri Apr 15 16:28:03 2011 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BC72A106566C for ; Fri, 15 Apr 2011 16:28:03 +0000 (UTC) (envelope-from mdf356@gmail.com) Received: from mail-wy0-f182.google.com (mail-wy0-f182.google.com [74.125.82.182]) by mx1.freebsd.org (Postfix) with ESMTP id 4E1668FC08 for ; Fri, 15 Apr 2011 16:28:02 +0000 (UTC) Received: by wyf23 with SMTP id 23so2882600wyf.13 for ; Fri, 15 Apr 2011 09:28:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=h3A/S9lGSg8QPpQQNdIRDD51o9bATN8LlRaedxcw0Yw=; b=W6TRCMYvw59cRDeIEf3I3DuifzWIG0jPFHUqlxlaYM3dAq9yz9ehZ5OiG8jia9LovY lyYH9rQUkwdV+PWX/7B2HuBt8qekqqj5Xt4tPSseqilwk/8ZNSv6g5C7bzJtxrbyvSmW GBZ6sO/VGxTl+qpRQWfUW5Cp34jz8prYXlG00= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=mXqThXF2bWAtamtnJZY8OXZQCUz+MJp0DIXUHetGCgWtsQPQ/25gtgYl6ABziE6d0H RAdpHohCn4rvrgSU6Hie67kaXipa4LDsylCqrPvToaukJq9tsZ4bCyDdYHw3HrpmeGaD DqsXrkqLza/4EuCHpuesy8l99k0yzHGOtXwss= MIME-Version: 1.0 Received: by 10.216.87.8 with SMTP id x8mr2205349wee.46.1302884881697; Fri, 15 Apr 2011 09:28:01 -0700 (PDT) Sender: mdf356@gmail.com Received: by 10.216.123.15 with HTTP; Fri, 15 Apr 2011 09:28:01 -0700 (PDT) In-Reply-To: <20110415105409.GA14344@tops> References: <20110414213610.GB92382@tops> <20110415105409.GA14344@tops> Date: Fri, 15 Apr 2011 09:28:01 -0700 X-Google-Sender-Auth: EbpFs5u0m462ZgcM4LfM5Kjor9o Message-ID: From: mdf@FreeBSD.org To: Gleb Kurtsou Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: FreeBSD Arch Subject: Re: posix_fallocate(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Apr 2011 16:28:03 -0000 On Fri, Apr 15, 2011 at 3:54 AM, Gleb Kurtsou wrot= e: > On (14/04/2011 15:41), mdf@FreeBSD.org wrote: >> On Thu, Apr 14, 2011 at 2:36 PM, Gleb Kurtsou w= rote: >> > On (14/04/2011 12:35), mdf@FreeBSD.org wrote: >> >> For work we need a functionality in our filesystem that is pretty muc= h >> >> like posix_fallocate(2), so we're using the name and I've added a >> >> default VOP_ALLOCATE definition that does the right, but dumb, thing. >> >> >> >> The most recent mention of this function in FreeBSD was another threa= d >> >> lamenting it's failure to exist: >> >> http://lists.freebsd.org/pipermail/freebsd-ports/2010-February/059268= .html >> >> >> >> The attached files are the core of the kernel implementation of the >> >> syscall and a default VOP for any filesystem not supporting >> >> VOP_ALLOCATE, which allows the syscall to work as expected but in a >> >> non-performant manner. =A0I didn't see this syscall in NetBSD or >> >> OpenBSD, so I plan to add it to the end of our syscall table. >> >> >> >> What I wanted to check with -arch about was: >> >> >> >> 1) is there still a desire for this syscall? >> > It looks not to play well architecturally with modern COW file systems >> > like ZFS and HUMMER. So potentially it can be implemented only for UFS= . >> >> The syscall, or the dumb implementation? =A0I don't see why the syscall >> itself would be a problem; presumably ZFS can figure out whether an >> fallocate() block is worth COWing or not... > It is good to have if there is a chance to get a real implementation for > UFS. Having only dumb implementation will fool user software that we > support it. > > As far as I understand ZFS caches large chunk of changes and than writes > all of them at once. I doubt blocks can be preallocated. You preallocate > block, it's marked as used in file systems meta data, changes to meta > data are written to disk -- it results in inconsistency because > preallocated block is marked as "used" in meta data and thus can't > be overwritten. I might be absolutely wrong, ZFS experts are > better answer this. Grepping reveals no fallocate support in ZFS. > >> >> 2) is this naive implementation useful enough to serve as a default >> >> for all filesystems until someone with more knowledge fills them in? >> > Maillist ate the patch. Only man page attached. >> >> Whoops! >> >> http://people.freebsd.org/~mdf/bsd-fallocate.diff > What was performance impact on copying large files? I don't know and I don't care. :-) Specifically, one problem is that there is no file-system implementation of "copy"; copy is implemented in userspace with read(2) then write(2). If the caller says posix_fallocate() then they want blocks. If copying a large file is slower after that, well, they asked for it. This implementation meets the spec only, it's not meant to be optimal. An optimal VOP_WRITE() implementation may check that e.g. the next block on write is all zero, and so will make a new logical-zero block in the same manner as VOP_FALLOCATE. This is up to each filesystem. > I had sparse file support in PEFS implemented similar way. posix_fallocate() is specifically to *not* have a sparse file. > Performance was terrible, vm > and buf caches where saturated first by writing huge chunks of zeros and > than by mmap'ing and writing actual data. sched_yeld() and/or vnode > lock/unlock didn't improve interactive performance either. > > Why wouldn't you just call VOP_SETATTR(newsize) in dumb implementation. > File systems expect files such behavior, cp is using mmap for a while > already. VOP_SETATTR(newsize) could truncate, if e.g. the file is already large and sparse and the fallocate(2) was to provide guaranteed storage only to the first 1MB. Thanks, matthew