From owner-freebsd-hackers@freebsd.org Tue Dec 8 17:08:05 2015 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C57859D4D94 for ; Tue, 8 Dec 2015 17:08:05 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-qg0-x229.google.com (mail-qg0-x229.google.com [IPv6:2607:f8b0:400d:c04::229]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 79E71175D for ; Tue, 8 Dec 2015 17:08:05 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by qgcc31 with SMTP id c31so25553557qgc.3 for ; Tue, 08 Dec 2015 09:08:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=9pd2dCJaKQLtecycxOQ0jUVTwzrYrN7IRxQCy0pBHRw=; b=dVWP/2Ji6Xgko/KSOxffvNm0uP3zEyQPSzAgKNKb3rpBCw2EQD2H+ccsNsJruHzNRk q8PnpNlPF5GSgWZOla+cArpLor5unilEvxgXxjSacZb8h5jY72jE6pVgYLd3/DXdrv5x XhSzven9iTAihCBkrFfo8yw46YyagxuJ7aJguOCCFCWjL36NRBQwoHZXaDjbmRfv+NQB 9Esrm01h30l2M2xMA32SrOL/iLIpGk92NiTYSh3Uc8H1tX+6R8zDayb/jhqoPhZ89C9t RrIoCgfqT+yiZDtHQ2Cngsh6t5Wa9/UUhXEHGuyTvEuDu5T5RuZw+kO34bnSu01mTfdJ smbw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=9pd2dCJaKQLtecycxOQ0jUVTwzrYrN7IRxQCy0pBHRw=; b=GVJh0Yf0uugYFNS/M6bZB5oYV8PtvE+nEJtc8bGd6j5FKc1Rq0B0FQkBgfADs/1csT 8Z0nu9ulGRqp8o/jbg8B8p+ZqQTIyH+OHSwwv0djn5CCgUSuXc06J3YkQ7rXPnFZClhh 29ty55SPDL4yJsL1v7imYlZ621rr1zNG/OdpZzMPzx9hQOMId3FSoQTm7jJd7b4AM8cR Grz6q2H2NZUN9uO3Kn7Jvq4j+5hZB+bttS1BzjQxt0+1Du6GqWBVBy2W35pd8UErsbaQ B1alqrqBIK8w+Sk/8pf+s00T4oGKowogdmYbC46EawIlnZf3hnGps8u6M3dWBEoqaIyO +rYA== X-Gm-Message-State: ALoCoQnz5CGKH4ERge5mONZ31d8/D4t7+oFwvPhEZ1GBMYKlmAzTWhIQbeAqNlW1+eSKJFFq1MSkfZVDPKmctUitWh2cWafw5g== MIME-Version: 1.0 X-Received: by 10.140.40.38 with SMTP id w35mr903462qgw.52.1449594463066; Tue, 08 Dec 2015 09:07:43 -0800 (PST) Sender: wlosh@bsdimp.com Received: by 10.140.27.181 with HTTP; Tue, 8 Dec 2015 09:07:42 -0800 (PST) X-Originating-IP: [2601:280:4900:3700:4d3f:8eba:ea86:7700] In-Reply-To: <864mfssxgt.fsf@desk.des.no> References: <201512052002.tB5K2ZEA026540@chez.mckusick.com> <86poyhqsdh.fsf@desk.des.no> <86fuzdqjwn.fsf@desk.des.no> <864mfssxgt.fsf@desk.des.no> Date: Tue, 8 Dec 2015 10:07:42 -0700 X-Google-Sender-Auth: -UjaAoO15eSCCqfGcXqKn-VCNUY Message-ID: Subject: Re: Fwd: DELETE support in the VOP_STRATEGY(9)? From: Warner Losh To: =?UTF-8?Q?Dag=2DErling_Sm=C3=B8rgrav?= Cc: "freebsd-hackers@freebsd.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Dec 2015 17:08:05 -0000 On Tue, Dec 8, 2015 at 9:42 AM, Dag-Erling Sm=C3=B8rgrav wrote= : > Warner Losh writes: > > Dag-Erling Sm=C3=B8rgrav writes: > > > Here are a few of our options for implementing FALLOC_FL_PUNCH_HOLE: > > > > > > a) create a filesystem-level hole in the disk image; > > > b) perform a), then issue a BIO_DELETE for the blocks that were > > > released; > > > c) perform a) or b), then zero the overspill if the requested range i= s > > > unaligned; > > > d) zero the entire range; > > > e) perform d) followed by either a) or b); > > > f) nothing at all. > > I don't think f is an option. Unless it is OK to have random contents > > after creating a file and seeking some ways into and writing a > > byte. When you punch a hole in the file, you should get the same > > semantics as if you'd written up to just before the hole originally, > > then skipped to the end of the punched range and written the rest of > > the file. > > I didn't realize there was a spec, so I didn't know what the intended > semantics were. I am assuming the semantics are the same as the Linux thing of the same name. > > You are correct, though, that the decision to issue a BIO_DELETE is > > between the filesystem and the storage device. This makes a-e possible > > implementations, but some are stupider than others (which ones depend > > on the situation). > > Each of them except f) is the optimal solution for at least one of the > 36 cases I outlined, or 18 if you ignore the zvol and device points on > the first axis. True. > > > > Discuss the advantages and drawbacks of each option I listed above > > > for each of the 36 points in the space defined by the following > > > axes: > > > [...] > > > If you think the answer is the same in all cases, you are deluded. > > That's why these decisions are left to the stack. > > Define "stack". Do you mean the entire food chain from the hardware to > the POSIX filesystem API? By design, no element in the stack has any > knowledge of any other element, beyond the names and dimensions of its > immediate consumers and suppliers (I find "producer" ambiguous). Also true. The stack makes the best choice it can at each level and passes the rest on down. There are administrative overrides, however, for the default actions (like sending down a BIO_DELETE). > > The only semantic that is required by the punch hole operation is that > > the filesystem return 0's on reads to that range. What the filesystem > > does to ensure this is up to the filesystem. > > That's easy to say, but each option has advantages and disadvantages > depending on information which is not necessarily available where it is > needed. A filesystem-level hole results in fragmentation, which can > have a huge performance impact on electromechanical storage but is > negligible on solid-state storage. It may result in fragmentation. UFS has techniques to cope, however. ZFS doesn't matter since it is log based and writing zeros would also produce a non-local copy. > But the filesystem does not know > whether the underlying storage is electromechanical or solid-state, nor > does it know whether the user cares much about seek times (unless we > introduce the heuristic "avoid creating holes unless the file already > has them, in which case the userland probably does not care"). Actually, the filesystem does know. Or has some knowledge of what is supported and what isn't. BIO_DELETE support is a strong indicator of a flash or other log-type system. Then > again, either the filesystem or the underlying storage *or both* may > have copy-on-write semantics, in which case zeroing is worse than > creating a hole. > It may have that implementation. That's what administrative controls are for. > BTW, writing zeroes to NAND flash does not require erasing the block. I > don't know whether SSDs take advantage of that to avoid unnecessarily > reallocating or erasing a block, nor whether they automatically release > and erase blocks that end up being completely zeroed. Turns out this hasn't been true for 4 or 5 generations of NAND. You cannot program pages in a block twice. That's not a supported operation, and canno= t possibly work for multi-level cell technology. Flash drives never use this fact because this fact simply won't work with anything other than first or secon= d generation single bit per cell technology. These days, NAND must be written from first page to last. And it is strongly advised that the DWELL time in the erase state be as short as possible. And it is also strongly advised that the writing be done as quickly as possible and that active counter measure be taken when you can't write it within a maximum amount of time (such as garbage collecting data forward to meet the timing constraints). You can violate these guidelines from time to time, but then the storage retention cannot be guaranteed to meet vendor specs. The firmware in recent generations of planar technologies makes certain assumptions to maximize life of the data and if you violate them (like giving the data time to decay (NAND cells are best thought of as tiny capacitors which lose charge over time) before the next cells are programmed (disturbing these cells in a predictable way), you'll get bit errors). And the new 3-d NAND throws a whole new set of constraints that are different from these constraints that plagued old planar NAND arrangements. SSDs have a log structure under the covers anyway. As you write data, the data gets laid down into a log. When you write new data, a note of the old obsolete data is made, but it isn't erased until either the entire block that held the data is invalidated, or the data is garbage collected forward to a new page of NAND in a new erase block. A BIO_DELETE turns into some flavor of TRIM or DELETE operation under the covers. The drive's FTL (Flash Translation Layer) then uses this to invalidate blocks in its internal tables. Maybe this will actually erase the data, maybe not, but that operation is decoupled from the TRIM completing. The invalidation occurs both in some kind of 'used mask' as well as in the LBA to physical mapping so when requests for the LBA come in later all 0's can be returned (for newer drives, complying with the latest standards). SD cards are a whole other domain of hacks and optimizations. Since I don't have direct experience making them and writing that firmware, I can't comment on what they are, other than to say that telling the drive the contents are no longer needed, or that you are about to write multiple blocks can help them out a lot. As technologies evolve, we should take advantage of them. The old assumptions, like writing 0's to NAND can be done without erasing, may prove to be unwise. This is why we let the drive tell us about their support for TRIMing technologies and let the filesystem decide when the best time to do that operation is. Sometimes these new technologies can map to existing facilities, other times they can't. Warner