From owner-freebsd-hackers@freebsd.org Tue Dec 8 11:06:44 2015 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id EF5529D44EA for ; Tue, 8 Dec 2015 11:06:43 +0000 (UTC) (envelope-from des@des.no) Received: from smtp.des.no (smtp.des.no [194.63.250.102]) by mx1.freebsd.org (Postfix) with ESMTP id 7EE581290; Tue, 8 Dec 2015 11:06:42 +0000 (UTC) (envelope-from des@des.no) Received: from desk.des.no (smtp.des.no [194.63.250.102]) by smtp.des.no (Postfix) with ESMTP id C6995DE6C; Tue, 8 Dec 2015 11:06:35 +0000 (UTC) Received: by desk.des.no (Postfix, from userid 1001) id 31C1348225; Tue, 8 Dec 2015 12:06:32 +0100 (CET) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: Maxim Sobolev Cc: Kirk McKusick , Pawel Jakub Dawidek , Warner Losh , freebsd-hackers@freebsd.org Subject: Re: DELETE support in the VOP_STRATEGY(9)? References: <201512052002.tB5K2ZEA026540@chez.mckusick.com> <86poyhqsdh.fsf@desk.des.no> Date: Tue, 08 Dec 2015 12:06:32 +0100 In-Reply-To: (Maxim Sobolev's message of "Tue, 8 Dec 2015 00:53:48 -0800") Message-ID: <86fuzdqjwn.fsf@desk.des.no> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Dec 2015 11:06:44 -0000 Maxim Sobolev writes: > Dag-Erling Sm=C3=B8rgrav writes: > > 1) why did you take this off the list? > There was a complain from list admin about this being off-topic. Yes, and Eitan moved the discussion to hackers@. It should have stayed there. > > 2) why did you even bother to cc: me if you were going to competely > > ignore everything I said anyway? > I did not really ignore it, it just that I did not have much to reply > at that point. [...] Basically I don't think your concerns wrt DELETE > reliability/gurantees have much to do with this particular feature. > The reason being that BIO_DELETE essentially tells the storage layer > that whichever code "owns" the block in question (e.g. ZFS or UFS) > has moved it into the free pool and will NEVER ever want to read its > value back again (until it's written into again). No, it means that the contents of that block are no longer important and that the lower layers *may* reclaim it. It does not mean that nobody will ever try to read the block, nor does it guarantee that the block will actually be reclaimed or zeroed. We cannot rely on the lower layers to ensure that reading from a previously deleted block never returns data that may have belonged to a different file. BTW, I've encountered CF cards (including the SanDisk card in my home router) that freeze if issued a TRIM command. Furthermore, many CF, MMC and SD cards, especially those marketed for use in digital cameras, perform wear leveling "automagically" based on their own understanding of the filesystem layout, and will therefore work poorly with anything other than FAT (Kingston call it "optimized recording performance" in their marketing literature). > Technically speaking on 100% correctly working os/hardware attempt to > read block after it's been successfully BIO_DELETE'd could produce > exception of some sort without any ill effects. If that were the case, it would never be safe to do # dd if=3D/dev/da0 of=3D/dev/da1 bs=3D4096 conv=3Dsparse which I'm sure you'll agree is not acceptable. > [...] in this particular case of VOP_ALLOCATE(FALLOC_FL_PUNCH_HOLE), a > filesystem in question is responsible for making sure the range that > has been punched through reads 0, whether by making real logical hole > in the file and/or by padding it with zeroes as needed. Is it really? Here are a few of our options for implementing FALLOC_FL_PUNCH_HOLE: a) create a filesystem-level hole in the disk image; b) perform a), then issue a BIO_DELETE for the blocks that were released; c) perform a) or b), then zero the overspill if the requested range is unaligned; d) zero the entire range; e) perform d) followed by either a) or b); f) nothing at all. Now, consider the case of the guest OS in a VM issuing TRIM commands to the emulated storage controller, which the hypervisor translates into a FALLOC_FL_PUNCH_HOLE request for the corresponding range in the disk image. Discuss the advantages and drawbacks of each option I listed above for each of the 36 points in the space defined by the following axes: - The disk image is: - a preallocated file on a filesystem (or an md(4) device backed by a preallocated file) - a dynamically allocated file on a filesystem (or an md(4) device backed by an unallocated file) - a zvol - a device - The underlying storage's preferred block size is: - small (e.g. 4 kB sectors on an AF drive) - medium (e.g. 64 kB stripes on a RAID) - large (e.g. 1 MB erase blocks on an SSD) - The physical storage is: - volatile - solid-state - electromechanical If you think the answer is the same in all cases, you are deluded. DES -- Dag-Erling Sm=C3=B8rgrav - des@des.no