From owner-freebsd-hackers@freebsd.org Tue Dec 8 19:28:47 2015 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D1FD99D542B for ; Tue, 8 Dec 2015 19:28:47 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-qg0-x22a.google.com (mail-qg0-x22a.google.com [IPv6:2607:f8b0:400d:c04::22a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 8E60F1081 for ; Tue, 8 Dec 2015 19:28:47 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by qgec40 with SMTP id c40so32171576qge.2 for ; Tue, 08 Dec 2015 11:28:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=j7Xmo4KEBOb360BKFU02/0ks+poOkrVs8MB8wWQAFtM=; b=q050lqlZddxnXpmkl28VIJrEewm5tt6XTd9xR3vNaaTjA87kh/kUe1D/PBJHCyIkv/ eG3ZafLivCPl4oH8KHNggPhBUXSbmT53OvokJwaSwS/e+nHPM4zB/GOamBWNwlIbszdJ U7wsjI0SAXXS3YveauR5bjC8riIhcoyB/hesd+u3kJDaNdC9vcpcPfSkOGXcz1sdRKSM OFt8CbZ/aNuves9XDC05twN5n54GfFR7oBe0vggK4ORQkVxC35B5oEa+fAU9x+/USo2S jP4SEu8Swm/+c0rBGky+vw3TjQUqR/z4qYSs7V25I/0WvKIQVlmA/iA2IFHklYGctvFK CDbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=j7Xmo4KEBOb360BKFU02/0ks+poOkrVs8MB8wWQAFtM=; b=lZCtdGnj3BcQKCqbh/5V9kN+PWP/wAxgqDiRs+NH/i1rayIInxCIlSgMuq/piYs1jW 03EFCuKqQFMreyVOXCrmx7OwhTI17MR327VKxiJtJgwKqPxv8ctM+0UNr8TEh9wH37OX CkaC34B3YDwThrDH2AQEWBLnzAznTuV+1+eCSbt9opDdQnivwqbHVjVRvsPnmYojsLCC 5YUDjweKInpvJ9Pd35lGg3v9JTdoqC3UDVNsAt/+Ga0a0Lgj4dFZPEkgvPSOz2PqHjkZ haZvqzzU74vCcwM5msLS/LSzzwmM3buF4QRf6TcnG4kGR+9sg15pPZYi+fNzAKUP/yql LRyQ== X-Gm-Message-State: ALoCoQk+IVz40lrw5X+u4kJq4uphkY7Xu2y1C3Y+qP+41liZ3ditq6D6pt3hVKJcSD7fG8gTZ68AeL7Mqf/bl5auN6kO8lhNlg== MIME-Version: 1.0 X-Received: by 10.140.176.143 with SMTP id w137mr7717572qhw.20.1449602926551; Tue, 08 Dec 2015 11:28:46 -0800 (PST) Sender: wlosh@bsdimp.com Received: by 10.140.27.181 with HTTP; Tue, 8 Dec 2015 11:28:46 -0800 (PST) X-Originating-IP: [2601:280:4900:3700:4d3f:8eba:ea86:7700] In-Reply-To: <56672C94.30404@multiplay.co.uk> References: <201512052002.tB5K2ZEA026540@chez.mckusick.com> <86poyhqsdh.fsf@desk.des.no> <86fuzdqjwn.fsf@desk.des.no> <864mfssxgt.fsf@desk.des.no> <86wpsord9l.fsf@desk.des.no> <566726ED.2010709@multiplay.co.uk> <0DB97CBA-4DC3-4D52-AE9D-54546292D66F@bsdimp.com> <56672C94.30404@multiplay.co.uk> Date: Tue, 8 Dec 2015 12:28:46 -0700 X-Google-Sender-Auth: pAR-Ck3zjeSnWBEyRlIo1iWSzpA Message-ID: Subject: Re: DELETE support in the VOP_STRATEGY(9)? From: Warner Losh To: Steven Hartland Cc: "freebsd-hackers@freebsd.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Dec 2015 19:28:48 -0000 On Tue, Dec 8, 2015 at 12:16 PM, Steven Hartland wrote: > > > On 08/12/2015 19:03, Warner Losh wrote: > >> On Dec 8, 2015, at 11:52 AM, Steven Hartland >>> wrote: >>> >>> >>> >>> On 08/12/2015 18:44, Dag-Erling Sm=C3=B8rgrav wrote: >>> >>>> Warner Losh writes: >>>> >>>>> Dag-Erling Sm=C3=B8rgrav writes: >>>>> >>>>>> But the filesystem does not know whether the underlying storage is >>>>>> electromechanical or solid-state, nor does it know whether the user >>>>>> cares much about seek times (unless we introduce the heuristic >>>>>> "avoid creating holes unless the file already has them, in which >>>>>> case the userland probably does not care"). >>>>>> >>>>> Actually, the filesystem does know. Or has some knowledge of what >>>>> is supported and what isn't. BIO_DELETE support is a strong indicator >>>>> of a flash or other log-type system. >>>>> >>>> The filesystem can ask the layer below if BIO_DELETE is supported, but >>>> should not assume anything about what it means. For instance, I could >>>> write a gnop-like module that translates BIO_DELETE into an all-zeroes >>>> BIO_WRITE and passes everything else unmodified. It would provide a >>>> stronger guarantee than, say, SATA TRIM but would also have a complete= ly >>>> different performance profile (even on SSDs, since it would do its wor= k >>>> synchronously whereas TRIM works asynchronously). >>>> >>> That ship has sailed. UFS, at least, assumes that if TRIM is supported >> then >> relocating files to be contiguous is bad. >> >> But writing a gnop module that did the BIO_DELETE thing would be bogus. >> BIO_DELETE does not mean that blocks will read back as zeros. But that= =E2=80=99s >> not what BIO_DELETE means. So, sure you could invent a stupid thing that >> breaks the rules, and thus the assumptions of the other code, but why >> would >> you want to do that? >> >> The SATA trims are actually synchronous (in the absence of power >> failures). >> Once you TRIM The data, it is gone. And depending what bits are set in >> the identify response, you can count on different things. But to say the= y >> happen asynchronously because of implementation details about when the >> data >> is actually erased is missing the point. Also, your BIO_DELETE example >> wouldn=E2=80=99t guarantee the data is erased either. Writes to log appe= nd devices >> (like SSDs) are like a TRIM followed by a write: the old LBA mapping is >> discarded and a new one replaces it. >> > > Not all SATA TRIMs are synchronous , some FW does process them in the > background. > > Saying once you TRIM data its gone is actually too strong I'm afraid, as > its advisory, the FW can ignore you if it so chooses. > > There is the concept of DSM deterministic read which if set "should" > result in returning the same values from read of a TRIMed sector every > time, but even this is unreliable due to FW bugs (yes I've seen this). I guess I've been lucky. In FreeBSD we only depend that the data will read without error after a BIO_DELETE and that a subsequent BIO_WRITE will make BIO_READ deterministic again. But I was mostly trying to say that once you issue a TRIM to the drive, and it returns, the TRIM is done in the sense that there's not another TRIM_COMPLETED message that comes back from the drive. > Anyway, my point is that Maxim needs to revise his assumptions. >>>> >>> Just to clarify most consumer devices process TRIM synchronously, not >>> asynchronously. >>> >> It also depends on what you mean by =E2=80=98process=E2=80=99 here. >> > Indeed it does, here I mean when / if the data is removed from the media > by the HW. I agree. Most firmware is asynchronous in this sense. You have to do something called a SECURE ERASE to have the data be actually gone. The granularity of that command, though is the entire drive. > Your example isn't actually just an example CAM scsi_da has a number of >>> different ways it can process BIO_DELETE: >>> * ATA TRIM >>> * SCSI UMAP >>> * Write Same 16 >>> * Write Same 10 >>> * Zero >>> >>> So you example is actually exists in practice in the FreeBSD code base >>> ;-) >>> >> All these are effectively TRIM operations. The devices that implement th= em >> use them as hints to optimize storage. DES=E2=80=99 BIO_DELETE -> WRITE = zero >> example doesn=E2=80=99t optimize storage at all, nor does it give the lo= wer layers >> any clue about how to optimize the storage. All the SCSI delete types >> do give that hint. >> > This is true, just wanted to highlight that "TRIM" can mean very differen= t > things even at the CAM layer. > Agreed. There's many different ways to implement BIO_DELETE's rather loose semantics. This is one reason why we give people the knobs to turn it off if performance is hurt in their application. This is the ultimate escape hatch when the performance profile of BIO_DELETE in the actual drive doesn't match the upper layer's assumptions. Warner