From owner-freebsd-geom@FreeBSD.ORG  Wed Mar  6 08:41:44 2013
Return-Path: <owner-freebsd-geom@FreeBSD.ORG>
Delivered-To: freebsd-geom@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id B7998FBD;
 Wed,  6 Mar 2013 08:41:44 +0000 (UTC) (envelope-from lev@FreeBSD.org)
Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru
 [46.4.40.135]) by mx1.freebsd.org (Postfix) with ESMTP id 42F9EA2D;
 Wed,  6 Mar 2013 08:41:44 +0000 (UTC)
Received: from lion.home.serebryakov.spb.ru (unknown
 [IPv6:2001:470:923f:1:9421:367:9d7d:512b])
 (Authenticated sender: lev@serebryakov.spb.ru)
 by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id 2FD304AC57;
 Wed,  6 Mar 2013 12:41:42 +0400 (MSK)
Date: Wed, 6 Mar 2013 12:41:39 +0400
From: Lev Serebryakov <lev@FreeBSD.org>
Organization: FreeBSD Project
X-Priority: 3 (Normal)
Message-ID: <1198028260.20130306124139@serebryakov.spb.ru>
To: Don Lewis <truckman@FreeBSD.org>
Subject: Re: Unexpected SU+J inconsistency AGAIN -- please,
 don't shift topic to ZFS!
In-Reply-To: <201303060815.r268FIl5015220@gw.catspoiler.org>
References: <1644513757.20130306113250@serebryakov.spb.ru>
 <201303060815.r268FIl5015220@gw.catspoiler.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-5
Content-Transfer-Encoding: quoted-printable
Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: lev@FreeBSD.org
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 06 Mar 2013 08:41:44 -0000

Hello, Don.
You wrote 6 =DC=D0=E0=E2=D0 2013 =D3., 12:15:18:

>>   This  scenario  looks plausible, but it raises another question: does
>>  barriers will protect against it? It doesn't look so, as now barrier
>>  write is issued only when new inode BLOCK is allocated. And it leads
>>  us to my other question: why did not mark such vital writes with
>>  flag, which will force driver to mark them as "uncacheable" (And same
>>  for fsync()-inducted writes). Again, not BIO_FLUSH, which should
>>  flush whole cache, but flag for BIO. I was told my mav@ (ahci driver
>>  author) that ATA has such capability. And I'm sure, that SCSI/SAS drives
>>  should have one too.
DL> In the existing implementation, barriers wouldn't help since they aren't
DL> used in enough nearly enough places.  UFS+SU currently expects the drive
DL> to tell it when the data actually hits the platter so that it can
DL> control the write ordering.  In theory, barriers could be used instead,
DL> but performance would be terrible if they got turned into cache flushes.
   Yep! So, we need stream (file/vnode/inode)-related barriers or
 simple per-request (bp/bio) flag added.

DL> With NCQ or TCQ, the drive can have a sizeable number of writes
DL> internally queued and it is free to reorder them as it pleases even with
DL> write caching disabled, but if write caching is disabled it has to delay
DL> the notification of their completion until the data is on the platters
DL> so that UFS+SU can enforce the proper dependency ordering.
  But, again, performance would be terrible :( I've checked it. On
 very sparse multi-threaded patterns (multiple torrents download on
 fast channel in my simple home case, and, I think, things could be
 worse in case of big file server in organization) and "simple" SATA
 drives it significant worse in my experience :(

DL> I don't know enough about ATA to say if it supports marking individual
DL> writes as uncacheable.  To support consistency on a drive with write
DL> caching enabled, UFS+SU would have to mark many of its writes as
DL> uncacheable.  Even if this works, calls to fsync() would have to be
   I don't see this as a big problem. I've done some experiments about
one and half year ago by adding counter all overs UFS/FFS code when it
writes metadata, and it was about 1% of writes on busy file system
(torrents, csup update, buildworld, all on one big FS).

DL> turned into cache flushes to force the file data (assuming that it was
DL> was written with a cacheable write) to be written to the platters and
DL> only return to the userland program after the data is written.  If drive
DL> write caching is off, then UFS+SU keeps track of the outstanding writes
DL> and an fsync() call won't return until the drive notifies UFS+SU that
DL> the data blocks for that file are actually written.  In this case, the
DL> fsync() call doesn't need to get propagated down to the drive.
  I see. But then we should turn off disc cache by default. And write
 some whitepaper about this situation. I don't know what is better for
 commodity SATA drives, really. And I'm not sure, that I understand
 UFS/FFS code good enough to do proper experiment by adding such flag
 to whole our storage stack :(

   And second problem: SSD. I know nothing about their caching
 strategies, and SSDs has very big RAM  buffers compared to commodity
 HDDs (something like 512MiB vs 64MiB).

--=20
// Black Lion AKA Lev Serebryakov <lev@FreeBSD.org>