From owner-freebsd-fs@FreeBSD.ORG Sat Nov 26 08:14:02 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 557EF106564A for ; Sat, 26 Nov 2011 08:14:02 +0000 (UTC) (envelope-from lev@freebsd.org) Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru [IPv6:2a01:4f8:131:60a2::2]) by mx1.freebsd.org (Postfix) with ESMTP id 1B3BE8FC0A for ; Sat, 26 Nov 2011 08:14:02 +0000 (UTC) Received: from lion.home.serebryakov.spb.ru (unknown [IPv6:2001:470:923f:1:5974:a369:b987:bc4d]) (Authenticated sender: lev@serebryakov.spb.ru) by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id C51BD4AC1C; Sat, 26 Nov 2011 12:14:00 +0400 (MSK) Date: Sat, 26 Nov 2011 12:13:54 +0400 From: Lev Serebryakov X-Priority: 3 (Normal) Message-ID: <1961318852.20111126121354@serebryakov.spb.ru> To: Kostik Belousov In-Reply-To: <20111126080351.GD50300@deviant.kiev.zoral.com.ua> References: <20111123194444.GE50300@deviant.kiev.zoral.com.ua> <201111260725.pAQ7PDow056289@chez.mckusick.com> <20111126080351.GD50300@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: text/plain; charset=windows-1251 Content-Transfer-Encoding: quoted-printable Cc: Kirk McKusick , freebsd-fs@freebsd.org Subject: Re: Does UFS2 send BIO_FLUSH to GEOM when update metadata (with softupdates)? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Nov 2011 08:14:02 -0000 Hello, Kostik. You wrote 26 =ED=EE=FF=E1=F0=FF 2011 =E3., 12:03:51: >> You are entirely correct when you say that the requirement for >> SU and SU+J is that it requires that notification of a disk-write >> complete mean that the data is on the disk (stable). The problem >> that arises is that (apparently) some tag-queue implementations >> report back that tags have been written when in fact they have >> not been written.=20 > Right, and my belief that real hardware is not much affected, You have wrong idea about modern hardware, sorry. Again: don't forget multi-megabyte caches, and absence of any guarantees, in which order these caches will be flashed. Many controllers and drives itself group writes. And if companion for data block in cache is found earlier than companion for metadata block (as drive doesn't distinguish them) or waiting timeout, data block will be written first. The same applicable to two metadata blocks, of course. And it is not question of BROKEN QUEUEING. Again, I'm speaking not about cheap ATA drivers here, but about expensive high-performance RAID controllers ands server drives with huge caches. > except probably some ultra-cheap and old ATA disks. Another issue > is broken-by-design 'drivers' which authors do not understand the > environment they programming for. And, again, or you have synchronous from top to bottoms storage stack and performance, which will be miserable, compared to other OSes, or you need to give some freedom to driver authors and provide hints about semantic of personal operations to them. Every drive and controller, which does write caching and reordering (except old, cheap broken ATA ones) HAVE flags and knobs to send this individual block to plactes as soon as possible. But now drivers doesn't have any idea when they should use these flags. And they don't use them. > I do not see how this proposal change much, except limiting potential > havoc to the last 100ms of system operation. In fact, reordering, > besides causing fs consistency problems, may cause the security issues > as well [*]. If user data is written into the reused blocks, but > metadata update was ordered after data write, we can end with the > arbitrary override of the sensititive authorization or accounting > information. It is why metadata requests should be marked as non-reordable, non-queuable. Personal requests, not some global barrier every 100ms. --=20 // Black Lion AKA Lev Serebryakov