From owner-freebsd-fs@FreeBSD.ORG  Thu Apr 11 06:30:56 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id B3830D4D
 for <freebsd-fs@FreeBSD.org>; Thu, 11 Apr 2013 06:30:56 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail110.syd.optusnet.com.au (mail110.syd.optusnet.com.au
 [211.29.132.97]) by mx1.freebsd.org (Postfix) with ESMTP id 7B6B2270
 for <freebsd-fs@FreeBSD.org>; Thu, 11 Apr 2013 06:30:55 +0000 (UTC)
Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au
 (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106])
 by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id 8D5C07812EB;
 Thu, 11 Apr 2013 16:30:53 +1000 (EST)
Date: Thu, 11 Apr 2013 16:30:52 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Kevin Day <toasty@dragondata.com>
Subject: Re: Does sync(8) really flush everything? Lost writes with journaled
 SU after sync+power cycle
In-Reply-To: <87CC14D8-7DC6-481A-8F85-46629F6D2249@dragondata.com>
Message-ID: <20130411160253.V1041@besplex.bde.org>
References: <87CC14D8-7DC6-481A-8F85-46629F6D2249@dragondata.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=HfxM1V48 c=1 sm=1 a=Cguo-lYZyhEA:10
 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=5GGpcXspQ0YA:10
 a=TNEYMwA1_HLbvNno7u4A:9 a=CjuIK1q_8ugA:10 a=OapHOw4wc7whhmw6:21
 a=wi0mfrH21KDZYzXh:21 a=TEtd8y5WR3g2ypngnwZWYw==:117
Cc: "freebsd-fs@FreeBSD.org Filesystems" <freebsd-fs@FreeBSD.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 11 Apr 2013 06:30:56 -0000

On Wed, 10 Apr 2013, Kevin Day wrote:

> Working with an environment where a system (with journaled soft-updates) is going to be notified that it's going to be losing power shortly, and needs to shut down daemons and flush everything to disk. It doesn't actually shutdown though, because the "power down now" command may get cancelled and we need to bring things back up. My understanding was that we could call sync(8), then just wait for the power to drop.
>
> The problem is that we were frequently losing the last 30-60 seconds worth of filesystem changes prior to the shutdown. i.e. newly created directories would disappear or fsck would reclaim them and throw them into lost+found.
>
> I confirmed that there is no caching disk controller, and write caching is disabled on the drives themselves, and the problem continued.
>
> On a whim, after running sync(8) once and waiting 10 seconds, I did "mount -u -o ro -f /" to force the filesystem into read-only mode. It took about 8 seconds to finish, gstat showed a lot of write activity, and SIGINFO on the mount command showed:

sync(2) only schedules all writing of all modified buffers to disk.  Its
man page even says this.  It doesn't wait for any of the writes to complete.
Its man page says that this is a BUG, but it is intentional and sync() has
always done this.  There is no way for sync() to guarantee that all modified
buffers have been written to disk when it returns, since even if it waited,
buffers might be modified while it is returning.  Perhaps even ones that
would take 8 seconds to complete can be written in the few nanoseconds that
it takes to return.

sync(8) is just a wrapper around sync(2).  One that doesn't even check
for errors.  Not that it could handle sync() failure.  Its man page
bogusly first claims that it "forces completion".  This is not
completely wrong, since it doesn't claim that the completion occurs
before sync(8) exits.  But then it claims that sync(8) is suitable "to
ensure that all disk writes have been completed in a way not suitably
done by reboot(8) or halt(8).  This wording is poor, unless it is
intentionally weaselishly worded so that it doesn't actually claim
full completion.  It only claims more suitable completion than with
reboot or halt.  Actually, completion is not guaranteed, and what
sync(8) provides is just less unsuitable than what reboot and halt
provide.

To ensure completion, you have to freeze the file systems of interest
before rebooting.  I don't know of any ways to do this from userland
except mount -u -o ro or unmount.

There should be a syscall to cause syncing with waiting.  The kernel
has a wait option for syncing, but doesn't use it for sync(2).  But
using this would only reduce the races.

Bruce