From owner-freebsd-fs@FreeBSD.ORG Thu Apr 11 06:30:56 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id B3830D4D for ; Thu, 11 Apr 2013 06:30:56 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail110.syd.optusnet.com.au (mail110.syd.optusnet.com.au [211.29.132.97]) by mx1.freebsd.org (Postfix) with ESMTP id 7B6B2270 for ; Thu, 11 Apr 2013 06:30:55 +0000 (UTC) Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id 8D5C07812EB; Thu, 11 Apr 2013 16:30:53 +1000 (EST) Date: Thu, 11 Apr 2013 16:30:52 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Kevin Day Subject: Re: Does sync(8) really flush everything? Lost writes with journaled SU after sync+power cycle In-Reply-To: <87CC14D8-7DC6-481A-8F85-46629F6D2249@dragondata.com> Message-ID: <20130411160253.V1041@besplex.bde.org> References: <87CC14D8-7DC6-481A-8F85-46629F6D2249@dragondata.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=HfxM1V48 c=1 sm=1 a=Cguo-lYZyhEA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=5GGpcXspQ0YA:10 a=TNEYMwA1_HLbvNno7u4A:9 a=CjuIK1q_8ugA:10 a=OapHOw4wc7whhmw6:21 a=wi0mfrH21KDZYzXh:21 a=TEtd8y5WR3g2ypngnwZWYw==:117 Cc: "freebsd-fs@FreeBSD.org Filesystems" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Apr 2013 06:30:56 -0000 On Wed, 10 Apr 2013, Kevin Day wrote: > Working with an environment where a system (with journaled soft-updates) is going to be notified that it's going to be losing power shortly, and needs to shut down daemons and flush everything to disk. It doesn't actually shutdown though, because the "power down now" command may get cancelled and we need to bring things back up. My understanding was that we could call sync(8), then just wait for the power to drop. > > The problem is that we were frequently losing the last 30-60 seconds worth of filesystem changes prior to the shutdown. i.e. newly created directories would disappear or fsck would reclaim them and throw them into lost+found. > > I confirmed that there is no caching disk controller, and write caching is disabled on the drives themselves, and the problem continued. > > On a whim, after running sync(8) once and waiting 10 seconds, I did "mount -u -o ro -f /" to force the filesystem into read-only mode. It took about 8 seconds to finish, gstat showed a lot of write activity, and SIGINFO on the mount command showed: sync(2) only schedules all writing of all modified buffers to disk. Its man page even says this. It doesn't wait for any of the writes to complete. Its man page says that this is a BUG, but it is intentional and sync() has always done this. There is no way for sync() to guarantee that all modified buffers have been written to disk when it returns, since even if it waited, buffers might be modified while it is returning. Perhaps even ones that would take 8 seconds to complete can be written in the few nanoseconds that it takes to return. sync(8) is just a wrapper around sync(2). One that doesn't even check for errors. Not that it could handle sync() failure. Its man page bogusly first claims that it "forces completion". This is not completely wrong, since it doesn't claim that the completion occurs before sync(8) exits. But then it claims that sync(8) is suitable "to ensure that all disk writes have been completed in a way not suitably done by reboot(8) or halt(8). This wording is poor, unless it is intentionally weaselishly worded so that it doesn't actually claim full completion. It only claims more suitable completion than with reboot or halt. Actually, completion is not guaranteed, and what sync(8) provides is just less unsuitable than what reboot and halt provide. To ensure completion, you have to freeze the file systems of interest before rebooting. I don't know of any ways to do this from userland except mount -u -o ro or unmount. There should be a syscall to cause syncing with waiting. The kernel has a wait option for syncing, but doesn't use it for sync(2). But using this would only reduce the races. Bruce