Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 06 Jul 2015 08:28:13 -0600
From:      Ian Lepore <ian@freebsd.org>
To:        Renato Botelho <garga@FreeBSD.org>
Cc:        Mateusz Guzik <mjguzik@gmail.com>, FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: rename() + fsync() implementation
Message-ID:  <1436192893.1334.27.camel@freebsd.org>
In-Reply-To: <4B5ABFF0-AAE5-4E74-BAD4-E7B301DC3F7D@FreeBSD.org>
References:  <2770DA5F-D7F2-46D9-9158-10C86115F8AC@FreeBSD.org> <20150704133034.GA3102@dft-labs.eu> <4B5ABFF0-AAE5-4E74-BAD4-E7B301DC3F7D@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 2015-07-06 at 10:20 -0300, Renato Botelho wrote:
> > On Jul 4, 2015, at 10:30, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > 
> > On Fri, Jul 03, 2015 at 01:59:43PM -0300, Renato Botelho wrote:
> >> Some time ago we found a bug on pfSense and after investigating we figured out the root cause was passwd / group related tools was not checking if files were safe in disk, and system ended up with a 0 length passwd/group db after a power cycle. There are more context at revision [1].
> >> 
> >> After that, bapt@ suggest to do similar fix for cap_mkdb and services_mkdb, that also can be found at another review [2].
> >> 
> > 
> > This definitely needs an explanation what is going on here.
> > 
> > When the problem from [1] is encountered, which file appears to be
> > zeroed?
> > 
> > If the new one, should not O_SYNC you added in several places take care
> > of that? (btw, it would be nicer if that was fsynced before close
> > instead)
> 
> According comment #11 of pfSense bug #4523 [3]:
> 
> ----------
> With SU, with or without J, you end up with 0 byte master.passwd, passwd, group, pwd.db, spwd.db. Or some subset of those. Without SU, you end up with master.passwd and/or group corrupted, containing parts of other files in /etc/.
> 
> It's replicable on stock FreeBSD 9.0 through 11-CURRENT by running the following: 
> 
> #!/bin/sh
> 
> /usr/sbin/pw userdel -n 'admin'
> /usr/sbin/pw groupadd all -g 1998
> /usr/sbin/pw groupadd admins -g 1999
> /usr/sbin/pw groupmod all -g 1998 -M ''
> echo \$6\$O/T6GYkcgYvOBTGm\$KvOh3zhFKiA6HMEPktuImAI8/cetwEFsgj7obXdeTcQvn6mhs50HgkWt6nfnxNhTIb2w4Je6dqdKtARavxThP1 | /usr/sbin/pw usermod -q -n root -s /bin/tcsh -H 0
> echo \$6\$O/T6GYkcgYvOBTGm\$KvOh3zhFKiA6HMEPktuImAI8/cetwEFsgj7obXdeTcQvn6mhs50HgkWt6nfnxNhTIb2w4Je6dqdKtARavxThP1 | /usr/sbin/pw useradd -m -k /etc/skel -o -q -u 0 -n admin -g wheel -s /bin/sh -d /root -c 'System Administrator' -H 0
> /usr/sbin/pw unlock admin -q
> /usr/sbin/pw groupmod all -g 1998 -M '0'
> /usr/sbin/pw groupmod admins -g 1999 -M '0'
> 
> then power cycling the system. If using SU, you'll end up with 0 byte files. Without SU, you'll have corrupted files containing parts of some other file(s) in /etc.
> ----------
> 
> > If the old one, this still has the window (although miniscule compared
> > to 5 mins) since whatever crash/failure you experienced can happen
> > before you fsync. It may be ok enough in practice, but then the question
> > is whether O_SYNC on the new file was of any significance. Or to state
> > differently, do callers have to fsync/O_SYNC the file they are passing
> > as an argument?
> > 
> > Of course it may be either one can appear truncated.
> 

I can't quite unpack the wording of this, but it is necessary to fsync()
the temp file before closing it (or maybe opening using O_SYNC is just
as good, I've never tried that).  Without the fsync, SU g'tees that the
metadata is consistant but not necessarily the file's contents.  Very
often that metadata consistancy manifests as a file with no contents
after an abrupt power cycle.

> The idea of using O_SYNC on temporary file, and call fsync() on directory after rename came from Kirk McKusick:
> 
> ----------
> Applications that are properly written take these steps:
> 
> 	1. write new file to a temporary name.
> 	2. fsync newly written file.
> 	3. rename temporary name to file to be updated.
> 	4. For faster results, fsync directory that has updated file to commit the new file.
> ----------
> 

I can confirm based on years of embedded-systems experience that the
scenarios you describe above do occur, and the solution you're proposing
is basically the one I've used to fix the problem in the past.  The one
difference is that instead of opening the temp file O_SYNC I've used
fsync() on it just before the close to ensure the contents reach the
media.

I've always wanted something like a rename_sync() syscall that makes
stronger guarantees than the current rename(2) (which only promises that
a file of that name will exist after the call, but not which of the two
potential files it might be, and there are no promises about the file's
contents).  It seems like the system should be able to more-efficiently
figure out what needs to be sync'd -- it should be able to sync the
directories involved without needing to do string manipulations of
pathnames and whatnot.

Speaking of "the directories involved" if the rename operation moves the
file within the hierarchy there are really two directories that would
need to be fsync'd for a more general solution, right?

> I decided to change it on libutil because the functions are specific for passwd and group operations, otherwise it would need to be changed on all applications that calls it (usr.bin/chpass, usr.sbin/pw, usr.sbin/rpc.yppasswdd).
> 
> > Assuming the whole approach makes sense here are some comments about the
> > code itself:
> > 
> >>        /*
> >>         * Make sure file is safe on disk. To improve performance we will call
> >>         * fsync() to the directory where file lies
> >>         */
> >>        if (rename(tname, dbname) == -1 ||
> >>            (dbname_dir = dirname(dbname)) == NULL ||
> > 
> > dirname returns a pointer to an internal buffer, so it is not suitable
> > for use in a library function (think: foo = dirname(...); this(...);)
> 
> What do you think about having dirname_r implemented? Using the same approach we have today for basename/basename_r.
> 

That sounds like a good idea to me.

> > [...]

Unrelated to this proposed patch, but maybe useful information... some
files are less critical than passwd/group but you'd still like to
increase the chances that recent writes hit the media to protect against
power loss.  I've found that these values in /etc/sysctl.conf reduce the
time that buffers linger in memory before being pushed out to devices:

 # Tune the flush time for pending disk writes.  The syncer process flushes 
 # dirty blocks from memory to disk after they've been in the queues for these
 # amounts of time.  It is important that metadelay < dirdelay < filedelay.
 kern.metadelay=3
 kern.dirdelay=4
 kern.filedelay=5

While I can't quite remember the details, I do remember that these
values are the smallest delays you can set and still have the system
function correctly.  (At least, that was true in the 8.x timeframe when
I cooked up these tweaks.)

On a sysytem that does a lot of IO, these parameters will be bad for IO
throughput, they're purely for an embedded system that does relatively
little IO and needs to increase the odds that things like log files get
written without too much latency (so that the evidence needed to debug a
crash gets preserved, for example).  Afaik, these are system-wide parms
and there's no way to set an equivelent per-filesystem tuning of them.

-- Ian





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1436192893.1334.27.camel>