From owner-freebsd-hackers@freebsd.org Mon Jul 6 14:28:21 2015 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E824CA78C for ; Mon, 6 Jul 2015 14:28:20 +0000 (UTC) (envelope-from ian@freebsd.org) Received: from pmta2.delivery4.ore.mailhop.org (pmta2.delivery4.ore.mailhop.org [54.200.247.200]) by mx1.freebsd.org (Postfix) with SMTP id B549B1E48 for ; Mon, 6 Jul 2015 14:28:20 +0000 (UTC) (envelope-from ian@freebsd.org) Received: from ilsoft.org (unknown [73.34.117.227]) by outbound1.ore.mailhop.org (Halon Mail Gateway) with ESMTPSA; Mon, 6 Jul 2015 14:28:12 +0000 (UTC) Received: from rev (rev [172.22.42.240]) by ilsoft.org (8.14.9/8.14.9) with ESMTP id t66ESDcf032359; Mon, 6 Jul 2015 08:28:13 -0600 (MDT) (envelope-from ian@freebsd.org) Message-ID: <1436192893.1334.27.camel@freebsd.org> Subject: Re: rename() + fsync() implementation From: Ian Lepore To: Renato Botelho Cc: Mateusz Guzik , FreeBSD Hackers Date: Mon, 06 Jul 2015 08:28:13 -0600 In-Reply-To: <4B5ABFF0-AAE5-4E74-BAD4-E7B301DC3F7D@FreeBSD.org> References: <2770DA5F-D7F2-46D9-9158-10C86115F8AC@FreeBSD.org> <20150704133034.GA3102@dft-labs.eu> <4B5ABFF0-AAE5-4E74-BAD4-E7B301DC3F7D@FreeBSD.org> Content-Type: text/plain; charset="us-ascii" X-Mailer: Evolution 3.12.10 FreeBSD GNOME Team Port Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Jul 2015 14:28:21 -0000 On Mon, 2015-07-06 at 10:20 -0300, Renato Botelho wrote: > > On Jul 4, 2015, at 10:30, Mateusz Guzik wrote: > > > > On Fri, Jul 03, 2015 at 01:59:43PM -0300, Renato Botelho wrote: > >> Some time ago we found a bug on pfSense and after investigating we figured out the root cause was passwd / group related tools was not checking if files were safe in disk, and system ended up with a 0 length passwd/group db after a power cycle. There are more context at revision [1]. > >> > >> After that, bapt@ suggest to do similar fix for cap_mkdb and services_mkdb, that also can be found at another review [2]. > >> > > > > This definitely needs an explanation what is going on here. > > > > When the problem from [1] is encountered, which file appears to be > > zeroed? > > > > If the new one, should not O_SYNC you added in several places take care > > of that? (btw, it would be nicer if that was fsynced before close > > instead) > > According comment #11 of pfSense bug #4523 [3]: > > ---------- > With SU, with or without J, you end up with 0 byte master.passwd, passwd, group, pwd.db, spwd.db. Or some subset of those. Without SU, you end up with master.passwd and/or group corrupted, containing parts of other files in /etc/. > > It's replicable on stock FreeBSD 9.0 through 11-CURRENT by running the following: > > #!/bin/sh > > /usr/sbin/pw userdel -n 'admin' > /usr/sbin/pw groupadd all -g 1998 > /usr/sbin/pw groupadd admins -g 1999 > /usr/sbin/pw groupmod all -g 1998 -M '' > echo \$6\$O/T6GYkcgYvOBTGm\$KvOh3zhFKiA6HMEPktuImAI8/cetwEFsgj7obXdeTcQvn6mhs50HgkWt6nfnxNhTIb2w4Je6dqdKtARavxThP1 | /usr/sbin/pw usermod -q -n root -s /bin/tcsh -H 0 > echo \$6\$O/T6GYkcgYvOBTGm\$KvOh3zhFKiA6HMEPktuImAI8/cetwEFsgj7obXdeTcQvn6mhs50HgkWt6nfnxNhTIb2w4Je6dqdKtARavxThP1 | /usr/sbin/pw useradd -m -k /etc/skel -o -q -u 0 -n admin -g wheel -s /bin/sh -d /root -c 'System Administrator' -H 0 > /usr/sbin/pw unlock admin -q > /usr/sbin/pw groupmod all -g 1998 -M '0' > /usr/sbin/pw groupmod admins -g 1999 -M '0' > > then power cycling the system. If using SU, you'll end up with 0 byte files. Without SU, you'll have corrupted files containing parts of some other file(s) in /etc. > ---------- > > > If the old one, this still has the window (although miniscule compared > > to 5 mins) since whatever crash/failure you experienced can happen > > before you fsync. It may be ok enough in practice, but then the question > > is whether O_SYNC on the new file was of any significance. Or to state > > differently, do callers have to fsync/O_SYNC the file they are passing > > as an argument? > > > > Of course it may be either one can appear truncated. > I can't quite unpack the wording of this, but it is necessary to fsync() the temp file before closing it (or maybe opening using O_SYNC is just as good, I've never tried that). Without the fsync, SU g'tees that the metadata is consistant but not necessarily the file's contents. Very often that metadata consistancy manifests as a file with no contents after an abrupt power cycle. > The idea of using O_SYNC on temporary file, and call fsync() on directory after rename came from Kirk McKusick: > > ---------- > Applications that are properly written take these steps: > > 1. write new file to a temporary name. > 2. fsync newly written file. > 3. rename temporary name to file to be updated. > 4. For faster results, fsync directory that has updated file to commit the new file. > ---------- > I can confirm based on years of embedded-systems experience that the scenarios you describe above do occur, and the solution you're proposing is basically the one I've used to fix the problem in the past. The one difference is that instead of opening the temp file O_SYNC I've used fsync() on it just before the close to ensure the contents reach the media. I've always wanted something like a rename_sync() syscall that makes stronger guarantees than the current rename(2) (which only promises that a file of that name will exist after the call, but not which of the two potential files it might be, and there are no promises about the file's contents). It seems like the system should be able to more-efficiently figure out what needs to be sync'd -- it should be able to sync the directories involved without needing to do string manipulations of pathnames and whatnot. Speaking of "the directories involved" if the rename operation moves the file within the hierarchy there are really two directories that would need to be fsync'd for a more general solution, right? > I decided to change it on libutil because the functions are specific for passwd and group operations, otherwise it would need to be changed on all applications that calls it (usr.bin/chpass, usr.sbin/pw, usr.sbin/rpc.yppasswdd). > > > Assuming the whole approach makes sense here are some comments about the > > code itself: > > > >> /* > >> * Make sure file is safe on disk. To improve performance we will call > >> * fsync() to the directory where file lies > >> */ > >> if (rename(tname, dbname) == -1 || > >> (dbname_dir = dirname(dbname)) == NULL || > > > > dirname returns a pointer to an internal buffer, so it is not suitable > > for use in a library function (think: foo = dirname(...); this(...);) > > What do you think about having dirname_r implemented? Using the same approach we have today for basename/basename_r. > That sounds like a good idea to me. > > [...] Unrelated to this proposed patch, but maybe useful information... some files are less critical than passwd/group but you'd still like to increase the chances that recent writes hit the media to protect against power loss. I've found that these values in /etc/sysctl.conf reduce the time that buffers linger in memory before being pushed out to devices: # Tune the flush time for pending disk writes. The syncer process flushes # dirty blocks from memory to disk after they've been in the queues for these # amounts of time. It is important that metadelay < dirdelay < filedelay. kern.metadelay=3 kern.dirdelay=4 kern.filedelay=5 While I can't quite remember the details, I do remember that these values are the smallest delays you can set and still have the system function correctly. (At least, that was true in the 8.x timeframe when I cooked up these tweaks.) On a sysytem that does a lot of IO, these parameters will be bad for IO throughput, they're purely for an embedded system that does relatively little IO and needs to increase the odds that things like log files get written without too much latency (so that the evidence needed to debug a crash gets preserved, for example). Afaik, these are system-wide parms and there's no way to set an equivelent per-filesystem tuning of them. -- Ian