Date: Fri, 14 Mar 2014 16:28:59 -0400 From: Richard Yao <ryao@gentoo.org> To: =?utf-8?Q?Edward_Tomasz_Napiera=C5=82a?= <trasz@FreeBSD.org> Cc: "freebsd-hackers@FreeBSD.org" <freebsd-hackers@FreeBSD.org>, RW <rwmaillists@googlemail.com>, Ian Lepore <ian@FreeBSD.org> Subject: Re: GSoC proposition: multiplatform UFS2 driver Message-ID: <F5E8863B-7889-4B3A-9D3E-DC70EAC031C2@gentoo.org> In-Reply-To: <9DA009CD-0629-4402-A2A0-0A6BDE1E86FD@FreeBSD.org> References: <CAA3ZYrCPJ1AydSS9n4dDBMFjHh5Ug6WDvTzncTtTw4eYrmcywg@mail.gmail.com> <20140314152732.0f6fdb02@gumby.homeunix.com> <1394811577.1149.543.camel@revolution.hippie.lan> <0405D29C-D74B-4343-82C7-57EA8BEEF370@FreeBSD.org> <53235014.1040003@gentoo.org> <9DA009CD-0629-4402-A2A0-0A6BDE1E86FD@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mar 14, 2014, at 3:18 PM, Edward Tomasz Napiera=C5=82a <trasz@FreeBSD.org= > wrote: > Wiadomo=C5=9B=C4=87 napisana przez Richard Yao w dniu 14 mar 2014, o godz.= 19:53: >> On 03/14/2014 02:36 PM, Edward Tomasz Napiera=C5=82a wrote: >>> Wiadomo=C5=9B=C4=87 napisana przez Ian Lepore w dniu 14 mar 2014, o godz= . 16:39: >>>> On Fri, 2014-03-14 at 15:27 +0000, RW wrote: >>>>> On Thu, 13 Mar 2014 18:22:10 -0800 >>>>> Dieter BSD wrote: >>>>>=20 >>>>>> Julio writes, >>>>>>> That being said, I do not like the idea of using NetBSD's UFS2 >>>>>>> code. It lacks Soft-Updates, which I consider to make FreeBSD UFS2 >>>>>>> second only to ZFS in desirability. >>>>>>=20 >>>>>> FFS has been in production use for decades. ZFS is still wet behind >>>>>> the ears. Older versions of NetBSD have soft updates, and they work >>>>>> fine for me. I believe that NetBSD 6.0 is the first release without >>>>>> soft updates. They claimed that soft updates was "too difficult" to >>>>>> maintain. I find that soft updates are *essential* for data >>>>>> integrity (I don't know *why*, I'm not a FFS guru). >>>>>=20 >>>>> NetBSD didn't simply drop soft-updates, they replaced it with >>>>> journalling, which is the approach used by practically all modern >>>>> filesystems.=20 >>>>>=20 >>>>> A number of people on the questions list have said that they find >>>>> UFS+SU to be considerably less robust than the journalled filesystems >>>>> of other OS's. =20 >>>=20 >>> Let me remind you that some other OS-es had problems such as truncation >>> of files which were _not_ written (XFS), silently corrupting metadata wh= en >>> there were too many files in a single directory (ext3), and panicing ins= tead >>> of returning ENOSPC (btrfs). ;-> >>=20 >> Lets be clear that such problems live between the VFS and block layer >> and therefore are isolated to specific filesystems. Such problems >> disappear when using ZFS. >=20 > Such problems disappear after fixing bugs that caused them. Just like > with ZFS - some people _have_ lost zpools in the past. People with problems who get in touch with me usually can save their pools. I= cannot recall an incident where a user came to me for help and suffered com= plete loss of a pool. However, there have been incidents of partial data los= s involving user error (running zfs destroy on data you want to keep is bad)= , faulty memory (this user ignored my warnings about non-ECC memory and then= put it into production without running memtest; then blamed ZFS) and two in= cidents where bugs in ZoL's autotools checks that disabled flushing to disk.= The latter two cases have had regression tests put into place to catch the e= rrors that permitted them. >=20 >>>> What I've seen claimed is that UFS+SUJ is less robust. That's a very >>>> different thing than UFS+SU. Journaling was nailed onto the side of UFS= >>>> +SU as an afterthought, and it shows. >>>=20 >>> Not really - it was developed rather recently, and with filesystems it u= sually >>> shows, but it's not "nailed onto the side": it complements SU operation >>> by journalling the few things which SU doesn't really handle and which >>> used to require background fsck. >>>=20 >>> One problem with SU is that it depends on hardware not lying about >>> write completion. Journalling filesystems usually just issue flushes >>> instead. >>=20 >> This point about write completion being done on unflushed data and no >> flushes being done could explain the disconnect between RW's statements >> and what Soft Updates should accomplish. However, it does not change my >> assertion that placing UFS SU on a ZFS zvol will avoid such failure >> modes. >=20 > Assuming everything between UFS and ZFS below behaves correctly. For ZFS, this means hardware honors flushes and does not deduplicate data (e= .g. sandforce) so that ditto blocks have an effect. The latter failure mode d= oes not appear to have been observed in the wild. The former has never been o= bserved to my knowledge when ZFS is given the physical disks and the SAS/SAT= A controller does not try doing a write cache. It has been observed on certa= in iSCSI targets though. >> In ZFS, we have a two stage transaction commit that issues a >> flush at each stage to ensure that data goes to disk, no matter what the >> drive reported. Unless the hardware disobeys flushes, the second stage >> cannot happen if the first stage does not complete and if the second >> stage does not complete, all changes are ignored. >>=20 >> What keeps soft updates from issuing a flush following write completion? >> If there are no pending writes, it is a noop. If the hardware lies, then >> this will force the write. The internal dependency tracking mechanisms >> in Soft Updates should make figuring out when a flush needs to be issued >> should hardware have lied about completion rather simple. At a high >> level, what needs to be done is to batch the things that can be done >> simultaneously and separate those that cannot by flushes. If such >> behavior is implemented, it should have a mount option for toggling it. >> It simply is not needed on well behaved devices, such as ZFS zvols. >=20 > As you say, it's not needed on well-behaved devices. While it could > help with crappy hardware, I think it would be either very complicated > (batching, as described), or would perform very poorly. For ZFS, a well behaved device is a device that honors flushes. As long as f= lush semantics are obeyed, ZFS should be fine. The only exceptions known to m= e involves drives that deduplicate zfs ditto blocks (so far unobserved in th= e wild), non-ECC RAM (which breaks everything equally) and driver bugs (ZFS d= oes not replace backups). UFS Soft Updates seems to have stricter requiremen= ts than ZFS in that IO completion must be honest, but the end result is not a= s good as there are no ditto blocks or checksums for a merkle tree. Also, in= all fairness, ZFS relies on this information too, but it is for performance= purposes, not consistency. > To be honest, I wonder how many problems could be avoided by > disabling write cache by default. With NCQ it shouldn't cause > performance problems, right? I think you need to specify which cache causes the problem. There is the buf= fer cache (removed in recent FreeBSD and bypassed in Linux by ZFSOnLinux), t= he RAID controller cache (using this gives good performance numbers, but is t= errible for reliability) and the actual drive cache (ZFS is okay with this; U= FS2 with SU possibly not).
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F5E8863B-7889-4B3A-9D3E-DC70EAC031C2>