Date: Sat, 2 Apr 2011 09:57:02 +0200 From: Olivier Smedts <olivier@gid0.org> To: freebsd-stable@freebsd.org Subject: Re: Constant rebooting after power loss Message-ID: <BANLkTik9aN7TZ_pSZ1b=nMeXO-mW-fYuUA@mail.gmail.com> In-Reply-To: <201104020335.p323Zp8Q018666@apollo.backplane.com> References: <87d3l6p5xv.fsf@cosmos.claresco.hr> <AANLkTi=kEyz-mKLzdV8LAf91ZhMTP8gLKs=3Eu5WD8mh@mail.gmail.com> <874o6ip0ak.fsf@cosmos.claresco.hr> <7b15d37d28f8ddac9eb81e4390231c96.HRCIM@webmail.1command.com> <AANLkTi=KEwmm1hM6Z=r_SWUAn9KhUrkTVzfF6VmqQauW@mail.gmail.com> <14c23d4bf5b47a7790cff65e70c66151.HRCIM@webmail.1command.com> <AANLkTi=6pqRwJ96Lg=603cYg_f8QUXkg8aXtbjbYpFrV@mail.gmail.com> <201104020335.p323Zp8Q018666@apollo.backplane.com>
next in thread | previous in thread | raw e-mail | index | archive | help
2011/4/2 Matthew Dillon <dillon@apollo.backplane.com>: > =A0 =A0The core of the issue here comes down to two things: > > =A0 =A0First, a power loss to the drive will cause the drive's dirty writ= e cache > =A0 =A0to be lost, that data will not make it to disk. =A0Nor do you real= ly want > =A0 =A0to turn of write caching on the physical drive. =A0Well, you CAN t= urn it > =A0 =A0off, but if you do performance will become so bad that there's no = point. > =A0 =A0So turning off the write caching is really a non-starter. > > =A0 =A0The solution to this first item is for the OS/filesystem to issue = a > =A0 =A0disk flush command to the drive at appropriate times. =A0If I reca= ll the > =A0 =A0ZFS implementation in FreeBSD *DOES* do this for transaction group= s, > =A0 =A0which guarantees that a prior transaction group is fully synced be= fore > =A0 =A0a new ones starts running (HAMMER in DragonFly also does this). > =A0 =A0(Just getting an 'ack' from the write transaction over the SATA bu= s only > =A0 =A0means the data made it to the drive's cache, not that it made it t= o > =A0 =A0the platter). Amen ! > =A0 =A0I'm not sure about UFS vis-a-vie the recent UFS logging features..= . > =A0 =A0it might be an option but I don't know if it is a default. =A0Perh= aps > =A0 =A0someone can comment on that. > > =A0 =A0One last note here. =A0Many modern drives have very large ram cach= es. > =A0 =A0OCZ's SSDs have something like 256MB write caches and many modern = HDs > =A0 =A0now come with 32MB and 64MB caches. =A0Aged drives with lots of re= located > =A0 =A0sectors and bit errors can also take a very long time to perform w= rites > =A0 =A0on certain sectors. =A0So these large caches take time to drain an= d one > =A0 =A0can't really assume that an acknowledged write to disk will actual= ly > =A0 =A0make it to the disk under adverse circumstances any more. =A0All s= orts > =A0 =A0of bad things can happen. > > =A0 =A0Finally, the drives don't order their writes to the platter (you c= an > =A0 =A0set a bit to tell them to, but like many similar bits in the past = there > =A0 =A0is no real guarantee that the drives will honor it). =A0So if two > =A0 =A0transactions do not have a disk flush command inbetween them it is > =A0 =A0possible for data from the second transaction to commit to the pla= tter > =A0 =A0before all the data from the first transaction commits to the plat= ter. > =A0 =A0Or worse, for the non-transactional data to update out of order re= lative > =A0 =A0to the transactional data which was supposed to commit first. > > =A0 =A0Hence IMHO the OS/filesystem must use the disk flush command in su= ch > =A0 =A0situations for good reliability. > > =A0 =A0-- > > =A0 =A0The second problem is that a physical loss of power to the drive c= an > =A0 =A0cause the drive to physically lose one or more sectors, and can ev= en > =A0 =A0effectively destroy the drive (even with the fancy auto-park)... i= f the > =A0 =A0drive happens to be in the middle of a track write-back when power= is > =A0 =A0lost it is possible to lose far more than a single sector, includi= ng > =A0 =A0sectors unrelated to recent filesystem operations. > > =A0 =A0The only solution to #2 is to make sure your machines (or at least= the > =A0 =A0drives if they happen to be in external enclosures) are connected = to > =A0 =A0a UPS and that the machines are communicating with the UPS via > =A0 =A0something like the "apcupsd" port. =A0AND also that you test to ma= ke > =A0 =A0sure the machines properly shut themselves down when AC is lost be= fore > =A0 =A0the UPS itself runs out of battery time. =A0After all, a UPS won't= help > =A0 =A0if the machines don't at least idle their drives before power is l= ost!!! > > =A0 =A0I learned this lesson the hard way about 3 years ago. =A0I had som= ething > =A0 =A0like a dozen drives in two raid arrays doing heavy write activity = and > =A0 =A0lost physical power and several of the drives were totally destroy= ed, > =A0 =A0with thousands of sector errors. =A0Not just one or two... thousan= ds. > > =A0 =A0(It is unclear how SSDs react to physical loss of power during hea= vy > =A0 =A0writing activity. =A0Theoretically while they will certainly lose = their > =A0 =A0write cache they shouldn't wind up with any read errors). > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0-Matt > > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" > --=20 Olivier Smedts=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=A0 _ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 ASCII ribbon campaign ( ) e-mail: olivier@gid0.org=A0 =A0 =A0 =A0 - against HTML email & vCards=A0 X www: http://www.gid0.org=A0 =A0 - against proprietary attachments / \ =A0 "Il y a seulement 10 sortes de gens dans le monde : =A0 ceux qui comprennent le binaire, =A0 et ceux qui ne le comprennent pas."
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?BANLkTik9aN7TZ_pSZ1b=nMeXO-mW-fYuUA>