From owner-freebsd-stable@FreeBSD.ORG Sat Apr 2 03:35:52 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9D9C9106566B for ; Sat, 2 Apr 2011 03:35:52 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [65.120.238.197]) by mx1.freebsd.org (Postfix) with ESMTP id 81BBA8FC0C for ; Sat, 2 Apr 2011 03:35:52 +0000 (UTC) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.4/8.14.1) with ESMTP id p323ZpN2018667 for ; Fri, 1 Apr 2011 20:35:51 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.4/8.13.4/Submit) id p323Zp8Q018666; Fri, 1 Apr 2011 20:35:51 -0700 (PDT) Date: Fri, 1 Apr 2011 20:35:51 -0700 (PDT) From: Matthew Dillon Message-Id: <201104020335.p323Zp8Q018666@apollo.backplane.com> To: freebsd-stable@freebsd.org References: <87d3l6p5xv.fsf@cosmos.claresco.hr> <874o6ip0ak.fsf@cosmos.claresco.hr> <7b15d37d28f8ddac9eb81e4390231c96.HRCIM@webmail.1command.com> <14c23d4bf5b47a7790cff65e70c66151.HRCIM@webmail.1command.com> Subject: Re: Constant rebooting after power loss X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Apr 2011 03:35:52 -0000 The core of the issue here comes down to two things: First, a power loss to the drive will cause the drive's dirty write cache to be lost, that data will not make it to disk. Nor do you really want to turn of write caching on the physical drive. Well, you CAN turn it off, but if you do performance will become so bad that there's no point. So turning off the write caching is really a non-starter. The solution to this first item is for the OS/filesystem to issue a disk flush command to the drive at appropriate times. If I recall the ZFS implementation in FreeBSD *DOES* do this for transaction groups, which guarantees that a prior transaction group is fully synced before a new ones starts running (HAMMER in DragonFly also does this). (Just getting an 'ack' from the write transaction over the SATA bus only means the data made it to the drive's cache, not that it made it to the platter). I'm not sure about UFS vis-a-vie the recent UFS logging features... it might be an option but I don't know if it is a default. Perhaps someone can comment on that. One last note here. Many modern drives have very large ram caches. OCZ's SSDs have something like 256MB write caches and many modern HDs now come with 32MB and 64MB caches. Aged drives with lots of relocated sectors and bit errors can also take a very long time to perform writes on certain sectors. So these large caches take time to drain and one can't really assume that an acknowledged write to disk will actually make it to the disk under adverse circumstances any more. All sorts of bad things can happen. Finally, the drives don't order their writes to the platter (you can set a bit to tell them to, but like many similar bits in the past there is no real guarantee that the drives will honor it). So if two transactions do not have a disk flush command inbetween them it is possible for data from the second transaction to commit to the platter before all the data from the first transaction commits to the platter. Or worse, for the non-transactional data to update out of order relative to the transactional data which was supposed to commit first. Hence IMHO the OS/filesystem must use the disk flush command in such situations for good reliability. -- The second problem is that a physical loss of power to the drive can cause the drive to physically lose one or more sectors, and can even effectively destroy the drive (even with the fancy auto-park)... if the drive happens to be in the middle of a track write-back when power is lost it is possible to lose far more than a single sector, including sectors unrelated to recent filesystem operations. The only solution to #2 is to make sure your machines (or at least the drives if they happen to be in external enclosures) are connected to a UPS and that the machines are communicating with the UPS via something like the "apcupsd" port. AND also that you test to make sure the machines properly shut themselves down when AC is lost before the UPS itself runs out of battery time. After all, a UPS won't help if the machines don't at least idle their drives before power is lost!!! I learned this lesson the hard way about 3 years ago. I had something like a dozen drives in two raid arrays doing heavy write activity and lost physical power and several of the drives were totally destroyed, with thousands of sector errors. Not just one or two... thousands. (It is unclear how SSDs react to physical loss of power during heavy writing activity. Theoretically while they will certainly lose their write cache they shouldn't wind up with any read errors). -Matt