Date: Sat, 23 Jul 2022 11:33:03 +0300 From: bogdan-lists@neant.ro To: freebsd-fs@freebsd.org, freebsd-cloud@freebsd.org Subject: AWS - UFS corrupted when restoring from AWS Backup service Message-ID: <11F07F6C-E93B-42E3-BD27-3FEC4E342B1A@neant.ro>
next in thread | raw e-mail | index | archive | help
Hello, TL;DR: We have a bunch of EC2 machines in AWS running FreeBSD. AMI from = Market, file system is UFS. We have AWS Backup service taking hourly = snapshots of these machines (AMI + EBS snapshots I believe). After a few = months of snapshots we had to restore one of them and found out that the = file system is corrupted and fsck was not able to recover it. We are = going to enable sync in fstab, see if that helps, but it=E2=80=99s hard = to know because it is hard to reproduce the problem, and details about = how everything works are fuzzy to me. Longer version: We use FreeBSD on web servers in AWS. Until January we were doing weekly = AMI snapshots by running a script that would shut down the machine, = create the AMI, then start the machine back up. Which worked for a long = time, but is less than ideal and shutting down production more often = than weekly is rude. At the start of this year we switched to running AWS Backup hourly. It = takes snapshots of a running machine without stopping it. I believe = it=E2=80=99s the same as creating an AMI and checking the =E2=80=9CNo = reboot=E2=80=9D checkbox. It should use the same API call, but I = wouldn=E2=80=99t know. We ran a few recovery tests, we read the docs, we = confirmed with support, everything looked like it should work with no = issues. A couple of weeks ago the EBS disk on one of the machines failed and we = needed to restore it. When we did, it ran fsck on boot (which it = didn=E2=80=99t on our previous tests) and failed to recover it, so the = machine was effectively dead. I know we can mount the disk on a = different machine and recover (some) data, that=E2=80=99s not the point. = We tried a few backups going back two weeks, same issue. We tried a few = more instances, about 5, all of them ran fsck on boot. A couple were = recovered, but it doesn=E2=80=99t matter, it still means it=E2=80=99s = not working as we thought. So now we=E2=80=99re effectively running = without backups on EC2 instances. I=E2=80=99m not sure why it happens. Information is sparse and I=E2=80=99m= making a lot of assumptions. Basically I believe that the snapshot = process is equivalent to cutting off power to the machine and that = happens every hour for months. The docs on UFS soft updates say that = there=E2=80=99s a small chance of data loss, but since that = power-cutting snapshot happens every hour over a time of months, that = chance isn=E2=80=99t that small any more. Still, apparently Linux = doesn=E2=80=99t have this problem, and everywhere I read it says that = data might be lost, but the file system should not be corrupted. And yet = fsck isn=E2=80=99t always able to recover it. As far as I understand, with soft updates and =E2=80=9Cnoasync=E2=80=9D = in fstab (default), data is flushed to disk about every 30 seconds = (according to syncer man page), asynchronously, while metadata is = written synchronously. I=E2=80=99m thinking that maybe that=E2=80=99s an = issue and turning on sync in fstab might help. On the other hand, the = man page for syncer says =E2=80=9CIt is possible on some systems that a = sync(2) occurring simultaneously with a crash may cause file system = damage.=E2=80=9D, which means it might make it worse? I don=E2=80=99t = know. We were not able to reproduce the problem reliably so that we can test. = I=E2=80=99m not sure if or how anyone can help. I just wanted to send = this message so that at least some other people are aware that AWS = Backup doesn=E2=80=99t play nice with FreeBSD.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?11F07F6C-E93B-42E3-BD27-3FEC4E342B1A>