Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 23 Jul 2022 11:33:03 +0300
From:      bogdan-lists@neant.ro
To:        freebsd-fs@freebsd.org, freebsd-cloud@freebsd.org
Subject:   AWS - UFS corrupted when restoring from AWS Backup service
Message-ID:  <11F07F6C-E93B-42E3-BD27-3FEC4E342B1A@neant.ro>

next in thread | raw e-mail | index | archive | help
Hello,

TL;DR: We have a bunch of EC2 machines in AWS running FreeBSD. AMI from =
Market, file system is UFS.  We have AWS Backup service taking hourly =
snapshots of these machines (AMI + EBS snapshots I believe). After a few =
months of snapshots we had to restore one of them and found out that the =
file system is corrupted and fsck was not able to recover it. We are =
going to enable sync in fstab, see if that helps, but it=E2=80=99s hard =
to know because it is hard to reproduce the problem, and details about =
how everything works are fuzzy to me.

Longer version:

We use FreeBSD on web servers in AWS. Until January we were doing weekly =
AMI snapshots by running a script that would shut down the machine, =
create the AMI, then start the machine back up. Which worked for a long =
time, but is less than ideal and shutting down production more often =
than weekly is rude.

At the start of this year we switched to running AWS Backup hourly. It =
takes snapshots of a running machine without stopping it. I believe =
it=E2=80=99s the same as creating an AMI and checking the =E2=80=9CNo =
reboot=E2=80=9D checkbox. It should use the same API call, but I =
wouldn=E2=80=99t know. We ran a few recovery tests, we read the docs, we =
confirmed with support, everything looked like it should work with no =
issues.

A couple of weeks ago the EBS disk on one of the machines failed and we =
needed to restore it. When we did, it ran fsck on boot (which it =
didn=E2=80=99t on our previous tests) and failed to recover it, so the =
machine was effectively dead. I know we can mount the disk on a =
different machine and recover (some) data, that=E2=80=99s not the point. =
We tried a few backups going back two weeks, same issue. We tried a few =
more instances, about 5, all of them ran fsck on boot. A couple were =
recovered, but it doesn=E2=80=99t matter, it still means it=E2=80=99s =
not working as we thought. So now we=E2=80=99re effectively running =
without backups on EC2 instances.

I=E2=80=99m not sure why it happens. Information is sparse and I=E2=80=99m=
 making a lot of assumptions. Basically I believe that the snapshot =
process is equivalent to cutting off power to the machine and that =
happens every hour for months. The docs on UFS soft updates say that =
there=E2=80=99s a small chance of data loss, but since that =
power-cutting snapshot happens every hour over a time of months, that =
chance isn=E2=80=99t that small any more. Still, apparently Linux =
doesn=E2=80=99t have this problem, and everywhere I read it says that =
data might be lost, but the file system should not be corrupted. And yet =
fsck isn=E2=80=99t always able to recover it.

As far as I understand, with soft updates and =E2=80=9Cnoasync=E2=80=9D =
in fstab (default), data is flushed to disk about every 30 seconds =
(according to syncer man page), asynchronously, while metadata is =
written synchronously. I=E2=80=99m thinking that maybe that=E2=80=99s an =
issue and turning on sync in fstab might help. On the other hand, the =
man page for syncer says =E2=80=9CIt is possible on some systems that a =
sync(2) occurring simultaneously with a crash may cause file system =
damage.=E2=80=9D, which means it might make it worse? I don=E2=80=99t =
know.

We were not able to reproduce the problem reliably so that we can test. =
I=E2=80=99m not sure if or how anyone can help. I just wanted to send =
this message so that at least some other people are aware that AWS =
Backup doesn=E2=80=99t play nice with FreeBSD.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?11F07F6C-E93B-42E3-BD27-3FEC4E342B1A>