From nobody Sat Jul 23 08:33:03 2022 X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4LqfhL43msz4X4FN; Sat, 23 Jul 2022 08:33:06 +0000 (UTC) (envelope-from bogdan-lists@neant.ro) Received: from out5-smtp.messagingengine.com (out5-smtp.messagingengine.com [66.111.4.29]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4LqfhK69Lfz3Zfp; Sat, 23 Jul 2022 08:33:05 +0000 (UTC) (envelope-from bogdan-lists@neant.ro) Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.nyi.internal (Postfix) with ESMTP id 67CD35C00BC; Sat, 23 Jul 2022 04:33:05 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute5.internal (MEProxy); Sat, 23 Jul 2022 04:33:05 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=neant.ro; h=cc :content-transfer-encoding:content-type:date:date:from:from :in-reply-to:message-id:mime-version:reply-to:sender:subject :subject:to:to; s=fm3; t=1658565185; x=1658651585; bh=UXBBzg3cvI owHKiAVE84bpcjil0hhP+WhPNmicmAEaQ=; b=nvdlyQnAsWzSZC5sTC399+rKTS puENflWsweOm2dncfrjBl1kS6THU1G3NzNQsZ77azlu5e+k0Fpdl1DuK6OqiRGOS Ym/HQqbRVg6KvmlNBAdt4LURlcaxRECsupQccFVRza2DngvmS4UhHyuMgb8vkf1H /RfF6tdMaCmToo4qahBMLmP9vVMVrgUTec/Lyfeflv7Iv8vG1QHgTEZF3SgAyKEB mXX6BxHKnFmthtwd9u24jwEc/Kg/lViQkFmD05eRapZAOB3ErPATH1HhS1XJfg4l KxAmymNYNVJRsM/kichZywPrwrLvTZQ8JwhjqPWC/sYwli5B1pFdm+MF0HQQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:date:feedback-id:feedback-id:from:from:in-reply-to :message-id:mime-version:reply-to:sender:subject:subject:to:to :x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm3; t=1658565185; x=1658651585; bh=UXBBzg3cvIowHKiAVE84bpcjil0h hP+WhPNmicmAEaQ=; b=PPKCtEdRrPcy+nL9c9oWJKHlbO3Npm6KcrMCJQO6V7e9 G8NXck2QsW5X/ykoaeVIweNk6rreBE30cKiaUjQ3J8FXUFc39iyGWVWsT2jwXHl/ 13FuSRyIPPYtmi6A0nA/sXp4PegJ63SIct0U+tV9rQOHBdN0nP9heozzxng+dngL 0BN21mPDx0i8f/W64JDqacvDmhAlTDn+QpJUlYA0MwRtWs0RXxJgGhbEDoZYhcAn WRCHUdYCvItAR7PHv1hLD9qjVPXTU2SAacBFNzbCY9Nsb64ZUa3+uhyo+UErae9j 3I5wN10K9ut6VxfYxWNQz0BKMyWZ5/lGI5D5Bu5CqA== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvfedrvddtgedgtdejucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucenucfjughrpefhtgfgggfukfffvffosehtqhhmtd hhtdejnecuhfhrohhmpegsohhguggrnhdqlhhishhtshesnhgvrghnthdrrhhonecuggft rfgrthhtvghrnhepkeevvdekheehleefteefheetteefueehkeffgeeitdefueegteeuje dvueevueejnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhho mhepsghoghgurghnqdhlihhsthhssehnvggrnhhtrdhroh X-ME-Proxy: Feedback-ID: i61d94637:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Sat, 23 Jul 2022 04:33:04 -0400 (EDT) From: bogdan-lists@neant.ro Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.100.31\)) Subject: AWS - UFS corrupted when restoring from AWS Backup service Message-Id: <11F07F6C-E93B-42E3-BD27-3FEC4E342B1A@neant.ro> Date: Sat, 23 Jul 2022 11:33:03 +0300 To: freebsd-fs@freebsd.org, freebsd-cloud@freebsd.org X-Mailer: Apple Mail (2.3696.100.31) X-Rspamd-Queue-Id: 4LqfhK69Lfz3Zfp X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=neant.ro header.s=fm3 header.b=nvdlyQnA; dkim=pass header.d=messagingengine.com header.s=fm3 header.b=PPKCtEdR; dmarc=none; spf=pass (mx1.freebsd.org: domain of bogdan-lists@neant.ro designates 66.111.4.29 as permitted sender) smtp.mailfrom=bogdan-lists@neant.ro X-Spamd-Result: default: False [-4.10 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; DWL_DNSWL_LOW(-1.00)[messagingengine.com:dkim]; NEURAL_HAM_SHORT(-1.00)[-0.999]; NEURAL_HAM_MEDIUM(-1.00)[-0.998]; MV_CASE(0.50)[]; R_DKIM_ALLOW(-0.20)[neant.ro:s=fm3,messagingengine.com:s=fm3]; R_SPF_ALLOW(-0.20)[+ip4:66.111.4.29]; RCVD_IN_DNSWL_LOW(-0.10)[66.111.4.29:from]; MIME_GOOD(-0.10)[text/plain]; MLMMJ_DEST(0.00)[freebsd-cloud@freebsd.org,freebsd-fs@freebsd.org]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_LAST(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; ASN(0.00)[asn:19151, ipnet:66.111.4.0/24, country:US]; FROM_NO_DN(0.00)[]; RCVD_COUNT_THREE(0.00)[4]; DKIM_TRACE(0.00)[neant.ro:+,messagingengine.com:+]; TO_MATCH_ENVRCPT_ALL(0.00)[]; ARC_NA(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; TO_DN_NONE(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; DMARC_NA(0.00)[neant.ro]; RCVD_VIA_SMTP_AUTH(0.00)[] X-ThisMailContainsUnwantedMimeParts: N Hello, TL;DR: We have a bunch of EC2 machines in AWS running FreeBSD. AMI from = Market, file system is UFS. We have AWS Backup service taking hourly = snapshots of these machines (AMI + EBS snapshots I believe). After a few = months of snapshots we had to restore one of them and found out that the = file system is corrupted and fsck was not able to recover it. We are = going to enable sync in fstab, see if that helps, but it=E2=80=99s hard = to know because it is hard to reproduce the problem, and details about = how everything works are fuzzy to me. Longer version: We use FreeBSD on web servers in AWS. Until January we were doing weekly = AMI snapshots by running a script that would shut down the machine, = create the AMI, then start the machine back up. Which worked for a long = time, but is less than ideal and shutting down production more often = than weekly is rude. At the start of this year we switched to running AWS Backup hourly. It = takes snapshots of a running machine without stopping it. I believe = it=E2=80=99s the same as creating an AMI and checking the =E2=80=9CNo = reboot=E2=80=9D checkbox. It should use the same API call, but I = wouldn=E2=80=99t know. We ran a few recovery tests, we read the docs, we = confirmed with support, everything looked like it should work with no = issues. A couple of weeks ago the EBS disk on one of the machines failed and we = needed to restore it. When we did, it ran fsck on boot (which it = didn=E2=80=99t on our previous tests) and failed to recover it, so the = machine was effectively dead. I know we can mount the disk on a = different machine and recover (some) data, that=E2=80=99s not the point. = We tried a few backups going back two weeks, same issue. We tried a few = more instances, about 5, all of them ran fsck on boot. A couple were = recovered, but it doesn=E2=80=99t matter, it still means it=E2=80=99s = not working as we thought. So now we=E2=80=99re effectively running = without backups on EC2 instances. I=E2=80=99m not sure why it happens. Information is sparse and I=E2=80=99m= making a lot of assumptions. Basically I believe that the snapshot = process is equivalent to cutting off power to the machine and that = happens every hour for months. The docs on UFS soft updates say that = there=E2=80=99s a small chance of data loss, but since that = power-cutting snapshot happens every hour over a time of months, that = chance isn=E2=80=99t that small any more. Still, apparently Linux = doesn=E2=80=99t have this problem, and everywhere I read it says that = data might be lost, but the file system should not be corrupted. And yet = fsck isn=E2=80=99t always able to recover it. As far as I understand, with soft updates and =E2=80=9Cnoasync=E2=80=9D = in fstab (default), data is flushed to disk about every 30 seconds = (according to syncer man page), asynchronously, while metadata is = written synchronously. I=E2=80=99m thinking that maybe that=E2=80=99s an = issue and turning on sync in fstab might help. On the other hand, the = man page for syncer says =E2=80=9CIt is possible on some systems that a = sync(2) occurring simultaneously with a crash may cause file system = damage.=E2=80=9D, which means it might make it worse? I don=E2=80=99t = know. We were not able to reproduce the problem reliably so that we can test. = I=E2=80=99m not sure if or how anyone can help. I just wanted to send = this message so that at least some other people are aware that AWS = Backup doesn=E2=80=99t play nice with FreeBSD.=