From owner-freebsd-bugs@freebsd.org Tue Apr 7 19:10:08 2020 Return-Path: Delivered-To: freebsd-bugs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 4492F27A021 for ; Tue, 7 Apr 2020 19:10:08 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mailman.nyi.freebsd.org (mailman.nyi.freebsd.org [IPv6:2610:1c1:1:606c::50:13]) by mx1.freebsd.org (Postfix) with ESMTP id 48xcRh17PZz45wV for ; Tue, 7 Apr 2020 19:10:08 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: by mailman.nyi.freebsd.org (Postfix) id 26C7427A020; Tue, 7 Apr 2020 19:10:08 +0000 (UTC) Delivered-To: bugs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 2690827A01F for ; Tue, 7 Apr 2020 19:10:08 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 48xcRh0FmRz45wT for ; Tue, 7 Apr 2020 19:10:08 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id F408EE92D for ; Tue, 7 Apr 2020 19:10:07 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 037JA7MP084015 for ; Tue, 7 Apr 2020 19:10:07 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 037JA7ej084014 for bugs@FreeBSD.org; Tue, 7 Apr 2020 19:10:07 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 245430] ZFS boot failure following memory exhaustion Date: Tue, 07 Apr 2020 19:10:07 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 12.1-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: jwb@freebsd.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version rep_platform op_sys bug_status bug_severity priority component assigned_to reporter Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Apr 2020 19:10:08 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D245430 Bug ID: 245430 Summary: ZFS boot failure following memory exhaustion Product: Base System Version: 12.1-RELEASE Hardware: Any OS: Any Status: New Severity: Affects Some People Priority: --- Component: kern Assignee: bugs@FreeBSD.org Reporter: jwb@freebsd.org The error message here is about the same as PR 221077, but I'm opening a ne= w PR due to differences in configuration and some new info. The exact error is as follows: ZFS: i/o error - all block copies unavailable ZFS: can't read object set for dataset u ZFS: can't open root filesystem gptzfsboot: failed to mount default pool zroot This occurred following a memory exhaustion event that flooded my messages = file with entries like the following: Apr 6 09:56:14 compute-001 kernel: swap_pager_getswapspace(11): failed This was caused by some computational jobs exhausting available memory. The system was still up and accepting logins, but was no longer reading or writing the ZFS pool. All attempts to reboot resulted in the message above. The system was installed recently using default ZFS configuration provided = by bsdinstall on 4 disks configured as RAID-0 on a Dell PERC H700 (which does = not support JBOD). I.e. ZFS is running a RAID-Z on mfid0 through mfid3. I updated an old forum thread on the issue and included my successful fix: https://forums.freebsd.org/threads/10-1-doesnt-boot-anymore-from-zroot-afte= r-applying-p25.54422/ Unlike that thread, this did not appear to be triggered by an upgrade. The gist of it is that some of the datasets (filesystems) appear to have be= en corrupted and the out of swap errors seem likely to be the cause. zpool scrub did not find any errors. All drives are reported as online and optimal by the RAID controller. My fix was as follows: Boot from USB drive, go to live image, log in as root. # mount -u -o rw / # Allow creating directories on USB drive # zpool import -R /mnt -fF zroot # cd /mnt # mount zroot/ROOT/default # Not mounted by default, canmount defaults to noauto # mv boot boot.orig # cp -Rp /boot . # Note -p to make sure permissions are correct in the new /boot # cp boot.orig/loader.conf boot/ # Restore customizations # cp boot.orig/zfs/zpool.cache boot/zfs/ # cd # zfs get canmount # zfs set canmount=3Don var/log (and a couple others that did not match def= aults) # zpool export # reboot After successful reboot, ran "freebsd-update fetch install" and rebooted ag= ain, so my /boot would be up-to-date. Everything seems fine now. I've made a backup of my /boot directory and plan to do so following every freebsd-update so I can hopefully correct this quickly if it happens again. I am seeing the same error on a workstation using 4 vanilla SATA ports, but have not had physical access to it due to COVID 19. This is the first time I've seen the error without the presence of an underlying hardware RAID. I= 'll update this thread when I can gather more information. --=20 You are receiving this mail because: You are the assignee for the bug.=