From owner-freebsd-hackers@freebsd.org Sat Aug 25 02:43:01 2018 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 39A31109D843 for ; Sat, 25 Aug 2018 02:43:01 +0000 (UTC) (envelope-from dcrosstech@gmail.com) Received: from mail-qt0-x234.google.com (mail-qt0-x234.google.com [IPv6:2607:f8b0:400d:c0d::234]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id B4038833CF for ; Sat, 25 Aug 2018 02:43:00 +0000 (UTC) (envelope-from dcrosstech@gmail.com) Received: by mail-qt0-x234.google.com with SMTP id x7-v6so12285854qtk.5 for ; Fri, 24 Aug 2018 19:43:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=MFODX4jLsBrCTfvA1wNWv2Ys4qeh9f6Jcb71m0Cd8OQ=; b=AJW9btjlupVFPUw83YyhWgLy67J1dNPC7SMfg+vUYoj6F8QMejzUCFBUVqlMxYX/Aw RnTfWgH44PApOLzsILgRpTQ1ZWKMTiIudh77PSPt0PNCGnL8JJSndwksUBYsy9zwuXEb sjz/4OTrUoeaevzF19G2GgLPXAJfMpQHwoou5F4a3B2NCv7ELTnsdpN5lyv4FLGKIRiC +GSZJ5k9i7mFFLJd/5sBMHWCQurVjCfGdl2JM50O8ON9309e6XPvObqMeB9wJzfN9NmN GFE4YRzKLP2b9Mzlneuhl5JNFlD3B55oZhExAnQUcdxHHr/tuMW7aU9Vc0DtHuVbroA7 0F0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=MFODX4jLsBrCTfvA1wNWv2Ys4qeh9f6Jcb71m0Cd8OQ=; b=fTLwHhfJvMxyiR7udNBnu4OthPFMyttDGTizVo9Ni8LeDvDoY88lH1ef/kRcjRZ/Mt ottMP79HHH80iaQf2bUX6l52Js/1oXM1dph4G9xM9CV3cUnhwxn2LRQJ3eTIociKNTHM fGy1BkzbYksvq/AQ0J8t7KaSswcWhif4SDH0zTW2YN53EajEBo1qu3EpTPCYYWGj8qBO DH9a7PGHKT0iJSUPUp5GvzuvZcrJYpON3XyDWtDZQWeL7SVVR3R+bJh3VFtgcXhWq+Rb 0oG4byttlgkMCLszbd1LmkrVv478bjJmFdJzeM41hxI2/TvlRmVazYzr8qDAMYM9PT4x fY6w== X-Gm-Message-State: APzg51D5Bcmd5vsewJt1T8gOLHlBc091jpXN+EPF9FUC5n6NMfw9tb4j /hBiZxcKIa2IPUnSTXdJHseoctTh X-Google-Smtp-Source: ANB0Vdb01VROMoW2xFTs8Quw5nxK8V6PzjfaWTGSyrYOfmMKK5SEyIDOCxH6ZB4Nt6PNISrPPzZOcQ== X-Received: by 2002:a0c:e7cc:: with SMTP id c12-v6mr4704846qvo.128.1535164980036; Fri, 24 Aug 2018 19:43:00 -0700 (PDT) Received: from ?IPv6:2600:1017:b006:54dc:61db:5f69:d835:5ed0? ([2600:1017:b006:54dc:61db:5f69:d835:5ed0]) by smtp.gmail.com with ESMTPSA id x32-v6sm6289541qtb.70.2018.08.24.19.42.59 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 24 Aug 2018 19:42:59 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (1.0) Subject: Re: weird geli behavior From: David Cross X-Mailer: iPhone Mail (15G77) In-Reply-To: <20180825010023.GD45503@funkthat.com> Date: Fri, 24 Aug 2018 22:42:57 -0400 Cc: FreeBSD Hackers Content-Transfer-Encoding: quoted-printable Message-Id: References: <20180825010023.GD45503@funkthat.com> To: John-Mark Gurney X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Aug 2018 02:43:01 -0000 > On Aug 24, 2018, at 21:00, John-Mark Gurney wrote: >=20 > David Cross wrote this message on Fri, Aug 24, 2018 at 17:54 -0400: >> Ok, I am seeing something truely bizzare, I am sending this out as a shot= >> across the bow since I am not even sure where or how to begin debugging >> this. >>=20 >> Some background. This in on an Intel Xeon 5520 based machine, 72G ECC >> memory, 11.2, fully patched. Though this has been a problem since at lea= st >> 11.1, probably 11.0, and maybe earlier. ~4G of eli encrypted swap, which i= t >> basically never even touches, even when problems are occuring) >=20 > I assume you've applied the lazyfpu SA patch? >=20 > If so, there's another patch you need to apply, see: > https://docs.freebsd.org/cgi/mid.cgi?20180821081150.GU2340@kib.kiev.ua >=20 I will definitely apply this, but I don=E2=80=99t think it applies to the pr= oblem in question. This system doesn=E2=80=99t have AESNI, this problem well= preceded the lazyfpu patch, and i am not seeing any corruption on disk. >> The first symptom was (and I think these are all aspects of the same root= >> underlying cause) that fsck on a geli encrypted d stripe of 2 USB drives >> would *randomly* error out on a corrupt entry. Upon investigating this I= >> discovered by watching gstat that as this happened the IO on the drives >> would STOP. the L(q) would hover at 1 for a number of seconds, and then >> when it returned fsck was complaining about various corrupt structures. a= >> ktrace of fsck shows that it got back data from the pread() that was >> partially corrupted (I am guessing, but I cannot confirm that 'some part'= >> of the stack handed back a zeroed page, or otherwise 'not the right data'= >> that geli dutifully 'decrypted'. No errors are ever logged in the kernel= >> about da0 or da1 (the respective underlying USB disks). It *seems* this i= s >> *always* on phase 2 of fsck (files and paths), and its never the same >> inode. no data is *ever* corrupted when in the filesystem, no matter how= >> hard I hit the disks (all data on these devices is fully checksummed) >> Devices have passed multiple SMART full diag checks, full read/write test= s >> with no issues. Under heavy FS IO it does occasionally lock.. but >> recovers, and again data and filesystem are fully consistent. >>=20 >> I was willing to live with that.. weird as it was (these are backup disks= , >> data is fully checksummed, and I was only fscking out of extreme paranoia= >> every reboot) Then I added an internal drive, configured with gmirror >> (broken mirror currently, second disk hasn't been added) and geli. On th= is >> disk I have a postgres 10 database in WAL replication. This was working >> fine and then the other day the system just locked for a few hours. Duri= ng >> that time I saw the L(q) of the _internal_ disk in the 10,000+ range, and= >> it doing _1_ operation a second to the underlying disk... all the while >> geli is logging 'error 11' to the console (nothing about the underlying >> disk) After this happened a static file on the disk (a zip file) had bad= >> data in the middle of a page (after reboot the file was ok.. so it was >> just in cache). Again, this disk fully checks ok, no corruption on the >> disk, no errors from the disk itself. >>=20 >>=20 >> Halp? where do I even begin with this? It really feels like there is >> some massive locking going on in geli in some way? Where should I even >> begin looking? I run geli on most of my systems and don't have any issue= s. >> _______________________________________________ >> freebsd-hackers@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers >> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org= " >=20 > --=20 > John-Mark Gurney Voice: +1 415 225 5579 >=20 > "All that I will do, has been done, All that I have, has not."