Date: Fri, 24 Aug 2018 22:23:30 -0700 From: John-Mark Gurney <jmg@funkthat.com> To: David Cross <dcrosstech@gmail.com> Cc: FreeBSD Hackers <freebsd-hackers@freebsd.org> Subject: Re: weird geli behavior Message-ID: <20180825052330.GE45503@funkthat.com> In-Reply-To: <CD43A15C-74B2-4F29-ADB5-B831A0CD5BF6@gmail.com> References: <CAM9edePfxANDxXAjgQsZPXzPc3Ezw4Pn%2BdaVcnkaHx1oY%2BUoDA@mail.gmail.com> <20180825010023.GD45503@funkthat.com> <CD43A15C-74B2-4F29-ADB5-B831A0CD5BF6@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
David Cross wrote this message on Fri, Aug 24, 2018 at 22:42 -0400: > > > > On Aug 24, 2018, at 21:00, John-Mark Gurney <jmg@funkthat.com> wrote: > > > > David Cross wrote this message on Fri, Aug 24, 2018 at 17:54 -0400: > >> Ok, I am seeing something truely bizzare, I am sending this out as a shot > >> across the bow since I am not even sure where or how to begin debugging > >> this. > >> > >> Some background. This in on an Intel Xeon 5520 based machine, 72G ECC > >> memory, 11.2, fully patched. Though this has been a problem since at least > >> 11.1, probably 11.0, and maybe earlier. ~4G of eli encrypted swap, which it > >> basically never even touches, even when problems are occuring) > > > > I assume you've applied the lazyfpu SA patch? > > > > If so, there's another patch you need to apply, see: > > https://docs.freebsd.org/cgi/mid.cgi?20180821081150.GU2340@kib.kiev.ua > > > I will definitely apply this, but I don???t think it applies to the problem in question. This system doesn???t have AESNI, this problem well preceded the lazyfpu patch, and i am not seeing any corruption on disk. Hmm, k... > >> The first symptom was (and I think these are all aspects of the same root > >> underlying cause) that fsck on a geli encrypted d stripe of 2 USB drives > >> would *randomly* error out on a corrupt entry. Upon investigating this I > >> discovered by watching gstat that as this happened the IO on the drives > >> would STOP. the L(q) would hover at 1 for a number of seconds, and then > >> when it returned fsck was complaining about various corrupt structures. a > >> ktrace of fsck shows that it got back data from the pread() that was > >> partially corrupted (I am guessing, but I cannot confirm that 'some part' > >> of the stack handed back a zeroed page, or otherwise 'not the right data' > >> that geli dutifully 'decrypted'. No errors are ever logged in the kernel > >> about da0 or da1 (the respective underlying USB disks). It *seems* this is > >> *always* on phase 2 of fsck (files and paths), and its never the same > >> inode. no data is *ever* corrupted when in the filesystem, no matter how > >> hard I hit the disks (all data on these devices is fully checksummed) > >> Devices have passed multiple SMART full diag checks, full read/write tests > >> with no issues. Under heavy FS IO it does occasionally lock.. but > >> recovers, and again data and filesystem are fully consistent. > >> > >> I was willing to live with that.. weird as it was (these are backup disks, > >> data is fully checksummed, and I was only fscking out of extreme paranoia > >> every reboot) Then I added an internal drive, configured with gmirror > >> (broken mirror currently, second disk hasn't been added) and geli. On this > >> disk I have a postgres 10 database in WAL replication. This was working > >> fine and then the other day the system just locked for a few hours. During > >> that time I saw the L(q) of the _internal_ disk in the 10,000+ range, and > >> it doing _1_ operation a second to the underlying disk... all the while > >> geli is logging 'error 11' to the console (nothing about the underlying > >> disk) After this happened a static file on the disk (a zip file) had bad > >> data in the middle of a page (after reboot the file was ok.. so it was > >> just in cache). Again, this disk fully checks ok, no corruption on the > >> disk, no errors from the disk itself. > >> > >> > >> Halp? where do I even begin with this? It really feels like there is > >> some massive locking going on in geli in some way? Where should I even > >> begin looking? I run geli on most of my systems and don't have any issues. Can you post actual log lines? geli has lots of error log lines, so w/o more info, pretty hard to say WHAT in geli is returning EAGAIN. I do see that _read_done and _write_done may not handle an EAGAIN error, which could cause this problem, but to confirm, I need the actual log lines... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not."
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180825052330.GE45503>