Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 24 Aug 2018 22:42:57 -0400
From:      David Cross <dcrosstech@gmail.com>
To:        John-Mark Gurney <jmg@funkthat.com>
Cc:        FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: weird geli behavior
Message-ID:  <CD43A15C-74B2-4F29-ADB5-B831A0CD5BF6@gmail.com>
In-Reply-To: <20180825010023.GD45503@funkthat.com>
References:  <CAM9edePfxANDxXAjgQsZPXzPc3Ezw4Pn%2BdaVcnkaHx1oY%2BUoDA@mail.gmail.com> <20180825010023.GD45503@funkthat.com>

next in thread | previous in thread | raw e-mail | index | archive | help


> On Aug 24, 2018, at 21:00, John-Mark Gurney <jmg@funkthat.com> wrote:
>=20
> David Cross wrote this message on Fri, Aug 24, 2018 at 17:54 -0400:
>> Ok, I am seeing something truely bizzare, I am sending this out as a shot=

>> across the bow since I am not even sure where or how to begin debugging
>> this.
>>=20
>> Some background.  This in on an Intel Xeon 5520 based machine, 72G ECC
>> memory, 11.2, fully patched.  Though this has been a problem since at lea=
st
>> 11.1, probably 11.0, and maybe earlier. ~4G of eli encrypted swap, which i=
t
>> basically never even touches, even when problems are occuring)
>=20
> I assume you've applied the lazyfpu SA patch?
>=20
> If so, there's another patch you need to apply, see:
> https://docs.freebsd.org/cgi/mid.cgi?20180821081150.GU2340@kib.kiev.ua
>=20
I will definitely apply this, but I don=E2=80=99t think it applies to the pr=
oblem in question. This system doesn=E2=80=99t have AESNI, this problem well=
 preceded the lazyfpu patch, and i am not seeing any corruption on disk.
>> The first symptom was (and I think these are all aspects of the same root=

>> underlying cause) that fsck on a geli encrypted d stripe of 2 USB drives
>> would *randomly* error out on a corrupt entry.  Upon investigating this I=

>> discovered by watching gstat that as this happened the IO on the drives
>> would STOP.  the L(q) would hover at 1 for a number of seconds, and then
>> when it returned fsck was complaining about various corrupt structures. a=

>> ktrace of fsck shows that it got back data from the pread() that was
>> partially corrupted (I am guessing, but I cannot confirm that 'some part'=

>> of the stack handed back a zeroed page, or otherwise 'not the right data'=

>> that geli dutifully 'decrypted'.  No errors are ever logged in the kernel=

>> about da0 or da1 (the respective underlying USB disks). It *seems* this i=
s
>> *always* on phase 2 of fsck (files and paths), and its never the same
>> inode.  no data is *ever* corrupted when in the filesystem, no matter how=

>> hard I hit the disks (all data on these devices is fully checksummed)
>> Devices have passed multiple SMART full diag checks, full read/write test=
s
>> with no issues.  Under heavy FS IO it does occasionally lock.. but
>> recovers, and again data and filesystem are fully consistent.
>>=20
>> I was willing to live with that.. weird as it was (these are backup disks=
,
>> data is fully checksummed, and I was only fscking out of extreme paranoia=

>> every reboot)  Then I added an internal drive, configured with gmirror
>> (broken mirror currently, second disk hasn't been added) and geli.  On th=
is
>> disk I have a postgres 10 database in WAL replication.  This was working
>> fine and then the other day the system just locked for a few hours.  Duri=
ng
>> that time I saw the L(q) of the _internal_ disk in the 10,000+ range, and=

>> it doing _1_ operation a second to the underlying disk... all the while
>> geli is logging 'error 11' to the console (nothing about the underlying
>> disk)  After this happened a static file on the disk (a zip file) had bad=

>> data in the middle of a page  (after reboot the file was ok.. so it was
>> just in cache).  Again, this disk fully checks ok, no corruption on the
>> disk, no errors from the disk itself.
>>=20
>>=20
>> Halp?  where do I even begin with this?   It really feels like there is
>> some massive locking going on in geli in some way?  Where should I even
>> begin looking?  I run geli on most of my systems and don't have any issue=
s.
>> _______________________________________________
>> freebsd-hackers@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org=
"
>=20
> --=20
>  John-Mark Gurney                Voice: +1 415 225 5579
>=20
>     "All that I will do, has been done, All that I have, has not."



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CD43A15C-74B2-4F29-ADB5-B831A0CD5BF6>