Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 12 Jul 2022 11:33:47 -0700
From:      John Baldwin <jhb@FreeBSD.org>
To:        Evgeniy Khramtsov <evgeniy@khramtsov.org>, Ryan Moeller <freqlabs@FreeBSD.org>
Cc:        FreeBSD-CURRENT@FreeBSD.org
Subject:   Re: BLAKE3 unstability?
Message-ID:  <653d1a91-1468-ba15-5365-87a63ed0e2d1@FreeBSD.org>
In-Reply-To: <20220712084101.iqvwyfuhge6myteq@vax.khramtsov.org>
References:  <20220709162640.7my2bq6rax5npdhf@vax.khramtsov.org> <20220709175605.ofkoft2mglrkaqpf@vax.khramtsov.org> <fab2145d-0a0c-aa62-9866-717d3f8c51d5@FreeBSD.org> <be7f6fb2-1215-c6d1-73da-fc15571b66be@FreeBSD.org> <20220712084101.iqvwyfuhge6myteq@vax.khramtsov.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 7/12/22 1:41 AM, Evgeniy Khramtsov wrote:
>>>> I can reproduce via:
>>>>
>>>> $ truncate -s 10G /tmp/test
>>>> $ mdconfig -f /tmp/test -S 4096
>>>> $ zpool create test /dev/md1
>>>> $ zfs create -o checksum=blake3 test/b
>>>> $ dd if=/dev/random of=/test/b/noise bs=1M count=4096
>>>> $ sync
>>>> $ zpool scrub test
>>>> $ zpool status
>>>
>>> I cannot reproduce this on openzfs/zfs@cb01da68057 (the commit that was
>>> most recently merged) built out of tree on either stable/13 70fd40edb86
>>> or main 9aa02d5120a.
>>>
>>> I'll update a system and see if I can reproduce it with the in-tree ZFS.
>>>
>>> - Ryan
>>>
>> It did not reproduce for me with in-tree ZFS on main@3c9ad9398fcd either.
>>
>> Could you share sysctl kstat.zfs.misc.chksum_bench, maybe we are using
>> different implementations?
>> I do see that blake3 went in with only a Linux module parameter for the
>> implementation selection, so I'll have to fix that. For now we can at least
>> see which was fastest, which should be the one selected. You just won't be
>> able to manually change it to see if that helps.
>>
>> - Ryan
> 
> I found the culprit (kernel and base from download.FreeBSD.org
> kernel.txz and base.txz respectively) (I forgot about local sysctl.conf...):
> 
> kern.sched.steal_thresh=1
> kern.sched.preempt_thresh=121
> 
> Then
> 
> #!/bin/sh
> 
> truncate -s 10G /tmp/test
> mdconfig -f /tmp/test -S 4096
> zpool create test /dev/md0
> zfs create -o checksum=blake3 test/b
> dd if=/dev/random of=/test/b/noise bs=1M count=4096
> sync
> zpool scrub test
> sleep 3
> zpool status
> 
> zpool destroy test
> mdconfig -d -u 0
> rm /tmp/test
> 
> As for ULE "tuning", these values give me fine desktop interactivity
> when building lang/rust when nice and idprio did not help, so I left
> them in sysctl.conf. Not sure if scheduling parameters are worthy of
> a ZFS PR, maybe something essential is preempted.

It could be missing fpu_kern_enter/leave that lack of preemption would
cover over.  I thought that missing that would give a panic in the
kernel though due to FPU instructions being disabled (including vector
instructions).  Maybe ZFS isn't using fpu_kern_enter(FPU_NOCTX) and is
instead trying to juggle contexts and it has a bug in how it manages
saved FPU contexts and reuses a context?  If so, I would just suggest
that ZFS switch to using FPU_KERN_NOCTX instead which runs all SSE
type code in a critical section to disable preemption but avoids
having to allocate and manage FPU contexts.

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?653d1a91-1468-ba15-5365-87a63ed0e2d1>