Date: Tue, 12 Jul 2022 11:33:47 -0700 From: John Baldwin <jhb@FreeBSD.org> To: Evgeniy Khramtsov <evgeniy@khramtsov.org>, Ryan Moeller <freqlabs@FreeBSD.org> Cc: FreeBSD-CURRENT@FreeBSD.org Subject: Re: BLAKE3 unstability? Message-ID: <653d1a91-1468-ba15-5365-87a63ed0e2d1@FreeBSD.org> In-Reply-To: <20220712084101.iqvwyfuhge6myteq@vax.khramtsov.org> References: <20220709162640.7my2bq6rax5npdhf@vax.khramtsov.org> <20220709175605.ofkoft2mglrkaqpf@vax.khramtsov.org> <fab2145d-0a0c-aa62-9866-717d3f8c51d5@FreeBSD.org> <be7f6fb2-1215-c6d1-73da-fc15571b66be@FreeBSD.org> <20220712084101.iqvwyfuhge6myteq@vax.khramtsov.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 7/12/22 1:41 AM, Evgeniy Khramtsov wrote: >>>> I can reproduce via: >>>> >>>> $ truncate -s 10G /tmp/test >>>> $ mdconfig -f /tmp/test -S 4096 >>>> $ zpool create test /dev/md1 >>>> $ zfs create -o checksum=blake3 test/b >>>> $ dd if=/dev/random of=/test/b/noise bs=1M count=4096 >>>> $ sync >>>> $ zpool scrub test >>>> $ zpool status >>> >>> I cannot reproduce this on openzfs/zfs@cb01da68057 (the commit that was >>> most recently merged) built out of tree on either stable/13 70fd40edb86 >>> or main 9aa02d5120a. >>> >>> I'll update a system and see if I can reproduce it with the in-tree ZFS. >>> >>> - Ryan >>> >> It did not reproduce for me with in-tree ZFS on main@3c9ad9398fcd either. >> >> Could you share sysctl kstat.zfs.misc.chksum_bench, maybe we are using >> different implementations? >> I do see that blake3 went in with only a Linux module parameter for the >> implementation selection, so I'll have to fix that. For now we can at least >> see which was fastest, which should be the one selected. You just won't be >> able to manually change it to see if that helps. >> >> - Ryan > > I found the culprit (kernel and base from download.FreeBSD.org > kernel.txz and base.txz respectively) (I forgot about local sysctl.conf...): > > kern.sched.steal_thresh=1 > kern.sched.preempt_thresh=121 > > Then > > #!/bin/sh > > truncate -s 10G /tmp/test > mdconfig -f /tmp/test -S 4096 > zpool create test /dev/md0 > zfs create -o checksum=blake3 test/b > dd if=/dev/random of=/test/b/noise bs=1M count=4096 > sync > zpool scrub test > sleep 3 > zpool status > > zpool destroy test > mdconfig -d -u 0 > rm /tmp/test > > As for ULE "tuning", these values give me fine desktop interactivity > when building lang/rust when nice and idprio did not help, so I left > them in sysctl.conf. Not sure if scheduling parameters are worthy of > a ZFS PR, maybe something essential is preempted. It could be missing fpu_kern_enter/leave that lack of preemption would cover over. I thought that missing that would give a panic in the kernel though due to FPU instructions being disabled (including vector instructions). Maybe ZFS isn't using fpu_kern_enter(FPU_NOCTX) and is instead trying to juggle contexts and it has a bug in how it manages saved FPU contexts and reuses a context? If so, I would just suggest that ZFS switch to using FPU_KERN_NOCTX instead which runs all SSE type code in a critical section to disable preemption but avoids having to allocate and manage FPU contexts. -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?653d1a91-1468-ba15-5365-87a63ed0e2d1>