Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 1 Sep 2023 12:04:22 -0400
From:      Alexander Motin <mav@FreeBSD.org>
To:        Kyle Evans <kevans@FreeBSD.org>, Martin Matuska <mm@freebsd.org>
Cc:        src-committers@freebsd.org, dev-commits-src-all@freebsd.org, dev-commits-src-main@freebsd.org, Cy Schubert <Cy.Schubert@cschubert.com>
Subject:   Re: git: 315ee00fa961 - main - zfs: merge openzfs/zfs@804414aad
Message-ID:  <80777717-1d67-104a-94f6-2ac8112e41b8@FreeBSD.org>
In-Reply-To: <7b12cc47-0e41-ee8c-2165-9e81874c3490@FreeBSD.org>
References:  <202308270509.37R596B5048298@gitrepo.freebsd.org> <ZO_aOaf-eGiCMCKy@cell.glebi.us> <c09c92df-90f5-8c94-4125-9e33262bc686@FreeBSD.org> <a9a0b8b4-b47b-b629-37b6-1c18c8736859@FreeBSD.org> <65269e7a-4c3f-95ff-3e81-91b76e023fbd@FreeBSD.org> <7b12cc47-0e41-ee8c-2165-9e81874c3490@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 01.09.2023 11:46, Kyle Evans wrote:
> On 9/1/23 08:41, Alexander Motin wrote:
>> On 31.08.2023 22:18, Kyle Evans wrote:
>>> It seems to have clearly been stomped on by uma trashing. Encountered 
>>> while running a pkgbase build, I think while it was in the packaging 
>>> phase. I note in particular in that frame:
>>>
>>> (kgdb) p/x lwb->lwb_issued_timestamp
>>> $4 = 0xdeadc0dedeadc0de
>>>
>>> So I guess it was freed sometime during one of the previous two 
>>> zio_nowait() calls.
>>
>> Thank you, Kyle.  If the source lines are resolved correctly and it 
>> really crashes on lwb_child_zio access, then I do see there a possible 
>> race condition, even though I think it would involve at least 2 or may 
>> be even 3 different threads.
>>
> 
> Oh, sorry- yes, it was the access to lwb_child_zio there.
> 
> 
>> I've just created this new PR to address it:
>> https://github.com/openzfs/zfs/pull/15233
>>
>> If you'll be able to test it, include also the two previous:
>> https://github.com/openzfs/zfs/pull/15227
>> https://github.com/openzfs/zfs/pull/15228
>>
>> Thank you for something actionable, it really feels much better! :)
>>
> 
> Perfect, thanks! I haven't been able to reproduce it since the first 
> time, but your explanation sounds plausible to me.
> 
> I'm not a ZFS developer, but it's not clear to me how I didn't end up 
> tripping over other assertions, though; e.g., in zil_lwb_flush_vdevs_done:
> 
> 1442         ASSERT3S(lwb->lwb_state, ==, LWB_STATE_WRITE_DONE);
> 1443         lwb->lwb_state = LWB_STATE_FLUSH_DONE;
> 
> lwb_state seems to only be set to LWB_STATE_WRITE_DONE in 
> zil_lwb_write_done (lwb_write_zio's completion routine). I would've 
> thought all three of these were executed synchronously in 
> __zio_execute(), which would presumably put us in LWB_STATE_ISSUED at 
> the time of completing the lwb_root_zio?

That is where ZIO dependencies work.  lwb_root_zio can never complete 
before lwb_write_zio completion.   So first zil_lwb_write_done() on 
lwb_write_zio completion should move the lwb to LWB_STATE_WRITE_DONE, 
then zil_lwb_flush_vdevs_done() on lwb_root_zio completion should move 
it to LWB_STATE_FLUSH_DONE, at which state zil_sync() can free it.  If 
only at that point we try to check lwb->lwb_child_zio, we see the 
0xdeadc0dedeadc0de and try call zio_nowait() on it with the result you 
saw.  Would lwb_child_zio actually be used by the specific lwb, 
lwb_write_zio could not proceed before its completion first and so late 
zio_nowait() call for it would be legal, but not otherwise.

-- 
Alexander Motin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?80777717-1d67-104a-94f6-2ac8112e41b8>