Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 13 Feb 2018 15:30:10 +0200
From:      Konstantin Belousov <kib@freebsd.org>
To:        Elliott.Rabe@dell.com
Cc:        freebsd-hackers@freebsd.org, Eric.Van.Gyzen@dell.com, alc@FreeBSD.org, markj@FreeBSD.org, truckman@FreeBSD.org
Subject:   Re: Stale memory during post fork cow pmap update
Message-ID:  <20180213133010.GD94212@kib.kiev.ua>
In-Reply-To: <5A82AB7C.6090404@dell.com>
References:  <5A7E7F2B.80900@dell.com> <20180210111848.GL33564@kib.kiev.ua> <5A7F6A7C.80607@dell.com> <20180210225608.GM33564@kib.kiev.ua> <5A82AB7C.6090404@dell.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Feb 13, 2018 at 09:10:21AM +0000, Elliott.Rabe@dell.com wrote:
> 
> On 02/10/2018 04:56 PM, Konstantin Belousov wrote:
> > On Sat, Feb 10, 2018 at 09:56:20PM +0000, Elliott.Rabe@dell.com wrote:
> >> On 02/10/2018 05:18 AM, Konstantin Belousov wrote:
> >>> On Sat, Feb 10, 2018 at 05:12:11AM +0000, Elliott.Rabe@dell.com wrote:
> >>>> ...
> >>>> I've been hunting for the root cause of elusive, slight memory
> >>>> corruptions in a large, complex process that manages many threads. All
> >>>> failures and experimentation thus far has been on x86_64 architecture
> >>>> machines, and pmap_pcid is not in use.
> >>>> ...
> >>> It is necessary for you to provide the test and provide
> >>> some kind of the test trace or the output which illustrates the issue
> >>> you found.
> >> Here is the sequence of actions I am referring to.  There is only one
> >> lock, and all the writes/reads are on one logical page.
> >>
> >> +The process is forked transitioning a map entry to COW
> >> +Thread A writes to a page on the map entry, faults, updates the pmap to
> >> writable at a new phys addr, and starts TLB invalidations...
> >> +Thread B acquires a lock, writes to a location on the new phys addr,
> >> and releases the lock
> >> +Thread C acquires the lock, reads from the location on the old phys addr...
> >> +Thread A ...continues the TLB invalidations which are completed
> >> +Thread C ...reads from the location on the new phys addr, and releases
> >> the lock
> >>
> >> In this example Thread B and C [lock, use and unlock] properly and
> >> neither own the lock at the same time.  Thread A was writing somewhere
> >> else on the page and so never had/needed the lock.  Thread B sees a
> >> location that is only ever read|modified under a lock change beneath it
> >> while it is the lock owner.
> > I believe you mean 'Thread C' in the last sentence.
> You are correct, I did mean Thread C.
> >> I will get a test patch together and make it available as soon as I can.
> > Please.
> 
> Sorry for my delayed response; I had been working off a separate project 
> based on releng/11.1 and it took me longer then I expected to get a dev 
> rig setup off of master on which I could re-evaluate the situation.
> 
> I am attaching my test apparatus, however, calling it a test is probably 
> a disservice to tests everywhere.  I consider this entire fixture 
> disposable, so I didn't get carried away trying to properly 
> style/partition/locate the code.  I never wanted anything this 
> complicated either; it pretty much just evolved into a development aid 
> to spelunk around in the fault/pmap handling.  My attempts thus-far at 
> reducing the fixture to be user-space only have not been successful.  
> Additionally, I have noticed that the fixture is /very/ sensitive to any 
> changes in timing; several of the debugging entries even seem key to 
> hitting the problem.  I didn't have much luck getting the problem to 
> manifest on a virtual machine guest w/ a VirtualBox host either.  For 
> all of these reasons, I don't think there is value here in trying to use 
> this as any sort of regression fixture, unless perhaps if someone is 
> willing to try to turn it into something less ridiculous.  Despite all 
> shortcomings, on my hardware anyways, it is able to reproduce the 
> example I described pretty much immediately when I use it with the 
> debugging knob "-v". Instructions and expectations are at the top of the 
> main test fixture source file.
Apparently Ryzen CPUs are able to demonstrate it quite reliably with the
python driver for the samba build.  It was very surprising to me, esp.
because I tried to understand the Ryzen bug for the whole last week and
thought that it is more likely CPU store/load inconsistency than a software
thing.  See https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584.

> 
> I am also attaching a patch that I have been using to prevent the 
> problem.  I was looking at things with a much narrower view and made the 
> changes directly in pmap_enter.  I suspect the internal 
> double-update-invalidate is slightly better performance wise then taking 
> two whole faults, but I haven't benchmarked it, it probably doesn't 
> matter much compared to the cost and frequency of the actual copies, and 
> it also has the disadvantage of being architecture specific.  I also 
> don't feel like I have enough experience with the vm fault code in 
> general for my commentary to be very valuable here.  However, I do 
> wonder: 1) if there are any other scenarios where a potentially 
> accessible page might be undergoing an [address+writable] change in the 
> same way (this sort of thing seems hard to read out of code), and 2) if 
> there is ever any legal reason why an accessible page should be 
> undergoing such a change?  If not, perhaps we could come up with an 
> appropriate sanity-check condition to guard against any cases of this 
> sort of thing accidentally slipping in the future.
I am not sure how to formulate such check. I believe there is now other
place in kernel, except the COW handling, which can create similar
situation, but I cannot guarantee it.

I also believe that the problem exists for all SMP hardware, so fixing
pmaps would create a lot of work, which is also quite hard to validate.
Note that system generally considers the faults which do not cause disk
io as 'easy', they are accounted as the 'soft faults'.  Sure, not causing
the second fault for write would be more efficient than arranging for it,
if only because we would not need to lookup and lock the map and shadow
chain of vm objects and pages, but fork is quite costly already and code
simplicity there is also important.

> 
> The attached git patches should apply and build cleanly on master commit 
> fe0ee5c.  I have verified at least these three scenarios in my environment:
> 1) the fixture alone reproduces the problem.
> 2) the fixture with my patch does not reproduce the problem.
> 3) the fixture with your patch does not reproduce the problem.

I put the review with the patch at https://reviews.freebsd.org/D14347.
You have to add yourself as subsriber or reviewer.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180213133010.GD94212>