Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 10 Feb 2018 13:18:48 +0200
From:      Konstantin Belousov <kib@freebsd.org>
To:        Elliott.Rabe@dell.com
Cc:        freebsd-hackers@freebsd.org, markj@FreeBSD.org, alc@FreeBSD.org, Eric.Van.Gyzen@dell.com
Subject:   Re: Stale memory during post fork cow pmap update
Message-ID:  <20180210111848.GL33564@kib.kiev.ua>
In-Reply-To: <5A7E7F2B.80900@dell.com>
References:  <5A7E7F2B.80900@dell.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Feb 10, 2018 at 05:12:11AM +0000, Elliott.Rabe@dell.com wrote:
> Greetings-
> 
> I've been hunting for the root cause of elusive, slight memory 
> corruptions in a large, complex process that manages many threads. All 
> failures and experimentation thus far has been on x86_64 architecture 
> machines, and pmap_pcid is not in use.
> 
> I believe I have stumbled into a very unlikely race condition in the way 
> the vm code updates the pmap during write fault processing following a 
> fork of the process.  In this situation, when the process is forked, 
> appropriate vm entries are marked copy-on-write. One such entry 
> allocated by static process initialization is frequently used by many 
> threads in the process.  This makes it a prime candidate to write-fault 
> shortly after a fork system call is made.  In this scenario, such a 
> fault normally burdens the faulting thread with the task of allocating a 
> new page, entering the page as part of managed memory, and updating the 
> pmap with the new physical address and the change to writeable status.  
> This action is followed with an invalidation of the TLB on the current 
> CPU, and in this case is also followed by IPI_INVLPG IPIs to do the same 
> on other CPUs (there are often many active threads in this process).  
> Before this remote TLB invalidation has completed, other CPUs are free 
> to act on either the old OR new page characteristics.  If other threads 
> are alive and using contents of the faulting page on other CPUs, bad 
> things can occur.
> 
> In one simplified and somewhat contrived example, one thread attempts to 
> write to a location on the faulting page under the protection of a lock 
> while another thread attempts to read from the same location twice in 
> succession under the protection of the same lock.  If both the writing 
> thread and reading thread are running on different CPUs, and if the 
> write is directed to the new physical address, the reads may come from 
> different physical addresses if a TLB invalidation occurs between them.  
> This seemingly violates the guarantees provided by the locking 
> primitives and can result in subtle memory corruption symptoms.
> 
> It took me quite a while to chase these symptoms from user-space down 
> into the operating system, and even longer to end up with a stand-alone 
> test fixture able to reproduce the situation described above on demand.  
> If I alter the kernel code to perform a two-stage update of the pmap 
> entry, the observed corruption symptoms disappear.  This two-stage 
> mechanism updates and invalidates the new physical address in a 
> read-only state first, and then does a second pmap update and 
> invalidation to change the status to writeable.  The intended effect was 
> to cause any other threads writing to the faulting page to become 
> obstructed until the earlier fault is complete, thus eliminating the 
> possibility of the physical pages having different contents until the 
> new physical address was fully visible.  This is goofy, and from an 
> efficiency standpoint it is obviously undesirable, but it was the first 
> thing that came to mind, and it seems to be working fine.
> 
> I am not terribly familliar with the higher level design here, so it is 
> unclear to me if this problem is simply a very unlikely race condition 
> that hasn't yet been diagnosed or if this is instead the breakdown of 
> some other mechanism of which I am not aware.  I would appreciate the 
> insights of those of you who have more history and experience with this 
> area of the code.

You are right describing the operation of the memory copy on fork. But I
cannot understand what parts of it, exactly, are problematic, from your
description. It is necessary for you to provide the test and provide
some kind of the test trace or the output which illustrates the issue
you found.

Do you mean something like that:
- after fork
- thread A writes into the page, causing page fault and COW because the
  entry has write permissions removed
- thread B reads from the page, and since invalidation IPI was not yet
  delivered, it reads from the need-copy page, effectively seeing the
  old content, before thread A write

And you claim is that you can create a situation where both threads A
and B owns the same lock around the write and read ?  I do not understand
this, since if thread A owns a (usermode) lock, it prevents thread B from
taking the same lock until fault is fully handled, in particular, the IPI
is delivered.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180210111848.GL33564>