Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 11 Feb 2018 00:56:08 +0200
From:      Konstantin Belousov <kib@freebsd.org>
To:        Elliott.Rabe@dell.com
Cc:        alc@freebsd.org, markj@freebsd.org, freebsd-hackers@freebsd.org, Eric.Van.Gyzen@dell.com
Subject:   Re: Stale memory during post fork cow pmap update
Message-ID:  <20180210225608.GM33564@kib.kiev.ua>
In-Reply-To: <5A7F6A7C.80607@dell.com>
References:  <5A7E7F2B.80900@dell.com> <20180210111848.GL33564@kib.kiev.ua> <5A7F6A7C.80607@dell.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Feb 10, 2018 at 09:56:20PM +0000, Elliott.Rabe@dell.com wrote:
> On 02/10/2018 05:18 AM, Konstantin Belousov wrote:
> > On Sat, Feb 10, 2018 at 05:12:11AM +0000, Elliott.Rabe@dell.com wrote:
> >> Greetings-
> >>
> >> I've been hunting for the root cause of elusive, slight memory
> >> corruptions in a large, complex process that manages many threads. All
> >> failures and experimentation thus far has been on x86_64 architecture
> >> machines, and pmap_pcid is not in use.
> >>
> >> I believe I have stumbled into a very unlikely race condition in the way
> >> the vm code updates the pmap during write fault processing following a
> >> fork of the process.  In this situation, when the process is forked,
> >> appropriate vm entries are marked copy-on-write. One such entry
> >> allocated by static process initialization is frequently used by many
> >> threads in the process.  This makes it a prime candidate to write-fault
> >> shortly after a fork system call is made.  In this scenario, such a
> >> fault normally burdens the faulting thread with the task of allocating a
> >> new page, entering the page as part of managed memory, and updating the
> >> pmap with the new physical address and the change to writeable status.
> >> This action is followed with an invalidation of the TLB on the current
> >> CPU, and in this case is also followed by IPI_INVLPG IPIs to do the same
> >> on other CPUs (there are often many active threads in this process).
> >> Before this remote TLB invalidation has completed, other CPUs are free
> >> to act on either the old OR new page characteristics.  If other threads
> >> are alive and using contents of the faulting page on other CPUs, bad
> >> things can occur.
> >>
> >> In one simplified and somewhat contrived example, one thread attempts to
> >> write to a location on the faulting page under the protection of a lock
> >> while another thread attempts to read from the same location twice in
> >> succession under the protection of the same lock.  If both the writing
> >> thread and reading thread are running on different CPUs, and if the
> >> write is directed to the new physical address, the reads may come from
> >> different physical addresses if a TLB invalidation occurs between them.
> >> This seemingly violates the guarantees provided by the locking
> >> primitives and can result in subtle memory corruption symptoms.
> >>
> >> It took me quite a while to chase these symptoms from user-space down
> >> into the operating system, and even longer to end up with a stand-alone
> >> test fixture able to reproduce the situation described above on demand.
> >> If I alter the kernel code to perform a two-stage update of the pmap
> >> entry, the observed corruption symptoms disappear.  This two-stage
> >> mechanism updates and invalidates the new physical address in a
> >> read-only state first, and then does a second pmap update and
> >> invalidation to change the status to writeable.  The intended effect was
> >> to cause any other threads writing to the faulting page to become
> >> obstructed until the earlier fault is complete, thus eliminating the
> >> possibility of the physical pages having different contents until the
> >> new physical address was fully visible.  This is goofy, and from an
> >> efficiency standpoint it is obviously undesirable, but it was the first
> >> thing that came to mind, and it seems to be working fine.
> >>
> >> I am not terribly familliar with the higher level design here, so it is
> >> unclear to me if this problem is simply a very unlikely race condition
> >> that hasn't yet been diagnosed or if this is instead the breakdown of
> >> some other mechanism of which I am not aware.  I would appreciate the
> >> insights of those of you who have more history and experience with this
> >> area of the code.
> > You are right describing the operation of the memory copy on fork. But I
> > cannot understand what parts of it, exactly, are problematic, from your
> > description. It is necessary for you to provide the test and provide
> > some kind of the test trace or the output which illustrates the issue
> > you found.
> >
> > Do you mean something like that:
> > - after fork
> > - thread A writes into the page, causing page fault and COW because the
> >    entry has write permissions removed
> > - thread B reads from the page, and since invalidation IPI was not yet
> >    delivered, it reads from the need-copy page, effectively seeing the
> >    old content, before thread A write
> >
> > And you claim is that you can create a situation where both threads A
> > and B owns the same lock around the write and read ?  I do not understand
> > this, since if thread A owns a (usermode) lock, it prevents thread B from
> > taking the same lock until fault is fully handled, in particular, the IPI
> > is delivered.
> 
> Here is the sequence of actions I am referring to.  There is only one 
> lock, and all the writes/reads are on one logical page.
> 
> +The process is forked transitioning a map entry to COW
> +Thread A writes to a page on the map entry, faults, updates the pmap to 
> writable at a new phys addr, and starts TLB invalidations...
> +Thread B acquires a lock, writes to a location on the new phys addr, 
> and releases the lock
> +Thread C acquires the lock, reads from the location on the old phys addr...
> +Thread A ...continues the TLB invalidations which are completed
> +Thread C ...reads from the location on the new phys addr, and releases 
> the lock
> 
> In this example Thread B and C [lock, use and unlock] properly and 
> neither own the lock at the same time.  Thread A was writing somewhere 
> else on the page and so never had/needed the lock.  Thread B sees a 
> location that is only ever read|modified under a lock change beneath it 
> while it is the lock owner.
I believe you mean 'Thread C' in the last sentence.

> 
> I will get a test patch together and make it available as soon as I can.
Please.

So I agree that doing two-stage COW, with the first stage copying page
but keeping it read-only, is the right solution. Below is my take.
During the smoke boot, I noted that there is somewhat related issue in
reevaluation of the map entry permissions.

diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c
index 83e12a588ee..149a15f1d9d 100644
--- a/sys/vm/vm_fault.c
+++ b/sys/vm/vm_fault.c
@@ -1135,6 +1157,10 @@ RetryFault:;
 				 */
 				pmap_copy_page(fs.m, fs.first_m);
 				fs.first_m->valid = VM_PAGE_BITS_ALL;
+				if ((fault_flags & VM_FAULT_WIRE) == 0) {
+					prot &= ~VM_PROT_WRITE;
+					fault_type &= ~VM_PROT_WRITE;
+				}
 				if (wired && (fault_flags &
 				    VM_FAULT_WIRE) == 0) {
 					vm_page_lock(fs.first_m);
@@ -1219,6 +1245,12 @@ RetryFault:;
 			 * write-enabled after all.
 			 */
 			prot &= retry_prot;
+			fault_type &= retry_prot;
+			if (prot == 0) {
+				release_page(&fs);
+				unlock_and_deallocate(&fs);
+				goto RetryFault;
+			}
 		}
 	}
 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180210225608.GM33564>