From owner-freebsd-hackers@freebsd.org  Sat Feb 10 22:56:23 2018
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 34A67F042C7
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Sat, 10 Feb 2018 22:56:23 +0000 (UTC) (envelope-from kib@freebsd.org)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id B052F7742C;
 Sat, 10 Feb 2018 22:56:22 +0000 (UTC) (envelope-from kib@freebsd.org)
Received: from tom.home (kib@localhost [127.0.0.1])
 by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id w1AMu8fa087483
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Sun, 11 Feb 2018 00:56:11 +0200 (EET) (envelope-from kib@freebsd.org)
DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua w1AMu8fa087483
Received: (from kostik@localhost)
 by tom.home (8.15.2/8.15.2/Submit) id w1AMu83p087482;
 Sun, 11 Feb 2018 00:56:08 +0200 (EET) (envelope-from kib@freebsd.org)
X-Authentication-Warning: tom.home: kostik set sender to kib@freebsd.org using
 -f
Date: Sun, 11 Feb 2018 00:56:08 +0200
From: Konstantin Belousov <kib@freebsd.org>
To: Elliott.Rabe@dell.com
Cc: alc@freebsd.org, markj@freebsd.org, freebsd-hackers@freebsd.org,
 Eric.Van.Gyzen@dell.com
Subject: Re: Stale memory during post fork cow pmap update
Message-ID: <20180210225608.GM33564@kib.kiev.ua>
References: <5A7E7F2B.80900@dell.com> <20180210111848.GL33564@kib.kiev.ua>
 <5A7F6A7C.80607@dell.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <5A7F6A7C.80607@dell.com>
User-Agent: Mutt/1.9.3 (2018-01-21)
X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00
 autolearn=ham autolearn_force=no version=3.4.1
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 10 Feb 2018 22:56:23 -0000

On Sat, Feb 10, 2018 at 09:56:20PM +0000, Elliott.Rabe@dell.com wrote:
> On 02/10/2018 05:18 AM, Konstantin Belousov wrote:
> > On Sat, Feb 10, 2018 at 05:12:11AM +0000, Elliott.Rabe@dell.com wrote:
> >> Greetings-
> >>
> >> I've been hunting for the root cause of elusive, slight memory
> >> corruptions in a large, complex process that manages many threads. All
> >> failures and experimentation thus far has been on x86_64 architecture
> >> machines, and pmap_pcid is not in use.
> >>
> >> I believe I have stumbled into a very unlikely race condition in the way
> >> the vm code updates the pmap during write fault processing following a
> >> fork of the process.  In this situation, when the process is forked,
> >> appropriate vm entries are marked copy-on-write. One such entry
> >> allocated by static process initialization is frequently used by many
> >> threads in the process.  This makes it a prime candidate to write-fault
> >> shortly after a fork system call is made.  In this scenario, such a
> >> fault normally burdens the faulting thread with the task of allocating a
> >> new page, entering the page as part of managed memory, and updating the
> >> pmap with the new physical address and the change to writeable status.
> >> This action is followed with an invalidation of the TLB on the current
> >> CPU, and in this case is also followed by IPI_INVLPG IPIs to do the same
> >> on other CPUs (there are often many active threads in this process).
> >> Before this remote TLB invalidation has completed, other CPUs are free
> >> to act on either the old OR new page characteristics.  If other threads
> >> are alive and using contents of the faulting page on other CPUs, bad
> >> things can occur.
> >>
> >> In one simplified and somewhat contrived example, one thread attempts to
> >> write to a location on the faulting page under the protection of a lock
> >> while another thread attempts to read from the same location twice in
> >> succession under the protection of the same lock.  If both the writing
> >> thread and reading thread are running on different CPUs, and if the
> >> write is directed to the new physical address, the reads may come from
> >> different physical addresses if a TLB invalidation occurs between them.
> >> This seemingly violates the guarantees provided by the locking
> >> primitives and can result in subtle memory corruption symptoms.
> >>
> >> It took me quite a while to chase these symptoms from user-space down
> >> into the operating system, and even longer to end up with a stand-alone
> >> test fixture able to reproduce the situation described above on demand.
> >> If I alter the kernel code to perform a two-stage update of the pmap
> >> entry, the observed corruption symptoms disappear.  This two-stage
> >> mechanism updates and invalidates the new physical address in a
> >> read-only state first, and then does a second pmap update and
> >> invalidation to change the status to writeable.  The intended effect was
> >> to cause any other threads writing to the faulting page to become
> >> obstructed until the earlier fault is complete, thus eliminating the
> >> possibility of the physical pages having different contents until the
> >> new physical address was fully visible.  This is goofy, and from an
> >> efficiency standpoint it is obviously undesirable, but it was the first
> >> thing that came to mind, and it seems to be working fine.
> >>
> >> I am not terribly familliar with the higher level design here, so it is
> >> unclear to me if this problem is simply a very unlikely race condition
> >> that hasn't yet been diagnosed or if this is instead the breakdown of
> >> some other mechanism of which I am not aware.  I would appreciate the
> >> insights of those of you who have more history and experience with this
> >> area of the code.
> > You are right describing the operation of the memory copy on fork. But I
> > cannot understand what parts of it, exactly, are problematic, from your
> > description. It is necessary for you to provide the test and provide
> > some kind of the test trace or the output which illustrates the issue
> > you found.
> >
> > Do you mean something like that:
> > - after fork
> > - thread A writes into the page, causing page fault and COW because the
> >    entry has write permissions removed
> > - thread B reads from the page, and since invalidation IPI was not yet
> >    delivered, it reads from the need-copy page, effectively seeing the
> >    old content, before thread A write
> >
> > And you claim is that you can create a situation where both threads A
> > and B owns the same lock around the write and read ?  I do not understand
> > this, since if thread A owns a (usermode) lock, it prevents thread B from
> > taking the same lock until fault is fully handled, in particular, the IPI
> > is delivered.
> 
> Here is the sequence of actions I am referring to.  There is only one 
> lock, and all the writes/reads are on one logical page.
> 
> +The process is forked transitioning a map entry to COW
> +Thread A writes to a page on the map entry, faults, updates the pmap to 
> writable at a new phys addr, and starts TLB invalidations...
> +Thread B acquires a lock, writes to a location on the new phys addr, 
> and releases the lock
> +Thread C acquires the lock, reads from the location on the old phys addr...
> +Thread A ...continues the TLB invalidations which are completed
> +Thread C ...reads from the location on the new phys addr, and releases 
> the lock
> 
> In this example Thread B and C [lock, use and unlock] properly and 
> neither own the lock at the same time.  Thread A was writing somewhere 
> else on the page and so never had/needed the lock.  Thread B sees a 
> location that is only ever read|modified under a lock change beneath it 
> while it is the lock owner.
I believe you mean 'Thread C' in the last sentence.

> 
> I will get a test patch together and make it available as soon as I can.
Please.

So I agree that doing two-stage COW, with the first stage copying page
but keeping it read-only, is the right solution. Below is my take.
During the smoke boot, I noted that there is somewhat related issue in
reevaluation of the map entry permissions.

diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c
index 83e12a588ee..149a15f1d9d 100644
--- a/sys/vm/vm_fault.c
+++ b/sys/vm/vm_fault.c
@@ -1135,6 +1157,10 @@ RetryFault:;
 				 */
 				pmap_copy_page(fs.m, fs.first_m);
 				fs.first_m->valid = VM_PAGE_BITS_ALL;
+				if ((fault_flags & VM_FAULT_WIRE) == 0) {
+					prot &= ~VM_PROT_WRITE;
+					fault_type &= ~VM_PROT_WRITE;
+				}
 				if (wired && (fault_flags &
 				    VM_FAULT_WIRE) == 0) {
 					vm_page_lock(fs.first_m);
@@ -1219,6 +1245,12 @@ RetryFault:;
 			 * write-enabled after all.
 			 */
 			prot &= retry_prot;
+			fault_type &= retry_prot;
+			if (prot == 0) {
+				release_page(&fs);
+				unlock_and_deallocate(&fs);
+				goto RetryFault;
+			}
 		}
 	}