Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 12 Feb 2018 13:40:40 -0800 (PST)
From:      Don Lewis <truckman@FreeBSD.org>
To:        Konstantin Belousov <kib@freebsd.org>
Cc:        Elliott.Rabe@dell.com, alc@freebsd.org, freebsd-hackers@freebsd.org, markj@freebsd.org, Eric.Van.Gyzen@dell.com
Subject:   Re: Stale memory during post fork cow pmap update
Message-ID:  <tkrat.949e51f1241d331f@FreeBSD.org>
In-Reply-To: <20180210225608.GM33564@kib.kiev.ua>
References:  <5A7E7F2B.80900@dell.com> <20180210111848.GL33564@kib.kiev.ua> <5A7F6A7C.80607@dell.com> <20180210225608.GM33564@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On 11 Feb, Konstantin Belousov wrote:
> On Sat, Feb 10, 2018 at 09:56:20PM +0000, Elliott.Rabe@dell.com wrote:
>> On 02/10/2018 05:18 AM, Konstantin Belousov wrote:
>> > On Sat, Feb 10, 2018 at 05:12:11AM +0000, Elliott.Rabe@dell.com wrote:
>> >> Greetings-
>> >>
>> >> I've been hunting for the root cause of elusive, slight memory
>> >> corruptions in a large, complex process that manages many threads. All
>> >> failures and experimentation thus far has been on x86_64 architecture
>> >> machines, and pmap_pcid is not in use.
>> >>
>> >> I believe I have stumbled into a very unlikely race condition in the way
>> >> the vm code updates the pmap during write fault processing following a
>> >> fork of the process.  In this situation, when the process is forked,
>> >> appropriate vm entries are marked copy-on-write. One such entry
>> >> allocated by static process initialization is frequently used by many
>> >> threads in the process.  This makes it a prime candidate to write-fault
>> >> shortly after a fork system call is made.  In this scenario, such a
>> >> fault normally burdens the faulting thread with the task of allocating a
>> >> new page, entering the page as part of managed memory, and updating the
>> >> pmap with the new physical address and the change to writeable status.
>> >> This action is followed with an invalidation of the TLB on the current
>> >> CPU, and in this case is also followed by IPI_INVLPG IPIs to do the same
>> >> on other CPUs (there are often many active threads in this process).
>> >> Before this remote TLB invalidation has completed, other CPUs are free
>> >> to act on either the old OR new page characteristics.  If other threads
>> >> are alive and using contents of the faulting page on other CPUs, bad
>> >> things can occur.
>> >>
>> >> In one simplified and somewhat contrived example, one thread attempts to
>> >> write to a location on the faulting page under the protection of a lock
>> >> while another thread attempts to read from the same location twice in
>> >> succession under the protection of the same lock.  If both the writing
>> >> thread and reading thread are running on different CPUs, and if the
>> >> write is directed to the new physical address, the reads may come from
>> >> different physical addresses if a TLB invalidation occurs between them.
>> >> This seemingly violates the guarantees provided by the locking
>> >> primitives and can result in subtle memory corruption symptoms.
>> >>
>> >> It took me quite a while to chase these symptoms from user-space down
>> >> into the operating system, and even longer to end up with a stand-alone
>> >> test fixture able to reproduce the situation described above on demand.
>> >> If I alter the kernel code to perform a two-stage update of the pmap
>> >> entry, the observed corruption symptoms disappear.  This two-stage
>> >> mechanism updates and invalidates the new physical address in a
>> >> read-only state first, and then does a second pmap update and
>> >> invalidation to change the status to writeable.  The intended effect was
>> >> to cause any other threads writing to the faulting page to become
>> >> obstructed until the earlier fault is complete, thus eliminating the
>> >> possibility of the physical pages having different contents until the
>> >> new physical address was fully visible.  This is goofy, and from an
>> >> efficiency standpoint it is obviously undesirable, but it was the first
>> >> thing that came to mind, and it seems to be working fine.
>> >>
>> >> I am not terribly familliar with the higher level design here, so it is
>> >> unclear to me if this problem is simply a very unlikely race condition
>> >> that hasn't yet been diagnosed or if this is instead the breakdown of
>> >> some other mechanism of which I am not aware.  I would appreciate the
>> >> insights of those of you who have more history and experience with this
>> >> area of the code.
>> > You are right describing the operation of the memory copy on fork. But I
>> > cannot understand what parts of it, exactly, are problematic, from your
>> > description. It is necessary for you to provide the test and provide
>> > some kind of the test trace or the output which illustrates the issue
>> > you found.
>> >
>> > Do you mean something like that:
>> > - after fork
>> > - thread A writes into the page, causing page fault and COW because the
>> >    entry has write permissions removed
>> > - thread B reads from the page, and since invalidation IPI was not yet
>> >    delivered, it reads from the need-copy page, effectively seeing the
>> >    old content, before thread A write
>> >
>> > And you claim is that you can create a situation where both threads A
>> > and B owns the same lock around the write and read ?  I do not understand
>> > this, since if thread A owns a (usermode) lock, it prevents thread B from
>> > taking the same lock until fault is fully handled, in particular, the IPI
>> > is delivered.
>> 
>> Here is the sequence of actions I am referring to.  There is only one 
>> lock, and all the writes/reads are on one logical page.
>> 
>> +The process is forked transitioning a map entry to COW
>> +Thread A writes to a page on the map entry, faults, updates the pmap to 
>> writable at a new phys addr, and starts TLB invalidations...
>> +Thread B acquires a lock, writes to a location on the new phys addr, 
>> and releases the lock
>> +Thread C acquires the lock, reads from the location on the old phys addr...
>> +Thread A ...continues the TLB invalidations which are completed
>> +Thread C ...reads from the location on the new phys addr, and releases 
>> the lock
>> 
>> In this example Thread B and C [lock, use and unlock] properly and 
>> neither own the lock at the same time.  Thread A was writing somewhere 
>> else on the page and so never had/needed the lock.  Thread B sees a 
>> location that is only ever read|modified under a lock change beneath it 
>> while it is the lock owner.
> I believe you mean 'Thread C' in the last sentence.
> 
>> 
>> I will get a test patch together and make it available as soon as I can.
> Please.
> 
> So I agree that doing two-stage COW, with the first stage copying page
> but keeping it read-only, is the right solution. Below is my take.
> During the smoke boot, I noted that there is somewhat related issue in
> reevaluation of the map entry permissions.
> 
> diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c
> index 83e12a588ee..149a15f1d9d 100644
> --- a/sys/vm/vm_fault.c
> +++ b/sys/vm/vm_fault.c
> @@ -1135,6 +1157,10 @@ RetryFault:;
>  				 */
>  				pmap_copy_page(fs.m, fs.first_m);
>  				fs.first_m->valid = VM_PAGE_BITS_ALL;
> +				if ((fault_flags & VM_FAULT_WIRE) == 0) {
> +					prot &= ~VM_PROT_WRITE;
> +					fault_type &= ~VM_PROT_WRITE;
> +				}
>  				if (wired && (fault_flags &
>  				    VM_FAULT_WIRE) == 0) {
>  					vm_page_lock(fs.first_m);
> @@ -1219,6 +1245,12 @@ RetryFault:;
>  			 * write-enabled after all.
>  			 */
>  			prot &= retry_prot;
> +			fault_type &= retry_prot;
> +			if (prot == 0) {
> +				release_page(&fs);
> +				unlock_and_deallocate(&fs);
> +				goto RetryFault;
> +			}
>  		}
>  	}
>  

I'm seeing really good results with this patch on my Ryzen package build
machine.  I have a set of 1782 ports that I build on a regular basis.  I
just did three consecutive poudriere runs and other than five ports that
hard fail due to incompatibility with clang 6, and one gmake-related
build runaway of lang/rust on one run that I think is another issue, I
saw none of the flakey build issues that I've had with this machine.
I've seen no recurrance of the lang/go build problems that I've had
since day 1 on this machine (but not my AMD FX-8320E machine), and also
no sign of guile-related build failures that I've seen both on this
machine and the FX-8320E.  I also didn't see the net/samba build hangs
that seem to be caused by a heavily threaded python process getting
deadlocked.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?tkrat.949e51f1241d331f>