Date: Mon, 12 Feb 2018 13:40:40 -0800 (PST) From: Don Lewis <truckman@FreeBSD.org> To: Konstantin Belousov <kib@freebsd.org> Cc: Elliott.Rabe@dell.com, alc@freebsd.org, freebsd-hackers@freebsd.org, markj@freebsd.org, Eric.Van.Gyzen@dell.com Subject: Re: Stale memory during post fork cow pmap update Message-ID: <tkrat.949e51f1241d331f@FreeBSD.org> In-Reply-To: <20180210225608.GM33564@kib.kiev.ua> References: <5A7E7F2B.80900@dell.com> <20180210111848.GL33564@kib.kiev.ua> <5A7F6A7C.80607@dell.com> <20180210225608.GM33564@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On 11 Feb, Konstantin Belousov wrote: > On Sat, Feb 10, 2018 at 09:56:20PM +0000, Elliott.Rabe@dell.com wrote: >> On 02/10/2018 05:18 AM, Konstantin Belousov wrote: >> > On Sat, Feb 10, 2018 at 05:12:11AM +0000, Elliott.Rabe@dell.com wrote: >> >> Greetings- >> >> >> >> I've been hunting for the root cause of elusive, slight memory >> >> corruptions in a large, complex process that manages many threads. All >> >> failures and experimentation thus far has been on x86_64 architecture >> >> machines, and pmap_pcid is not in use. >> >> >> >> I believe I have stumbled into a very unlikely race condition in the way >> >> the vm code updates the pmap during write fault processing following a >> >> fork of the process. In this situation, when the process is forked, >> >> appropriate vm entries are marked copy-on-write. One such entry >> >> allocated by static process initialization is frequently used by many >> >> threads in the process. This makes it a prime candidate to write-fault >> >> shortly after a fork system call is made. In this scenario, such a >> >> fault normally burdens the faulting thread with the task of allocating a >> >> new page, entering the page as part of managed memory, and updating the >> >> pmap with the new physical address and the change to writeable status. >> >> This action is followed with an invalidation of the TLB on the current >> >> CPU, and in this case is also followed by IPI_INVLPG IPIs to do the same >> >> on other CPUs (there are often many active threads in this process). >> >> Before this remote TLB invalidation has completed, other CPUs are free >> >> to act on either the old OR new page characteristics. If other threads >> >> are alive and using contents of the faulting page on other CPUs, bad >> >> things can occur. >> >> >> >> In one simplified and somewhat contrived example, one thread attempts to >> >> write to a location on the faulting page under the protection of a lock >> >> while another thread attempts to read from the same location twice in >> >> succession under the protection of the same lock. If both the writing >> >> thread and reading thread are running on different CPUs, and if the >> >> write is directed to the new physical address, the reads may come from >> >> different physical addresses if a TLB invalidation occurs between them. >> >> This seemingly violates the guarantees provided by the locking >> >> primitives and can result in subtle memory corruption symptoms. >> >> >> >> It took me quite a while to chase these symptoms from user-space down >> >> into the operating system, and even longer to end up with a stand-alone >> >> test fixture able to reproduce the situation described above on demand. >> >> If I alter the kernel code to perform a two-stage update of the pmap >> >> entry, the observed corruption symptoms disappear. This two-stage >> >> mechanism updates and invalidates the new physical address in a >> >> read-only state first, and then does a second pmap update and >> >> invalidation to change the status to writeable. The intended effect was >> >> to cause any other threads writing to the faulting page to become >> >> obstructed until the earlier fault is complete, thus eliminating the >> >> possibility of the physical pages having different contents until the >> >> new physical address was fully visible. This is goofy, and from an >> >> efficiency standpoint it is obviously undesirable, but it was the first >> >> thing that came to mind, and it seems to be working fine. >> >> >> >> I am not terribly familliar with the higher level design here, so it is >> >> unclear to me if this problem is simply a very unlikely race condition >> >> that hasn't yet been diagnosed or if this is instead the breakdown of >> >> some other mechanism of which I am not aware. I would appreciate the >> >> insights of those of you who have more history and experience with this >> >> area of the code. >> > You are right describing the operation of the memory copy on fork. But I >> > cannot understand what parts of it, exactly, are problematic, from your >> > description. It is necessary for you to provide the test and provide >> > some kind of the test trace or the output which illustrates the issue >> > you found. >> > >> > Do you mean something like that: >> > - after fork >> > - thread A writes into the page, causing page fault and COW because the >> > entry has write permissions removed >> > - thread B reads from the page, and since invalidation IPI was not yet >> > delivered, it reads from the need-copy page, effectively seeing the >> > old content, before thread A write >> > >> > And you claim is that you can create a situation where both threads A >> > and B owns the same lock around the write and read ? I do not understand >> > this, since if thread A owns a (usermode) lock, it prevents thread B from >> > taking the same lock until fault is fully handled, in particular, the IPI >> > is delivered. >> >> Here is the sequence of actions I am referring to. There is only one >> lock, and all the writes/reads are on one logical page. >> >> +The process is forked transitioning a map entry to COW >> +Thread A writes to a page on the map entry, faults, updates the pmap to >> writable at a new phys addr, and starts TLB invalidations... >> +Thread B acquires a lock, writes to a location on the new phys addr, >> and releases the lock >> +Thread C acquires the lock, reads from the location on the old phys addr... >> +Thread A ...continues the TLB invalidations which are completed >> +Thread C ...reads from the location on the new phys addr, and releases >> the lock >> >> In this example Thread B and C [lock, use and unlock] properly and >> neither own the lock at the same time. Thread A was writing somewhere >> else on the page and so never had/needed the lock. Thread B sees a >> location that is only ever read|modified under a lock change beneath it >> while it is the lock owner. > I believe you mean 'Thread C' in the last sentence. > >> >> I will get a test patch together and make it available as soon as I can. > Please. > > So I agree that doing two-stage COW, with the first stage copying page > but keeping it read-only, is the right solution. Below is my take. > During the smoke boot, I noted that there is somewhat related issue in > reevaluation of the map entry permissions. > > diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c > index 83e12a588ee..149a15f1d9d 100644 > --- a/sys/vm/vm_fault.c > +++ b/sys/vm/vm_fault.c > @@ -1135,6 +1157,10 @@ RetryFault:; > */ > pmap_copy_page(fs.m, fs.first_m); > fs.first_m->valid = VM_PAGE_BITS_ALL; > + if ((fault_flags & VM_FAULT_WIRE) == 0) { > + prot &= ~VM_PROT_WRITE; > + fault_type &= ~VM_PROT_WRITE; > + } > if (wired && (fault_flags & > VM_FAULT_WIRE) == 0) { > vm_page_lock(fs.first_m); > @@ -1219,6 +1245,12 @@ RetryFault:; > * write-enabled after all. > */ > prot &= retry_prot; > + fault_type &= retry_prot; > + if (prot == 0) { > + release_page(&fs); > + unlock_and_deallocate(&fs); > + goto RetryFault; > + } > } > } > I'm seeing really good results with this patch on my Ryzen package build machine. I have a set of 1782 ports that I build on a regular basis. I just did three consecutive poudriere runs and other than five ports that hard fail due to incompatibility with clang 6, and one gmake-related build runaway of lang/rust on one run that I think is another issue, I saw none of the flakey build issues that I've had with this machine. I've seen no recurrance of the lang/go build problems that I've had since day 1 on this machine (but not my AMD FX-8320E machine), and also no sign of guile-related build failures that I've seen both on this machine and the FX-8320E. I also didn't see the net/samba build hangs that seem to be caused by a heavily threaded python process getting deadlocked.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?tkrat.949e51f1241d331f>