Date: Sat, 10 Feb 2018 21:56:20 +0000 From: <Elliott.Rabe@dell.com> To: <kib@freebsd.org> Cc: <freebsd-hackers@freebsd.org>, <Eric.Van.Gyzen@dell.com> Subject: Re: Stale memory during post fork cow pmap update Message-ID: <5A7F6A7C.80607@dell.com> In-Reply-To: <20180210111848.GL33564@kib.kiev.ua> References: <5A7E7F2B.80900@dell.com> <20180210111848.GL33564@kib.kiev.ua>
index | next in thread | previous in thread | raw e-mail
On 02/10/2018 05:18 AM, Konstantin Belousov wrote: > On Sat, Feb 10, 2018 at 05:12:11AM +0000, Elliott.Rabe@dell.com wrote: >> Greetings- >> >> I've been hunting for the root cause of elusive, slight memory >> corruptions in a large, complex process that manages many threads. All >> failures and experimentation thus far has been on x86_64 architecture >> machines, and pmap_pcid is not in use. >> >> I believe I have stumbled into a very unlikely race condition in the way >> the vm code updates the pmap during write fault processing following a >> fork of the process. In this situation, when the process is forked, >> appropriate vm entries are marked copy-on-write. One such entry >> allocated by static process initialization is frequently used by many >> threads in the process. This makes it a prime candidate to write-fault >> shortly after a fork system call is made. In this scenario, such a >> fault normally burdens the faulting thread with the task of allocating a >> new page, entering the page as part of managed memory, and updating the >> pmap with the new physical address and the change to writeable status. >> This action is followed with an invalidation of the TLB on the current >> CPU, and in this case is also followed by IPI_INVLPG IPIs to do the same >> on other CPUs (there are often many active threads in this process). >> Before this remote TLB invalidation has completed, other CPUs are free >> to act on either the old OR new page characteristics. If other threads >> are alive and using contents of the faulting page on other CPUs, bad >> things can occur. >> >> In one simplified and somewhat contrived example, one thread attempts to >> write to a location on the faulting page under the protection of a lock >> while another thread attempts to read from the same location twice in >> succession under the protection of the same lock. If both the writing >> thread and reading thread are running on different CPUs, and if the >> write is directed to the new physical address, the reads may come from >> different physical addresses if a TLB invalidation occurs between them. >> This seemingly violates the guarantees provided by the locking >> primitives and can result in subtle memory corruption symptoms. >> >> It took me quite a while to chase these symptoms from user-space down >> into the operating system, and even longer to end up with a stand-alone >> test fixture able to reproduce the situation described above on demand. >> If I alter the kernel code to perform a two-stage update of the pmap >> entry, the observed corruption symptoms disappear. This two-stage >> mechanism updates and invalidates the new physical address in a >> read-only state first, and then does a second pmap update and >> invalidation to change the status to writeable. The intended effect was >> to cause any other threads writing to the faulting page to become >> obstructed until the earlier fault is complete, thus eliminating the >> possibility of the physical pages having different contents until the >> new physical address was fully visible. This is goofy, and from an >> efficiency standpoint it is obviously undesirable, but it was the first >> thing that came to mind, and it seems to be working fine. >> >> I am not terribly familliar with the higher level design here, so it is >> unclear to me if this problem is simply a very unlikely race condition >> that hasn't yet been diagnosed or if this is instead the breakdown of >> some other mechanism of which I am not aware. I would appreciate the >> insights of those of you who have more history and experience with this >> area of the code. > You are right describing the operation of the memory copy on fork. But I > cannot understand what parts of it, exactly, are problematic, from your > description. It is necessary for you to provide the test and provide > some kind of the test trace or the output which illustrates the issue > you found. > > Do you mean something like that: > - after fork > - thread A writes into the page, causing page fault and COW because the > entry has write permissions removed > - thread B reads from the page, and since invalidation IPI was not yet > delivered, it reads from the need-copy page, effectively seeing the > old content, before thread A write > > And you claim is that you can create a situation where both threads A > and B owns the same lock around the write and read ? I do not understand > this, since if thread A owns a (usermode) lock, it prevents thread B from > taking the same lock until fault is fully handled, in particular, the IPI > is delivered. Here is the sequence of actions I am referring to. There is only one lock, and all the writes/reads are on one logical page. +The process is forked transitioning a map entry to COW +Thread A writes to a page on the map entry, faults, updates the pmap to writable at a new phys addr, and starts TLB invalidations... +Thread B acquires a lock, writes to a location on the new phys addr, and releases the lock +Thread C acquires the lock, reads from the location on the old phys addr... +Thread A ...continues the TLB invalidations which are completed +Thread C ...reads from the location on the new phys addr, and releases the lock In this example Thread B and C [lock, use and unlock] properly and neither own the lock at the same time. Thread A was writing somewhere else on the page and so never had/needed the lock. Thread B sees a location that is only ever read|modified under a lock change beneath it while it is the lock owner. I will get a test patch together and make it available as soon as I can.help
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5A7F6A7C.80607>
