From owner-freebsd-hackers@freebsd.org Sat Feb 10 22:56:23 2018 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 34A67F042C7 for ; Sat, 10 Feb 2018 22:56:23 +0000 (UTC) (envelope-from kib@freebsd.org) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id B052F7742C; Sat, 10 Feb 2018 22:56:22 +0000 (UTC) (envelope-from kib@freebsd.org) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id w1AMu8fa087483 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Sun, 11 Feb 2018 00:56:11 +0200 (EET) (envelope-from kib@freebsd.org) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua w1AMu8fa087483 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id w1AMu83p087482; Sun, 11 Feb 2018 00:56:08 +0200 (EET) (envelope-from kib@freebsd.org) X-Authentication-Warning: tom.home: kostik set sender to kib@freebsd.org using -f Date: Sun, 11 Feb 2018 00:56:08 +0200 From: Konstantin Belousov To: Elliott.Rabe@dell.com Cc: alc@freebsd.org, markj@freebsd.org, freebsd-hackers@freebsd.org, Eric.Van.Gyzen@dell.com Subject: Re: Stale memory during post fork cow pmap update Message-ID: <20180210225608.GM33564@kib.kiev.ua> References: <5A7E7F2B.80900@dell.com> <20180210111848.GL33564@kib.kiev.ua> <5A7F6A7C.80607@dell.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5A7F6A7C.80607@dell.com> User-Agent: Mutt/1.9.3 (2018-01-21) X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 10 Feb 2018 22:56:23 -0000 On Sat, Feb 10, 2018 at 09:56:20PM +0000, Elliott.Rabe@dell.com wrote: > On 02/10/2018 05:18 AM, Konstantin Belousov wrote: > > On Sat, Feb 10, 2018 at 05:12:11AM +0000, Elliott.Rabe@dell.com wrote: > >> Greetings- > >> > >> I've been hunting for the root cause of elusive, slight memory > >> corruptions in a large, complex process that manages many threads. All > >> failures and experimentation thus far has been on x86_64 architecture > >> machines, and pmap_pcid is not in use. > >> > >> I believe I have stumbled into a very unlikely race condition in the way > >> the vm code updates the pmap during write fault processing following a > >> fork of the process. In this situation, when the process is forked, > >> appropriate vm entries are marked copy-on-write. One such entry > >> allocated by static process initialization is frequently used by many > >> threads in the process. This makes it a prime candidate to write-fault > >> shortly after a fork system call is made. In this scenario, such a > >> fault normally burdens the faulting thread with the task of allocating a > >> new page, entering the page as part of managed memory, and updating the > >> pmap with the new physical address and the change to writeable status. > >> This action is followed with an invalidation of the TLB on the current > >> CPU, and in this case is also followed by IPI_INVLPG IPIs to do the same > >> on other CPUs (there are often many active threads in this process). > >> Before this remote TLB invalidation has completed, other CPUs are free > >> to act on either the old OR new page characteristics. If other threads > >> are alive and using contents of the faulting page on other CPUs, bad > >> things can occur. > >> > >> In one simplified and somewhat contrived example, one thread attempts to > >> write to a location on the faulting page under the protection of a lock > >> while another thread attempts to read from the same location twice in > >> succession under the protection of the same lock. If both the writing > >> thread and reading thread are running on different CPUs, and if the > >> write is directed to the new physical address, the reads may come from > >> different physical addresses if a TLB invalidation occurs between them. > >> This seemingly violates the guarantees provided by the locking > >> primitives and can result in subtle memory corruption symptoms. > >> > >> It took me quite a while to chase these symptoms from user-space down > >> into the operating system, and even longer to end up with a stand-alone > >> test fixture able to reproduce the situation described above on demand. > >> If I alter the kernel code to perform a two-stage update of the pmap > >> entry, the observed corruption symptoms disappear. This two-stage > >> mechanism updates and invalidates the new physical address in a > >> read-only state first, and then does a second pmap update and > >> invalidation to change the status to writeable. The intended effect was > >> to cause any other threads writing to the faulting page to become > >> obstructed until the earlier fault is complete, thus eliminating the > >> possibility of the physical pages having different contents until the > >> new physical address was fully visible. This is goofy, and from an > >> efficiency standpoint it is obviously undesirable, but it was the first > >> thing that came to mind, and it seems to be working fine. > >> > >> I am not terribly familliar with the higher level design here, so it is > >> unclear to me if this problem is simply a very unlikely race condition > >> that hasn't yet been diagnosed or if this is instead the breakdown of > >> some other mechanism of which I am not aware. I would appreciate the > >> insights of those of you who have more history and experience with this > >> area of the code. > > You are right describing the operation of the memory copy on fork. But I > > cannot understand what parts of it, exactly, are problematic, from your > > description. It is necessary for you to provide the test and provide > > some kind of the test trace or the output which illustrates the issue > > you found. > > > > Do you mean something like that: > > - after fork > > - thread A writes into the page, causing page fault and COW because the > > entry has write permissions removed > > - thread B reads from the page, and since invalidation IPI was not yet > > delivered, it reads from the need-copy page, effectively seeing the > > old content, before thread A write > > > > And you claim is that you can create a situation where both threads A > > and B owns the same lock around the write and read ? I do not understand > > this, since if thread A owns a (usermode) lock, it prevents thread B from > > taking the same lock until fault is fully handled, in particular, the IPI > > is delivered. > > Here is the sequence of actions I am referring to. There is only one > lock, and all the writes/reads are on one logical page. > > +The process is forked transitioning a map entry to COW > +Thread A writes to a page on the map entry, faults, updates the pmap to > writable at a new phys addr, and starts TLB invalidations... > +Thread B acquires a lock, writes to a location on the new phys addr, > and releases the lock > +Thread C acquires the lock, reads from the location on the old phys addr... > +Thread A ...continues the TLB invalidations which are completed > +Thread C ...reads from the location on the new phys addr, and releases > the lock > > In this example Thread B and C [lock, use and unlock] properly and > neither own the lock at the same time. Thread A was writing somewhere > else on the page and so never had/needed the lock. Thread B sees a > location that is only ever read|modified under a lock change beneath it > while it is the lock owner. I believe you mean 'Thread C' in the last sentence. > > I will get a test patch together and make it available as soon as I can. Please. So I agree that doing two-stage COW, with the first stage copying page but keeping it read-only, is the right solution. Below is my take. During the smoke boot, I noted that there is somewhat related issue in reevaluation of the map entry permissions. diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c index 83e12a588ee..149a15f1d9d 100644 --- a/sys/vm/vm_fault.c +++ b/sys/vm/vm_fault.c @@ -1135,6 +1157,10 @@ RetryFault:; */ pmap_copy_page(fs.m, fs.first_m); fs.first_m->valid = VM_PAGE_BITS_ALL; + if ((fault_flags & VM_FAULT_WIRE) == 0) { + prot &= ~VM_PROT_WRITE; + fault_type &= ~VM_PROT_WRITE; + } if (wired && (fault_flags & VM_FAULT_WIRE) == 0) { vm_page_lock(fs.first_m); @@ -1219,6 +1245,12 @@ RetryFault:; * write-enabled after all. */ prot &= retry_prot; + fault_type &= retry_prot; + if (prot == 0) { + release_page(&fs); + unlock_and_deallocate(&fs); + goto RetryFault; + } } }