Date: Tue, 13 Feb 2018 09:10:21 +0000 From: <Elliott.Rabe@dell.com> To: <kib@freebsd.org> Cc: <freebsd-hackers@freebsd.org>, <Eric.Van.Gyzen@dell.com>, <alc@FreeBSD.org>, <markj@FreeBSD.org>, <truckman@FreeBSD.org> Subject: Re: Stale memory during post fork cow pmap update Message-ID: <5A82AB7C.6090404@dell.com> In-Reply-To: <20180210225608.GM33564@kib.kiev.ua> References: <5A7E7F2B.80900@dell.com> <20180210111848.GL33564@kib.kiev.ua> <5A7F6A7C.80607@dell.com> <20180210225608.GM33564@kib.kiev.ua>
index | next in thread | previous in thread | raw e-mail
[-- Attachment #1 --]
On 02/10/2018 04:56 PM, Konstantin Belousov wrote:
> On Sat, Feb 10, 2018 at 09:56:20PM +0000, Elliott.Rabe@dell.com wrote:
>> On 02/10/2018 05:18 AM, Konstantin Belousov wrote:
>>> On Sat, Feb 10, 2018 at 05:12:11AM +0000, Elliott.Rabe@dell.com wrote:
>>>> ...
>>>> I've been hunting for the root cause of elusive, slight memory
>>>> corruptions in a large, complex process that manages many threads. All
>>>> failures and experimentation thus far has been on x86_64 architecture
>>>> machines, and pmap_pcid is not in use.
>>>> ...
>>> It is necessary for you to provide the test and provide
>>> some kind of the test trace or the output which illustrates the issue
>>> you found.
>> Here is the sequence of actions I am referring to. There is only one
>> lock, and all the writes/reads are on one logical page.
>>
>> +The process is forked transitioning a map entry to COW
>> +Thread A writes to a page on the map entry, faults, updates the pmap to
>> writable at a new phys addr, and starts TLB invalidations...
>> +Thread B acquires a lock, writes to a location on the new phys addr,
>> and releases the lock
>> +Thread C acquires the lock, reads from the location on the old phys addr...
>> +Thread A ...continues the TLB invalidations which are completed
>> +Thread C ...reads from the location on the new phys addr, and releases
>> the lock
>>
>> In this example Thread B and C [lock, use and unlock] properly and
>> neither own the lock at the same time. Thread A was writing somewhere
>> else on the page and so never had/needed the lock. Thread B sees a
>> location that is only ever read|modified under a lock change beneath it
>> while it is the lock owner.
> I believe you mean 'Thread C' in the last sentence.
You are correct, I did mean Thread C.
>> I will get a test patch together and make it available as soon as I can.
> Please.
Sorry for my delayed response; I had been working off a separate project
based on releng/11.1 and it took me longer then I expected to get a dev
rig setup off of master on which I could re-evaluate the situation.
I am attaching my test apparatus, however, calling it a test is probably
a disservice to tests everywhere. I consider this entire fixture
disposable, so I didn't get carried away trying to properly
style/partition/locate the code. I never wanted anything this
complicated either; it pretty much just evolved into a development aid
to spelunk around in the fault/pmap handling. My attempts thus-far at
reducing the fixture to be user-space only have not been successful.
Additionally, I have noticed that the fixture is /very/ sensitive to any
changes in timing; several of the debugging entries even seem key to
hitting the problem. I didn't have much luck getting the problem to
manifest on a virtual machine guest w/ a VirtualBox host either. For
all of these reasons, I don't think there is value here in trying to use
this as any sort of regression fixture, unless perhaps if someone is
willing to try to turn it into something less ridiculous. Despite all
shortcomings, on my hardware anyways, it is able to reproduce the
example I described pretty much immediately when I use it with the
debugging knob "-v". Instructions and expectations are at the top of the
main test fixture source file.
I am also attaching a patch that I have been using to prevent the
problem. I was looking at things with a much narrower view and made the
changes directly in pmap_enter. I suspect the internal
double-update-invalidate is slightly better performance wise then taking
two whole faults, but I haven't benchmarked it, it probably doesn't
matter much compared to the cost and frequency of the actual copies, and
it also has the disadvantage of being architecture specific. I also
don't feel like I have enough experience with the vm fault code in
general for my commentary to be very valuable here. However, I do
wonder: 1) if there are any other scenarios where a potentially
accessible page might be undergoing an [address+writable] change in the
same way (this sort of thing seems hard to read out of code), and 2) if
there is ever any legal reason why an accessible page should be
undergoing such a change? If not, perhaps we could come up with an
appropriate sanity-check condition to guard against any cases of this
sort of thing accidentally slipping in the future.
The attached git patches should apply and build cleanly on master commit
fe0ee5c. I have verified at least these three scenarios in my environment:
1) the fixture alone reproduces the problem.
2) the fixture with my patch does not reproduce the problem.
3) the fixture with your patch does not reproduce the problem.
Thanks!
[-- Attachment #2 --]
From 3090b8232f6f421c0c6de2102b18cfac5700b51a Mon Sep 17 00:00:00 2001
From: Elliott Rabe <elliott.rabe@dell.com>
Date: Sun, 11 Feb 2018 17:19:26 -0600
Subject: [PATCH 1/3] DISPOSABLE: A test fixture that can repro a pmap
update-invalidate race condition
A high-level description of the fixture is available in forking_stale.c
---
stand/libsa/printf.c | 16 +
stand/libsa/stand.h | 1 +
sys/amd64/amd64/pmap.c | 15 +
sys/amd64/conf/GENERIC | 19 +-
sys/vm/forking_stale.c | 1054 ++++++++++++++++++++++++++++++++++++++++++++++++
sys/vm/forking_stale.h | 296 ++++++++++++++
sys/vm/vm_fault.c | 16 +
sys/vm/vm_page.c | 459 +++++++++++++++++++++
sys/vm/vm_page.h | 5 +
9 files changed, 1872 insertions(+), 9 deletions(-)
create mode 100755 sys/vm/forking_stale.c
create mode 100755 sys/vm/forking_stale.h
diff --git a/stand/libsa/printf.c b/stand/libsa/printf.c
index d0c409d..c77e941 100644
--- a/stand/libsa/printf.c
+++ b/stand/libsa/printf.c
@@ -149,6 +149,22 @@ vsprintf(char *buf, const char *cfmt, va_list ap)
buf[retval] = '\0';
}
+int
+vsnprintf(char *buf, size_t size, const char *cfmt, va_list ap)
+{
+ int retval;
+ struct print_buf arg;
+
+ arg.buf = buf;
+ arg.size = size;
+
+ retval = kvprintf(cfmt, &snprint_func, &arg, 10, ap);
+
+ if (arg.size >= 1)
+ *(arg.buf)++ = 0;
+ return retval;
+}
+
/*
* Put a NUL-terminated ASCII number (base <= 36) in a buffer in reverse
* order; return an optional length and a pointer to the last character
diff --git a/stand/libsa/stand.h b/stand/libsa/stand.h
index f6a612b..8d50efe 100644
--- a/stand/libsa/stand.h
+++ b/stand/libsa/stand.h
@@ -274,6 +274,7 @@ extern void vprintf(const char *fmt, __va_list);
extern int sprintf(char *buf, const char *cfmt, ...) __printflike(2, 3);
extern int snprintf(char *buf, size_t size, const char *cfmt, ...) __printflike(3, 4);
extern void vsprintf(char *buf, const char *cfmt, __va_list);
+extern int vsnprintf(char *buf, size_t size, const char *cfmt, __va_list);
extern void twiddle(u_int callerdiv);
extern void twiddle_divisor(u_int globaldiv);
diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
index b9889e3..7bb9c1b 100644
--- a/sys/amd64/amd64/pmap.c
+++ b/sys/amd64/amd64/pmap.c
@@ -4628,6 +4628,8 @@ int
pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot,
u_int flags, int8_t psind)
{
+ fstf_conditional_fault_debug(__FILE__, __LINE__, pmap, va, prot);
+
struct rwlock *lock;
pd_entry_t *pde;
pt_entry_t *pte, PG_G, PG_A, PG_M, PG_RW, PG_V;
@@ -4792,9 +4794,20 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot,
/*
* Update the PTE.
*/
+ fstf_conditional_point_debug(__FILE__, __LINE__, pmap, va);
+
if ((origpte & PG_V) != 0) {
validate:
+ fstf_conditional_pmapmod_advance(__FILE__,
+ __LINE__,
+ va,
+ PHYS_TO_VM_PAGE(newpte & PG_FRAME));
+
origpte = pte_load_store(pte, newpte);
+
+ fstf_conditional_pte_debug(__FILE__, __LINE__, pmap, va, origpte);
+ fstf_conditional_pte_debug(__FILE__, __LINE__, pmap, va, newpte);
+
opa = origpte & PG_FRAME;
if (opa != pa) {
if ((origpte & PG_MANAGED) != 0) {
@@ -4833,6 +4846,8 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot,
} else
pte_store(pte, newpte);
+ fstf_conditional_point_debug(__FILE__, __LINE__, pmap, va);
+
unchanged:
#if VM_NRESERVLEVEL > 0
diff --git a/sys/amd64/conf/GENERIC b/sys/amd64/conf/GENERIC
index 1af7b7b..4c45290 100644
--- a/sys/amd64/conf/GENERIC
+++ b/sys/amd64/conf/GENERIC
@@ -86,16 +86,16 @@ options RCTL # Resource limits
options KDB # Enable kernel debugger support.
options KDB_TRACE # Print a stack trace for a panic.
# For full debugger support use (turn off in stable branch):
-options BUF_TRACKING # Track buffer history
-options DDB # Support DDB.
-options FULL_BUF_TRACKING # Track more buffer history
+#options BUF_TRACKING # Track buffer history
+#options DDB # Support DDB.
+#options FULL_BUF_TRACKING # Track more buffer history
options GDB # Support remote GDB.
-options DEADLKRES # Enable the deadlock resolver
-options INVARIANTS # Enable calls of extra sanity checking
-options INVARIANT_SUPPORT # Extra sanity checks of internal structures, required by INVARIANTS
-options WITNESS # Enable checks to detect deadlocks and cycles
-options WITNESS_SKIPSPIN # Don't run witness on spinlocks for speed
-options MALLOC_DEBUG_MAXZONES=8 # Separate malloc(9) zones
+#options DEADLKRES # Enable the deadlock resolver
+#options INVARIANTS # Enable calls of extra sanity checking
+#options INVARIANT_SUPPORT # Extra sanity checks of internal structures, required by INVARIANTS
+#options WITNESS # Enable checks to detect deadlocks and cycles
+#options WITNESS_SKIPSPIN # Don't run witness on spinlocks for speed
+#options MALLOC_DEBUG_MAXZONES=8 # Separate malloc(9) zones
# Make an SMP-capable kernel by default
options SMP # Symmetric MultiProcessor Kernel
@@ -103,6 +103,7 @@ options EARLY_AP_STARTUP
# CPU frequency control
device cpufreq
+device cpuctl
# Bus support.
device acpi
diff --git a/sys/vm/forking_stale.c b/sys/vm/forking_stale.c
new file mode 100755
index 0000000..47a1bb9
--- /dev/null
+++ b/sys/vm/forking_stale.c
@@ -0,0 +1,1054 @@
+/*
+ * FSTF
+ *
+ * forking stale test fixture
+ *
+ * A test fixture to repro a specific race condition in the
+ * FreeBSD kernel amd64 pmap code.
+ *
+ * The test fixture has components in the kernel and userspace.
+ * A sysctl hook in the kernel is called from userspace to allocate
+ * a region of memory in the kernel, the physical address of which
+ * is mmaped somewhere into the test process. 64-bits of this region
+ * is then atomically manipulated in both kernel and userspace to
+ * help coordinate test actions. This coordination is nothing more
+ * then bit-flag changing & "spin-wait" loops intended to advance
+ * actions in a specific sequence. The test creates three threads:
+ * a faulter, a writer, and a reader. The main process updates
+ * the value at a virtual address to a known state under a lock.
+ * The process is forked and held to keep the vm entries in a
+ * copy-on-write state. The faulter thread is awoken to trap in
+ * the kernel to perform the copy-on-write operation for a page.
+ * Once the pmap page update that applies the new physical address
+ * AND the clearing of the read-only state has occurred the writer
+ * and reader threads are released. The reader thread continually
+ * reads the value under lock twice and ensures the values read are
+ * the same. The writer thread changes the value under the lock.
+ * The iteration stops once the fault handling has been completed.
+ * This whole test cycle is repeated an arbitrary number of iterations.
+ *
+ * The expected "working" behavior from the OS is that because the
+ * test value is always modified and read under the same lock, that it
+ * should not be possible for a thread to read the value twice and see
+ * a change in value. In practice a "mismatch" can be observed, presumably
+ * if the TLB is invalidated for the CPU the reader thread is running
+ * on in-between successive reads.
+ *
+ * The expected "working" behavior from the test fixture below when it
+ * detects the problem is to error out with a non-zero exit code
+ * displaying "complete MISMATCH" when the reader thread sees the
+ * illegal value change.
+ *
+ * This test is expected to run only on an x86_64 platform with at
+ * least 4 CPUs configured for SMP. It is not expected to work (nor
+ * was it ever tried) on any other architecture types or platforms with
+ * smaller CPU counts. It also uses the "rdtscp" instruction
+ * when generating debugging info to help correlate actions occurring
+ * across different CPUs/TSCs. If this is not available, the fixture
+ * will core attempting to execute an illegal instruction. The TSC
+ * MSR AUX can be seeded with the CPU ID from a shell like this:
+ * root@:/usr/src/sys/vm # msr_tsc_aux=0xc0000103; max_cpu_id=`sysctl -n kern.smp.maxid`; cpu=0
+ * root@:/usr/src/sys/vm # while [ ${cpu} -le ${max_cpu_id} ]; do cpuhex=`printf 0x%x ${cpu}`; cpucontrol -m "${msr_tsc_aux}=${cpuhex}" "/dev/cpuctl${cpu}"; cpu=`expr "${cpu}" + 1`; done
+ *
+ * NOTE: Although I have seen it fail occasionally on test virtual machines,
+ * the fixture is far more reliable on real hardware (usually fails in a few seconds).
+ * The timing necessary to hit this race condition seems very delicate.
+ *
+ * The test fixture requires a kernel built w/ the test hooks as well.
+ * Apply the test code patch from /usr/src, rebuild and boot into the kernel.
+ * The test fixture can be built from with the following command:
+ * root@:/usr/src/sys/vm # clang -Wall -g -O0 forking_stale.c -o forking_stale -lpthread
+ *
+ * Example output if the fixture detects the problem:
+ *
+ * root@:/usr/src/sys/vm # ./forking_stale -v > /dev/null
+ * RUNTIME PARAMS:niters 100000
+ * ERROR: Mismatch: Old=0, New=1
+ * runtime (sec) 0.378667
+ * forks 1149
+ * faulter iters 1149
+ * writer iters 1149
+ * reader iters 9063
+ * reader value old 6498
+ * reader value new 2565
+ * pmapstalls 1149
+ * mismatches 1
+ * completion reason MISMATCH
+ * exit 1
+ *
+ * Debugging:
+ *
+ * The fixture can be directed to capture runtime debugging about the timing and
+ * actions leading up to the mismatch. To use this, a command like this will
+ * interleave both userspace and kernel debugging info:
+ *
+ * Example command to generate debug timing data in a file named 'fr_sorted.txt'
+ * ./forking_stale -v > fr.txt; sysctl -n kern.fstf_debug_output >> fr.txt && cat fr.txt | sort > fr_sorted.txt
+ *
+ * Output above the following line is from a previous iteration and should be ignored:
+ * TSC: 40256065075175 CPU:00 TID:100187 Forker CODE:forking_stale.c:0481 STATE:0x0000000000000007 MSG:Test hook active...
+ *
+ */
+
+#include <pthread.h>
+#include <pthread_np.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <signal.h>
+
+#include <sys/param.h>
+#include <sys/cpuset.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/types.h>
+#include <sys/sysctl.h>
+#include <sys/mman.h>
+
+#include <machine/atomic.h>
+#include <machine/stdarg.h>
+
+#include "forking_stale.h"
+
+#define value_old 0
+#define value_new 1
+
+#define opt_iterations_default 100000lu
+
+static int fork_func();
+static void * main_func(void *);
+static void * faulter_func(void *);
+static void * writer_func(void *);
+static void * reader_func(void *);
+
+// A structure housed on the memory page we will be testing
+struct test_data
+{
+ uint64_t value;
+ uint64_t something_else;
+};
+
+// The miscellaneous stuff necessary to perform the testing
+struct test_machinery
+{
+ pthread_mutexattr_t mattr;
+ pthread_mutex_t mutex;
+ pthread_attr_t tattr;
+ int main_tid;
+ pthread_t faulter;
+ int faulter_tid;
+ uint64_t faulter_iters;
+ pthread_t writer;
+ int writer_tid;
+ uint64_t writer_iters;
+ pthread_t reader;
+ int reader_tid;
+ uint64_t reader_iters;
+ int fork_tid;
+ uint64_t tsc_freq;
+ uint64_t forks;
+ uint64_t pmapstalls;
+ uint64_t old_data;
+ uint64_t new_data;
+ uint64_t mismatches;
+ uint64_t debug_value;
+ long opt_iterations;
+ int opt_early_error_exit;
+ int opt_wired;
+ struct test_data * test_data;
+ volatile uint64_t * state_control;
+ uint64_t state_control_phys;
+ struct fstf_debug_data debug_entries[FSTF_DEBUG_NUMBER_ENTRIES];
+ u_int debug_position;
+} * test;
+
+static __thread int lcl_tid = 0;
+
+// Initialize any 'test_machinery' values
+static void test_machinery_init(void)
+{
+ memset(test, 0x0, sizeof(struct test_machinery));
+ test->opt_iterations = opt_iterations_default;
+ test->opt_early_error_exit = 1;
+}
+
+// Describe a tid to make debug output more readable
+static const char * test_actor_descr(int tid)
+{
+ char * desc = "<unknown>";
+ if (tid == test->main_tid)
+ {
+ desc = "Forker";
+ }
+ else if (tid == test->faulter_tid)
+ {
+ desc = "Faulter";
+ }
+ else if (tid == test->writer_tid)
+ {
+ desc = "Writer";
+ }
+ else if (tid == test->reader_tid)
+ {
+ desc = "Reader";
+ }
+ else if (tid == test->fork_tid)
+ {
+ desc = "Forkee";
+ }
+ return (desc);
+}
+
+// Retrieve the tid for the current thread
+static int get_current_tid(void)
+{
+ if (lcl_tid == 0)
+ {
+ lcl_tid = pthread_getthreadid_np();
+ }
+ return (lcl_tid);
+}
+
+// Best effort set of the name of the current thread
+static void set_current_thread_name(void)
+{
+ pthread_set_name_np(pthread_self(), test_actor_descr(get_current_tid()));
+}
+
+// Get the location of a 64-bit region state/control region shared by userspace and the kernel
+volatile uint64_t * fstf_state_control(void)
+{
+ return (test->state_control);
+}
+
+// Get the tsc frequency
+uint64_t fstf_tsc_frequency_seconds(void)
+{
+ return (test->tsc_freq);
+}
+
+// Fast formatted debug
+void fstf_debug_fast(const char * file, int line, uint32_t type, uint64_t state)
+{
+ if (test->debug_value)
+ {
+ int pos = FSTF_DEBUG_POSITION_INDEX(atomic_fetchadd_int(&test->debug_position, 1));
+
+ test->debug_entries[pos].tsc = fstf_tsc_and_aux(&test->debug_entries[pos].aux);
+ test->debug_entries[pos].tid = get_current_tid();
+ test->debug_entries[pos].file = file;
+ test->debug_entries[pos].line = line;
+ test->debug_entries[pos].state = state;
+ test->debug_entries[pos].dbg_type = type;
+ }
+}
+
+// Misc text debug
+void fstf_debug_misc(const char * file, int line, const char *fmt, ...)
+{
+ if (test->debug_value)
+ {
+ char buffer[1024];
+ va_list ap;
+
+ va_start(ap, fmt);
+
+ vsnprintf(buffer, sizeof(buffer), fmt, ap);
+
+ va_end(ap);
+
+ uint64_t tst = fstf_state_value();
+
+ int pos = FSTF_DEBUG_POSITION_INDEX(atomic_fetchadd_int(&test->debug_position, 1));
+
+ uint32_t aux;
+ uint64_t tsc = fstf_tsc_and_aux(&aux);
+ test->debug_entries[pos].dbg_type = FSTF_DEBUG_TYPE_MISCELLANEOUS;
+
+ va_start(ap, fmt);
+
+ int tid = get_current_tid();
+
+ snprintf(test->debug_entries[pos].misc,
+ sizeof(test->debug_entries[pos].misc),
+ FSTF_DEBUG_FORMAT_STR,
+ tsc,
+ aux,
+ tid,
+ test_actor_descr(tid),
+ file,
+ line,
+ tst,
+ buffer);
+
+ va_end(ap);
+ }
+}
+
+// General debugging/text output
+static void debug_output(const char *fmt, ...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+
+ vprintf(fmt, ap);
+
+ va_end(ap);
+}
+
+// General stats/error output
+static void console_output(const char *fmt, ...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+
+ vfprintf(stderr, fmt, ap);
+
+ va_end(ap);
+}
+
+// Traverse the captured debug entries and output them
+static void dump_debug(void)
+{
+ int num;
+ for (num = 0; num < FSTF_DEBUG_NUMBER_ENTRIES; num++)
+ {
+ if (test->debug_entries[num].dbg_type != FSTF_DEBUG_TYPE_UNUSED)
+ {
+ if (test->debug_entries[num].dbg_type == FSTF_DEBUG_TYPE_MISCELLANEOUS)
+ {
+ debug_output("%s", test->debug_entries[num].misc);
+ }
+ else
+ {
+ char msg[1024];
+ char temp[1024];
+ char buffer[1024];
+
+ fstf_debug_fill_state_descr(test->debug_entries[num].state, temp, sizeof(buffer));
+
+ snprintf(msg,
+ sizeof(msg),
+ "State: %s (0x%016lx) [%s]\n",
+ temp,
+ test->debug_entries[num].state,
+ fstf_debug_type_descr(test->debug_entries[num].dbg_type));
+
+ snprintf(buffer,
+ sizeof(buffer),
+ FSTF_DEBUG_FORMAT_STR,
+ test->debug_entries[num].tsc,
+ test->debug_entries[num].aux,
+ test->debug_entries[num].tid,
+ test_actor_descr(test->debug_entries[num].tid),
+ fstf_strip_file_path(__FILE__),
+ test->debug_entries[num].line,
+ test->debug_entries[num].state,
+ msg);
+
+ debug_output("%s", buffer);
+ }
+ }
+ }
+}
+
+// Get the fd int for /dev/mem
+static int dev_mem_fd(void)
+{
+ static int memfd = -1;
+ if (memfd < 0)
+ {
+ memfd = open("/dev/mem", O_RDWR|O_CLOEXEC);
+ }
+ return (memfd);
+}
+
+// Prepare some stuff for threading
+static int thread_prep(void)
+{
+ if ((pthread_mutexattr_init(&test->mattr) != 0) ||
+ (pthread_mutexattr_settype(&test->mattr, PTHREAD_MUTEX_ADAPTIVE_NP) != 0))
+
+ {
+ console_output("ERROR: test setting up mutex attribute for AdaptiveNP\n");
+ return (1);
+ }
+
+ if (pthread_mutex_init(&test->mutex, &test->mattr) != 0)
+ {
+ console_output("ERROR: test initializing mutex\n");
+ return (1);
+ }
+
+ if (pthread_attr_init(&test->tattr) != 0)
+ {
+ console_output("ERROR: test initializing pthread attr\n");
+ return (1);
+ }
+
+ if (pthread_attr_setdetachstate(&test->tattr, PTHREAD_CREATE_JOINABLE) != 0)
+ {
+ console_output("ERROR: test setting pthread attr joinable\n");
+ return (1);
+ }
+
+ return (0);
+}
+
+// Create and start test therads and setup affinities
+static int start_threads(void)
+{
+ int rc;
+ cpuset_t thecpuset;
+
+ CPU_ZERO(&thecpuset);
+ CPU_SET(0, &thecpuset);
+ rc = pthread_setaffinity_np(pthread_self(), sizeof(cpuset_t), &thecpuset);
+ if (rc != 0)
+ {
+ console_output("ERROR: pthread_setaffinity_np() main\n");
+ return (1);
+ }
+
+ rc = pthread_create(&test->faulter, &test->tattr, faulter_func, NULL);
+ if (rc != 0)
+ {
+ console_output("ERROR: pthread_create() faulter\n");
+ return (1);
+ }
+ CPU_ZERO(&thecpuset);
+ CPU_SET(1, &thecpuset);
+ rc = pthread_setaffinity_np(test->faulter, sizeof(cpuset_t), &thecpuset);
+ if (rc != 0)
+ {
+ console_output("ERROR: pthread_setaffinity_np() faulter\n");
+ return (1);
+ }
+
+ rc = pthread_create(&test->writer, &test->tattr, writer_func, NULL);
+ if (rc != 0)
+ {
+ console_output("ERROR: pthread_create() writer\n");
+ return (1);
+ }
+ CPU_ZERO(&thecpuset);
+ CPU_SET(2, &thecpuset);
+ rc = pthread_setaffinity_np(test->writer, sizeof(cpuset_t), &thecpuset);
+ if (rc != 0)
+ {
+ console_output("ERROR: pthread_setaffinity_np() writer\n");
+ return (1);
+ }
+
+ rc = pthread_create(&test->reader, &test->tattr, reader_func, NULL);
+ if (rc != 0)
+ {
+ console_output("ERROR: pthread_create() reader\n");
+ return (1);
+ }
+ CPU_ZERO(&thecpuset);
+ CPU_SET(3, &thecpuset);
+ rc = pthread_setaffinity_np(test->reader, sizeof(cpuset_t), &thecpuset);
+ if (rc != 0)
+ {
+ console_output("ERROR: pthread_setaffinity_np() reader\n");
+ return (1);
+ }
+
+ return (0);
+}
+
+// main processing function; do as many 'fork' iterations as requested or until done
+static void * main_func(void *m)
+{
+ int f;
+ long i;
+
+ for (i = 0; i < test->opt_iterations; i++)
+ {
+ memset(test->debug_entries, 0x00, sizeof(test->debug_entries));
+
+ FSTF_DEBUG_MISC("Test hook active. tsc_freq:%lu, control vaddr:%p, control paddr:0x%lx, test_data vaddr:0x%lx, test_data value location:%p\n",
+ test->tsc_freq,
+ test->state_control,
+ test->state_control_phys,
+ (void*)FSTF_CONSTANT_VADDR,
+ &test->test_data->value);
+
+ pthread_mutex_lock(&test->mutex);
+
+ test->test_data->value = value_old;
+
+ pthread_mutex_unlock(&test->mutex);
+
+ if ((f = fork_func()) < 0)
+ {
+ console_output("Error forking: %d, Errno: %d\n", f, errno);
+ goto Done;
+ }
+
+ test->forks++;
+
+ if (fstf_state_impediment_all(FSTF_STATE_BITS_IDLE) & FSTF_STATE_BITS_FINISHED)
+ {
+ goto Done;
+ }
+ }
+
+ fstf_state_transition(0,
+ FSTF_STATE_BIT_COMPLETE);
+
+Done:
+
+ return (0);
+}
+
+// fork processing function; coordinate the testing for the race
+static int fork_func(void)
+{
+ pid_t pid, savedpid;
+ int pstat;
+
+ switch(pid = fork())
+ {
+ case -1:
+ // error
+ break;
+ case 0:
+ {
+ // child
+ test->fork_tid = get_current_tid();
+ set_current_thread_name();
+
+ // tell threads we have forked and they can get ready
+ fstf_state_transition((FSTF_STATE_BITS_PRIMED | FSTF_STATE_BIT_TEST_ACTIVE),
+ FSTF_STATE_BIT_READY_TO_PRIME);
+
+ // wait for threads to be in the right state
+ if (fstf_state_impediment_all(FSTF_STATE_BITS_PRIMED) & FSTF_STATE_BITS_FINISHED)
+ {
+ goto Done;
+ }
+
+ // tell the faulter he can go
+ fstf_state_transition(FSTF_STATE_BIT_READY_TO_PRIME,
+ FSTF_STATE_BIT_TEST_ACTIVE);
+
+ // wait for the faulter to be done, at that point the iteration is over
+ if (fstf_state_impediment_all(FSTF_STATE_BIT_FAULTER_DONE) & FSTF_STATE_BITS_FINISHED)
+ {
+ goto Done;
+ }
+
+ // cleanup
+ uint64_t st = fstf_state_impediment_all(FSTF_STATE_BITS_IDLE);
+ if (st & FSTF_STATE_BIT_KERNEL_PMAPMOD)
+ {
+ test->pmapstalls++;
+ }
+ if (st & FSTF_STATE_BITS_FINISHED)
+ {
+ goto Done;
+ }
+ fstf_state_transition(FSTF_STATE_BITS_RESET,
+ 0);
+
+ Done:
+ _exit(127);
+ }
+ default:
+ {
+ // parent
+ savedpid = pid;
+ do
+ {
+ pid = wait4(savedpid, &pstat, 0, (struct rusage *)0);
+ }
+ while (pid == -1 && errno == EINTR);
+ break;
+ }
+ }
+
+ return(pid == -1 ? -1 : pstat);
+}
+
+// faulter thread processing function
+static void * faulter_func(void *t)
+{
+ test->faulter_tid = get_current_tid();
+ set_current_thread_name();
+
+ for (;;)
+ {
+ // indicate we have reached the 'idle' state
+ fstf_state_transition(0,
+ FSTF_STATE_BIT_FAULTER_IDLE);
+
+ // wait to be told we can prepare
+ if (fstf_state_impediment_any(FSTF_STATE_BIT_READY_TO_PRIME) & FSTF_STATE_BITS_FINISHED)
+ {
+ goto Done;
+ }
+
+ pthread_mutex_lock(&test->mutex);
+ pthread_mutex_unlock(&test->mutex);
+
+ test->faulter_iters++;
+ test->faulter_iters--;
+
+ // indicate we are prepared
+ fstf_state_transition(FSTF_STATE_BIT_FAULTER_IDLE,
+ FSTF_STATE_BIT_FAULTER_PRIMED);
+
+ // wait to be told we can write-fault
+ if (fstf_state_impediment_all(FSTF_STATE_BIT_TEST_ACTIVE) & FSTF_STATE_BITS_FINISHED)
+ {
+ goto Done;
+ }
+
+ // actually do the write fault
+ FSTF_DEBUG_MISC("Writing to address: %p\n", &test->test_data->something_else);
+ test->test_data->something_else = 1;
+
+ fstf_state_transition(0,
+ FSTF_STATE_BIT_FAULTER_DONE);
+
+ test->faulter_iters++;
+ }
+
+Done:
+
+ return (0);
+}
+
+// writer thread processing function
+static void * writer_func(void *t)
+{
+ test->writer_tid = get_current_tid();
+ set_current_thread_name();
+
+ for (;;)
+ {
+ // indicate we have reached the 'idle' state
+ fstf_state_transition(0,
+ FSTF_STATE_BIT_WRITER_IDLE);
+
+ // wait to be told we can prepare
+ if (fstf_state_impediment_any(FSTF_STATE_BIT_READY_TO_PRIME) & FSTF_STATE_BITS_FINISHED)
+ {
+ goto Done;
+ }
+
+ pthread_mutex_lock(&test->mutex);
+ (void)test->test_data->value;
+ pthread_mutex_unlock(&test->mutex);
+
+ test->writer_iters++;
+ test->writer_iters--;
+
+ // indicate we are prepared
+ fstf_state_transition(FSTF_STATE_BIT_WRITER_IDLE,
+ FSTF_STATE_BIT_WRITER_PRIMED);
+
+ // wait for either the kernel code to have modified the pmap or the fault to be over
+ if (fstf_state_impediment_any(FSTF_STATE_BIT_KERNEL_PMAPMOD | FSTF_STATE_BIT_FAULTER_DONE) & FSTF_STATE_BITS_FINISHED)
+ {
+ goto Done;
+ }
+
+ // change the value at the test location to something else
+ pthread_mutex_lock(&test->mutex);
+
+ (void)test->test_data->value;
+
+ FSTF_DEBUG_MISC("Writing to address: %p\n", &test->test_data->value);
+
+ test->test_data->value = value_new;
+
+ pthread_mutex_unlock(&test->mutex);
+
+ test->writer_iters++;
+ }
+
+Done:
+
+ return (0);
+}
+
+// reader thread processing function
+static void * reader_func(void *t3)
+{
+ test->reader_tid = get_current_tid();
+ set_current_thread_name();
+
+ for (;;)
+ {
+ // indicate we have reached the 'idle' state
+ fstf_state_transition(0,
+ FSTF_STATE_BIT_READER_IDLE);
+
+ // wait to be told we can prepare
+ if (fstf_state_impediment_any(FSTF_STATE_BIT_READY_TO_PRIME) & FSTF_STATE_BITS_FINISHED)
+ {
+ goto Done;
+ }
+
+ pthread_mutex_lock(&test->mutex);
+ (void)test->test_data->value;
+ pthread_mutex_unlock(&test->mutex);
+
+ test->reader_iters++;
+ test->reader_iters--;
+
+ // indicate we are prepared
+ fstf_state_transition(FSTF_STATE_BIT_READER_IDLE,
+ FSTF_STATE_BIT_READER_PRIMED);
+
+ // wait for either the kernel code to have modified the pmap or the fault to be over
+ if (fstf_state_impediment_any(FSTF_STATE_BIT_TEST_ACTIVE) & FSTF_STATE_BITS_FINISHED)
+ {
+ goto Done;
+ }
+
+ while (!fstf_state_check_any(FSTF_STATE_BIT_FAULTER_DONE | FSTF_STATE_BITS_FINISHED))
+ {
+ // read the value at the test location twice and complain if it is different
+ if (pthread_mutex_trylock(&test->mutex) == 0)
+ {
+ uint64_t initialValue = test->test_data->value;
+
+ uint64_t nextValue = test->test_data->value;
+
+ if (initialValue != nextValue)
+ {
+ test->mismatches++;
+
+ FSTF_DEBUG_MISC("Mismatch detected! Old=%lu, New=%lu\n",
+ initialValue,
+ nextValue);
+
+ if (test->opt_early_error_exit)
+ {
+ console_output("ERROR: Mismatch: Old=%lu, New=%lu\n",
+ initialValue,
+ nextValue);
+ fstf_state_transition(0,
+ FSTF_STATE_BIT_MISMATCH);
+ }
+ }
+
+ if (nextValue == value_old)
+ {
+ test->old_data++;
+ }
+ else if (nextValue == value_new)
+ {
+ test->new_data++;
+ }
+
+ pthread_mutex_unlock(&test->mutex);
+
+ test->reader_iters++;
+ }
+ }
+ }
+
+Done:
+
+ return (0);
+}
+
+// shutdown/join threads for completion
+static int finish_threads(void)
+{
+ int err = 0;
+ if (pthread_join(test->faulter, NULL))
+ {
+ console_output("ERROR: pthread_join() faulter\n");
+ err = 1;
+ }
+ if (pthread_join(test->writer, NULL))
+ {
+ console_output("ERROR: pthread_join() writer\n");
+ err = 1;
+ }
+ if (pthread_join(test->reader, NULL))
+ {
+ console_output("ERROR: pthread_join() reader\n");
+ err = 1;
+ }
+ return (err);
+}
+
+// Cleanup stuff for threading
+static int thread_cleanup(void)
+{
+ pthread_mutex_destroy(&test->mutex);
+ pthread_attr_destroy(&test->tattr);
+ return (0);
+}
+
+// do the actual testing
+static int run_test(void)
+{
+ int err = 0;
+
+ if (thread_prep())
+ {
+ err = 1;
+ }
+ else
+ {
+ if (start_threads())
+ {
+ err = 1;
+ }
+
+ main_func(0);
+
+ if (finish_threads())
+ {
+ err = 1;
+ }
+
+ thread_cleanup();
+ }
+ return (err);
+}
+
+// output basic help info
+static void usage(const char * progname)
+{
+ console_output("Usage: %s [opts]\n"
+ "OPTIONS:\n"
+ " -n N ........ set number of iterations, default=%lu\n"
+ " -v ........ output debugging data at the end of the run\n"
+ " -d ........ request heavier debugging in the kernel\n"
+ " -k ........ keep running; don't exit early on mismatch\n"
+ " -w ........ wire the memory with mlockall\n"
+ " -h .......... show this help\n",
+ progname,
+ opt_iterations_default);
+}
+
+// a cancel handler so a ctrl-c'ed long running test can still output useful info
+static void cancel_function(int signo)
+{
+ fstf_state_transition(0,
+ FSTF_STATE_BIT_CANCELLED);
+}
+
+// main
+int main(int argc, char *argv[])
+{
+ uint64_t tm1, tm2, tdiff;
+ int opt, err;
+ char * pEnd;
+
+ // setup our cancel handler
+ if (signal(SIGINT, cancel_function) == SIG_ERR)
+ {
+ console_output("ERROR: setting signal handler cancel\n");
+ exit(1);
+ }
+
+ // allocate memory for the test machinery; we don't care where it is but want it shared (non-COW)
+ void *machinery_mem = mmap(NULL,
+ sizeof(struct test_machinery),
+ PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_ANON,
+ -1,
+ 0);
+ if (machinery_mem == NULL)
+ {
+ console_output("ERROR: allocating memory for test machinery\n");
+ exit(1);
+ }
+
+ test = machinery_mem;
+ test_machinery_init();
+
+ test->main_tid = get_current_tid();
+ set_current_thread_name();
+
+ while ((opt = getopt(argc, argv, "hvdkfwn:")) != -1)
+ {
+ switch (opt)
+ {
+ case 'n':
+ test->opt_iterations = strtol(optarg, &pEnd, 10);
+ break;
+ case 'v':
+ test->debug_value |= FSTF_SYSCTL_DEBUG_ON;
+ break;
+ case 'd':
+ test->debug_value |= FSTF_SYSCTL_DEBUG_HEAVY;
+ break;
+ case 'k':
+ test->opt_early_error_exit = 0;
+ break;
+ case 'w':
+ test->opt_wired = 1;
+ break;
+ case 'h':
+ usage(argv[0]);
+ exit(0);
+ break;
+ default : usage(argv[0]);
+ exit(-1);
+ break;
+ }
+ }
+
+ console_output("RUNTIME PARAMS:"
+ "niters %-20lu\n",
+ test->opt_iterations);
+
+ size_t size;
+ pid_t pid;
+ uint64_t old;
+ uint64_t new;
+
+ // obtain the tsc frequency from the sysctl
+ size = sizeof(old);
+ err = sysctlbyname("machdep.tsc_freq", &old, &size, NULL, 0);
+ if (err)
+ {
+ console_output("ERROR: calling machdep.tsc_freq\n");
+ exit(1);
+ }
+ test->tsc_freq = old;
+
+ // allocate memory for the test region; put it at a fixed address and COW.
+ void *test_data_mem = mmap((void*)FSTF_CONSTANT_VADDR,
+ sizeof(struct test_data),
+ PROT_READ | PROT_WRITE,
+ MAP_FIXED | MAP_ANON | MAP_PRIVATE,
+ -1,
+ 0);
+ if (test_data_mem == NULL)
+ {
+ console_output("ERROR: null test struct addr\n");
+ exit(1);
+ }
+ test->test_data = test_data_mem;
+ test->test_data->value = 0xffffffffffffffffull;
+
+ // call into the sysctl to prepare the test; pass the desired debugging options
+ size = sizeof(old);
+ pid = getpid();
+ new = ((uint64_t)pid);
+ if (test->debug_value)
+ {
+ new = test->debug_value | ((uint64_t)pid);
+ }
+ err = sysctlbyname("kern.fstf_setup", &old, &size, &new, sizeof(new));
+ if (err)
+ {
+ console_output("ERROR: calling kernel test hook, new: 0x%lx\n", new);
+ exit(1);
+ }
+
+ // call into a sysctl to get the physical address the kernel setup
+ size = sizeof(old);
+ err = sysctlbyname("kern.fstf_control_paddr", &old, &size, NULL, 0);
+ if (err)
+ {
+ console_output("ERROR: calling kernel test addr\n");
+ exit(1);
+ }
+ test->state_control_phys = old;
+
+ // map the physical control address into our address space so we can coordinate
+ void *control_mem = mmap(NULL,
+ PAGE_SIZE,
+ PROT_READ | PROT_WRITE,
+ MAP_SHARED,
+ dev_mem_fd(),
+ (off_t)test->state_control_phys);
+ if (control_mem == NULL)
+ {
+ console_output("ERROR: mmaping control region\n");
+ exit(1);
+ }
+ test->state_control = control_mem;
+
+ // allow a runtime toggle to 'lock' the test region to demonstrate it "fixes the glitch"
+ if (test->opt_wired)
+ {
+ err = mlockall(MCL_CURRENT);
+ if (err)
+ {
+ console_output("ERROR: calling mlockall\n");
+ exit(1);
+ }
+ }
+
+ // run the test
+ uint32_t aux;
+ tm1 = fstf_tsc_and_aux(&aux);
+ err = run_test();
+ tm2 = fstf_tsc_and_aux(&aux);
+ tdiff = (tm2 - tm1);
+
+ // error if we found any discrepancies
+ pthread_mutex_lock(&test->mutex);
+ uint64_t mismatchez = test->mismatches;
+ pthread_mutex_unlock(&test->mutex);
+ if (mismatchez)
+ {
+ err = 1;
+ }
+
+ // output the debugging from the test
+ dump_debug();
+
+ // output status information about the test run
+ char status[1024];
+ fstf_debug_fill_state_descr((fstf_state_value() & FSTF_STATE_BITS_FINISHED), status, sizeof(status));
+ const char * reason = status;
+ while (*reason == ' ') { reason++; }
+ console_output("runtime (sec) %lf\n"
+ "forks %lu\n"
+ "faulter iters %lu\n"
+ "writer iters %lu\n"
+ "reader iters %lu\n"
+ "reader value old %lu\n"
+ "reader value new %lu\n"
+ "pmapstalls %lu\n"
+ "mismatches %lu\n"
+ "completion reason %s\n"
+ "exit %d\n",
+ ((1.0*tdiff)/fstf_tsc_frequency_seconds()),
+ test->forks,
+ test->faulter_iters,
+ test->writer_iters,
+ test->reader_iters,
+ test->old_data,
+ test->new_data,
+ test->pmapstalls,
+ mismatchez,
+ reason,
+ err);
+
+ if (machinery_mem)
+ {
+ munmap(machinery_mem, sizeof(struct test_machinery));
+ }
+ if (control_mem)
+ {
+ munmap(control_mem, PAGE_SIZE);
+ }
+ if (test_data_mem)
+ {
+ munmap(test_data_mem, sizeof(struct test_data));
+ }
+
+ return (err);
+}
diff --git a/sys/vm/forking_stale.h b/sys/vm/forking_stale.h
new file mode 100755
index 0000000..5cfde83
--- /dev/null
+++ b/sys/vm/forking_stale.h
@@ -0,0 +1,296 @@
+/*
+ * FSTF
+ *
+ * forking stale test fixture header
+ *
+ * Common constants for userspace/kernel fork-cow-bug repro
+ */
+
+#ifndef _FORKING_STALE__
+#define _FORKING_STALE__
+
+#include <machine/atomic.h>
+
+// A common virtual address to target
+#define FSTF_CONSTANT_OFFSET 31337
+#define FSTF_CONSTANT_VADDR (31337 * PAGE_SIZE)
+#define FSTF_CONSTANT_TIMEOUT_SECONDS 1
+
+// Various bits of state information easily accessed/shared in a uint64_t size value
+#define FSTF_STATE_BIT_COMPLETE 0x8000000000000000llu // Natural termination (iteration count reached)
+#define FSTF_STATE_BIT_TIMEOUT 0x4000000000000000llu // State change timeout (unexpected test condition)
+#define FSTF_STATE_BIT_MISMATCH 0x2000000000000000llu // Mismatch detected (real problem found)
+#define FSTF_STATE_BIT_CANCELLED 0x1000000000000000llu // Early cancellation requested by the user
+#define FSTF_STATE_BIT_FAULTER_IDLE 0x0000000000000001llu // The faulter thread at the beginning
+#define FSTF_STATE_BIT_WRITER_IDLE 0x0000000000000002llu // The writer thread at the beginning
+#define FSTF_STATE_BIT_READER_IDLE 0x0000000000000004llu // The reader thread at the beginning
+#define FSTF_STATE_BIT_READY_TO_PRIME 0x0000000000000008llu // Fork is ready for thread to prepare (vm_map in COW state)
+#define FSTF_STATE_BIT_FAULTER_PRIMED 0x0000000000000010llu // The faulter thread has "primed" variables it intends to use
+#define FSTF_STATE_BIT_WRITER_PRIMED 0x0000000000000020llu // The writer thread has "primed" variables it intends to use
+#define FSTF_STATE_BIT_READER_PRIMED 0x0000000000000040llu // The reader thread has "primed" variables it intends to use
+#define FSTF_STATE_BIT_TEST_ACTIVE 0x0000000000000080llu // The race is on
+#define FSTF_STATE_BIT_KERNEL_PMAPMOD 0x0000000000000100llu // The pmap has been updated in the kernel
+#define FSTF_STATE_BIT_FAULTER_DONE 0x0000000000000200llu // The faulter thread has "completed" its update
+
+// Various combinations of the above bits
+#define FSTF_STATE_BITS_ALL 0xffffffffffffffffllu
+#define FSTF_STATE_BITS_FINISHED 0xf000000000000000llu
+#define FSTF_STATE_BITS_IDLE (FSTF_STATE_BIT_FAULTER_IDLE | FSTF_STATE_BIT_WRITER_IDLE | FSTF_STATE_BIT_READER_IDLE)
+#define FSTF_STATE_BITS_PRIMED (FSTF_STATE_BIT_FAULTER_PRIMED | FSTF_STATE_BIT_WRITER_PRIMED | FSTF_STATE_BIT_READER_PRIMED)
+#define FSTF_STATE_BITS_DONE (FSTF_STATE_BIT_FAULTER_DONE)
+#define FSTF_STATE_BITS_RESET (FSTF_STATE_BITS_PRIMED | FSTF_STATE_BIT_KERNEL_PMAPMOD | FSTF_STATE_BITS_DONE | FSTF_STATE_BIT_TEST_ACTIVE)
+
+// Some pre-determined "faster" debugging modes
+#define FSTF_DEBUG_TYPE_UNUSED 0
+#define FSTF_DEBUG_TYPE_MISCELLANEOUS 1
+#define FSTF_DEBUG_TYPE_TRANSITION_INITIAL 2
+#define FSTF_DEBUG_TYPE_TRANSITION_FINAL 3
+#define FSTF_DEBUG_TYPE_IMPEDIMENT_INITIAL 4
+#define FSTF_DEBUG_TYPE_IMPEDIMENT_FINAL 5
+#define FSTF_DEBUG_TYPE_CHECK_STATE 6
+
+// sysctl test hook flags
+#define FSTF_SYSCTL_PID_MASK 0x00000000ffffffffull
+#define FSTF_SYSCTL_DEBUG_MASK 0xffffffff00000000ull
+#define FSTF_SYSCTL_DEBUG_ON 0x8000000000000000ull
+#define FSTF_SYSCTL_DEBUG_HEAVY 0xC000000000000000ull
+
+// Debug constants
+#define FSTF_DEBUG_CONST_MISC_SIZE 800
+#define FSTF_DEBUG_FORMAT_STR "TSC:%18lu CPU:%02d TID:%06d %-10s CODE:%15s:%04d STATE:0x%016lx MSG:%s"
+
+// helper macros to do consistent debugging
+#define FSTF_DEBUG_ENTRY_POF2 9
+#define FSTF_DEBUG_NUMBER_ENTRIES (1 << FSTF_DEBUG_ENTRY_POF2)
+#define FSTF_DEBUG_POSITION_INDEX(num) (num & (FSTF_DEBUG_NUMBER_ENTRIES-1))
+#define FSTF_DEBUG_MISC(fmt, ...) fstf_debug_misc(__FILE__, __LINE__, fmt, ##__VA_ARGS__);
+
+// Structure containing data for debug messages
+struct fstf_debug_data
+{
+ uint64_t tsc;
+ uint32_t aux;
+ int tid;
+ const char * file;
+ int line;
+ uint64_t state;
+ uint32_t dbg_type;
+ char misc[FSTF_DEBUG_CONST_MISC_SIZE];
+};
+
+// A "slim" debugging variant
+void fstf_debug_fast(const char * file, int line, uint32_t type, uint64_t state);
+
+// A "misc" debugging variant
+void fstf_debug_misc(const char * file, int line, const char *fmt, ...);
+
+// A location where the state variable can be accessed
+volatile uint64_t * fstf_state_control(void);
+
+// Get the tsc frequency
+uint64_t fstf_tsc_frequency_seconds(void);
+
+// Get the "current" state value
+static uint64_t fstf_state_value(void)
+{
+ volatile uint64_t * control = fstf_state_control();
+ uint64_t state = 0;
+ if (control)
+ {
+ state = atomic_load_acq_64(control);
+ }
+ return (state);
+}
+
+// Pushes a textual description of the state flags into a supplied buffer
+static void fstf_debug_fill_state_descr(uint64_t state, char * buf, int len)
+{
+ snprintf(buf,
+ len,
+ "%14s %14s %14s %14s %14s %14s %14s %14s %14s %14s %14s %14s %14s %14s",
+ ((state & FSTF_STATE_BIT_COMPLETE) ? "COMPLETE" : ""),
+ ((state & FSTF_STATE_BIT_TIMEOUT) ? "TIMEOUT" : ""),
+ ((state & FSTF_STATE_BIT_MISMATCH) ? "MISMATCH" : ""),
+ ((state & FSTF_STATE_BIT_CANCELLED) ? "CANCELLED" : ""),
+ ((state & FSTF_STATE_BIT_FAULTER_IDLE) ? "FAULTER_IDLE" : ""),
+ ((state & FSTF_STATE_BIT_WRITER_IDLE) ? "WRITER_IDLE" : ""),
+ ((state & FSTF_STATE_BIT_READER_IDLE) ? "READER_IDLE" : ""),
+ ((state & FSTF_STATE_BIT_READY_TO_PRIME) ? "READY_TO_PRIME" : ""),
+ ((state & FSTF_STATE_BIT_FAULTER_PRIMED) ? "FAULTER_PRIMED" : ""),
+ ((state & FSTF_STATE_BIT_WRITER_PRIMED) ? "WRITER_PRIMED" : ""),
+ ((state & FSTF_STATE_BIT_READER_PRIMED) ? "READER_PRIMED" : ""),
+ ((state & FSTF_STATE_BIT_TEST_ACTIVE) ? "TEST_ACTIVE" : ""),
+ ((state & FSTF_STATE_BIT_KERNEL_PMAPMOD) ? "KERNEL_PMAPMOD" : ""),
+ ((state & FSTF_STATE_BIT_FAULTER_DONE) ? "FAULTER_DONE" : ""));
+}
+
+// Describe the debug action
+static const char * fstf_debug_type_descr(uint32_t type)
+{
+ switch(type)
+ {
+ case FSTF_DEBUG_TYPE_MISCELLANEOUS:
+ return ("Miscellaneous");
+ case FSTF_DEBUG_TYPE_TRANSITION_INITIAL:
+ return ("Transition Initial");
+ case FSTF_DEBUG_TYPE_TRANSITION_FINAL:
+ return ("Transition Final");
+ case FSTF_DEBUG_TYPE_IMPEDIMENT_INITIAL:
+ return ("Impediment Initial");
+ case FSTF_DEBUG_TYPE_IMPEDIMENT_FINAL:
+ return ("Impediment Final");
+ case FSTF_DEBUG_TYPE_CHECK_STATE:
+ return ("Check State");
+ default:
+ break;
+ }
+ return ("Unknown");
+}
+
+// strip "misc" path information from a file name leaving just the file element name
+static const char * fstf_strip_file_path(const char * file)
+{
+ const char * p;
+ const char * fn;
+ size_t len;
+
+ fn = file;
+ if (fn != 0)
+ {
+ for (p = fn, len = strlen(fn);
+ len > 0;
+ len--, p++)
+ {
+ if (*p == '/' || *p == '\\')
+ {
+ fn = p + 1;
+ }
+ }
+ }
+
+ return (fn);
+}
+
+// "rdtscp"
+static uint64_t fstf_tsc_and_aux(uint32_t *ipAux)
+{
+// uint32_t a, d;
+// *ipAux = 0;
+// __asm__ volatile("rdtsc" : "=a" (a), "=d" (d));
+// return ((uint64_t)a) | (((uint64_t)d) << 32);
+ uint64_t a, d;
+ __asm__ volatile("rdtscp" : "=a" (a), "=d" (d), "=c" (*ipAux));
+ return (d << 32) + a;
+}
+
+// Updates the state data to remove 'remove_bits' and add 'include_bits'
+static uint64_t fstf_change_state_bits(uint64_t remove_bits, uint64_t include_bits)
+{
+ uint64_t old_val;
+ uint64_t new_val;
+
+ do
+ {
+ old_val = atomic_load_acq_64(fstf_state_control());
+ new_val = ((old_val & (~remove_bits)) | include_bits);
+ if (old_val == new_val)
+ {
+ break;
+ }
+
+ } while (!atomic_cmpset_64(fstf_state_control(), old_val, new_val));
+
+ return (new_val);
+}
+
+// Waits for the state data to match a certain bitmask
+// opts == fstf_opt_all --> wait for state control to have all bits in bit_mask
+// opts == fstf_opt_any --> wait for state control to have any bits in bit_mask
+static int fstf_opt_all = 0;
+static int fstf_opt_any = 1;
+static uint64_t fstf_wait_for_state_bits(uint64_t wait_for_bit_mask, int opts)
+{
+ uint64_t max_wait = FSTF_CONSTANT_TIMEOUT_SECONDS * fstf_tsc_frequency_seconds();
+ uint64_t now;
+ uint32_t aux;
+ uint64_t start_cycles = fstf_tsc_and_aux(&aux);
+ uint64_t end_cycles = start_cycles + max_wait;
+ uint64_t tst;
+ do
+ {
+ __asm__ volatile("pause\n": : :"memory");
+
+ tst = atomic_load_acq_64(fstf_state_control());
+ if (tst & FSTF_STATE_BITS_FINISHED)
+ {
+ return (tst);
+ }
+ if (opts == fstf_opt_all)
+ {
+ if ((tst & wait_for_bit_mask) == wait_for_bit_mask)
+ {
+ return (tst);
+ }
+ }
+ else if (opts == fstf_opt_any)
+ {
+ if ((tst & wait_for_bit_mask))
+ {
+ return (tst);
+ }
+ }
+ now = fstf_tsc_and_aux(&aux);
+ } while (now < end_cycles);
+
+ tst = fstf_change_state_bits(0, FSTF_STATE_BIT_TIMEOUT);
+
+ return (tst);
+}
+
+// Window dressing for fstf_change_state_bits
+static uint64_t fstf_wrap_state_transition(const char * file, int line, uint64_t remove_bits, uint64_t include_bits)
+{
+ fstf_debug_fast(file, line, FSTF_DEBUG_TYPE_TRANSITION_INITIAL, fstf_state_value());
+ uint64_t newstate = fstf_change_state_bits(remove_bits, include_bits);
+ fstf_debug_fast(file, line, FSTF_DEBUG_TYPE_TRANSITION_FINAL, newstate);
+ return (newstate);
+}
+
+// Window dressing for fstf_wait_for_state_bits
+static uint64_t fstf_wrap_state_impediment(const char * file, int line, uint64_t wait_for_bitmask, int opts)
+{
+ fstf_debug_fast(file, line, FSTF_DEBUG_TYPE_IMPEDIMENT_INITIAL, fstf_state_value());
+ uint64_t newstate = fstf_wait_for_state_bits(wait_for_bitmask, opts);
+ fstf_debug_fast(file, line, FSTF_DEBUG_TYPE_IMPEDIMENT_FINAL, newstate);
+ return (newstate);
+}
+
+// Window dressing for fstf_wrap_check_for_bits
+static int fstf_wrap_check_for_bits(const char * file, int line, uint64_t check_for_bitmask, int opts)
+{
+ uint64_t state = fstf_state_value();
+ fstf_debug_fast(file, line, FSTF_DEBUG_TYPE_CHECK_STATE, state);
+ if (opts == fstf_opt_all)
+ {
+ if ((state & check_for_bitmask) == check_for_bitmask)
+ {
+ return (1);
+ }
+ }
+ else if (opts == fstf_opt_any)
+ {
+ if ((state & check_for_bitmask))
+ {
+ return (1);
+ }
+ }
+ return (0);
+}
+
+#define fstf_state_impediment_all(wait_for) fstf_wrap_state_impediment(__FILE__, __LINE__, wait_for, fstf_opt_all)
+#define fstf_state_impediment_any(wait_for) fstf_wrap_state_impediment(__FILE__, __LINE__, wait_for, fstf_opt_any)
+#define fstf_state_transition(remove, add) fstf_wrap_state_transition(__FILE__, __LINE__, remove, add)
+#define fstf_state_check_all(check_for) fstf_wrap_check_for_bits(__FILE__, __LINE__, check_for, fstf_opt_all)
+#define fstf_state_check_any(check_for) fstf_wrap_check_for_bits(__FILE__, __LINE__, check_for, fstf_opt_any)
+
+#endif
diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c
index 83e12a5..eaab9c6 100644
--- a/sys/vm/vm_fault.c
+++ b/sys/vm/vm_fault.c
@@ -544,6 +544,9 @@ vm_fault_hold(vm_map_t map, vm_offset_t vaddr, vm_prot_t fault_type,
RetryFault:;
+ fstf_conditional_point_debug(__FILE__, __LINE__, map->pmap, vaddr);
+
+
/*
* Find the backing store object and offset into it to begin the
* search.
@@ -558,6 +561,7 @@ RetryFault:;
}
fs.map_generation = fs.map->timestamp;
+ fstf_conditional_fault_debug(__FILE__, __LINE__, map->pmap, vaddr, prot);
if (fs.entry->eflags & MAP_ENTRY_NOFAULT) {
panic("%s: fault on nofault entry, addr: %#lx",
@@ -607,6 +611,7 @@ RetryFault:;
(fs.first_object->type != OBJT_VNODE &&
(fs.first_object->flags & OBJ_TMPFS_NODE) == 0) ||
(fs.first_object->flags & OBJ_MIGHTBEDIRTY) != 0) {
+ fstf_conditional_point_debug(__FILE__, __LINE__, map->pmap, vaddr);
rv = vm_fault_soft_fast(&fs, vaddr, prot, fault_type,
fault_flags, wired, m_hold);
if (rv == KERN_SUCCESS)
@@ -721,6 +726,9 @@ RetryFault:;
* found the page ).
*/
vm_page_xbusy(fs.m);
+ if (fs.map) {
+ fstf_conditional_point_debug(__FILE__, __LINE__, fs.map->pmap, vaddr);
+ }
if (fs.m->valid != VM_PAGE_BITS_ALL)
goto readrest;
break;
@@ -784,6 +792,9 @@ RetryFault:;
alloc_req |= VM_ALLOC_ZERO;
fs.m = vm_page_alloc(fs.object, fs.pindex,
alloc_req);
+ if (fs.m != NULL && fs.map != NULL) {
+ fstf_conditional_point_debug(__FILE__, __LINE__, fs.map->pmap, vaddr);
+ }
}
if (fs.m == NULL) {
unlock_and_deallocate(&fs);
@@ -1133,6 +1144,7 @@ RetryFault:;
/*
* Oh, well, lets copy it.
*/
+ fstf_conditional_point_debug(__FILE__, __LINE__, fs.map->pmap, vaddr);
pmap_copy_page(fs.m, fs.first_m);
fs.first_m->valid = VM_PAGE_BITS_ALL;
if (wired && (fault_flags &
@@ -1168,6 +1180,7 @@ RetryFault:;
curthread->td_cow++;
} else {
prot &= ~VM_PROT_WRITE;
+ fstf_conditional_fault_debug(__FILE__, __LINE__, map->pmap, vaddr, prot);
}
}
@@ -1185,6 +1198,7 @@ RetryFault:;
if (fs.map->timestamp != fs.map_generation) {
result = vm_map_lookup_locked(&fs.map, vaddr, fault_type,
&fs.entry, &retry_object, &retry_pindex, &retry_prot, &wired);
+ fstf_conditional_fault_debug(__FILE__, __LINE__, map->pmap, vaddr, retry_prot);
/*
* If we don't need the page any longer, put it on the inactive
@@ -1194,6 +1208,7 @@ RetryFault:;
if (result != KERN_SUCCESS) {
release_page(&fs);
unlock_and_deallocate(&fs);
+ fstf_conditional_point_debug(__FILE__, __LINE__, map->pmap, vaddr);
/*
* If retry of map lookup would have blocked then
@@ -1219,6 +1234,7 @@ RetryFault:;
* write-enabled after all.
*/
prot &= retry_prot;
+ fstf_conditional_fault_debug(__FILE__, __LINE__, map->pmap, vaddr, prot);
}
}
diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c
index f17a981..82c5fd7f 100644
--- a/sys/vm/vm_page.c
+++ b/sys/vm/vm_page.c
@@ -4094,6 +4094,465 @@ vm_page_trylock_KBI(vm_page_t m, const char *file, int line)
return (mtx_trylock_flags_(vm_page_lockptr(m), 0, file, line));
}
+
+
+
+
+
+
+#include <sys/types.h>
+#include <sys/malloc.h>
+#include <sys/sysctl.h>
+#include <vm/vm_map.h>
+#include <machine/atomic.h>
+#include <machine/stdarg.h>
+
+#include <vm/forking_stale.h>
+
+extern uint64_t tsc_freq;
+
+MALLOC_DEFINE(M_FSTF, "fstf", "FSTF Test Hook");
+
+static pmap_t test_pmap = NULL;
+static uintptr_t test_vaddr = FSTF_CONSTANT_VADDR;
+struct fstf_debug_data debug_entries[FSTF_DEBUG_NUMBER_ENTRIES];
+static u_int debug_position;
+static uint64_t debug_value = 0;
+static void * state_control_alloc = 0;
+static size_t state_control_alloc_length = PAGE_SIZE;
+static vm_paddr_t state_control_paddr = 0;
+static volatile uint64_t * state_control = 0;
+
+static int sysctl_fstf_setup(SYSCTL_HANDLER_ARGS);
+static int sysctl_debug_output(SYSCTL_HANDLER_ARGS);
+
+SYSCTL_NODE(_kern,
+ OID_AUTO,
+ fstf_setup,
+ CTLFLAG_RW,
+ sysctl_fstf_setup,
+ "Test hook to setup to help debug a race condition");
+
+SYSCTL_U64(_kern,
+ OID_AUTO,
+ fstf_control_paddr,
+ CTLFLAG_RD,
+ &state_control_paddr,
+ 0,
+ "Read the physical memory address allocated for the test");
+
+SYSCTL_OID(_kern,
+ OID_AUTO,
+ fstf_debug_output,
+ CTLTYPE_STRING | CTLFLAG_RD,
+ NULL,
+ 0,
+ sysctl_debug_output,
+ "A",
+ "Output test fixture debugging");
+
+/* Output debugging captured in memory in the kernel during the last test run */
+static int
+sysctl_debug_output(SYSCTL_HANDLER_ARGS)
+{
+ struct sbuf sbuf;
+ int num, error;
+
+ error = sysctl_wire_old_buffer(req, 0);
+ if (error != 0) {
+ return (error);
+ }
+
+ sbuf_new_for_sysctl(&sbuf, NULL, sizeof(debug_entries[0].misc) + 1024, req);
+
+ sbuf_printf(&sbuf, "\n");
+
+ for (num = 0; num < FSTF_DEBUG_NUMBER_ENTRIES; num++) {
+ if (debug_entries[num].dbg_type != FSTF_DEBUG_TYPE_UNUSED) {
+ if (debug_entries[num].dbg_type == FSTF_DEBUG_TYPE_MISCELLANEOUS) {
+
+ error = sbuf_printf(&sbuf, "%s", debug_entries[num].misc);
+ if (error) {
+ printf("Error: %d handling sysctl printf\n", error);
+ break;
+ }
+
+ } else {
+
+ char stbuf[512];
+ char state_str[1024];
+ fstf_debug_fill_state_descr(debug_entries[num].state,
+ stbuf,
+ sizeof(stbuf));
+
+ snprintf(state_str,
+ sizeof(state_str),
+ "State: %s (0x%016lx) [%s]\n",
+ stbuf,
+ debug_entries[num].state,
+ fstf_debug_type_descr(debug_entries[num].dbg_type));
+
+ error = sbuf_printf(&sbuf,
+ FSTF_DEBUG_FORMAT_STR,
+ debug_entries[num].tsc,
+ debug_entries[num].aux,
+ debug_entries[num].tid,
+ "",
+ fstf_strip_file_path(debug_entries[num].file),
+ debug_entries[num].line,
+ debug_entries[num].state,
+ state_str);
+ if (error) {
+ printf("Error: %d handling sysctl printf\n", error);
+ break;
+ }
+ }
+ }
+ }
+
+ error = sbuf_finish(&sbuf);
+
+ sbuf_delete(&sbuf);
+ return (error);
+}
+
+/* Setup for testing the race condition */
+static int
+sysctl_fstf_setup(SYSCTL_HANDLER_ARGS)
+{
+ int error;
+ pid_t pid;
+ uint64_t pidmore;
+ struct proc *p;
+
+ /* Get the value supplied; an 64 bit integer with <debug | pid> */
+
+ error = sysctl_handle_64(oidp, &pidmore, 0, req);
+ if (error || req->newptr == NULL) {
+ return (error);
+ }
+
+ FSTF_DEBUG_MISC("Test Hook: Entry: %0lx\n", pidmore);
+
+ pid = (pid_t)(FSTF_SYSCTL_PID_MASK & pidmore);
+ debug_value = (FSTF_SYSCTL_DEBUG_MASK & pidmore);
+
+ FSTF_DEBUG_MISC("Test Hook: Initial pid=%d dbg=%lu\n",
+ pid,
+ debug_value);
+
+ /* Confirm that the calling process is correct & look it up */
+ error = pget(pid, PGET_CANSEE | PGET_ISCURRENT | PGET_CANDEBUG | PGET_NOTWEXIT | PGET_NOTID, &p);
+ if (error != 0) {
+ printf("Test Hook: Setup Error, unable to lookup/match PID. Error: %d\n", error);
+ return (error);
+ }
+
+ FSTF_DEBUG_MISC("Test Hook: Process Located pid=%d name=%s\n",
+ pid,
+ p->p_comm);
+
+ /* Release any previous test resoruces */
+ if (state_control_alloc) {
+ contigfree(state_control_alloc,
+ state_control_alloc_length,
+ M_FSTF);
+ state_control = NULL;
+ state_control_alloc = NULL;
+ }
+
+ /* Allocate a control region & make the physical address available via a sysctl */
+ state_control_alloc = contigmalloc(state_control_alloc_length,
+ M_FSTF,
+ M_WAITOK | M_ZERO,
+ 0ul,
+ ~0ul,
+ PAGE_SIZE,
+ 0);
+ if (state_control_alloc == NULL) {
+ error = 1;
+ printf("Test Hook: Setup Error, unable to allocate kernel control memory.\n");
+ goto Done;
+ }
+ state_control = state_control_alloc;
+ state_control_paddr = vtophys(state_control_alloc);
+
+ /* Squirrel away the process pmap and vaddr; we use these to make the
+ * test hooks conditional only on select access */
+ test_pmap = vmspace_pmap(p->p_vmspace);
+ test_vaddr = FSTF_CONSTANT_VADDR;
+
+ /* Clear the debugging out from any prevous runs */
+ int num;
+ for (num = 0; num < FSTF_DEBUG_NUMBER_ENTRIES; num++) {
+ debug_entries[num].dbg_type = FSTF_DEBUG_TYPE_UNUSED;
+ }
+
+ FSTF_DEBUG_MISC("Test Hook: Initiated pid=%d, name=%s, pmap=%p, vaddr=0x%jx, "
+ "control vaddr=0x%jx, control paddr=0x%jx\n",
+ pid,
+ p->p_comm,
+ test_pmap,
+ (uintmax_t)test_vaddr,
+ (uintmax_t)state_control,
+ (uintmax_t)state_control_paddr);
+
+ Done:
+
+ PROC_UNLOCK(p);
+
+ return (error);
+}
+
+/* The state/control region */
+volatile uint64_t * fstf_state_control(void)
+{
+ return (state_control);
+}
+
+/* Get the tsc frequency in seconds */
+uint64_t fstf_tsc_frequency_seconds(void)
+{
+ uint64_t freq = atomic_load_acq_64(&tsc_freq);
+ return (freq);
+}
+
+/* A fast pre-determined debugging function */
+void fstf_debug_fast(const char * file, int line, uint32_t type, uint64_t state)
+{
+ if (debug_value) {
+
+ struct thread *td = curthread;
+ if ((td != NULL)) {
+
+ int pos = FSTF_DEBUG_POSITION_INDEX(atomic_fetchadd_int(&debug_position, 1));
+
+ debug_entries[pos].tsc = fstf_tsc_and_aux(&debug_entries[pos].aux);
+ debug_entries[pos].tid = td->td_tid;
+ debug_entries[pos].file = file;
+ debug_entries[pos].line = line;
+ debug_entries[pos].state = state;
+ debug_entries[pos].dbg_type = type;
+ }
+ }
+}
+
+/* A misc text debugging function */
+void fstf_debug_misc(const char * file, int line, const char *fmt, ...)
+{
+ if (debug_value) {
+
+ struct thread *td = curthread;
+ if ((td != NULL)) {
+
+ char buffer[512];
+ va_list ap;
+
+ va_start(ap, fmt);
+
+ vsnprintf(buffer, sizeof(buffer), fmt, ap);
+
+ va_end(ap);
+
+ uint64_t state = fstf_state_value();
+
+ int pos = FSTF_DEBUG_POSITION_INDEX(atomic_fetchadd_int(&debug_position, 1));
+
+ uint32_t aux;
+ uint64_t tsc = fstf_tsc_and_aux(&aux);
+
+ debug_entries[pos].dbg_type = FSTF_DEBUG_TYPE_MISCELLANEOUS;
+
+ snprintf(debug_entries[pos].misc,
+ sizeof(debug_entries[pos].misc),
+ FSTF_DEBUG_FORMAT_STR,
+ tsc,
+ aux,
+ (td == NULL ? 0 : td->td_tid),
+ (td == NULL ? "" : td->td_name),
+ fstf_strip_file_path(file),
+ line,
+ state,
+ buffer);
+ }
+ }
+}
+
+/* A 'got here' debugging function */
+void fstf_conditional_point_debug(const char * file, int line, pmap_t pmap, vm_offset_t vaddr)
+{
+ if ((pmap != NULL) &&
+ (pmap == test_pmap) &&
+ (((debug_value & FSTF_SYSCTL_DEBUG_HEAVY) == FSTF_SYSCTL_DEBUG_HEAVY) ||
+ ((test_vaddr <= vaddr) && (vaddr < (test_vaddr + PAGE_SIZE))))) {
+
+ fstf_debug_misc(file,
+ line,
+ "Debug: pmap=%p vaddr=0x%jx\n",
+ pmap,
+ vaddr);
+ }
+}
+
+/* A vm_page_t info debugging function */
+static void fstf_debug_page(const char * file, int line, vm_offset_t vaddr, vm_page_t m)
+{
+ fstf_debug_misc(file,
+ line,
+ "Debug: VAddr:0x%jx page %p obj %p pidx 0x%jx phys 0x%jx q %d hold %d wire %d"
+ " af 0x%x of 0x%x f 0x%x act %d busy %x valid 0x%x dirty 0x%x\n",
+ (uintmax_t)vaddr,
+ m,
+ m->object,
+ (uintmax_t)m->pindex,
+ (uintmax_t)m->phys_addr,
+ m->queue,
+ m->hold_count,
+ m->wire_count,
+ m->aflags,
+ m->oflags,
+ m->flags,
+ m->act_count,
+ m->busy_lock,
+ m->valid,
+ m->dirty);
+}
+
+/* A pt_entry_t info debugging function */
+void fstf_conditional_pte_debug(const char * file, int line, pmap_t pmap, vm_offset_t vaddr, pt_entry_t pte)
+{
+ if ((pmap != NULL) &&
+ (pmap == test_pmap) &&
+ (((debug_value & FSTF_SYSCTL_DEBUG_HEAVY) == FSTF_SYSCTL_DEBUG_HEAVY) ||
+ ((test_vaddr <= vaddr) && (vaddr < (test_vaddr + PAGE_SIZE))))) {
+
+ vm_page_t m = PHYS_TO_VM_PAGE(pte & PG_FRAME);
+ fstf_debug_misc(file,
+ line,
+ "Debug: Hit[pmap:%p VAddr:0x%jx P:0x%jx F:%016lx] %s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s\n",
+ pmap,
+ (uintmax_t)vaddr,
+ (uintmax_t)(pte & PG_FRAME),
+ pte,
+ ((pte & PG_V) ? "Valid" : "Invalid"),
+ ((pte & PG_RW) ? "Read-Write" : "No-Read-Write"),
+ ((pte & PG_A) ? "Accessed" : "Not-Accessed"),
+ ((pte & PG_M) ? "Dirty" : "Not-Dirty"),
+ ((pte & PG_PS) ? "4K-Page-Size" : "2M-Page-Size"),
+ ((pte & PG_G) ? "Global" : "Not-Global"),
+ ((pte & PG_AVAIL1) ? "Avail1" : "No-Avail1"),
+ ((pte & PG_AVAIL2) ? "Managed" : "Unmanaged"),
+ ((pte & PG_AVAIL3) ? "Wired" : "Not-Wired"),
+ ((pte & PG_U) ? "User-Supervisor" : "No-User-Supervisor"),
+ ((pte & PG_NC_PWT) ? "Write-Through" : "No-Write-Through"),
+ ((pte & PG_NC_PCD) ? "Cache-Disable" : "No-Cache-Disable"),
+ ((pte & PG_PTE_PAT) ? "PAT-Index" : "No-PAT-Index"));
+ if (m != NULL)
+ {
+ fstf_debug_page(file, line, vaddr, m);
+ }
+ }
+}
+
+/* A 'got here' debugging function */
+void fstf_conditional_fault_debug(const char * file, int line, pmap_t pmap, vm_offset_t vaddr, vm_prot_t prot)
+{
+ if ((pmap != NULL) &&
+ (pmap == test_pmap) &&
+ (((debug_value & FSTF_SYSCTL_DEBUG_HEAVY) == FSTF_SYSCTL_DEBUG_HEAVY) ||
+ ((test_vaddr <= vaddr) && (vaddr < (test_vaddr + PAGE_SIZE))))) {
+
+ fstf_debug_misc(file,
+ line,
+ "Debug: pmap=%p vaddr=0x%jx prot=%s|%s|%s|%s|%s\n",
+ pmap,
+ vaddr,
+ ((prot & VM_PROT_READ) ? "Read" : "No-Read"),
+ ((prot & VM_PROT_WRITE) ? "Write" : "No-Write"),
+ ((prot & VM_PROT_EXECUTE) ? "Execute" : "No-Execute"),
+ ((prot & VM_PROT_COPY) ? "Copy" : "No-Copy"),
+ ((prot & VM_PROT_FAULT_LOOKUP) ? "Lookup" : "No-Lookup"));
+ }
+}
+
+/* A helper to check if a supplied pmap and vaddr match and should trigger test code */
+static int fstf_conditional(pmap_t pmap, vm_offset_t vaddr)
+{
+ if ((pmap != NULL) &&
+ (pmap == test_pmap) &&
+ ((test_vaddr <= vaddr) && (vaddr < (test_vaddr + PAGE_SIZE)))) {
+
+ return (1);
+ }
+ return (0);
+}
+
+/* Debug & advance the state once at the appropriate error-gen state */
+int fstf_conditional_pmapmod_advance(const char * file, int line, vm_offset_t vaddr, vm_page_t m)
+{
+ int status = 0;
+ pmap_t pmap = PCPU_GET(curpmap);
+ if (fstf_conditional(pmap, vaddr)) {
+
+ if (state_control) {
+
+ uint64_t old_val;
+ uint64_t new_val;
+
+ fstf_debug_fast(file,
+ line,
+ FSTF_DEBUG_TYPE_TRANSITION_INITIAL,
+ fstf_state_value());
+
+ fstf_debug_page(file, line, vaddr, m);
+
+ do
+ {
+ old_val = atomic_load_acq_64(state_control);
+ if ((old_val & FSTF_STATE_BITS_PRIMED) != FSTF_STATE_BITS_PRIMED)
+ {
+ status = 0;
+ goto Done;
+ }
+ if (old_val & FSTF_STATE_BITS_FINISHED)
+ {
+ status = 0;
+ goto Done;
+ }
+ new_val = (old_val | FSTF_STATE_BIT_KERNEL_PMAPMOD);
+
+ } while (!atomic_cmpset_64(state_control, old_val, new_val));
+
+ fstf_debug_fast(file,
+ line,
+ FSTF_DEBUG_TYPE_TRANSITION_FINAL,
+ new_val);
+ status = 1;
+ }
+ }
+
+ Done:
+
+ return (status);
+}
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
#if defined(INVARIANTS) || defined(INVARIANT_SUPPORT)
void
vm_page_assert_locked_KBI(vm_page_t m, const char *file, int line)
diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
index 6ca808b..e250be0 100644
--- a/sys/vm/vm_page.h
+++ b/sys/vm/vm_page.h
@@ -452,6 +452,11 @@ malloc2vm_flags(int malloc_flags)
#define PS_ALL_VALID 0x2
#define PS_NONE_BUSY 0x4
+void fstf_conditional_point_debug(const char * file, int line, pmap_t pmap, vm_offset_t vaddr);
+void fstf_conditional_fault_debug(const char * file, int line, pmap_t pmap, vm_offset_t vaddr, vm_prot_t prot);
+void fstf_conditional_pte_debug(const char * file, int line, pmap_t pmap, vm_offset_t vaddr, pt_entry_t pte);
+int fstf_conditional_pmapmod_advance(const char * file, int line, vm_offset_t vaddr, vm_page_t m);
+
void vm_page_busy_downgrade(vm_page_t m);
void vm_page_busy_sleep(vm_page_t m, const char *msg, bool nonshared);
void vm_page_flash(vm_page_t m);
--
2.10.2
[-- Attachment #3 --]
From 004d3e879cdf894771e260fadaf92c48e48180fe Mon Sep 17 00:00:00 2001
From: Elliott Rabe <elliott.rabe@dell.com>
Date: Sun, 11 Feb 2018 17:29:15 -0600
Subject: [PATCH 2/3] TRIAL: Double invalidate when finishing COW pmap update
When a process forks the first write to a page of memory starts a copy-on-write
operation. The pmap is currently updated with the new physical address and the
writable status in a single atomic operation followed by the necessary TLB
invalidations. Marking the page writeable before the invalidations are complete
allows the page contents to be changed before they are guaranteed to be fully
visible. This can result in subtle memory corruptions.
Here is a simplified example of what can occur:
+A process is forked which transitions a map entry to COW
+Thread A writes to a page on the map entry, faults, updates the pmap to
writable at a new phys addr, and starts TLB invalidations...
+Thread B acquires a lock, writes to a location on the new phys addr,
and releases the lock
+Thread C acquires the lock, reads from the location on the old phys addr...
+Thread A ...continues the TLB invalidations which are completed
+Thread C ...reads from the location on the new phys addr, and releases
the lock
In this example Thread B and C lock, use memory and unlock properly and neither
own the lock at the same time. Thread C sees data protected by a lock change
beneath it while it is the lock owner. Thread A was writing somewhere else on
the page and so never needed the lock.
This commit introduces a double-update-invalidation for the scenario above. The
pmap update will first apply the new address in a read-only state and perform the
TLB invalidations. Immediately afterwards, the pmap will be updated again marking
the region as writeable. This strategy ensures the page contents are immutable
until all CPUs know the correct location.
---
sys/amd64/amd64/pmap.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
index 7bb9c1b..aa91cb1 100644
--- a/sys/amd64/amd64/pmap.c
+++ b/sys/amd64/amd64/pmap.c
@@ -4638,7 +4638,7 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot,
vm_paddr_t opa, pa;
vm_page_t mpte, om;
int rv;
- boolean_t nosleep;
+ boolean_t nosleep, delayrw;
PG_A = pmap_accessed_bit(pmap);
PG_G = pmap_global_bit(pmap);
@@ -4728,6 +4728,7 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot,
panic("pmap_enter: invalid page directory va=%#lx", va);
origpte = *pte;
+ delayrw = 0;
/*
* Is the specified virtual address already mapped?
@@ -4768,6 +4769,11 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot,
if (((origpte ^ newpte) & ~(PG_M | PG_A)) == 0)
goto unchanged;
goto validate;
+ } else if (((origpte & PG_MANAGED) != 0) &&
+ ((origpte & PG_RW) == 0) &&
+ ((newpte & PG_RW) != 0)) {
+ newpte &= ~PG_RW;
+ delayrw = 1;
}
} else {
/*
@@ -4841,6 +4847,17 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot,
*/
goto unchanged;
}
+ if (delayrw) {
+ /*
+ * If both the physical address has changed and we're adding
+ * RW, we do a two-stage update so the region is immutable
+ * until all CPUs have visibility to the new address.
+ */
+ if ((origpte & PG_A) != 0)
+ pmap_invalidate_page(pmap, va);
+ newpte |= PG_RW;
+ origpte = pte_load_store(pte, newpte);
+ }
if ((origpte & PG_A) != 0)
pmap_invalidate_page(pmap, va);
} else
--
2.10.2
help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5A82AB7C.6090404>
