From owner-svn-src-all@FreeBSD.ORG Wed Oct 13 00:29:13 2010 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 51574106566C; Wed, 13 Oct 2010 00:29:13 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail09.syd.optusnet.com.au (mail09.syd.optusnet.com.au [211.29.132.190]) by mx1.freebsd.org (Postfix) with ESMTP id C4A258FC19; Wed, 13 Oct 2010 00:29:12 +0000 (UTC) Received: from c122-106-146-165.carlnfd1.nsw.optusnet.com.au (c122-106-146-165.carlnfd1.nsw.optusnet.com.au [122.106.146.165]) by mail09.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o9D0T8Y3025830 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 13 Oct 2010 11:29:09 +1100 Date: Wed, 13 Oct 2010 11:29:08 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Gennady Proskurin In-Reply-To: <20101012210804.GA4286@gpr.nnz-home.ru> Message-ID: <20101013094338.U1075@besplex.bde.org> References: <201010090807.o9987nG0030939@svn.freebsd.org> <20101009191600.N3531@besplex.bde.org> <4CB03A82.6040701@freebsd.org> <20101012210804.GA4286@gpr.nnz-home.ru> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, Bruce Evans , Andriy Gapon Subject: Re: svn commit: r213648 - head/sys/kern X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Oct 2010 00:29:13 -0000 On Wed, 13 Oct 2010, Gennady Proskurin wrote: > On Sat, Oct 09, 2010 at 12:48:50PM +0300, Andriy Gapon wrote: >> on 09/10/2010 12:33 Bruce Evans said the following: >>> On Sat, 9 Oct 2010, Andriy Gapon wrote: >[>]*... >>> Now, why don't the partial memory barriers prevent caching the variable? >>> >>> % if (panic_cpu != PCPU_GET(cpuid)) >>> % while (atomic_cmpset_int(&panic_cpu, NOCPU, >>> % PCPU_GET(cpuid)) == 0) >>> % while (panic_cpu != NOCPU) >>> % ; /* nothing */ >>> >>> The very first access can't reasonably use a cachec value. atomic_cmpset() So the first access doesn't need a compiler-type memory barrier. Not needing a CPU-type one is more subtle -- see below. >>> can change panic_cpu, so even without the memory barrier, panic_cpu must >>> be reloaded for the third access the first time. But then when the third >>> access is repeated in the second while loop, the missing atomic op with >>> barrier makes things very broken. panic_cpu isn't changed by the loop, >>> and the compiler thinks that it isn't changed by anything else either, so >>> the compiler may reduce the loop to: >>> >>> % if (panic_cpu != NOCPU) >>> % for (;;) >>> % ; /* nothing */ >> >> >> Yes, it's exactly the last loop that had the problem. >> On amd64 with -O2: >> .loc 1 544 0 >> movl panic_cpu(%rip), %eax >> .LVL134: >> .p2align 4,,7 >> .L210: >> cmpl $255, %eax >> jne .L210 >> jmp .L225 >> >>> except I've seen claims that even an endless for loop can be optimized >>> to nothing. Declaring panic_cpu as volatile prevents the compiler doing >>> this, but I think it is insufficient. We really do want to see panic_cpu >>> changed by other CPUs, and what is the point of atomic_load_acq*() if not >>> to use for this -- if declaring things volatile were sufficient, then we >>> could just use *(volatile *)&var. atomic_load/store_acq/rel*() on i386 >> >> I discussed this with kib and the idea is that atomic operation is not needed in >> that place and volatile is sufficient, because we don't need barrier semantics >> there and only want to ensure that variable is reloaded from memory. > > If I understand correctly, without acquiring barrier, variable is not > guaranteed to be loaded from memory, it can end up with some stale value from > cpu's cache or pipeline. If incorrect value will be checked in the first "if" > operator, it can lead to skiping all the atomic synchronization. > The same about writing to variable, without barrier new value may be written to > some local cpu cache and not be seen by readers until it is flushed from cache. That's what I was thinking. Now I think the only problem is that not locking everything just makes the correctness very hard to understand, and it shouldn't be done since this code is as far as possible from needing to be efficient. If the access in the inner while loop sees a stale value, then there is no problem provided the value eventually becomes unstale or just changes to a different stale value -- then we exit the inner loop and go back to the locked access in the outer loop. The access before the outer loop can at most see a stale value which doesn't matter. Only the current CPU can set panic_cpu to PCPU_GET(cpuid)). When panic() is called with panic_cpu having this value, it means that the panic is recursive (since panic_cpu is initially NOCPU, and the current CPU doesn't/shouldn't change it to anything but itself). For recursive panics, the current CPU should already have locked everything and stopped other CPUs). It doesn't need to lock against itself, and it needs the test using the first access to avoid deadlocking against itself. When panic() is called with panic_cpu = NOCPU and this value is not stale, then there is no problem. When panic() is called with panic_cpu = NOCPU and this value is stale, it means that one CPU entered and left panic() and set the value to NOCPU, but another CPU has entered panic() and set the value to !NOCPU but we haven't seen this yet. Then we reach the outer while loop expecting to acquire the lock, but when we try to acquire it we don't get it, and there is no problem. Finally, when when panic() is called with panic_cpu = another_cpu, there is similarly no problem, whether or not this value is stale. Then we reach the outer while loop expecting not to acquire the lock, but we may sometimes acquire it, even on the first try, if the value was stale or if it just became stale after we read it. Perhaps my rule for locking read accesses may allow/require the complexity in the above. It is, never lock read accesses, since the value may be stale after you read it, whether or not you locked the access. It is only in delicate cases, where you need a value that is non-stale at the time of the access but are happy with a value that is stale later, that locking a read access is useful. The panic locking could be written using only standard complications using a recursive mutex (or trylock), except then it might have to fight with the mutex implementation doing more than the simple spin: mtx_lock_spin(&panic_mtx); /* * Already have a problem -- the implementation disable interrupts, * which we should want, but since we do wrong things like syncing * with normal i/o expected to work, we probably need interrupts. */ It's interesting that the implementation of mutexes doesn't optimize for the unusual case of a mutex being recursed on, but we do that here. We could be more like the mutex implementation and do: while (atomic_cmpset_int(&panic_cpu, NOCPU, PCPU_GET(cpuid)) == 0) { if (panic_cpu == PCPU_GET(cpuid)) break; /* recursive */ while (panic_cpu != NOCPU) continue; } > > This also applies to r213736. > >> atomic_store_rel at the end seems to be needed because of inter-dependency >> between panicstr and panic_cpu. That is we want changes to panicstr to becomes >> visible at the same time (before actually) changes to panic_cpu are visible. There are similar but larger complications for debugger entry. Debugger entry has the additional problem that a thundering herd of CPUs may enter at a breakpoint, or at any debugger trap with handling necessary but normal handling not possible. Debuggers entr has the additional non-problem that it at least tries to stop other CPUs before proceeding far. My i386 ddb entry starts a bit like the above: % ef = read_eflags(); % disable_intr(); % if (atomic_cmpset_int(&kdb_trap_lock, NOCPU, PCPU_GET(cpuid)) == 0 && % kdb_trap_lock != PCPU_GET(cpuid)) { It has the anti-recursion detection in the same if statement as the cmpset. % while (atomic_cmpset_int(&output_lock, 0, 1) == 0) % ; % db_printf( % "concurrent ddb entry: type %d trap, code=%x cpu=%d\n", % type, code, PCPU_GET(cpuid)); % atomic_store_rel_int(&output_lock, 0); Another lock to serialize output for debugging this stuff. Thundering herds at breakpoints are very easy to arrange, and they will interleave their output almost perfectly with some console drivers (say, a serial console at 50 bps so that you can easily watch the problem; this is also a good test for console drivers). Elsewhere I posted a similar method for properly serializing plain printf output. That bounds the time for the output so it might not work on 50 bps consoles. The above has no bound and depends on db_printf() never blocking endlessly even under fire from places like here. % if (type == T_BPTFLT) % regs->tf_eip--; This backs of from the breakpoint. It is not MI enough. Some nut might have put a 2-byte breakpoint instruction in the instruction stream. Then the breakpoint won't belong to us, so not handling it here would be correct. Backing off 2 would work OK too (it allows the instruction to trap again). Backing of 1 backs into it and corrupts it. Backing off 1 works OK in the same way as when the breakpoint is not ours. Backing off 1 here handles the case where the breakpoint is ours now but will go away because another ddb entry proceeds and removes it. % else { % while (atomic_cmpset_int(&output_lock, 0, 1) == 0) % ; % db_printf( % "concurrent ddb entry on non-breakpoint: too hard to handle properly\n"); % atomic_store_rel_int(&output_lock, 0); % } More debugging. % while (atomic_load_acq_int(&kdb_trap_lock) != NOCPU) % ; This corresponds to the inner while loop in panic(). According to the above discussion, I don't need the atomic op here. I should use pause() here. % write_eflags(ef); panic() doesn't disable interrupts, especially in the inner while loop. Interrupts must be kept disabled to avoid them doing bad things including recursive debugger and/or panic entries, although masking of all interrupts breaks sending of most IPIS. % return (1); % } % // Proceed to stopping CPUs... Since some CPUs may be looping with % // interrupts masked, either in the above loop or accidentally, the % // old method of sending maskable IPIs may block us endlessly. I % // use a hack to work around this. Bruce