From owner-freebsd-hackers@FreeBSD.ORG  Wed May 29 07:27:48 2013
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 1780470
 for <freebsd-hackers@freebsd.org>; Wed, 29 May 2013 07:27:48 +0000 (UTC)
 (envelope-from bright@mu.org)
Received: from elvis.mu.org (elvis.mu.org [192.203.228.196])
 by mx1.freebsd.org (Postfix) with ESMTP id EC161F3
 for <freebsd-hackers@freebsd.org>; Wed, 29 May 2013 07:27:47 +0000 (UTC)
Received: from Alfreds-MacBook-Pro-9.local
 (c-67-180-208-218.hsd1.ca.comcast.net [67.180.208.218])
 by elvis.mu.org (Postfix) with ESMTPSA id 71CB41A3C1D
 for <freebsd-hackers@freebsd.org>; Wed, 29 May 2013 00:27:46 -0700 (PDT)
Message-ID: <51A5ADF2.3070004@mu.org>
Date: Wed, 29 May 2013 00:27:46 -0700
From: Alfred Perlstein <bright@mu.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7;
 rv:17.0) Gecko/20130509 Thunderbird/17.0.6
MIME-Version: 1.0
To: freebsd-hackers@freebsd.org
Subject: Re: please review, patch for lost camisr
References: <51A44B0C.8010908@ixsystems.com>
 <201305281204.14146.jhb@freebsd.org> <51A505A5.7030105@ixsystems.com>
 <201305281613.32414.jhb@freebsd.org> <51A514F5.9050405@ixsystems.com>
 <20130529050835.GP3047@kib.kiev.ua> <51A592BC.1050100@mu.org>
 <20130529071645.GR3047@kib.kiev.ua>
In-Reply-To: <20130529071645.GR3047@kib.kiev.ua>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 May 2013 07:27:48 -0000

On 5/29/13 12:16 AM, Konstantin Belousov wrote:
> On Tue, May 28, 2013 at 10:31:40PM -0700, Alfred Perlstein wrote:
>> On 5/28/13 10:08 PM, Konstantin Belousov wrote:
>>> On Tue, May 28, 2013 at 01:35:01PM -0700, Alfred Perlstein wrote:
>>>> [[  moved to hackers@ from private mail. ]]
>>>>
>>>> On 5/28/13 1:13 PM, John Baldwin wrote:
>>>>> On Tuesday, May 28, 2013 3:29:41 pm Alfred Perlstein wrote:
>>>>>> On 5/28/13 9:04 AM, John Baldwin wrote:
>>>>>>> On Tuesday, May 28, 2013 2:13:32 am Alfred Perlstein wrote:
>>>>>>>> Hey folks,
>>>>>>>>
>>>>>>>> I had a talk with Nathan Whitehorn about the camisr issue.  The issue we
>>>>>>>> are seeing we mostly know, but to summarize, we are losing the camisr
>>>>>>>> signal and the camisr is not being run.
>>>>>>>>
>>>>>>>> I gave him a summary of what we have been seeing and pointed him to the
>>>>>>>> code I am concerned about here:
>>>>>>>> http://pastebin.com/tLKr7mCV  (this is inside of kern_intr.c).
>>>>>>>>
>>>>>>>> What I think that is happening is that the setting of it_need to 0
>>>>>>>> inside of sys/kern/kern_intr.c:ithread_loop() is not being scheduled
>>>>>>>> correctly and it is being delayed until AFTER the call to
>>>>>>>> ithread_execute_handlers() right below the atomic_store_rel_int().
>>>>>>> This seems highly unlikely, to the extent that if this were true all our
>>>>>>> locking primitives would be broken.  The store_rel is actually a release
>>>>>>> barrier which acts like more a read/write fence.  No memory accesses (read or
>>>>>>> write) from before the release can be scheduled after the associated store,
>>>>>>> either by the compiler or CPU.  That is what Konstantin is referring to in his
>>>>>>> commit when he says "release" semantics".
>>>>>> Yes, that makes sense, however does it specify that the writes *must*
>>>>>> occur at that *point*?  If it only enforces ordering then we may have
>>>>>> some issue, specifically because the setting of it to '1' inside of
>>>>>> intr_event_schedule_thread has no barrier other than the acq semantics
>>>>>> of the thread lock.  I am wondering what is forcing out the '1' there.
>>>>> Nothing ever forces writes.  You would have to invalidate the cache to do that
>>>>> and that is horribly expensive.  It is always only about ordering and knowing
>>>>> that if you can complete another operation on the same "cookie" variable with
>>>>> acquire semantics that earlier writes will be visible.
>>>> By cookie, you mean a specific memory address, basically a lock? This is
>>>> starting to reinforce my suspicions as the setting of it_need is done
>>>> with release semantics, however the acq on the other CPU is done on the
>>>> thread lock.  Maybe that is irrelevant.  We will find out shortly.
>>>>
>>>>>> See below as I think we have proof that this is somehow happening.
>>>>> Having ih_need of 1 and it_need of 0 is certainly busted.  The simplest fix
>>>>> is probably to stop using atomics on it_need and just grab the thread lock
>>>>> in the main ithread loop and hold it while checking and clearing it_need.
>>>>>
>>>> OK, we have some code that will either prove this, or perturb the memory
>>>> ordering enough to make the bug go away, or prove this assertion wrong.
>>>>
>>>> We will update on our findings hopefully in the next few days.
>>> IMO the read of it_need in the 'while (ithd->it_need)' should
>>> have acquire semantic, otherwise the future reads in the
>>> ithread_execute_handlers(), in particular, of the ih_need, could pass
>>> the read of it_need and cause the situation you reported.  I do not
>>> see any acquire barrier between a condition in the while() statement
>>> and the read of ih_need in the execute_handlers().
>>>
>>> It is probably true that the issue you see was caused by r236456, in the
>>> sense that implicitely locked xchgl instruction on x86 has a full barrier
>>> semantic.  As result, the store_rel() was actually an acquire too, making
>>> this reordering impossible.  I argue that this is not a bug in r236456,
>>> but the issue in the kern_intr.c.
>> If I remember the code correctly that would probably explain why we see
>> it only on 9.1 system.
>>> On the other hand, the John' suggestion to move the manipulations of
>>> it_need under the lock is probably the best anyway.
>>>
>> I was wondering if it would be lower latency to maintain it_need,
>> however to keep another variable it_needlocked under the thread lock.
>> This would result in potential superfluous interrupts, however under
>> load you would allow the ithread to loop without taking the thread lock
>> some number of times.
>>
>> I am not really sure if this is really worth the optimization
>> (especially since it can result in superfluous interrupts) however it
>> may reduce latency and that might be important.
>>
>> Is there some people that I can pass the patch onto for help with
>> performance once we confirm that this is the actual bug?   We can do
>> internal testing, but I am worried about regressing performance of any
>> form of IO for the kernel.
>>
>> I'll show the patch soon.
>>
>> Thank you for the information.  This is promising.
> Well, if you and I are right, the minimal patch should be
>
> diff --git a/sys/kern/kern_intr.c b/sys/kern/kern_intr.c
> index 8d63c9b..7c21015 100644
> --- a/sys/kern/kern_intr.c
> +++ b/sys/kern/kern_intr.c
> @@ -1349,7 +1349,7 @@ ithread_loop(void *arg)
>   		 * we are running, it will set it_need to note that we
>   		 * should make another pass.
>   		 */
> -		while (ithd->it_need) {
> +		while (atomic_load_acq_int(&ithd->it_need)) {
>   			/*
>   			 * This might need a full read and write barrier
>   			 * to make sure that this write posts before any
>

OK we can try this.

I've been pretty good at locking when using mutexes, but when we get 
into the atomic ops like this it gets a little tough for me to follow 
without extensive research.  I know that the signalling thread 
(swi_sched caller) does not use any atomic ops... is this OK?

-Alfred