From owner-freebsd-stable@FreeBSD.ORG  Tue Dec  3 02:23:39 2013
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 1639DE7E;
 Tue,  3 Dec 2013 02:23:39 +0000 (UTC)
Received: from freefall.freebsd.org (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id F2E3A1D54;
 Tue,  3 Dec 2013 02:23:38 +0000 (UTC)
Received: from xyf.my.dom (localhost [127.0.0.1])
 by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id rB32Na94053588;
 Tue, 3 Dec 2013 02:23:37 GMT (envelope-from davidxu@freebsd.org)
Message-ID: <529D40B5.5040605@freebsd.org>
Date: Tue, 03 Dec 2013 10:23:49 +0800
From: David Xu <davidxu@freebsd.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD i386;
 rv:17.0) Gecko/20130416 Thunderbird/17.0.5
MIME-Version: 1.0
To: Mikolaj Golub <to.my.trociny@gmail.com>
Subject: Re: Hast locking up under 9.2
References: <20131121203711.GA3736@gmail.com>
 <E1Vjokn-000OuU-1Y@dilbert.ingresso.co.uk> <20131123215950.GA17292@gmail.com>
 <20131125083223.GE1398@garage.freebsd.pl> <20131125094111.GA22396@gmail.com>
In-Reply-To: <20131125094111.GA22396@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-stable@freebsd.org, Pete French <petefrench@ingresso.co.uk>
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable/>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 03 Dec 2013 02:23:39 -0000

On 2013/11/25 17:41, Mikolaj Golub wrote:
> On Mon, Nov 25, 2013 at 09:32:23AM +0100, Pawel Jakub Dawidek wrote:
>> On Sat, Nov 23, 2013 at 11:59:51PM +0200, Mikolaj Golub wrote:
>>> On Fri, Nov 22, 2013 at 11:18:29AM +0000, Pete French wrote:
>>>
>>>> "Assertion failed: (!hio->hio_done), function write_complete, file
>>>>   /usr/src/sbin/hastd/primary.c, line 1130."
>>>
>>> It looks like write_complete usage (which should be called once per
>>> write request) for memsync is racy.
>>>
>>> Consider the following scenario:
>>>
>>> 1) remote_recv_thread: memsync ack received, refcount -> 2;
>>> 2) local_send_thread: local write completed, refcount -> 1, entering
>>>     write_complete()
>>> 3) remote_recv_thread: memsync fin received, refcount -> 0, move hio
>>>     to done queue, ggate_send_thread gets the hio, checks for
>>>     !hio->hio_done and (if loca_send_thread is still in
>>>     write_complete()) entering write_complete()
>>
>> I don't see how is that possible. The write_complete() function is
>> called only when hio_countdown goes from 2 to 1 and because this is
>> atomic operation it can only happen in one thread. Can you elaborate on
>> how calling write_complete() concurrently for the same request is
>> possible?
>
> Yes, hio_countdown protects calling write_complete() concurently by
> "component" threads. But it may also be called by ggate_send_thread():
>
> 	if (!hio->hio_done)
> 		write_complete(res, hio);
>
> So if write_complete() has already started executing in
> local_send_thread(), and at that time memsync fin is received, the
> request is moved to ggate_send_thread, and write_complete can be
> reentered if it is still in progress in local_send_thread (hio_done is
> set on exiting write_complete).
>
> That is why statement (3) in my patch: write_complete() in component
> threads is called only before releasing hio_countdown. Otherwise you
> are not protected from running it simultaneously by ggate_send_thread,
> or even hio be moved to free before write_complete is finished in
> local_send_thread. And so hio_countdown can't be used for detecting
> the current memsync state.
>
Also I found the following code in hastd:

#define QUEUE_INSERT1(hio, name, ncomp) do {                            \
         bool _wakeup;                                                   \
                                                                         \
         mtx_lock(&hio_##name##_list_lock[(ncomp)]);                     \
         _wakeup = TAILQ_EMPTY(&hio_##name##_list[(ncomp)]);             \
         TAILQ_INSERT_TAIL(&hio_##name##_list[(ncomp)], (hio),           \
             hio_next[(ncomp)]);                                         \
         hio_##name##_list_size[(ncomp)]++;                              \
         mtx_unlock(&hio_##name##_list_lock[ncomp]);                     \
         if (_wakeup)                                                    \
                 cv_broadcast(&hio_##name##_list_cond[(ncomp)]);

Our thread library does optimize the condition variable's wait/signal 
lock contention, we had implemented wait queue morphying, so it is not
needed to unlock mutex first, then call cv_broadcast, such code really
increases spurious wakeups, and can be worse in some cases:
for example, before cv_broadcast is called, the producer thread is
preempted, and a consumer thread removes elements in the queue and
sleeps again, and the producer thread is scheduled again and it blindly
calls cv_broadcast, and a consumer thread then find nothing in the
queue, and sleeps again.

I think following code is enough for our thread library, and works
better.

#define QUEUE_INSERT1(hio, name, ncomp) do {                            \
         mtx_lock(&hio_##name##_list_lock[(ncomp)]);                     \
         if (TAILQ_EMPTY(&hio_##name##_list[(ncomp)]))                   \
                 cv_broadcast(&hio_##name##_list_cond[(ncomp)]);         \
         TAILQ_INSERT_TAIL(&hio_##name##_list[(ncomp)], (hio),           \
             hio_next[(ncomp)]);                                         \
         hio_##name##_list_size[(ncomp)]++;                              \
         mtx_unlock(&hio_##name##_list_lock[ncomp]);                     \
} while (0)