From owner-freebsd-stable@FreeBSD.ORG Tue Dec 3 02:23:39 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 1639DE7E; Tue, 3 Dec 2013 02:23:39 +0000 (UTC) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id F2E3A1D54; Tue, 3 Dec 2013 02:23:38 +0000 (UTC) Received: from xyf.my.dom (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id rB32Na94053588; Tue, 3 Dec 2013 02:23:37 GMT (envelope-from davidxu@freebsd.org) Message-ID: <529D40B5.5040605@freebsd.org> Date: Tue, 03 Dec 2013 10:23:49 +0800 From: David Xu User-Agent: Mozilla/5.0 (X11; FreeBSD i386; rv:17.0) Gecko/20130416 Thunderbird/17.0.5 MIME-Version: 1.0 To: Mikolaj Golub Subject: Re: Hast locking up under 9.2 References: <20131121203711.GA3736@gmail.com> <20131123215950.GA17292@gmail.com> <20131125083223.GE1398@garage.freebsd.pl> <20131125094111.GA22396@gmail.com> In-Reply-To: <20131125094111.GA22396@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-stable@freebsd.org, Pete French X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 Dec 2013 02:23:39 -0000 On 2013/11/25 17:41, Mikolaj Golub wrote: > On Mon, Nov 25, 2013 at 09:32:23AM +0100, Pawel Jakub Dawidek wrote: >> On Sat, Nov 23, 2013 at 11:59:51PM +0200, Mikolaj Golub wrote: >>> On Fri, Nov 22, 2013 at 11:18:29AM +0000, Pete French wrote: >>> >>>> "Assertion failed: (!hio->hio_done), function write_complete, file >>>> /usr/src/sbin/hastd/primary.c, line 1130." >>> >>> It looks like write_complete usage (which should be called once per >>> write request) for memsync is racy. >>> >>> Consider the following scenario: >>> >>> 1) remote_recv_thread: memsync ack received, refcount -> 2; >>> 2) local_send_thread: local write completed, refcount -> 1, entering >>> write_complete() >>> 3) remote_recv_thread: memsync fin received, refcount -> 0, move hio >>> to done queue, ggate_send_thread gets the hio, checks for >>> !hio->hio_done and (if loca_send_thread is still in >>> write_complete()) entering write_complete() >> >> I don't see how is that possible. The write_complete() function is >> called only when hio_countdown goes from 2 to 1 and because this is >> atomic operation it can only happen in one thread. Can you elaborate on >> how calling write_complete() concurrently for the same request is >> possible? > > Yes, hio_countdown protects calling write_complete() concurently by > "component" threads. But it may also be called by ggate_send_thread(): > > if (!hio->hio_done) > write_complete(res, hio); > > So if write_complete() has already started executing in > local_send_thread(), and at that time memsync fin is received, the > request is moved to ggate_send_thread, and write_complete can be > reentered if it is still in progress in local_send_thread (hio_done is > set on exiting write_complete). > > That is why statement (3) in my patch: write_complete() in component > threads is called only before releasing hio_countdown. Otherwise you > are not protected from running it simultaneously by ggate_send_thread, > or even hio be moved to free before write_complete is finished in > local_send_thread. And so hio_countdown can't be used for detecting > the current memsync state. > Also I found the following code in hastd: #define QUEUE_INSERT1(hio, name, ncomp) do { \ bool _wakeup; \ \ mtx_lock(&hio_##name##_list_lock[(ncomp)]); \ _wakeup = TAILQ_EMPTY(&hio_##name##_list[(ncomp)]); \ TAILQ_INSERT_TAIL(&hio_##name##_list[(ncomp)], (hio), \ hio_next[(ncomp)]); \ hio_##name##_list_size[(ncomp)]++; \ mtx_unlock(&hio_##name##_list_lock[ncomp]); \ if (_wakeup) \ cv_broadcast(&hio_##name##_list_cond[(ncomp)]); Our thread library does optimize the condition variable's wait/signal lock contention, we had implemented wait queue morphying, so it is not needed to unlock mutex first, then call cv_broadcast, such code really increases spurious wakeups, and can be worse in some cases: for example, before cv_broadcast is called, the producer thread is preempted, and a consumer thread removes elements in the queue and sleeps again, and the producer thread is scheduled again and it blindly calls cv_broadcast, and a consumer thread then find nothing in the queue, and sleeps again. I think following code is enough for our thread library, and works better. #define QUEUE_INSERT1(hio, name, ncomp) do { \ mtx_lock(&hio_##name##_list_lock[(ncomp)]); \ if (TAILQ_EMPTY(&hio_##name##_list[(ncomp)])) \ cv_broadcast(&hio_##name##_list_cond[(ncomp)]); \ TAILQ_INSERT_TAIL(&hio_##name##_list[(ncomp)], (hio), \ hio_next[(ncomp)]); \ hio_##name##_list_size[(ncomp)]++; \ mtx_unlock(&hio_##name##_list_lock[ncomp]); \ } while (0)