From owner-freebsd-stable@FreeBSD.ORG Sat Jan 28 23:04:28 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2A07D106566B for ; Sat, 28 Jan 2012 23:04:28 +0000 (UTC) (envelope-from to.my.trociny@gmail.com) Received: from mail-bk0-f54.google.com (mail-bk0-f54.google.com [209.85.214.54]) by mx1.freebsd.org (Postfix) with ESMTP id 9E12B8FC13 for ; Sat, 28 Jan 2012 23:04:27 +0000 (UTC) Received: by bkbc12 with SMTP id c12so77443bkb.13 for ; Sat, 28 Jan 2012 15:04:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=from:to:cc:subject:references:x-comment-to:sender:date:in-reply-to :message-id:user-agent:mime-version:content-type; bh=XauBNaq4vSP1Y5i7/FW6ii7LIcDyWAimeMsfnyEGwuo=; b=Eht8RinZ4s3TF6UdrSXwNz7u0YF0CpCrNywDdyhHcdVkiD+nAqWWU83IAOISuzMk+m OYTAYiPinQpPU4Pm1FEMVAt5iykVQd6dB0t/kVll3FlLi1o/bKEEHGAuB9dKM9aRiRLO 6tIyQiUazg5/mO+2p+NEXLIxFv5vXrMsCTbug= Received: by 10.205.122.134 with SMTP id gg6mr5825540bkc.41.1327790140419; Sat, 28 Jan 2012 14:35:40 -0800 (PST) Received: from localhost ([95.69.173.122]) by mx.google.com with ESMTPS id 20sm15276226bkr.0.2012.01.28.14.35.37 (version=TLSv1/SSLv3 cipher=OTHER); Sat, 28 Jan 2012 14:35:38 -0800 (PST) From: Mikolaj Golub To: Artem Kajalainen References: X-Comment-To: Artem Kajalainen Sender: Mikolaj Golub Date: Sun, 29 Jan 2012 00:35:35 +0200 In-Reply-To: (Artem Kajalainen's message of "Wed, 18 Jan 2012 20:23:25 +0200") Message-ID: <86ipjvbglk.fsf@kopusha.home.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (berkeley-unix) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Cc: Pawel Jakub Dawidek , freebsd-stable@freebsd.org Subject: Re: problems with hast X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 28 Jan 2012 23:04:28 -0000 --=-=-= Hi, On Wed, 18 Jan 2012 20:23:25 +0200 Artem Kajalainen wrote: AK> Hello, AK> I'm trying to setup hastd on two servers and got error, which I can't AK> understand. Box is running as primary, then i reboot it, another box AK> get primary role by carp events, then 1st box at boot tries to set up AK> primary role on own hast instance and fails with this: AK> Jan 18 22:13:03 gw_chlb_2 hastd[1387]: [storage0] (primary) AK> G_GATE_CMD_DONE failed: No such file or directory. AK> Jan 18 22:13:08 gw_chlb_2 hastd[1004]: [storage0] (primary) Worker AK> process exited ungracefully (pid=1387, exitcode=71). AK> I thought that geom_gate module can be problem, so i compiled it in AK> kernel. As you can see - it doesn't help. Both servers are AK> FreeBSD9.0-stable, updated 1 week ago. Hastd use whole disk. More info AK> from hastd: AK> gw_chlb_2# hastd -dF -c /etc/hast.conf AK> [INFO] Started successfully, running protocol version 1. AK> [DEBUG][1] Listening on control address /var/run/hastctl. AK> [INFO] Listening on address 192.168.0.1:8457. AK> [INFO] [storage0] (init) Role changed to primary. AK> [DEBUG][1] [storage0] (primary) Obtained info about /dev/ada2. AK> [DEBUG][1] [storage0] (primary) Locked /dev/ada2. AK> [INFO] [storage0] (primary) Device hast/storage0 created. AK> [DEBUG][1] [storage0] (primary) Privileges successfully dropped using AK> jail+setgid+setuid. AK> [INFO] [storage0] (primary) Privileges successfully dropped. AK> [INFO] [storage0] (primary) Connected to tcp4://192.168.0.2. AK> [INFO] [storage0] (primary) Synchronization started. 6.0MB to go. AK> [ERROR] [storage0] (primary) G_GATE_CMD_DONE failed: No such file or directory. AK> [INFO] [storage0] (primary) Received cancel from the kernel, exiting. AK> [DEBUG][1] Unable to receive event header: Socket is not connected. AK> [ERROR] [storage0] (primary) Worker process exited ungracefully AK> (pid=1452, exitcode=71). AK> [INFO] [storage0] (primary) Changing resource role back to init. AK> Any thoughts? Sorry, Artem, I read your email only today. Investigating, it looks after r226859, when 'async' mode was added, we have 2 issues with synchronization from secondary to master (rather very rear case normally): 1) When the synchronization from secondary to master is running and primary gets READ request, the request should be sent to the secondary but actually it is lost. As a result READ operation gets stuck. After the syncronization is complete the following READ requests, which now can be served by primary, work ok. 2) In async mode, for syncronization requests, write_complete() function, which sends G_GATE_CMD_DONE command to ggate, is called twice and the second call fails. Artem, did you run async mode? If you did then I suppose you observed the second issue. Could you please try the attached patch? -- Mikolaj Golub --=-=-= Content-Type: text/x-patch Content-Disposition: inline; filename=hastd.remote_read.patch Index: sbin/hastd/primary.c =================================================================== --- sbin/hastd/primary.c (revision 230661) +++ sbin/hastd/primary.c (working copy) @@ -1255,7 +1255,7 @@ ggate_recv_thread(void *arg) pjdlog_debug(2, "ggate_recv: (%p) Moving request to the send queues.", hio); refcount_init(&hio->hio_countdown, ncomps); - for (ii = ncomp; ii < ncomps; ii++) + for (ii = ncomp; ncomps != 0; ncomps--, ii++) QUEUE_INSERT1(hio, send, ii); } /* NOTREACHED */ @@ -1326,7 +1326,7 @@ local_send_thread(void *arg) } else { hio->hio_errors[ncomp] = 0; if (hio->hio_replication == - HAST_REPLICATION_ASYNC) { + HAST_REPLICATION_ASYNC && !ISSYNCREQ(hio)) { ggio->gctl_error = 0; write_complete(res, hio); } --=-=-=--