Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 29 Jan 2012 00:35:35 +0200
From:      Mikolaj Golub <trociny@freebsd.org>
To:        Artem Kajalainen <artem@kayalaynen.ru>
Cc:        Pawel Jakub Dawidek <pjd@FreeBSD.org>, freebsd-stable@freebsd.org
Subject:   Re: problems with hast
Message-ID:  <86ipjvbglk.fsf@kopusha.home.net>
In-Reply-To: <CAGS-ug=KPuuDHTYYcVFrk4D3Q=PhJtEfb4%2B1NknU-Qfu9pJZNw@mail.gmail.com> (Artem Kajalainen's message of "Wed, 18 Jan 2012 20:23:25 %2B0200")
References:  <CAGS-ug=KPuuDHTYYcVFrk4D3Q=PhJtEfb4%2B1NknU-Qfu9pJZNw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
--=-=-=


Hi, 

On Wed, 18 Jan 2012 20:23:25 +0200 Artem Kajalainen wrote:

 AK> Hello,

 AK> I'm trying to setup hastd on two servers and got error, which I can't
 AK> understand. Box is running as primary, then i reboot it, another box
 AK> get primary role by carp events, then 1st box at boot tries to set up
 AK> primary role on own hast instance and fails with this:
 AK> Jan 18 22:13:03 gw_chlb_2 hastd[1387]: [storage0] (primary)
 AK> G_GATE_CMD_DONE failed: No such file or directory.
 AK> Jan 18 22:13:08 gw_chlb_2 hastd[1004]: [storage0] (primary) Worker
 AK> process exited ungracefully (pid=1387, exitcode=71).

 AK> I thought that geom_gate module can be problem, so i compiled it in
 AK> kernel. As you can see - it doesn't help. Both servers are
 AK> FreeBSD9.0-stable, updated 1 week ago. Hastd use whole disk. More info
 AK> from hastd:
 AK> gw_chlb_2# hastd -dF -c /etc/hast.conf
 AK> [INFO] Started successfully, running protocol version 1.
 AK> [DEBUG][1] Listening on control address /var/run/hastctl.
 AK> [INFO] Listening on address 192.168.0.1:8457.
 AK> [INFO] [storage0] (init) Role changed to primary.
 AK> [DEBUG][1] [storage0] (primary) Obtained info about /dev/ada2.
 AK> [DEBUG][1] [storage0] (primary) Locked /dev/ada2.
 AK> [INFO] [storage0] (primary) Device hast/storage0 created.
 AK> [DEBUG][1] [storage0] (primary) Privileges successfully dropped using
 AK> jail+setgid+setuid.
 AK> [INFO] [storage0] (primary) Privileges successfully dropped.
 AK> [INFO] [storage0] (primary) Connected to tcp4://192.168.0.2.
 AK> [INFO] [storage0] (primary) Synchronization started. 6.0MB to go.
 AK> [ERROR] [storage0] (primary) G_GATE_CMD_DONE failed: No such file or directory.
 AK> [INFO] [storage0] (primary) Received cancel from the kernel, exiting.
 AK> [DEBUG][1] Unable to receive event header: Socket is not connected.
 AK> [ERROR] [storage0] (primary) Worker process exited ungracefully
 AK> (pid=1452, exitcode=71).
 AK> [INFO] [storage0] (primary) Changing resource role back to init.

 AK> Any thoughts?

Sorry, Artem, I read your email only today.

Investigating, it looks after r226859, when 'async' mode was added, we have 2
issues with synchronization from secondary to master (rather very rear case
normally):

1) When the synchronization from secondary to master is running and primary
gets READ request, the request should be sent to the secondary but actually it
is lost. As a result READ operation gets stuck. After the syncronization is
complete the following READ requests, which now can be served by primary, work
ok.

2) In async mode, for syncronization requests, write_complete() function,
which sends G_GATE_CMD_DONE command to ggate, is called twice and the second
call fails.

Artem, did you run async mode? If you did then I suppose you observed the
second issue. Could you please try the attached patch?

-- 
Mikolaj Golub


--=-=-=
Content-Type: text/x-patch
Content-Disposition: inline; filename=hastd.remote_read.patch

Index: sbin/hastd/primary.c
===================================================================
--- sbin/hastd/primary.c	(revision 230661)
+++ sbin/hastd/primary.c	(working copy)
@@ -1255,7 +1255,7 @@ ggate_recv_thread(void *arg)
 		pjdlog_debug(2,
 		    "ggate_recv: (%p) Moving request to the send queues.", hio);
 		refcount_init(&hio->hio_countdown, ncomps);
-		for (ii = ncomp; ii < ncomps; ii++)
+		for (ii = ncomp; ncomps != 0; ncomps--, ii++)
 			QUEUE_INSERT1(hio, send, ii);
 	}
 	/* NOTREACHED */
@@ -1326,7 +1326,7 @@ local_send_thread(void *arg)
 			} else {
 				hio->hio_errors[ncomp] = 0;
 				if (hio->hio_replication ==
-				    HAST_REPLICATION_ASYNC) {
+				    HAST_REPLICATION_ASYNC && !ISSYNCREQ(hio)) {
 					ggio->gctl_error = 0;
 					write_complete(res, hio);
 				}

--=-=-=--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86ipjvbglk.fsf>