Date: Sun, 29 Jan 2012 00:35:35 +0200 From: Mikolaj Golub <trociny@freebsd.org> To: Artem Kajalainen <artem@kayalaynen.ru> Cc: Pawel Jakub Dawidek <pjd@FreeBSD.org>, freebsd-stable@freebsd.org Subject: Re: problems with hast Message-ID: <86ipjvbglk.fsf@kopusha.home.net> In-Reply-To: <CAGS-ug=KPuuDHTYYcVFrk4D3Q=PhJtEfb4%2B1NknU-Qfu9pJZNw@mail.gmail.com> (Artem Kajalainen's message of "Wed, 18 Jan 2012 20:23:25 %2B0200") References: <CAGS-ug=KPuuDHTYYcVFrk4D3Q=PhJtEfb4%2B1NknU-Qfu9pJZNw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--=-=-= Hi, On Wed, 18 Jan 2012 20:23:25 +0200 Artem Kajalainen wrote: AK> Hello, AK> I'm trying to setup hastd on two servers and got error, which I can't AK> understand. Box is running as primary, then i reboot it, another box AK> get primary role by carp events, then 1st box at boot tries to set up AK> primary role on own hast instance and fails with this: AK> Jan 18 22:13:03 gw_chlb_2 hastd[1387]: [storage0] (primary) AK> G_GATE_CMD_DONE failed: No such file or directory. AK> Jan 18 22:13:08 gw_chlb_2 hastd[1004]: [storage0] (primary) Worker AK> process exited ungracefully (pid=1387, exitcode=71). AK> I thought that geom_gate module can be problem, so i compiled it in AK> kernel. As you can see - it doesn't help. Both servers are AK> FreeBSD9.0-stable, updated 1 week ago. Hastd use whole disk. More info AK> from hastd: AK> gw_chlb_2# hastd -dF -c /etc/hast.conf AK> [INFO] Started successfully, running protocol version 1. AK> [DEBUG][1] Listening on control address /var/run/hastctl. AK> [INFO] Listening on address 192.168.0.1:8457. AK> [INFO] [storage0] (init) Role changed to primary. AK> [DEBUG][1] [storage0] (primary) Obtained info about /dev/ada2. AK> [DEBUG][1] [storage0] (primary) Locked /dev/ada2. AK> [INFO] [storage0] (primary) Device hast/storage0 created. AK> [DEBUG][1] [storage0] (primary) Privileges successfully dropped using AK> jail+setgid+setuid. AK> [INFO] [storage0] (primary) Privileges successfully dropped. AK> [INFO] [storage0] (primary) Connected to tcp4://192.168.0.2. AK> [INFO] [storage0] (primary) Synchronization started. 6.0MB to go. AK> [ERROR] [storage0] (primary) G_GATE_CMD_DONE failed: No such file or directory. AK> [INFO] [storage0] (primary) Received cancel from the kernel, exiting. AK> [DEBUG][1] Unable to receive event header: Socket is not connected. AK> [ERROR] [storage0] (primary) Worker process exited ungracefully AK> (pid=1452, exitcode=71). AK> [INFO] [storage0] (primary) Changing resource role back to init. AK> Any thoughts? Sorry, Artem, I read your email only today. Investigating, it looks after r226859, when 'async' mode was added, we have 2 issues with synchronization from secondary to master (rather very rear case normally): 1) When the synchronization from secondary to master is running and primary gets READ request, the request should be sent to the secondary but actually it is lost. As a result READ operation gets stuck. After the syncronization is complete the following READ requests, which now can be served by primary, work ok. 2) In async mode, for syncronization requests, write_complete() function, which sends G_GATE_CMD_DONE command to ggate, is called twice and the second call fails. Artem, did you run async mode? If you did then I suppose you observed the second issue. Could you please try the attached patch? -- Mikolaj Golub --=-=-= Content-Type: text/x-patch Content-Disposition: inline; filename=hastd.remote_read.patch Index: sbin/hastd/primary.c =================================================================== --- sbin/hastd/primary.c (revision 230661) +++ sbin/hastd/primary.c (working copy) @@ -1255,7 +1255,7 @@ ggate_recv_thread(void *arg) pjdlog_debug(2, "ggate_recv: (%p) Moving request to the send queues.", hio); refcount_init(&hio->hio_countdown, ncomps); - for (ii = ncomp; ii < ncomps; ii++) + for (ii = ncomp; ncomps != 0; ncomps--, ii++) QUEUE_INSERT1(hio, send, ii); } /* NOTREACHED */ @@ -1326,7 +1326,7 @@ local_send_thread(void *arg) } else { hio->hio_errors[ncomp] = 0; if (hio->hio_replication == - HAST_REPLICATION_ASYNC) { + HAST_REPLICATION_ASYNC && !ISSYNCREQ(hio)) { ggio->gctl_error = 0; write_complete(res, hio); } --=-=-=--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86ipjvbglk.fsf>