Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 08 Oct 2010 17:45:49 +0300
From:      Mikolaj Golub <to.my.trociny@gmail.com>
To:        Pawel Jakub Dawidek <pjd@FreeBSD.org>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: hastd: assertion (res->hr_event != NULL) fails in secondary on split-brain
Message-ID:  <86lj68n3oi.fsf@zhuzha.ua1>
In-Reply-To: <20101007182436.GB1733@garage.freebsd.pl> (Pawel Jakub Dawidek's message of "Thu, 7 Oct 2010 20:24:36 %2B0200")
References:  <86hbh44wgl.fsf@kopusha.home.net> <86aamw4l42.fsf@kopusha.home.net> <20101004213647.GK7322@garage.freebsd.pl> <86tyl1m85y.fsf@zhuzha.ua1> <20101005074736.GM7322@garage.freebsd.pl> <20101007182436.GB1733@garage.freebsd.pl>

next in thread | previous in thread | raw e-mail | index | archive | help
--=-=-=


On Thu, 7 Oct 2010 20:24:36 +0200 Pawel Jakub Dawidek wrote:

 PJD> On Tue, Oct 05, 2010 at 09:47:36AM +0200, Pawel Jakub Dawidek wrote:
 >> On Tue, Oct 05, 2010 at 10:05:13AM +0300, Mikolaj Golub wrote:
 >> > 
 >> > On Mon, 4 Oct 2010 23:36:47 +0200 Pawel Jakub Dawidek wrote:
 >> > 
 >> >  PJD> I see three problems:)
 >> > 
 >> >  PJD> 1. In child_kill() you interpret status value always, even if it is
 >> >  PJD>    invalid due to earlier errors.
 >> >  PJD> 2. While copying the code you changed style. Don't you like style(9)?:)
 >> > 
 >> > Me like :-). But it looks like my emacs don't. Need to teach it somehow...
 >> > 
 >> >  PJD> 3. The patch doesn't fix the root cause of the problem.
 >> > 
 >> > Thank you for your comments.
 >> 
 >> The hang you reported is still not fixed, but I'm working on it.

 PJD> Could you verify if the primary/secondary loop doesn't cause hangs
 PJD> anymore with most recent hast?

It doesn't, thanks!

But to test I had to run hastd with two changes. The first was needed to fix
the issue that described in Subject :-) (adding (res->hr_event != NULL) check
in child_cleanup() -- you wrote that the fix was correct but did not commit
it). 

The second one was needed to fix the issue that I observed after the latest
commit r213533:

Oct  8 16:14:04 hasta hastd[2175]: [storage] (primary) G_GATE_CMD_START failed: Invalid argument.
Oct  8 16:14:04 hasta kernel: Version mismatch 0 != 2.

Zerroing hio->hio_ggio we clear version and data pointer. This looks wrong for
me -- they are set and allocated in init_environment(). Also it looks like
setting ggio->gctl_length = MAXPHYS is not needed here too. See the attached
patch.

-- 
Mikolaj Golub


--=-=-=
Content-Type: text/x-patch
Content-Disposition: inline; filename=hastd.patch

Index: sbin/hastd/control.c
===================================================================
--- sbin/hastd/control.c	(revision 213573)
+++ sbin/hastd/control.c	(working copy)
@@ -58,8 +58,10 @@ child_cleanup(struct hast_resource *res)
 
 	proto_close(res->hr_ctrl);
 	res->hr_ctrl = NULL;
-	proto_close(res->hr_event);
-	res->hr_event = NULL;
+	if (res->hr_event != NULL) {
+		proto_close(res->hr_event);
+		res->hr_event = NULL;
+	}
 	res->hr_workerpid = 0;
 }
 
Index: sbin/hastd/primary.c
===================================================================
--- sbin/hastd/primary.c	(revision 213573)
+++ sbin/hastd/primary.c	(working copy)
@@ -930,9 +930,7 @@ ggate_recv_thread(void *arg)
 		QUEUE_TAKE2(hio, free);
 		pjdlog_debug(2, "ggate_recv: (%p) Got free request.", hio);
 		ggio = &hio->hio_ggio;
-		bzero(ggio, sizeof(*ggio));
 		ggio->gctl_unit = res->hr_ggateunit;
-		ggio->gctl_length = MAXPHYS;
 		ggio->gctl_error = 0;
 		pjdlog_debug(2,
 		    "ggate_recv: (%p) Waiting for request from the kernel.",

--=-=-=--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86lj68n3oi.fsf>