Date: Fri, 08 Oct 2010 17:45:49 +0300 From: Mikolaj Golub <to.my.trociny@gmail.com> To: Pawel Jakub Dawidek <pjd@FreeBSD.org> Cc: freebsd-fs@freebsd.org Subject: Re: hastd: assertion (res->hr_event != NULL) fails in secondary on split-brain Message-ID: <86lj68n3oi.fsf@zhuzha.ua1> In-Reply-To: <20101007182436.GB1733@garage.freebsd.pl> (Pawel Jakub Dawidek's message of "Thu, 7 Oct 2010 20:24:36 %2B0200") References: <86hbh44wgl.fsf@kopusha.home.net> <86aamw4l42.fsf@kopusha.home.net> <20101004213647.GK7322@garage.freebsd.pl> <86tyl1m85y.fsf@zhuzha.ua1> <20101005074736.GM7322@garage.freebsd.pl> <20101007182436.GB1733@garage.freebsd.pl>
next in thread | previous in thread | raw e-mail | index | archive | help
--=-=-= On Thu, 7 Oct 2010 20:24:36 +0200 Pawel Jakub Dawidek wrote: PJD> On Tue, Oct 05, 2010 at 09:47:36AM +0200, Pawel Jakub Dawidek wrote: >> On Tue, Oct 05, 2010 at 10:05:13AM +0300, Mikolaj Golub wrote: >> > >> > On Mon, 4 Oct 2010 23:36:47 +0200 Pawel Jakub Dawidek wrote: >> > >> > PJD> I see three problems:) >> > >> > PJD> 1. In child_kill() you interpret status value always, even if it is >> > PJD> invalid due to earlier errors. >> > PJD> 2. While copying the code you changed style. Don't you like style(9)?:) >> > >> > Me like :-). But it looks like my emacs don't. Need to teach it somehow... >> > >> > PJD> 3. The patch doesn't fix the root cause of the problem. >> > >> > Thank you for your comments. >> >> The hang you reported is still not fixed, but I'm working on it. PJD> Could you verify if the primary/secondary loop doesn't cause hangs PJD> anymore with most recent hast? It doesn't, thanks! But to test I had to run hastd with two changes. The first was needed to fix the issue that described in Subject :-) (adding (res->hr_event != NULL) check in child_cleanup() -- you wrote that the fix was correct but did not commit it). The second one was needed to fix the issue that I observed after the latest commit r213533: Oct 8 16:14:04 hasta hastd[2175]: [storage] (primary) G_GATE_CMD_START failed: Invalid argument. Oct 8 16:14:04 hasta kernel: Version mismatch 0 != 2. Zerroing hio->hio_ggio we clear version and data pointer. This looks wrong for me -- they are set and allocated in init_environment(). Also it looks like setting ggio->gctl_length = MAXPHYS is not needed here too. See the attached patch. -- Mikolaj Golub --=-=-= Content-Type: text/x-patch Content-Disposition: inline; filename=hastd.patch Index: sbin/hastd/control.c =================================================================== --- sbin/hastd/control.c (revision 213573) +++ sbin/hastd/control.c (working copy) @@ -58,8 +58,10 @@ child_cleanup(struct hast_resource *res) proto_close(res->hr_ctrl); res->hr_ctrl = NULL; - proto_close(res->hr_event); - res->hr_event = NULL; + if (res->hr_event != NULL) { + proto_close(res->hr_event); + res->hr_event = NULL; + } res->hr_workerpid = 0; } Index: sbin/hastd/primary.c =================================================================== --- sbin/hastd/primary.c (revision 213573) +++ sbin/hastd/primary.c (working copy) @@ -930,9 +930,7 @@ ggate_recv_thread(void *arg) QUEUE_TAKE2(hio, free); pjdlog_debug(2, "ggate_recv: (%p) Got free request.", hio); ggio = &hio->hio_ggio; - bzero(ggio, sizeof(*ggio)); ggio->gctl_unit = res->hr_ggateunit; - ggio->gctl_length = MAXPHYS; ggio->gctl_error = 0; pjdlog_debug(2, "ggate_recv: (%p) Waiting for request from the kernel.", --=-=-=--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86lj68n3oi.fsf>