Date: Fri, 08 Oct 2010 17:45:49 +0300 From: Mikolaj Golub <to.my.trociny@gmail.com> To: Pawel Jakub Dawidek <pjd@FreeBSD.org> Cc: freebsd-fs@freebsd.org Subject: Re: hastd: assertion (res->hr_event != NULL) fails in secondary on split-brain Message-ID: <86lj68n3oi.fsf@zhuzha.ua1> In-Reply-To: <20101007182436.GB1733@garage.freebsd.pl> (Pawel Jakub Dawidek's message of "Thu, 7 Oct 2010 20:24:36 %2B0200") References: <86hbh44wgl.fsf@kopusha.home.net> <86aamw4l42.fsf@kopusha.home.net> <20101004213647.GK7322@garage.freebsd.pl> <86tyl1m85y.fsf@zhuzha.ua1> <20101005074736.GM7322@garage.freebsd.pl> <20101007182436.GB1733@garage.freebsd.pl>
next in thread | previous in thread | raw e-mail | index | archive | help
--=-=-=
On Thu, 7 Oct 2010 20:24:36 +0200 Pawel Jakub Dawidek wrote:
PJD> On Tue, Oct 05, 2010 at 09:47:36AM +0200, Pawel Jakub Dawidek wrote:
>> On Tue, Oct 05, 2010 at 10:05:13AM +0300, Mikolaj Golub wrote:
>> >
>> > On Mon, 4 Oct 2010 23:36:47 +0200 Pawel Jakub Dawidek wrote:
>> >
>> > PJD> I see three problems:)
>> >
>> > PJD> 1. In child_kill() you interpret status value always, even if it is
>> > PJD> invalid due to earlier errors.
>> > PJD> 2. While copying the code you changed style. Don't you like style(9)?:)
>> >
>> > Me like :-). But it looks like my emacs don't. Need to teach it somehow...
>> >
>> > PJD> 3. The patch doesn't fix the root cause of the problem.
>> >
>> > Thank you for your comments.
>>
>> The hang you reported is still not fixed, but I'm working on it.
PJD> Could you verify if the primary/secondary loop doesn't cause hangs
PJD> anymore with most recent hast?
It doesn't, thanks!
But to test I had to run hastd with two changes. The first was needed to fix
the issue that described in Subject :-) (adding (res->hr_event != NULL) check
in child_cleanup() -- you wrote that the fix was correct but did not commit
it).
The second one was needed to fix the issue that I observed after the latest
commit r213533:
Oct 8 16:14:04 hasta hastd[2175]: [storage] (primary) G_GATE_CMD_START failed: Invalid argument.
Oct 8 16:14:04 hasta kernel: Version mismatch 0 != 2.
Zerroing hio->hio_ggio we clear version and data pointer. This looks wrong for
me -- they are set and allocated in init_environment(). Also it looks like
setting ggio->gctl_length = MAXPHYS is not needed here too. See the attached
patch.
--
Mikolaj Golub
--=-=-=
Content-Type: text/x-patch
Content-Disposition: inline; filename=hastd.patch
Index: sbin/hastd/control.c
===================================================================
--- sbin/hastd/control.c (revision 213573)
+++ sbin/hastd/control.c (working copy)
@@ -58,8 +58,10 @@ child_cleanup(struct hast_resource *res)
proto_close(res->hr_ctrl);
res->hr_ctrl = NULL;
- proto_close(res->hr_event);
- res->hr_event = NULL;
+ if (res->hr_event != NULL) {
+ proto_close(res->hr_event);
+ res->hr_event = NULL;
+ }
res->hr_workerpid = 0;
}
Index: sbin/hastd/primary.c
===================================================================
--- sbin/hastd/primary.c (revision 213573)
+++ sbin/hastd/primary.c (working copy)
@@ -930,9 +930,7 @@ ggate_recv_thread(void *arg)
QUEUE_TAKE2(hio, free);
pjdlog_debug(2, "ggate_recv: (%p) Got free request.", hio);
ggio = &hio->hio_ggio;
- bzero(ggio, sizeof(*ggio));
ggio->gctl_unit = res->hr_ggateunit;
- ggio->gctl_length = MAXPHYS;
ggio->gctl_error = 0;
pjdlog_debug(2,
"ggate_recv: (%p) Waiting for request from the kernel.",
--=-=-=--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86lj68n3oi.fsf>
