From owner-freebsd-fs@FreeBSD.ORG Fri Oct 8 14:46:26 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 431D2106564A; Fri, 8 Oct 2010 14:46:26 +0000 (UTC) (envelope-from to.my.trociny@gmail.com) Received: from mail-fx0-f54.google.com (mail-fx0-f54.google.com [209.85.161.54]) by mx1.freebsd.org (Postfix) with ESMTP id 34E6C8FC0A; Fri, 8 Oct 2010 14:46:25 +0000 (UTC) Received: by fxm4 with SMTP id 4so373570fxm.13 for ; Fri, 08 Oct 2010 07:46:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:from:to:cc:subject :organization:references:date:in-reply-to:message-id:user-agent :mime-version:content-type; bh=XN7sqWsZl4Y2WJyuTFhlYf3kSJWc2pgoLWr42ZKaF6c=; b=L4HGKLJOP4B87Z1pe4CgcDgEAaeZjcc5BKJRYqN4ZcM/UnsaeuwXajkR4yitMkGg7N 1ajfaxF6IsOs/EtUYHbJmLMIiFShrc2Oqn//iyXdAKEOhXxGbw5WJ+gETlRgBzSG5maY DtCUcHj+a8lwBIMtDbS+V/eJgKS8+p+v8hL9M= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:to:cc:subject:organization:references:date:in-reply-to :message-id:user-agent:mime-version:content-type; b=g7YHQGFWR8o5GLk23icwTAuXWchAXATUyMK1Lm2wj/o/318fUncB7zUmgSoVdAwEQJ CWCUBOJxKx0im+Xw+rboaj90WGOvvOjDWE3Ztb+Pxyo6iAjuUmT6vqFE5HOjngrqG8Ve wEG1/CJA/Xnr3WVmPfqAS3NQb3Ub1rM6zclaE= Received: by 10.223.125.207 with SMTP id z15mr3246880far.107.1286549184276; Fri, 08 Oct 2010 07:46:24 -0700 (PDT) Received: from localhost (ua1.etadirect.net [91.198.140.16]) by mx.google.com with ESMTPS id 10sm1666018fax.18.2010.10.08.07.46.13 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 08 Oct 2010 07:46:17 -0700 (PDT) From: Mikolaj Golub To: Pawel Jakub Dawidek Organization: TOA Ukraine References: <86hbh44wgl.fsf@kopusha.home.net> <86aamw4l42.fsf@kopusha.home.net> <20101004213647.GK7322@garage.freebsd.pl> <86tyl1m85y.fsf@zhuzha.ua1> <20101005074736.GM7322@garage.freebsd.pl> <20101007182436.GB1733@garage.freebsd.pl> Date: Fri, 08 Oct 2010 17:45:49 +0300 In-Reply-To: <20101007182436.GB1733@garage.freebsd.pl> (Pawel Jakub Dawidek's message of "Thu, 7 Oct 2010 20:24:36 +0200") Message-ID: <86lj68n3oi.fsf@zhuzha.ua1> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (berkeley-unix) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Cc: freebsd-fs@freebsd.org Subject: Re: hastd: assertion (res->hr_event != NULL) fails in secondary on split-brain X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Oct 2010 14:46:26 -0000 --=-=-= On Thu, 7 Oct 2010 20:24:36 +0200 Pawel Jakub Dawidek wrote: PJD> On Tue, Oct 05, 2010 at 09:47:36AM +0200, Pawel Jakub Dawidek wrote: >> On Tue, Oct 05, 2010 at 10:05:13AM +0300, Mikolaj Golub wrote: >> > >> > On Mon, 4 Oct 2010 23:36:47 +0200 Pawel Jakub Dawidek wrote: >> > >> > PJD> I see three problems:) >> > >> > PJD> 1. In child_kill() you interpret status value always, even if it is >> > PJD> invalid due to earlier errors. >> > PJD> 2. While copying the code you changed style. Don't you like style(9)?:) >> > >> > Me like :-). But it looks like my emacs don't. Need to teach it somehow... >> > >> > PJD> 3. The patch doesn't fix the root cause of the problem. >> > >> > Thank you for your comments. >> >> The hang you reported is still not fixed, but I'm working on it. PJD> Could you verify if the primary/secondary loop doesn't cause hangs PJD> anymore with most recent hast? It doesn't, thanks! But to test I had to run hastd with two changes. The first was needed to fix the issue that described in Subject :-) (adding (res->hr_event != NULL) check in child_cleanup() -- you wrote that the fix was correct but did not commit it). The second one was needed to fix the issue that I observed after the latest commit r213533: Oct 8 16:14:04 hasta hastd[2175]: [storage] (primary) G_GATE_CMD_START failed: Invalid argument. Oct 8 16:14:04 hasta kernel: Version mismatch 0 != 2. Zerroing hio->hio_ggio we clear version and data pointer. This looks wrong for me -- they are set and allocated in init_environment(). Also it looks like setting ggio->gctl_length = MAXPHYS is not needed here too. See the attached patch. -- Mikolaj Golub --=-=-= Content-Type: text/x-patch Content-Disposition: inline; filename=hastd.patch Index: sbin/hastd/control.c =================================================================== --- sbin/hastd/control.c (revision 213573) +++ sbin/hastd/control.c (working copy) @@ -58,8 +58,10 @@ child_cleanup(struct hast_resource *res) proto_close(res->hr_ctrl); res->hr_ctrl = NULL; - proto_close(res->hr_event); - res->hr_event = NULL; + if (res->hr_event != NULL) { + proto_close(res->hr_event); + res->hr_event = NULL; + } res->hr_workerpid = 0; } Index: sbin/hastd/primary.c =================================================================== --- sbin/hastd/primary.c (revision 213573) +++ sbin/hastd/primary.c (working copy) @@ -930,9 +930,7 @@ ggate_recv_thread(void *arg) QUEUE_TAKE2(hio, free); pjdlog_debug(2, "ggate_recv: (%p) Got free request.", hio); ggio = &hio->hio_ggio; - bzero(ggio, sizeof(*ggio)); ggio->gctl_unit = res->hr_ggateunit; - ggio->gctl_length = MAXPHYS; ggio->gctl_error = 0; pjdlog_debug(2, "ggate_recv: (%p) Waiting for request from the kernel.", --=-=-=--