From owner-freebsd-fs@FreeBSD.ORG Mon Oct 4 21:37:20 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 67C5C1065740 for ; Mon, 4 Oct 2010 21:37:20 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: from mail.garage.freebsd.pl (60.wheelsystems.com [83.12.187.60]) by mx1.freebsd.org (Postfix) with ESMTP id 948178FC16 for ; Mon, 4 Oct 2010 21:37:19 +0000 (UTC) Received: by mail.garage.freebsd.pl (Postfix, from userid 65534) id 1FE8D45E11; Mon, 4 Oct 2010 23:37:18 +0200 (CEST) Received: from localhost (chello089077043238.chello.pl [89.77.43.238]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.garage.freebsd.pl (Postfix) with ESMTP id DFA4045D8D; Mon, 4 Oct 2010 23:37:12 +0200 (CEST) Date: Mon, 4 Oct 2010 23:36:47 +0200 From: Pawel Jakub Dawidek To: Mikolaj Golub Message-ID: <20101004213647.GK7322@garage.freebsd.pl> References: <86hbh44wgl.fsf@kopusha.home.net> <86aamw4l42.fsf@kopusha.home.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="rVkomL2febZOZtGQ" Content-Disposition: inline In-Reply-To: <86aamw4l42.fsf@kopusha.home.net> User-Agent: Mutt/1.4.2.3i X-PGP-Key-URL: http://people.freebsd.org/~pjd/pjd.asc X-OS: FreeBSD 9.0-CURRENT amd64 X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.garage.freebsd.pl X-Spam-Level: X-Spam-Status: No, score=-0.6 required=4.5 tests=BAYES_00,RCVD_IN_SORBS_DUL autolearn=no version=3.0.4 Cc: freebsd-fs@freebsd.org Subject: Re: hastd: assertion (res->hr_event != NULL) fails in secondary on split-brain X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Oct 2010 21:37:20 -0000 --rVkomL2febZOZtGQ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Oct 02, 2010 at 07:26:05PM +0300, Mikolaj Golub wrote: > Running with this fix another issue is observed. On split-brain `hastctl > status' on secondary will return "[ERROR] Error 32 received from hastd" m= ost > of the times. And only for some runs an output will be returned. >=20 > lolek# hastctl status storage > [ERROR] Error 32 received from hastd. > lolek# hastctl status storage > [ERROR] Error 32 received from hastd. > lolek# hastctl status storage > storage: > role: secondary > provname: storage > localpath: /dev/ad4 > extentsize: 2097152 > keepdirty: 0 > remoteaddr: tcp4://bolek > replication: memsync > status: complete > dirty: 0 bytes > lolek# hastctl status storage > [ERROR] Error 32 received from hastd. >=20 > This is because hastd clears res->hr_workerpid only when a new connection= from > the primary comes. Whilst hastd checks res->hr_workerpid in control_statu= s() > and if it is not zero it tries to get info from the worker and returns er= ror > (broken pipe) if the worker is actually not running. >=20 > So it looks like it is better not just to close res->hr_ctrl in main_loop= () > but to do full child cleanup here -- straight away its exit is detected. >=20 > What do you think about the attached patch? I see three problems:) 1. In child_kill() you interpret status value always, even if it is invalid due to earlier errors. 2. While copying the code you changed style. Don't you like style(9)?:) 3. The patch doesn't fix the root cause of the problem. The real problem also for "hastd: zombies after hooks" you reported was that sigprocmask(2) doesn't mask ignored signals. In this case SIGCHLD is ignored by default, so it was never reported. We need to first install dummy signal handler for SIGCHLD. The fix I've here (and going to commit after a bit more testing) fixes zombie hookd you observed completely and makes the window for 'Error 32 received from hastd' problem much smaller. You can still see this message, because we can send request to child before we know it has terminated, but it is not as visible as it was before. Thanks for the report! --=20 Pawel Jakub Dawidek http://www.wheelsystems.com pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! --rVkomL2febZOZtGQ Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (FreeBSD) iEYEARECAAYFAkyqSO4ACgkQForvXbEpPzRR9gCeP3QSHvMNiEwU62pCNiUdKYCA XXMAoKZujORkwbMOmuIGTHAIbWAA/94C =zAjI -----END PGP SIGNATURE----- --rVkomL2febZOZtGQ--