Date: Mon, 4 Oct 2010 23:36:47 +0200 From: Pawel Jakub Dawidek <pjd@FreeBSD.org> To: Mikolaj Golub <to.my.trociny@gmail.com> Cc: freebsd-fs@freebsd.org Subject: Re: hastd: assertion (res->hr_event != NULL) fails in secondary on split-brain Message-ID: <20101004213647.GK7322@garage.freebsd.pl> In-Reply-To: <86aamw4l42.fsf@kopusha.home.net> References: <86hbh44wgl.fsf@kopusha.home.net> <86aamw4l42.fsf@kopusha.home.net>
next in thread | previous in thread | raw e-mail | index | archive | help
--rVkomL2febZOZtGQ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Oct 02, 2010 at 07:26:05PM +0300, Mikolaj Golub wrote: > Running with this fix another issue is observed. On split-brain `hastctl > status' on secondary will return "[ERROR] Error 32 received from hastd" m= ost > of the times. And only for some runs an output will be returned. >=20 > lolek# hastctl status storage > [ERROR] Error 32 received from hastd. > lolek# hastctl status storage > [ERROR] Error 32 received from hastd. > lolek# hastctl status storage > storage: > role: secondary > provname: storage > localpath: /dev/ad4 > extentsize: 2097152 > keepdirty: 0 > remoteaddr: tcp4://bolek > replication: memsync > status: complete > dirty: 0 bytes > lolek# hastctl status storage > [ERROR] Error 32 received from hastd. >=20 > This is because hastd clears res->hr_workerpid only when a new connection= from > the primary comes. Whilst hastd checks res->hr_workerpid in control_statu= s() > and if it is not zero it tries to get info from the worker and returns er= ror > (broken pipe) if the worker is actually not running. >=20 > So it looks like it is better not just to close res->hr_ctrl in main_loop= () > but to do full child cleanup here -- straight away its exit is detected. >=20 > What do you think about the attached patch? I see three problems:) 1. In child_kill() you interpret status value always, even if it is invalid due to earlier errors. 2. While copying the code you changed style. Don't you like style(9)?:) 3. The patch doesn't fix the root cause of the problem. The real problem also for "hastd: zombies after hooks" you reported was that sigprocmask(2) doesn't mask ignored signals. In this case SIGCHLD is ignored by default, so it was never reported. We need to first install dummy signal handler for SIGCHLD. The fix I've here (and going to commit after a bit more testing) fixes zombie hookd you observed completely and makes the window for 'Error 32 received from hastd' problem much smaller. You can still see this message, because we can send request to child before we know it has terminated, but it is not as visible as it was before. Thanks for the report! --=20 Pawel Jakub Dawidek http://www.wheelsystems.com pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! --rVkomL2febZOZtGQ Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (FreeBSD) iEYEARECAAYFAkyqSO4ACgkQForvXbEpPzRR9gCeP3QSHvMNiEwU62pCNiUdKYCA XXMAoKZujORkwbMOmuIGTHAIbWAA/94C =zAjI -----END PGP SIGNATURE----- --rVkomL2febZOZtGQ--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20101004213647.GK7322>