Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 4 Oct 2010 23:36:47 +0200
From:      Pawel Jakub Dawidek <pjd@FreeBSD.org>
To:        Mikolaj Golub <to.my.trociny@gmail.com>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: hastd: assertion (res->hr_event != NULL) fails in secondary on split-brain
Message-ID:  <20101004213647.GK7322@garage.freebsd.pl>
In-Reply-To: <86aamw4l42.fsf@kopusha.home.net>
References:  <86hbh44wgl.fsf@kopusha.home.net> <86aamw4l42.fsf@kopusha.home.net>

next in thread | previous in thread | raw e-mail | index | archive | help

--rVkomL2febZOZtGQ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Oct 02, 2010 at 07:26:05PM +0300, Mikolaj Golub wrote:
> Running with this fix another issue is observed. On split-brain `hastctl
> status' on secondary will return "[ERROR] Error 32 received from hastd" m=
ost
> of the times. And only for some runs an output will be returned.
>=20
> lolek# hastctl status storage
> [ERROR] Error 32 received from hastd.
> lolek# hastctl status storage
> [ERROR] Error 32 received from hastd.
> lolek# hastctl status storage
> storage:
>   role: secondary
>   provname: storage
>   localpath: /dev/ad4
>   extentsize: 2097152
>   keepdirty: 0
>   remoteaddr: tcp4://bolek
>   replication: memsync
>   status: complete
>   dirty: 0 bytes
> lolek# hastctl status storage
> [ERROR] Error 32 received from hastd.
>=20
> This is because hastd clears res->hr_workerpid only when a new connection=
 from
> the primary comes. Whilst hastd checks res->hr_workerpid in control_statu=
s()
> and if it is not zero it tries to get info from the worker and returns er=
ror
> (broken pipe) if the worker is actually not running.
>=20
> So it looks like it is better not just to close res->hr_ctrl in main_loop=
()
> but to do full child cleanup here -- straight away its exit is detected.
>=20
> What do you think about the attached patch?

I see three problems:)

1. In child_kill() you interpret status value always, even if it is
   invalid due to earlier errors.
2. While copying the code you changed style. Don't you like style(9)?:)
3. The patch doesn't fix the root cause of the problem.

The real problem also for "hastd: zombies after hooks" you reported was
that sigprocmask(2) doesn't mask ignored signals. In this case SIGCHLD
is ignored by default, so it was never reported. We need to first
install dummy signal handler for SIGCHLD.

The fix I've here (and going to commit after a bit more testing) fixes
zombie hookd you observed completely and makes the window for 'Error 32
received from hastd' problem much smaller. You can still see this
message, because we can send request to child before we know it has
terminated, but it is not as visible as it was before.

Thanks for the report!

--=20
Pawel Jakub Dawidek                       http://www.wheelsystems.com
pjd@FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!

--rVkomL2febZOZtGQ
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (FreeBSD)

iEYEARECAAYFAkyqSO4ACgkQForvXbEpPzRR9gCeP3QSHvMNiEwU62pCNiUdKYCA
XXMAoKZujORkwbMOmuIGTHAIbWAA/94C
=zAjI
-----END PGP SIGNATURE-----

--rVkomL2febZOZtGQ--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20101004213647.GK7322>