From owner-freebsd-questions@freebsd.org Sat Jan 16 18:13:12 2016 Return-Path: Delivered-To: freebsd-questions@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E721EA844FB for ; Sat, 16 Jan 2016 18:13:11 +0000 (UTC) (envelope-from to.my.trociny@gmail.com) Received: from mail-lb0-x22f.google.com (mail-lb0-x22f.google.com [IPv6:2a00:1450:4010:c04::22f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 67CD713CE for ; Sat, 16 Jan 2016 18:13:11 +0000 (UTC) (envelope-from to.my.trociny@gmail.com) Received: by mail-lb0-x22f.google.com with SMTP id oh2so330156824lbb.3 for ; Sat, 16 Jan 2016 10:13:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=8oUMd5Yld1jCmjROZMxd4xjJ63FQZYkIkJYeFcmHEUs=; b=vdwdRtrsrCEGrETWCOkUHmeqDaknDQMul1UeiqvlCkgOpQkW2bFwSo4hJxKCiWmWMH oROXitETox5+9EJiS7oLEUxBjYu2bLgeaHfOsAgZOOo668IJYVZ599ypy1CkXqZK1uop 3fmLnGeONba/3Q4pMNO8A747LsT04zOhcSoitcLgebhjjtBFZiHI4gwb0mVJhaPDHTCM Y41pj5LzTYiNDORn8q1j673xCyCCPknNRQ+93x+5OsxkL0U7z15m16d7qhvuYkiemIGm eAI9BQ9oCxLiF0/74/SjR757iYjxHDPwYauxK8V+ttXKZFXC+sxdx4E0vfGlQCtkHLfT 96dQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-type:content-disposition :in-reply-to:user-agent; bh=8oUMd5Yld1jCmjROZMxd4xjJ63FQZYkIkJYeFcmHEUs=; b=FGulMMVAXl2Jjj4zoasjBjXztV/P8Juko0GZxKoVYO+aC1VN5DVvZtfWb+L7wn1VWa W2gH/eed2xAc0TaB3zlhExgbXh0KfUEtdkp+Ncj5dOd9W1Vcb5NRzOVUvs61qO4OohTY 2ZtUrA8qXuC7OV9QcKxjdzmXSnhsJXmJvkJAuESXo3ego87FeXBU0e9mF+C/3Jp6Wj5j lfRZiUcNx64yknpGBJwvu8z2CSLUGZrYl/wl2TQ7SHvaalaSPGewFOBVRbu9yfTs10qN 0sIXn7fcbpipCELJ7HnLGjch87tZLlnZCnP/fyTRbQ64dzmkjE8Sd1aa2UscDm18qsYp eQcg== X-Gm-Message-State: ALoCoQmoIGdHnuLhI7CvsoE4fdsKdQ92llW712BlVvSVWOqciJLsF3u7aAxBcNsGUlcmZlLIBftITlMqI1z4nOLw+kZqQwvBvA== X-Received: by 10.112.132.66 with SMTP id os2mr5636034lbb.111.1452967988356; Sat, 16 Jan 2016 10:13:08 -0800 (PST) Received: from localhost ([91.225.202.62]) by smtp.gmail.com with ESMTPSA id ac10sm633912lbc.44.2016.01.16.10.13.07 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 16 Jan 2016 10:13:07 -0800 (PST) Sender: Mykola Golub Date: Sat, 16 Jan 2016 20:13:06 +0200 From: Mykola Golub To: Shahin Hasanov Cc: FREEBSD_QUESTION Subject: Re: the switching time hastd from secondary to primary Message-ID: <20160116181305.GA2165@gmail.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 16 Jan 2016 18:13:12 -0000 On Thu, Jan 14, 2016 at 02:23:46PM +0400, Shahin Hasanov wrote: > In /usr/local/sbin/ucarp_up.sh(below shown extract of it) script > ucarp waiting while it became primary. It tooks about 20 sec as > written > http://www.freebsd.org/cgi/man.cgi?query=hast.conf&apropos=0&sektion=0&manpath=FreeBSD+10.2-RELEASE&arch=default&format=html > . > for i in `jot 30`; do > pgrep -f "hastd: ${resource} \(secondary\)" >/dev/null 2>&1 || break > sleep 1 > done > if pgrep -f "hastd: ${resource} \(secondary\)" >/dev/null 2>&1; then > logger -p local0.error -t hast "Secondary process for resource ${resource} is still running after 30 seconds." > exit 1 > fi Looking at the logs would be nice. But I guess you are hitting here timeout in the thread waiting for incoming data from primary. This timeout is 2 * HAST_KEEPALIVE, and HAST_KEEPALIVE is hardcoded to 10 sec. So right now it can be changed only by recompiling hastd. On the other hand, hitting this timeout means that the connection was not closed properly, so it is not a case, I would expected for "planned" failovering, when the role is changed using `hastctl role` commands. This looks like rather a case of disaster recovery after networking partitioning, host crash, hang, etc.. In my opinion waiting for 20 sec is not bad comparing with possibility to have split-brain if the former primary is still alive. If you observe 20 sec timeout when doing "planned" failovering, I guess there is something wrong with the scripts that do switching. -- Mykola Golub