From owner-freebsd-fs@FreeBSD.ORG Fri Jul 2 07:11:30 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0051A1065672; Fri, 2 Jul 2010 07:11:29 +0000 (UTC) (envelope-from to.my.trociny@gmail.com) Received: from mail-ww0-f42.google.com (mail-ww0-f42.google.com [74.125.82.42]) by mx1.freebsd.org (Postfix) with ESMTP id 2203E8FC15; Fri, 2 Jul 2010 07:11:28 +0000 (UTC) Received: by wwd20 with SMTP id 20so453001wwd.1 for ; Fri, 02 Jul 2010 00:11:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:from:to:cc:subject :organization:references:date:in-reply-to:message-id:user-agent :mime-version:content-type; bh=tO3PXRbtLGKfFhTmayeMDwn8xyFswFan22vcV/HsdFY=; b=jk075YSde8NFcbHkENKcGP+XTApS0RZKKp7u11HDEdF27sFhuHLIviiOFhYhM31KRt Jp75ktAFb2W4b9LJfP1bCC5GUkyRRpNzdiy0DlB/ABvMQBv3B2X7Zc4ZyUvm+I82ZK7+ VOLNl5V44JmRdGF/jG77CcY9E8eIfyWkA76x0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:to:cc:subject:organization:references:date:in-reply-to :message-id:user-agent:mime-version:content-type; b=o/emMoxf3ifiXnmnvufQd6HQlfABdI52dRidu29wV2SyVu4R1sevlbE8WVp2lcZtj7 TvIv43F9s2RfUPr7gKmdJRG5vE+tktxb3VY1RmuzspLhpiyW2yqD0xQKK2MGRmXv7qAa xa3Rp9Ah4qjTjhuWwVAKZnJvpIzsURUu3qs5M= Received: by 10.227.129.85 with SMTP id n21mr169785wbs.81.1278054677820; Fri, 02 Jul 2010 00:11:17 -0700 (PDT) Received: from localhost (ua1.etadirect.net [91.198.140.16]) by mx.google.com with ESMTPS id a27sm2390525wbe.18.2010.07.02.00.11.15 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 02 Jul 2010 00:11:16 -0700 (PDT) From: Mikolaj Golub To: "hiroshi\@soupacific.com" Organization: TOA Ukraine References: <4C139F9C.2090305@soupacific.com> <86iq5oc82y.fsf@kopusha.home.net> <4C14215D.9090304@soupacific.com> <20100613003635.GA60012@icarus.home.lan> <20100613074921.GB1320@garage.freebsd.pl> <4C149A5C.3070401@soupacific.com> <20100613102401.GE1320@garage.freebsd.pl> <86eigavzsg.fsf@kopusha.home.net> <20100614095044.GH1721@garage.freebsd.pl> <868w6hwt2w.fsf@kopusha.home.net> <20100614153746.GN1721@garage.freebsd.pl> <86zkyxvc4v.fsf@kopusha.home.net> <4C2C43D5.1080907@soupacific.com> <86mxubndrp.fsf@kopusha.home.net> <4C2D7615.5070606@soupacific.com> Date: Fri, 02 Jul 2010 10:11:12 +0300 In-Reply-To: <4C2D7615.5070606@soupacific.com> (hiroshi@soupacific.com's message of "Fri, 02 Jul 2010 14:16:05 +0900") Message-ID: <861vbm1hpr.fsf@zhuzha.ua1> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: freebsd-fs@freebsd.org, Pawel Jakub Dawidek Subject: Re: HAST and CARP X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 02 Jul 2010 07:11:30 -0000 On Fri, 02 Jul 2010 14:16:05 +0900 hiroshi@soupacific.com wrote: h> I checked that both node communication is established. And here is log h> without hastctl create. h> Seems ServerB once became MASTER, then back to BACKUP. h> This situation cause unhappy split-brain happened. h> hastctl dump shows prevrole: primary h> error debug los is this h> Jul 2 12:31:37 fw01B kernel: Clearing /tmp (X related). h> Jul 2 12:31:37 fw01B kernel: Updating motd: h> Jul 2 12:31:37 fw01B kernel: . h> Jul 2 12:31:37 fw01B kernel: Configuring syscons: h> Jul 2 12:31:37 fw01B kernel: blanktime h> Jul 2 12:31:37 fw01B kernel: . h> Jul 2 12:31:38 fw01B sm-mta[879]: gethostbyaddr(211.19.53.206) failed: 2 h> Jul 2 12:31:38 fw01B sm-mta[879]: gethostbyaddr(211.19.53.202) failed: 2 h> Jul 2 12:31:38 fw01B sm-mta[880]: starting daemon (8.14.4): h> SMTP+queueing@00:30:00 h> Jul 2 12:31:38 fw01B sm-msp-queue[884]: starting daemon (8.14.4): h> queueing@00:30:00 h> Jul 2 12:31:38 fw01B kernel: Starting cron. h> Jul 2 12:31:38 fw01B kernel: Starting background file system checks h> in 60 seconds. h> Jul 2 12:31:38 fw01B kernel: h> Jul 2 12:31:38 fw01B kernel: Fri Jul 2 12:31:38 UTC 2010 h> Jul 2 12:31:40 fw01B kernel: carp0: INIT -> BACKUP h> Jul 2 12:31:40 fw01B kernel: alc0: link state changed to UP h> Jul 2 12:31:40 fw01B kernel: carp0: 2 link states coalesced h> Jul 2 12:31:40 fw01B kernel: carp0: link state changed to DOWN h> Jul 2 12:31:43 fw01B login: login on ttyv0 as root h> Jul 2 12:31:43 fw01B login: ROOT LOGIN (root) ON ttyv0 h> Jul 2 12:31:43 fw01B kernel: Jul 2 12:31:43 fw01B login: ROOT LOGIN h> (root) ON ttyv0 h> Jul 2 12:31:48 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457. h> Jul 2 12:31:48 fw01B hastd: Connection from h> tcp4://211.19.53.201:20070 to tcp4://211.19.53.206:8457. h> Jul 2 12:31:48 fw01B hastd: tcp4://211.19.53.201:20070: resource=zfshast h> Jul 2 12:31:48 fw01B hastd: [zfshast] (init) We act as init for the h> resource and not as secondary as requested by h> tcp4://211.19.53.201:20070. h> Jul 2 12:31:48 fw01B kernel: Jul 2 12:31:48 fw01B hastd: [zfshast] h> (init) We act as init for the resource and not as secondary as h> requested by tcp4://211.19.53.201:20070. h> Jul 2 12:31:53 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457. h> Jul 2 12:31:53 fw01B hastd: Connection from h> tcp4://211.19.53.201:11542 to tcp4://211.19.53.206:8457. h> Jul 2 12:31:53 fw01B hastd: tcp4://211.19.53.201:11542: resource=zfshast h> Jul 2 12:31:53 fw01B hastd: [zfshast] (init) We act as init for the h> resource and not as secondary as requested by h> tcp4://211.19.53.201:11542. h> Jul 2 12:31:53 fw01B kernel: Jul 2 12:31:53 fw01B hastd: [zfshast] h> (init) We act as init for the resource and not as secondary as h> requested by tcp4://211.19.53.201:11542. h> Jul 2 12:31:58 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457. h> Jul 2 12:31:58 fw01B hastd: Connection from h> tcp4://211.19.53.201:49777 to tcp4://211.19.53.206:8457. h> Jul 2 12:31:58 fw01B hastd: tcp4://211.19.53.201:49777: resource=zfshast h> Jul 2 12:31:58 fw01B hastd: [zfshast] (init) We act as init for the h> resource and not as secondary as requested by h> tcp4://211.19.53.201:49777. h> Jul 2 12:31:58 fw01B kernel: Jul 2 12:31:58 fw01B hastd: [zfshast] h> (init) We act as init for the resource and not as secondary as h> requested by tcp4://211.19.53.201:49777. h> Jul 2 12:31:58 fw01B hastd: [zfshast] (init) Role changed to secondary. h> Jul 2 12:32:03 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457. h> Jul 2 12:32:03 fw01B hastd: Connection from h> tcp4://211.19.53.201:17014 to tcp4://211.19.53.206:8457. h> Jul 2 12:32:03 fw01B hastd: tcp4://211.19.53.201:17014: resource=zfshast h> Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) Initial connection h> from tcp4://211.19.53.201:17014. h> Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) Incoming connection h> from tcp4://211.19.53.201:17014 configured. h> Jul 2 12:32:03 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457. h> Jul 2 12:32:03 fw01B hastd: Connection from h> tcp4://211.19.53.201:42420 to tcp4://211.19.53.206:8457. h> Jul 2 12:32:03 fw01B hastd: tcp4://211.19.53.201:42420: resource=zfshast h> Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) Outgoing connection h> to tcp4://211.19.53.201:42420 configured. h> Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) hastd_secondary h> Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) calling init_local() h> Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) init_local h> Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) Obtained info about h> /dev/ad4p4. h> Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) Locked /dev/ad4p4. h> Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) inside metadata.c h> res->hr_role !=HAST_ROLE_PRIMAR h> Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) inside mettadata h> secondary_localcnt: 1 secondary_remotecnt: 0 h> Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) calling init_remote() h> Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) init_remote() h> Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) humhum secondary h> local 1: secondary remote 0 h> Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) init h> hr_secondary_remotecnt: 0 hr_primary_remotecnt: 0 h> Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) secondary_remotecnt h> 0, primary_remotecnt 0 h> Jul 2 12:32:03 fw01B kernel: Jul 2 12:32:03 fw01B hastd: [zfshast] h> (secondary) secondary_remotecnt 0, primary_remotecnt 0 h> Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) secondary_localcnt h> 1, primary_localcnt 1 h> Jul 2 12:32:03 fw01B kernel: Jul 2 12:32:03 fw01B hastd: [zfshast] h> (secondary) secondary_localcnt 1, primary_localcnt 1 h> Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) Split-brain h> detected, exiting. h> Jul 2 12:32:03 fw01B kernel: Jul 2 12:32:03 fw01B hastd: [zfshast] h> (secondary) Split-brain detected, exiting. So you have: secondary localcnt: 1 secondary remotecnt: 0 primary localcnt: 1 primary remotecnt: 0 This is a split-brain condition as described on wiki: primary's localcnt is greater than secondary's remotecnt (primary [fw01A] was modified while fw01B wasn't watching) and secondary's localcnt is greater than primary's remotecnt (fw01B was modified while fw01A wasn't watching). h> Hope this logs can help you ! If you need to make me debug bit more, h> give me some idea to check! Actually the logs you have provided are not very interesting as they shows the state after bad things happened. It is more interesting to look at the logs (both hosts) before split brain. I would recommend: 1) Configure hast manually and ensure that both primary and secondary function properly and data are synchronized between the nodes. Also make sure the clock on both hosts is in sync (needed when comparing logs). 2) Reboot both servers so your carp/hast setup auto starts and see what happens. 3) If it sets primary and secondary automatically and status is ok on both nodes initiate switching to failover. 4) If after switching (or earlier) split brain is detected, provide logs from both nodes since hosts reboot. -- Mikolaj Golub