From owner-freebsd-fs@FreeBSD.ORG Fri Jul 2 05:16:11 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 85AEC1065670; Fri, 2 Jul 2010 05:16:11 +0000 (UTC) (envelope-from hiroshi@soupacific.com) Received: from mail.soupacific.com (mail.soupacific.com [211.19.53.201]) by mx1.freebsd.org (Postfix) with ESMTP id 2C4748FC13; Fri, 2 Jul 2010 05:16:10 +0000 (UTC) Received: from [127.0.0.1] (unknown [192.168.1.239]) by mail.soupacific.com (Postfix) with ESMTP id 8DD1B6E9B1; Fri, 2 Jul 2010 05:08:03 +0000 (UTC) Message-ID: <4C2D7615.5070606@soupacific.com> Date: Fri, 02 Jul 2010 14:16:05 +0900 From: "hiroshi@soupacific.com" User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.4) Gecko/20100608 Thunderbird/3.1 MIME-Version: 1.0 To: Mikolaj Golub References: <4C139F9C.2090305@soupacific.com><86iq5oc82y.fsf@kopusha.home.net> <4C14215D.9090304@soupacific.com><20100613003635.GA60012@icarus.home.lan><20100613074921.GB1320@garage.freebsd.pl><4C149A5C.3070401@soupacific.com><20100613102401.GE1320@garage.freebsd.pl><86eigavzsg.fsf@kopusha.home.net><20100614095044.GH1721@garage.freebsd.pl><868w6hwt2w.fsf@kopusha.home.net><20100614153746.GN1721@garage.freebsd.pl><86zkyxvc4v.fsf@kopusha.home.net> <4C2C43D5.1080907@soupacific.com> <86mxubndrp.fsf@kopusha.home.net> In-Reply-To: <86mxubndrp.fsf@kopusha.home.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, Pawel Jakub Dawidek Subject: Re: HAST and CARP X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 02 Jul 2010 05:16:11 -0000 Thanks for your info. > Doing 'hastctl create' on every switching is wrong. Note, after 'hastctl > create' hast metadata on the disk are lost and synchronization of all blocks That's I was afraid of it ! > Split-brain means that you or your scripts did something wrong: both nodes > acted as primary (either simultaneously or one then another but there was no > communication between them so both made changes to the data that was not > synced to another node). > > The easy way to get split-brain is to change the role on secondary to primary > without changing the role on the primary host, make some changes on the > secondary (acting as a primary) and change back its role to secondary. I checked that both node communication is established. And here is log without hastctl create. Seems ServerB once became MASTER, then back to BACKUP. This situation cause unhappy split-brain happened. hastctl dump shows prevrole: primary error debug los is this Jul 2 12:31:37 fw01B kernel: Clearing /tmp (X related). Jul 2 12:31:37 fw01B kernel: Updating motd: Jul 2 12:31:37 fw01B kernel: . Jul 2 12:31:37 fw01B kernel: Configuring syscons: Jul 2 12:31:37 fw01B kernel: blanktime Jul 2 12:31:37 fw01B kernel: . Jul 2 12:31:38 fw01B sm-mta[879]: gethostbyaddr(211.19.53.206) failed: 2 Jul 2 12:31:38 fw01B sm-mta[879]: gethostbyaddr(211.19.53.202) failed: 2 Jul 2 12:31:38 fw01B sm-mta[880]: starting daemon (8.14.4): SMTP+queueing@00:30:00 Jul 2 12:31:38 fw01B sm-msp-queue[884]: starting daemon (8.14.4): queueing@00:30:00 Jul 2 12:31:38 fw01B kernel: Starting cron. Jul 2 12:31:38 fw01B kernel: Starting background file system checks in 60 seconds. Jul 2 12:31:38 fw01B kernel: Jul 2 12:31:38 fw01B kernel: Fri Jul 2 12:31:38 UTC 2010 Jul 2 12:31:40 fw01B kernel: carp0: INIT -> BACKUP Jul 2 12:31:40 fw01B kernel: alc0: link state changed to UP Jul 2 12:31:40 fw01B kernel: carp0: 2 link states coalesced Jul 2 12:31:40 fw01B kernel: carp0: link state changed to DOWN Jul 2 12:31:43 fw01B login: login on ttyv0 as root Jul 2 12:31:43 fw01B login: ROOT LOGIN (root) ON ttyv0 Jul 2 12:31:43 fw01B kernel: Jul 2 12:31:43 fw01B login: ROOT LOGIN (root) ON ttyv0 Jul 2 12:31:48 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457. Jul 2 12:31:48 fw01B hastd: Connection from tcp4://211.19.53.201:20070 to tcp4://211.19.53.206:8457. Jul 2 12:31:48 fw01B hastd: tcp4://211.19.53.201:20070: resource=zfshast Jul 2 12:31:48 fw01B hastd: [zfshast] (init) We act as init for the resource and not as secondary as requested by tcp4://211.19.53.201:20070. Jul 2 12:31:48 fw01B kernel: Jul 2 12:31:48 fw01B hastd: [zfshast] (init) We act as init for the resource and not as secondary as requested by tcp4://211.19.53.201:20070. Jul 2 12:31:53 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457. Jul 2 12:31:53 fw01B hastd: Connection from tcp4://211.19.53.201:11542 to tcp4://211.19.53.206:8457. Jul 2 12:31:53 fw01B hastd: tcp4://211.19.53.201:11542: resource=zfshast Jul 2 12:31:53 fw01B hastd: [zfshast] (init) We act as init for the resource and not as secondary as requested by tcp4://211.19.53.201:11542. Jul 2 12:31:53 fw01B kernel: Jul 2 12:31:53 fw01B hastd: [zfshast] (init) We act as init for the resource and not as secondary as requested by tcp4://211.19.53.201:11542. Jul 2 12:31:58 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457. Jul 2 12:31:58 fw01B hastd: Connection from tcp4://211.19.53.201:49777 to tcp4://211.19.53.206:8457. Jul 2 12:31:58 fw01B hastd: tcp4://211.19.53.201:49777: resource=zfshast Jul 2 12:31:58 fw01B hastd: [zfshast] (init) We act as init for the resource and not as secondary as requested by tcp4://211.19.53.201:49777. Jul 2 12:31:58 fw01B kernel: Jul 2 12:31:58 fw01B hastd: [zfshast] (init) We act as init for the resource and not as secondary as requested by tcp4://211.19.53.201:49777. Jul 2 12:31:58 fw01B hastd: [zfshast] (init) Role changed to secondary. Jul 2 12:32:03 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457. Jul 2 12:32:03 fw01B hastd: Connection from tcp4://211.19.53.201:17014 to tcp4://211.19.53.206:8457. Jul 2 12:32:03 fw01B hastd: tcp4://211.19.53.201:17014: resource=zfshast Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) Initial connection from tcp4://211.19.53.201:17014. Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) Incoming connection from tcp4://211.19.53.201:17014 configured. Jul 2 12:32:03 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457. Jul 2 12:32:03 fw01B hastd: Connection from tcp4://211.19.53.201:42420 to tcp4://211.19.53.206:8457. Jul 2 12:32:03 fw01B hastd: tcp4://211.19.53.201:42420: resource=zfshast Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) Outgoing connection to tcp4://211.19.53.201:42420 configured. Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) hastd_secondary Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) calling init_local() Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) init_local Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) Obtained info about /dev/ad4p4. Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) Locked /dev/ad4p4. Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) inside metadata.c res->hr_role !=HAST_ROLE_PRIMAR Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) inside mettadata secondary_localcnt: 1 secondary_remotecnt: 0 Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) calling init_remote() Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) init_remote() Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) humhum secondary local 1: secondary remote 0 Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) init hr_secondary_remotecnt: 0 hr_primary_remotecnt: 0 Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) secondary_remotecnt 0, primary_remotecnt 0 Jul 2 12:32:03 fw01B kernel: Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) secondary_remotecnt 0, primary_remotecnt 0 Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) secondary_localcnt 1, primary_localcnt 1 Jul 2 12:32:03 fw01B kernel: Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) secondary_localcnt 1, primary_localcnt 1 Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) Split-brain detected, exiting. Jul 2 12:32:03 fw01B kernel: Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) Split-brain detected, exiting. Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) Worker process exited ungracefully (pid=979, exitcode=78). Jul 2 12:32:03 fw01B kernel: Jul 2 12:32:03 fw01B hastd: [zfshast] (secondary) Worker process exited ungracefully (pid=979, exitcode=78). Jul 2 12:32:08 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457. Jul 2 12:32:08 fw01B hastd: Connection from tcp4://211.19.53.201:53033 to tcp4://211.19.53.206:8457. Jul 2 12:32:08 fw01B hastd: tcp4://211.19.53.201:53033: resource=zfshast Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) Initial connection from tcp4://211.19.53.201:53033. Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) Incoming connection from tcp4://211.19.53.201:53033 configured. Jul 2 12:32:08 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457. Jul 2 12:32:08 fw01B hastd: Connection from tcp4://211.19.53.201:50656 to tcp4://211.19.53.206:8457. Jul 2 12:32:08 fw01B hastd: tcp4://211.19.53.201:50656: resource=zfshast Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) Outgoing connection to tcp4://211.19.53.201:50656 configured. Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) hastd_secondary Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) calling init_local() Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) init_local Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) Obtained info about /dev/ad4p4. Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) Locked /dev/ad4p4. Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) inside metadata.c res->hr_role !=HAST_ROLE_PRIMAR Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) inside mettadata secondary_localcnt: 1 secondary_remotecnt: 0 Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) calling init_remote() Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) init_remote() Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) humhum secondary local 1: secondary remote 0 Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) init hr_secondary_remotecnt: 0 hr_primary_remotecnt: 0 Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) secondary_remotecnt 0, primary_remotecnt 0 Jul 2 12:32:08 fw01B kernel: Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) secondary_remotecnt 0, primary_remotecnt 0 Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) secondary_localcnt 1, primary_localcnt 1 Jul 2 12:32:08 fw01B kernel: Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) secondary_localcnt 1, primary_localcnt 1 Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) Split-brain detected, exiting. Jul 2 12:32:08 fw01B kernel: Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) Split-brain detected, exiting. Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) Worker process exited ungracefully (pid=980, exitcode=78). Jul 2 12:32:08 fw01B kernel: Jul 2 12:32:08 fw01B hastd: [zfshast] (secondary) Worker process exited ungracefully (pid=980, exitcode=78). Jul 2 12:32:13 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457. When debuglog working properly. Jul 2 10:24:10 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457. Jul 2 10:24:10 fw01B hastd: tcp4://211.19.53.201:26965: resource=zfshast Jul 2 10:24:15 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457. Jul 2 10:24:15 fw01B hastd: tcp4://211.19.53.201:50280: resource=zfshast Jul 2 10:24:20 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457. Jul 2 10:24:20 fw01B hastd: tcp4://211.19.53.201:27929: resource=zfshast Jul 2 10:24:25 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457. Jul 2 10:24:25 fw01B hastd: tcp4://211.19.53.201:24357: resource=zfshast Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) Initial connection from tcp4://211.19.53.201:24357. Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) Incoming connection from tcp4://211.19.53.201:24357 configured. Jul 2 10:24:25 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457. Jul 2 10:24:25 fw01B hastd: tcp4://211.19.53.201:18217: resource=zfshast Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) Outgoing connection to tcp4://211.19.53.201:18217 configured. Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) hastd_secondary Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) calling init_local() Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) init_local Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) Obtained info about /dev/ad4p4. Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) Locked /dev/ad4p4. Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) inside metadata.c res->hr_role !=HAST_ROLE_PRIMAR Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) inside mettadata secondary_localcnt: 0 secondary_remotecnt: 0 Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) calling init_remote() Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) init_remote() Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) humhum secondary local 0: secondary remote 0 Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) init hr_secondary_remotecnt: 0 hr_primary_remotecnt: 0 Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) recv: Taking free request. Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) disk: Taking request. Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) disk: No requests, waiting. Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) recv: (0x8013ea2e0) Got request. Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) send: Taking request. Jul 2 10:24:25 fw01B hastd: [zfshast] (secondary) send: No requests, waiting. Jul 2 10:24:26 fw01B hastd: [zfshast] (secondary) disk: (0x8013ea2e0) Moving request to the send queue. Jul 2 10:24:26 fw01B hastd: [zfshast] (secondary) disk: Taking request. Jul 2 10:24:26 fw01B hastd: [zfshast] (secondary) disk: No requests, waiting. hastctl role seconary xxx should reset some value of master to backup? Hope this logs can help you ! If you need to make me debug bit more, give me some idea to check! Thanks Hiroshi Katayama