From owner-freebsd-stable@freebsd.org Wed Aug 19 09:30:00 2015 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C6DC69BC2B2 for ; Wed, 19 Aug 2015 09:30:00 +0000 (UTC) (envelope-from ml@my.gd) Received: from mail-la0-f54.google.com (mail-la0-f54.google.com [209.85.215.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 3D8F7CDA for ; Wed, 19 Aug 2015 09:29:59 +0000 (UTC) (envelope-from ml@my.gd) Received: by lahi9 with SMTP id i9so115328969lah.2 for ; Wed, 19 Aug 2015 02:29:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=wlCzwRhfgXsM00mvRRXIEx4PpNUCoBxoIkCxz+5bL6c=; b=UDYWTqNiiFImKrELV7JZSKBCm4P/wNcdjTixNGO6duhoGG0UXHVcFixyu5sI2jlYnP nlrqERoXz24VUwFw9W2+/bQLIwXjubNUUT+O6oG58qUIFjKFhg4GrsA1zkP5kdVZ0T62 Eo/pOdKwfAtrhnKuZwUg702xu+5cKgtuOInD+Px1YmsDcSNJ7MmG4F9BJ2ZNXoPXI0Cb AzAGw3wkfkWhYWqkaq052NvvqYEgu/UcNt4hmSeWZv/ALOi9Caihe+/mhr1HtSNyfaz0 mzAefCMNoTJUlesTINalSgI40oVTUYaDYGKW/bhrJ19WIqLXYPnpaRhmSZjwwjcjjfX/ okVg== X-Gm-Message-State: ALoCoQnurfLmIq0QIPJE12G/F++GGl56K9j+KkBvsqrMY7B8ZFvWkHtRbzZat9Rxn65cxXxM4buB MIME-Version: 1.0 X-Received: by 10.112.151.178 with SMTP id ur18mr10523235lbb.59.1439976592389; Wed, 19 Aug 2015 02:29:52 -0700 (PDT) Received: by 10.112.60.34 with HTTP; Wed, 19 Aug 2015 02:29:52 -0700 (PDT) In-Reply-To: References: Date: Wed, 19 Aug 2015 11:29:52 +0200 Message-ID: Subject: Re: [POSSIBLE BUG] 10-STABLE CARP erroneously becomes master on boot From: Damien Fleuriot To: Freddie Cash Cc: FreeBSD Stable , Damien Fleuriot Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Aug 2015 09:30:00 -0000 Hello list, Freddie, I've been able to run extensive tests, the results of which I'm pasting below. First of all, I've been unable to replicate the problem in our preproduction and QA environments. The differences between our production and pre/QA envs are as follows : - we use link aggregation and VLAN tagging in production , we cannot replicate this in pre/QA because of limitations with KVM guests. - we use multiple CARP addresses with the same VHID on our public VLAN in production, we COULD replicate that in pre/QA if required. The context remains the following : - host A is supposed to be CARP MASTER, has advskew 20 and preempt - host B is supposed to be CARP BACKUP, has advskew 150 and preempt - host B assumes mastership if the CARPs are configured from rc.conf , doesn't if they're set up manually after boot Note that these 2 boxes were upgraded from 8-STABLE to 10-STABLE. Host A runs 10.2-BETA1 Host B runs 10.2-PRERELEASE I used the exact same versions in our pre/QA environments (one BETA1, one PRERELEASE from the same build) and couldn't replicate the issue. Now, on to the tests themselves. A/ Create a new CARP address with a new VHID, configure it in rc.conf and see if we get double MASTERS - on Host A : CARP created manually - on Host B : CARP created manually Host A is MASTER and Host B is BACKUP - on Host B : setup the new CARP in rc.conf , reboot Host A is MASTER and Host B is BACKUP , problem not replicated B/ Try with only one of our production addresses - on Host B : uncomment the production CARP address from rc.conf , reboot Host A is MASTER and Host B is MASTER Host A shows net.inet.carp.demotion=0 Host B shows net.inet.carp.demotion=240 C/ Try with the new CARP address + one of our production addresses , different VHIDs - on Host B : uncomment the new CARP address from rc.conf , reboot Host A is MASTER and Host B is MASTER Host A shows net.inet.carp.demotion=0 Host B shows net.inet.carp.demotion=240 D/ Try the new syntax in rc.conf , as per Freddie's suggestion - on Host B : change the rc.conf syntax , reboot Host A is MASTER and Host B is MASTER Host A shows net.inet.carp.demotion=0 Host B shows net.inet.carp.demotion=240 E/ Try, just for the sake of it, to remove old files and libs on host B - on Host B : cd /usr/src ; yes | make delete-old ; yes | make delete-old-libs ; reboot Host A is MASTER and Host B is MASTER Host A shows net.inet.carp.demotion=0 Host B shows net.inet.carp.demotion=240 F/ Edit sysctls to disable CARP demotion on advertisement send errors - on Host A : sysctl net.inet.carp.senderr_demotion_factor=0 - on Host B : set "net.inet.carp.senderr_demotion_factor=0" in sysctl.conf , reboot Host A is MASTER and Host B is MASTER Host A shows net.inet.carp.demotion=0 Host B shows net.inet.carp.demotion=240 Now after this F/ test, I'm thinking there's some voodoo going on here and it sure shows up : - on Host A, 'pfctl -si' shows 3k states - on Host B, 'pfctl -si' shows 800 states Now that would explain why my CARP gets demoted on Host B (as per man 4 carp, pfsync failures result in a -240 demotion). It doesn't, however, explain why the demoted CARP chooses to remain in a MASTER state, or assumed MASTERship in the first place. Surely enough, I can find some CARP errors in-between my reboots : messages:2420:2015-08-19T03:51:38.273600+00:00 pf1-gs kernel: carp: demoted by -240 to 0 (pfsync bulk fail) messages:2429:2015-08-19T03:56:37.178575+00:00 pf1-gs kernel: carp: VHID 110@vlan410: INIT -> BACKUP messages:2430:2015-08-19T03:56:40.664568+00:00 pf1-gs kernel: carp: VHID 110@vlan410: BACKUP -> MASTER (master down) messages:2637:2015-08-19T04:00:22.482071+00:00 pf1-gs kernel: carp: VHID 111@vlan410: BACKUP -> MASTER (master down) messages:2857:2015-08-19T04:04:02.330167+00:00 pf1-gs kernel: carp: VHID 110@vlan410: BACKUP -> MASTER (master down) messages:2877:2015-08-19T04:05:03.288199+00:00 pf1-gs kernel: carp: demoted by -240 to 0 (pfsync bulk fail) messages:3088:2015-08-19T04:08:48.961985+00:00 pf1-gs kernel: carp: VHID 110@vlan410: BACKUP -> MASTER (master down) Things I have not tested yet : - use /24 CARPs instead of /32s - switch my CARPs to all use different VHIDs Things I CANNOT test : - set up a *dedicated* pfsync link between the firewalls , they're in different DCs - set up a *dedicated* VLAN for pfsync , that would entail huge changes in our PCI-DSS environment After all these tests, I find myself in a situation where : - manually set up CARPs on Host B work fine , and pfsync works - CARPs from rc.conf on Host B result in MASTER-MASTER , and pfsync fails I must say I'm stuck here. The "master down" message is very confusing, when the firewalls *can* see each other. The "pfsync bulk fail" is rather interesting as well, since it doesn't occur when the CARPs are unconfigured, or set up manually without rc.conf. tcpdump -nei pflog0 does not show any dropped packet. PF is configured to "pass in quick" pfsync and CARP packets. I'm afraid a STFW hasn't helped overmuch here, although I'll try some more. If anyone's got a pointer, I'll bite. Cheers On 17 August 2015 at 18:38, Damien Fleuriot wrote: > > On 17 August 2015 at 18:32, Freddie Cash wrote: > >> >> On Aug 17, 2015 9:22 AM, "Damien Fleuriot" wrote: >> > >> > Hello list, >> > >> > >> > >> > I'm seeing this very peculiar behaviour between 2 10-STABLE boxes. >> > >> > Host A is CARP Master with advskew 20 and runs 10.2-BETA1 from 10/07 >> > Host B is CARP Backup with advskew 150 and runs 10.2-PRERELEASE from >> 12/08 >> > >> > >> > When I configure CARP in rc.conf on host B, it becomes Master on boot, >> and >> > host A remains Master as well. >> > When I force a state change on host B (ifconfig vlanx vhid y state >> backup), >> > it transitions to Backup then again to Master. >> > >> > When I comment out the CARP configuration in rc.conf , and configure >> CARP >> > manually on host B's interfaces after it boots, it correctly becomes and >> > remains Backup. >> > >> > >> > >> > Below is the excerpt from rc.conf pertaining to CARP configuration, the >> > only difference between the 2 hosts being their advskew. >> > >> > Host A >> > == BEGIN >> > >> > ifconfig_vlan410_alias0="vhid 110 pass passhere advskew 20 alias >> > 10.104.10.251/32" >> > >> > == END >> > >> > Host B >> > == BEGIN >> > >> > ifconfig_vlan410_alias0="vhid 110 pass passhere advskew 150 alias >> > 10.104.10.251/32" >> > >> > == END >> >> Put the IP first, and the vhid stuff last in rc.conf for things to work >> the most reliably. And drop the extra alias. >> >> ifconfig_vlan410_alias0="inet 10.104.10.251/32 vhid 110 pass passhere >> advskew 150" >> >> CARP requires that all IPs on an interface that are part of the same vhid >> to be listed (added) in the exact same order for the vhid to be considered >> "the same". That one trips me up all the time when manually adding an IP to >> a CARP pair, and then later rebooting one box as they both think they're >> master for that interface, while being a mix of master/backup for the other >> interfaces. >> >> Cheers, >> Freddie >> (running CARP on 2 10-CURRENT boxes and 2 10.1-p13 boxes) >> > > Cheers Freddie, will try and keep the thread up to date on the results. > >