Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 19 Aug 2015 18:20:18 +0200
From:      Damien Fleuriot <ml@my.gd>
To:        Freddie Cash <fjwcash@gmail.com>
Cc:        FreeBSD Stable <freebsd-stable@freebsd.org>, Damien Fleuriot <dam@my.gd>
Subject:   Re: [POSSIBLE BUG] 10-STABLE CARP erroneously becomes master on boot
Message-ID:  <CAE63ME5tTuQ3tsQrsj86ujchtKk5bQycbaoqXiHjpgYTar2FPw@mail.gmail.com>
In-Reply-To: <CAE63ME4hLrVGCLwaXd4-44qkVYeQx=f6pkD%2BY78CdH6zt9nDSw@mail.gmail.com>
References:  <CAE63ME70yRFuTbVQnZ9w%2Byf2dZAQkxsdddUhTsqBtms_F%2BdibA@mail.gmail.com> <CAOjFWZ5YBEpWBUMDgmoPqkyUiuCR7QSaZg-bByizwYimXA4NUA@mail.gmail.com> <CAE63ME5030t%2BfDCLgmiY-qgJc36D%2Byq5nv0U6P4gPjUyW6HShw@mail.gmail.com> <CAE63ME4hLrVGCLwaXd4-44qkVYeQx=f6pkD%2BY78CdH6zt9nDSw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 19 August 2015 at 11:29, Damien Fleuriot <ml@my.gd> wrote:

> Hello list, Freddie,
>
>
> I've been able to run extensive tests, the results of which I'm pasting
> below.
>
>
> First of all, I've been unable to replicate the problem in our
> preproduction and QA environments.
>
> The differences between our production and pre/QA envs are as follows :
> - we use link aggregation and VLAN tagging in production , we cannot
> replicate this in pre/QA because of limitations with KVM guests.
> - we use multiple CARP addresses with the same VHID on our public VLAN in
> production, we COULD replicate that in pre/QA if required.
>
>
> The context remains the following :
> - host A is supposed to be CARP MASTER, has advskew 20 and preempt
> - host B is supposed to be CARP BACKUP, has advskew 150 and preempt
> - host B assumes mastership if the CARPs are configured from rc.conf ,
> doesn't if they're set up manually after boot
>
> Note that these 2 boxes were upgraded from 8-STABLE to 10-STABLE.
> Host A runs 10.2-BETA1
> Host B runs 10.2-PRERELEASE
>
> I used the exact same versions in our pre/QA environments (one BETA1, one
> PRERELEASE from the same build) and couldn't replicate the issue.
>
>
>
> Now, on to the tests themselves.
>
>
> A/ Create a new CARP address with a new VHID, configure it in rc.conf and
> see if we get double MASTERS
> - on Host A : CARP created manually
> - on Host B : CARP created manually
> Host A is MASTER and Host B is BACKUP
>
> - on Host B : setup the new CARP in rc.conf , reboot
> Host A is MASTER and Host B is BACKUP , problem not replicated
>
>
> B/ Try with only one of our production addresses
> - on Host B : uncomment the production CARP address from rc.conf , reboot
> Host A is MASTER and Host B is MASTER
> Host A shows net.inet.carp.demotion=0
> Host B shows net.inet.carp.demotion=240
>
>
> C/ Try with the new CARP address + one of our production addresses ,
> different VHIDs
> - on Host B : uncomment the new CARP address from rc.conf , reboot
> Host A is MASTER and Host B is MASTER
> Host A shows net.inet.carp.demotion=0
> Host B shows net.inet.carp.demotion=240
>
>
> D/ Try the new syntax in rc.conf , as per Freddie's suggestion
> - on Host B : change the rc.conf syntax , reboot
> Host A is MASTER and Host B is MASTER
> Host A shows net.inet.carp.demotion=0
> Host B shows net.inet.carp.demotion=240
>
>
> E/ Try, just for the sake of it, to remove old files and libs on host B
> - on Host B : cd /usr/src ; yes | make delete-old ; yes | make
> delete-old-libs ; reboot
> Host A is MASTER and Host B is MASTER
> Host A shows net.inet.carp.demotion=0
> Host B shows net.inet.carp.demotion=240
>
>
> F/ Edit sysctls to disable CARP demotion on advertisement send errors
> - on Host A : sysctl net.inet.carp.senderr_demotion_factor=0
> - on Host B : set "net.inet.carp.senderr_demotion_factor=0" in sysctl.conf
> , reboot
> Host A is MASTER and Host B is MASTER
> Host A shows net.inet.carp.demotion=0
> Host B shows net.inet.carp.demotion=240
>
>
>
> Now after this F/ test, I'm thinking there's some voodoo going on here and
> it sure shows up :
> - on Host A, 'pfctl -si' shows 3k states
> - on Host B, 'pfctl -si' shows 800 states
>
> Now that would explain why my CARP gets demoted on Host B (as per man 4
> carp, pfsync failures result in a -240 demotion).
> It doesn't, however, explain why the demoted CARP chooses to remain in a
> MASTER state, or assumed MASTERship in the first place.
>
> Surely enough, I can find some CARP errors in-between my reboots :
> messages:2420:2015-08-19T03:51:38.273600+00:00 pf1-gs kernel: carp:
> demoted by -240 to 0 (pfsync bulk fail)
> messages:2429:2015-08-19T03:56:37.178575+00:00 pf1-gs kernel: carp: VHID
> 110@vlan410: INIT -> BACKUP
> messages:2430:2015-08-19T03:56:40.664568+00:00 pf1-gs kernel: carp: VHID
> 110@vlan410: BACKUP -> MASTER (master down)
> messages:2637:2015-08-19T04:00:22.482071+00:00 pf1-gs kernel: carp: VHID
> 111@vlan410: BACKUP -> MASTER (master down)
> messages:2857:2015-08-19T04:04:02.330167+00:00 pf1-gs kernel: carp: VHID
> 110@vlan410: BACKUP -> MASTER (master down)
> messages:2877:2015-08-19T04:05:03.288199+00:00 pf1-gs kernel: carp:
> demoted by -240 to 0 (pfsync bulk fail)
> messages:3088:2015-08-19T04:08:48.961985+00:00 pf1-gs kernel: carp: VHID
> 110@vlan410: BACKUP -> MASTER (master down)
>
>
> Things I have not tested yet :
> - use /24 CARPs instead of /32s
> - switch my CARPs to all use different VHIDs
>
> Things I CANNOT test :
> - set up a *dedicated* pfsync link between the firewalls , they're in
> different DCs
> - set up a *dedicated* VLAN for pfsync , that would entail huge changes in
> our PCI-DSS environment
>
>
> After all these tests, I find myself in a situation where :
> - manually set up CARPs on Host B work fine , and pfsync works
> - CARPs from rc.conf on Host B result in MASTER-MASTER , and pfsync fails
>
>
>
>
> I must say I'm stuck here.
> The "master down" message is very confusing, when the firewalls *can* see
> each other.
> The "pfsync bulk fail" is rather interesting as well, since it doesn't
> occur when the CARPs are unconfigured, or set up manually without rc.conf.
> tcpdump -nei pflog0 does not show any dropped packet.
> PF is configured to "pass in quick" pfsync and CARP packets.
>
>
>
> I'm afraid a STFW hasn't helped overmuch here, although I'll try some more.
>
>
> If anyone's got a pointer, I'll bite.
>
> Cheers
>
>

FWIW , additional testing shows that on Host A (10.2-BETA) CARP
advertisements are sent from the CARP IP itself.
As in, my Host A has physical IPs X and Y, but sends its CARP announces
from CARP IP Z.

I have not been able to replicate this behaviour in our preproduction
environment where both -BETA and -PRERELEASE send the advertisements from
their physical IP.
These boxes however, have just the one IP, as opposed to the production
environment which has 2 physical IPs.


I will get some additional testing done, by swapping over to make the Host
B running -PRERELEASE master, and see if it also sends its CARP
announcements sourced from its physical IP (would be good) or the CARP
itself (would be bad).






>
> On 17 August 2015 at 18:38, Damien Fleuriot <ml@my.gd> wrote:
>
>>
>> On 17 August 2015 at 18:32, Freddie Cash <fjwcash@gmail.com> wrote:
>>
>>>
>>> On Aug 17, 2015 9:22 AM, "Damien Fleuriot" <ml@my.gd> wrote:
>>> >
>>> > Hello list,
>>> >
>>> >
>>> >
>>> > I'm seeing this very peculiar behaviour between 2 10-STABLE boxes.
>>> >
>>> > Host A is CARP Master with advskew 20 and runs 10.2-BETA1 from 10/07
>>> > Host B is CARP Backup with advskew 150 and runs 10.2-PRERELEASE from
>>> 12/08
>>> >
>>> >
>>> > When I configure CARP in rc.conf on host B, it becomes Master on boot,
>>> and
>>> > host A remains Master as well.
>>> > When I force a state change on host B (ifconfig vlanx vhid y state
>>> backup),
>>> > it transitions to Backup then again to Master.
>>> >
>>> > When I comment out the CARP configuration in rc.conf , and configure
>>> CARP
>>> > manually on host B's interfaces after it boots, it correctly becomes
>>> and
>>> > remains Backup.
>>> >
>>> >
>>> >
>>> > Below is the excerpt from rc.conf pertaining to CARP configuration, the
>>> > only difference between the 2 hosts being their advskew.
>>> >
>>> > Host A
>>> > == BEGIN
>>> >
>>> > ifconfig_vlan410_alias0="vhid 110 pass passhere advskew 20 alias
>>> > 10.104.10.251/32"
>>> >
>>> > == END
>>> >
>>> > Host B
>>> > == BEGIN
>>> >
>>> > ifconfig_vlan410_alias0="vhid 110 pass passhere advskew 150 alias
>>> > 10.104.10.251/32"
>>> >
>>> > == END
>>>
>>> Put the IP first, and the vhid stuff last in rc.conf for things to work
>>> the most reliably. And drop the extra alias.
>>>
>>> ifconfig_vlan410_alias0="inet 10.104.10.251/32 vhid 110 pass passhere
>>> advskew 150"
>>>
>>> CARP requires that all IPs on an interface that are part of the same
>>> vhid to be listed (added) in the exact same order for the vhid to be
>>> considered "the same". That one trips me up all the time when manually
>>> adding an IP to a CARP pair, and then later rebooting one box as they both
>>> think they're master for that interface, while being a mix of master/backup
>>> for the other interfaces.
>>>
>>> Cheers,
>>> Freddie
>>> (running CARP on 2 10-CURRENT boxes and 2 10.1-p13 boxes)
>>>
>>
>> Cheers Freddie, will try and keep the thread up to date on the results.
>>
>>
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAE63ME5tTuQ3tsQrsj86ujchtKk5bQycbaoqXiHjpgYTar2FPw>