Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 19 Aug 2015 11:29:52 +0200
From:      Damien Fleuriot <ml@my.gd>
To:        Freddie Cash <fjwcash@gmail.com>
Cc:        FreeBSD Stable <freebsd-stable@freebsd.org>, Damien Fleuriot <dam@my.gd>
Subject:   Re: [POSSIBLE BUG] 10-STABLE CARP erroneously becomes master on boot
Message-ID:  <CAE63ME4hLrVGCLwaXd4-44qkVYeQx=f6pkD%2BY78CdH6zt9nDSw@mail.gmail.com>
In-Reply-To: <CAE63ME5030t%2BfDCLgmiY-qgJc36D%2Byq5nv0U6P4gPjUyW6HShw@mail.gmail.com>
References:  <CAE63ME70yRFuTbVQnZ9w%2Byf2dZAQkxsdddUhTsqBtms_F%2BdibA@mail.gmail.com> <CAOjFWZ5YBEpWBUMDgmoPqkyUiuCR7QSaZg-bByizwYimXA4NUA@mail.gmail.com> <CAE63ME5030t%2BfDCLgmiY-qgJc36D%2Byq5nv0U6P4gPjUyW6HShw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Hello list, Freddie,


I've been able to run extensive tests, the results of which I'm pasting
below.


First of all, I've been unable to replicate the problem in our
preproduction and QA environments.

The differences between our production and pre/QA envs are as follows :
- we use link aggregation and VLAN tagging in production , we cannot
replicate this in pre/QA because of limitations with KVM guests.
- we use multiple CARP addresses with the same VHID on our public VLAN in
production, we COULD replicate that in pre/QA if required.


The context remains the following :
- host A is supposed to be CARP MASTER, has advskew 20 and preempt
- host B is supposed to be CARP BACKUP, has advskew 150 and preempt
- host B assumes mastership if the CARPs are configured from rc.conf ,
doesn't if they're set up manually after boot

Note that these 2 boxes were upgraded from 8-STABLE to 10-STABLE.
Host A runs 10.2-BETA1
Host B runs 10.2-PRERELEASE

I used the exact same versions in our pre/QA environments (one BETA1, one
PRERELEASE from the same build) and couldn't replicate the issue.



Now, on to the tests themselves.


A/ Create a new CARP address with a new VHID, configure it in rc.conf and
see if we get double MASTERS
- on Host A : CARP created manually
- on Host B : CARP created manually
Host A is MASTER and Host B is BACKUP

- on Host B : setup the new CARP in rc.conf , reboot
Host A is MASTER and Host B is BACKUP , problem not replicated


B/ Try with only one of our production addresses
- on Host B : uncomment the production CARP address from rc.conf , reboot
Host A is MASTER and Host B is MASTER
Host A shows net.inet.carp.demotion=0
Host B shows net.inet.carp.demotion=240


C/ Try with the new CARP address + one of our production addresses ,
different VHIDs
- on Host B : uncomment the new CARP address from rc.conf , reboot
Host A is MASTER and Host B is MASTER
Host A shows net.inet.carp.demotion=0
Host B shows net.inet.carp.demotion=240


D/ Try the new syntax in rc.conf , as per Freddie's suggestion
- on Host B : change the rc.conf syntax , reboot
Host A is MASTER and Host B is MASTER
Host A shows net.inet.carp.demotion=0
Host B shows net.inet.carp.demotion=240


E/ Try, just for the sake of it, to remove old files and libs on host B
- on Host B : cd /usr/src ; yes | make delete-old ; yes | make
delete-old-libs ; reboot
Host A is MASTER and Host B is MASTER
Host A shows net.inet.carp.demotion=0
Host B shows net.inet.carp.demotion=240


F/ Edit sysctls to disable CARP demotion on advertisement send errors
- on Host A : sysctl net.inet.carp.senderr_demotion_factor=0
- on Host B : set "net.inet.carp.senderr_demotion_factor=0" in sysctl.conf
, reboot
Host A is MASTER and Host B is MASTER
Host A shows net.inet.carp.demotion=0
Host B shows net.inet.carp.demotion=240



Now after this F/ test, I'm thinking there's some voodoo going on here and
it sure shows up :
- on Host A, 'pfctl -si' shows 3k states
- on Host B, 'pfctl -si' shows 800 states

Now that would explain why my CARP gets demoted on Host B (as per man 4
carp, pfsync failures result in a -240 demotion).
It doesn't, however, explain why the demoted CARP chooses to remain in a
MASTER state, or assumed MASTERship in the first place.

Surely enough, I can find some CARP errors in-between my reboots :
messages:2420:2015-08-19T03:51:38.273600+00:00 pf1-gs kernel: carp: demoted
by -240 to 0 (pfsync bulk fail)
messages:2429:2015-08-19T03:56:37.178575+00:00 pf1-gs kernel: carp: VHID
110@vlan410: INIT -> BACKUP
messages:2430:2015-08-19T03:56:40.664568+00:00 pf1-gs kernel: carp: VHID
110@vlan410: BACKUP -> MASTER (master down)
messages:2637:2015-08-19T04:00:22.482071+00:00 pf1-gs kernel: carp: VHID
111@vlan410: BACKUP -> MASTER (master down)
messages:2857:2015-08-19T04:04:02.330167+00:00 pf1-gs kernel: carp: VHID
110@vlan410: BACKUP -> MASTER (master down)
messages:2877:2015-08-19T04:05:03.288199+00:00 pf1-gs kernel: carp: demoted
by -240 to 0 (pfsync bulk fail)
messages:3088:2015-08-19T04:08:48.961985+00:00 pf1-gs kernel: carp: VHID
110@vlan410: BACKUP -> MASTER (master down)


Things I have not tested yet :
- use /24 CARPs instead of /32s
- switch my CARPs to all use different VHIDs

Things I CANNOT test :
- set up a *dedicated* pfsync link between the firewalls , they're in
different DCs
- set up a *dedicated* VLAN for pfsync , that would entail huge changes in
our PCI-DSS environment


After all these tests, I find myself in a situation where :
- manually set up CARPs on Host B work fine , and pfsync works
- CARPs from rc.conf on Host B result in MASTER-MASTER , and pfsync fails




I must say I'm stuck here.
The "master down" message is very confusing, when the firewalls *can* see
each other.
The "pfsync bulk fail" is rather interesting as well, since it doesn't
occur when the CARPs are unconfigured, or set up manually without rc.conf.
tcpdump -nei pflog0 does not show any dropped packet.
PF is configured to "pass in quick" pfsync and CARP packets.



I'm afraid a STFW hasn't helped overmuch here, although I'll try some more.


If anyone's got a pointer, I'll bite.

Cheers


On 17 August 2015 at 18:38, Damien Fleuriot <ml@my.gd> wrote:

>
> On 17 August 2015 at 18:32, Freddie Cash <fjwcash@gmail.com> wrote:
>
>>
>> On Aug 17, 2015 9:22 AM, "Damien Fleuriot" <ml@my.gd> wrote:
>> >
>> > Hello list,
>> >
>> >
>> >
>> > I'm seeing this very peculiar behaviour between 2 10-STABLE boxes.
>> >
>> > Host A is CARP Master with advskew 20 and runs 10.2-BETA1 from 10/07
>> > Host B is CARP Backup with advskew 150 and runs 10.2-PRERELEASE from
>> 12/08
>> >
>> >
>> > When I configure CARP in rc.conf on host B, it becomes Master on boot,
>> and
>> > host A remains Master as well.
>> > When I force a state change on host B (ifconfig vlanx vhid y state
>> backup),
>> > it transitions to Backup then again to Master.
>> >
>> > When I comment out the CARP configuration in rc.conf , and configure
>> CARP
>> > manually on host B's interfaces after it boots, it correctly becomes and
>> > remains Backup.
>> >
>> >
>> >
>> > Below is the excerpt from rc.conf pertaining to CARP configuration, the
>> > only difference between the 2 hosts being their advskew.
>> >
>> > Host A
>> > == BEGIN
>> >
>> > ifconfig_vlan410_alias0="vhid 110 pass passhere advskew 20 alias
>> > 10.104.10.251/32"
>> >
>> > == END
>> >
>> > Host B
>> > == BEGIN
>> >
>> > ifconfig_vlan410_alias0="vhid 110 pass passhere advskew 150 alias
>> > 10.104.10.251/32"
>> >
>> > == END
>>
>> Put the IP first, and the vhid stuff last in rc.conf for things to work
>> the most reliably. And drop the extra alias.
>>
>> ifconfig_vlan410_alias0="inet 10.104.10.251/32 vhid 110 pass passhere
>> advskew 150"
>>
>> CARP requires that all IPs on an interface that are part of the same vhid
>> to be listed (added) in the exact same order for the vhid to be considered
>> "the same". That one trips me up all the time when manually adding an IP to
>> a CARP pair, and then later rebooting one box as they both think they're
>> master for that interface, while being a mix of master/backup for the other
>> interfaces.
>>
>> Cheers,
>> Freddie
>> (running CARP on 2 10-CURRENT boxes and 2 10.1-p13 boxes)
>>
>
> Cheers Freddie, will try and keep the thread up to date on the results.
>
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAE63ME4hLrVGCLwaXd4-44qkVYeQx=f6pkD%2BY78CdH6zt9nDSw>