Date: Wed, 19 Aug 2015 18:20:18 +0200 From: Damien Fleuriot <ml@my.gd> To: Freddie Cash <fjwcash@gmail.com> Cc: FreeBSD Stable <freebsd-stable@freebsd.org>, Damien Fleuriot <dam@my.gd> Subject: Re: [POSSIBLE BUG] 10-STABLE CARP erroneously becomes master on boot Message-ID: <CAE63ME5tTuQ3tsQrsj86ujchtKk5bQycbaoqXiHjpgYTar2FPw@mail.gmail.com> In-Reply-To: <CAE63ME4hLrVGCLwaXd4-44qkVYeQx=f6pkD%2BY78CdH6zt9nDSw@mail.gmail.com> References: <CAE63ME70yRFuTbVQnZ9w%2Byf2dZAQkxsdddUhTsqBtms_F%2BdibA@mail.gmail.com> <CAOjFWZ5YBEpWBUMDgmoPqkyUiuCR7QSaZg-bByizwYimXA4NUA@mail.gmail.com> <CAE63ME5030t%2BfDCLgmiY-qgJc36D%2Byq5nv0U6P4gPjUyW6HShw@mail.gmail.com> <CAE63ME4hLrVGCLwaXd4-44qkVYeQx=f6pkD%2BY78CdH6zt9nDSw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 19 August 2015 at 11:29, Damien Fleuriot <ml@my.gd> wrote: > Hello list, Freddie, > > > I've been able to run extensive tests, the results of which I'm pasting > below. > > > First of all, I've been unable to replicate the problem in our > preproduction and QA environments. > > The differences between our production and pre/QA envs are as follows : > - we use link aggregation and VLAN tagging in production , we cannot > replicate this in pre/QA because of limitations with KVM guests. > - we use multiple CARP addresses with the same VHID on our public VLAN in > production, we COULD replicate that in pre/QA if required. > > > The context remains the following : > - host A is supposed to be CARP MASTER, has advskew 20 and preempt > - host B is supposed to be CARP BACKUP, has advskew 150 and preempt > - host B assumes mastership if the CARPs are configured from rc.conf , > doesn't if they're set up manually after boot > > Note that these 2 boxes were upgraded from 8-STABLE to 10-STABLE. > Host A runs 10.2-BETA1 > Host B runs 10.2-PRERELEASE > > I used the exact same versions in our pre/QA environments (one BETA1, one > PRERELEASE from the same build) and couldn't replicate the issue. > > > > Now, on to the tests themselves. > > > A/ Create a new CARP address with a new VHID, configure it in rc.conf and > see if we get double MASTERS > - on Host A : CARP created manually > - on Host B : CARP created manually > Host A is MASTER and Host B is BACKUP > > - on Host B : setup the new CARP in rc.conf , reboot > Host A is MASTER and Host B is BACKUP , problem not replicated > > > B/ Try with only one of our production addresses > - on Host B : uncomment the production CARP address from rc.conf , reboot > Host A is MASTER and Host B is MASTER > Host A shows net.inet.carp.demotion=0 > Host B shows net.inet.carp.demotion=240 > > > C/ Try with the new CARP address + one of our production addresses , > different VHIDs > - on Host B : uncomment the new CARP address from rc.conf , reboot > Host A is MASTER and Host B is MASTER > Host A shows net.inet.carp.demotion=0 > Host B shows net.inet.carp.demotion=240 > > > D/ Try the new syntax in rc.conf , as per Freddie's suggestion > - on Host B : change the rc.conf syntax , reboot > Host A is MASTER and Host B is MASTER > Host A shows net.inet.carp.demotion=0 > Host B shows net.inet.carp.demotion=240 > > > E/ Try, just for the sake of it, to remove old files and libs on host B > - on Host B : cd /usr/src ; yes | make delete-old ; yes | make > delete-old-libs ; reboot > Host A is MASTER and Host B is MASTER > Host A shows net.inet.carp.demotion=0 > Host B shows net.inet.carp.demotion=240 > > > F/ Edit sysctls to disable CARP demotion on advertisement send errors > - on Host A : sysctl net.inet.carp.senderr_demotion_factor=0 > - on Host B : set "net.inet.carp.senderr_demotion_factor=0" in sysctl.conf > , reboot > Host A is MASTER and Host B is MASTER > Host A shows net.inet.carp.demotion=0 > Host B shows net.inet.carp.demotion=240 > > > > Now after this F/ test, I'm thinking there's some voodoo going on here and > it sure shows up : > - on Host A, 'pfctl -si' shows 3k states > - on Host B, 'pfctl -si' shows 800 states > > Now that would explain why my CARP gets demoted on Host B (as per man 4 > carp, pfsync failures result in a -240 demotion). > It doesn't, however, explain why the demoted CARP chooses to remain in a > MASTER state, or assumed MASTERship in the first place. > > Surely enough, I can find some CARP errors in-between my reboots : > messages:2420:2015-08-19T03:51:38.273600+00:00 pf1-gs kernel: carp: > demoted by -240 to 0 (pfsync bulk fail) > messages:2429:2015-08-19T03:56:37.178575+00:00 pf1-gs kernel: carp: VHID > 110@vlan410: INIT -> BACKUP > messages:2430:2015-08-19T03:56:40.664568+00:00 pf1-gs kernel: carp: VHID > 110@vlan410: BACKUP -> MASTER (master down) > messages:2637:2015-08-19T04:00:22.482071+00:00 pf1-gs kernel: carp: VHID > 111@vlan410: BACKUP -> MASTER (master down) > messages:2857:2015-08-19T04:04:02.330167+00:00 pf1-gs kernel: carp: VHID > 110@vlan410: BACKUP -> MASTER (master down) > messages:2877:2015-08-19T04:05:03.288199+00:00 pf1-gs kernel: carp: > demoted by -240 to 0 (pfsync bulk fail) > messages:3088:2015-08-19T04:08:48.961985+00:00 pf1-gs kernel: carp: VHID > 110@vlan410: BACKUP -> MASTER (master down) > > > Things I have not tested yet : > - use /24 CARPs instead of /32s > - switch my CARPs to all use different VHIDs > > Things I CANNOT test : > - set up a *dedicated* pfsync link between the firewalls , they're in > different DCs > - set up a *dedicated* VLAN for pfsync , that would entail huge changes in > our PCI-DSS environment > > > After all these tests, I find myself in a situation where : > - manually set up CARPs on Host B work fine , and pfsync works > - CARPs from rc.conf on Host B result in MASTER-MASTER , and pfsync fails > > > > > I must say I'm stuck here. > The "master down" message is very confusing, when the firewalls *can* see > each other. > The "pfsync bulk fail" is rather interesting as well, since it doesn't > occur when the CARPs are unconfigured, or set up manually without rc.conf. > tcpdump -nei pflog0 does not show any dropped packet. > PF is configured to "pass in quick" pfsync and CARP packets. > > > > I'm afraid a STFW hasn't helped overmuch here, although I'll try some more. > > > If anyone's got a pointer, I'll bite. > > Cheers > > FWIW , additional testing shows that on Host A (10.2-BETA) CARP advertisements are sent from the CARP IP itself. As in, my Host A has physical IPs X and Y, but sends its CARP announces from CARP IP Z. I have not been able to replicate this behaviour in our preproduction environment where both -BETA and -PRERELEASE send the advertisements from their physical IP. These boxes however, have just the one IP, as opposed to the production environment which has 2 physical IPs. I will get some additional testing done, by swapping over to make the Host B running -PRERELEASE master, and see if it also sends its CARP announcements sourced from its physical IP (would be good) or the CARP itself (would be bad). > > On 17 August 2015 at 18:38, Damien Fleuriot <ml@my.gd> wrote: > >> >> On 17 August 2015 at 18:32, Freddie Cash <fjwcash@gmail.com> wrote: >> >>> >>> On Aug 17, 2015 9:22 AM, "Damien Fleuriot" <ml@my.gd> wrote: >>> > >>> > Hello list, >>> > >>> > >>> > >>> > I'm seeing this very peculiar behaviour between 2 10-STABLE boxes. >>> > >>> > Host A is CARP Master with advskew 20 and runs 10.2-BETA1 from 10/07 >>> > Host B is CARP Backup with advskew 150 and runs 10.2-PRERELEASE from >>> 12/08 >>> > >>> > >>> > When I configure CARP in rc.conf on host B, it becomes Master on boot, >>> and >>> > host A remains Master as well. >>> > When I force a state change on host B (ifconfig vlanx vhid y state >>> backup), >>> > it transitions to Backup then again to Master. >>> > >>> > When I comment out the CARP configuration in rc.conf , and configure >>> CARP >>> > manually on host B's interfaces after it boots, it correctly becomes >>> and >>> > remains Backup. >>> > >>> > >>> > >>> > Below is the excerpt from rc.conf pertaining to CARP configuration, the >>> > only difference between the 2 hosts being their advskew. >>> > >>> > Host A >>> > == BEGIN >>> > >>> > ifconfig_vlan410_alias0="vhid 110 pass passhere advskew 20 alias >>> > 10.104.10.251/32" >>> > >>> > == END >>> > >>> > Host B >>> > == BEGIN >>> > >>> > ifconfig_vlan410_alias0="vhid 110 pass passhere advskew 150 alias >>> > 10.104.10.251/32" >>> > >>> > == END >>> >>> Put the IP first, and the vhid stuff last in rc.conf for things to work >>> the most reliably. And drop the extra alias. >>> >>> ifconfig_vlan410_alias0="inet 10.104.10.251/32 vhid 110 pass passhere >>> advskew 150" >>> >>> CARP requires that all IPs on an interface that are part of the same >>> vhid to be listed (added) in the exact same order for the vhid to be >>> considered "the same". That one trips me up all the time when manually >>> adding an IP to a CARP pair, and then later rebooting one box as they both >>> think they're master for that interface, while being a mix of master/backup >>> for the other interfaces. >>> >>> Cheers, >>> Freddie >>> (running CARP on 2 10-CURRENT boxes and 2 10.1-p13 boxes) >>> >> >> Cheers Freddie, will try and keep the thread up to date on the results. >> >> >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAE63ME5tTuQ3tsQrsj86ujchtKk5bQycbaoqXiHjpgYTar2FPw>