From owner-freebsd-pf@freebsd.org Thu Jan 24 20:38:10 2019 Return-Path: Delivered-To: freebsd-pf@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C930B14BDB89 for ; Thu, 24 Jan 2019 20:38:10 +0000 (UTC) (envelope-from kib@freebsd.org) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 54F6985F52; Thu, 24 Jan 2019 20:38:10 +0000 (UTC) (envelope-from kib@freebsd.org) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id x0OKc20N003875 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 24 Jan 2019 22:38:05 +0200 (EET) (envelope-from kib@freebsd.org) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua x0OKc20N003875 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id x0OKc2wj003874; Thu, 24 Jan 2019 22:38:02 +0200 (EET) (envelope-from kib@freebsd.org) X-Authentication-Warning: tom.home: kostik set sender to kib@freebsd.org using -f Date: Thu, 24 Jan 2019 22:38:02 +0200 From: Konstantin Belousov To: Andreas Longwitz Cc: freebsd-pf@freebsd.org, Gleb Smirnoff , Kristof Provost Subject: Re: rdr pass for proto tcp sometimes creates states with expire time zero and so breaking connections Message-ID: <20190124203802.GU24863@kib.kiev.ua> References: <5BC51424.5000309@incore.de> <5BD45882.1000207@incore.de> <5BEB3B9A.9080402@incore.de> <20181113222533.GJ9744@FreeBSD.org> <5C49ECAA.7060505@incore.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5C49ECAA.7060505@incore.de> User-Agent: Mutt/1.11.2 (2019-01-07) X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on tom.home X-BeenThere: freebsd-pf@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Technical discussion and general questions about packet filter \(pf\)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2019 20:38:11 -0000 [Keep me in Cc:] On Thu, Jan 24, 2019 at 05:49:46PM +0100, Andreas Longwitz wrote: > after some more long term research I have an educated guess whats going > on in this problem. > > The problem only occurs on i386. > > If I replace the counter_u64_fetch() call in pf_state_expires() by > the value of V_pf_status.states, then pf works without problems, the > expire time zero problem is gone: > > > --- pf.c.1st 2018-08-14 10:17:41.000000000 +0200 > +++ pf.c 2019-01-19 17:49:18.000000000 +0100 > @@ -1542,7 +1542,7 @@ > start = state->rule.ptr->timeout[PFTM_ADAPTIVE_START]; > if (start) { > end = state->rule.ptr->timeout[PFTM_ADAPTIVE_END]; > - states = counter_u64_fetch(state->rule.ptr->states_cur); > + states = V_pf_status.states; > } else { > start = V_pf_default_rule.timeout[PFTM_ADAPTIVE_START]; > end = V_pf_default_rule.timeout[PFTM_ADAPTIVE_END]; > > The use of counter_u64_fetch() looks a little bit at random for me. For > all states not associated with the pf_default_rule the value of > pf_status.states is used and for me this value is ok for all rules. > > Further the counter(9) framework was created for quick and lockless > write of counters, but fetching is more expansive. So I suggest to let > pf_state_expires() work without a counter fetch. > > Further I can confirm that the counter states_cur of the pf_default_rule > remains correct, when the patch given above is active. Without the patch > the counter on my main firewall machine gets slowly negative. I have > verified this with a lot of live DTrace and kgdb script debugging. > > >> OK, in the meantime I did some more research and I am now quite sure the > >> problem with the bogus pf_default_rule->states_cur counter is not a > >> problem in pf. I am convinced it is a problem in counter(9) on i386 > >> server. The critical code is the machine instruction cmpxchg8b used in > >> /sys/i386/include/counter.h. > >> > >> From intel instruction set reference manual: > >> Zhis instruction can be used with a LOCK prefix allow the instruction to > >> be executed atomically. > >> > >> We have two other sources in kernel using cmpxchg8b: > >> /sys/i386/include/atomic.h and > >> /sys/cddl/contrib/opensolaris/common/atomic/i386/opensolaris_atomic.S > > > > A single CPU instruction is atomic by definition, with regards to the CPU. > > A preemption can not happen in a middle of instruction. What the "lock" > > prefix does is memory locking to avoid unlocked parallel access to the > > same address by different CPUs. > > > > What is special about counter(9) is that %fs:%esi always points to a > > per-CPU address, because %fs is unique for every CPU and is constant, > > so no other CPU may write to this address, so lock prefix isn't needed. > > > > Of course a true SMP i386 isn't a well tested arch, so I won't assert > > that counter(9) doesn't have bugs on this arch. However, I don't see > > lock prefix necessary here. > > I think the problem is the cmpxchg8b instruction used in > counter_u64_fetch(), because this machine instruction always writes to > memory, also when we only want to read and have (EDX:EAX) = (ECX:EBX): > > TEMP64 <- DEST > IF (EDX:EAX = TEMP64) > THEN > ZF <- 1 > DEST <- ECX:EBX > ELSE > ZF <- 0 > EDX:EAX <- TEMP64 > DEST <- TEMP64 > FI > > If one CPU increments the counter in pf_create_state() and another does > the fetch, then both CPU's may run the xmpxschg8b at once with the > chance that both read the same memory value in TEMP64 and the fetching > CPU is the second CPU that writes and so the increment is lossed. Thats > what I see without the above patch two or three times a week. Please try the following patch. The idea is to make the value to compare with unlikely to be equal to the memory content, for fetch_one(). Also it fixes a silly bug in zero_one(). diff --git a/sys/i386/include/counter.h b/sys/i386/include/counter.h index 7fd26d2a960..aa20831ba18 100644 --- a/sys/i386/include/counter.h +++ b/sys/i386/include/counter.h @@ -78,6 +78,9 @@ counter_u64_read_one_8b(uint64_t *p) uint32_t res_lo, res_high; __asm __volatile( + "movl (%0),%%eax\n\t" + "movl 4(%0),%%edx\n\t" + "addl $0x10000000,%%edx\n\t" /* avoid write */ "movl %%eax,%%ebx\n\t" "movl %%edx,%%ecx\n\t" "cmpxchg8b (%2)" @@ -120,11 +123,11 @@ counter_u64_zero_one_8b(uint64_t *p) { __asm __volatile( +"\n1:\n\t" "movl (%0),%%eax\n\t" - "movl 4(%0),%%edx\n" + "movl 4(%0),%%edx\n\t" "xorl %%ebx,%%ebx\n\t" "xorl %%ecx,%%ecx\n\t" -"1:\n\t" "cmpxchg8b (%0)\n\t" "jnz 1b" :