From owner-freebsd-pf@freebsd.org Thu Jan 24 16:49:58 2019 Return-Path: Delivered-To: freebsd-pf@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A8A5514B53C9 for ; Thu, 24 Jan 2019 16:49:58 +0000 (UTC) (envelope-from longwitz@incore.de) Received: from dss.incore.de (dss.incore.de [195.145.1.138]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 54D08725CB; Thu, 24 Jan 2019 16:49:57 +0000 (UTC) (envelope-from longwitz@incore.de) Received: from inetmail.dmz (inetmail.dmz [10.3.0.3]) by dss.incore.de (Postfix) with ESMTP id 141F9280C1; Thu, 24 Jan 2019 17:49:48 +0100 (CET) X-Virus-Scanned: amavisd-new at incore.de Received: from dss.incore.de ([10.3.0.3]) by inetmail.dmz (inetmail.dmz [10.3.0.3]) (amavisd-new, port 10024) with LMTP id yMbDs99f8zXa; Thu, 24 Jan 2019 17:49:47 +0100 (CET) Received: from mail.local.incore (fwintern.dmz [10.0.0.253]) by dss.incore.de (Postfix) with ESMTP id DBE89280BF; Thu, 24 Jan 2019 17:49:46 +0100 (CET) Received: from bsdmhs.longwitz (unknown [192.168.99.6]) by mail.local.incore (Postfix) with ESMTP id 8FF66B0; Thu, 24 Jan 2019 17:49:46 +0100 (CET) Message-ID: <5C49ECAA.7060505@incore.de> Date: Thu, 24 Jan 2019 17:49:46 +0100 From: Andreas Longwitz User-Agent: Thunderbird 2.0.0.19 (X11/20090113) MIME-Version: 1.0 To: freebsd-pf@freebsd.org Subject: Re: rdr pass for proto tcp sometimes creates states with expire time zero and so breaking connections References: <5BC51424.5000309@incore.de> <5BD45882.1000207@incore.de> <5BEB3B9A.9080402@incore.de> <20181113222533.GJ9744@FreeBSD.org> In-Reply-To: <20181113222533.GJ9744@FreeBSD.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 54D08725CB X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; spf=pass (mx1.freebsd.org: domain of longwitz@incore.de designates 195.145.1.138 as permitted sender) smtp.mailfrom=longwitz@incore.de X-Spamd-Result: default: False [-1.70 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-0.53)[-0.528,0]; RCVD_COUNT_FIVE(0.00)[5]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; R_SPF_ALLOW(-0.20)[+mx]; NEURAL_HAM_LONG(-0.96)[-0.961,0]; MIME_GOOD(-0.10)[text/plain]; RCVD_TLS_LAST(0.00)[]; DMARC_NA(0.00)[incore.de]; TO_DN_SOME(0.00)[]; NEURAL_SPAM_SHORT(0.06)[0.057,0]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MX_GOOD(-0.01)[dss.incore.de]; RCVD_IN_DNSWL_NONE(0.00)[138.1.145.195.list.dnswl.org : 127.0.10.0]; IP_SCORE(0.04)[asn: 3320(0.23), country: DE(-0.01)]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:3320, ipnet:195.145.0.0/16, country:DE]; MID_RHS_MATCH_FROM(0.00)[] X-BeenThere: freebsd-pf@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Technical discussion and general questions about packet filter \(pf\)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2019 16:49:58 -0000 after some more long term research I have an educated guess whats going on in this problem. The problem only occurs on i386. If I replace the counter_u64_fetch() call in pf_state_expires() by the value of V_pf_status.states, then pf works without problems, the expire time zero problem is gone: --- pf.c.1st 2018-08-14 10:17:41.000000000 +0200 +++ pf.c 2019-01-19 17:49:18.000000000 +0100 @@ -1542,7 +1542,7 @@ start = state->rule.ptr->timeout[PFTM_ADAPTIVE_START]; if (start) { end = state->rule.ptr->timeout[PFTM_ADAPTIVE_END]; - states = counter_u64_fetch(state->rule.ptr->states_cur); + states = V_pf_status.states; } else { start = V_pf_default_rule.timeout[PFTM_ADAPTIVE_START]; end = V_pf_default_rule.timeout[PFTM_ADAPTIVE_END]; The use of counter_u64_fetch() looks a little bit at random for me. For all states not associated with the pf_default_rule the value of pf_status.states is used and for me this value is ok for all rules. Further the counter(9) framework was created for quick and lockless write of counters, but fetching is more expansive. So I suggest to let pf_state_expires() work without a counter fetch. Further I can confirm that the counter states_cur of the pf_default_rule remains correct, when the patch given above is active. Without the patch the counter on my main firewall machine gets slowly negative. I have verified this with a lot of live DTrace and kgdb script debugging. >> OK, in the meantime I did some more research and I am now quite sure the >> problem with the bogus pf_default_rule->states_cur counter is not a >> problem in pf. I am convinced it is a problem in counter(9) on i386 >> server. The critical code is the machine instruction cmpxchg8b used in >> /sys/i386/include/counter.h. >> >> From intel instruction set reference manual: >> Zhis instruction can be used with a LOCK prefix allow the instruction to >> be executed atomically. >> >> We have two other sources in kernel using cmpxchg8b: >> /sys/i386/include/atomic.h and >> /sys/cddl/contrib/opensolaris/common/atomic/i386/opensolaris_atomic.S > > A single CPU instruction is atomic by definition, with regards to the CPU. > A preemption can not happen in a middle of instruction. What the "lock" > prefix does is memory locking to avoid unlocked parallel access to the > same address by different CPUs. > > What is special about counter(9) is that %fs:%esi always points to a > per-CPU address, because %fs is unique for every CPU and is constant, > so no other CPU may write to this address, so lock prefix isn't needed. > > Of course a true SMP i386 isn't a well tested arch, so I won't assert > that counter(9) doesn't have bugs on this arch. However, I don't see > lock prefix necessary here. I think the problem is the cmpxchg8b instruction used in counter_u64_fetch(), because this machine instruction always writes to memory, also when we only want to read and have (EDX:EAX) = (ECX:EBX): TEMP64 <- DEST IF (EDX:EAX = TEMP64) THEN ZF <- 1 DEST <- ECX:EBX ELSE ZF <- 0 EDX:EAX <- TEMP64 DEST <- TEMP64 FI If one CPU increments the counter in pf_create_state() and another does the fetch, then both CPU's may run the xmpxschg8b at once with the chance that both read the same memory value in TEMP64 and the fetching CPU is the second CPU that writes and so the increment is lossed. Thats what I see without the above patch two or three times a week. Andreas