From owner-svn-src-all@FreeBSD.ORG Mon Mar 30 01:54:03 2015 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 6D8447A; Mon, 30 Mar 2015 01:54:03 +0000 (UTC) Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au [211.29.132.42]) by mx1.freebsd.org (Postfix) with ESMTP id 28B0CAA5; Mon, 30 Mar 2015 01:54:02 +0000 (UTC) Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id D1D563C6367; Mon, 30 Mar 2015 12:54:00 +1100 (AEDT) Date: Mon, 30 Mar 2015 12:53:41 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Ian Lepore Subject: Re: svn commit: r280759 - head/sys/netinet In-Reply-To: <1427666182.82583.4.camel@freebsd.org> Message-ID: <20150330110104.J1113@besplex.bde.org> References: <201503271326.t2RDQxd3056112@svn.freebsd.org> <20150328083443.GV64665@FreeBSD.org> <20150328191629.GY64665@FreeBSD.org> <5517B433.5010508@selasky.org> <20150329210757.GA64665@FreeBSD.org> <1427666182.82583.4.camel@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=A5NVYcmG c=1 sm=1 tr=0 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=kj9zAlcOel0A:10 a=JzwRw_2MAAAA:8 a=ndaoGXS1AAAA:8 a=gEE0zVnCIVZYxM8EKGAA:9 a=CjuIK1q_8ugA:10 Cc: Hans Petter Selasky , Adrian Chadd , "src-committers@freebsd.org" , "svn-src-all@freebsd.org" , Gleb Smirnoff , "svn-src-head@freebsd.org" , Fabien Thomas X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Mar 2015 01:54:03 -0000 On Sun, 29 Mar 2015, Ian Lepore wrote: > On Mon, 2015-03-30 at 00:07 +0300, Gleb Smirnoff wrote: >> On Sun, Mar 29, 2015 at 08:16:46AM -0700, Adrian Chadd wrote: >> A> On 29 March 2015 at 01:13, Hans Petter Selasky wrote: >> A> > On 03/28/15 20:16, Gleb Smirnoff wrote: >> A> >> >> A> >> +uint16_t >> A> >> +ip_newid(void) >> A> >> +{ >> A> >> + >> A> >> + counter_u64_add(V_ip_id, 1); >> A> >> + return (htons((*(uint64_t *)zpcpu_get(V_ip_id)) & 0xffff)); >> A> >> +} >> A> > >> A> > Technically you would need to enter a critical section here, so that the >> A> > current process doesn't get swapped to a different CPU between the counter >> A> > add and the zpcpu_get. >> A> >> A> +10000 here. >> A> >> A> You can get preempted and shifted to another CPU at almost any point >> A> in time you're not explicitly asking not to be. It's.. frustrating. >> >> What happens in case of the race is that one CPU will use counter of the >> other CPU, reusing the ID that was just used by the other CPU. And this >> isn't a fatal race, it is just reuse of ID. The probability of such event >> is much lower than probability of a collision due to overflowing uint16_t. It's more confusing (and probably more harmless) than that. When there is a context switch, there are lots of atomic ops and delays which might sync the memory system. Otherwise, the reader on a different CPU might not see stale data instead of the value that the same thread just wrote. Then the read is non-atomic except possibly on 64-bit systems. It may see a mixture of old and new data. But only 16 bits are used, and on non-exotic systems these bits will be in the same memory access unit, so the result may be new or old but not a mixture. Code that runs unsynchronized can never depend on getting an in-order result. This is correct iff the order is unimportant. >> For example, if we emit 1 Mpps, then we overflow the ID counter 15 times >> per second. Every ID is reused 15 times within this small interval of time, >> which of course is larger than TTL. And we find this kinda acceptable. >> >> To illustrate the probability of 1 instruction race, I would suggest to >> look at PCPU_ADD() implementation for all arches except amd64/i386 and >> find out that it is prone to this race. Hitting the race in PCPU_ADD() >> will usually lead to mangling vm_meter statistics. Did we ever observe >> this in practice? No. Ugh, PCPU_ADD() is broken on i386 too. It violates its own stated requirement that the update be atomic with respect to interrupts. i386 can only naturally support up to 32-bit atomic accesses, but broken MI code assumes that all arches support 64-bit accesses. i386 version: X /* X * Adds a value of the per-cpu counter name. The implementation X * must be atomic with respect to interrupts. X */ But this implementation is not atomic with respect to interrupts. X #define __PCPU_ADD(name, val) do { \ X __pcpu_type(name) __val; \ X struct __s { \ X u_char __b[MIN(sizeof(__val), 4)]; \ X } __s; \ X \ X __val = (val); \ X if (sizeof(__val) == 1 || sizeof(__val) == 2 || \ X sizeof(__val) == 4) { \ X __s = *(struct __s *)(void *)&__val; \ X __asm __volatile("add %1,%%fs:%0" \ X : "=m" (*(struct __s *)(__pcpu_offset(name))) \ X : "r" (__s)); \ This part uses asm partly to ensure that a suitable atomic instruction is used. X } else \ X *__PCPU_PTR(name) += __val; \ This part handles 64-bit integers. Since i386 is 32 bits and SSE is unavailable, the compiler has to do a bunch of non-atomic 32-bit accesses, and without a compiler barrier it can do these in ways that are especially bad for atomicity. X } while (0) PCPU_INC() is similarly broken. PCPU_GET() and PCPU_SET() do similar non-atomic accesses, but never claimed that they were atomic with respect to interrupts. PCPU_PTR() is racy as designed. vmmeter uses a *(type_t *)&pcpu->pc_bar hack like the above. This has smaller atomicity problems because type_t is always int (vmmeter only supports u_int). Perhaps the i386 pcpu accesses are not so broken after all, with most MI code only doing 32-bit accesses. As you know, I don't like the implementation of counter64, but it is missing some of the above problems. On amd64, it uses a simple asm with addq; PCPU_ADD() would do the same except for it not having a compiler barrier. On i386, to avoid the above bug it uses a slow method using cmpxchg8b if available, else a slower method using critical sections. Most other arches use the critical section method. Even sparc64. sparc64's PCP_*() is even sloppier. Then the access methods are sloppy, as for vmmeter. kib wrote XXX comments about only some of them. i386 seems to be more careful -- it uses a complicated rendezvous method or critical section where needed. > I was under the impression that PCPU_ADD(), like all pcpu r-m-w > accesses, requires the calling code to have taken steps to ensure the > code doesn't migrate during the sequence. If callers aren't doing so, > that seems like a bug in the calling code, not the PCPU_ADD() code. No, PCPU_ADD() is supposed to be atomic enough, and the caller is only supposed to use it when it doesn't matter on which CPU the counter is incremented on. It is normal operation for the addition to occur on a different CPU than the one that called PCPU_ADD(). Similarly for counter_64. It doesn't matter where pure counters are accumulated, especially since there are almost no ways to see the counts for individual CPUs. Bruce