From owner-freebsd-net@FreeBSD.ORG Sun Jul 17 12:14:36 2011 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7E3D9106564A for ; Sun, 17 Jul 2011 12:14:36 +0000 (UTC) (envelope-from freebsd-net@m.gmane.org) Received: from lo.gmane.org (lo.gmane.org [80.91.229.12]) by mx1.freebsd.org (Postfix) with ESMTP id EE45E8FC12 for ; Sun, 17 Jul 2011 12:14:35 +0000 (UTC) Received: from list by lo.gmane.org with local (Exim 4.69) (envelope-from ) id 1QiQ0S-0004Yb-KN for freebsd-net@freebsd.org; Sun, 17 Jul 2011 13:59:32 +0200 Received: from 208.88.188.90.adsl.tomsknet.ru ([90.188.88.208]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 17 Jul 2011 13:59:32 +0200 Received: from vadim_nuclight by 208.88.188.90.adsl.tomsknet.ru with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 17 Jul 2011 13:59:32 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-net@freebsd.org From: Vadim Goncharov Date: Sun, 17 Jul 2011 11:59:17 +0000 (UTC) Organization: Nuclear Lightning @ Tomsk, TPU AVTF Hostel Lines: 159 Message-ID: References: <20110714154457.GI70776@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@dough.gmane.org X-Gmane-NNTP-Posting-Host: 208.88.188.90.adsl.tomsknet.ru X-Comment-To: Gleb Smirnoff User-Agent: slrn/0.9.9p1 (FreeBSD) Subject: Re: m_pkthdr.rcvif dangling pointer problem X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: vadim_nuclight@mail.ru List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 17 Jul 2011 12:14:36 -0000 Hi Gleb Smirnoff! On Thu, 14 Jul 2011 19:44:57 +0400; Gleb Smirnoff wrote about 'm_pkthdr.rcvif dangling pointer problem': > 1) Every m_pkthdr.rcvif should if_ref() the interface. Releasing > references can be done in the mbuf allocator: mb_dtor_mbuf(), > mb_dtor_pack(). I'm afraid this approach would degrate performance, > since adding at least a couple of atomic ops on every mbuf for its > lifetime. That is +2 atomic ops per packet. Conceptually, this is the most correct solution, though not so fast. Ways to improve are to be found from this starting point. However, are that +2 atomic ops per packet really so expensive? How many of atomic ops are already on that path? Any measures? > 2) kib@ suggested to allocate ifnets from a UMA_ZONE_NOFREE zone. > I've made a compilable & working patch: > http://people.freebsd.org/~glebius/patches/ifnet.no_free > But on second though I find this a bad idea, this is just fooling > of INVARIANTS. Yes, we avoid thrashing of freed memory and rewriting > it by some other kernel allocation. But still out pointer point to > invalid ifnet. Even, if we make a check for IFF_DYING flag, we still > can not guarantee that an interface had been re-allocated for a new > instance. This would be not a panic condition, but subtle bugs in > firewalls. Indeed, for long-term solution this is not right approach. > 3) As we now have a straight if_index table that can grow, what about > storing the if_index in the m_pkthdr? Lookup of interface by index > is fast enough if done lockless. Doing it lockless isn't perfect, but > better than current pointer dereferncing. Optionally it could be > done with locking and with putting a reference. To avoid situation > with with getting to a re-allocated interface with the same index, > we can use a unique cookie, that is incremented in if_alloc(). > Smth like: > struct ifnet * > mbuf_rcvif(struct mbuf *m) > { [...] > ifp = ifnet_byindex_locked(m->m_pkthdr.rcvif_idx); [...] > The size of struct mbuf isn't going to change on amd64: > @@ -111,7 +111,8 @@ > * Record/packet header in first mbuf of chain; valid only if M_PKTHDR is set. > */ > struct pkthdr { > - struct ifnet *rcvif; /* rcv interface */ > + uint32_t rcvif_idx; /* rcv interface index */ > + uint32_t rcvif_unique; /* rcv interface unique id */ > A proof-of-concept patch is available here: > http://people.freebsd.org/~glebius/patches/pkthdr.rcvif_idx > It doesn't cover entire kernel, LINT won't compile. It covers kern, netinet, > netinet6, several interface drivers and netgraph. > One of the problems is ongoing namespace pollution: not all code that include > mbuf.h includes if_var.h and vice versa. I've cut ifnet from m_devget(), but > my new function do declare it in parameter list :(. To deal with that I had > to declare functions in mbuf.h but implement them in if.c. > Other problem is net80211 code that abuses the rcvif pointer in mbuf packet > header for its own purposes casting it to private type. This can be fixed > utilizing mbuf_tags(9), I think. I haven't made this yet, that's why LINT > doesn't compile. This solution has several other drawbacks: 1) It slows down rcvif dereference on _every_ ifnet lookup during packet lifetime. That's not just +2 ops but there could be many of them. 2) It requires converting *all* consumers of rcvif pointer in the kernel. Solution requiring to change everything are usually smell bad, could introduce bugs, and at least one case already found - net80211. 3) It comlicates not only the code, but requires additional 4 bytes of pkthdr on non-64bit architectures (i386, MIPS, ARM?) > Any alternative ideas on solving the problem are welcome! Yes, I have one, a somewhat hybrid of solutions 1 and 3. First, we leave all mbuf consumers and rcvif pointer as is. We then add two new fields: struct pkthdr { + uint16_t gencount; /* iftable version */ + uint16_t fib; /* may be reserve some bits? */ We need only 16 bits, and fib must be moved from mbuf flags anyway. I'm not sure this should be all 16 bits - we may want to use e.g. 12 bits for FIB number and 4 bits for M_HASHTYPEBITS which was placed to flags with the same hack as M_FIB ?.. This should be thought, too, but let's continue with our problem. So, we have a global variable: int iftable_gencount; /* generation counter */ this should be updated (atomically) on every operation with interfaces table, that is, on every interface departure and arrival. And when an interface is created, it gets: struct ifnet { + int gencount; /* iftable version */ at the moment of creation. This valuse then doesn't change. After then, when a new mbuf w/pkthdr is allocated, it's mb_ctor_mbuf() does: pkthdr.gencount = iftable_gencount & 0xffff; And this value also doesn't change during entire mbuf lifetime. At the same time, along with setting pkthdr.gencount, an entry in the special array, let's call it a[], is updated: index: gencount=1 gencount=2 gencount=3 +-----------+-----------+-----------+---....... |atomic_int |atomic_int |atomic_int | +-----------+-----------+-----------+---....... Those entries a[gencount] show the total number of mbufs with the same pkthdr.gencount in the system (it is decreased by mb_dtor_mbuf() on mbuf freeing). You can say, that's atomic too as in solution 1, but wait a little. Now, let's say em0 was created with gencount=1, ng0 was with gencount=2, and now ng0 is being destroyed. Now, global iftable_gencount is now 3, and all new mbufs in the system will get pkthdr.gencount=3, so numbers in previous generations now could only decrease, not grow. The ng0 if if_ref()'ed once by generation counting code and is not released until the SUM of all previous generations is zero (that is, for both gencount=1 and gencount=2). We observe that this counts could only decrease, so now this scheme could be converted to be lockless! We copy array a[] into every CPU's private memory, and all this copies are now accessed lockless, and now what matters for every generation is sum of all per-CPU values (this costs, however, 256K of memory on 32-bit archs). Now, a callout(9)-garbage collector procedure is invoked every N secs and scans arrays of all CPU without atomic(9) and sums that. If a value is stale in some cache - nothing harmless, numbers could only decrease, the if_rele() will be just deferred to next pass. Of course, this requires memory, and a TCP-like arithmetics for wrapping array in 2^16 space. So the only other problem which could be is when some mbuf exists too long to survive 32767 interfaces creates and destroys, which is very unlikely. Unfortunatelly, I don't know how to write per-CPU code, so I can't give you a PoC patch, but I hope I've explained idea detailed enough. -- WBR, Vadim Goncharov. ICQ#166852181 mailto:vadim_nuclight@mail.ru [Moderator of RU.ANTI-ECOLOGY][FreeBSD][http://antigreen.org][LJ:/nuclight]