From owner-freebsd-hackers@FreeBSD.ORG Sat Mar 15 01:44:13 2014 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 58D5A671; Sat, 15 Mar 2014 01:44:13 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id C3AF39CB; Sat, 15 Mar 2014 01:44:12 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqUEAPavI1ODaFve/2dsb2JhbABWA4NBV4MGt0mGYU+BKnSCJQEBAQMBAQEBIAQnIAsbGAICDRkCKQEJJgYIBwQBHASHUAgNsT+iYBeBKYxcCwUCAQEaJBAHEYJegUkElW6ECZB8g0khMXxB X-IronPort-AV: E=Sophos;i="4.97,658,1389762000"; d="scan'208";a="105734583" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 14 Mar 2014 21:44:05 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 91459B420B; Fri, 14 Mar 2014 21:44:05 -0400 (EDT) Date: Fri, 14 Mar 2014 21:44:05 -0400 (EDT) From: Rick Macklem To: John-Mark Gurney Message-ID: <804839311.22904387.1394847845581.JavaMail.root@uoguelph.ca> In-Reply-To: <20140314015021.GN32089@funkthat.com> Subject: Re: kernel memory allocator: UMA or malloc? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: Freebsd hackers list , Garrett Wollman X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 15 Mar 2014 01:44:13 -0000 John-Mark Gurney wrote: > Rick Macklem wrote this message on Thu, Mar 13, 2014 at 18:22 -0400: > > John-Mark Gurney wrote: > > > Rick Macklem wrote this message on Wed, Mar 12, 2014 at 21:59 > > > -0400: > > > > John-Mark Gurney wrote: > > > > > Rick Macklem wrote this message on Tue, Mar 11, 2014 at 21:32 > > > > > -0400: > > > > > > I've been working on a patch provided by wollman@, where > > > > > > he uses UMA instead of malloc() to allocate an iovec array > > > > > > for use by the NFS server's read. > > > > > > > > > > > > So, my question is: > > > > > > When is it preferable to use UMA(9) vs malloc(9) if the > > > > > > allocation is going to be a fixed size? > > > > > > > > > > UMA has benefits if the structure size is uniform and a > > > > > non-power > > > > > of > > > > > 2.. > > > > > In this case, it can pack the items more densely, say, a 192 > > > > > byte > > > > > allocation can fit 21 allocations in a 4k page size verse > > > > > malloc > > > > > which > > > > > would round it up to 256 bytes leaving only 16 per page... > > > > > These > > > > > counts per page are probably different as UMA may keep some > > > > > information > > > > > in the page... > > > > > > > > > Ok, this one might apply. I need to look at the size. > > > > > > > > > It also has the benefit of being able to keep allocations > > > > > "half > > > > > alive".. > > > > > "freed" objects can be partly initalized with references to > > > > > buffers > > > > > and > > > > > other allocations still held by them... Then if the systems > > > > > needs > > > > > to > > > > > fully free your allocation, it can, and will call your > > > > > function > > > > > to > > > > > release these remaining resources... look at the ctor/dtor > > > > > uminit/fini > > > > > functions in uma(9) for more info... > > > > > > > > > > uma also allows you to set a hard limit on the number of > > > > > allocations > > > > > the zone provides... > > > > > > > > > Yep. None of the above applies to this case, but thanks for the > > > > good points > > > > for a future case. (I've seen where this gets used for the > > > > "secondary zone" > > > > for mbufs+cluster.) > > > > > > > > > Hope this helps... > > > > > > > > > Yes, it did. Thanks. > > > > > > > > Does anyone know if there is a significant performance > > > > difference > > > > if the allocation > > > > is a power of 2 and the "half alive" cases don't apply? > > > > > > From my understanding, the malloc case is "slightly" slower as it > > > needs to look up which bucket to use, but after the lookup, the > > > buckets > > > are UMA, so the performance will be the same... > > > > > > > Thanks all for your help, rick > > > > ps: Garrett's patch switched to using a fixed size allocation > > > > and > > > > using UMA(9). > > > > Since I have found that a uma allocation request with > > > > M_WAITOK > > > > can get the thread > > > > stuck sleeping in "btalloc", I am a bit shy of using it > > > > when > > > > I've never > > > > > > Hmm... I took a look at the code, and if you're stuck in > > > btalloc, > > > either pause(9) isn't working, or you're looping, which probably > > > means > > > you're really low on memory... > > > > > Well, this was an i386 with the default of about 400Mbytes of > > kernel > > memory (address space if I understand it correctly). Since it > > seemed > > to persist in this state, I assumed that it was looping and, > > therefore, > > wasn't able to find a page sized and page aligned chunk of kernel > > address space to use. (The rest of the system was still running > > ok.) > > It looks like vm.phys_free would have some useful information about > the > availability of free memory... I'm not sure if this is where the > allocators get their memory or not.. I was about to say it seamed > weird > we only have 16K as the largest allocation, but that's 16MEGs.. > I can't reproduce it reliably. I saw it twice during several days of testing. > > I did email about this and since no one had a better > > explanation/fix, > > I avoided the problem by using M_NOWAIT on the m_getjcl() call. > > > > Although I couldn't reproduce this reliably, it seemed to happen > > more > > easily when my code was doing a mix of MCLBYTES and MJUMPAGESIZE > > cluster > > allocation. Again, just a hunch, but maybe the MCLBYTE cluster > > allocations > > were fragmenting the address space to the point where a page sized > > chunk > > aligned to a page boundary couldn't be found. > > By definition, you would be out of memory if there is not a page free > (that is aligned to a page boundary, which all pages are)... > > It'd be interesting to put a printf w/ the pause to see if it is > looping, and to get a sysctl -a from the machine when it is > happening... > > > Alternately, the code for M_WAITOK is broken in some way not > > obvious > > to me. > > > > Either way, I avoid it by using M_NOWAIT. I also fall back on: > > MGET(..M_WAITOK); > > MCLGET(..M_NOWAIT); > > which has a "side effect" of draining the mbuf cluster zone if the > > MCLGET(..M_NOWAIT) fails to get a cluster. (For some reason > > m_getcl() > > and m_getjcl() do not drain the cluster zone when they fail?) > > Why aren't you using m_getcl(9) which does both of the above > automaticly > for you? And is faster, since there is a special uma zone that has > both an mbuf and an mbuf cluster paired up already? > Well, remember this is only done as a fallback if m_getjcl(..M_NOWAIT..) fails (returns NULL). --> It will rarely happen when there are no easily allocatable clusters. For that case, I wanted something that will reliably get at least an mbuf without getting stuck in "btalloc". If I used m_getcl(..M_NOWAIT..) it could still fail, then I don't even have an mbuf. If I used m_getcl(..M_WAITOK..) it could get stuck in "btalloc". Since m_getjcl(..M_NOWAIT..) already failed, it is constrained at this time. Also (and I don't know why), only m_clget(..M_NOWAIT..) does a drain on the mbuf cluster zone. This is not done by m_getcl() or m_getjcl() from what I saw when I looked at the code. Note that the above uses M_WAITOK for m_get() and M_NOWAIT for m_clget(), it may only get an mbuf and no cluster, but I can live with that. > > One of the advantages of having very old/small hardware to test > > on;-) > > :) > > > > > had a problem with malloc(). Btw, this was for a pagesize > > > > cluster allocation, > > > > so it might be related to the alignment requirement (and > > > > running on a small > > > > i386, so the kernel address space is relatively small). > > > > > > Yeh, if you put additional alignment requirements, that's > > > probably > > > it, > > > but if you needed these alignment requirements, how was malloc > > > satisfying your request? > > > > > This was for a m_getjcl(MJUMAGEIZE, M_WAITOK..), so for this case > > I've never done a malloc(). The code in head (which my patch uses > > as > > a fallback when m_getjcl(..M_NOWAIT..) fails does (as above): > > MGET(..M_WAITOK); > > MCLGET(..M_NOWAIT); > > When that fails, an netstat -m would also be useful to see what the > stats think of the availability of page size clusters... > This has never failed in testing. The case that would get stuck in "btalloc" was a: m_getjcl(..M_WAITOK..); - same as m_getcl(), but sometimes asking for a MJUMPAGESIZE cluster instead of a MCLBYTES cluster. The current patch still does a m_getjcl() call, but with M_NOWAIT. Then, if that returns NULL, it falls back to the old reliable way, as above. > > > > I do see that switching to a fixed size allocation to cover > > > > the > > > > common > > > > case is a good idea, but I'm not sure if setting up a uma > > > > zone > > > > is worth > > > > the effort over malloc()? > > > > > > I'd say it depends upon how many and the number... If you're > > > allocating > > > many megabytes of memory, and the wastage is 50%+, then think > > > about > > > it, but if it's just a few objects, then the coding time and > > > maintenance isn't worth it.. > > > > > Btw, I think the allocation is a power of 2. (It is a power of 2 > > times > > sizeof(struct iovec) and it looks to me that sizeof(struct iovec) > > is > > a power of 2 as well. (I know i386 is 8 and I think most 64bits > > arches > > will make it 16, since it is a pointer and a size_t.) > > yes, struct iovec is 16 on amd64... > > (kgdb) print sizeof(struct iovec) > $1 = 16 > > > This was part of Garrett's patch, so I'll admit I would have been > > to > > lazy to do it.;-) Now it's in the current patch, so unless there > > seems > > to be a reason to take it out..?? > > > > Garrett mentioned that UMA(9) has a per-CPU cache. I'll admit I > > don't > > know what that implies? > > a per-CPU cache means that on an SMP system, you can lock the local > pool instead of grabing a global lock.. This will be MUCH faster as > the local lock won't have to bounce around CPUs like a global lock > does, plus it should never contend which really puts the breaks on > sync primities... > > > - I might guess that a per-CPU cache would be useful for items that > > get > > re-allocated a lot with minimal change to the data in the slab. > > --> It seems to me that if most of the bytes in the slab have the > > same bits, then you might improve hit rate on the CPU's > > memory > > caches, but since I haven't looked at this, I could be way > > off?? > > caching will help some, but the lock is the main one... > > > - For this case, the iovec array that is allocated is filled in > > with > > different mbuf data addresses each time, so minimal change > > doesn`t > > apply. > > So, this is where a UMA half alive object could be helpful... Say > that > you always need to allocate an iovec + 8 mbuf clusters to populate > the > iovec... What you can do is have a uma uminit function that > allocates > the memory for the iovec and 8 mbuf clusters, and populates the iovec > w/ the correct addresses... Then when you call uma_zalloc, the iovec > is already initalized, and you just go on your merry way instead of > doing all that work... when you uma_zfree, you don't have to worry > about loosing the clusters as the next uma_zalloc might return the > same object w/ the clusters already present... When the system gets > low on memory, it will call your fini function which will need to > free the clusters.... > > > - Does the per-CPU cache help w.r.t. UMA(9) internal code perf? > > > > So, lots of questions that I don't have an answer for. However, > > unless > > there is a downside to using UMA(9) for this, the code is written > > and > > I'm ok with it. > > Nope, not really... > > -- > John-Mark Gurney Voice: +1 415 225 5579 > > "All that I will do, has been done, All that I have, has not." > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to > "freebsd-hackers-unsubscribe@freebsd.org" >