Date: Thu, 13 Mar 2014 18:22:19 -0400 (EDT) From: Rick Macklem <rmacklem@uoguelph.ca> To: John-Mark Gurney <jmg@funkthat.com> Cc: Freebsd hackers list <freebsd-hackers@freebsd.org>, Garrett Wollman <wollman@freebsd.org> Subject: Re: kernel memory allocator: UMA or malloc? Message-ID: <1783335610.22308389.1394749339304.JavaMail.root@uoguelph.ca> In-Reply-To: <20140313054659.GG32089@funkthat.com>
next in thread | previous in thread | raw e-mail | index | archive | help
John-Mark Gurney wrote: > Rick Macklem wrote this message on Wed, Mar 12, 2014 at 21:59 -0400: > > John-Mark Gurney wrote: > > > Rick Macklem wrote this message on Tue, Mar 11, 2014 at 21:32 > > > -0400: > > > > I've been working on a patch provided by wollman@, where > > > > he uses UMA instead of malloc() to allocate an iovec array > > > > for use by the NFS server's read. > > > > > > > > So, my question is: > > > > When is it preferable to use UMA(9) vs malloc(9) if the > > > > allocation is going to be a fixed size? > > > > > > UMA has benefits if the structure size is uniform and a non-power > > > of > > > 2.. > > > In this case, it can pack the items more densely, say, a 192 byte > > > allocation can fit 21 allocations in a 4k page size verse malloc > > > which > > > would round it up to 256 bytes leaving only 16 per page... These > > > counts per page are probably different as UMA may keep some > > > information > > > in the page... > > > > > Ok, this one might apply. I need to look at the size. > > > > > It also has the benefit of being able to keep allocations "half > > > alive".. > > > "freed" objects can be partly initalized with references to > > > buffers > > > and > > > other allocations still held by them... Then if the systems needs > > > to > > > fully free your allocation, it can, and will call your function > > > to > > > release these remaining resources... look at the ctor/dtor > > > uminit/fini > > > functions in uma(9) for more info... > > > > > > uma also allows you to set a hard limit on the number of > > > allocations > > > the zone provides... > > > > > Yep. None of the above applies to this case, but thanks for the > > good points > > for a future case. (I've seen where this gets used for the > > "secondary zone" > > for mbufs+cluster.) > > > > > Hope this helps... > > > > > Yes, it did. Thanks. > > > > Does anyone know if there is a significant performance difference > > if the allocation > > is a power of 2 and the "half alive" cases don't apply? > > From my understanding, the malloc case is "slightly" slower as it > needs to look up which bucket to use, but after the lookup, the > buckets > are UMA, so the performance will be the same... > > > Thanks all for your help, rick > > ps: Garrett's patch switched to using a fixed size allocation and > > using UMA(9). > > Since I have found that a uma allocation request with M_WAITOK > > can get the thread > > stuck sleeping in "btalloc", I am a bit shy of using it when > > I've never > > Hmm... I took a look at the code, and if you're stuck in btalloc, > either pause(9) isn't working, or you're looping, which probably > means > you're really low on memory... > Well, this was an i386 with the default of about 400Mbytes of kernel memory (address space if I understand it correctly). Since it seemed to persist in this state, I assumed that it was looping and, therefore, wasn't able to find a page sized and page aligned chunk of kernel address space to use. (The rest of the system was still running ok.) I did email about this and since no one had a better explanation/fix, I avoided the problem by using M_NOWAIT on the m_getjcl() call. Although I couldn't reproduce this reliably, it seemed to happen more easily when my code was doing a mix of MCLBYTES and MJUMPAGESIZE cluster allocation. Again, just a hunch, but maybe the MCLBYTE cluster allocations were fragmenting the address space to the point where a page sized chunk aligned to a page boundary couldn't be found. Alternately, the code for M_WAITOK is broken in some way not obvious to me. Either way, I avoid it by using M_NOWAIT. I also fall back on: MGET(..M_WAITOK); MCLGET(..M_NOWAIT); which has a "side effect" of draining the mbuf cluster zone if the MCLGET(..M_NOWAIT) fails to get a cluster. (For some reason m_getcl() and m_getjcl() do not drain the cluster zone when they fail?) One of the advantages of having very old/small hardware to test on;-) > > had a problem with malloc(). Btw, this was for a pagesize > > cluster allocation, > > so it might be related to the alignment requirement (and > > running on a small > > i386, so the kernel address space is relatively small). > > Yeh, if you put additional alignment requirements, that's probably > it, > but if you needed these alignment requirements, how was malloc > satisfying your request? > This was for a m_getjcl(MJUMAGEIZE, M_WAITOK..), so for this case I've never done a malloc(). The code in head (which my patch uses as a fallback when m_getjcl(..M_NOWAIT..) fails does (as above): MGET(..M_WAITOK); MCLGET(..M_NOWAIT); > > I do see that switching to a fixed size allocation to cover the > > common > > case is a good idea, but I'm not sure if setting up a uma zone > > is worth > > the effort over malloc()? > > I'd say it depends upon how many and the number... If you're > allocating > many megabytes of memory, and the wastage is 50%+, then think about > it, but if it's just a few objects, then the coding time and > maintenance isn't worth it.. > Btw, I think the allocation is a power of 2. (It is a power of 2 times sizeof(struct iovec) and it looks to me that sizeof(struct iovec) is a power of 2 as well. (I know i386 is 8 and I think most 64bits arches will make it 16, since it is a pointer and a size_t.) This was part of Garrett's patch, so I'll admit I would have been to lazy to do it.;-) Now it's in the current patch, so unless there seems to be a reason to take it out..?? Garrett mentioned that UMA(9) has a per-CPU cache. I'll admit I don't know what that implies? - I might guess that a per-CPU cache would be useful for items that get re-allocated a lot with minimal change to the data in the slab. --> It seems to me that if most of the bytes in the slab have the same bits, then you might improve hit rate on the CPU's memory caches, but since I haven't looked at this, I could be way off?? - For this case, the iovec array that is allocated is filled in with different mbuf data addresses each time, so minimal change doesn`t apply. - Does the per-CPU cache help w.r.t. UMA(9) internal code perf? So, lots of questions that I don't have an answer for. However, unless there is a downside to using UMA(9) for this, the code is written and I'm ok with it. Thanks for all the good comments, rick > -- > John-Mark Gurney Voice: +1 415 225 5579 > > "All that I will do, has been done, All that I have, has not." > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to > "freebsd-hackers-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1783335610.22308389.1394749339304.JavaMail.root>