From owner-freebsd-hackers@FreeBSD.ORG  Fri Mar 14 01:50:28 2014
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id C5D6D38A;
 Fri, 14 Mar 2014 01:50:28 +0000 (UTC)
Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 8457DAE3;
 Fri, 14 Mar 2014 01:50:28 +0000 (UTC)
Received: from h2.funkthat.com (localhost [127.0.0.1])
 by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id s2E1oLxL036391
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Thu, 13 Mar 2014 18:50:22 -0700 (PDT)
 (envelope-from jmg@h2.funkthat.com)
Received: (from jmg@localhost)
 by h2.funkthat.com (8.14.3/8.14.3/Submit) id s2E1oL0P036390;
 Thu, 13 Mar 2014 18:50:21 -0700 (PDT) (envelope-from jmg)
Date: Thu, 13 Mar 2014 18:50:21 -0700
From: John-Mark Gurney <jmg@funkthat.com>
To: Rick Macklem <rmacklem@uoguelph.ca>
Subject: Re: kernel memory allocator: UMA or malloc?
Message-ID: <20140314015021.GN32089@funkthat.com>
Mail-Followup-To: Rick Macklem <rmacklem@uoguelph.ca>,
 Freebsd hackers list <freebsd-hackers@freebsd.org>,
 Garrett Wollman <wollman@freebsd.org>
References: <20140313054659.GG32089@funkthat.com>
 <1783335610.22308389.1394749339304.JavaMail.root@uoguelph.ca>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1783335610.22308389.1394749339304.JavaMail.root@uoguelph.ca>
User-Agent: Mutt/1.4.2.3i
X-Operating-System: FreeBSD 7.2-RELEASE i386
X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88  9322 9CB1 8F74 6D3F A396
X-Files: The truth is out there
X-URL: http://resnet.uoregon.edu/~gurney_j/
X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html
X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE
X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger?
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.2
 (h2.funkthat.com [127.0.0.1]); Thu, 13 Mar 2014 18:50:22 -0700 (PDT)
Cc: Freebsd hackers list <freebsd-hackers@freebsd.org>,
 Garrett Wollman <wollman@freebsd.org>
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 14 Mar 2014 01:50:28 -0000

Rick Macklem wrote this message on Thu, Mar 13, 2014 at 18:22 -0400:
> John-Mark Gurney wrote:
> > Rick Macklem wrote this message on Wed, Mar 12, 2014 at 21:59 -0400:
> > > John-Mark Gurney wrote:
> > > > Rick Macklem wrote this message on Tue, Mar 11, 2014 at 21:32
> > > > -0400:
> > > > > I've been working on a patch provided by wollman@, where
> > > > > he uses UMA instead of malloc() to allocate an iovec array
> > > > > for use by the NFS server's read.
> > > > > 
> > > > > So, my question is:
> > > > > When is it preferable to use UMA(9) vs malloc(9) if the
> > > > > allocation is going to be a fixed size?
> > > > 
> > > > UMA has benefits if the structure size is uniform and a non-power
> > > > of
> > > > 2..
> > > > In this case, it can pack the items more densely, say, a 192 byte
> > > > allocation can fit 21 allocations in a 4k page size verse malloc
> > > > which
> > > > would round it up to 256 bytes leaving only 16 per page...  These
> > > > counts per page are probably different as UMA may keep some
> > > > information
> > > > in the page...
> > > > 
> > > Ok, this one might apply. I need to look at the size.
> > > 
> > > > It also has the benefit of being able to keep allocations "half
> > > > alive"..
> > > > "freed" objects can be partly initalized with references to
> > > > buffers
> > > > and
> > > > other allocations still held by them... Then if the systems needs
> > > > to
> > > > fully free your allocation, it can, and will call your function
> > > > to
> > > > release these remaining resources... look at the ctor/dtor
> > > > uminit/fini
> > > > functions in uma(9) for more info...
> > > > 
> > > > uma also allows you to set a hard limit on the number of
> > > > allocations
> > > > the zone provides...
> > > > 
> > > Yep. None of the above applies to this case, but thanks for the
> > > good points
> > > for a future case. (I've seen where this gets used for the
> > > "secondary zone"
> > > for mbufs+cluster.)
> > > 
> > > > Hope this helps...
> > > > 
> > > Yes, it did. Thanks.
> > > 
> > > Does anyone know if there is a significant performance difference
> > > if the allocation
> > > is a power of 2 and the "half alive" cases don't apply?
> > 
> > From my understanding, the malloc case is "slightly" slower as it
> > needs to look up which bucket to use, but after the lookup, the
> > buckets
> > are UMA, so the performance will be the same...
> > 
> > > Thanks all for your help, rick
> > > ps: Garrett's patch switched to using a fixed size allocation and
> > > using UMA(9).
> > >     Since I have found that a uma allocation request with M_WAITOK
> > >     can get the thread
> > >     stuck sleeping in "btalloc", I am a bit shy of using it when
> > >     I've never
> > 
> > Hmm...  I took a look at the code, and if you're stuck in btalloc,
> > either pause(9) isn't working, or you're looping, which probably
> > means
> > you're really low on memory...
> > 
> Well, this was an i386 with the default of about 400Mbytes of kernel
> memory (address space if I understand it correctly). Since it seemed
> to persist in this state, I assumed that it was looping and, therefore,
> wasn't able to find a page sized and page aligned chunk of kernel
> address space to use. (The rest of the system was still running ok.)

It looks like vm.phys_free would have some useful information about the
availability of free memory... I'm not sure if this is where the
allocators get their memory or not..  I was about to say it seamed weird
we only have 16K as the largest allocation, but that's 16MEGs..

> I did email about this and since no one had a better explanation/fix,
> I avoided the problem by using M_NOWAIT on the m_getjcl() call.
> 
> Although I couldn't reproduce this reliably, it seemed to happen more
> easily when my code was doing a mix of MCLBYTES and MJUMPAGESIZE cluster
> allocation. Again, just a hunch, but maybe the MCLBYTE cluster allocations
> were fragmenting the address space to the point where a page sized chunk
> aligned to a page boundary couldn't be found.

By definition, you would be out of memory if there is not a page free
(that is aligned to a page boundary, which all pages are)...

It'd be interesting to put a printf w/ the pause to see if it is
looping, and to get a sysctl -a from the machine when it is happening...

> Alternately, the code for M_WAITOK is broken in some way not obvious
> to me.
> 
> Either way, I avoid it by using M_NOWAIT. I also fall back on:
>    MGET(..M_WAITOK);
>    MCLGET(..M_NOWAIT);
> which has a "side effect" of draining the mbuf cluster zone if the
> MCLGET(..M_NOWAIT) fails to get a cluster. (For some reason m_getcl()
> and m_getjcl() do not drain the cluster zone when they fail?)

Why aren't you using m_getcl(9) which does both of the above automaticly
for you?  And is faster, since there is a special uma zone that has
both an mbuf and an mbuf cluster paired up already?

> One of the advantages of having very old/small hardware to test on;-)

:)

> > >     had a problem with malloc(). Btw, this was for a pagesize
> > >     cluster allocation,
> > >     so it might be related to the alignment requirement (and
> > >     running on a small
> > >     i386, so the kernel address space is relatively small).
> > 
> > Yeh, if you put additional alignment requirements, that's probably
> > it,
> > but if you needed these alignment requirements, how was malloc
> > satisfying your request?
> > 
> This was for a m_getjcl(MJUMAGEIZE, M_WAITOK..), so for this case
> I've never done a malloc(). The code in head (which my patch uses as
> a fallback when m_getjcl(..M_NOWAIT..) fails does (as above):
>    MGET(..M_WAITOK);
>    MCLGET(..M_NOWAIT);

When that fails, an netstat -m would also be useful to see what the
stats think of the availability of page size clusters...

> > >     I do see that switching to a fixed size allocation to cover the
> > >     common
> > >     case is a good idea, but I'm not sure if setting up a uma zone
> > >     is worth
> > >     the effort over malloc()?
> > 
> > I'd say it depends upon how many and the number...  If you're
> > allocating
> > many megabytes of memory, and the wastage is 50%+, then think about
> > it, but if it's just a few objects, then the coding time and
> > maintenance isn't worth it..
> > 
> Btw, I think the allocation is a power of 2. (It is a power of 2 times
> sizeof(struct iovec) and it looks to me that sizeof(struct iovec) is
> a power of 2 as well. (I know i386 is 8 and I think most 64bits arches
> will make it 16, since it is a pointer and a size_t.)

yes, struct iovec is 16 on amd64...

(kgdb) print sizeof(struct iovec)
$1 = 16

> This was part of Garrett's patch, so I'll admit I would have been to
> lazy to do it.;-) Now it's in the current patch, so unless there seems
> to be a reason to take it out..??
> 
> Garrett mentioned that UMA(9) has a per-CPU cache. I'll admit I don't
> know what that implies?

a per-CPU cache means that on an SMP system, you can lock the local
pool instead of grabing a global lock..  This will be MUCH faster as
the local lock won't have to bounce around CPUs like a global lock
does, plus it should never contend which really puts the breaks on
sync primities...

> - I might guess that a per-CPU cache would be useful for items that get
>   re-allocated a lot with minimal change to the data in the slab.
>   --> It seems to me that if most of the bytes in the slab have the
>       same bits, then you might improve hit rate on the CPU's memory
>       caches, but since I haven't looked at this, I could be way off??

caching will help some, but the lock is the main one...

> - For this case, the iovec array that is allocated is filled in with
>   different mbuf data addresses each time, so minimal change doesn`t
>   apply.

So, this is where a UMA half alive object could be helpful... Say that
you always need to allocate an iovec + 8 mbuf clusters to populate the
iovec...  What you can do is have a uma uminit function that allocates
the memory for the iovec and 8 mbuf clusters, and populates the iovec
w/ the correct addresses...  Then when you call uma_zalloc, the iovec
is already initalized, and you just go on your merry way instead of
doing all that work...  when you uma_zfree, you don't have to worry
about loosing the clusters as the next uma_zalloc might return the
same object w/ the clusters already present... When the system gets
low on memory, it will call your fini function which will need to
free the clusters....

> - Does the per-CPU cache help w.r.t. UMA(9) internal code perf?
> 
> So, lots of questions that I don't have an answer for. However, unless
> there is a downside to using UMA(9) for this, the code is written and
> I'm ok with it.

Nope, not really...

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."