Date: Wed, 12 Feb 2014 21:02:09 -0500 (EST) From: Rick Macklem <rmacklem@uoguelph.ca> To: FreeBSD Net <freebsd-net@freebsd.org> Cc: Garrett Wollman <wollman@freebsd.org> Subject: NFS threads getting stuck in vmem_bt_alloc() at "btalloc"? (mbuf allocation) Message-ID: <165956022.5235196.1392256929127.JavaMail.root@uoguelph.ca>
next in thread | raw e-mail | index | archive | help
I wrote: > I've been doing some testing using pagesize clusters (4K) for NFS > instead of mclbytes (2K) on a single core i386. > Sometimes I get threads stuck sleeping on "btalloc", which seems > to happen in vmem_bt_alloc(). > > The comment in vmem_bt_alloc() basically says: > out of address space or lost a fill race > Since this is persistent, I suspect it is the first case? > > So, does anyone know what is going on here or what I should look > at to try and resolve this? > > Btw, when I am testing, I don't see the pagesize cluster allocation > exceed 400, so it doesn't seem to be a leak or excessive allocation. > > Thanks in advance for any help, rick I originally posted this to freebsd-hackers@, but since it seems to be related to mbuf allocation, I thought it might be better here. When I posted this, I knew nothing about uma or the current mbuf allocation mechanisms. Now, I know a little bit and the story is getting interesting... Currently, NFS does: MGET(..M_WAITOK); MCLGET(..M_NOWAIT); when it wants an mbuf cluster. It was done this way long ago, because mbuf clusters could become exhausted and this allowed NFS to limp along, using long lists of regular mbufs for the data (NFS RPC messages). Now, it seems that this does the following (MCLGET() is just m_clget(), which is an inline function in sys/mbuf.h): MGET(..M_WAITOK) - always returns an mbuf m_clget(..M_NOWAIT) - calls uma_alloc_arg(zone_clust, M_NOWAIT..) if this fails, it then zone_drain(zone_pack); calls uma_alloc_arg(zone_clust, M_NOWAIT..) again As such, it will zone_drain(zone_pack) when cluster allocations become difficult (including when a uma zone allocation for a boundary tag can't succeed without waiting). I suspect this usually fixes the problem and the second attempt succeeds. However, even if the second attempt fails, NFS still has an mbuf and doesn't get stuck in "btalloc". When I was doing recent testing to see how pagesize clusters would work, I switched to m_getjcl(..M_WAITOK..), which can get stuck in "btalloc" if an attempt to allocate a boundary tag fails, due to lack of kernel address space. I test on i386, but it still isn't obvious how I exhausted kernel address space? One thing I notice is that zone_pack is set to the same limit as the mbuf zone at 168765. However, unlike the mbuf zone, I think that many of the entries in zone_pack will have a cluster associated with them. I am thinking that the limit for zone_pack is on the high side, since zone_clust is limited to 26368 on my i386 and maybe this is how kernel address space gets exhausted? In summary, to play it safe, I think that if NFS is going to use pagesize clusters, it needs to: - call m_getjcl(..M_NOWAIT..); /* call with M_NOWAIT */ - if this fails (returns NULL) then - call MGET(..M_WAITOK..) - call MCLGET(.. M_NOWAIT..) That way, I don't think the NFS threads can get stuck sleeping on "btalloc" and calls to zone_drain(zone_pack) will happen when allocation gets constrained. This means that the length of the mbuf list for a read reply could be - length (64K or whatever) / MLEN for the worst case, since allocation of clusters isn't guaranteed. (Garrett, I think you have to make your iovec that big if you are going to use a fixed size allocation instead of the current code, which malloc()s enough for the list.) It seems to me that m_getcl()/m_getjcl() should do a zone_drain(zone_pack) when an allocation fails (for M_NOWAIT), but that is just a suggestion? What do others think of the above? rick
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?165956022.5235196.1392256929127.JavaMail.root>