Date: Sat, 03 Dec 2005 23:59:12 +0100 From: Andre Oppermann <andre@freebsd.org> To: Robert Watson <rwatson@FreeBSD.org> Cc: current@FreeBSD.org Subject: Re: mbuf cluster leaks in -CURRENT Message-ID: <43922340.4CAF7157@freebsd.org> References: <20051203150119.B3216@fledge.watson.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Robert Watson wrote: > > Yesterday I sat down to run some benchmarks on phk's changes to the process > time measurement system for scheduling, and discovered SMP boxes were wedging > in [zonelimit] when running netperf tests. I quickly tracked this down to an > mbuf cluster leak: > > /zoo/rwatson/netperf/bin/netserver > while (1) > echo "" > netstat -m | grep mbuf > /zoo/rwatson/netperf/bin/netperf -l 30 >& /dev/null > end > > Result of: > > CVS Date Description Leak? > 2005/12/3 sample yes > 2005/11/28-2005/11/29 rwatson sosend changes - > 2005/11/25 sample yes > 2005/11/15 sample yes > 2005/11/02-2005/11/05 andre cluster changes - > 2005/10/25 sample no > 2005/10/15 sample no > 2005/10/1 sample no > 2005/09/27 rwatson removes mbuf counters - > 2005/09/16 sample no > > The reason for the wedge is that NFS based systems don't like running out of > mbuf clusters. It turns out that the reason I likely didn't notice this > previously was that I was running the test boxes in question without ACPI, and > for whatever reason, the race becomes many times more serious with ACPI turned > on. It was leaking without ACPI, but since it was slower, I wasn't noticing > since I had the machines up for much shorter tests. Here's a sampling of > kernel dates and whether or not the leak was present in a kernel from the > date, as well as the dates of a few changes I was worried were likely causes: > > 769/641/1410 mbufs in use (current/cache/total) > 768/204/972/25600 mbuf clusters in use (current/cache/total/max) > > 769/4991/5760 mbufs in use (current/cache/total) > 4341/905/5246/25600 mbuf clusters in use (current/cache/total/max) > > 769/8456/9225 mbufs in use (current/cache/total) > 7901/801/8702/25600 mbuf clusters in use (current/cache/total/max) > > 769/11786/12555 mbufs in use (current/cache/total) > 11242/788/12030/25600 mbuf clusters in use (current/cache/total/max) > > 769/15236/16005 mbufs in use (current/cache/total) > 14570/916/15486/25600 mbuf clusters in use (current/cache/total/max) > > 769/18566/19335 mbufs in use (current/cache/total) > 17948/866/18814/25600 mbuf clusters in use (current/cache/total/max) > > I've not really had a chance to investigate the details of the leak -- the > number of used (allocated) mbufs remains low, but the cache number grows > steadily. However, the dates suggest that it was the mbuf cluster cleanup > work you did that introduced the problem (although don't guarantee it). This seems to be the same problem I described in rev. 1.14 of kern_mbuf.c where mbuf+clusters from the packet zone (pre-combined m+c) never get free'd back to their original pools. The numbers from netstat -m support that assumption. It doesn't (and can't) show the number of cached m+c in the packet zone. Mbuf's in packet zone account as cached in the mbuf zone because the packet zone is a secondary zone to it. The clusters in use are not leaked but attached to all those mbufs in the packet zone. The cluster zone doesn't know about the packet zone and accounts them as used. This pseudo-leak is not from my changes (as it is a UMA bug) but gets amplified by use of kernel subsystems which make heavy use of m+c from the packet zone. While my changes triggered this problem too by changing the way packets get free'd back to the UMA mbuf, cluster and packet zones it was reverted in 1.14. The other refcount changes do not cause any such effect. It may very well be that some part of the network stack switched from allocating mbuf and cluster separately to pre-combined packets. That would explain the 'sudden' appearance of the problem. The right fix is to have UMA free back mbuf+clusters from the packet zone to their native zones. This should not be done with high/low watermarks but a median and positive/negative deviation method. Refills and drains to/from the packet zone should happen in batches and not for single requests or free's to be efficient. I'll look into it tomorrow. I may have to summon Bosko for some help on the secondary zone stuff as he introduced this feature. -- Andre
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?43922340.4CAF7157>