Date: Tue, 16 Feb 2010 21:58:55 +0200 From: Kostik Belousov <kostikbel@gmail.com> To: Maxim Sobolev <sobomax@freebsd.org> Cc: freebsd-net@freebsd.org, FreeBSD Hackers <freebsd-hackers@freebsd.org> Subject: Re: Sudden mbuf demand increase and shortage under the load (igb issue?) Message-ID: <20100216195855.GG50403@deviant.kiev.zoral.com.ua> In-Reply-To: <4B7ADFC6.7020202@FreeBSD.org> References: <4B79297D.9080403@FreeBSD.org> <4B79205B.619A0A1A@verizon.net> <4B7ADFC6.7020202@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
--Bqc0IY4JZZt50bUr Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable [Trimmed Cc: list] On Tue, Feb 16, 2010 at 10:11:18AM -0800, Maxim Sobolev wrote: > OK, here is some new data that I think rules out any issues with the=20 > applications. Following Alfred's suggestion I have made a script to run= =20 > every second and output some system statistics: >=20 > date > netstat -m > vmstat -i > ps -axl > pstat -T > vmstat -z > sysctl -a >=20 > The problem had hit us again today several times and upon investigating= =20 > the log I found that increase in the mbuf usage happened in one step -=20 > going from normal 10% to 100% between two script runs. What is more=20 > interesting, is that time from two such subsequent runs were about 2=20 > minutes apart (instead of 1 second as it should be) and when inspecting= =20 > cron logs I noticed the same time gap in there. I ruled out any VM=20 > starvation as a cause of the delay because system has plenty of free=20 > memory. The incoming network traffic was not sufficient to starve VM so= =20 > quickly either - it was about 7MB/sec at that time, so even if all=20 > receivers stopped draining their buffers it should have taken at least=20 > 1-2 seconds to fill up mbuf cache and create demand for an additional=20 > kernel memory. The failure would likely to be more gradual and I should= =20 > have seen how it builds up in the debug log. >=20 > So it looks like kernel issue of a sort, which causes all userland=20 > activity to cease for 2 minutes when the system reaches certain load.=20 > Mbuf build-up is only the by-product of this, not really a cause. igb(4)= =20 > is being the primary suspect now, since we have other machines with more= =20 > load not having this problem and we don't have anybody else using this=20 > driver. The chip is the following: >=20 > igb0@pci0:5:0:0: class=3D0x020000 card=3D0x323f103c chip=3D0x10c98= 086=20 > rev=3D0x01 hdr=3D0x00 > vendor =3D 'Intel Corporation' > class =3D network > subclass =3D ethernet > igb1@pci0:5:0:1: class=3D0x020000 card=3D0x323f103c chip=3D0x10c98= 086=20 > rev=3D0x01 hdr=3D0x00 > vendor =3D 'Intel Corporation' > class =3D network > subclass =3D ethernet >=20 > Hardware in question is a new HP DL160G6. I have also checked IPMI logs= =20 > and sensors and have not found any issue in there as well. No sensors=20 > reported off-range values and chassis temperature is within normal limits. >=20 > I am not sure how to debug this problem further. We are now=20 > investigating opportunity to install external non-igb card to the server= =20 > and see if it solves the issue. >=20 > I have the whole log if anyone wants to take a closer peek. How much physical memory do you have installed in the machine ? If it is > 16Gb, try to remove some. --Bqc0IY4JZZt50bUr Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (FreeBSD) iEYEARECAAYFAkt6+P4ACgkQC3+MBN1Mb4gn+QCgvaSwNrcvigYcLCXLwV81i8j/ mzYAoNghlDps8yyiQieR1r9ejiPpnkKx =9c1c -----END PGP SIGNATURE----- --Bqc0IY4JZZt50bUr--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100216195855.GG50403>