Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 16 Feb 2010 21:58:55 +0200
From:      Kostik Belousov <kostikbel@gmail.com>
To:        Maxim Sobolev <sobomax@freebsd.org>
Cc:        freebsd-net@freebsd.org, FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: Sudden mbuf demand increase and shortage under the load (igb issue?)
Message-ID:  <20100216195855.GG50403@deviant.kiev.zoral.com.ua>
In-Reply-To: <4B7ADFC6.7020202@FreeBSD.org>
References:  <4B79297D.9080403@FreeBSD.org> <4B79205B.619A0A1A@verizon.net> <4B7ADFC6.7020202@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--Bqc0IY4JZZt50bUr
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

[Trimmed Cc: list]

On Tue, Feb 16, 2010 at 10:11:18AM -0800, Maxim Sobolev wrote:
> OK, here is some new data that I think rules out any issues with the=20
> applications. Following Alfred's suggestion I have made a script to run=
=20
> every second and output some system statistics:
>=20
> date
> netstat -m
> vmstat -i
> ps -axl
> pstat -T
> vmstat -z
> sysctl -a
>=20
> The problem had hit us again today several times and upon investigating=
=20
> the log I found that increase in the mbuf usage happened in one step -=20
> going from normal 10% to 100% between two script runs. What is more=20
> interesting, is that time from two such subsequent runs were about 2=20
> minutes apart (instead of 1 second as it should be) and when inspecting=
=20
> cron logs I noticed the same time gap in there. I ruled out any VM=20
> starvation as a cause of the delay because system has plenty of free=20
> memory. The incoming network traffic was not sufficient to starve VM so=
=20
> quickly either - it was about 7MB/sec at that time, so even if all=20
> receivers stopped draining their buffers it should have taken at least=20
> 1-2 seconds to fill up mbuf cache and create demand for an additional=20
> kernel memory. The failure would likely to be more gradual and I should=
=20
> have seen how it builds up in the debug log.
>=20
> So it looks like kernel issue of a sort, which causes all userland=20
> activity to cease for 2 minutes when the system reaches certain load.=20
> Mbuf build-up is only the by-product of this, not really a cause. igb(4)=
=20
> is being the primary suspect now, since we have other machines with more=
=20
> load not having this problem and we don't have anybody else using this=20
> driver.  The chip is the following:
>=20
> igb0@pci0:5:0:0:        class=3D0x020000 card=3D0x323f103c chip=3D0x10c98=
086=20
> rev=3D0x01 hdr=3D0x00
>     vendor     =3D 'Intel Corporation'
>     class      =3D network
>     subclass   =3D ethernet
> igb1@pci0:5:0:1:        class=3D0x020000 card=3D0x323f103c chip=3D0x10c98=
086=20
> rev=3D0x01 hdr=3D0x00
>     vendor     =3D 'Intel Corporation'
>     class      =3D network
>     subclass   =3D ethernet
>=20
> Hardware in question is a new HP DL160G6. I have also checked IPMI logs=
=20
> and sensors and have not found any issue in there as well. No sensors=20
> reported off-range values and chassis temperature is within normal limits.
>=20
> I am not sure how to debug this problem further. We are now=20
> investigating opportunity to install external non-igb card to the server=
=20
> and see if it solves the issue.
>=20
> I have the whole log if anyone wants to take a closer peek.

How much physical memory do you have installed in the machine ?
If it is > 16Gb, try to remove some.

--Bqc0IY4JZZt50bUr
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (FreeBSD)

iEYEARECAAYFAkt6+P4ACgkQC3+MBN1Mb4gn+QCgvaSwNrcvigYcLCXLwV81i8j/
mzYAoNghlDps8yyiQieR1r9ejiPpnkKx
=9c1c
-----END PGP SIGNATURE-----

--Bqc0IY4JZZt50bUr--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100216195855.GG50403>