Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 30 Mar 2011 10:10:23 -0700
From:      YongHyeon PYUN <pyunyh@gmail.com>
To:        Vlad Galu <dudu@dudu.ro>
Cc:        freebsd-net@freebsd.org, Arnaud Lacombe <lacombar@gmail.com>
Subject:   Re: bge(4) on RELENG_8 mbuf cluster starvation
Message-ID:  <20110330171023.GA8601@michelle.cdnetworks.com>
In-Reply-To: <AANLkTi=dci-cKVuvpXCs40u8u=5LGzey6s5-jYXEPM7s@mail.gmail.com>
References:  <AANLkTimSs48ftRv8oh1wTwMEpgN1Ny3B1ahzfS=AbML_@mail.gmail.com> <AANLkTimfh3OdXOe1JFo5u6JypcLrcWKv2WpSu8Uv-tgv@mail.gmail.com> <AANLkTi=rWobA40UtCTSeOzEz65TMw8vfCcxtMWBBme%2Bu@mail.gmail.com> <20110313011632.GA1621@michelle.cdnetworks.com> <AANLkTi=dci-cKVuvpXCs40u8u=5LGzey6s5-jYXEPM7s@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Mar 30, 2011 at 05:55:47PM +0200, Vlad Galu wrote:
> On Sun, Mar 13, 2011 at 2:16 AM, YongHyeon PYUN <pyunyh@gmail.com> wrote:
> 
> > On Sat, Mar 12, 2011 at 09:17:28PM +0100, Vlad Galu wrote:
> > > On Sat, Mar 12, 2011 at 8:53 PM, Arnaud Lacombe <lacombar@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > On Sat, Mar 12, 2011 at 4:03 AM, Vlad Galu <dudu@dudu.ro> wrote:
> > > > > Hi folks,
> > > > >
> > > > > On a fairly busy recent (r219010) RELENG_8 machine I keep getting
> > > > > -- cut here --
> > > > > 1096/1454/2550 mbufs in use (current/cache/total)
> > > > > 1035/731/1766/262144 mbuf clusters in use (current/cache/total/max)
> > > > > 1035/202 mbuf+clusters out of packet secondary zone in use
> > > > (current/cache)
> > > > > 0/117/117/12800 4k (page size) jumbo clusters in use
> > > > > (current/cache/total/max)
> > > > > 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
> > > > > 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
> > > > > 2344K/2293K/4637K bytes allocated to network (current/cache/total)
> > > > > 0/70128196/37726935 requests for mbufs denied
> > > > (mbufs/clusters/mbuf+clusters)
> > > > > ^^^^^^^^^^^^^^^^^^^^^
> > > > > -- and here --
> > > > >
> > > > > kern.ipc.nmbclusters is set to 131072. Other settings:
> > > > no, netstat(8) says 262144.
> > > >
> > > >
> > > Heh, you're right, I forgot I'd doubled it a while ago. Wrote that from
> > the
> > > top of my head.
> > >
> > >
> > > > Maybe can you include $(sysctl dev.bge) ? Might be useful.
> > > >
> > > >  - Arnaud
> > > >
> > >
> > > Sure:
> >
> > [...]
> >
> > > dev.bge.1.%desc: Broadcom NetXtreme Gigabit Ethernet Controller, ASIC
> > rev.
> > > 0x004101
> > > dev.bge.1.%driver: bge
> > > dev.bge.1.%location: slot=0 function=0
> > > dev.bge.1.%pnpinfo: vendor=0x14e4 device=0x1659 subvendor=0x1014
> > > subdevice=0x02c6 class=0x020000
> > > dev.bge.1.%parent: pci5
> > > dev.bge.1.forced_collapse: 2
> > > dev.bge.1.forced_udpcsum: 0
> > > dev.bge.1.stats.FramesDroppedDueToFilters: 0
> > > dev.bge.1.stats.DmaWriteQueueFull: 0
> > > dev.bge.1.stats.DmaWriteHighPriQueueFull: 0
> > > dev.bge.1.stats.NoMoreRxBDs: 680050
> >   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > This indicates bge(4) encountered RX buffer shortage. Perhaps
> > bge(4) couldn't fill new RX buffers for incoming frames due to
> > other system activities.
> >
> > > dev.bge.1.stats.InputDiscards: 228755931
> >
> > This counter indicates number of frames discarded due to RX buffer
> > shortage. bge(4) discards received frame if it failed to allocate
> > new RX buffer such that InputDiscards is normally higher than
> > NoMoreRxBDs.
> >
> > > dev.bge.1.stats.InputErrors: 49080818
> >   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > Something is wrong here. Too many frames were classified as error
> > frames. You may see poor RX performance.
> >
> > > dev.bge.1.stats.RecvThresholdHit: 0
> > > dev.bge.1.stats.rx.ifHCInOctets: 2095148839247
> > > dev.bge.1.stats.rx.Fragments: 47887706
> > > dev.bge.1.stats.rx.UnicastPkts: 32672557601
> > > dev.bge.1.stats.rx.MulticastPkts: 1218
> > > dev.bge.1.stats.rx.BroadcastPkts: 2
> > > dev.bge.1.stats.rx.FCSErrors: 2822217
> >   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > FCS errors are too high. Please check cabling again(I'm assuming
> > the controller is not broken here). I think you can use vendor's
> > diagnostic tools to verify this.
> >
> > > dev.bge.1.stats.rx.AlignmentErrors: 0
> > > dev.bge.1.stats.rx.xonPauseFramesReceived: 0
> > > dev.bge.1.stats.rx.xoffPauseFramesReceived: 0
> > > dev.bge.1.stats.rx.ControlFramesReceived: 0
> > > dev.bge.1.stats.rx.xoffStateEntered: 0
> > > dev.bge.1.stats.rx.FramesTooLong: 0
> > > dev.bge.1.stats.rx.Jabbers: 0
> > > dev.bge.1.stats.rx.UndersizePkts: 0
> > > dev.bge.1.stats.tx.ifHCOutOctets: 48751515826
> > > dev.bge.1.stats.tx.Collisions: 0
> > > dev.bge.1.stats.tx.XonSent: 0
> > > dev.bge.1.stats.tx.XoffSent: 0
> > > dev.bge.1.stats.tx.InternalMacTransmitErrors: 0
> > > dev.bge.1.stats.tx.SingleCollisionFrames: 0
> > > dev.bge.1.stats.tx.MultipleCollisionFrames: 0
> > > dev.bge.1.stats.tx.DeferredTransmissions: 0
> > > dev.bge.1.stats.tx.ExcessiveCollisions: 0
> > > dev.bge.1.stats.tx.LateCollisions: 0
> > > dev.bge.1.stats.tx.UnicastPkts: 281039183
> > > dev.bge.1.stats.tx.MulticastPkts: 0
> > > dev.bge.1.stats.tx.BroadcastPkts: 1153
> > > -- and here --
> > >
> > > And now, that I remembered about this as well:
> > > -- cut here --
> > > Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts
> > > Oerrs  Coll
> > > bge1   1500 <Link#2>      00:11:25:22:0d:ed 32321767025 278517070
> > 37726837
> > > 281068216     0     0
> > > -- and here --
> > > The colo provider changed my cable a couple of times so I'd not blame it
> > on
> > > that. Unfortunately, I don't have access to the port statistics on the
> > > switch. Running netstat with -w1 yields between 0 and 4 errors/second.
> > >
> >
> > Hardware MAC counters still show high number of FCS errors. The
> > service provider should have to check possible cabling issues on
> > the port of the switch.
> >
> 
> After swapping cables and moving the NIC into another switch, there are some
> improvements. However:
> -- cut here --
> dev.bge.1.%desc: Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev.
> 0x004101
> dev.bge.1.%driver: bge
> dev.bge.1.%location: slot=0 function=0
> dev.bge.1.%pnpinfo: vendor=0x14e4 device=0x1659 subvendor=0x1014
> subdevice=0x02c6 class=0x020000
> dev.bge.1.%parent: pci5
> dev.bge.1.forced_collapse: 0
> dev.bge.1.forced_udpcsum: 0
> dev.bge.1.stats.FramesDroppedDueToFilters: 0
> dev.bge.1.stats.DmaWriteQueueFull: 0
> dev.bge.1.stats.DmaWriteHighPriQueueFull: 0
> dev.bge.1.stats.NoMoreRxBDs: 243248 <- this
> dev.bge.1.stats.InputDiscards: 9945500
> dev.bge.1.stats.InputErrors: 0

There are still discarded frames but I believe it's not related
with any cabling issues since you don't have FCS or alignment
errors.

> dev.bge.1.stats.RecvThresholdHit: 0
> dev.bge.1.stats.rx.ifHCInOctets: 36697296701
> dev.bge.1.stats.rx.Fragments: 0
> dev.bge.1.stats.rx.UnicastPkts: 549334370
> dev.bge.1.stats.rx.MulticastPkts: 113638
> dev.bge.1.stats.rx.BroadcastPkts: 0
> dev.bge.1.stats.rx.FCSErrors: 0
> dev.bge.1.stats.rx.AlignmentErrors: 0
> dev.bge.1.stats.rx.xonPauseFramesReceived: 0
> dev.bge.1.stats.rx.xoffPauseFramesReceived: 0
> dev.bge.1.stats.rx.ControlFramesReceived: 0
> dev.bge.1.stats.rx.xoffStateEntered: 0
> dev.bge.1.stats.rx.FramesTooLong: 0
> dev.bge.1.stats.rx.Jabbers: 0
> dev.bge.1.stats.rx.UndersizePkts: 0
> dev.bge.1.stats.tx.ifHCOutOctets: 10578000636
> dev.bge.1.stats.tx.Collisions: 0
> dev.bge.1.stats.tx.XonSent: 0
> dev.bge.1.stats.tx.XoffSent: 0
> dev.bge.1.stats.tx.InternalMacTransmitErrors: 0
> dev.bge.1.stats.tx.SingleCollisionFrames: 0
> dev.bge.1.stats.tx.MultipleCollisionFrames: 0
> dev.bge.1.stats.tx.DeferredTransmissions: 0
> dev.bge.1.stats.tx.ExcessiveCollisions: 0
> dev.bge.1.stats.tx.LateCollisions: 0
> dev.bge.1.stats.tx.UnicastPkts: 64545266
> dev.bge.1.stats.tx.MulticastPkts: 0
> dev.bge.1.stats.tx.BroadcastPkts: 313
> 
> and
> 0/1710531/2006005 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
> -- and here --
> 
> I'll start gathering some stats/charts on this host to see if I can
> correlate the starvation with other system events.
> 

Now MAC statistics counter show no abnormal things which in turn
indicates the mbuf starvation came from other issues. The next
thing is to identify which process or kernel subsystem consumes a
lot of mbuf clusters.

> 
> 
> > However this does not explain why you have large number of mbuf
> > cluster allocation failure. The only wild guess I have at this
> > moment is some process or kernel subsystems are too slow to release
> > allocated mbuf clusters. Did you check various system activities
> > while seeing the issue?
> >



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110330171023.GA8601>