Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 5 Feb 2013 21:35:03 +0100
From:      Marius Strobl <marius@alchemy.franken.de>
To:        YongHyeon PYUN <pyunyh@gmail.com>
Cc:        Kurt Lidl <lidl@pix.net>, freebsd-sparc64@freebsd.org
Subject:   Re: console stops with 9.1-RELEASE when under forwarding load
Message-ID:  <20130205203503.GR80850@alchemy.franken.de>
In-Reply-To: <20130205072553.GB1439@michelle.cdnetworks.com>
References:  <20130122043541.GA67894@pix.net> <20130123223009.GA22474@alchemy.franken.de> <20130205061956.GB40942@pix.net> <20130205072553.GB1439@michelle.cdnetworks.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Feb 05, 2013 at 04:25:53PM +0900, YongHyeon PYUN wrote:
> On Tue, Feb 05, 2013 at 01:19:56AM -0500, Kurt Lidl wrote:
> > On Wed, Jan 23, 2013 at 11:30:09PM +0100, Marius Strobl wrote:
> > > On Mon, Jan 21, 2013 at 11:35:41PM -0500, Kurt Lidl wrote:
> > > > I'm not sure if this is better directed at freebsd-sparc64@
> > > > or freebsd-net@ but I'm going to guess here...
> > > > 
> > > > Anyways.  In all cases, I'm using an absolutely stock
> > > > FreeBSD 9.1-release installation.
> > > > 
> > > > I got several SunFire V120 machines recently, and have been testing
> > > > them out to verify their operation.  They all started out identically
> > > > configured -- 1 GB of memory, 2x36GB disks, DVD-rom, 650Mhz processor.
> > > > The V120 has two on-board "gem" network interfaces.  And the machine
> > > > can take a single, 32-bit PCI card.
> > > > 
> > > > I've benchmarked the gem interfaces being able to source or sink
> > > > about 90mbit/sec of TCP traffic.  This is comparable to the speed
> > > > of "hme" interfaces that I've tested in my slower Netra-T1-105
> > > > machines.
> > > > 
> > > > So.  I put a Intel 32bit gig-e interface (a "GT" desktop
> > > > Gig-E interface) into the machine, and it comes up like this:
> > > > 
> > > > em0: <Intel(R) PRO/1000 Legacy Network Connection 1.0.4> port 0xc00200-0xc0023f mem 0x20000-0x3ffff,0x40000-0x5ffff at device 5.0 on pci2
> > > > em0: Memory Access and/or Bus Master bits were not set!
> > > > em0: Ethernet address: 00:1b:21:<redacted>
> > > > 
> > > > That interface can source or sink TCP traffic at about
> > > > 248 mbit/sec.
> > > > 
> > > > Since I really want to make one of these machines my firewall/router,
> > > > I took a different, dual-port Intel Gig-E server adaptor (a 64bit
> > > > PCI card) and put it into one of the machines so I could look at
> > > > the fowarding performance.  It probes like this:
> > > > 
> > > > em0: <Intel(R) PRO/1000 Legacy Network Connection 1.0.4> port 0xc00200-0xc0023f mem 0x20000-0x3ffff,0x40000-0x7ffff at device 5.0 on pci2
> > > > em0: Memory Access and/or Bus Master bits were not set!
> > > > em0: Ethernet address: 00:04:23:<redacted>
> > > > em1: <Intel(R) PRO/1000 Legacy Network Connection 1.0.4> port 0xc00240-0xc0027f mem 0xc0000-0xdffff,0x100000-0x13ffff at device 5.1 on pci2
> > > > em1: Memory Access and/or Bus Master bits were not set!
> > > > em1: Ethernet address: 00:04:23:<redacted>
> > > > 
> > > > Now this card can source traffic at about 250 mbit/sec and can sink
> > > > traffic around 204 mbit/sec.
> > > > 
> > > > But the real question is - how is the forwarding performance?
> > > > 
> > > > So I setup a test between some machines:
> > > > 
> > > > A --tcp data--> em0-sparc64-em1 --tcp data--> B
> > > > |                                             |
> > > > \---------<--------tcp acks-------<-----------/
> > > > 
> > > > So, A sends to interface em0 on the sparc64, the sparc64
> > > > forward out em1 to host B, and the ack traffic flows out
> > > > a different interface from B to A.  (A and B are amd64
> > > > machines, with Gig-E interfaces that are considerably
> > > > faster than the sparc64 machines.)
> > > > 
> > > > This test works surprisingly well -- 270 mbit/sec of forwarding
> > > > traffic, at around 29500 packets/second.
> > > > 
> > > > The problem is when I change the test to send the tcp ack traffic
> > > > back through the sparc64 (so, ack traffic goes from B into em1,
> > > > then forwarded out em0 to A), while doing the data in the same way.
> > > > 
> > > > The console of the sparc64 becomes completely unresponsive during
> > > > the running of this test.  The 'netstat 1' that I been running just
> > > > stops.  When the data finishes transmitting, the netstat output
> > > > gives one giant jump, counting all the packets that were sent during
> > > > the test as if they happened in a single second.
> > > > 
> > > > It's pretty clear that the process I'm running on the console isn't
> > > > receiving any cycles at all.  This is true for whatever I have
> > > > running on the console of machine -- a shell, vmstat, iostat,
> > > > whatever.  It just hangs until the forwarding test is over.
> > > > Then the console input/output resumes normally.
> > > > 
> > > > Has anybody else seen this type of problem?
> > > > 
> > > 
> > > I don't see what could be a sparc64-specific problem in this case.
> > > You are certainly pushing the hardware beyond its limits though and
> > > it would be interesting to know how a similarly "powerful" i386
> > > machine behaves in this case.
> > > In any case, in order to not burn any CPU cycles needlessly, you
> > > should use a kernel built from a config stripped down to your
> > > requirements and with options SMP removed to get the maximum out
> > > of a UP machine. It could also be that SCHED_ULE actually helps
> > > in this case (there's a bug in 9.1-RELEASE causing problems with
> > > SCHED_ULE and SMP on sparc64, but for UP it should be fine).
> > 
> > I updated the kernel tree on one of my sparc64 machines to the
> > latest version of 9-STABLE, and gave the following combinations a
> > try:
> > 	SMP+ULE
> > 	SMP+4BSD
> > 	non-SMP+ULE
> > 	non-SMP+4BSD
> > They all performed about the same, in terms of throughput,
> > and about the same in terms of user-responsiveness when under load.
> > None were responsive when forwarding ~214mbit/sec of traffic.
> > 
> > I played around a bit with tuning of the rx/tx queue depths for the
> > em0/em1 devices, but none of that had any perceptable difference in
> > the level of throughput or responsiveness of the machine.
> 
> If my memory serve me right, em(4) requires considerably fast
> machine to offset the overhead of taskqueue(9). Because the
> taskqueue handler is enqueued again and again under heavy RX
> network load, most system cycles would be consumed in the
> taskqueue handler.
> Try polling(4) and see whether it makes any difference. I'm not
> sure whether polling(4) works on sparc64 though.
> 

This might or might not work or at least cause ill effects. In general,
Sun PCI bridges synchronize DMA on interrupts and polling(4) bypasses
that mechanism. For the host-PCI-bridges found in v210, psycho(4)
additionally synchronizes DMA manually when bus_dmamap_sync(9) is called
with BUS_DMASYNC_POSTREAD (as suggested in the datasheet). I'm not sure
whether this is also sufficient for polling(4). In any case, sun4u
hardware certainly wasn't built with something like polling(4) in mind.
Hrm, according to my reading of the lem(4) source, it shouldn't use
taskqueue(9) when setting the loader tunable hw.em.use_legacy_irq to
1 for the MACs in question. In any case, the latter certainly is easier
to test than rebuilding a kernel with polling(4) support.

Marius




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130205203503.GR80850>