Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 6 Apr 2009 10:12:00 -0700 (PDT)
From:      Barney Cordoba <barney_cordoba@yahoo.com>
To:        freebsd-net@freebsd.org
Subject:   Re: Advice on a multithreaded netisr  patch?
Message-ID:  <723620.1225.qm@web63905.mail.re1.yahoo.com>
In-Reply-To: <grcsus$9vh$1@ger.gmane.org>

next in thread | previous in thread | raw e-mail | index | archive | help




--- On Mon, 4/6/09, Ivan Voras <ivoras@freebsd.org> wrote:

> From: Ivan Voras <ivoras@freebsd.org>
> Subject: Re: Advice on a multithreaded netisr  patch?
> To: freebsd-net@freebsd.org
> Date: Monday, April 6, 2009, 8:35 AM
> Robert Watson wrote:
> > On Mon, 6 Apr 2009, Ivan Voras wrote:
> 
> >> So, a mbuf can reference data not yet copied from
> the NIC hardware?
> >> I'm specifically trying to undestand what
> m_pullup() does.
> > 
> > I think we're talking slightly at cross purposes. 
> There are two
> > transfers of interest:
> > 
> > (1) DMA of the packet data to main memory from the NIC
> > (2) Servicing of CPU cache misses to access data in
> main memory
> > 
> > By the time you receive an interrupt, the DMA is
> complete, so once you
> 
> OK, this was what was confusing me - for a moment I thought
> you meant
> it's not so.
> 
> > believe a packet referenced by the descriptor ring is
> done, you don't
> > have to wait for DMA.  However, the packet data is in
> main memory rather
> > than your CPU cache, so you'll need to take a
> cache miss in order to
> > retrieve it.  You don't want to prefetch before
> you know the packet data
> > is there, or you may prefetch stale data from the
> previous packet sent
> > or received from the cluster.
> > 
> > m_pullup() has to do with mbuf chain memory contiguity
> during packet
> > processing.  The usual usage is something along the
> following lines:
> > 
> >     struct whatever *w;
> > 
> >     m = m_pullup(m, sizeof(*w));
> >     if (m == NULL)
> >         return;
> >     w = mtod(m, struct whatever *);
> >
> > m_pullup() here ensures that the first sizeof(*w)
> bytes of mbuf data are
> > contiguously stored so that the cast of w to m's
> data will point at a
> 
> So, m_pullup() can resize / realloc() the mbuf? (not that
> it matters for
> this purpose)
> 
> > Is this for the loopback workload?  If so, remember
> that there may be
> > some other things going on:
> 
> Both loopback and physical.
> 
> > - Every packet is processed at least two times: once
> went sent, and then
> > again
> >   when it's received.
> > 
> > - A TCP segment will need to be ACK'd, so if
> you're sending data in
> > chunks in
> >   one direction, the ACKs will not be piggy-backed on
> existing data
> > tranfers,
> >   and instead be sent independently, hitting the
> network stack two more
> > times.
> 
> No combination of these can make an accounting difference
> between 1,000
> and 250,000 pps. I must be hitting something very bad here.
> 
> > - Remember that TCP works to expand its window, and
> then maintains the
> > highest
> >   performance it can by bumping up against the top of
> available bandwidth
> >   continuously.  This involves detecting buffer limits
> by generating
> > packets
> >   that can't be sent, adding to the packet count. 
> With loopback
> > traffic, the
> >   drop point occurs when you exceed the size of the
> netisr's queue for
> > IP, so
> >   you might try bumping that from the default to
> something much larger.
> 
> My messages are approx. 100 +/- 10 bytes. No practical way
> they will
> even span multiple mbufs. TCP_NODELAY is on.
> 
> > No.  x++ is massively slow if executed in parallel
> across many cores on
> > a variable in a single cache line.  See my recent
> commit to kern_tc.c
> > for an example: the updating of trivial statistics for
> the kernel time
> > calls reduced 30m syscalls/second to 3m
> syscalls/second due to heavy
> > contention on the cache line holding the statistic. 
> One of my goals for
> 
> I don't get it:
> http://svn.freebsd.org/viewvc/base/stable/7/sys/kern/kern_tc.c?r1=189891&r2=189890&pathrev=189891
> 
> you replaced x++ with no-ops if TC_COUNTER is defined?
> Aren't the
> timecounters actually needed somewhere?
> 
> > 8.0 is to fix this problem for IP and TCP layers, and
> ideally also ifnet
> > but we'll see.  We should be maintaining those
> stats per-CPU and then
> > aggregating to report them to userspace.  This is what
> we already do for
> > a number of system stats -- UMA and kernel malloc,
> syscall and trap
> > counters, etc.
> 
> How magic is this? Is it just a matter of declaring
> mystatarray[NCPU]
> and updating mystat[current_cpu] or (probably), the spacing
> between
> array elements should be magically fixed so two elements
> don't share a
> cache line?
> 
> >>> - Use cpuset to pin ithreads, the netisr, and
> whatever else, to specific
> >>> cores
> >>>   so that they don't migrate, and if your
> system uses HTT, experiment
> >>> with
> >>>   pinning the ithread and the netisr on
> different threads on the same
> >>> core, or
> >>>   at least, different cores on the same die.
> >>
> >> I'm using em hardware; I still think
> there's a possibility I'm
> >> fighting the driver in some cases but this has
> priority #2.
> > 
> > Have you tried LOCK_PROFILING?  It would quickly tell
> you if driver
> > locks were a source of significant contention.  It
> works quite well...

I enabled lock profiling in my kernel and the system panics on 
lock_init for one of my drivers. Are you aware of any issues
that would be specific to lock profiling being enabled?

Barney


      



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?723620.1225.qm>