From owner-freebsd-net@FreeBSD.ORG Mon Apr 6 12:35:57 2009 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6DBCB10656F6 for ; Mon, 6 Apr 2009 12:35:57 +0000 (UTC) (envelope-from freebsd-net@m.gmane.org) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by mx1.freebsd.org (Postfix) with ESMTP id C82BE8FC0A for ; Mon, 6 Apr 2009 12:35:56 +0000 (UTC) (envelope-from freebsd-net@m.gmane.org) Received: from list by ciao.gmane.org with local (Exim 4.43) id 1Lqo3P-0003Qo-Cg for freebsd-net@freebsd.org; Mon, 06 Apr 2009 12:35:55 +0000 Received: from lara.cc.fer.hr ([161.53.72.113]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 06 Apr 2009 12:35:55 +0000 Received: from ivoras by lara.cc.fer.hr with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 06 Apr 2009 12:35:55 +0000 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-net@freebsd.org From: Ivan Voras Date: Mon, 06 Apr 2009 14:35:33 +0200 Lines: 168 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig2259B8C6FCD2C8A9C92854A6" X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: lara.cc.fer.hr User-Agent: Thunderbird 2.0.0.21 (X11/20090318) In-Reply-To: X-Enigmail-Version: 0.95.0 Sender: news Subject: Re: Advice on a multithreaded netisr patch? X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Apr 2009 12:35:57 -0000 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig2259B8C6FCD2C8A9C92854A6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Robert Watson wrote: > On Mon, 6 Apr 2009, Ivan Voras wrote: >> So, a mbuf can reference data not yet copied from the NIC hardware? >> I'm specifically trying to undestand what m_pullup() does. >=20 > I think we're talking slightly at cross purposes. There are two > transfers of interest: >=20 > (1) DMA of the packet data to main memory from the NIC > (2) Servicing of CPU cache misses to access data in main memory >=20 > By the time you receive an interrupt, the DMA is complete, so once you OK, this was what was confusing me - for a moment I thought you meant it's not so. > believe a packet referenced by the descriptor ring is done, you don't > have to wait for DMA. However, the packet data is in main memory rathe= r > than your CPU cache, so you'll need to take a cache miss in order to > retrieve it. You don't want to prefetch before you know the packet dat= a > is there, or you may prefetch stale data from the previous packet sent > or received from the cluster. >=20 > m_pullup() has to do with mbuf chain memory contiguity during packet > processing. The usual usage is something along the following lines: >=20 > struct whatever *w; >=20 > m =3D m_pullup(m, sizeof(*w)); > if (m =3D=3D NULL) > return; > w =3D mtod(m, struct whatever *); > > m_pullup() here ensures that the first sizeof(*w) bytes of mbuf data ar= e > contiguously stored so that the cast of w to m's data will point at a So, m_pullup() can resize / realloc() the mbuf? (not that it matters for this purpose) > Is this for the loopback workload? If so, remember that there may be > some other things going on: Both loopback and physical. > - Every packet is processed at least two times: once went sent, and the= n > again > when it's received. >=20 > - A TCP segment will need to be ACK'd, so if you're sending data in > chunks in > one direction, the ACKs will not be piggy-backed on existing data > tranfers, > and instead be sent independently, hitting the network stack two more= > times. No combination of these can make an accounting difference between 1,000 and 250,000 pps. I must be hitting something very bad here. > - Remember that TCP works to expand its window, and then maintains the > highest > performance it can by bumping up against the top of available bandwid= th > continuously. This involves detecting buffer limits by generating > packets > that can't be sent, adding to the packet count. With loopback > traffic, the > drop point occurs when you exceed the size of the netisr's queue for > IP, so > you might try bumping that from the default to something much larger.= My messages are approx. 100 +/- 10 bytes. No practical way they will even span multiple mbufs. TCP_NODELAY is on. > No. x++ is massively slow if executed in parallel across many cores on= > a variable in a single cache line. See my recent commit to kern_tc.c > for an example: the updating of trivial statistics for the kernel time > calls reduced 30m syscalls/second to 3m syscalls/second due to heavy > contention on the cache line holding the statistic. One of my goals fo= r I don't get it: http://svn.freebsd.org/viewvc/base/stable/7/sys/kern/kern_tc.c?r1=3D18989= 1&r2=3D189890&pathrev=3D189891 you replaced x++ with no-ops if TC_COUNTER is defined? Aren't the timecounters actually needed somewhere? > 8.0 is to fix this problem for IP and TCP layers, and ideally also ifne= t > but we'll see. We should be maintaining those stats per-CPU and then > aggregating to report them to userspace. This is what we already do fo= r > a number of system stats -- UMA and kernel malloc, syscall and trap > counters, etc. How magic is this? Is it just a matter of declaring mystatarray[NCPU] and updating mystat[current_cpu] or (probably), the spacing between array elements should be magically fixed so two elements don't share a cache line? >>> - Use cpuset to pin ithreads, the netisr, and whatever else, to speci= fic >>> cores >>> so that they don't migrate, and if your system uses HTT, experiment= >>> with >>> pinning the ithread and the netisr on different threads on the same= >>> core, or >>> at least, different cores on the same die. >> >> I'm using em hardware; I still think there's a possibility I'm >> fighting the driver in some cases but this has priority #2. >=20 > Have you tried LOCK_PROFILING? It would quickly tell you if driver > locks were a source of significant contention. It works quite well... I don't think I'm fighting against locking artifacts, it looks more like some kind of overly smart hardware thing, like interrupt moderation (but not exactly interrupt moderation since the number of IRQs/s remains approx. the same). >>> - If your card supports RSS, pass the flowid up the stack in the mbuf= >>> packet >>> header flowid field, and use that instead of the hash for work >>> placement. >> >> Don't know about em. Don't really want to touch it if I don't have to = :) >=20 > if_em doesn't support it, but if_igb does. If this saves you a minimum= > of one and possibly two cache misses per packet, it could be a huge > performance improvement. If I had the funds to upgrade hardware, I wouldn't be so interested in solving it in software :) --------------enig2259B8C6FCD2C8A9C92854A6 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFJ2fccldnAQVacBcgRAnUsAKDvLaUuooKGdMVtT+qJDLQXFNQ/CQCeJvP3 2Xzrk5yV4QbhBpmg5XvCqPk= =0776 -----END PGP SIGNATURE----- --------------enig2259B8C6FCD2C8A9C92854A6--