From owner-freebsd-arch@FreeBSD.ORG  Sat Jan  4 00:55:50 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 58D7276A
 for <freebsd-arch@freebsd.org>; Sat,  4 Jan 2014 00:55:50 +0000 (UTC)
Received: from mail-qe0-x235.google.com (mail-qe0-x235.google.com
 [IPv6:2607:f8b0:400d:c02::235])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 1948A119F
 for <freebsd-arch@freebsd.org>; Sat,  4 Jan 2014 00:55:50 +0000 (UTC)
Received: by mail-qe0-f53.google.com with SMTP id nc12so16097400qeb.26
 for <freebsd-arch@freebsd.org>; Fri, 03 Jan 2014 16:55:48 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:date:message-id:subject:from:to:content-type;
 bh=LIhRyRl9dMcl+EqubfYVk6aTYddga12hkxhXFFT6ViE=;
 b=i+G1L0+NuUodPA3oX2UCb23FJ2s3/5ZyDefZwB69LyBWPUTMP6jgv5BHHWN/ICL39z
 m9qZpHqJADX0ecmZ5sR2/8sB4tzZlkR0P52P8wJUm3TVqBo1KLwalXRMu03kEdIld4Tb
 H2v8K9m5qhpsAX28ZQUuMkzN6PbLuubYAB+1p7hlWGj0MXfZWEQ5iVgyOfnQjC2hFejr
 3uguFwQm/jWFB1UH06G/UW4qLTAVuiX8mG9Ljvhhxt0tfMjRc+BLlqLz6zO6yMRhXG5l
 /m31caQPn3mGr3/Gy8ceHoub3okjG/uae7jdu/6W3ZUkF/IvElM1iTUfptfh1L7jGeo9
 Ajag==
MIME-Version: 1.0
X-Received: by 10.224.13.141 with SMTP id c13mr145750417qaa.76.1388796948485; 
 Fri, 03 Jan 2014 16:55:48 -0800 (PST)
Sender: adrian.chadd@gmail.com
Received: by 10.224.52.8 with HTTP; Fri, 3 Jan 2014 16:55:48 -0800 (PST)
Date: Fri, 3 Jan 2014 16:55:48 -0800
X-Google-Sender-Auth: QE9XBmD7cGltLoDlq7Y04nacAaM
Message-ID: <CAJ-Vmok-AJkz0THu72ThTdRhO2h1CnHwffq=cFZGZkbC=cWJZA@mail.gmail.com>
Subject: Acquiring a lock on the same CPU that holds it - what can be done?
From: Adrian Chadd <adrian@freebsd.org>
To: "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 04 Jan 2014 00:55:50 -0000

Hi,

So here's a fun one.

When doing TCP traffic + socket affinity + thread pinning experiments,
I seem to hit this very annoying scenario that caps my performance and
scalability.

Assume I've lined up everything relating to a socket to run on the
same CPU (ie, TX, RX, TCP timers, userland thread):

* userland code calls something, let's say "kqueue"
* the kqueue lock gets grabbed
* an interrupt comes in for the NIC
* the NIC code runs some RX code, and eventually hits something that
wants to push a knote up
* and the knote is for the same kqueue above
* .. so it grabs the lock..
* .. contests..
* Then the scheduler flips us back to the original userland thread doing TX
* The userland thread finishes its kqueue manipulation and releases
the queue lock
* .. the scheduler then immediately flips back to the NIC thread
waiting for the lock, grabs the lock, does a bit of work, then
releases the lock

I see this on kqueue locks, sendfile locks (for sendfile notification)
and vm locks (for the VM page referencing/dereferencing.)

This happens very frequently. It's very noticable with large numbers
of sockets as the chances of hitting a lock in the NIC RX path that
overlaps with something in the userland TX path that you are currently
fiddling with (eg kqueue manipulation) or sending data (eg vm_page
locks or sendfile locks for things you're currently transmitting) is
very high. As I increase traffic and the number of sockets, the amount
of context switches goes way up (to 300,000+) and the lock contention
/ time spent doing locking is non-trivial.

Linux doesn't "have this" problem - the lock primitives let you
disable driver bottom halves. So, in this instance, I'd just grab the
lock with spin_lock_bh() and all the driver bottom halves would not be
run. I'd thus not have this scheduler ping-ponging and lock contention
as it'd never get a chance to happen.

So, does anyone have any ideas? Has anyone seen this? Shall we just
implement a way of doing selective thread disabling, a la
spin_lock_bh() mixed with spl${foo}() style stuff?

Thanks,


-adrian