From owner-freebsd-hackers@FreeBSD.ORG Sat Jun 7 20:45:05 2014 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 89D7626F; Sat, 7 Jun 2014 20:45:05 +0000 (UTC) Received: from mail-qa0-x22c.google.com (mail-qa0-x22c.google.com [IPv6:2607:f8b0:400d:c00::22c]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 311042096; Sat, 7 Jun 2014 20:45:05 +0000 (UTC) Received: by mail-qa0-f44.google.com with SMTP id j7so6079874qaq.17 for ; Sat, 07 Jun 2014 13:45:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=TGnHrbEvxhHPsvTTC+ZzTp24UxmdScSGHPtR5H0pKic=; b=uWR3uMEtpH5Wdmsj9GS0vK9J/f0B4jbYCEh+mepe7o8nD07t5UZiR/sBoRzHUD7jt5 VWQ6OfKa9sHh60Et8P0oMKLY++TH9x8N886Kw4DqRULTefmtPyAZv3KfnIeFDtCZ+SCU 0wx8T7VmpTWxpVt1PdOn7jYnzyOrvbUFVnCOVHhZ8nXLYJKW6ERfO0oeKRHEZjrz24BT jKjU8N7h06lmaGo+tmC1lbwYiQrYQPs7WqSf0x70V1Y24dJ0IDPUR32FiQJ8QBtDYmiY G7FJsH7qK5QGcP8ygLLflGagSZjBdFtPFtuP4ugsDJbTgT+rqtD1j4pzcJA7tlU/kpVa Dk5A== MIME-Version: 1.0 X-Received: by 10.140.22.209 with SMTP id 75mr19416044qgn.4.1402173904358; Sat, 07 Jun 2014 13:45:04 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.224.43.134 with HTTP; Sat, 7 Jun 2014 13:45:04 -0700 (PDT) In-Reply-To: References: <1402159374.20883.160.camel@revolution.hippie.lan> Date: Sat, 7 Jun 2014 16:45:04 -0400 X-Google-Sender-Auth: HqfR3Kc7yTQsCK_Jyk7-3Z8SRV0 Message-ID: Subject: Re: Best practice for accepting TCP connections on multicore? From: Adrian Chadd To: Igor Mozolevsky Content-Type: text/plain; charset=UTF-8 Cc: Hackers freeBSD , Daniel Janzon , Dirk Engling , Ian Lepore X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 07 Jun 2014 20:45:05 -0000 On 7 June 2014 16:37, Igor Mozolevsky wrote: > > > > On 7 June 2014 21:18, Adrian Chadd wrote: >> >> > Not quite - the gist (and the point) of that slide with Rob's story was >> > that >> > by the time Rob wrote something that could comprehensively deal with >> > states >> > in an even-driven server, he ended up essentially re-inventing the >> > wheel. >> >> I read the same slides you did. He didn't reinvent the wheel - threads >> are a different concept - at any point the state can change and you >> switch to a new thread. Event driven, asynchronous programming isn't >> quite like that. > > > Not quite- unless you're dealing with stateless HTTP, you still need to know > what the "current" state of the "current" connection is, which is the point > of that slide. > > >> > Paul Tyma's presentation posted earlier did conclude with various models >> > for >> > different types of daemons, which the OP might find at least >> > interesting. >> >> Agreed, but again - it's all java, it's all linux, and it's 2008. > > > Agreed, but threading models are platform-agnostic. > > >> The current state is that threads and thread context switching are >> more expensive than you'd like. You really want to (a) avoid locking >> at all, (b) keep the CPU hot with cached data, and (c) keep it from >> changing contexts. > > > Agreed, but uncontested locking should be virtually cost-free (or close to > that), modern CPUs have plenty of L2/L3 cache to keep enough data nearby, > and there are plenty of cores to keep cycling in the same thread-loop, and > hyper-threading helps with ctx switching (or at least is supposed to). In > any event, shuttling data between RAM and cache (especially with the on-die > RAM controllers, and even if data has to go through QPI/HyperT), and the > cost of changing contexts is tiny compared to that of disk and network IO. I was doing 40gbit/sec testing over 2^16 connections (and was hoping to get the chance to optimise this stuff to get to 2^17 active streaming connections, but I ran out of CPU.) If you're not careful about keeping work on a local CPU, you end up blowing your caches and hitting lock contention pretty quickly. And QPI isn't free. There's a cost going backwards and forwards with packet data and cache lines for uncontested data. I'm not going to worry about QPI and socket awareness just for now - that's a bigger problem to solve. I'll first worry about getting RSS working for a single socket setup and then convert a couple of drivers over to be RSS aware. I'll then worry about multiple socket awareness and being aware of whether a NIC is local to a socket. I'm hoping that with this work and the Verisign TCP locking changes, we'll be able to handle 40gig bulk data on single socket Sandy Bridge Xeon hardware and/or > 100,000 TCP sessions a second with plenty of CPU to spare. Then it's getting to 80 gig on Ivy bridge class single-socket hardware. I'm hoping we can aim for much higher (million + transactions a second) on the current generation hardware but that requires a bunch more locking work. And well, whatever hardware I can play with. All I have at home is a 4-core ivy bridge desktop box with igb(4). :-P -a