From owner-freebsd-net@FreeBSD.ORG  Fri Nov 11 22:41:56 2011
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2FAA5106564A;
	Fri, 11 Nov 2011 22:41:56 +0000 (UTC)
	(envelope-from mavbsd@gmail.com)
Received: from mail-bw0-f54.google.com (mail-bw0-f54.google.com
	[209.85.214.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 79C988FC08;
	Fri, 11 Nov 2011 22:41:55 +0000 (UTC)
Received: by bkbzs8 with SMTP id zs8so5202985bkb.13
	for <multiple recipients>; Fri, 11 Nov 2011 14:41:54 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject
	:x-enigmail-version:content-type:content-transfer-encoding;
	bh=v0jcWUz6xPjEYqvX4bteaT7izUfWiHpCLjFxf9sSqiU=;
	b=jrMTFQXB1/fFr+dTihIYj0K95wk0UXEfR70ltjGTfC1cTqk71qMKRIIW7ScFkh4pTX
	f6LsjWrCVS7MGAAhKYsXOSOLtWkv/M/DDwEZYehRVXlBg4oFd+vc0MqlIi04+Tr0YO5K
	IZYrhI0c+dRh5yMR4QmnLuGtTQb9YmmCJcTc0=
Received: by 10.204.156.141 with SMTP id x13mr9932147bkw.54.1321051314125;
	Fri, 11 Nov 2011 14:41:54 -0800 (PST)
Received: from mavbook.mavhome.dp.ua (pc.mavhome.dp.ua. [212.86.226.226])
	by mx.google.com with ESMTPS id j9sm13723854bkd.2.2011.11.11.14.41.52
	(version=SSLv3 cipher=OTHER); Fri, 11 Nov 2011 14:41:53 -0800 (PST)
Sender: Alexander Motin <mavbsd@gmail.com>
Message-ID: <4EBDA4B2.6030602@FreeBSD.org>
Date: Sat, 12 Nov 2011 00:41:54 +0200
From: Alexander Motin <mav@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
	rv:6.0.2) Gecko/20110910 Thunderbird/6.0.2
MIME-Version: 1.0
To: freebsd-net <freebsd-net@freebsd.org>
X-Enigmail-Version: undefined
Content-Type: text/plain; charset=KOI8-R
Content-Transfer-Encoding: 7bit
Cc: David Hooton <david.hooton@platformnetworks.net>,
	Gleb Smirnoff <glebius@FreeBSD.org>
Subject: Re: MPD LAC Scaling
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 11 Nov 2011 22:41:56 -0000

Hi.

> I'm currently evaluating MPD as a potential LAC solution for a
> project I'm working on.  I'm looking to try and handle at least 4Gbit
> and 20,000 sessions worth of PPPoE -> L2TP LAC traffic per server.
> The reading I've done from the archives so far seems to indicate that
> this has not yet been done.

I also haven't heard about so big cases, but also can't say it is
theoretically impossible after some tuning and development work. At this
point I have neither production/test environment nor much time to
actively work on it, but I want to express some experience and ideas in
case somebody wants to take that.

First, as Julian said, it is not necessary should be one server to
handle all load. Cluster of smaller machines should be preferable from
many points. PPPoE allows you to have several servers and load-balance
them. At this moment MPD can't balance load dynamically, but you can do
it manually, limiting number of sessions per one server.

As some example point of hardware from personal experience I can say
that three years ago mpd5 on 1U servers with single Core2Duo CPUs, 1GB
of RAM and two 1Gb NICs (less then $1K that time) handled in production
about 2000 PPPoE sessions and 600Mbps of traffic per server, including
Netflow generation, per-customer typed traffic shaping and accounting.
Modern and more powerful hardware is able to do more.

Getting higher numbers there can mostly be split in two questions:
getting more traffic and getting more sessions, as limitations are
different.
 - Getting more traffic mostly means scaling kernel Netgraph and
networking code to more CPU cores. As soon as Netgraph uses direct
function calls when possible, it depends on number of network interrupt
threads in system. Three years ago there was only one net SWI thread and
setting net.isr.direct=1 while having several NICs in system allowed to
distribute load between CPUs. Modern high-level NICs with several MSI-X
interrupts should give the same effect. Now it is also possible to have
several net SWI threads, but I haven't tested it.
 - Getting more sessions also means tuning and optimizing user-level mpd
daemon. Three years ago on Pentium4-level test machine I've reached
about 5K PPPoE sessions with RADIUS auth/acct. Main limiting factor was
user-level daemon performance. The more sessions connected, the more
overhead daemon had in face of LCP echo requests and event timeouts to
handle, number of netgraph kernel sockets to listen, etc. At some point
daemon is just unable to handle all new incoming events in time and
resending requests by clients causes cumulative effect. So the main
limiting factor is not just number of users, but also number of events.
If users connect one by one, number of sessions can be quite high. But
if due to some accident you have all users dropped and reconnecting,
that may cause overload sooner. In that case it is important even what
LCP echo timeout set on the server and clients, or how many logs are
enabled. My best tuning result that time on Pentium4-level machine was
about 100 connections per second. It allowed to setup 5000 simultaneous
sessions within 50 seconds. Higher numbers were problematic. At this
moment user-level MPD's main state machine is single-threaded, except
authorization and accounting (like RADIUS), that are done in separate
threads, but require synchronized completion to return the data.
Splitting main FPM on several threads is difficult, because it would
require to somehow to group links and bundles within different threads
with different locks, that is difficult, because of multilink support
and because until user is authorized, it is impossible to say which
bundle it should join. If there is need to handle several PPPoE services
with different names or several LAN segments, it theoretically may be
effective to have several MPD daemon instances running, one for each
service/segment. Generally I've spent less time on profiling and
optimizing MPD daemon itself then kernel code, so there still should be
a lot of space for improvement. Some possible optimization points I
still remember are:
  - rework pevent() engine used by MPD state machine to use kqueue()
instead of poll() to reduce event overhead overhead;
  - optimize locking of paction() functions used for thread creation and
completion for MPD-specific case; Idea was that by the cost of
functionality it could be simplified to reduce number of context switches;
  - rewrite RADIUS auth/acct support to run within main mpd thread or
fixed number of external threads; since existing threaded approach was
implemented, libradius got support for asynchronous operation; that
should reduce overhead for thread creation/destruction;
  - optimize ng_ksocket node when work with large number of hooks, using
some optimized search, and/or make MPD to create another sockets for
each next number of links to balance kernel and user-level search
overheads; initially MPD created separate set of sockets for every link,
but it was found too expensive for user-level FSM and was rewritten into
present state with almost minimal number of sockets and most
multiplexing tone in kernel.

I have no personal production experience with PPPoE-L2TP LAC case. It is
used much less often and I had only several reports from people actively
using it and no much numbers. I think LAC case should have smaller
overhead and CPU load and so better scalability then usual traffic
termination: there is no IPCP layer in PPP to negotiate, there is no
interfaces to create and configure, no Netflow, no shapes, no periodic
accounting, etc. If you don't need to authenticate users, but only to
forward connections, and so server doesn't need to handle LCP protocol,
task simplifies even much more.

If you can setup test environment to stress-test the LAC stuffs, it
would be interesting to see the numbers. On my test lab I used several
machines with mpd configured for thousands of PPPoE client sessions each
to generate simultaneous connections. For testing LAC you also should
have some fast enough L2TP terminator. If you have no such hardware for
test, you may try use several systems with mpd L2TP servers spreading
load between them in one of ways to avoid bottleneck there, while system
load in such case may potentially slightly differ.

-- 
Alexander Motin