From owner-freebsd-net@FreeBSD.ORG Fri Nov 11 22:41:56 2011 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2FAA5106564A; Fri, 11 Nov 2011 22:41:56 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-bw0-f54.google.com (mail-bw0-f54.google.com [209.85.214.54]) by mx1.freebsd.org (Postfix) with ESMTP id 79C988FC08; Fri, 11 Nov 2011 22:41:55 +0000 (UTC) Received: by bkbzs8 with SMTP id zs8so5202985bkb.13 for ; Fri, 11 Nov 2011 14:41:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :x-enigmail-version:content-type:content-transfer-encoding; bh=v0jcWUz6xPjEYqvX4bteaT7izUfWiHpCLjFxf9sSqiU=; b=jrMTFQXB1/fFr+dTihIYj0K95wk0UXEfR70ltjGTfC1cTqk71qMKRIIW7ScFkh4pTX f6LsjWrCVS7MGAAhKYsXOSOLtWkv/M/DDwEZYehRVXlBg4oFd+vc0MqlIi04+Tr0YO5K IZYrhI0c+dRh5yMR4QmnLuGtTQb9YmmCJcTc0= Received: by 10.204.156.141 with SMTP id x13mr9932147bkw.54.1321051314125; Fri, 11 Nov 2011 14:41:54 -0800 (PST) Received: from mavbook.mavhome.dp.ua (pc.mavhome.dp.ua. [212.86.226.226]) by mx.google.com with ESMTPS id j9sm13723854bkd.2.2011.11.11.14.41.52 (version=SSLv3 cipher=OTHER); Fri, 11 Nov 2011 14:41:53 -0800 (PST) Sender: Alexander Motin Message-ID: <4EBDA4B2.6030602@FreeBSD.org> Date: Sat, 12 Nov 2011 00:41:54 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:6.0.2) Gecko/20110910 Thunderbird/6.0.2 MIME-Version: 1.0 To: freebsd-net X-Enigmail-Version: undefined Content-Type: text/plain; charset=KOI8-R Content-Transfer-Encoding: 7bit Cc: David Hooton , Gleb Smirnoff Subject: Re: MPD LAC Scaling X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Nov 2011 22:41:56 -0000 Hi. > I'm currently evaluating MPD as a potential LAC solution for a > project I'm working on. I'm looking to try and handle at least 4Gbit > and 20,000 sessions worth of PPPoE -> L2TP LAC traffic per server. > The reading I've done from the archives so far seems to indicate that > this has not yet been done. I also haven't heard about so big cases, but also can't say it is theoretically impossible after some tuning and development work. At this point I have neither production/test environment nor much time to actively work on it, but I want to express some experience and ideas in case somebody wants to take that. First, as Julian said, it is not necessary should be one server to handle all load. Cluster of smaller machines should be preferable from many points. PPPoE allows you to have several servers and load-balance them. At this moment MPD can't balance load dynamically, but you can do it manually, limiting number of sessions per one server. As some example point of hardware from personal experience I can say that three years ago mpd5 on 1U servers with single Core2Duo CPUs, 1GB of RAM and two 1Gb NICs (less then $1K that time) handled in production about 2000 PPPoE sessions and 600Mbps of traffic per server, including Netflow generation, per-customer typed traffic shaping and accounting. Modern and more powerful hardware is able to do more. Getting higher numbers there can mostly be split in two questions: getting more traffic and getting more sessions, as limitations are different. - Getting more traffic mostly means scaling kernel Netgraph and networking code to more CPU cores. As soon as Netgraph uses direct function calls when possible, it depends on number of network interrupt threads in system. Three years ago there was only one net SWI thread and setting net.isr.direct=1 while having several NICs in system allowed to distribute load between CPUs. Modern high-level NICs with several MSI-X interrupts should give the same effect. Now it is also possible to have several net SWI threads, but I haven't tested it. - Getting more sessions also means tuning and optimizing user-level mpd daemon. Three years ago on Pentium4-level test machine I've reached about 5K PPPoE sessions with RADIUS auth/acct. Main limiting factor was user-level daemon performance. The more sessions connected, the more overhead daemon had in face of LCP echo requests and event timeouts to handle, number of netgraph kernel sockets to listen, etc. At some point daemon is just unable to handle all new incoming events in time and resending requests by clients causes cumulative effect. So the main limiting factor is not just number of users, but also number of events. If users connect one by one, number of sessions can be quite high. But if due to some accident you have all users dropped and reconnecting, that may cause overload sooner. In that case it is important even what LCP echo timeout set on the server and clients, or how many logs are enabled. My best tuning result that time on Pentium4-level machine was about 100 connections per second. It allowed to setup 5000 simultaneous sessions within 50 seconds. Higher numbers were problematic. At this moment user-level MPD's main state machine is single-threaded, except authorization and accounting (like RADIUS), that are done in separate threads, but require synchronized completion to return the data. Splitting main FPM on several threads is difficult, because it would require to somehow to group links and bundles within different threads with different locks, that is difficult, because of multilink support and because until user is authorized, it is impossible to say which bundle it should join. If there is need to handle several PPPoE services with different names or several LAN segments, it theoretically may be effective to have several MPD daemon instances running, one for each service/segment. Generally I've spent less time on profiling and optimizing MPD daemon itself then kernel code, so there still should be a lot of space for improvement. Some possible optimization points I still remember are: - rework pevent() engine used by MPD state machine to use kqueue() instead of poll() to reduce event overhead overhead; - optimize locking of paction() functions used for thread creation and completion for MPD-specific case; Idea was that by the cost of functionality it could be simplified to reduce number of context switches; - rewrite RADIUS auth/acct support to run within main mpd thread or fixed number of external threads; since existing threaded approach was implemented, libradius got support for asynchronous operation; that should reduce overhead for thread creation/destruction; - optimize ng_ksocket node when work with large number of hooks, using some optimized search, and/or make MPD to create another sockets for each next number of links to balance kernel and user-level search overheads; initially MPD created separate set of sockets for every link, but it was found too expensive for user-level FSM and was rewritten into present state with almost minimal number of sockets and most multiplexing tone in kernel. I have no personal production experience with PPPoE-L2TP LAC case. It is used much less often and I had only several reports from people actively using it and no much numbers. I think LAC case should have smaller overhead and CPU load and so better scalability then usual traffic termination: there is no IPCP layer in PPP to negotiate, there is no interfaces to create and configure, no Netflow, no shapes, no periodic accounting, etc. If you don't need to authenticate users, but only to forward connections, and so server doesn't need to handle LCP protocol, task simplifies even much more. If you can setup test environment to stress-test the LAC stuffs, it would be interesting to see the numbers. On my test lab I used several machines with mpd configured for thousands of PPPoE client sessions each to generate simultaneous connections. For testing LAC you also should have some fast enough L2TP terminator. If you have no such hardware for test, you may try use several systems with mpd L2TP servers spreading load between them in one of ways to avoid bottleneck there, while system load in such case may potentially slightly differ. -- Alexander Motin