Date: Wed, 17 Apr 2019 13:09:50 -0500 From: Jim Thompson <jim@netgate.com> To: Wojciech Puchar <wojtek@puchar.net> Cc: Miroslav Lachman <000.fbsd@quip.cz>, Mark Millard via freebsd-hackers <freebsd-hackers@freebsd.org> Subject: Re: openvpn and system overhead Message-ID: <94EA4F3F-4D78-4E08-9AF8-441B957A4749@netgate.com> In-Reply-To: <alpine.BSF.2.20.1904171753480.98262@puchar.net> References: <alpine.BSF.2.20.1904171707030.87502@puchar.net> <8648d069-2172-2c09-8e59-d66a8265a120@quip.cz> <alpine.BSF.2.20.1904171753480.98262@puchar.net>
next in thread | previous in thread | raw e-mail | index | archive | help
> On Apr 17, 2019, at 10:54 AM, Wojciech Puchar <wojtek@puchar.net> = wrote: >=20 >=20 >=20 > On Wed, 17 Apr 2019, Miroslav Lachman wrote: >=20 >> Wojciech Puchar wrote on 2019/04/17 17:08: >>> i'm running openvpn server on Xeon E5 2620 server. >>> when receiving 100Mbit/s traffic over VPN it uses 20% of single = core. >>> At least 75% of it is system time. >>> Seems like 500Mbit/s is a max for a single openvpn process. >>> can anything be done about that to improve performance? >>=20 >> You can play with ciphers, AES-NI etc. >> https://community.openvpn.net/openvpn/wiki/Gigabit_Networks_Linux >>=20 >> Miroslav Lachman >>=20 >>=20 > again. it's system time mostly not user time. Yup. I=E2=80=99ve looked at this a bunch over the years for pfSense. The tun/tap device can be viewed as a simple Point-to-Point IP or = Ethernet device, which instead of receiving packets from a physical=20 media, receives them from user space program and instead of sending = packets via physical media sends them to the user space program.=20 Let's say that you configured IP on the tap0, then whenever the kernel = sends an IP packet to tap0, it is passed to the application (OpenVPN, = for example).=20 Open=10VPN encrypts, authenticates, and occasionally compresses this = packet, encapsulates it, and sends it to the other side over TCP or = (preferably) UDP. The application on the other side receives the packet, decompresses and = decrypts the data received and writes the packet to its TAP device, the = kernel on the other side handles the packet like it came from real = physical device. Each time you copy data from user to kernel or kernel to user space, you = also incur a context switch with all the associated overheads. Using a tun/tap device incurs an additional context switch in each = direction, as you=E2=80=99re basically running the program to send data = (say, =E2=80=98ping=E2=80=99 or =E2=80=99ssh=E2=80=99), and another = program is used to encrypt and encapsulate the packet before it leaves = the machine. The process is roughly the same on the other side. So = you get twice the copies, and twice the number of context switches. = Making things worse, the =E2=80=9CIP stack=E2=80=9D inside OpenVPN is = single-threaded, and processes one packet at a time, so all the = overheads accrue to each packet, rather than being amortized across = several packets. Net-net, openvpn won=E2=80=99t do close to 1Mpps. There is a = decent-enough write-up of recent actual benchmarking in a masters thesis = that compares IPsec, OpenVPN and Wireguard, on linux here: = https://www.net.in.tum.de/fileadmin/bibtex/publications/theses/2018-pudelk= o-vpn-performance.pdf Section 5.5 if you want to skip to the substance. Basically, with *no* = encryption overheads, OpenVPN still has a static overhead of around 8500 = cycles/packet on the setup they used (Xeon E5-2620 v4), which seems = quite similar to yours. Given all this, they show that OpenVPN enters = an overload condition at around 120Kpps. There is some hope if you really have to have a lower-overhead OpenVPN. = An OpenVPN session has two channels, multiplexed on the same connection. = One is a control channel, the other is a data channel. The control = channel and associated configuration code in OpenVPN is =E2=80=A6 = complex. It has close to 10 trillion configuration options, and any = re-write of this code would be a huge, huge undertaking. Nearly = unthinkable, really. The data channel, otoh, is relatively = straight-forward, especially if you don=E2=80=99t need all the crypto = options provided, and, instead, limit yourself to, say AES-GCM or = another AEAD (ChaCha20 / Poly1305) transform. (Here, if your CPU has = AES-NI or similar (e.g. ARMv8 has AES acceleration instructions) AES-GCM = will always be faster.) But, if you=E2=80=99re willing to limit yourself to one, or a few = transforms, it theory, it=E2=80=99s possible to make a specialized tun / = tap device such that the data channel is kept in-kernel, with = encryption/decryption and encapsulation/decapsulation of data packets = occurring in the kernel, but control packets passed up and down to/from = the associated user space process. A partial attempt of this idea (for linux) can be found here: = https://github.com/marywangran/OpenVPN-Linux-kernel it looks abandoned, = so maybe it didn=E2=80=99t pan out, or maybe the work just got = asymptotic. There is a bunch of work to get this right (keeping the openVPN user = process happy, counters up to date, etc), but, at the end of the day, = it=E2=80=99s all software. Netflix got enough of OpenSSL's AES-GCM = implementation into the kernel to run the transmit side. They didn=E2=80=99= t care about the receive side, and just let nginx deal with the = relatively light rx flows in their deployment, but it does show that = it=E2=80=99s possible with enough work. Even with all that work, It will probably never be as fast as a decent = IPsec implementation. Jim
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?94EA4F3F-4D78-4E08-9AF8-441B957A4749>