From owner-freebsd-current@FreeBSD.ORG Mon Feb 17 11:16:45 2014 Return-Path: Delivered-To: current@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id DDCE6851 for ; Mon, 17 Feb 2014 11:16:45 +0000 (UTC) Received: from cell.glebius.int.ru (glebius.int.ru [81.19.69.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 3E3E213CF for ; Mon, 17 Feb 2014 11:16:44 +0000 (UTC) Received: from cell.glebius.int.ru (localhost [127.0.0.1]) by cell.glebius.int.ru (8.14.8/8.14.8) with ESMTP id s1HBGafL059373 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Mon, 17 Feb 2014 15:16:36 +0400 (MSK) (envelope-from glebius@FreeBSD.org) Received: (from glebius@localhost) by cell.glebius.int.ru (8.14.8/8.14.8/Submit) id s1HBGau4059372 for current@FreeBSD.org; Mon, 17 Feb 2014 15:16:36 +0400 (MSK) (envelope-from glebius@FreeBSD.org) X-Authentication-Warning: cell.glebius.int.ru: glebius set sender to glebius@FreeBSD.org using -f Date: Mon, 17 Feb 2014 15:16:35 +0400 From: Gleb Smirnoff To: current@FreeBSD.org Subject: [CFT] new sendfile(2) Message-ID: <20140217111635.GL26785@glebius.int.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.22 (2013-10-16) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Feb 2014 11:16:45 -0000 Hello! At Netflix and Nginx we are experimenting with improving FreeBSD wrt sending large amounts of static data via HTTP. One of the approaches we are experimenting with is new sendfile(2) implementation, that doesn't block on the I/O done from the file descriptor. The problem with classic sendfile(2) is that if the the request length is large enough, and file data is not cached in VM, then sendfile(2) syscall would not return until it fills socket buffer with data. With modern internet socket buffers can be up to 1 Mb, thus time taken by the syscall raises by order of magnitude. All the time, the nginx worker is blocked in syscall and doesn't process data from other clients. The best current practice to mitigate that is known as "sendfile(2) + aio_read(2)". This is special mode of nginx operation on FreeBSD. The sendfile(2) call is issued with SF_NODISKIO flag, that forbids the syscall to perform disk I/O, and send only data that is cached by VM. If sendfile(2) reports that I/O needs to be done (but forbidden), then nginx would do aio_read() of a chunk of the file. The data read is cached by VM, as side affect. Then sendfile() is called again. Now for the new sendfile. The core idea is that sendfile() schedules the I/O, but doesn't wait for it to complete. It returns immediately to the process, and I/O completion is processed in kernel context. Unlike aio(4), no additional threads in kernel are created. The new sendfile is a drop-in replacement for the old one. Applications (like nginx) doesn't need recompile, neither configuration change. The SF_NODISKIO is ignored. At Netflix, we already see improvements with new sendfile(2). We can send more data utilizing same amount of CPU, and we can push closer to 0% idle, without experiencing short lags. However, we have somewhat modified VM subsystem, that behaves optimal for our task, but suboptimal for average FreeBSD system. I'd like someone from community to try the new sendfile(2) at other setup and see how does it serve for you. To be the early tester you need to checkout projects/sendfile branch and build kernel from it. The world from head/ would run fine with it. svn co http://svn.freebsd.org/base/projects/sendfile cd sendfile ... build kernel ... Limitations: - Some subsystems that use socket buffers are not compilable, namely SCTP. - No testing were done on serving files on NFS. - No testing were done on serving files on ZFS. - There is mbuf leak. The leak is very slow. It takes 3 days serving up to 20 Gbit/s to deplete the cluster zone. I'm working on finding the leak. -- Totus tuus, Glebius.