From owner-freebsd-arch@freebsd.org Mon Dec 14 11:13:48 2015 Return-Path: Delivered-To: freebsd-arch@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 476CDA437C2 for ; Mon, 14 Dec 2015 11:13:48 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 31EE71975 for ; Mon, 14 Dec 2015 11:13:48 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: by mailman.ysv.freebsd.org (Postfix) id 30D39A437C1; Mon, 14 Dec 2015 11:13:48 +0000 (UTC) Delivered-To: arch@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 168ACA437C0 for ; Mon, 14 Dec 2015 11:13:48 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id B4D1B1973; Mon, 14 Dec 2015 11:13:47 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id tBEBDbaD012368 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Mon, 14 Dec 2015 13:13:38 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua tBEBDbaD012368 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id tBEBDZ0J012366; Mon, 14 Dec 2015 13:13:35 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 14 Dec 2015 13:13:35 +0200 From: Konstantin Belousov To: Gleb Smirnoff Cc: jeff@FreeBSD.org, alc@FreeBSD.org, scottl@FreeBSD.org, pho@FreeBSD.org, arch@FreeBSD.org Subject: Re: new vm_pager_get_pages() KPI, round 3 Message-ID: <20151214111335.GB82577@kib.kiev.ua> References: <20151205052940.GJ42565@FreeBSD.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151205052940.GJ42565@FreeBSD.org> User-Agent: Mutt/1.5.24 (2015-08-30) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 14 Dec 2015 11:13:48 -0000 On Sat, Dec 05, 2015 at 08:29:40AM +0300, Gleb Smirnoff wrote: > Hi, > > [first paragraph for arch subscribers, To: recepients may skip] > > This patch is kinda a prerequisite for the non-blocking sendfile(2), > that was jointly developed by NGINX and Netflix in 2014 and has been > running in Netflix production for a year, serving 35% of the whole > North America (US, Canada, Mexico) Internet traffic. > Technically, the new sendfile(2) doesn't require the new > vm_pager_get_pages() KPI. We currently run it on the old KPI. However, > kib@ suggested that we are abusing the KPI, carefully using its > edge cases. To address this critic, back in spring, I suggested a KPI, > where vm_pager_get_pages() offers all-or-none approach to the array of > pages. Again, kib@ wasn't satisfied, as for "the main user" of > vm_pager_get_pages, the vm_fault(), all-or-none approach isn't optimal. > The problem was slowly debated through the summer. And then in October > jeff@ suggested yet another extension of the KPI, which I have > implemented and it is described below. > > [for those interested in new sendfile(2), skip to the last paragraph, > for those willing to review new pager KPI, read on] > > The new KPI offers this prototype for vm_pager_get_pages(): > > int > vm_pager_get_pages(vm_object_t object, vm_page_t pages[], int count, > int *rbehind, in *rahead); > > Where "count" stands for number of pages in the array. The rbehind > and rahead if not NULL specify how many pages the caller is willing to > allow the pager to pre-cache, if the pager can. > > Pager doesn't promise to do any read behind or read ahead. If it does, > then only the pager is responsive for grabbing, busying, unbusying and > queueing these pages. It also writes the actual values of completed > read ahead and read behind back to the pointers. > > Pager promises to page in "count" pages or fail. Pager expects the > pages to be busied, and returns them busied. For a multi page requests, > the pager demands that the region is a valid region, that exists in > the pager, which can be checked by preceding call to vm_pager_haspage(). > For single page requests, there is no such demand. I fail to understand how the case of count > 1 and non-contiguous blocks in the non-readahead case is handled by new vnode_pager_generic_getpages(). I do not understand how a hole somewhere in the requested range is handled. Code has a comment that a hole must not appear in the range. Both issues mean that vm_pager_has_page() still must be called before pagein, for count > 1 use. E.g. the exec_map_first_page() uses *after value returned from has_pages() to calculate count, which is an advisory and not the contract. Same issue prevents converting GEM and TTM (and probably md) to use the count > 1 KPI. Same is true for swap pager, and this prevents the removal of the loop in vm_thread_swaping(). Code assumes that the partially valid page may only appear in the last position of the page run for the local pager, which again requires pre-validation of the vm_pager_get_pages() on the caller side. Overall this is not an KPI that was discussed. It seemingly does not change semantic for count == 1 case, but is not what it should be for count > 1. As discussed, new vm_pager_get_pages() was support to just work for any count, doing the loop over the non-contig ranges or short reads, and guaranteeing that all existing (or hole-filled) pages are read until EOF is met. This KPI was supposed to: - fix my compaints about short reads - avoid excessive VOP_BMAP() call from has_pages before get_pages() - allowed to remove the loops from all current get_pages() consumers, it vm_thread_swapin(), GEM/TTM, image activator - supposedly served your needs for sendfile(2) in one go. What is present in your patch might only satisfy the last item of the list. > > The net result is a win for both vm_fault() and for new sendfile(). > > The vm_fault() no longer needs to do prepatory vm_pager_haspage(), > which removes one I/O operation. The logic for read ahead/behind, > which is strongly UFS/EXT-centric, moves into vnode_pager.c. So > we no longer do useless operations when having a fault on ZFS. > > The vm_fault() now knows precisely the read ahead that happened, > when updates fs.entry->next_read index. This reduces number of > hardfaults by a tiny fraction (measured building world tree). > > The new sendfile() has a stronger KPI, that doesn't unbusy pages, > that sendfile() needs to be kept busied. > > Also, the new KPI removes some ugly edges. E.g., since the old > KPI used to unbusy and free pages in the array in case of an > error, the pages could not be wired. However, there are places in > kernel where we want to page in into a wired page. These places > simply violated the assumption, relying on lack of errors in the > pager. Moreover, the swap pager has a special function to skip > wired pages, while doing the freeing sweep, to avoid hitting > assertion. That means passing wired pages to swapper is kinda > OK, while to any other pager it is not. So, we end up with > vm_pager_get_pages() being not pager agnostic, while it is > designed to be so. Now this is fixed.