From owner-freebsd-arch@FreeBSD.ORG Mon May 4 08:24:37 2015 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 1CAFD95B; Mon, 4 May 2015 08:24:37 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 9A81D1E45; Mon, 4 May 2015 08:24:36 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t448OQBj088050 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Mon, 4 May 2015 11:24:26 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t448OQBj088050 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id t448OQRF088049; Mon, 4 May 2015 11:24:26 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 4 May 2015 11:24:26 +0300 From: Konstantin Belousov To: Gleb Smirnoff Cc: alc@FreeBSD.org, arch@FreeBSD.org Subject: Re: more strict KPI for vm_pager_get_pages() Message-ID: <20150504082426.GC2390@kib.kiev.ua> References: <20150430142408.GS546@nginx.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150430142408.GS546@nginx.com> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 May 2015 08:24:37 -0000 On Thu, Apr 30, 2015 at 05:24:08PM +0300, Gleb Smirnoff wrote: > Hi! > > The reason to write down this patch emerges from the > projects/sendfile branch, where vm_pager_get_pages() is > used in the sendfile(2) system call. Although the new > sendfile works flawlessly, it makes some assumptions > about vnode_pager that theoretically may not be valid, > however always hold in our current code. > > Going deeper in the problem I have found more important > points, which yielded in the suggested patch. To start, > let me display the current KPI assumptions: > > 1) vm_pager_get_pages() works on an array of consequtive > array of pages. Pindex of (n+1)-th pages must be pindex > of n-th + 1. One page is special, it is called reqpage. > 2) vm_pager_get_pages() guarantees to swapin only the reqpage, > and may skip or fail other pages for different reasons, that > may vary from pager to pager. > 3) There also is function vm_pager_has_page(), which reports > availability of a page at given index in the pager, and also > provides hints on how many consequtive pages before this one > and after this one can be swapped in in single pager request. > Most pagers return zeros in these hints. The vnode pager for > UFS returns a strong promise, that one can later utilize in > vm_pager_get_pages(). > 4) All pages must be busied on enter. On exit only reqpage > will be left busied. The KPI doesn't guarantee that rest > of the pages is still in place. The pager usually calls > vm_page_readahead_finish() on them, which can either free, > or put the page on active/inactive queue, using quite > a strange approach to choose a queue. > 5) The pages must not be wired, since vm_page_free() may be > called on them. However, this is violated by several > consumers of KPI, relying on lack of errors in the pager. > Moreover, the swap pager has a special function to skip > wired pages, while doing the sweep, to avoid this problem. > So, passing wired pages to swapper is OK, while to the > reset is not. > 6) Pagers may replace a page in the object with a new one. > The sg_pager actually does that. To protect from this > event, consumers of vm_pager_get_pages() always run > vm_page_lookup() over the array of pages to relookup the pages. > However, not all consumers do this. > > Speaking of pagers and their consumers: > - 11 consumers request array of size 1, a single page > - 3 consumers actually request array > > My suggestion is to change the KPI assumptions to the following: > > 1) There is no reqpage. All pages are entered busied, all pages > are returned busied and validated. If pager fails to validate > all pages it must return error. > 2) The consumer (not the pager!) is to decide what to do with the > pages: vm_page_active, vm_page_deactivate, vm_page_flash or just > vm_page_free them. The consumer also unbusies pages, if it > wants to. The consumer is free to wire pages before the call. > 3) Consumers must first query the pager via vm_pager_has_page(), > and use the after/before hints to limit the size of the > requested pages array. > 4) In case if pager replaces pages, it must also update the array, > so that consumer doesn't need to do relookup. > > Doing this sweep, I also noticed that all pagers have a copy-pasted > code of zeroing invalid regions of partially valid pages. Also, > many pagers got a set of assertions copy and pasted from each > other. So, I decided to un-inline the vm_pager_get_pages(), bring > it to the vm_pager.c file and gather all these copy-pastes > into one place. > > The suggested patch is attached. As expected, it simplifies and > removes quite a lot of code. > > Right now it is tested on UFS only, testing NFS and ZFS is on my list. > There is one panic known, but it seems unrelated, and Peter pho@ says > that once it has been seen before. Below is the summary of my part of the internal discussion about the changes. Traditionally, Unix allows the filesystems to perform the short reads. Most fundamental change in the patch removes this freedom from the filesystem implementation, and I think that only local filesystems could be compliant with the proposed strictness. IMO, the response from vm_pager_haspages() is only advisory, since filesystem might not control the external entities which are the source of the required data.