From owner-freebsd-arch@FreeBSD.ORG Mon May 4 09:50:51 2015 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C57672DC; Mon, 4 May 2015 09:50:51 +0000 (UTC) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "vps1.elischer.org", Issuer "CA Cert Signing Authority" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 8DABC1823; Mon, 4 May 2015 09:50:51 +0000 (UTC) Received: from Julian-MBP3.local (ppp121-45-241-118.lns20.per4.internode.on.net [121.45.241.118]) (authenticated bits=0) by vps1.elischer.org (8.14.9/8.14.9) with ESMTP id t449ojR1024484 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Mon, 4 May 2015 02:50:48 -0700 (PDT) (envelope-from julian@freebsd.org) Message-ID: <554740EF.7030808@freebsd.org> Date: Mon, 04 May 2015 17:50:39 +0800 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Konstantin Belousov , Gleb Smirnoff CC: alc@FreeBSD.org, arch@FreeBSD.org Subject: Re: more strict KPI for vm_pager_get_pages() References: <20150430142408.GS546@nginx.com> <20150504082426.GC2390@kib.kiev.ua> In-Reply-To: <20150504082426.GC2390@kib.kiev.ua> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 May 2015 09:50:51 -0000 On 5/4/15 4:24 PM, Konstantin Belousov wrote: > On Thu, Apr 30, 2015 at 05:24:08PM +0300, Gleb Smirnoff wrote: >> Hi! >> >> The reason to write down this patch emerges from the >> projects/sendfile branch, where vm_pager_get_pages() is >> used in the sendfile(2) system call. Although the new >> sendfile works flawlessly, it makes some assumptions >> about vnode_pager that theoretically may not be valid, >> however always hold in our current code. >> >> Going deeper in the problem I have found more important >> points, which yielded in the suggested patch. To start, >> let me display the current KPI assumptions: >> >> 1) vm_pager_get_pages() works on an array of consequtive >> array of pages. Pindex of (n+1)-th pages must be pindex >> of n-th + 1. One page is special, it is called reqpage. >> 2) vm_pager_get_pages() guarantees to swapin only the reqpage, >> and may skip or fail other pages for different reasons, that >> may vary from pager to pager. >> 3) There also is function vm_pager_has_page(), which reports >> availability of a page at given index in the pager, and also >> provides hints on how many consequtive pages before this one >> and after this one can be swapped in in single pager request. >> Most pagers return zeros in these hints. The vnode pager for >> UFS returns a strong promise, that one can later utilize in >> vm_pager_get_pages(). >> 4) All pages must be busied on enter. On exit only reqpage >> will be left busied. The KPI doesn't guarantee that rest >> of the pages is still in place. The pager usually calls >> vm_page_readahead_finish() on them, which can either free, >> or put the page on active/inactive queue, using quite >> a strange approach to choose a queue. >> 5) The pages must not be wired, since vm_page_free() may be >> called on them. However, this is violated by several >> consumers of KPI, relying on lack of errors in the pager. >> Moreover, the swap pager has a special function to skip >> wired pages, while doing the sweep, to avoid this problem. >> So, passing wired pages to swapper is OK, while to the >> reset is not. >> 6) Pagers may replace a page in the object with a new one. >> The sg_pager actually does that. To protect from this >> event, consumers of vm_pager_get_pages() always run >> vm_page_lookup() over the array of pages to relookup the pages. >> However, not all consumers do this. >> >> Speaking of pagers and their consumers: >> - 11 consumers request array of size 1, a single page >> - 3 consumers actually request array >> >> My suggestion is to change the KPI assumptions to the following: >> >> 1) There is no reqpage. All pages are entered busied, all pages >> are returned busied and validated. If pager fails to validate >> all pages it must return error. >> 2) The consumer (not the pager!) is to decide what to do with the >> pages: vm_page_active, vm_page_deactivate, vm_page_flash or just >> vm_page_free them. The consumer also unbusies pages, if it >> wants to. The consumer is free to wire pages before the call. >> 3) Consumers must first query the pager via vm_pager_has_page(), >> and use the after/before hints to limit the size of the >> requested pages array. >> 4) In case if pager replaces pages, it must also update the array, >> so that consumer doesn't need to do relookup. >> >> Doing this sweep, I also noticed that all pagers have a copy-pasted >> code of zeroing invalid regions of partially valid pages. Also, >> many pagers got a set of assertions copy and pasted from each >> other. So, I decided to un-inline the vm_pager_get_pages(), bring >> it to the vm_pager.c file and gather all these copy-pastes >> into one place. >> >> The suggested patch is attached. As expected, it simplifies and >> removes quite a lot of code. >> >> Right now it is tested on UFS only, testing NFS and ZFS is on my list. >> There is one panic known, but it seems unrelated, and Peter pho@ says >> that once it has been seen before. > Below is the summary of my part of the internal discussion about the changes. > > Traditionally, Unix allows the filesystems to perform the short reads. > Most fundamental change in the patch removes this freedom from the > filesystem implementation, and I think that only local filesystems could > be compliant with the proposed strictness. > > IMO, the response from vm_pager_haspages() is only advisory, since > filesystem might not control the external entities which are the source > of the required data. Also since the backing object is not locked, a truncate() may be performed between the operations making the prior return information invalid. Certainly in remote filesystems, possibly on local ones too. > > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" >