From owner-freebsd-arch@FreeBSD.ORG Fri Oct 28 18:26:02 2011 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2BF141065675 for ; Fri, 28 Oct 2011 18:26:02 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 0569F8FC19 for ; Fri, 28 Oct 2011 18:26:02 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id AD85C46B06 for ; Fri, 28 Oct 2011 14:26:01 -0400 (EDT) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 4D5978A037 for ; Fri, 28 Oct 2011 14:26:01 -0400 (EDT) From: John Baldwin To: arch@freebsd.org Date: Fri, 28 Oct 2011 14:25:59 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p8; KDE/4.5.5; amd64; ; ) MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201110281426.00013.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (bigwig.baldwin.cx); Fri, 28 Oct 2011 14:26:01 -0400 (EDT) Cc: Subject: [PATCH] fadvise(2) system call X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Oct 2011 18:26:02 -0000 I have been working for the last week or so on a patch to add an fadvise(2) system call. It is somewhat similar to madvise(2) except that it operates on a file descriptor instead of a memory region. It also only really makes sense for regular files and does not apply to other file descriptor types. Just as with madvise(2) there are two types of advice that can be given. One set specifies the access pattern for a specific region of the file while the second set result in immediate action. The first set consist of FADV_NORMAL, FADV_SEQUENTIAL, FADV_RANDOM, and FADV_NOREUSE. For these operations what I have done is to add an optional "advice region" to a file descriptor. When a read(2) or write(2) is performed on a file, if the requested region falls completely within an active "advice region", then the associated advice is used to modify the IO_* flags passed down with the request. FADV_NORMAL just uses the current IO_* flags including using sequential_heuristic() to determine the amount of read-ahead and/or clustering to perform. FADV_RANDOM always passes a sequential count of zero to prevent read-ahead. FADV_SEQUENTIAL is the same as FADV_NORMAL for now (perhaps it should always be setting the maximum sequential count?). FADV_NOREUSE passes a sequential count of zero and sets IO_DIRECT (as if the operation were performed on a file opened with O_DIRECT). To simplify the implementation, only a single "advice region" is maintained for now (unlike madvise(2) which will split up vm map entries if necessary to ensure all requests are honored). Since the advice is only advisory, I think this is an ok approach for now. If we really had a valid use case, we could maybe add a list of advice regions, but then you have to deal with possibly splitting up read(2) or write(2) requests that span multiple advice regions, etc. I didn't feel that this extra complexity was warranted for now. The other two operations (FADV_WILLNEED and FADV_DONTNEED) are implemented via a new VOP_ADVISE(). The patch includes a default implementation (vop_stdadvise()) which is a nop for FADV_WILLNEED (I couldn't come up with a filesystem-independent way to trigger an async read-ahead). For FADV_DONTNEED it has a functional implementation which flushes all clean buffers from the vnode (via a new V_CLEANONLY mode for vinvalbuf()) and then moves any clean, unwired pages in the specified range of the file to the cache page queue (using a new vm_object_page_cache() routine). Various versions of this patch have already been reviewed and/or glanced at by alc@, kib@, and mdf@, but I'd like to open it for wider review before committing it. I will likely also MFC it back to 8 after 9.0 is released. The patch can be found at www.freebsd.org/~jhb/patches/fadvise.patch You can read the description of posix_fadvise() (which this implements) here: http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_fadvise.html -- John Baldwin