From owner-freebsd-current@freebsd.org Wed Aug 3 14:26:30 2016 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 21070BADB71; Wed, 3 Aug 2016 14:26:30 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citapm.icyb.net.ua (citapm.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 1EAEE1A02; Wed, 3 Aug 2016 14:26:28 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citapm.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id RAA16480; Wed, 03 Aug 2016 17:26:20 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1bUx7k-000NAA-FN; Wed, 03 Aug 2016 17:26:20 +0300 To: FreeBSD Filesystems , FreeBSD Current From: Andriy Gapon Subject: some [big] changes to ZPL (ZFS<->VFS ) Message-ID: <3f79e88d-a519-0c8d-f16a-7c83460a37c1@FreeBSD.org> Date: Wed, 3 Aug 2016 17:25:45 +0300 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Aug 2016 14:26:30 -0000 I would like to get more comments about the following review request: https://reviews.freebsd.org/D6533 Both about the code changes and about the general direction of the changes. I've been asked to try to get this change into 11.0 because of https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158 So, there is some urgency to my request. To be honest I've never wanted to rush in this change, but I probably have been sitting on it for too long. ZFS POSIX Layer is originally written for Solaris VFS which is very different from FreeBSD VFS. Many things that FreeBSD VFS manages on behalf of all filesystems are also (re-)implemented in ZPL in a different way. As a result, ZPL contains code that is redundant on FreeBSD or duplicates VFS functionality or, in the worst cases, badly interacts / interferes with VFS. I am talking mostly about the operations that work on a filesystem namespace rather than on contents and properties of files. The former operations typically work on 2 or more vnodes. The vnodes are usually locked in a particular way or are expected to be locked in a particular way by the filesystem. ZPL on the other hand has quite an elaborate locking system of its own. It includes a lock that protects a znode, a lock that protects the znode's notion of its parent, locks that protect a name to znode relationship for the znode's children. So, we have two competing locking systems for the filesystem nodes and they are not aware of each other. The most prominent problem is a deadlock caused by the lock order reversal of vnode locks that may happen with concurrent zfs_rename() and lookup(). The deadlock is a result of zfs_rename() not observing the vnode locking contract expected by the VFS. I should mention here that VOP_RENAME is the weirdest method with respect to locking requirements. Every filesystem implementation is expected to perform quite an elaborate locking "dance" in VOP_RENAME. This is in contrast to all other operations where all necessary locking is done by the VFS before calling into the filesystem code. So, I started by fixing zfs_rename() but that required to make some changes to the "bottom half" of the ZPL (the "upper half" being the code that implements the vnode and VFS operations). And that required making changes in other VOPs anyway. And so the next step was to remove redundant re-lookup of child nodes in methods like zfs_rmdir and zfs_remove. Before I could stop I removed all ZPL internal locking that was supposed to protect parent-child relationships of filesystem nodes. Those relationships are now protected exclusively by the vnode locks and the code is changed to take advantage of that fact and to properly interact with VFS. As a minor detail, the removal of the internal locking allowed all ZPL dmu_tx_assign calls to use TXG_WAIT mode. Another change that was not strictly required and which is probably too intrusive is killing the support for case insensitive operations. My thinking was that FreeBSD VFS does not provide support for those anyway. But I'll probably restore the code, at least in the bottom half of the ZPL, before committing the change. On a more general level, I decided that the upper half of ZPL would necessarily be quite different between different VFS models and so it does not make much sense to keep huge chunks of ifdef-ed out code (or compiled in, but never reached code) that's not useful in FreeBSD at all. I think that keeping that code makes it harder to maintain the FreeBSD code and to _correctly_ merge upstream changes. E.g., if a change can be merged without conflicts to an ifdef-ed out blocked, that does not mean that the merge is correct. Last, but not least, the work was sponsored by HybridCluster / ClusterHQ Inc back in 2013. -- Andriy Gapon