From owner-freebsd-fs@freebsd.org Mon Jun 13 14:35:43 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 844EAAF18CF for ; Mon, 13 Jun 2016 14:35:43 +0000 (UTC) (envelope-from pawan@cloudbyte.com) Received: from mail-yw0-x22f.google.com (mail-yw0-x22f.google.com [IPv6:2607:f8b0:4002:c05::22f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 5C29D2794 for ; Mon, 13 Jun 2016 14:35:43 +0000 (UTC) (envelope-from pawan@cloudbyte.com) Received: by mail-yw0-x22f.google.com with SMTP id g20so127540260ywb.0 for ; Mon, 13 Jun 2016 07:35:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudbyte-com.20150623.gappssmtp.com; s=20150623; h=mime-version:from:date:message-id:subject:to; bh=Nk3svj1pJwXgIjjAVC+MCxRt+kavEJsVSTU3kUUMlKQ=; b=i5W7k0ZW4TsdZlKAAKQ8oxCqzvXGmdaKMaGXng+irczqMCsx5BVzlcOfmJA9C5zqOK qxnz49uNpil0D/d8q2uHM5YuzJnTyI4ic91i7s0CyL5QSPji9Fux8ExDBDZnl/oDkPS9 aoFadMWOuq6urX0QAld0AhYnyXLSj2zlPvEQaawD7C7VEVObKWHDlGfKi8yXr4jv3AlI ngFyek+k78ww9MsS8NwQThGZJt/SJ7q6En/dQWAHI9lT7gU3S4MZD0lqQ1qNJV4oFELF HtDHv86T5XJiX/ncP9ss4GCipDuYbWhSgU9iGsbAVJ+MO2JRucRJonutmLRzqxjs5MtM H27w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=Nk3svj1pJwXgIjjAVC+MCxRt+kavEJsVSTU3kUUMlKQ=; b=TrIzVj9WQe/jyiUkdQrxmNN/+VOck1WSISDZc6ULiQKHpFHrrOIO4fCUQ9/oY1DVL/ +8EhyZYSx0o9N8Y2dg/EfShoipLsCmaNTlbIVBTmvWseQzVA5MXNvMGsc4a23Mb57pS7 RonhiB4pJkj014MSAfTRoxjrlCH/HuND2IL7a293dPjH/q6b+/SIhNhCKOX17/apGtZO yY+8NuPGyqoSINSnXMq+L3XbOE9SkkPaea2iD/sTuP3KI46z7pLPoVieVu3y8pEHD1Kx MqPMyVNp17U9QLYqrW+eJ4KdUTDtupNyMpVb13iSthw/DPFmJTBWY2hdx8xcr9npPGus +THQ== X-Gm-Message-State: ALyK8tLa8p4DNuta/xQ2LrXgUGkc44zwrcMUQCHLj5rGxP7F41CXBzpxdAATadTkpfeo7kaxu5NfMkVk0SQZoA== X-Received: by 10.129.165.144 with SMTP id c138mr8744317ywh.271.1465828542420; Mon, 13 Jun 2016 07:35:42 -0700 (PDT) MIME-Version: 1.0 Received: by 10.37.29.67 with HTTP; Mon, 13 Jun 2016 07:35:42 -0700 (PDT) From: Pawan Prakash Sharma Date: Mon, 13 Jun 2016 20:05:42 +0530 Message-ID: Subject: ZFS : When resilvering is in progress, ZIL disk removal is restarting the resilvering process . To: freebsd-fs@freebsd.org Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Jun 2016 14:35:43 -0000 This is happening because as part of clean up the vdev namespace (spa_vdev_remove_from_namespace) after removing ZIL disk, we reopen (vdev_reopen) all the vdev, which checks if resilvering is needed for a leaf vdev, if it is needed, it restarts the resilvering. It doesn't check if resilvering is already going on or not. So if resilvering is already in-progress, it will start it from the beginning. My question is, why we are not checking for dsl_scan_resilvering (we do this in spa_load_impl) in vdev_open? What will be the side effects if we put this check while opening the vdev? Let me know if I am missing anything. vdev_open() :- /* * If a leaf vdev has a DTL, and seems healthy, then kick off a * resilver. But don't do this if we are doing a reopen for a scrub, * since this would just restart the scrub we are already doing. */ if (vd->vdev_ops->vdev_op_leaf && !spa->spa_scrub_reopen && vdev_resilver_needed(vd, NULL, NULL)) spa_async_request(spa, SPA_ASYNC_RESILVER); Regards, Pawan. From owner-freebsd-fs@freebsd.org Mon Jun 13 15:04:24 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E56C1AF131E for ; Mon, 13 Jun 2016 15:04:24 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id D5F2F25A2 for ; Mon, 13 Jun 2016 15:04:24 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u5DF4O6U059267 for ; Mon, 13 Jun 2016 15:04:24 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 204337] ZFS can have a non-empty directory, but the files don't exist on arm64. Date: Mon, 13 Jun 2016 15:04:25 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 11.0-CURRENT X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: emaste@freebsd.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: Andrew@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: assigned_to Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Jun 2016 15:04:25 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D204337 Ed Maste changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|freebsd-fs@FreeBSD.org |Andrew@FreeBSD.org --=20 You are receiving this mail because: You are the assignee for the bug.= From owner-freebsd-fs@freebsd.org Mon Jun 13 17:28:56 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 045FBAF2376 for ; Mon, 13 Jun 2016 17:28:56 +0000 (UTC) (envelope-from matthew.ahrens@delphix.com) Received: from mail-io0-x234.google.com (mail-io0-x234.google.com [IPv6:2607:f8b0:4001:c06::234]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id CE8C72D2E for ; Mon, 13 Jun 2016 17:28:55 +0000 (UTC) (envelope-from matthew.ahrens@delphix.com) Received: by mail-io0-x234.google.com with SMTP id d2so38787481iof.0 for ; Mon, 13 Jun 2016 10:28:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=delphix.com; s=google; h=mime-version:from:date:message-id:subject:to; bh=F2eLjTaw9yv/Glcl1KFJ/k6NIt+rVKEziymrx8JxGsw=; b=KlsXGZIzpfDSLz4xBEJbxrUFWsByjhOr8u8bWaf+lCJqcVYgQFEfZZE44lBUMgOoTD sb7BlWxrI/fSJPwzMNVVCtp8RNdUdAvOagxmcy9JlJ9bnBkImt0jy6S0XrdC8SuoTuRq 9gsPzJ271GD5edJMkVz1bFLC9W/+QxsyYlLT4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=F2eLjTaw9yv/Glcl1KFJ/k6NIt+rVKEziymrx8JxGsw=; b=KM1su82+OqqTtC6XMC6KNDdfm2f2ARG5wTdSvBkdG1ZV2pKgg89obW9rfvWFsU/DLG tXBYSC9qzuQuKnZR1Btzj7cJ89T4T3LtFlSpWm2/rx1dOr1ip6FXgHAZ5PoRwEbDLT8J moXh36PZ0lE3W6B62ZGWs9bS/6sxKTJn3BcryaDTWijy7Fk65ZAppnwB2XVcMvMogA0C wNNU0ejC9p5h8HdTH4PqPyffZaoo/VXbjsgl3OmL8JNMgO3pQn0G0ZRHNpTwPas/ldcP K2lE6ZerfEHDLpfVtKY93ecoZuduk93XskK0kqIO4STov74YDymfd4z/GkBSWR7Onj3P 6x3w== X-Gm-Message-State: ALyK8tL4TQSVTBWnLRs9F78wEwJpCcQHYwItL9Nja+P9VNtGSlCRppyn4hVFgJbhR/vd1mEdE+od6aVecSw++DDl X-Received: by 10.107.142.149 with SMTP id q143mr12261526iod.178.1465838934939; Mon, 13 Jun 2016 10:28:54 -0700 (PDT) MIME-Version: 1.0 Received: by 10.36.103.140 with HTTP; Mon, 13 Jun 2016 10:28:54 -0700 (PDT) From: Matthew Ahrens Date: Mon, 13 Jun 2016 13:28:54 -0400 Message-ID: Subject: OpenZFS Developer Summit 2016 - announcement & CFP To: "admin@open-zfs.org" , developer , illumos-zfs , zfs-devel@list.zfsonlinux.org, zfs-devel@freebsd.org, freebsd-fs , zfs-discuss Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Jun 2016 17:28:56 -0000 The fourth annual OpenZFS Developer Summit will be held in San Francisco, September 26-27th, 2016. All OpenZFS developers are invited to participate, and registration is now open. See http://www.open-zfs.org/wiki/OpenZFS_Developer_Summit for all of the details, including slides and videos from previous years' conferences. The goal of the event is to foster cross-community discussions of OpenZFS work and to make progress on some of the projects we have proposed. Like last year, the first day will be set aside for presentations. *If you would like to give a talk, submit a proposal via email to admin@open-zfs.org , including a 1-2 paragraph abstract.* I will review all of the submissions and select talks that will be most interesting for the audience. The registration fee will be waived for speakers. The deadlines are as follows: Aug 1, 2016 All abstracts/proposals submitted to admin@open-zfs.org Aug 12, 2016 Proposal submitters notified Aug 25, 2016 Agenda finalized September 26-27, 2016 OpenZFS Developer Summit The second day of the event will be a hackathon, where you will have the opportunity to work with OpenZFS developers that you might not normally sit next to, with the goal of having something, no matter how insignificant, to demo at the end of the day. Please add your hackathon ideas to the summit wiki page. This event is only possible because of the efforts and contributions of the community of developers and companies that sponsor the event. Special thanks to our early Platinum sponsors: Delphix and Intel Additional sponsorship opportunities are available. Please see the website for details and send email to admin@open-zfs.org if you have any questions. --matt From owner-freebsd-fs@freebsd.org Mon Jun 13 22:28:54 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 657F6AF2E13 for ; Mon, 13 Jun 2016 22:28:54 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id CE42B2E94; Mon, 13 Jun 2016 22:28:53 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) IronPort-PHdr: 9a23:k4UQKRQPcIX/UKf8b7wUXBjmMdpsv+yvbD5Q0YIujvd0So/mwa64YRCN2/xhgRfzUJnB7Loc0qyN4/GmCTZLucjJmUtBWaIPfidNsd8RkQ0kDZzNImzAB9muURYHGt9fXkRu5XCxPBsdMs//Y1rPvi/6tmZKSV3BPAZ4bt74BpTVx5zukbviqtuDOU4Q2nKUWvBbElaflU3prM4YgI9veO4a6yDihT92QdlQ3n5iPlmJnhzxtY+a9Z9n9DlM6bp6r5YTGfayQ6NtQ6ZVAT49PyU7/4W/uwPOQAGU6j4SSU0YiBdFCRPJqhbgUcGinDH9s79H2SKZdej/RrMwVDHqu71uQRTrjCoCHyM+/3zajtRwyqlS9kHy7ydjypLZNdnGfMF1ebnQKIsX X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2DOAQBCM19X/61jaINSChaDfn0GuyuBeSKFdYFnFAEBAQEBAQEBZCeCMYIhIwRSEgEiAg0ZAlsELogVDqhbkSwlgQGFJohlBA4CgxWCWgWNbopzhgSKDU6EBIMshTmPbAIeNoQKIDIBiEMBJR9/AQEB X-IronPort-AV: E=Sophos;i="5.26,468,1459828800"; d="scan'208";a="287204297" Received: from nipigon.cs.uoguelph.ca (HELO zcs1.mail.uoguelph.ca) ([131.104.99.173]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 13 Jun 2016 18:28:46 -0400 Received: from localhost (localhost [127.0.0.1]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id 58D2B15F565; Mon, 13 Jun 2016 18:28:46 -0400 (EDT) Received: from zcs1.mail.uoguelph.ca ([127.0.0.1]) by localhost (zcs1.mail.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id u2MNvh5Nvcnh; Mon, 13 Jun 2016 18:28:45 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id 5D97A15F56D; Mon, 13 Jun 2016 18:28:45 -0400 (EDT) X-Virus-Scanned: amavisd-new at zcs1.mail.uoguelph.ca Received: from zcs1.mail.uoguelph.ca ([127.0.0.1]) by localhost (zcs1.mail.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id p5u4RTcPjT5D; Mon, 13 Jun 2016 18:28:45 -0400 (EDT) Received: from zcs1.mail.uoguelph.ca (zcs1.mail.uoguelph.ca [172.17.95.18]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id 3E37F15F565; Mon, 13 Jun 2016 18:28:45 -0400 (EDT) Date: Mon, 13 Jun 2016 18:28:45 -0400 (EDT) From: Rick Macklem To: freebsd-fs Cc: Jordan Hubbard , Doug Rabson , Alexander Motin Message-ID: <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca> Subject: pNFS server Plan B MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.95.10] X-Mailer: Zimbra 8.0.9_GA_6191 (ZimbraWebClient - FF47 (Win)/8.0.9_GA_6191) Thread-Topic: pNFS server Plan B Thread-Index: 5QJj0Ggy8anwcaEZpfxgbgEUjWIwZA== X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Jun 2016 22:28:54 -0000 You may have already heard of Plan A, which sort of worked and you could test by following the instructions here: http://people.freebsd.org/~rmacklem/pnfs-setup.txt However, it is very slow for metadata operations (everything other than read/write) and I don't think it is very useful. After my informal talk at BSDCan, here are some thoughts I have: - I think the slowness is related to latency w.r.t. all the messages being passed between the nfsd, GlusterFS via Fuse and between the GlusterFS daemons. As such, I don't think faster hardware is likely to help a lot w.r.t. performance. - I have considered switching to MooseFS, but I would still be using Fuse. *** MooseFS uses a centralized metadata store, which would imply only a single Metadata Server (MDS) could be supported, I think? (More on this later...) - dfr@ suggested that avoiding Fuse and doing everything in userspace might help. - I thought of porting the nfsd to userland, but that would be quite a bit of work, since it uses the kernel VFS/VOP interface, etc. All of the above has led me to Plan B. It would be limited to a single MDS, but as you'll see I'm not sure that is as large a limitation as I thought it would be. (If you aren't interested in details of this Plan B design, please skip to "Single Metadata server..." for the issues.) Plan B: - Do it all in kernel using a slightly modified nfsd. (FreeBSD nfsd would be used for both the MDS and Data Server (DS).) - One FreeBSD server running nfsd would be the MDS. It would build a file system tree that looks exactly like it would without pNFS, except that the files would be empty. (size == 0) --> As such, all the current nfsd code would do metadata operations on this file system exactly like the nfsd does now. - When a new file is created (an Open operation on NFSv4.1), the file would be created exactly like it is now for the MDS. - Then DS(s) would be selected and the MDS would do a Create of a data storage file on these DS(s). (This algorithm could become interesting later, but initially it would probably just pick one DS at random or similar.) - These file(s) would be in a single directory on the DS(s) and would have a file name which is simply the File Handle for this file on the MDS (an FH is 28bytes->48bytes of Hex in ASCII). - Extended attributes would be added to the Metadata file for: - The data file's actual size. - The DS(s) the data file in on. - The File Handle for these data files on the DS(s). This would add some overhead to the Open/create, which would be one Create RPC for each DS the data file is on. *** Initially there would only be one file on one DS. Mirroring for redundancy can be added later. Now, the layout would be generated from these extended attributes for any NFSv4.1 client that asks for it. If I/O operations (read/write/setattr_of_size) are performed on the Metadata server, it would act as a proxy and do them on the DS using the extended attribute information (doing an RPC on the DS for the client). When the file is removed on the Metadata server (link cnt --> 0), the Metadata server would do Remove RPC(s) on the DS(s) for the data file(s). (This requires the file name, which is just the Metadata FH in ASCII.) The only addition that the nfsd for the DS(s) would need would be a callback to the MDS done whenever a client (not the MDS) does a write to the file, notifying the Metadata server the file has been modified and is now Size=K, so the Metadata server can keep the attributes up to date for the file. (It can identify the file by the MDS FH.) All of this is a relatively small amount of change to the FreeBSD nfsd, so it shouldn't be that much work (I'm a lazy guy looking for a minimal solution;-). Single Metadata server... The big limitation to all of the above is the "single MDS" limitation. I had thought this would be a serious limitation to the design scaling up to large stores. However, I'm not so sure it is a big limitation?? 1 - Since the files on the MDS are all empty, the file system is only i-nodes, directories and extended attribute blocks. As such, I hope it can be put on fast storage. *** I don't know anything about current and near term future SSD technologies. Hopefully others can suggest how large/fast a store for the MDS could be built easily? --> I am hoping that it will be possible to build an MDS that can handle a lot of DS/storage this way? (If anyone has access to hardware and something like SpecNFS, they could test an RPC load with almost no Read/Write RPCs and this would probably show about what the metadata RPC limits are for one of these.) 2 - Although it isn't quite having multiple MDSs, the directory tree could be split up with an MDS for each subtree. This would allow some scaling beyond one MDS. (Although not implemented in FreeBSD's NFSv4.1 yet, Referrals are basically an NFS server driven "automount" that redirects the NFSv4.1 client to a different server for a subtree. This might be a useful tool for splitting off subtrees to different MDSs?) If you actually read this far, any comments on this would be welcome. In particular, if you have an opinion w.r.t. this single MDS limitation and/or how big an MDS could be built, that would be appreciated. Thanks for any comments, rick From owner-freebsd-fs@freebsd.org Tue Jun 14 08:47:25 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 99B06AF0BB8 for ; Tue, 14 Jun 2016 08:47:25 +0000 (UTC) (envelope-from dfr@rabson.org) Received: from mail-oi0-x229.google.com (mail-oi0-x229.google.com [IPv6:2607:f8b0:4003:c06::229]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 683A324BD for ; Tue, 14 Jun 2016 08:47:25 +0000 (UTC) (envelope-from dfr@rabson.org) Received: by mail-oi0-x229.google.com with SMTP id u201so119057506oie.0 for ; Tue, 14 Jun 2016 01:47:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rabson-org.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=M48zYvUb8klHbvXpRTLZkypQ+VOC1niPXuIsKxZxffY=; b=am3RLeV3rztvLtxh965js2v8c0KxwXX2TTAGXcYxByShN2HDZ7CrBjXC1z0khkd/t+ Qsjd2OpnWuZR3MDlkysVEW5oCQ4a+kWBatzP4yTSmN5bkPFdtHRBB6nXiQLInODtzjHn cIGS1gmcP8QNs366atKRrTvG2HamaKo5j/fN/em68eKmfUbQq3zLeT2FjHXaIi6/VV+x w0JOoDDQu0WhZ0xGyWZMgH0CeJOZVgyCyCccRLNcq8K2QfDJoUZB8PWF6viZAHmYPUGH Rr0IePDI8rMl0sxnBTUDY5N6+DY+jhEnjhAjTjsL7nPgaiZk2BpLoc/O7bYbfC5wHz9u C7+w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=M48zYvUb8klHbvXpRTLZkypQ+VOC1niPXuIsKxZxffY=; b=PQu1yPasGNnCabYadZP+cu3byHRCAb/Yo8eEe1jB6lfK57T/tNirn6uX+ecbsW8E09 SqtAwPg7CJxanfqq0PDIh2VaTO0uK1vnPBIiXrbtw7ILvdBHbtO767rSUEOYC+q4kW7/ KhRVAF1QXg7gXqMVu8ZI0QOGIheZEZ9HQK1RUYbx4rGgBL8OzLJGh0T0CYOpHUCtQwQ/ +c76Z75zzdCMF1wSZSQpg0xQb7QzmVJZLWDxYlYOGDyzRjBS3NiVL5TVLqoHsLb4mBJh BnO8Dm2q+U/MxzUKmvbAbWXm6RupzpkcZFVWGpbkarEG9On+Q/yF/e04DiHUApCkh1yG icRA== X-Gm-Message-State: ALyK8tLybFt3SQkInXFxP3FpD+L6HVEu1pxah+MiBmdhRaH17o5Sqtg6vAL73NjRzv+PqXNjGpSMsnuKtIP0LQ== X-Received: by 10.157.39.75 with SMTP id r69mr8582073ota.181.1465894044574; Tue, 14 Jun 2016 01:47:24 -0700 (PDT) MIME-Version: 1.0 Received: by 10.182.22.68 with HTTP; Tue, 14 Jun 2016 01:47:23 -0700 (PDT) In-Reply-To: <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca> References: <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca> From: Doug Rabson Date: Tue, 14 Jun 2016 09:47:23 +0100 Message-ID: Subject: Re: pNFS server Plan B To: Rick Macklem Cc: freebsd-fs , Jordan Hubbard , Alexander Motin Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Jun 2016 08:47:25 -0000 As I mentioned to Rick, I have been working on similar lines to put together a pNFS implementation. Comments embedded below. On 13 June 2016 at 23:28, Rick Macklem wrote: > You may have already heard of Plan A, which sort of worked > and you could test by following the instructions here: > > http://people.freebsd.org/~rmacklem/pnfs-setup.txt > > However, it is very slow for metadata operations (everything other than > read/write) and I don't think it is very useful. > > After my informal talk at BSDCan, here are some thoughts I have: > - I think the slowness is related to latency w.r.t. all the messages > being passed between the nfsd, GlusterFS via Fuse and between the > GlusterFS daemons. As such, I don't think faster hardware is likely > to help a lot w.r.t. performance. > - I have considered switching to MooseFS, but I would still be using Fuse. > *** MooseFS uses a centralized metadata store, which would imply only > a single Metadata Server (MDS) could be supported, I think? > (More on this later...) > - dfr@ suggested that avoiding Fuse and doing everything in userspace > might help. > - I thought of porting the nfsd to userland, but that would be quite a > bit of work, since it uses the kernel VFS/VOP interface, etc. > I ended up writing everything from scratch as userland code rather than consider porting the kernel code. It was quite a bit of work :) > > All of the above has led me to Plan B. > It would be limited to a single MDS, but as you'll see > I'm not sure that is as large a limitation as I thought it would be. > (If you aren't interested in details of this Plan B design, please > skip to "Single Metadata server..." for the issues.) > > Plan B: > - Do it all in kernel using a slightly modified nfsd. (FreeBSD nfsd would > be used for both the MDS and Data Server (DS).) > - One FreeBSD server running nfsd would be the MDS. It would > build a file system tree that looks exactly like it would without pNFS, > except that the files would be empty. (size == 0) > --> As such, all the current nfsd code would do metadata operations on > this file system exactly like the nfsd does now. > - When a new file is created (an Open operation on NFSv4.1), the file would > be created exactly like it is now for the MDS. > - Then DS(s) would be selected and the MDS would do > a Create of a data storage file on these DS(s). > (This algorithm could become interesting later, but initially it would > probably just pick one DS at random or similar.) > - These file(s) would be in a single directory on the DS(s) and would > have > a file name which is simply the File Handle for this file on the > MDS (an FH is 28bytes->48bytes of Hex in ASCII). > I have something similar but using a directory hierarchy to try to avoid any one directory being excessively large. > - Extended attributes would be added to the Metadata file for: > - The data file's actual size. > - The DS(s) the data file in on. > - The File Handle for these data files on the DS(s). > This would add some overhead to the Open/create, which would be one > Create RPC for each DS the data file is on. > An alternative here would be to store the extra metadata in the file itself rather than use extended attributes. > *** Initially there would only be one file on one DS. Mirroring for > redundancy can be added later. > The scale of filesystem I want to build more or less requires the extra redundancy of mirroring so I added this at the start. It does add quite a bit of complexity to the MDS to keep track of which DS should have which piece of data and to handle DS failures properly, re-silvering data etc. > > Now, the layout would be generated from these extended attributes for any > NFSv4.1 client that asks for it. > > If I/O operations (read/write/setattr_of_size) are performed on the > Metadata > server, it would act as a proxy and do them on the DS using the extended > attribute information (doing an RPC on the DS for the client). > > When the file is removed on the Metadata server (link cnt --> 0), the > Metadata server would do Remove RPC(s) on the DS(s) for the data file(s). > (This requires the file name, which is just the Metadata FH in ASCII.) > Currently I have a non-nfs control protocol for this but strictly speaking it isn't necessary as you note. > > The only addition that the nfsd for the DS(s) would need would be a > callback > to the MDS done whenever a client (not the MDS) does > a write to the file, notifying the Metadata server the file has been > modified and is now Size=K, so the Metadata server can keep the attributes > up to date for the file. (It can identify the file by the MDS FH.) > I don't think you need this - the client should perform LAYOUTCOMMIT rpcs which will inform the MDS of the last write position and last modify time. This can be used to update the file metadata. The Linux client does this before the CLOSE rpc on the client as far as I can tell. > > All of this is a relatively small amount of change to the FreeBSD nfsd, > so it shouldn't be that much work (I'm a lazy guy looking for a minimal > solution;-). > > Single Metadata server... > The big limitation to all of the above is the "single MDS" limitation. > I had thought this would be a serious limitation to the design scaling > up to large stores. > However, I'm not so sure it is a big limitation?? > 1 - Since the files on the MDS are all empty, the file system is only > i-nodes, directories and extended attribute blocks. > As such, I hope it can be put on fast storage. > *** I don't know anything about current and near term future SSD > technologies. > Hopefully others can suggest how large/fast a store for the MDS could > be built easily? > --> I am hoping that it will be possible to build an MDS that can > handle > a lot of DS/storage this way? > (If anyone has access to hardware and something like SpecNFS, they > could > test an RPC load with almost no Read/Write RPCs and this would > probably > show about what the metadata RPC limits are for one of these.) > I think a single MDS can scale up to petabytes of storage easily. It remains to be seen how far it can scale for TPS. I will note that Google's GFS filesystem (you can find a paper describing it at http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf) uses effectively a single MDS, replicated for redundancy but still serving just from one master MDS at a time. That filesystem scaled pretty well for both data size and transactions so I think the approach is viable. > > 2 - Although it isn't quite having multiple MDSs, the directory tree could > be split up with an MDS for each subtree. This would allow some scaling > beyond one MDS. > (Although not implemented in FreeBSD's NFSv4.1 yet, Referrals are > basically > an NFS server driven "automount" that redirects the NFSv4.1 client to > a different server for a subtree. This might be a useful tool for > splitting off subtrees to different MDSs?) > > If you actually read this far, any comments on this would be welcome. > In particular, if you have an opinion w.r.t. this single MDS limitation > and/or how big an MDS could be built, that would be appreciated. > > Thanks for any comments, rick > My back-of-envelope calculation assumed a 10 Pb filesystem containing mostly large files which would be striped in 10 Mb pieces. Guessing that we need 200 bytes of metadata per piece, that gives around 200 Gb of metadata which is very reasonable. Even for file sets containing much smaller files, a single server should have no trouble storing the metadata. From owner-freebsd-fs@freebsd.org Tue Jun 14 18:03:07 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9D78DB72B2D for ; Tue, 14 Jun 2016 18:03:07 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 8E12B2A32 for ; Tue, 14 Jun 2016 18:03:07 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u5EI37CU040455 for ; Tue, 14 Jun 2016 18:03:07 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 209158] node / npm triggering zfs rename deadlock Date: Tue, 14 Jun 2016 18:03:07 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 11.0-CURRENT X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: peter@FreeBSD.org X-Bugzilla-Status: Open X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-fs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Jun 2016 18:03:07 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209158 --- Comment #29 from Peter Wemm --- I was wondering what the current state is. Which patch did Doug test? We'= ve had some more explosions with poudriere even without npm and its threaded operations. --=20 You are receiving this mail because: You are the assignee for the bug.= From owner-freebsd-fs@freebsd.org Tue Jun 14 22:35:47 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2E3D7B72BD0 for ; Tue, 14 Jun 2016 22:35:47 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id B238A234F; Tue, 14 Jun 2016 22:35:46 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) IronPort-PHdr: 9a23:g4Ew1RWYJauKOTnZpIOJ5ANXEn7V8LGtZVwlr6E/grcLSJyIuqrYZhCAt8tkgFKBZ4jH8fUM07OQ6PCxHzxdqs/Y+Fk5M7VyFDY9wf0MmAIhBMPXQWbaF9XNKxIAIcJZSVV+9Gu6O0UGUOz3ZlnVv2HgpWVKQka3CwN5K6zPF5LIiIzvjqbpq8yVM1gD3WP1SIgxBSv1hD2ZjtMRj4pmJ/R54TryiVwMRd5rw3h1L0mYhRf265T41pdi9yNNp6BprJYYAu2pN5g/GJBfETtuCWk//8rt/U3PQxGn/HIWSWIQ1B1SDF6Wwgv9W8LLsyD5/s900yqeMMi+GaoxUD+h66puYALvhzoKMyY5tmre3J8jxJlHqQ6s8kQsi7XfZ5uYYaJz X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2BeAgAshmBX/61jaINTChaDfi5PBro+doF5JIVzAoFuFAEBAQEBAQEBZCeCMYIbAQECAiMEUhACAQgOCgICDRkCAlcCBBMbiBUOrFeRBAEBAQcBAQEBASKBAYUmhE2COYFfBA4CgxWCWgWNcIpzhgVwiR1OhASDLYU6hk2JJQIeNoIEAh2BZyAyAQEBiEEBJR9/AQEB X-IronPort-AV: E=Sophos;i="5.26,472,1459828800"; d="scan'208";a="289245665" Received: from nipigon.cs.uoguelph.ca (HELO zcs1.mail.uoguelph.ca) ([131.104.99.173]) by esa-annu.net.uoguelph.ca with ESMTP; 14 Jun 2016 18:34:15 -0400 Received: from localhost (localhost [127.0.0.1]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id 33B9215F55D; Tue, 14 Jun 2016 18:34:15 -0400 (EDT) Received: from zcs1.mail.uoguelph.ca ([127.0.0.1]) by localhost (zcs1.mail.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id QsHp-Hn0rjxZ; Tue, 14 Jun 2016 18:34:14 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id ED46415F565; Tue, 14 Jun 2016 18:34:13 -0400 (EDT) X-Virus-Scanned: amavisd-new at zcs1.mail.uoguelph.ca Received: from zcs1.mail.uoguelph.ca ([127.0.0.1]) by localhost (zcs1.mail.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id RMW4akJycbVX; Tue, 14 Jun 2016 18:34:13 -0400 (EDT) Received: from zcs1.mail.uoguelph.ca (zcs1.mail.uoguelph.ca [172.17.95.18]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id C5FDE15F55D; Tue, 14 Jun 2016 18:34:13 -0400 (EDT) Date: Tue, 14 Jun 2016 18:34:13 -0400 (EDT) From: Rick Macklem To: Doug Rabson Cc: freebsd-fs , Jordan Hubbard , Alexander Motin Message-ID: <1344776266.148298197.1465943653751.JavaMail.zimbra@uoguelph.ca> In-Reply-To: References: <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca> Subject: Re: pNFS server Plan B MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.95.11] X-Mailer: Zimbra 8.0.9_GA_6191 (ZimbraWebClient - FF47 (Win)/8.0.9_GA_6191) Thread-Topic: pNFS server Plan B Thread-Index: ASrQ/M2KGJa5G0b8rxnaUXVTWDao4Q== X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Jun 2016 22:35:47 -0000 Doug Rabson wrote: > As I mentioned to Rick, I have been working on similar lines to put > together a pNFS implementation. Comments embedded below. > > On 13 June 2016 at 23:28, Rick Macklem wrote: > > > You may have already heard of Plan A, which sort of worked > > and you could test by following the instructions here: > > > > http://people.freebsd.org/~rmacklem/pnfs-setup.txt > > > > However, it is very slow for metadata operations (everything other than > > read/write) and I don't think it is very useful. > > > > After my informal talk at BSDCan, here are some thoughts I have: > > - I think the slowness is related to latency w.r.t. all the messages > > being passed between the nfsd, GlusterFS via Fuse and between the > > GlusterFS daemons. As such, I don't think faster hardware is likely > > to help a lot w.r.t. performance. > > - I have considered switching to MooseFS, but I would still be using Fuse. > > *** MooseFS uses a centralized metadata store, which would imply only > > a single Metadata Server (MDS) could be supported, I think? > > (More on this later...) > > - dfr@ suggested that avoiding Fuse and doing everything in userspace > > might help. > > - I thought of porting the nfsd to userland, but that would be quite a > > bit of work, since it uses the kernel VFS/VOP interface, etc. > > > > I ended up writing everything from scratch as userland code rather than > consider porting the kernel code. It was quite a bit of work :) > > > > > > All of the above has led me to Plan B. > > It would be limited to a single MDS, but as you'll see > > I'm not sure that is as large a limitation as I thought it would be. > > (If you aren't interested in details of this Plan B design, please > > skip to "Single Metadata server..." for the issues.) > > > > Plan B: > > - Do it all in kernel using a slightly modified nfsd. (FreeBSD nfsd would > > be used for both the MDS and Data Server (DS).) > > - One FreeBSD server running nfsd would be the MDS. It would > > build a file system tree that looks exactly like it would without pNFS, > > except that the files would be empty. (size == 0) > > --> As such, all the current nfsd code would do metadata operations on > > this file system exactly like the nfsd does now. > > - When a new file is created (an Open operation on NFSv4.1), the file would > > be created exactly like it is now for the MDS. > > - Then DS(s) would be selected and the MDS would do > > a Create of a data storage file on these DS(s). > > (This algorithm could become interesting later, but initially it would > > probably just pick one DS at random or similar.) > > - These file(s) would be in a single directory on the DS(s) and would > > have > > a file name which is simply the File Handle for this file on the > > MDS (an FH is 28bytes->48bytes of Hex in ASCII). > > > > I have something similar but using a directory hierarchy to try to avoid > any one directory being excessively large. > I thought of that, but since no one will be doing an "ls" of it, I wasn't going to bother doing multiple dirs initially. However, now that I think of it, the Create and Remove RPCs will end up doing VOP_LOOKUP()s, so breaking these up into multiple directories sounds like a good idea. (I may just hash the FH and let the hash choose a directory.) Good suggestion, thanks. > > > - Extended attributes would be added to the Metadata file for: > > - The data file's actual size. > > - The DS(s) the data file in on. > > - The File Handle for these data files on the DS(s). > > This would add some overhead to the Open/create, which would be one > > Create RPC for each DS the data file is on. > > > > An alternative here would be to store the extra metadata in the file itself > rather than use extended attributes. > Yep. I'm not sure if there is any performance advantage of doing data vs. extended attributes? > > > *** Initially there would only be one file on one DS. Mirroring for > > redundancy can be added later. > > > > The scale of filesystem I want to build more or less requires the extra > redundancy of mirroring so I added this at the start. It does add quite a > bit of complexity to the MDS to keep track of which DS should have which > piece of data and to handle DS failures properly, re-silvering data etc. > > > > > > Now, the layout would be generated from these extended attributes for any > > NFSv4.1 client that asks for it. > > > > If I/O operations (read/write/setattr_of_size) are performed on the > > Metadata > > server, it would act as a proxy and do them on the DS using the extended > > attribute information (doing an RPC on the DS for the client). > > > > When the file is removed on the Metadata server (link cnt --> 0), the > > Metadata server would do Remove RPC(s) on the DS(s) for the data file(s). > > (This requires the file name, which is just the Metadata FH in ASCII.) > > > > Currently I have a non-nfs control protocol for this but strictly speaking > it isn't necessary as you note. > > > > > > The only addition that the nfsd for the DS(s) would need would be a > > callback > > to the MDS done whenever a client (not the MDS) does > > a write to the file, notifying the Metadata server the file has been > > modified and is now Size=K, so the Metadata server can keep the attributes > > up to date for the file. (It can identify the file by the MDS FH.) > > > > I don't think you need this - the client should perform LAYOUTCOMMIT rpcs > which will inform the MDS of the last write position and last modify time. > This can be used to update the file metadata. The Linux client does this > before the CLOSE rpc on the client as far as I can tell. > When I developed the NFSv4.1_Files layout client, I had three servers to test against. - The Netapp filer just returned EOPNOTSUPP for LayoutCommit. - The Linux test server (had MDS and DS on the same Linux system) accepted the LayoutCommit, but didn't do anything for it, so doing it had no effect. - The only pNFS server I've ever tested against that needed LayoutCommit was Oracle/Solaris and the Oracle folk never explained why their server required it or what would break if you didn't do it. (I don't recall attributes being messed up when I didn't do it correctly.) As such, I've never been sure what it is used for. I need to read the LayoutCommit stuff in the RFC and Flex Files draft again. It would be nice if the DS->MDS calls could be avoided for every write. Doing one when the DS receives a Commit RPC wouldn't be too bad. > > > > > All of this is a relatively small amount of change to the FreeBSD nfsd, > > so it shouldn't be that much work (I'm a lazy guy looking for a minimal > > solution;-). > > > > Single Metadata server... > > The big limitation to all of the above is the "single MDS" limitation. > > I had thought this would be a serious limitation to the design scaling > > up to large stores. > > However, I'm not so sure it is a big limitation?? > > 1 - Since the files on the MDS are all empty, the file system is only > > i-nodes, directories and extended attribute blocks. > > As such, I hope it can be put on fast storage. > > *** I don't know anything about current and near term future SSD > > technologies. > > Hopefully others can suggest how large/fast a store for the MDS could > > be built easily? > > --> I am hoping that it will be possible to build an MDS that can > > handle > > a lot of DS/storage this way? > > (If anyone has access to hardware and something like SpecNFS, they > > could > > test an RPC load with almost no Read/Write RPCs and this would > > probably > > show about what the metadata RPC limits are for one of these.) > > > > I think a single MDS can scale up to petabytes of storage easily. It > remains to be seen how far it can scale for TPS. I will note that Google's > GFS filesystem (you can find a paper describing it at > http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf) > uses effectively a single MDS, replicated for redundancy but still serving > just from one master MDS at a time. That filesystem scaled pretty well for > both data size and transactions so I think the approach is viable. > > > > > > > 2 - Although it isn't quite having multiple MDSs, the directory tree could > > be split up with an MDS for each subtree. This would allow some scaling > > beyond one MDS. > > (Although not implemented in FreeBSD's NFSv4.1 yet, Referrals are > > basically > > an NFS server driven "automount" that redirects the NFSv4.1 client to > > a different server for a subtree. This might be a useful tool for > > splitting off subtrees to different MDSs?) > > > > If you actually read this far, any comments on this would be welcome. > > In particular, if you have an opinion w.r.t. this single MDS limitation > > and/or how big an MDS could be built, that would be appreciated. > > > > Thanks for any comments, rick > > > > My back-of-envelope calculation assumed a 10 Pb filesystem containing > mostly large files which would be striped in 10 Mb pieces. Guessing that we > need 200 bytes of metadata per piece, that gives around 200 Gb of metadata > which is very reasonable. Even for file sets containing much smaller files, > a single server should have no trouble storing the metadata. > Thanks for all the good comments, rick ps: Good luck with your pNFS server. Maybe someday it will be available for FreeBSD? From owner-freebsd-fs@freebsd.org Wed Jun 15 17:34:01 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0051FA47652 for ; Wed, 15 Jun 2016 17:34:00 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id D46F81E60 for ; Wed, 15 Jun 2016 17:34:00 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u5FHXxpO073781 for ; Wed, 15 Jun 2016 17:34:00 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 209158] node / npm triggering zfs rename deadlock Date: Wed, 15 Jun 2016 17:33:59 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 11.0-CURRENT X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: doug@freebsd.con.com X-Bugzilla-Status: Open X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-fs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Jun 2016 17:34:01 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209158 --- Comment #30 from Doug Luce --- The last thing I tried was https://reviews.freebsd.org/D6533 on top of that day's head. Looks like there are some reviewers tagged on there but nobody seems to have gotten around to it yet. --=20 You are receiving this mail because: You are the assignee for the bug.= From owner-freebsd-fs@freebsd.org Wed Jun 15 19:38:13 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E88B9A444CA for ; Wed, 15 Jun 2016 19:38:13 +0000 (UTC) (envelope-from kp@FreeBSD.org) Received: from venus.codepro.be (venus.codepro.be [IPv6:2a01:4f8:162:1127::2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "*.codepro.be", Issuer "Gandi Standard SSL CA 2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id B4A0F1719 for ; Wed, 15 Jun 2016 19:38:13 +0000 (UTC) (envelope-from kp@FreeBSD.org) Received: from [192.168.228.1] (94-224-12-229.access.telenet.be [94.224.12.229]) (Authenticated sender: kp) by venus.codepro.be (Postfix) with ESMTPSA id 51FEB1EDE0 for ; Wed, 15 Jun 2016 21:38:11 +0200 (CEST) From: "Kristof Provost" To: freebsd-fs@freebsd.org Subject: Panic with Root on ZFS Date: Wed, 15 Jun 2016 21:38:10 +0200 Message-ID: MIME-Version: 1.0 X-Mailer: MailMate Trial (1.9.4r5234) Content-Type: text/plain; charset=utf-8; format=flowed; markup=markdown Content-Transfer-Encoding: 8bit X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Jun 2016 19:38:14 -0000 Hi, I’ running a root-on-ZFS system and reliably see this panic during boot: It’s a raidz vdev. The faulting kernel is r301916 (head). The last version known to boot it r299060 (head). panic: solaris assert: refcount(count(&spa->spa_refcount) >= spa->spa_minref || MUTEX_HELD(&spa_namespace_lock), file: /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c, line: 863 Unfortunately I can’t get a dump, but here’s a picture of the backtrace: https://people.freebsd.org/~kp/zfs_panic.jpg Regards, Kristof From owner-freebsd-fs@freebsd.org Thu Jun 16 08:42:29 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 67CF3A44901 for ; Thu, 16 Jun 2016 08:42:29 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 57DF31519 for ; Thu, 16 Jun 2016 08:42:29 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u5G8gS34085992 for ; Thu, 16 Jun 2016 08:42:29 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 210316] panic after trying to r/w mount msdosfs on write protected media Date: Thu, 16 Jun 2016 08:42:29 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 11.0-CURRENT X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: avg@FreeBSD.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-fs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: assigned_to cc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 16 Jun 2016 08:42:29 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D210316 Andriy Gapon changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|freebsd-bugs@FreeBSD.org |freebsd-fs@FreeBSD.org CC| |trasz@FreeBSD.org --=20 You are receiving this mail because: You are the assignee for the bug.= From owner-freebsd-fs@freebsd.org Thu Jun 16 08:47:29 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 43F3EA44A98 for ; Thu, 16 Jun 2016 08:47:29 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 3406117D0 for ; Thu, 16 Jun 2016 08:47:29 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u5G8lS73092345 for ; Thu, 16 Jun 2016 08:47:29 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 210316] panic after trying to r/w mount msdosfs on write protected media Date: Thu, 16 Jun 2016 08:47:29 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 11.0-CURRENT X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: avg@FreeBSD.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-fs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 16 Jun 2016 08:47:29 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D210316 --- Comment #1 from Andriy Gapon --- My preliminary analysis of the problem. mountmsdosfs() calls markvoldirty() near the end. markvoldirty() failed because of the read-only media. bwrite() called brelse() which marked the buffer as dirty because of the write error. Because of the markvoldirty() failure mountmsdosfs() failed over all and, thus, it called g_vfs_destroy() which destroyed the filesystem's geom. When the syncer tried to sync the di= rty buffer later on g_vfs_strategy9) accessed the destroyed consumer / geom and that resulted in a crash. --=20 You are receiving this mail because: You are the assignee for the bug.= From owner-freebsd-fs@freebsd.org Fri Jun 17 03:54:34 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id BD880A77F5B for ; Fri, 17 Jun 2016 03:54:34 +0000 (UTC) (envelope-from zanchey@ucc.gu.uwa.edu.au) Received: from mail-ext-sout1.uwa.edu.au (mail-ext-sout1.uwa.edu.au [130.95.128.72]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "IronPort Appliance Demo Certificate", Issuer "IronPort Appliance Demo Certificate" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 55B282B22 for ; Fri, 17 Jun 2016 03:54:32 +0000 (UTC) (envelope-from zanchey@ucc.gu.uwa.edu.au) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2AFBQA4c2NX/8+AX4JehRGmAwEBAQEBAQaWRxeHewEBAQEBAWYnhQwGAQE4gQxEiDCvF4UpAQEFiGyDSgiFX4JHhleBA4F9C0CCR45sigqBMZwaj3VUgUNFHIFZYQGJegEBAQ X-IPAS-Result: A2AFBQA4c2NX/8+AX4JehRGmAwEBAQEBAQaWRxeHewEBAQEBAWYnhQwGAQE4gQxEiDCvF4UpAQEFiGyDSgiFX4JHhleBA4F9C0CCR45sigqBMZwaj3VUgUNFHIFZYQGJegEBAQ X-IronPort-AV: E=Sophos;i="5.26,481,1459785600"; d="scan'208";a="222815008" Received: from f5-new.net.uwa.edu.au (HELO mooneye.ucc.gu.uwa.edu.au) ([130.95.128.207]) by mail-ext-out1.uwa.edu.au with ESMTP/TLS/ADH-AES256-SHA; 17 Jun 2016 11:53:20 +0800 Received: by mooneye.ucc.gu.uwa.edu.au (Postfix, from userid 801) id 4AD7A3C057; Fri, 17 Jun 2016 11:53:19 +0800 (AWST) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=ucc.gu.uwa.edu.au; s=ucc-2016-3; t=1466135599; bh=HDmi0WXGrRaTu1ndD7mBJW9u7NfO5rbcYGlwGKTpaFY=; h=Date:From:To:Subject; b=FQ/pjfgy6MePmg0IvVS2Iw0I7pc/cwzNcwrEVH+Sb21FdADXSSdeeu5emYKLrOugh G+UbY0Jr5KdlPxABGg4ijcl2taSUrn6OBWcsmP+dnvbUVVVu0IDryZPeE+EQkcI5Wq JbtynYPaCmQHnNLotBv6Kvb4Zmdjk9rtD7l4AVOo= Received: from motsugo.ucc.gu.uwa.edu.au (motsugo.ucc.gu.uwa.edu.au [130.95.13.7]) by mooneye.ucc.gu.uwa.edu.au (Postfix) with ESMTP id 266AF3C054 for ; Fri, 17 Jun 2016 11:53:19 +0800 (AWST) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=ucc.gu.uwa.edu.au; s=ucc-2016-3; t=1466135599; bh=HDmi0WXGrRaTu1ndD7mBJW9u7NfO5rbcYGlwGKTpaFY=; h=Date:From:To:Subject; b=FQ/pjfgy6MePmg0IvVS2Iw0I7pc/cwzNcwrEVH+Sb21FdADXSSdeeu5emYKLrOugh G+UbY0Jr5KdlPxABGg4ijcl2taSUrn6OBWcsmP+dnvbUVVVu0IDryZPeE+EQkcI5Wq JbtynYPaCmQHnNLotBv6Kvb4Zmdjk9rtD7l4AVOo= Received: by motsugo.ucc.gu.uwa.edu.au (Postfix, from userid 11251) id 1E5CE2001D; Fri, 17 Jun 2016 11:53:19 +0800 (AWST) Received: from localhost (localhost [127.0.0.1]) by motsugo.ucc.gu.uwa.edu.au (Postfix) with ESMTP id 17C3220019 for ; Fri, 17 Jun 2016 11:53:19 +0800 (AWST) Date: Fri, 17 Jun 2016 11:53:19 +0800 (AWST) From: David Adam To: freebsd-fs@freebsd.org Subject: Processes wedging on ZFS accesses Message-ID: User-Agent: Alpine 2.11 (DEB 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Jun 2016 03:54:34 -0000 Hi all, We're still having trouble with our 10.3-RELEASE-p3 fileserver using a ZFS pool. After a certain amount of uptime (usually a week or so), a Samba process will get stuck in D-state: max 2075 0.0 0.2 339928 26616 - D 26May16 0:19.59 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf Running find(1) over the hierarchy that the smbd process has open will also wedge in a D-state. Our backups also seem to get stuck, presumably in the same spot. `procstat -k` on the stuck processes (smbd and our stuck python-based backup program) shows: PID TID COMM TDNAME KSTACK 2075 100587 smbd - mi_switch+0xe1 sleepq_wait+0x3a _sx_slock_hard+0x31b namei+0x1c5 vn_open_cred+0x24d zfs_getextattr+0x1f2 VOP_GETEXTATTR_APV+0xa7 extattr_get_vp+0x15d sys_extattr_get_file+0xf4 amd64_syscall+0x40f Xfast_syscall+0xfb 2075 100623 smbd - mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x15d __lockmgr_args+0xca0 vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 knlist_remove_kq+0x24 filt_vfsdetach+0x22 knote_fdclose+0xef closefp+0x42 amd64_syscall+0x40f Xfast_syscall+0xfb 21676 101572 python2.7 - mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x15d __lockmgr_args+0x91a vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 vget+0x73 cache_lookup+0x5d5 vfs_cache_lookup+0xac VOP_LOOKUP_APV+0xa1 lookup+0x5a1 namei+0x4d4 kern_statat_vnhook+0xae sys_lstat+0x30 amd64_syscall+0x40f Xfast_syscall+0xfb 36144 101585 python2.7 - mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x15d __lockmgr_args+0x91a vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 vget+0x73 cache_lookup+0x5d5 vfs_cache_lookup+0xac VOP_LOOKUP_APV+0xa1 lookup+0x5a1 namei+0x4d4 kern_statat_vnhook+0xae sys_lstat+0x30 amd64_syscall+0x40f Xfast_syscall+0xfb Memory doesn't appear to be a problem, and we have the ARC wired to 10 GB maximum: Mem: 80M Active, 2127M Inact, 13G Wired, 36K Cache, 1643M Buf, 308M Free ARC: 9921M Total, 7056M MFU, 2434M MRU, 977K Anon, 162M Header, 267M Other Swap: 20G Total, 20G Free I'm getting a DDB kernel built, but is there any other information that is useful? One of the problems is that Samba won't start a new process for the user whose process is already wedged, so eventually user logins stop working. Thanks David Adam zanchey@ucc.gu.uwa.edu.au From owner-freebsd-fs@freebsd.org Fri Jun 17 04:44:34 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 344BCA7874E for ; Fri, 17 Jun 2016 04:44:34 +0000 (UTC) (envelope-from mjguzik@gmail.com) Received: from mail-wm0-x22a.google.com (mail-wm0-x22a.google.com [IPv6:2a00:1450:400c:c09::22a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id B72691E58; Fri, 17 Jun 2016 04:44:33 +0000 (UTC) (envelope-from mjguzik@gmail.com) Received: by mail-wm0-x22a.google.com with SMTP id f126so73798547wma.1; Thu, 16 Jun 2016 21:44:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=SOHw9BIclvoQNdD+T7aAviLuUlp7vX+xb5ySTif+fT0=; b=BLPISSjDyo8JCFlcEYl0He/1gNONsEI2ZMEiFRFxtZeNBVSRxDF75q4Jy92B+uL20j 5CPAF0WpvY+JeJ/mVGFFSdD4LEMHwagK7kNeNO1U4nlgs9FH0QimNlvUIFSo28Mpnp2t mE00rVIeJ8nwcwYsVVN8vXAYYKFdxxV9DsmU8Ons9obeUBeMGYTokXkc1LvO6mpYOF/p ZB0Bz2zhmWeQjJXWL65uu1+qyjrnjPCcRpvSXPfhrNdUjOwzC1UHEKcigiOgXwlfqv9y Ao3TwGujR751bt1jd74HZsvTMhtrpwM/8xAPf4P6b3ZGRDMoHlc9wbTWu+XQRlDi7CMG W47Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=SOHw9BIclvoQNdD+T7aAviLuUlp7vX+xb5ySTif+fT0=; b=DmlPy3C7Ct7KdG+w3hF5fhyyQKL7l0jWKJRlVqL02tDegoPtvHAQvtQBNk3gJQE9Pn sThMYhkaYBj0clxeOZrvzRi/mXXnSYAjQrwcSDT+QtPP7Wb4Z73ZDUtO4zQDSZVPepS0 3SSJpGsZ23s2tQ+C93gCPdxNzNsYa3lmQecVOrQDquu1SPMWwsQsQt9mfgTzy3F5Mr5i qMIkRIQV1Zr0T4o8K24mwWeaY8hmFXkxiTI0FLfqy+kZtMkbsWz7+Bj1Wwt2EUVFTDE/ jPllBxzfmVMmnkOYOkYa9q30vuCrFzNZgDfA5KU1Xu/cKLZ52IQnXxPQIz4IanJ6nfsZ rNgA== X-Gm-Message-State: ALyK8tIEdEGGR2KxoNHsfkiG/yitk5Hm4fEACwaRWBguthgpEZ3qbt2THyEf1M6yWwEUqw== X-Received: by 10.194.82.36 with SMTP id f4mr134490wjy.104.1466138670963; Thu, 16 Jun 2016 21:44:30 -0700 (PDT) Received: from dft-labs.eu (n1x0n-1-pt.tunnel.tserv5.lon1.ipv6.he.net. [2001:470:1f08:1f7::2]) by smtp.gmail.com with ESMTPSA id o129sm1740715wmb.17.2016.06.16.21.44.29 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Thu, 16 Jun 2016 21:44:29 -0700 (PDT) Date: Fri, 17 Jun 2016 06:44:27 +0200 From: Mateusz Guzik To: David Adam Cc: freebsd-fs@freebsd.org, avg@freebsd.org Subject: Re: Processes wedging on ZFS accesses Message-ID: <20160617044427.GA6575@dft-labs.eu> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Jun 2016 04:44:34 -0000 On Fri, Jun 17, 2016 at 11:53:19AM +0800, David Adam wrote: > Hi all, > > We're still having trouble with our 10.3-RELEASE-p3 fileserver using a ZFS > pool. > > After a certain amount of uptime (usually a week or so), a Samba process > will get stuck in D-state: > > max 2075 0.0 0.2 339928 26616 - D 26May16 0:19.59 > /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf > > Running find(1) over the hierarchy that the smbd process has open will > also wedge in a D-state. > > Our backups also seem to get stuck, presumably in the same spot. > > `procstat -k` on the stuck processes (smbd and our stuck python-based > backup program) shows: > PID TID COMM TDNAME KSTACK > 2075 100587 smbd - mi_switch+0xe1 > sleepq_wait+0x3a _sx_slock_hard+0x31b namei+0x1c5 vn_open_cred+0x24d > zfs_getextattr+0x1f2 VOP_GETEXTATTR_APV+0xa7 extattr_get_vp+0x15d > sys_extattr_get_file+0xf4 amd64_syscall+0x40f Xfast_syscall+0xfb > 2075 100623 smbd - mi_switch+0xe1 > sleepq_wait+0x3a sleeplk+0x15d __lockmgr_args+0xca0 vop_stdlock+0x3c > VOP_LOCK1_APV+0xab _vn_lock+0x43 knlist_remove_kq+0x24 filt_vfsdetach+0x22 > knote_fdclose+0xef closefp+0x42 amd64_syscall+0x40f Xfast_syscall+0xfb > 21676 101572 python2.7 - mi_switch+0xe1 These 2 threads likely deadlocked each other: the first one has the vnode locked and tries to take the filedesc lock for reading the second one has the filedesc taken for writing and tries to lock the vnode The filedesc lock is going to be split which will get rid of this particular instance of the problem. However, I think the real bug is the fact that zfs_getextattr calls vn_open_cred with the vnode locked, but I don't know if we can simply unlock it and not revalidate anything (likely yes). Cc'ing some zfs people for comments. -- Mateusz Guzik From owner-freebsd-fs@freebsd.org Sat Jun 18 20:50:36 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 65A14A78DCA for ; Sat, 18 Jun 2016 20:50:36 +0000 (UTC) (envelope-from jkh@ixsystems.com) Received: from barracuda.ixsystems.com (barracuda.ixsystems.com [12.229.62.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "*.ixsystems.com", Issuer "Go Daddy Secure Certificate Authority - G2" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 476AD16E7 for ; Sat, 18 Jun 2016 20:50:35 +0000 (UTC) (envelope-from jkh@ixsystems.com) X-ASG-Debug-ID: 1466283034-08ca041142195570001-3nHGF7 Received: from zimbra.ixsystems.com ([10.246.0.20]) by barracuda.ixsystems.com with ESMTP id QgMuqJeGFJ3XI6CQ (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Sat, 18 Jun 2016 13:50:34 -0700 (PDT) X-Barracuda-Envelope-From: jkh@ixsystems.com X-Barracuda-RBL-Trusted-Forwarder: 10.246.0.20 X-ASG-Whitelist: Client Received: from localhost (localhost [127.0.0.1]) by zimbra.ixsystems.com (Postfix) with ESMTP id C364ADD0C51; Sat, 18 Jun 2016 13:50:34 -0700 (PDT) Received: from zimbra.ixsystems.com ([127.0.0.1]) by localhost (zimbra.ixsystems.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id vCykHjNm8UXi; Sat, 18 Jun 2016 13:50:32 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.ixsystems.com (Postfix) with ESMTP id 5182CDD0C50; Sat, 18 Jun 2016 13:50:32 -0700 (PDT) X-Virus-Scanned: amavisd-new at ixsystems.com Received: from zimbra.ixsystems.com ([127.0.0.1]) by localhost (zimbra.ixsystems.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id dyzQGuABO0Oe; Sat, 18 Jun 2016 13:50:32 -0700 (PDT) Received: from [172.20.0.10] (vpn.ixsystems.com [10.249.0.2]) by zimbra.ixsystems.com (Postfix) with ESMTPSA id 95A57DD0C39; Sat, 18 Jun 2016 13:50:31 -0700 (PDT) Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: pNFS server Plan B From: Jordan Hubbard X-ASG-Orig-Subj: Re: pNFS server Plan B In-Reply-To: <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca> Date: Sat, 18 Jun 2016 13:50:29 -0700 Cc: freebsd-fs , Alexander Motin Message-Id: References: <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca> To: Rick Macklem X-Mailer: Apple Mail (2.3124) X-Barracuda-Connect: UNKNOWN[10.246.0.20] X-Barracuda-Start-Time: 1466283034 X-Barracuda-Encrypted: ECDHE-RSA-AES256-GCM-SHA384 X-Barracuda-URL: https://10.246.0.26:443/cgi-mod/mark.cgi X-Virus-Scanned: by bsmtpd at ixsystems.com X-Barracuda-BRTS-Status: 1 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Jun 2016 20:50:36 -0000 > On Jun 13, 2016, at 3:28 PM, Rick Macklem = wrote: >=20 > You may have already heard of Plan A, which sort of worked > and you could test by following the instructions here: >=20 > http://people.freebsd.org/~rmacklem/pnfs-setup.txt >=20 > However, it is very slow for metadata operations (everything other = than > read/write) and I don't think it is very useful. Hi guys, I finally got a chance to catch up and bring up Rick=E2=80=99s pNFS = setup on a couple of test machines. He=E2=80=99s right, obviously - The = =E2=80=9Cplan A=E2=80=9D approach is a bit convoluted and not at all = surprisingly slow. With all of those transits twixt kernel and = userland, not to mention glusterfs itself which has not really been = tuned for our platform (there are a number of papers on this we probably = haven=E2=80=99t even all read yet), we=E2=80=99re obviously still in the = =E2=80=9Cfirst make it work=E2=80=9D stage. That said, I think there are probably more possible plans than just A = and B here, and we should give the broader topic of =E2=80=9Cwhat does = FreeBSD want to do in the Enterprise / Cloud computing space?" at least = some consideration at the same time, since there are more than a few = goals running in parallel here. First, let=E2=80=99s talk about our story around clustered filesystems + = associated command-and-control APIs in FreeBSD. There is something of = an embarrassment of riches in the industry at the moment - glusterfs, = ceph, Hadoop HDFS, RiakCS, moose, etc. All or most of them offer = different pros and cons, and all offer more than just the ability to = store files and scale =E2=80=9Celastically=E2=80=9D. They also have = ReST APIs for configuring and monitoring the health of the cluster, some = offer object as well as file storage, and Riak offers a distributed KVS = for storing information *about* file objects in addition to the object = themselves (and when your application involves storing and managing = several million photos, for example, the idea of distributing the index = as well as the files in a fault-tolerant fashion is also compelling). = Some, if not most, of them are also far better supported under Linux = than FreeBSD (I don=E2=80=99t think we even have a working ceph port = yet). I=E2=80=99m not saying we need to blindly follow the herds and = do all the same things others are doing here, either, I=E2=80=99m just = saying that it=E2=80=99s a much bigger problem space than simply = =E2=80=9Cparallelizing NFS=E2=80=9D and if we can kill multiple birds = with one stone on the way to doing that, we should certainly consider = doing so. Why? Because pNFS was first introduced as a draft RFC (RFC5661 = ) in 2005. The linux folks = have been working on it = = since 2006. Ten years is a long time in this business, and when I = raised the topic of pNFS at the recent SNIA DSI conference (where = storage developers gather to talk about trends and things), the most = prevalent reaction I got was =E2=80=9Cpeople are still using pNFS?!=E2=80=9D= This is clearly one of those technologies that may still have some = runway left, but it=E2=80=99s been rapidly overtaken by other approaches = to solving more or less the same problems in coherent, distributed = filesystem access and if we want to get mindshare for this, we should at = least have an answer ready for the =E2=80=9Cwhy did you guys do pNFS = that way rather than just shimming it on top of ${someNewerHotness}??=E2=80= =9D argument. I=E2=80=99m not suggesting pNFS is dead - hell, even AFS = still appears to be somewhat alive, but = there=E2=80=99s a difference between appealing to an increasingly narrow = niche and trying to solve the sorts of problems most DevOps folks = working At Scale these days are running into. That is also why I am not sure I would totally embrace the idea of a = central MDS being a Real Option. Sure, the risks can be mitigated (as = you say, by mirroring it), but even saying the words =E2=80=9Ccentral = MDS=E2=80=9D (or central anything) may be such a turn-off to those very = same DevOps folks, folks who have been burned so many times by SPOFs and = scaling bottlenecks in large environments, that we'll lose the audience = the minute they hear the trigger phrase. Even if it means signing up = for Other Problems later, it=E2=80=99s a lot easier to =E2=80=9Csell=E2=80= =9D the concept of completely distributed mechanisms where, if there is = any notion of centralization at all, it=E2=80=99s at least the result of = a quorum election and the DevOps folks don=E2=80=99t have to do anything = manually to cause it to happen - the cluster is =E2=80=9Cresilient" and = "self-healing" and they are happy with being able to say those buzzwords = to the CIO, who nods knowingly and tells them they=E2=80=99re doing a = fine job! Let=E2=80=99s get back, however, to the notion of downing multiple = avians with the same semi-spherical kinetic projectile: What seems to = be The Rage at the moment, and I don=E2=80=99t know how well it actually = scales since I=E2=80=99ve yet to be at the pointy end of such a = real-world deployment, is the idea of clustering the storage = (=E2=80=9Csomehow=E2=80=9D) underneath and then providing NFS and SMB = protocol access entirely in userland, usually with both of those = services cooperating with the same lock manager and even the same ACL = translation layer. Our buddies at Red Hat do this with glusterfs at the = bottom and NFS Ganesha + Samba on top - I talked to one of the Samba = core team guys at SNIA and he indicated that this was increasingly = common, with the team having helped here and there when approached by = different vendors with the same idea. We (iXsystems) also get a lot of = requests to be able to make the same file(s) available via both NFS and = SMB at the same time and they don=E2=80=99t much at all like being told = =E2=80=9Cbut that=E2=80=99s dangerous - don=E2=80=99t do that! Your = file contents and permissions models are not guaranteed to survive such = an experience!=E2=80=9D They really want to do it, because the rest of = the world lives in Heterogenous environments and that=E2=80=99s just the = way it is. Even the object storage folks, like Openstack=E2=80=99s Swift project, = are spending significant amounts of mental energy on the topic of how to = re-export their object stores as shared filesystems over NFS and SMB, = the single consistent and distributed object store being, of course, = Their Thing. They wish, of course, that the rest of the world would = just fall into line and use their object system for everything, but they = also get that the "legacy stuff=E2=80=9D just won=E2=80=99t go away and = needs some sort of attention if they=E2=80=99re to remain players at the = standards table. So anyway, that=E2=80=99s the view I have from the perspective of = someone who actually sells storage solutions for a living, and while I = could certainly =E2=80=9Csell some pNFS=E2=80=9D to various customers = who just want to add a dash of steroids to their current NFS = infrastructure, or need to use NFS but also need to store far more data = into a single namespace than any one box will accommodate, I also know = that offering even more elastic solutions will be a necessary part of = offering solutions to the growing contingent of folks who are not tied = to any existing storage infrastructure and have various non-greybearded = folks shouting in their ears about object this and cloud that. Might = there not be some compromise solution which allows us to put more of = this in userland with less context switches in and out of the kernel, = also giving us the option of presenting a more united front to multiple = protocols that require more ACL and lock impedance-matching than we=E2=80=99= d ever want to put in the kernel anyway? - Jordan From owner-freebsd-fs@freebsd.org Sat Jun 18 23:05:52 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 78B78A79C7C for ; Sat, 18 Jun 2016 23:05:52 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id E9C851B78; Sat, 18 Jun 2016 23:05:51 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) IronPort-PHdr: 9a23:SQYLhhOi+2OCsKSqmIUl6mtUPXoX/o7sNwtQ0KIMzox0KPn+rarrMEGX3/hxlliBBdydsKIVzbuM+PmwACQp2tWojjMrSNR0TRgLiMEbzUQLIfWuLgnFFsPsdDEwB89YVVVorDmROElRH9viNRWJ+iXhpQAbFhi3DwdpPOO9QteU1JTmkbHosMSDOk1hv3mUX/BbFF2OtwLft80b08NJC50a7V/3mEZOYPlc3mhyJFiezF7W78a0+4N/oWwL46pyv50IbaKvXaMiQbVeRBQ7OWo8/sGj4RvATSOO9mANSXkblwEOCA/AukLURJD05xH7vek1/SCRPsn7SPhgQzGr5KRvRRrAlSAIKjM96GGRgcUm3/ETmw6ouxEqm92cW4qSLvcrJq4= X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2DaBABU02VX/61jaINdFoN+Lk8GvFkXC4V1AoFbEAEBAQEBAQEBZCeCMYIaAQEBAgEBIwRBDAUFCwIBCA4KAgINGQICSQENAgQTGwSICQgOr0iQAQEBAQcBAQEBAQEhgQGFJoRNhCoCFIJJOBOCRwWGAEuSK4UdaYVohCVOhASDLYU6j3UCNR+CBQMcgWggMgGJBER/AQEB X-IronPort-AV: E=Sophos;i="5.26,489,1459828800"; d="scan'208";a="288280302" Received: from nipigon.cs.uoguelph.ca (HELO zcs1.mail.uoguelph.ca) ([131.104.99.173]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 18 Jun 2016 19:05:43 -0400 Received: from localhost (localhost [127.0.0.1]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id E347215F5DF; Sat, 18 Jun 2016 19:05:43 -0400 (EDT) Received: from zcs1.mail.uoguelph.ca ([127.0.0.1]) by localhost (zcs1.mail.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 1XPfbgTqjzbB; Sat, 18 Jun 2016 19:05:41 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id BC43015F5E2; Sat, 18 Jun 2016 19:05:41 -0400 (EDT) X-Virus-Scanned: amavisd-new at zcs1.mail.uoguelph.ca Received: from zcs1.mail.uoguelph.ca ([127.0.0.1]) by localhost (zcs1.mail.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id D1pabmn0Zy48; Sat, 18 Jun 2016 19:05:41 -0400 (EDT) Received: from zcs1.mail.uoguelph.ca (zcs1.mail.uoguelph.ca [172.17.95.18]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id 93DCE15F5DF; Sat, 18 Jun 2016 19:05:41 -0400 (EDT) Date: Sat, 18 Jun 2016 19:05:41 -0400 (EDT) From: Rick Macklem To: Jordan Hubbard Cc: freebsd-fs , Alexander Motin Message-ID: <2021361361.156197173.1466291141156.JavaMail.zimbra@uoguelph.ca> In-Reply-To: References: <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca> Subject: Re: pNFS server Plan B MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Originating-IP: [172.17.95.11] X-Mailer: Zimbra 8.0.9_GA_6191 (ZimbraWebClient - FF47 (Win)/8.0.9_GA_6191) Thread-Topic: pNFS server Plan B Thread-Index: mAo6z2utn+2vydDy12iaRMkdgHIljg== X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Jun 2016 23:05:52 -0000 Jordan Hubbard wrote: >=20 > > On Jun 13, 2016, at 3:28 PM, Rick Macklem wrote: > >=20 > > You may have already heard of Plan A, which sort of worked > > and you could test by following the instructions here: > >=20 > > http://people.freebsd.org/~rmacklem/pnfs-setup.txt > >=20 > > However, it is very slow for metadata operations (everything other than > > read/write) and I don't think it is very useful. >=20 I am going to respond to a few of the comments, but I hope that people who actually run server farms and might be a user of a fairly large/inexpensive storage cluster will comment. Put another way, I'd really like to hear a "user" perspective. > Hi guys, >=20 > I finally got a chance to catch up and bring up Rick=E2=80=99s pNFS setup= on a couple > of test machines. He=E2=80=99s right, obviously - The =E2=80=9Cplan A=E2= =80=9D approach is a bit > convoluted and not at all surprisingly slow. With all of those transits > twixt kernel and userland, not to mention glusterfs itself which has not > really been tuned for our platform (there are a number of papers on this = we > probably haven=E2=80=99t even all read yet), we=E2=80=99re obviously stil= l in the =E2=80=9Cfirst > make it work=E2=80=9D stage. >=20 > That said, I think there are probably more possible plans than just A and= B > here, and we should give the broader topic of =E2=80=9Cwhat does FreeBSD = want to do > in the Enterprise / Cloud computing space?" at least some consideration a= t > the same time, since there are more than a few goals running in parallel > here. >=20 > First, let=E2=80=99s talk about our story around clustered filesystems + = associated > command-and-control APIs in FreeBSD. There is something of an embarrassm= ent > of riches in the industry at the moment - glusterfs, ceph, Hadoop HDFS, > RiakCS, moose, etc. All or most of them offer different pros and cons, a= nd > all offer more than just the ability to store files and scale =E2=80=9Cel= astically=E2=80=9D. > They also have ReST APIs for configuring and monitoring the health of the > cluster, some offer object as well as file storage, and Riak offers a > distributed KVS for storing information *about* file objects in addition = to > the object themselves (and when your application involves storing and > managing several million photos, for example, the idea of distributing th= e > index as well as the files in a fault-tolerant fashion is also compelling= ). > Some, if not most, of them are also far better supported under Linux than > FreeBSD (I don=E2=80=99t think we even have a working ceph port yet). I= =E2=80=99m not > saying we need to blindly follow the herds and do all the same things oth= ers > are doing here, either, I=E2=80=99m just saying that it=E2=80=99s a much = bigger problem > space than simply =E2=80=9Cparallelizing NFS=E2=80=9D and if we can kill = multiple birds with > one stone on the way to doing that, we should certainly consider doing so= . >=20 > Why? Because pNFS was first introduced as a draft RFC (RFC5661 > ) in 2005. The linux folks ha= ve > been working on it > si= nce > 2006. Ten years is a long time in this business, and when I raised the > topic of pNFS at the recent SNIA DSI conference (where storage developers > gather to talk about trends and things), the most prevalent reaction I go= t > was =E2=80=9Cpeople are still using pNFS?!=E2=80=9D Actually, I would have worded this as "will anyone ever use pNFS?". Although 10 years is a long time in this business, it doesn't seem to be lo= ng at all in the standards world where the NFSv4 protocols are being developed= . - You note that the Linux folk started development in 2006. I will note that RFC5661 (the RFC that describes pNFS) is dated 2010. I will also note that I believe the first vendor to ship a server that su= pported pNFS happened sometime after the RFC was published. - I could be wrong, but I'd guess that Netapp's clustered Filers were the first to ship, about 4 years ago. To this date, very few vendors have actually shipped working pNFS servers as far as I am aware. Other than Netapp, the only one I know that has shipp= ed are the large EMC servers (not Isilon). I am not sure if Oracle/Solaris has ever shipped a pNFS server to customers= yet. Same goes for Panasas. I am not aware of a Linux based pNFS server usable i= n a production environment, although Ganesha-NFS might be shipping with pNFS = support now. - If others are aware of other pNFS servers that are shipping to customers, please correct me. (I haven't been to a NFSv4.1 testing event for 3 years= , so my info is definitely dated.) Note that the "Flex Files" layout I used for the Plan A experiment is only = an Internet draft at this time and hasn't even made it to the RFC stage. --> As such, I think it is very much an open question w.r.t. whether or not this protocol will become widely used or yet another forgotten standard= ? I also suspect that some storage vendors that have invested considerabl= e resources in NFSv4.1/pNFS development might ask the same question in-ho= use;-) > This is clearly one of those > technologies that may still have some runway left, but it=E2=80=99s been = rapidly > overtaken by other approaches to solving more or less the same problems i= n > coherent, distributed filesystem access and if we want to get mindshare f= or > this, we should at least have an answer ready for the =E2=80=9Cwhy did yo= u guys do > pNFS that way rather than just shimming it on top of ${someNewerHotness}?= ?=E2=80=9D > argument. I=E2=80=99m not suggesting pNFS is dead - hell, even AFS > still appears to be somewhat alive, but there= =E2=80=99s a > difference between appealing to an increasingly narrow niche and trying t= o > solve the sorts of problems most DevOps folks working At Scale these days > are running into. >=20 > That is also why I am not sure I would totally embrace the idea of a cent= ral > MDS being a Real Option. Sure, the risks can be mitigated (as you say, b= y > mirroring it), but even saying the words =E2=80=9Ccentral MDS=E2=80=9D (o= r central anything) > may be such a turn-off to those very same DevOps folks, folks who have be= en > burned so many times by SPOFs and scaling bottlenecks in large environmen= ts, > that we'll lose the audience the minute they hear the trigger phrase. Ev= en > if it means signing up for Other Problems later, it=E2=80=99s a lot easie= r to =E2=80=9Csell=E2=80=9D > the concept of completely distributed mechanisms where, if there is any > notion of centralization at all, it=E2=80=99s at least the result of a qu= orum > election and the DevOps folks don=E2=80=99t have to do anything manually = to cause it > to happen - the cluster is =E2=80=9Cresilient" and "self-healing" and the= y are happy > with being able to say those buzzwords to the CIO, who nods knowingly and > tells them they=E2=80=99re doing a fine job! >=20 I'll admit that I'm a bits and bytes guy. I have a hunch how difficult it i= s to get "resilient" and "self-healing" to really work. I also know it is way beyond what I am capable of. > Let=E2=80=99s get back, however, to the notion of downing multiple avians= with the > same semi-spherical kinetic projectile: What seems to be The Rage at the > moment, and I don=E2=80=99t know how well it actually scales since I=E2= =80=99ve yet to be at > the pointy end of such a real-world deployment, is the idea of clustering > the storage (=E2=80=9Csomehow=E2=80=9D) underneath and then providing NFS= and SMB protocol > access entirely in userland, usually with both of those services cooperat= ing > with the same lock manager and even the same ACL translation layer. Our > buddies at Red Hat do this with glusterfs at the bottom and NFS Ganesha + > Samba on top - I talked to one of the Samba core team guys at SNIA and he > indicated that this was increasingly common, with the team having helped > here and there when approached by different vendors with the same idea. = We > (iXsystems) also get a lot of requests to be able to make the same file(s= ) > available via both NFS and SMB at the same time and they don=E2=80=99t mu= ch at all > like being told =E2=80=9Cbut that=E2=80=99s dangerous - don=E2=80=99t do = that! Your file contents > and permissions models are not guaranteed to survive such an experience!= =E2=80=9D > They really want to do it, because the rest of the world lives in > Heterogenous environments and that=E2=80=99s just the way it is. >=20 If you want to make SMB and NFS work to-gether on the same uderlying file s= ystems, I suspect it is doable, although messy. To do this with the current FreeBSD= nfsd, it would require someone with Samba/Windows knowledge pointing out what Sam= ba needs to interact with NFSv4 and those hooks could probably be implemented. (I know nothing about Samba/Windows, so I'd need someone else doing that si= de of it.) I actually mentioned Ganesha-NFS at the little talk/discussion I gave. At this time, they have ripped a FreeBSD port out of their sources and they use Linux specific thread primitives. --> It would probably be significant work to get Ganesha-NFS up to speed on FreeBSD. Maybe a good project, but it needs some person/group dedicatin= g resources to get it to happen. > Even the object storage folks, like Openstack=E2=80=99s Swift project, ar= e spending > significant amounts of mental energy on the topic of how to re-export the= ir > object stores as shared filesystems over NFS and SMB, the single consiste= nt > and distributed object store being, of course, Their Thing. They wish, o= f > course, that the rest of the world would just fall into line and use thei= r > object system for everything, but they also get that the "legacy stuff=E2= =80=9D just > won=E2=80=99t go away and needs some sort of attention if they=E2=80=99re= to remain players > at the standards table. >=20 > So anyway, that=E2=80=99s the view I have from the perspective of someone= who > actually sells storage solutions for a living, and while I could certainl= y > =E2=80=9Csell some pNFS=E2=80=9D to various customers who just want to ad= d a dash of > steroids to their current NFS infrastructure, or need to use NFS but also > need to store far more data into a single namespace than any one box will > accommodate, I also know that offering even more elastic solutions will b= e a > necessary part of offering solutions to the growing contingent of folks w= ho > are not tied to any existing storage infrastructure and have various > non-greybearded folks shouting in their ears about object this and cloud > that. Might there not be some compromise solution which allows us to put > more of this in userland with less context switches in and out of the > kernel, also giving us the option of presenting a more united front to > multiple protocols that require more ACL and lock impedance-matching than > we=E2=80=99d ever want to put in the kernel anyway? >=20 For SMB + NFS in userland, the combination of Samba and Ganesha is probably your main open source choice, from what I am aware of. I am one guy who does this as a spare time retirement hobby. As such, doing something like a Ganesha port etc is probably beyond what I am interested i= n. When saying this, I don't want to imply that it isn't a good approach. You sent me the URL for an abstract for a paper discussing how Facebook is using GlusterFS. It would be nice to get more details w.r.t. how they use i= t, such as: - How do their client servers access it? (NFS, Fuse, or ???) - Whether or not they've tried the Ganesha-NFS stuff that GlusterFS is transitioning to? Put another way, they might have some insight into whether the NFS is userl= and via Ganesha works well or not? Hopefully some "users" for this stuff will respond, rick ps: Maybe this could be reposted in a place they are likely to read it. > - Jordan >=20 >=20 >=20 >=20