From owner-freebsd-fs@freebsd.org Sat Dec 22 14:49:36 2018 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7C6E31355B47 for ; Sat, 22 Dec 2018 14:49:36 +0000 (UTC) (envelope-from sodynet1@gmail.com) Received: from mail-qt1-x835.google.com (mail-qt1-x835.google.com [IPv6:2607:f8b0:4864:20::835]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 5DEFB8A88E for ; Sat, 22 Dec 2018 14:49:35 +0000 (UTC) (envelope-from sodynet1@gmail.com) Received: by mail-qt1-x835.google.com with SMTP id v11so9061833qtc.2 for ; Sat, 22 Dec 2018 06:49:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=jmE9D8os0NUpT2tv7MQNiwHZdUKRRzW8h0o/crb6fXA=; b=mCTQJ5WtgAzqfit67XcHbfv5uqgpY0v/TfisqeBpOClzJ52XtwyC48KyQdz99s58+J DB1v/mapuSR/VL1GnzCUhV5ntxQpHqNVpxSmpWBAjgo8qXq9WYkIuKt4IhFtGjt2COzu KK5A/O/ToYRy4LtXzXZuHZUC8Qts4pdGmtIxKnzLCPR5nEvo0kfSGX2ibmjvZg0Dj3dN 5/mRUegm4jFJWnmvd4TXoTXoJgaTpVqyZyrPn0OBw+rSrT5XR5pV/E7OvXowg1MoYvFm 7iyMw3tuXyNvJ7pmN2OZWzUwLA/sWgdABdGAJSGQsInPdxIqpzVe/Bb6eqPt8Ocw9zG8 PlmA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=jmE9D8os0NUpT2tv7MQNiwHZdUKRRzW8h0o/crb6fXA=; b=kKe3YGW62BwyurguQjLZnRopxe8haUMcs1+uT+yh15Jleej+NNbPmGayGEKbzzXzAm nczWZeZUM0lPJJUlVQSzpFBHxpWnxXF4E8eSzh6n/jiD8MTMHP1Wv3AEoHBAIhlLQUh9 419TSzbMCDvSAMl2kqCFCEF6f81iVpfI25PUgexPmSVSLE7T2uLIL3BD9rqItFdZYkR5 QHVxFdc9Y/A/sVTrHLcs6jJNDLyorLb/GiAQ4d8KCcUSn83GAF3w0/4L5L4Up6IvJJoP JC1OqpKd/Ys0d/4mvCLU9CELmRJ3wY+Dw0LYuF7p/yCoShhRGYhp79gib7JXfDz0z1yd LCOA== X-Gm-Message-State: AJcUukeWxIXi6J92MyuxWs0jr3+MOYbbht+7abpPNbZzwAwNQXT30XCI j5CJg7qNBjuSZMxzzj/Y3iNoniDaP4UcmrWz8/Jaqw== X-Google-Smtp-Source: AFSGD/VzcwgBYPOsdolNZw6fEin8w5iK6LeQtHcO/L6zitUnDv6c1LonNjesmh6fir1CPK1TJ6Rq3LmQUL89HMzQim8= X-Received: by 2002:aed:356d:: with SMTP id b42mr6510091qte.186.1545490174621; Sat, 22 Dec 2018 06:49:34 -0800 (PST) MIME-Version: 1.0 References: <4f816be7-79e0-cacb-9502-5fbbe343cfc9@denninger.net> <3160F105-85C1-4CB4-AAD5-D16CF5D6143D@ifm.liu.se> In-Reply-To: From: Sami Halabi Date: Sat, 22 Dec 2018 16:49:22 +0200 Message-ID: Subject: Re: Suggestion for hardware for ZFS fileserver To: Peter Eriksson Cc: freebsd-fs@freebsd.org X-Rspamd-Queue-Id: 5DEFB8A88E X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20161025 header.b=mCTQJ5Wt; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of sodynet1@gmail.com designates 2607:f8b0:4864:20::835 as permitted sender) smtp.mailfrom=sodynet1@gmail.com X-Spamd-Result: default: False [-3.68 / 15.00]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; FREEMAIL_FROM(0.00)[gmail.com]; URI_COUNT_ODD(1.00)[3]; DKIM_TRACE(0.00)[gmail.com:+]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; MX_GOOD(-0.01)[cached: alt3.gmail-smtp-in.l.google.com]; NEURAL_HAM_SHORT(-0.72)[-0.723,0]; FROM_EQ_ENVFROM(0.00)[]; RCVD_TLS_LAST(0.00)[]; MIME_TRACE(0.00)[0:+,1:+]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; DWL_DNSWL_NONE(0.00)[gmail.com.dwl.dnswl.org : 127.0.5.0]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20161025]; FROM_HAS_DN(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-fs@freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[5.3.8.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0.0.4.6.8.4.0.b.8.f.7.0.6.2.list.dnswl.org : 127.0.5.0]; HTTP_TO_IP(1.00)[]; IP_SCORE(-1.94)[ip: (-6.36), ipnet: 2607:f8b0::/32(-1.83), asn: 15169(-1.45), country: US(-0.08)]; RCVD_COUNT_TWO(0.00)[2] Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 22 Dec 2018 14:49:36 -0000 Hi, What sas hba card do you recommend for 16/24 internal ports and 2 external that are recognized and work well with freebsd ZFS. Sami =D7=91=D7=AA=D7=90=D7=A8=D7=99=D7=9A =D7=A9=D7=91=D7=AA, 22 =D7=91=D7=93=D7= =A6=D7=9E=D7=B3 2018, 2:48, =D7=9E=D7=90=D7=AA Peter Eriksson : > > > > On 22 Dec 2018, at 00:49, Rick Macklem wrote: > > > > Peter Eriksson wrote: > > [good stuff snipped] > >> This has caused some interesting problems=E2=80=A6 > >> > >> First thing we noticed was that booting would take forever=E2=80=A6 Mo= unting > the 20-100k >filesystems _and_ enabling them to be shared via NFS is not > done efficient at all (for >each filesystem it re-reads /etc/zfs/exports = (a > couple of times) befor appending one >line to the end. Repeat 20-100,000 > times=E2=80=A6 Not to mention the big kernel lock for >NFS =E2=80=9Chold = all NFS activity > while we flush and reinstalls all sharing information per >filesystem=E2= =80=9D > being done by mountd=E2=80=A6 > > Yes, /etc/exports and mountd were implemented in the 1980s, when a doze= n > > file systems would have been a large server. Scaling to 10,000 or more > file > > systems wasn't even conceivable back then. > > Yeah, for a normal user with non-silly amounts of filesystems this is a > non-issue. Anyway it=E2=80=99s the kind of issues that I kind of like to = think > about how to solve. It=E2=80=99s fun :-) > > > >> Wish list item #1: A BerkeleyDB-based =E2=80=99sharetab=E2=80=99 that = replaces the > horribly >slow /etc/zfs/exports text file. > >> Wish list item #2: A reimplementation of mountd and the kernel > interface to allow >a =E2=80=9Cdiff=E2=80=9D between the contents of the = DB-based sharetab > above be input into the >kernel instead of the brute-force way it=E2=80= =99s done > now.. > > The parser in mountd for /etc/exports is already an ugly beast and I > think > > implementing a "diff" version will be difficult, especially figuring ou= t > what needs > > to be deleted. > > Yeah, I tried to decode it (this summer) and I think I sort of got the > hang of it eventually. > > > > I do have a couple of questions related to this: > > 1 - Would your case work if there was an "add these lines to > /etc/exports"? > > (Basically adding entries for file systems, but not trying to delet= e > anything > > previously exported. I am not a ZFS guy, but I think ZFS just > generates another > > exports file and then gets mountd to export everything again.) > > Yeah, the ZFS library that the zfs commands use just reads and updates th= e > separate /etc/zfs/exports text file (and have mountd read both /etc/expor= ts > and /etc/zfs/exports). The problem is that basically what it does when yo= u > tell it to =E2=80=9Czfs mount -a=E2=80=9D (mount all filesystems in all z= pools) is a big > (pseudocode): > > For P in ZPOOLS; do > For Z in ZFILESYSTEMS-AND-SNAPSHOTS in $P; do > Mount $Z > If $Z Have =E2=80=9Csharenfs=E2=80=9D option; Then > Open /etc/zfs/exports > Read until you find a matching line, replace with the options, els= e > if not found, Append options > Close /etc/zfs/exports > Signal mountd > (Which then opens /etc/exports and /etc/zfs/exports and does it= =E2=80=99s > magic) > End > End > End > > All wrapped up in a Solaris compatibility layer I libzfs. Actually I thin= k > it even reads the /etc/zfs/exports file twice for each loop iteration due > to some abstractions. Btw things got really =E2=80=9Cfun=E2=80=9D when th= e hourly snapshots > we were taking (adding 10-20k new snapshots every hour, and we didn=E2=80= =99t > expire them fast enough in the beginning) triggered the code above and th= at > code took longer than 1 hour to execute - mountd was 100% busy getting > signalled and rereading, flushing and reinstalling exports into the kerne= l > all the time) and basically never finished. Luckily we didn=E2=80=99t hav= e an NFS > clients accessing the servers at that time :-) > > This summer I wrote some code to instead use a Btree BerkeleyDB file and > modified the libzfs code and mountd daemon to instead use that database f= or > much faster lookups (no need to read the whole /etc/zfs/exports file all > the time) and additions. Worked pretty well actually and wasn=E2=80=99t t= hat hard > to add. Wanted to also add a possibility to add =E2=80=9Cexports=E2=80=9D= arguments > =E2=80=9CSolaris=E2=80=9D-style so one could say things like: > > /export/staff vers=3D4,sec=3Dkrb5:krb5i:krb5p,rw=3D > 130.236.0.0/16,sec=3Dsys,ro=3D130.236.160.0/24:10.1.2.3 > > But I never finished that (solaris-style exports options) part=E2=80=A6. > > We=E2=80=99ve lately been toying with putting the NFS sharing stuff into = separate > =E2=80=9Cprivate" ZFS attribute (separate from official =E2=80=9Csharenfs= =E2=80=9D one) and have > another tool to read them instead and generate another =E2=80=9Cexports= =E2=80=9D file so > that file can be generated in =E2=80=9Cone go=E2=80=9D and just signal mo= untd once after > all filesystems have been mounted. Unfortunately that would mean that the= y > won=E2=80=99t be shared until after all of them have been mounted but we = think it > would take less time all-in-all. > > We also modified the FreeBSD boot scripts so that we make sure to first > mount all most important ZFS filesystems that is needed on the boot disks > (not just /) and then we mount (and share via NFS the rest in the > background so we can login to the machine as root early (no need for > everything to have been mounted before giving us a login prompt). > > (Right now a reboot of the bigger servers take an hour or two before all > filesystems are mounted and exported). > > > > 2 - Are all (or maybe most) of these ZFS file systems exported with the > same > > arguments? > > - Here I am thinking that a "default-for-all-ZFS-filesystems" line > could be > > put in /etc/exports that would apply to all ZFS file systems no= t > exported > > by explicit lines in the exports file(s). > > This would be fairly easy to implement and would avoid trying to > handle > > 1000s of entries. > > For us most have exactly the same exports arguments. (We set options on > the top level filsystems (/export/staff, /export/students etc) and then a= ll > home dirs inherit those. > > > In particular, #2 above could be easily implemented on top of what is > already > > there, using a new type of line in /etc/exports and handling that as a > special > > case by the NFS server code, when no specific export for the file syste= m > to the > > client is found. > > > >> (I=E2=80=99ve written some code that implements item #1 above and it h= elps > quite a bit. >Nothing near production quality yet though. I have looked a= t > item #2 a bit too but >not done anything about it.) > > [more good stuff snipped] > > Btw, although I put the questions here, I think a separate thread > discussing > > how to scale to 10000+ file systems might be useful. (On freebsd-fs@ or > > freebsd-current@. The latter sometimes gets the attention of more > developers.) > > Yeah, probably a good idea! > > - Peter > > > rick > > > > > > _______________________________________________ > freebsd-fs@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >