From owner-freebsd-fs@freebsd.org  Thu Dec 27 10:54:26 2018
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id C32301358DB4
 for <freebsd-fs@mailman.ysv.freebsd.org>; Thu, 27 Dec 2018 10:54:25 +0000 (UTC)
 (envelope-from wjw@digiware.nl)
Received: from smtp.digiware.nl (smtp.digiware.nl [IPv6:2001:4cb8:90:ffff::3])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
 bits)) (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 46F026DD88
 for <freebsd-fs@freebsd.org>; Thu, 27 Dec 2018 10:54:25 +0000 (UTC)
 (envelope-from wjw@digiware.nl)
Received: from router.digiware.nl (localhost.digiware.nl [127.0.0.1])
 by smtp.digiware.nl (Postfix) with ESMTP id C6D22BAD43;
 Thu, 27 Dec 2018 11:54:15 +0100 (CET)
X-Virus-Scanned: amavisd-new at digiware.com
Received: from smtp.digiware.nl ([127.0.0.1])
 by router.digiware.nl (router.digiware.nl [127.0.0.1]) (amavisd-new,
 port 10024)
 with ESMTP id WR8iBsSt9sfI; Thu, 27 Dec 2018 11:54:14 +0100 (CET)
Received: from [192.168.101.70] (unknown [192.168.101.70])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.digiware.nl (Postfix) with ESMTPSA id DB7ECBAD34;
 Thu, 27 Dec 2018 11:54:14 +0100 (CET)
Subject: Re: Suggestion for hardware for ZFS fileserver
To: Sami Halabi <sodynet1@gmail.com>
Cc: freebsd-fs@freebsd.org
References: <CAEW+ogZnWC07OCSuzO7E4TeYGr1E9BARKSKEh9ELCL9Zc4YY3w@mail.gmail.com>
 <C839431D-628C-4C73-8285-2360FE6FFE88@gmail.com>
 <CAEW+ogYWKPL5jLW2H_UWEsCOiz=8fzFcSJ9S5k8k7FXMQjywsw@mail.gmail.com>
 <4f816be7-79e0-cacb-9502-5fbbe343cfc9@denninger.net>
 <3160F105-85C1-4CB4-AAD5-D16CF5D6143D@ifm.liu.se>
 <YQBPR01MB038805DBCCE94383219306E1DDB80@YQBPR01MB0388.CANPRD01.PROD.OUTLOOK.COM>
 <D0E7579B-2768-46DB-94CF-DBD23259E74B@ifm.liu.se>
 <CAEW+ogaKTLsmXaUGk7rZWb7u2Xqja+pPBK5rduX0zXCjk=2zew@mail.gmail.com>
From: Willem Jan Withagen <wjw@digiware.nl>
Message-ID: <d423b8c3-5aba-907c-c80f-b4974571adba@digiware.nl>
Date: Thu, 27 Dec 2018 11:54:16 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101
 Thunderbird/60.3.3
MIME-Version: 1.0
In-Reply-To: <CAEW+ogaKTLsmXaUGk7rZWb7u2Xqja+pPBK5rduX0zXCjk=2zew@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Dec 2018 10:54:26 -0000

On 22/12/2018 15:49, Sami Halabi wrote:
> Hi,
> 
> What sas hba card do you recommend for 16/24 internal ports and 2 external
> that are recognized and work well with freebsd ZFS.

There is no real advise here, but what I saw is that it is relatively 
easy to overload a lot of the busses involved int his.

I got this when building Ceph clusters on FreeBSD, where each disk has 
its own daemon to hammer away on the platters.

The first bottleneck is the disk "backplane". It you do not need to wire 
every disk with a dedicated HBA-disk cable, then you are sharing the 
bandwidth on the backplane between all the disks. and dependant on the 
architecture on the backplane serveral disk share one expander. And the 
feed into that will be share by the disks attached to that expander.
Some expanders will have multiple inputs from the HBA, but I seen cases 
where 4 sas lanes go in and only 2 get used.

The second bottleneck is that once you have all these nice disks 
connected to your HBA, but that is only on a PCIe 4x slot.....
You will need PCIe x8 or x16 for that, and PCIe 3.0 stuff.
Total Bandwidth: (x16 link): PCIe 3.0 = 32GB/s, PCIe 2.0 = 16GB/s, PCIe 
1.1 = 8GB/s.

So lets say that your 24 port HBA has 24 disks connected, each doing 
100Mbytes/sec = 19,2 Gbit/s Which will very likely saturate that PCI 
bus. Note that I'm 0nly talking 100Mbyte/sec. Since that is what I seed 
spinning rust do under Ceph. I'm not even talking about the SSDs used 
for journals and cache.

For ZFS the bus challenge is a bit more of a problem, because you cannot 
scale out.

But I've seen designs where an extra disk cabinet with 96 disks is 
attached over something like 4*12Gbit/s into a controller in a PCIe 16x 
slot, wondering why it doesn't do what they thought it was going to do.

For Ceph there is a "nice" way out, because it is able to scale out by 
more smaller servers with less disks per chassis. So we tend to use 16 
drive chassis with 2 8 ports HBAs that have dedicated connections per 
disk. It is a bit more expensive but it seems to work much better.
Note that you then will then run into network problems, which are more 
of the same. Only just a bit further up the scale.

With Ceph that only plays a role during recovery of lost nodes, which 
hopefully is not too often. But a died/replaced disk will be able to 
restore at the max speed a disk can take. A lost/replaced node will 
recover at the speed limited by the disk infrastructure of the 
recovering node, since the data will come from a lot of other disks on 
other servers. The local busses will saturate when the HW design was poor.

--WjW