From owner-freebsd-fs@freebsd.org Thu Dec 27 10:54:26 2018 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C32301358DB4 for ; Thu, 27 Dec 2018 10:54:25 +0000 (UTC) (envelope-from wjw@digiware.nl) Received: from smtp.digiware.nl (smtp.digiware.nl [IPv6:2001:4cb8:90:ffff::3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 46F026DD88 for ; Thu, 27 Dec 2018 10:54:25 +0000 (UTC) (envelope-from wjw@digiware.nl) Received: from router.digiware.nl (localhost.digiware.nl [127.0.0.1]) by smtp.digiware.nl (Postfix) with ESMTP id C6D22BAD43; Thu, 27 Dec 2018 11:54:15 +0100 (CET) X-Virus-Scanned: amavisd-new at digiware.com Received: from smtp.digiware.nl ([127.0.0.1]) by router.digiware.nl (router.digiware.nl [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WR8iBsSt9sfI; Thu, 27 Dec 2018 11:54:14 +0100 (CET) Received: from [192.168.101.70] (unknown [192.168.101.70]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.digiware.nl (Postfix) with ESMTPSA id DB7ECBAD34; Thu, 27 Dec 2018 11:54:14 +0100 (CET) Subject: Re: Suggestion for hardware for ZFS fileserver To: Sami Halabi Cc: freebsd-fs@freebsd.org References: <4f816be7-79e0-cacb-9502-5fbbe343cfc9@denninger.net> <3160F105-85C1-4CB4-AAD5-D16CF5D6143D@ifm.liu.se> From: Willem Jan Withagen Message-ID: Date: Thu, 27 Dec 2018 11:54:16 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.3.3 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2018 10:54:26 -0000 On 22/12/2018 15:49, Sami Halabi wrote: > Hi, > > What sas hba card do you recommend for 16/24 internal ports and 2 external > that are recognized and work well with freebsd ZFS. There is no real advise here, but what I saw is that it is relatively easy to overload a lot of the busses involved int his. I got this when building Ceph clusters on FreeBSD, where each disk has its own daemon to hammer away on the platters. The first bottleneck is the disk "backplane". It you do not need to wire every disk with a dedicated HBA-disk cable, then you are sharing the bandwidth on the backplane between all the disks. and dependant on the architecture on the backplane serveral disk share one expander. And the feed into that will be share by the disks attached to that expander. Some expanders will have multiple inputs from the HBA, but I seen cases where 4 sas lanes go in and only 2 get used. The second bottleneck is that once you have all these nice disks connected to your HBA, but that is only on a PCIe 4x slot..... You will need PCIe x8 or x16 for that, and PCIe 3.0 stuff. Total Bandwidth: (x16 link): PCIe 3.0 = 32GB/s, PCIe 2.0 = 16GB/s, PCIe 1.1 = 8GB/s. So lets say that your 24 port HBA has 24 disks connected, each doing 100Mbytes/sec = 19,2 Gbit/s Which will very likely saturate that PCI bus. Note that I'm 0nly talking 100Mbyte/sec. Since that is what I seed spinning rust do under Ceph. I'm not even talking about the SSDs used for journals and cache. For ZFS the bus challenge is a bit more of a problem, because you cannot scale out. But I've seen designs where an extra disk cabinet with 96 disks is attached over something like 4*12Gbit/s into a controller in a PCIe 16x slot, wondering why it doesn't do what they thought it was going to do. For Ceph there is a "nice" way out, because it is able to scale out by more smaller servers with less disks per chassis. So we tend to use 16 drive chassis with 2 8 ports HBAs that have dedicated connections per disk. It is a bit more expensive but it seems to work much better. Note that you then will then run into network problems, which are more of the same. Only just a bit further up the scale. With Ceph that only plays a role during recovery of lost nodes, which hopefully is not too often. But a died/replaced disk will be able to restore at the max speed a disk can take. A lost/replaced node will recover at the speed limited by the disk infrastructure of the recovering node, since the data will come from a lot of other disks on other servers. The local busses will saturate when the HW design was poor. --WjW