Date: Thu, 22 May 2014 07:52:03 -0500 From: Karl Denninger <karl@denninger.net> To: freebsd-fs@freebsd.org Subject: Re: Turn off RAID read and write caching with ZFS? Message-ID: <537DF2F3.10604@denninger.net> In-Reply-To: <719056985.20140522033824@supranet.net> References: <719056985.20140522033824@supranet.net>
next in thread | previous in thread | raw e-mail | index | archive | help
[-- Attachment #1 --] On 5/22/2014 5:38 AM, Jeff Chan wrote: > As mentioned before we have a server with the LSI 2208 RAID chip which > apparently doesn't seem to have HBA firmware available. (If anyone > knows of one, please let me know.) Therefore we are running each drive > as separate, individual RAID0, and we've turned off the RAID harware > read and write caching on the claim it performs better with ZFS, such > as: > > > http://forums.freenas.org/index.php?threads/disable-cache-flush.12253/ > > " cyberjock, Apr 7, 2013 > > AAh. You have a RAID controller with on-card RAM. Based on my > testing with 3 different RAID controllers that had RAM and benchmark > and real world tests, here's my recommended settings for ZFS users: > > 1. Disable your on-card write cache. Believe it or not this > improves write performance significantly. I was very disappointed with > this choice, but it seems to be a universal truth. I upgraded one of > the cards to 4GB of cache a few months before going to ZFS and I'm > disappointed that I wasted my money. It helped a LOT on the Windows > server, but in FreeBSD it's a performance killer. :(" > > 2. If your RAID controller supports read-ahead cache, you should > be setting to either "disabled", the most "conservative"(smallest > read-ahead) or "normal"(medium size read-ahead). I found that > "conservative" was better for random reads from lots of users and the > "normal" was better for things where you were constantly reading a > file in order(such as copying a single very large file). If you choose > anything else for the read-ahead size the latency of your zpool will > go way up because any read by the zpool will be multiplied by 100x > because the RAID card is constantly reading a bunch of sectors before > and after the one sector or area requested." > > > > Does anyone have any comments or test results about this? I have not > attempted to test it independently. Should we run with RAID hardware > caching on or off? > That's mostly-right. Write-caching is very evil in a ZFS world, because ZFS checksums each block. If the filesystem gets back an "OK" for a block not actually on the disk ZFS will presume the checksum is ok. If that assumption proves to be false down the road you're going to have a very bad day. READ caching is not so simple. The problem that comes about is that in order to obtain the best speed from a spinning piece of rust you must read whole tracks. If you don't you take a latency penalty every time you want a sector, because you must wait for the rust to pass under the head. If you read a single sector and then come back to read a second one inter-sector gap sync is lost and you get to wait for another rotation. Therefore what you WANT for spinning rust in virtually all cases is for all reads coming off the rust to be one full **TRACK** in size. If you wind up only using one sector of that track you still don't get hurt materially because you had to wait for the rotational latency anyway as soon as you move the head. Unfortunately this stopped being easy to figure out quite a long time ago in the disk drive world with the sort of certainty that you need to best-optimize workload. It used to be that ST506-style drives had 17 sectors per track and RLL 2,7 ones had 26. Then areal density became the limit and variable geometry showed up, frustrating an operating system (or disk controller!) that tried to, at the driver level, issue one DMA command per physical track in an attempt to capitalize on the fact that all but the first sector read for a given rotation were essentially "free". Modern drives typically try to compensate for their variable-geometryness through their own read-ahead cache, but the exact details of their algorithm are typically not exposed. What I would love to find is a "buffered" controller that recognizes all of this and works as follows: 1. Writes, when committed, are committed and no return is made until storage has written the data and claims it's on the disk. If the sector(s) written are in the buffer memory (from a previous read in 2 below) then the write physically alters both the disk AND the buffer. 2. Reads are always one full track in size and go into the buffer memory on a LRU basis. A read for a sector already in the buffer memory results in no physical I/O taking place. The controller does not store sectors per-se in the buffer, it stores tracks. This requires that the adapter be able to discern the *actual* underlying geometry of the drive so it knows where track boundaries are. Yes, I know drive caches themselves try to do this, but how well do they manage? Evidence suggests that it's not particularly effective. Without this read cache is a crapshoot that gets difficult to tune and is very workload-dependent in terms of what delivers best performance. All you can do is tune (if you're able with a given controller) and test. -- -- Karl karl@denninger.net [-- Attachment #2 --] 0 *H 010 + 0 *H O0K030 *H 010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1/0- *H customer-service@cudasystems.net0 130824190344Z 180823190344Z0[10 UUS10UFlorida10UKarl Denninger1!0 *H karl@denninger.net0"0 *H 0 bi՞]MNԿawx?`)'ҴcWgR@BlWh+ u}ApdCF JVй~FOL}EW^bچYp3K&ׂ(R lxڝ.xz?6&nsJ +1v9v/( kqĪp[vjcK%fϻe?iq]z lyzFO'ppdX//Lw(3JIA*S#՟H[f|CGqJKooy.oEuOw$/섀$삻J9b|AP~8]D1YI<"""Y^T2iQ2b yH)] Ƶ0y$_N6XqMC 9 XgώjGTP"#nˋ"Bk1 00 U0 0 `HB0U0, `HB OpenSSL Generated Certificate0U|8 ˴d[20U#0]Af4U3x&^"408 `HB+)https://cudasystems.net:11443/revoked.crl0 *H gBwH]j\x`( &gW32"Uf^. ^Iϱ k!DQA g{(w/)\N'[oRW@CHO>)XrTNɘ!u`xt5(=f\-l3<@C6mnhv##1ŃbH͍_Nq aʷ?rk$^9TIa!kh,D -ct1 00010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1/0- *H customer-service@cudasystems.net0 + ;0 *H 1 *H 0 *H 1 140522125203Z0# *H 1}(e\Km80l *H 1_0]0 `He*0 `He0 *H 0*H 0 *H @0+0 *H (0 +710010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1/0- *H customer-service@cudasystems.net0*H 1010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1/0- *H customer-service@cudasystems.net0 *H +ˑx%z^Z8.D8?R;jok6(C- wb]lGgMo{b[U<cPş'ǫ=)MHl{̐d^?$8Wl~xauP^l5JZc!AmVTA2dMRӂQV'~]qZnť5vZqމ@;<$R2m艇6?~Ԩ9x$_UF6!)``;$#QŐ;n uTŨd^@'5x,rdl6crQP8~`1os)/!{qi^^8ιV8,[IiῨBs|zW}[)fR-|ƻ/{7jKMQvŃ9GIT-?8DfFVq\@ qg vF Ax^i+F L2c;3r
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?537DF2F3.10604>
