Date: Fri, 11 Oct 2013 15:39:53 -0700 From: Maksim Yevmenkin <maksim.yevmenkin@gmail.com> To: John-Mark Gurney <jmg@funkthat.com> Cc: Maksim Yevmenkin <emax@freebsd.org>, "current@freebsd.org" <current@freebsd.org> Subject: Re: [rfc] small bioq patch Message-ID: <72DA2C4F-44F0-456D-8679-A45CE617F8E6@gmail.com> In-Reply-To: <20131011215210.GY56872@funkthat.com> References: <CAFPOs6pXhDjj1JTY0JNaw8g=zvtw9NgDVeJTQW-=31jwj321mQ@mail.gmail.com> <20131011215210.GY56872@funkthat.com>
next in thread | previous in thread | raw e-mail | index | archive | help
> On Oct 11, 2013, at 2:52 PM, John-Mark Gurney <jmg@funkthat.com> wrote: >=20 > Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700: >> i would like to submit the attached bioq patch for review and >> comments. this is proof of concept. it helps with smoothing disk read >> service times and arrear to eliminates outliers. please see attached >> pictures (about a week worth of data) >>=20 >> - c034 "control" unmodified system >> - c044 patched system >=20 > Can you describe how you got this data? Were you using the gstat > code or some other code? Yes, it's basically gstat data.=20 > Also, was your control system w/ the patch, but w/ the sysctl set to > zero to possibly eliminate any code alignment issues? Both systems use the same code base and build. Patched system has patch incl= uded, "control" system does not have the patch. I can rerun my tests with sy= sctl set to zero and use it as "control". So, the answer to your question is= "no".=20 >> graphs show max/avg disk read service times for both systems across 36 >> spinning drives. both systems are relatively busy serving production >> traffic (about 10 Gbps at peak). grey shaded areas on the graphs >> represent time when systems are refreshing their content, i.e. disks >> are both reading and writing at the same time. >=20 > Can you describe why you think this change makes an improvement? Unless > you're running 10k or 15k RPM drives, 128 seems like a large number.. as > that's about halve number of IOPs that a normal HD handles in a second.. Our (Netflix) load is basically random disk io. We have tweaked the system t= o ensure that our io path is "wide" enough, I.e. We read 1mb per disk io for= majority of the requests. However offsets we read from are all over the pla= ce. It appears that we are getting into situation where larger offsets are g= etting delayed because smaller offsets are "jumping" ahead of them. Forcing b= ioq insert tail operation and effectively moving insertion point seems to he= lp avoiding getting into this situation. And, no. We don't use 10k or 15k dr= ives. Just regular enterprise 7200 sata drives.=20 > I assume you must be regularly seeing queue depths of 128+ for this > code to make a difference, do you see that w/ gstat? No, we don't see large (128+) queue sizes in gstat data. The way I see it, w= e don't have to have deep queue here. We could just have a steady stream of i= o requests where new, smaller, offsets consistently "jumping" ahead of older= , larger offset. In fact gstat data show shallow queue of 5 or less items. > Also, do you see a similar throughput of the system? Yes. We do see almost identical throughput from both systems. I have not pu= shed the system to its limit yet, but having much smoother disk read service= time is important for us because we use it as one of the components of syst= em health metrics. We also need to ensure that disk io request is actually d= ispatched to the disk in a timely manner.=20 Thanks Max
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?72DA2C4F-44F0-456D-8679-A45CE617F8E6>