Date: Tue, 16 Apr 2013 00:31:40 +0300 From: Alexander Motin <mav@FreeBSD.org> To: Poul-Henning Kamp <phk@phk.freebsd.dk> Cc: freebsd-hackers@FreeBSD.org, freebsd-geom@FreeBSD.org Subject: Re: devstat overhead VS precision Message-ID: <516C71BC.4000902@FreeBSD.org> In-Reply-To: <38496.1366058586@critter.freebsd.dk> References: <51692C95.3010901@FreeBSD.org> <20130415184203.GA1839@garage.freebsd.pl> <516C515A.9090602@FreeBSD.org> <38496.1366058586@critter.freebsd.dk>
next in thread | previous in thread | raw e-mail | index | archive | help
On 15.04.2013 23:43, Poul-Henning Kamp wrote: > In message <516C515A.9090602@FreeBSD.org>, Alexander Motin writes: > >>>> I propose to switch that >>>> statistics from using binuptime() to getbinuptime() to solve the problem >>>> globally. > >>> No objections here, but I wonder if you were able to compare the results >>> somehow before and after the change so we have some hard numbers to show >>> that we don't lose much by applying the change. >> >> I haven't tested it statistically, but I haven't noticed any visual >> difference in gstat output with its 0.1ms displayed resolution. > > I have tested it statistically, back when I wrote GEOM: It leads > to very significant statistical bias. > > Just about the only thing in devstat that has any predictive power > with respect to filesystem performance, is the latency, which measures > how long time it takes to satisfy each I/O request. > > If you run gstat(8), this is the "ms/*" numbers: milliseconds per > this or that. > > The rest of what's in devstat, with the exception of the queue-length > ("L(q)") has almost no predictive power, and is IMO, practically > pointless. In particular the %busy is totally misleading and I > deeply regret that I didn't fight to kill it back then. > > If you switch to getbinuptime(), the latency measurements will only > be precise if the I/O operations take much longer than the timecounter > update period, which is not guaranteed to be 1000 Hz btw. > > For measuring how much USB-sticks suck, that will work fine. > > For tuning anything on a non-ridiculous SSD device or modern > harddisks, it will be useless because of the bias you introduce is > *not* one which averages out over many operations. Could you please explain why? Unless disk I/O somehow aliased to hardclock(), each of them should get random error from 0 to max(1ms, 1s/HZ). With large number of I/Os that error should be hidden when calculating average time. I am not talking about microseconds, but I think fraction of millisecond should be realistic to get. > The fundamental problem is that on a busy system, getbinuptime() > does not get called at random times, it will be heavily affected > by the I/O traffic, because of the interrupts, the bus-traffic > itself, the cache-effects of I/O transfers and the context-switches > by the processes causing the I/O. I'm sorry, but I am not sure I understand above paragraphs. Do you want to say that in some realistic conditions (not counting entering debugger with disabled interrupts, etc) hardclock() can be delayed more then some significant percent of its period and that depends of I/O traffic itself? Or you want to say that disk I/Os somehow aliased with hardclock(), making impossible to hide error by averaging? > So yes, you can switch to getbinuptime(), but the only statistical > responsible way to do so, would be to supress latency measurements > on all I/O operations which complete in less than 5-10 timecounter > interrupts. Sure, getbinuptime() won't allow to answer how many requests completed within 0.5ms, but present API doesn't allow to calculate that any way, providing only total/average times. And why "_5-10_ timecounter interrupts"? > Apart from some practical issues implementing it, the numbers > that came out would be pretty useless. > > The right idea is probably to bucketize the latencies, so that > rather than having to keep track of devstat in real time to find > out, you could get a histogram at any time showing past > performance something like: > > Latency distribution: > > <5msec: 92.12 % > <10msec: 0.17 % > <20msec: 1.34 % > <50msec: 6.37 % > >50msec: 0.00 % > > Doing that with getbinuptime() would be statistically defensible > provided the top bucket is "<5msec" and it would very clearly tell > people if they have I/O trouble or not, which IMO is what people > want to know. > > The cost 20 64bit counters in struct devstat (N|R|W|E)*5*8 = 160 > bytes, but since devstat is already 288 bytes, that isn't a major > catastropy. I agree that such functionality could be interesting. The only worry is which buckets should be there. For modern HDDs above buckets could be fine. For high-end SSD it may go about microseconds then milliseconds. I have doubt that 5 buckets will be universal enough, unless separated by factor of 5-10. > The ability to measure latency precisly should be retained, but it > could be made a sysctl enabled debugging facility. > > The %busy crap should be killed, all it does is confuse people. I agree that it heavily lies, especially for cached writes, but at least it allows to make some very basic estimates. The value has valid explanation and the only problem is that users are misinterpreting it. -- Alexander Motin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?516C71BC.4000902>