From owner-freebsd-hackers@FreeBSD.ORG  Mon Apr 15 20:43:15 2013
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id F3E1ABC1;
 Mon, 15 Apr 2013 20:43:14 +0000 (UTC)
 (envelope-from phk@phk.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
 by mx1.freebsd.org (Postfix) with ESMTP id 825BA1AA1;
 Mon, 15 Apr 2013 20:43:14 +0000 (UTC)
Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3])
 by phk.freebsd.dk (Postfix) with ESMTP id EF36E89FBE;
 Mon, 15 Apr 2013 20:43:06 +0000 (UTC)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
 by critter.freebsd.dk (8.14.6/8.14.6) with ESMTP id r3FKh6bm038497;
 Mon, 15 Apr 2013 20:43:06 GMT (envelope-from phk@phk.freebsd.dk)
To: Alexander Motin <mav@FreeBSD.org>
Subject: Re: devstat overhead VS precision
In-reply-to: <516C515A.9090602@FreeBSD.org>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
References: <51692C95.3010901@FreeBSD.org>
 <20130415184203.GA1839@garage.freebsd.pl> <516C515A.9090602@FreeBSD.org>
Content-Type: text/plain; charset=ISO-8859-1
Date: Mon, 15 Apr 2013 20:43:06 +0000
Message-ID: <38496.1366058586@critter.freebsd.dk>
X-Mailman-Approved-At: Mon, 15 Apr 2013 20:55:39 +0000
Cc: freebsd-hackers@FreeBSD.org, freebsd-geom@FreeBSD.org
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 15 Apr 2013 20:43:15 -0000

In message <516C515A.9090602@FreeBSD.org>, Alexander Motin writes:

>>> I propose to switch that
>>> statistics from using binuptime() to getbinuptime() to solve the problem
>>> globally.

>> No objections here, but I wonder if you were able to compare the results
>> somehow before and after the change so we have some hard numbers to show
>> that we don't lose much by applying the change.
>
>I haven't tested it statistically, but I haven't noticed any visual 
>difference in gstat output with its 0.1ms displayed resolution.

I have tested it statistically, back when I wrote GEOM:  It leads
to very significant statistical bias.

Just about the only thing in devstat that has any predictive power
with respect to filesystem performance, is the latency, which measures
how long time it takes to satisfy each I/O request.

If you run gstat(8), this is the "ms/*" numbers:  milliseconds per
this or that.

The rest of what's in devstat, with the exception of the queue-length
("L(q)") has almost no predictive power, and is IMO, practically
pointless.  In particular the %busy is totally misleading and I
deeply regret that I didn't fight to kill it back then.

If you switch to getbinuptime(), the latency measurements will only
be precise if the I/O operations take much longer than the timecounter
update period, which is not guaranteed to be 1000 Hz btw.

For measuring how much USB-sticks suck, that will work fine.

For tuning anything on a non-ridiculous SSD device or modern
harddisks, it will be useless because of the bias you introduce is
*not* one which averages out over many operations.

The fundamental problem is that on a busy system, getbinuptime()
does not get called at random times, it will be heavily affected
by the I/O traffic, because of the interrupts, the bus-traffic
itself, the cache-effects of I/O transfers and the context-switches
by the processes causing the I/O.

So yes, you can switch to getbinuptime(), but the only statistical
responsible way to do so, would be to supress latency measurements
on all I/O operations which complete in less than 5-10 timecounter
interrupts.

Apart from some practical issues implementing it, the numbers
that came out would be pretty useless.

The right idea is probably to bucketize the latencies, so that
rather than having to keep track of devstat in real time to find
out, you could get a histogram at any time showing past
performance something like:

	Latency distribution:

		<5msec:		92.12 %
		<10msec:	 0.17 %
		<20msec:	 1.34 %
		<50msec:	 6.37 %
		>50msec:	 0.00 %

Doing that with getbinuptime() would be statistically defensible
provided the top bucket is "<5msec" and it would very clearly tell
people if they have I/O trouble or not, which IMO is what people
want to know.

The cost 20 64bit counters in struct devstat (N|R|W|E)*5*8 = 160
bytes, but since devstat is already 288 bytes, that isn't a major
catastropy.

The ability to measure latency precisly should be retained, but it
could be made a sysctl enabled debugging facility.

The %busy crap should be killed, all it does is confuse people.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.