From owner-freebsd-hackers@FreeBSD.ORG  Tue Apr 16 06:24:15 2013
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 5B9E13BE;
 Tue, 16 Apr 2013 06:24:15 +0000 (UTC) (envelope-from phk@freebsd.org)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
 by mx1.freebsd.org (Postfix) with ESMTP id 244FCEF8;
 Tue, 16 Apr 2013 06:24:14 +0000 (UTC)
Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3])
 by phk.freebsd.dk (Postfix) with ESMTP id F205389FBE;
 Tue, 16 Apr 2013 06:24:13 +0000 (UTC)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
 by critter.freebsd.dk (8.14.6/8.14.6) with ESMTP id r3G6OD8N040153;
 Tue, 16 Apr 2013 06:24:13 GMT (envelope-from phk@freebsd.org)
To: Alexander Motin <mav@FreeBSD.org>
Subject: Re: devstat overhead VS precision
In-reply-to: <516C71BC.4000902@FreeBSD.org>
From: "Poul-Henning Kamp" <phk@freebsd.org>
References: <51692C95.3010901@FreeBSD.org>
 <20130415184203.GA1839@garage.freebsd.pl> <516C515A.9090602@FreeBSD.org>
 <38496.1366058586@critter.freebsd.dk> <516C71BC.4000902@FreeBSD.org>
Content-Type: text/plain; charset=ISO-8859-1
Date: Tue, 16 Apr 2013 06:24:13 +0000
Message-ID: <40152.1366093453@critter.freebsd.dk>
Cc: freebsd-hackers@FreeBSD.org, freebsd-geom@FreeBSD.org
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 16 Apr 2013 06:24:15 -0000

In message <516C71BC.4000902@FreeBSD.org>, Alexander Motin writes:
>On 15.04.2013 23:43, Poul-Henning Kamp wrote:
>> In message <516C515A.9090602@FreeBSD.org>, Alexander Motin writes:
>>

>> For tuning anything on a non-ridiculous SSD device or modern
>> harddisks, it will be useless because of the bias you introduce is
>> *not* one which averages out over many operations.
>
>Could you please explain why?
>
>> The fundamental problem is that on a busy system, getbinuptime()
>> does not get called at random times, it will be heavily affected
>> by the I/O traffic, because of the interrupts, the bus-traffic
>> itself, the cache-effects of I/O transfers and the context-switches
>> by the processes causing the I/O.
>
>I'm sorry, but I am not sure I understand above paragraphs.

That was the exact explanation you asked for, and I'm not sure I can
find a better way to explain it, but I'll try:

Your assumption that the error will cancel out, implicitly assumes
that the timestamp returned from getbinuptime() is updated at
times which are totally independent from the I/O traffic you are
trying to measure the latency of.

That is not the case.  The interrupt which updates getbinuptime()'s
cached timestamp is affected a lot by the I/O traffic, for the various
reasons I mention above.

>Sure, getbinuptime() won't allow to answer how many requests completed 
>within 0.5ms, but present API doesn't allow to calculate that any way, 
>providing only total/average times. And why "_5-10_ timecounter interrupts"?

A: Yes it actually does, a userland application running on a dedicated
CPU core can poll the shared memory devstat structure at a very high
rate and get very useful information about short latencies.

Most people don't do that, becuase they don't care about the difference
between 0.5 and 0.45 milliseconds.

B: To get the systematic bias down to 10-20% of the measured interval.

>> 	Latency distribution:
>>
>> 		<5msec:		92.12 %
>> 		<10msec:	 0.17 %
>> 		<20msec:	 1.34 %
>> 		<50msec:	 6.37 %
>> 		>50msec:	 0.00 %
>>
>I agree that such functionality could be interesting. The only worry is 
>which buckets should be there. For modern HDDs above buckets could be 
>fine. For high-end SSD it may go about microseconds then milliseconds. I 
>have doubt that 5 buckets will be universal enough, unless separated by 
>factor of 5-10.

Remember what people use this for:  Answering the question "Does my
disk subsystem suck, and if so, how much"

Buckets like the ones proposed will tell you that.

>> The %busy crap should be killed, all it does is confuse people.
>
>I agree that it heavily lies, especially for cached writes, but at least 
>it allows to make some very basic estimates. 

For rotating disks:  It always lies.

For SSD: It almost always lies.

Kill it.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.