From owner-freebsd-stable@FreeBSD.ORG  Tue Oct 13 13:14:16 2009
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3CB461065693;
	Tue, 13 Oct 2009 13:14:16 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 15EE88FC25;
	Tue, 13 Oct 2009 13:14:16 +0000 (UTC)
Received: from fledge.watson.org (fledge.watson.org [65.122.17.41])
	by cyrus.watson.org (Postfix) with ESMTPS id BBBFB46B2C;
	Tue, 13 Oct 2009 09:14:15 -0400 (EDT)
Date: Tue, 13 Oct 2009 14:14:15 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Ivan Voras <ivoras@freebsd.org>
In-Reply-To: <hb1qs0$qjd$1@ger.gmane.org>
Message-ID: <alpine.BSF.2.00.0910131406340.26071@fledge.watson.org>
References: <E316139E-FFCF-432F-8DCE-62B120C38E55@exscape.org>
	<CC16B639-7A75-4016-A8A8-5C59E9CD5E95@exscape.org>
	<hb1qs0$qjd$1@ger.gmane.org>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-stable@freebsd.org
Subject: Re: Extreme console latency during disk IO (8.0-RC1,
 previous releases also affected according to others)
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 13 Oct 2009 13:14:16 -0000


On Tue, 13 Oct 2009, Ivan Voras wrote:

> Thomas Backman wrote:
>> I'm copying this over from the freebsd-performance list, as I'm looking for 
>> a few more opinions - not on the problems *I* am having, but rather to 
>> check whether the problem is universal or not, and if not, find a possible 
>> common factor. In other words: I want to hear about your experiences, *good 
>> or bad*!
>> 
>> Here's the original thread (not from the beginning, though): 
>> http://lists.freebsd.org/pipermail/freebsd-performance/2009-October/003843.html
>> 
>> Long story short, my version: when the disk is stressed hard enough, 
>> console IO becomes COMPLETELY unbearable. 10+ seconds to switch between 
>> windows in screen(1), running (or even typing) simple commands, etc. This 
>> happens both via SSH and the serial console.
>
> Hmm, this looks familiar - I've noticed it before on the physical (VGA) 
> console and I notice it all the time under VMWare. It sort of looks like 
> disk IO really blocks network IO in this case - I use the VMs over ssh.

Real hardware and virtual hardware have vastly different performance 
properties, so I'd be careful not to assume that the issue described by the 
original reporter and the issue you're experiencing are the same.  In our 
kernel, low level network protocols will essentially always take precedence 
over disk I/O activity.  So on face value "disk IO really blocks network IO" 
is highly unlikely.

There are two much more likely possibilities: (1) poor VM implementation 
causes the virtual CPU to be suspended behind synchronous host OS I/O or (2) 
the network stack is running fine but the interactive user application is 
getting I/O or locks scheduled behind a bulk process.

A useful diagnostic here is to compare the behavior of three kinds of network 
latency tests:

(1) ping from the host OS to the guest OS
(2) netperf TCP_RR from the host OS to the guest OS
(3) ssh interactive latency

If (1) is highly variable during I/O, it's almost certainly a property of the 
VM technology you're using, and there's nought to be done about it in the 
guest OS.

If (2) but not (1) is highly variable, it may well be a scheduling issue, 
although under high memory pressure you couldn't rule out paging out of 
netserver pages/etc causing latency.

If (3) but not (1) or (2) is highly variable, it's most likely an I/O 
scheduling issue, perhaps caused by priority inversion on lockmgr locks on a 
vnode, disk I/O scheduling leading to starvation, etc.

Robert