From owner-freebsd-current@FreeBSD.ORG  Tue Apr 20 06:57:24 2010
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 15430106564A
	for <freebsd-current@freebsd.org>; Tue, 20 Apr 2010 06:57:24 +0000 (UTC)
	(envelope-from ehrmann@gmail.com)
Received: from mxout-08.mxes.net (mxout-08.mxes.net [216.86.168.183])
	by mx1.freebsd.org (Postfix) with ESMTP id E27438FC17
	for <freebsd-current@freebsd.org>; Tue, 20 Apr 2010 06:57:23 +0000 (UTC)
Received: from [10.0.0.171] (unknown [64.9.241.228])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by smtp.mxes.net (Postfix) with ESMTPSA id 88EE0509B4
	for <freebsd-current@freebsd.org>; Tue, 20 Apr 2010 02:57:22 -0400 (EDT)
Message-ID: <4BCD5049.8030408@gmail.com>
Date: Mon, 19 Apr 2010 23:57:13 -0700
From: David Ehrmann <ehrmann@gmail.com>
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
To: freebsd-current@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Strange disk problem
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Apr 2010 06:57:24 -0000

Initially, I noticed a problem where reading a file on this machine 
seemed to stop--something like a video would just stop playing.  At 
first, I thought it was the machine, but a new motherboard, CPU, and RAM 
later, the problem persists.  The network card uses a different chipset, 
too.

The files are on zfs, but scrubs are fine, and zpool status lists no 
errors of any kind.  Trying to reproduce the problem, I set up a script 
that reading a random 1M block every 60 seconds off the drive backing 
zfs.  That's when I noticed something: one disk seems to be causing the 
problems.  I logged the dd times, and some of them were huge--more than 
a minute.  The times on the other disk in the mirrored vdev were low.

I've only seen the problem when I have a vm's disk image hosted on the 
machine.  That said, the network interface is configured at 100mbps, so 
there's no reason for that to saturate the disk's throughput.  Top 
reports that almost 20% of the CPU is going towards interrupts.  I can 
read a file off the zfs pool at over 50MB/s, so that shouldn't be a 
problem.  One thing I'm wondering is why the disk read doesn't timeout 
quickly?  At least that way zfs could try to use the other drive in the 
mirrored vdev.

Any ideas?  One thing I should try is switching the drive, see if the 
problem follows the disk or stays with the lowest /dev/adX device.  I'm 
using geli, but the read problems happen with both /dev/adX AND 
/dev/adX.eli., so I don't think that's it.  I've seen the problem with 
Samba, NFS, and dd.

Thanks in advance.