From owner-freebsd-performance@FreeBSD.ORG  Tue Jun 14 12:20:14 2005
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 6F16116A41C
	for <freebsd-performance@freebsd.org>;
	Tue, 14 Jun 2005 12:20:14 +0000 (GMT) (envelope-from bde@zeta.org.au)
Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.85])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 81BE343D53
	for <freebsd-performance@freebsd.org>;
	Tue, 14 Jun 2005 12:20:13 +0000 (GMT) (envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.0.87])
	by mailout2.pacific.net.au (8.13.4/8.13.4/Debian-1) with ESMTP id
	j5ECK4AC028216; Tue, 14 Jun 2005 22:20:04 +1000
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-1) with ESMTP id
	j5ECK2M9021015; Tue, 14 Jun 2005 22:20:02 +1000
Date: Tue, 14 Jun 2005 22:20:03 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: Glenn Dawson <glenn@antimatter.net>
In-Reply-To: <6.1.0.6.2.20050604230636.01bf68c0@cobalt.antimatter.net>
Message-ID: <20050614213135.K38258@delplex.bde.org>
References: <6.1.0.6.2.20050604230636.01bf68c0@cobalt.antimatter.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-performance@freebsd.org
Subject: Re: vn(4) performance on 4.11 versus md(4) on 5.4
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 14 Jun 2005 12:20:14 -0000

On Sat, 4 Jun 2005, Glenn Dawson wrote:

> I have a number of systems running 4.11 that have file backed virtual disks, 
> each of which contains a jail.  I need to start using 5.4 for new servers. 
> The catch is, file backed virtual disks using md(4) seem to be much slower 
> than similar virtual disks on 4.11 using vn(4).  vn(4) on 4.11 is about 2.24 
> times faster than the equivalent setup using md(4) on 5.4.
>
> I've posted the results of some tests that I ran at 
> http://www.antimatter.net/md-versus-vn.txt
>
> Is this decrease in performance known?  Is there something I can do in order 
> to come close to the performance that 4.11 has?  I've tried changing some of 
> the parameters of the filesystem on the virtual disk, but the performance 
> didn't change.

Writes by md are now synchronous.  Try turning this off using
"mdconfig -o async ...", though this is probably too dangerous to use in
production -- the sync writes are a hack to work around hangs, and my
system hung almost instantly while testing this.

For copying a cached copy of /usr/src/sys/ (~100MB) on an old de-GEOMed
version of -current, with all filesystems mounted -async -noatime, I
got the following times:

# ffs1 fs on ad2s2d
         6.21 real         0.52 user         3.39 sys
# ffs2 fs on md2 (default) on file zz on previous fs
        63.83 real         0.56 user         3.34 sys
# ffs2 fs on md3 (-o async) on same file (after mdconfig -u 2)
        16.10 real         0.50 user         3.40 sys

Syncing of the last fs deadlocked the file systems on md3 and ad2s2d :-(
but not others.

For dd'ing /dev/zero to large file, the sync writes gave a loss of
performance of almost exactly your factor of 2.24 relative to the
non-md fs: the raw disk speed is about 55MB/sec and writing to the
native ffs gave 54MB/sec by mostly writing with a physical block size
of 64K and writing via md2 gave 25MB/sec by writing always with a
physical block size of 16K.  The size of 64K results from clustering
and the size of 16K results from sync writes breaking clustering
(md always writes the fs block size which is 16K in my tests, and
since the writes are sync they must be done individually so they
cannot be clustered).

>From mdconfig(1):

%      -o [no]option
%              Set or reset options.
% 
%              [no]async
%                      For vnode backed devices: avoid IO_SYNC for increased
%                      performance but at the risk of deadlocking the entire
%                      kernel.
%              ...
%              [no]cluster
%                      Enable clustering on this disk.

A nearby bug in md is that "-o cluster" has always been silently ignored.
I think we decided that it is the user's responsibility to mount md-backed
(and other file systems on non-physical or memory-like devices) with
-o noclusterw -o noclusterr to prevent wasteful clustering).  This is easy
to forget, however.  vn used to turn off clustering non-optionally to avoid
some deadlock problems but this was removed long before 4.11 when the
deadlock problems were supposed to be fixed, so turning off clustering
was supposed to be only a small optimization.  Try turning it off to
see if it reduces deadlocks.

>From md.c's cvs history:

% RCS file: /home/ncvs/src/sys/dev/md/md.c,v
% Working file: md.c
% head: 1.124
% ...
% ----------------------------
% revision 1.115
% date: 2004/03/10 20:41:08;  author: phk;  state: Exp;  lines: +5 -3
% Fix a long-standing deadlock issue with vnode backed md(4) devices:
% 
% On vnode backed md(4) devices over a certain, currently undetermined
% size relative to the buffer cache our "lemming-syncer" can provoke
% a buffer starvation which puts the md thread to sleep on wdrain.
% 
% This generally tends to grind the entire system to a stop because the
% event that is supposed to wake up the thread will not happen until a fair
% bit of the piled up I/O requests in the system finish, and since a lot
% of those are on a md(4) vnode backed device which is currently waiting
% on wdrain until a fair amount of the piled up ... you get the picture.
% 
% The cure is to issue all VOP_WRITES on the vnode backing the device
% with IO_SYNC.
% 
% In addition to more closely emulating a real disk device with a
% non-lying write-cache, this makes the writes exempt from rate-limited
% (there to avoid starving the buffer cache) and consequently prevents
% the deadlock.
% 
% Unfortunately performance takes a hit.
% 
% Add "async" option to give people who know what they are doing the
% old behaviour.
% ----------------------------

Bruce