From owner-freebsd-geom@FreeBSD.ORG  Tue Jun  3 20:48:16 2014
Return-Path: <owner-freebsd-geom@FreeBSD.ORG>
Delivered-To: freebsd-geom@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 1918FED5
 for <freebsd-geom@freebsd.org>; Tue,  3 Jun 2014 20:48:16 +0000 (UTC)
Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client CN "funkthat.com", Issuer "funkthat.com" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id E60B0290A
 for <freebsd-geom@freebsd.org>; Tue,  3 Jun 2014 20:48:15 +0000 (UTC)
Received: from h2.funkthat.com (localhost [127.0.0.1])
 by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id s53KmDTW045637
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Tue, 3 Jun 2014 13:48:13 -0700 (PDT)
 (envelope-from jmg@h2.funkthat.com)
Received: (from jmg@localhost)
 by h2.funkthat.com (8.14.3/8.14.3/Submit) id s53KmB6u045636;
 Tue, 3 Jun 2014 13:48:11 -0700 (PDT) (envelope-from jmg)
Date: Tue, 3 Jun 2014 13:48:11 -0700
From: John-Mark Gurney <jmg@funkthat.com>
To: Frank Broniewski <brfr@metrico.lu>
Subject: Re: Geom stripe bottleneck
Message-ID: <20140603204811.GJ31367@funkthat.com>
Mail-Followup-To: Frank Broniewski <brfr@metrico.lu>, freebsd-geom@freebsd.org
References: <538D9BC3.6040509@metrico.lu>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <538D9BC3.6040509@metrico.lu>
User-Agent: Mutt/1.4.2.3i
X-Operating-System: FreeBSD 7.2-RELEASE i386
X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88  9322 9CB1 8F74 6D3F A396
X-Files: The truth is out there
X-URL: http://resnet.uoregon.edu/~gurney_j/
X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html
X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE
X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger?
X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.2
 (h2.funkthat.com [127.0.0.1]); Tue, 03 Jun 2014 13:48:13 -0700 (PDT)
Cc: freebsd-geom@freebsd.org
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 03 Jun 2014 20:48:16 -0000

Frank Broniewski wrote this message on Tue, Jun 03, 2014 at 11:56 +0200:
> I have a stripe (RAID0) geom setup for my database's data. Currently I
> am applying some large updates on the data and I think the performance
> of my stripe could be better. But I am uncertain and so I thought I'd
> request some interpretation help from the community :)
> 
> The stripe consists of two disks (WD Velociraptor with 10.000 rpm):
> >diskinfo -v ada2
> ada2
>         512             # sectorsize
>         600127266816    # mediasize in bytes (558G)
>         1172123568      # mediasize in sectors
>         0               # stripesize
>         0               # stripeoffset
>         1162821         # Cylinders according to firmware.
> 
>         16              # Heads according to firmware.
> 
>         63              # Sectors according to firmware.
> 
>         WD-WXH1E61ASNX9 # Disk ident.
> 
> 
> and /var/log/dmesg.boot
> # snip
> ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
> ada2: <WDC WD6000HLHX-01JJPV0 04.05G04> ATA-8 SATA 3.x device
> ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
> ada2: Command Queueing enabled
> ada2: 572325MB (1172123568 512 byte sectors: 16H 63S/T 16383C)
> ada2: Previously was known as ad8
> ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
> ada3: <WDC WD6000HLHX-01JJPV0 04.05G04> ATA-8 SATA 3.x device
> ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
> ada3: Command Queueing enabled
> ada3: 572325MB (1172123568 512 byte sectors: 16H 63S/T 16383C)
> ada3: Previously was known as ad10
> #snap
> 
> 
> And here's some iostat -d -w 10 ada0 ada1 ada2 ada3 example output
> #snip
>            ada0             ada1             ada2             ada3
>   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s
>   0.00   0  0.00   0.00   0  0.00  19.33 176  3.32  19.33 176  3.32
>  16.25   0  0.01  16.25   0  0.01  16.87 133  2.20  16.87 133  2.20
>   0.00   0  0.00   0.00   0  0.00  16.77 146  2.40  16.77 147  2.40
>   0.00   0  0.00   0.00   0  0.00  19.46 170  3.24  19.45 170  3.23
>  21.50   0  0.01  21.50   0  0.01  17.00 125  2.08  17.00 125  2.08
>   0.50   0  0.00   0.50   0  0.00  16.88 145  2.38  16.88 145  2.38
>   0.00   0  0.00   0.00   0  0.00  16.96 125  2.07  16.97 125  2.07
>   0.00   0  0.00   0.00   0  0.00  19.82 158  3.06  19.81 158  3.07
>  28.77   1  0.03  28.77   1  0.03  16.83 133  2.19  16.82 133  2.19
> #snap

The key here is the tps... Spining drives have a limited number of
tps... first you have moving the heads, which on average will be ~4ms,
then you have to wait, on average half a rotation, which for a 10k RPM
drive is ~3ms, so each seek will take around 7ms, so, as you can see,
your best number is 176 TPS, or ~8ms/transaction... so, it looks like
your drives are performing as they should...

> I think the MB/s output is rather low for such a disk. To gain further
> insight I started gstat:
> dT: 1.001s  w: 1.000s
>  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
>     0     27      0      0    0.0     27   2226    4.8    7.0| ada0
>     0     28      1     32   23.9     27   2226    1.3    3.9| ada1
>     2    120    115   1838    6.4      5     96    0.2   74.3| ada2
>     2    121    116   1854    6.3      5     96    0.4   72.9| ada3
>     0     28      1     32   24.0     27   2226    5.0    8.7| mirror/gm
>     2    121    116   3708    7.9      5    192    0.4   92.2| stripe/gs
>     0     28      1     32   24.0     27   2226    5.0    8.7| mirror/gms1
>     0     12      0      0    0.0     12   1343    9.1    6.9| mirror/gms1a
>     0      0      0      0    0.0      0      0    0.0    0.0| mirror/gms1b
>     0      0      0      0    0.0      0      0    0.0    0.0| mirror/gms1d
>     0      0      0      0    0.0      0      0    0.0    0.0| mirror/gms1e
>     0     16      1     32   24.0     15    883    1.7    2.9| mirror/gms1f
> 
> 
> What bothers me here is that the stripe/gs is 92% busy while the disks
> themselves are only at 74/72%. This lead me to my post here and seek
> some advice, since I don't know enough about the mechanics and so I
> can't really find the problem, if there is any at all.

This is because the stripe has to wait for both drives to return data
before moving the data up... If you're just running a single threaded
benchmark, there isn't multiple IO's in flight, and there for the
remaining time is spent in your application before it sends another
request down to the stripe...  the different between stripe and the
drives is the fact each of them is sometimes faster than the other,
so again, won't have work to do until another IO is submitted...

Try sending more IO at it, like doing 4 or more dd read's such that
the between the latency of one IO, there is other IO to server...

Also, make sure that you're using NCQ where the OS can submit multiple
IO's to the drives at once, this should improve things, but won't
change the results you see above as it requires multiple IO's
outstanding...

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."