From owner-freebsd-geom@FreeBSD.ORG Wed Jun 4 16:30:44 2014 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 45AD66AC for ; Wed, 4 Jun 2014 16:30:44 +0000 (UTC) Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "funkthat.com", Issuer "funkthat.com" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id F36A32528 for ; Wed, 4 Jun 2014 16:30:43 +0000 (UTC) Received: from h2.funkthat.com (localhost [127.0.0.1]) by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id s54GUfOH063361 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 4 Jun 2014 09:30:42 -0700 (PDT) (envelope-from jmg@h2.funkthat.com) Received: (from jmg@localhost) by h2.funkthat.com (8.14.3/8.14.3/Submit) id s54GUeMa063360; Wed, 4 Jun 2014 09:30:40 -0700 (PDT) (envelope-from jmg) Date: Wed, 4 Jun 2014 09:30:40 -0700 From: John-Mark Gurney To: Frank Broniewski Subject: Re: Geom stripe bottleneck Message-ID: <20140604163040.GQ31367@funkthat.com> Mail-Followup-To: Frank Broniewski , freebsd-geom@freebsd.org References: <538D9BC3.6040509@metrico.lu> <20140603204811.GJ31367@funkthat.com> <538EDB11.6090507@metrico.lu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <538EDB11.6090507@metrico.lu> User-Agent: Mutt/1.4.2.3i X-Operating-System: FreeBSD 7.2-RELEASE i386 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.2 (h2.funkthat.com [127.0.0.1]); Wed, 04 Jun 2014 09:30:42 -0700 (PDT) Cc: freebsd-geom@freebsd.org X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 Jun 2014 16:30:44 -0000 Frank Broniewski wrote this message on Wed, Jun 04, 2014 at 10:38 +0200: > thank you very much for your verbose and very helpful answer! I think > that clears things out for me. You're welcome... > I've got a question concerning NCQ though: > > # grep ahci /var/run/dmesg.boot > ahci0: port > 0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f > mem 0xfaffe400-0xfaffe7ff irq 22 at device 17.0 on pci0 > ahci0: AHCI v1.10 with 4 3Gbps ports, Port Multiplier supported > ahcich0: at channel 0 on ahci0 > ahcich1: at channel 1 on ahci0 > ahcich2: at channel 2 on ahci0 > ahcich3: at channel 3 on ahci0 > ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 > ada1 at ahcich1 bus 0 scbus1 target 0 lun 0 > ada2 at ahcich2 bus 0 scbus2 target 0 lun 0 > ada3 at ahcich3 bus 0 scbus3 target 0 lun 0 try doing a grep ada0, as mine shows: ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 ada0: ATA-9 SATA 3.x device ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) ada0: Command Queueing enabled ada0: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C) ada0: Previously was known as ad0 You should probably see something similar... > and: > > # camcontrol identify ada3 > pass3: ATA-8 SATA 3.x device > pass3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) > > protocol ATA/ATAPI-8 SATA 3.x > device model WDC WD6000HLHX-01JJPV0 > firmware revision 04.05G04 > serial number WD-WXL1E61PWAL2 > WWN 50014ee7aaab0118 > cylinders 16383 > heads 16 > sectors/track 63 > sector size logical 512, physical 512, offset 0 > LBA supported 268435455 sectors > LBA48 supported 1172123568 sectors > PIO supported PIO4 > DMA supported WDMA2 UDMA6 > media RPM 10000 > > Feature Support Enabled Value Vendor > read ahead yes yes > write cache yes yes > flush cache yes yes > overlap no > Tagged Command Queuing (TCQ) no no > Native Command Queuing (NCQ) yes 32 tags > SMART yes yes > microcode download yes yes > security yes no > power management yes yes > advanced power management yes yes 128/0x80 > automatic acoustic management no no > media status notification no no > power-up in Standby yes no > write-read-verify no no > unload yes yes > free-fall no no > Data Set Management (DSM/TRIM) no > Host Protected Area (HPA) yes no 1172123568/1172123568 > HPA - Security no > > > is NCQ now enabled? The corresponding line in the camcontrol identify > output doesn't tell me that explicitly but also doesn't deny that ... > but the dmesg.boot may hint that the ahci module is loaded ... I'm > confused :-) > > I do not have a ahci_load=YES in /boot/loader.conf (this is on FreeBSD > 9.2-p6) and I don't know if that's still necessary or not. Searching the > internet turned up mostly rather old (2010,2011) results. > > > Am 2014-06-03 22:48, schrieb John-Mark Gurney: > > Frank Broniewski wrote this message on Tue, Jun 03, 2014 at 11:56 +0200: > >> I have a stripe (RAID0) geom setup for my database's data. Currently I > >> am applying some large updates on the data and I think the performance > >> of my stripe could be better. But I am uncertain and so I thought I'd > >> request some interpretation help from the community :) > >> > >> The stripe consists of two disks (WD Velociraptor with 10.000 rpm): > >>> diskinfo -v ada2 > >> ada2 > >> 512 # sectorsize > >> 600127266816 # mediasize in bytes (558G) > >> 1172123568 # mediasize in sectors > >> 0 # stripesize > >> 0 # stripeoffset > >> 1162821 # Cylinders according to firmware. > >> > >> 16 # Heads according to firmware. > >> > >> 63 # Sectors according to firmware. > >> > >> WD-WXH1E61ASNX9 # Disk ident. > >> > >> > >> and /var/log/dmesg.boot > >> # snip > >> ada2 at ahcich2 bus 0 scbus2 target 0 lun 0 > >> ada2: ATA-8 SATA 3.x device > >> ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) > >> ada2: Command Queueing enabled > >> ada2: 572325MB (1172123568 512 byte sectors: 16H 63S/T 16383C) > >> ada2: Previously was known as ad8 > >> ada3 at ahcich3 bus 0 scbus3 target 0 lun 0 > >> ada3: ATA-8 SATA 3.x device > >> ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) > >> ada3: Command Queueing enabled > >> ada3: 572325MB (1172123568 512 byte sectors: 16H 63S/T 16383C) > >> ada3: Previously was known as ad10 > >> #snap > >> > >> > >> And here's some iostat -d -w 10 ada0 ada1 ada2 ada3 example output > >> #snip > >> ada0 ada1 ada2 ada3 > >> KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s > >> 0.00 0 0.00 0.00 0 0.00 19.33 176 3.32 19.33 176 3.32 > >> 16.25 0 0.01 16.25 0 0.01 16.87 133 2.20 16.87 133 2.20 > >> 0.00 0 0.00 0.00 0 0.00 16.77 146 2.40 16.77 147 2.40 > >> 0.00 0 0.00 0.00 0 0.00 19.46 170 3.24 19.45 170 3.23 > >> 21.50 0 0.01 21.50 0 0.01 17.00 125 2.08 17.00 125 2.08 > >> 0.50 0 0.00 0.50 0 0.00 16.88 145 2.38 16.88 145 2.38 > >> 0.00 0 0.00 0.00 0 0.00 16.96 125 2.07 16.97 125 2.07 > >> 0.00 0 0.00 0.00 0 0.00 19.82 158 3.06 19.81 158 3.07 > >> 28.77 1 0.03 28.77 1 0.03 16.83 133 2.19 16.82 133 2.19 > >> #snap > > > > The key here is the tps... Spining drives have a limited number of > > tps... first you have moving the heads, which on average will be ~4ms, > > then you have to wait, on average half a rotation, which for a 10k RPM > > drive is ~3ms, so each seek will take around 7ms, so, as you can see, > > your best number is 176 TPS, or ~8ms/transaction... so, it looks like > > your drives are performing as they should... > > > >> I think the MB/s output is rather low for such a disk. To gain further > >> insight I started gstat: > >> dT: 1.001s w: 1.000s > >> L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name > >> 0 27 0 0 0.0 27 2226 4.8 7.0| ada0 > >> 0 28 1 32 23.9 27 2226 1.3 3.9| ada1 > >> 2 120 115 1838 6.4 5 96 0.2 74.3| ada2 > >> 2 121 116 1854 6.3 5 96 0.4 72.9| ada3 > >> 0 28 1 32 24.0 27 2226 5.0 8.7| mirror/gm > >> 2 121 116 3708 7.9 5 192 0.4 92.2| stripe/gs > >> 0 28 1 32 24.0 27 2226 5.0 8.7| mirror/gms1 > >> 0 12 0 0 0.0 12 1343 9.1 6.9| mirror/gms1a > >> 0 0 0 0 0.0 0 0 0.0 0.0| mirror/gms1b > >> 0 0 0 0 0.0 0 0 0.0 0.0| mirror/gms1d > >> 0 0 0 0 0.0 0 0 0.0 0.0| mirror/gms1e > >> 0 16 1 32 24.0 15 883 1.7 2.9| mirror/gms1f > >> > >> > >> What bothers me here is that the stripe/gs is 92% busy while the disks > >> themselves are only at 74/72%. This lead me to my post here and seek > >> some advice, since I don't know enough about the mechanics and so I > >> can't really find the problem, if there is any at all. > > > > This is because the stripe has to wait for both drives to return data > > before moving the data up... If you're just running a single threaded > > benchmark, there isn't multiple IO's in flight, and there for the > > remaining time is spent in your application before it sends another > > request down to the stripe... the different between stripe and the > > drives is the fact each of them is sometimes faster than the other, > > so again, won't have work to do until another IO is submitted... > > > > Try sending more IO at it, like doing 4 or more dd read's such that > > the between the latency of one IO, there is other IO to server... > > > > Also, make sure that you're using NCQ where the OS can submit multiple > > IO's to the drives at once, this should improve things, but won't > > change the results you see above as it requires multiple IO's > > outstanding... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not."