From owner-freebsd-arch@FreeBSD.ORG Fri Dec 25 10:58:50 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8F1251065696; Fri, 25 Dec 2009 10:58:50 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-fx0-f227.google.com (mail-fx0-f227.google.com [209.85.220.227]) by mx1.freebsd.org (Postfix) with ESMTP id EE61D8FC19; Fri, 25 Dec 2009 10:58:49 +0000 (UTC) Received: by fxm27 with SMTP id 27so8778359fxm.3 for ; Fri, 25 Dec 2009 02:58:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:date:from :user-agent:mime-version:to:subject:x-enigmail-version:content-type :content-transfer-encoding; bh=dOGAq0fldWTFVfl3MvbKRfDkuvsbY7wkE6GaZL8ERHg=; b=dwcbHs0bSEmbDz+AsLd/7eq7p7R52mMaMOb6EBkh3RtYthc+vC5HX6XUk1rkv5RhpR f6AioFXme82Mr/Pgy6tMkSL91YTtb+xagYB13AdzswxQZjxdzH8ZRZAkNmRnozEoY2vA G7vt+wSXnSJNJ03sX9mhshUXEj2WPGLYU9CnA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:subject :x-enigmail-version:content-type:content-transfer-encoding; b=omM1nlgF+hviwCCxtgDgzXupprgNaE+hw+AGdFQC52Ezh82cmYYgnU+UNNvaDR2laB cqK+pYjJbm1YLpWq0i5Mm9HcSnIRwpgMlMS45ByFAVTMprgs/xx+O2UOg28QxXFu7fQm 2rvEfJMFyYleL0umo6Udyw4nFJAIx69SQdhoE= Received: by 10.223.95.72 with SMTP id c8mr6014885fan.73.1261738728801; Fri, 25 Dec 2009 02:58:48 -0800 (PST) Received: from mavbook.mavhome.dp.ua (pc.mavhome.dp.ua [212.86.226.226]) by mx.google.com with ESMTPS id 16sm3145902fxm.8.2009.12.25.02.58.47 (version=SSLv3 cipher=RC4-MD5); Fri, 25 Dec 2009 02:58:48 -0800 (PST) Sender: Alexander Motin Message-ID: <4B349ABF.2070800@FreeBSD.org> Date: Fri, 25 Dec 2009 12:58:07 +0200 From: Alexander Motin User-Agent: Thunderbird 2.0.0.23 (X11/20090901) MIME-Version: 1.0 To: freebsd-arch@freebsd.org, FreeBSD-Current X-Enigmail-Version: 0.96.0 Content-Type: text/plain; charset=KOI8-R Content-Transfer-Encoding: 7bit Cc: Subject: File system blocks alignment X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Dec 2009 10:58:50 -0000 Hi. Recently WD released first series of ATA disks with increased physical sector size. It makes writes not matching with 4K blocks inefficient there. So I propose to get back to the question of optimal FS block alignment. This topic is also important for most of RAIDs having striped nature, such as RAID0/3/5/... and flash drives with simple controller (such as MMC/SD cards). As I have no one of those WD disks yet, I have made series of tests with RAID0, made by geom_stripe, to check general idea. I've tested the most describing case: 2-disk RAID0 with 16K stripe, 16K FS block and many 16K random I/Os (reads in this test, to avoid FS locking). Same load pattern but with writes I had on my busy disk-bound MySQL servers, so it is quite real. Test one, default partitioning. %gstripe label -s 16384 data /dev/ada1 /dev/ada2 %fdisk -I /dev/stripe/data %disklabel -w /dev/stripe/datas1 %disklabel /dev/stripe/datas1 # /dev/stripe/datas1: 8 partitions: # size offset fstype [fsize bsize bps/cpg] a: 1250274611 16 unused 0 0 c: 1250274627 0 unused 0 0 # "raw" part, don't edit %diskinfo -v /dev/stripe/datas1a /dev/stripe/datas1a 512 # sectorsize 640140600832 # mediasize in bytes (596G) 1250274611 # mediasize in sectors 16384 # stripesize 7680 # stripeoffset 77825 # Cylinders according to firmware. 255 # Heads according to firmware. 63 # Sectors according to firmware. As you can see, fdisk aligned partition to the "track length" of 63 sectors and disklabel added offset of 16 sectors. As result, file system will start at quite odd place of the RAID stripe. I've created UFS file system, pre-wrote 4GB file and run tests (raidtest was patched to generate only 16K requests): %raidtest test -d /mnt/qqq -n 1 Requests per second: 112 %raidtest test -d /mnt/qqq -n 64 Requests per second: 314 Before each test FS was unmounted to flush caches. Test two, FS manually aligned with disklabel. %disklabel /dev/stripe/datas1 # /dev/stripe/datas1: 8 partitions: # size offset fstype [fsize bsize bps/cpg] a: 1250274578 33 unused 0 0 c: 1250274627 0 unused 0 0 # "raw" part, don't edit %diskinfo -v /dev/stripe/datas1a /dev/stripe/datas1a 512 # sectorsize 640140583936 # mediasize in bytes (596G) 1250274578 # mediasize in sectors 16384 # stripesize 0 # stripeoffset 77825 # Cylinders according to firmware. 255 # Heads according to firmware. 63 # Sectors according to firmware. File system aligned with stripe. %raidtest test -d /mnt/qqq -n 1 Requests per second: 133 %raidtest test -d /mnt/qqq -n 64 Requests per second: 594 The difference is quite significant. Unaligned RAID0 access causes two disks involved in it's handling, while aligned one leaves one of disks free for another request, doubling performance. As we have now mechanism for reporting stripe size and offset for any partition to user-level, it should be easy to make disk partitioning and file system creation tools to use it automatically. Stripe size/offset reporting now supported by ada and mmcsd disk drivers and most of GEOM modules. It would be nice to fetch that info from hardware RAIDs also, where possible. -- Alexander Motin