From owner-freebsd-hackers@FreeBSD.ORG  Fri Jan 18 08:53:50 2013
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 6B04B309
 for <freebsd-hackers@freebsd.org>; Fri, 18 Jan 2013 08:53:50 +0000 (UTC)
 (envelope-from se@freebsd.org)
Received: from nm21-vm6.bullet.mail.ird.yahoo.com
 (nm21-vm6.bullet.mail.ird.yahoo.com [212.82.109.246])
 by mx1.freebsd.org (Postfix) with ESMTP id 712F975D
 for <freebsd-hackers@freebsd.org>; Fri, 18 Jan 2013 08:53:49 +0000 (UTC)
Received: from [212.82.105.247] by nm21.bullet.mail.ird.yahoo.com with NNFMP;
 18 Jan 2013 08:47:17 -0000
Received: from [217.146.188.167] by tm19.bullet.mail.ird.yahoo.com with NNFMP;
 18 Jan 2013 08:47:16 -0000
Received: from [127.0.0.1] by smtp135.mail.ird.yahoo.com with NNFMP;
 18 Jan 2013 08:47:16 -0000
X-Yahoo-Newman-Id: 763580.20531.bm@smtp135.mail.ird.yahoo.com
X-Yahoo-Newman-Property: ymail-3
X-YMail-OSG: .DrZbX0VM1kIwxnUq708Ui3jrKyytQiY7lPv5U1Tc4Barsu
 vwlLGJpzp_xzehD5D_k0DjbKTbLV0n2RSfH94XOA8nLyIDHDa2nwr8vveaUJ
 LpBwZCcvkt7fuupFsQ_ML1XdJrEiNn370HnbAx.XjvYviA8OmyFnJlqY82rj
 dpORNRyTdoxpFTo3Oo3OhHj89O0PmFnDk15_mPjJErEiVpIXS1HVGAnStFlO
 4c2EKVdzrmjnwywjJcorITAUrD4sD5o1eQQhsFNuQRct.ozskyye7HvXNfoz
 JKvjwaV6U8xr9M5x7JPk83bbdZdCG7X6OFHOtxM63qTqiflzfMKHjSUWmiuG
 rAAF2FXg6m5U52fzhOQPWsho0d3B9gu01ySWytcsZtjRf8a6mvErblxRRlzR
 CNEb.KfsvFAkg9XNLovMTfesWxB8xwV.oKbhJFIzFnEe2Z0.s0w--
X-Yahoo-SMTP: iDf2N9.swBDAhYEh7VHfpgq0lnq.
Received: from [192.168.119.26] (se@87.158.25.147 with plain)
 by smtp135.mail.ird.yahoo.com with SMTP; 18 Jan 2013 00:47:16 -0800 PST
Message-ID: <50F90C0F.5010604@freebsd.org>
Date: Fri, 18 Jan 2013 09:47:11 +0100
From: Stefan Esser <se@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:17.0) Gecko/20130107 Thunderbird/17.0.2
MIME-Version: 1.0
To: freebsd-hackers@freebsd.org
Subject: Re: stupid UFS behaviour on random writes
References: <103826787.2103620.1358463687244.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <103826787.2103620.1358463687244.JavaMail.root@erie.cs.uoguelph.ca>
X-Enigmail-Version: 1.5
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 18 Jan 2013 08:53:50 -0000

Am 18.01.2013 00:01, schrieb Rick Macklem:
> Wojciech Puchar wrote:
>> create 10GB file (on 2GB RAM machine, with some swap used to make sure
>> little cache would be available for filesystem.
>>
>> dd if=/dev/zero of=file bs=1m count=10k
>>
>> block size is 32KB, fragment size 4k
>>
>>
>> now test random read access to it (10 threads)
>>
>> randomio test 10 0 0 4096
>>
>> normal result on such not so fast disk in my laptop.
>>
>> 118.5 | 118.5 5.8 82.3 383.2 85.6 | 0.0 inf nan 0.0 nan
>> 138.4 | 138.4 3.9 72.2 499.7 76.1 | 0.0 inf nan 0.0 nan
>> 142.9 | 142.9 5.4 69.9 297.7 60.9 | 0.0 inf nan 0.0 nan
>> 133.9 | 133.9 4.3 74.1 480.1 75.1 | 0.0 inf nan 0.0 nan
>> 138.4 | 138.4 5.1 72.1 380.0 71.3 | 0.0 inf nan 0.0 nan
>> 145.9 | 145.9 4.7 68.8 419.3 69.6 | 0.0 inf nan 0.0 nan
>>
>>
>> systat shows 4kB I/O size. all is fine.
>>
>> BUT random 4kB writes
>>
>> randomio test 10 1 0 4096
>>
>> total | read: latency (ms) | write: latency (ms)
>> iops | iops min avg max sdev | iops min avg max
>> sdev
>> --------+-----------------------------------+----------------------------------
>> 38.5 | 0.0 inf nan 0.0 nan | 38.5 9.0 166.5 1156.8 261.5
>> 44.0 | 0.0 inf nan 0.0 nan | 44.0 0.1 251.2 2616.7 492.7
>> 44.0 | 0.0 inf nan 0.0 nan | 44.0 7.6 178.3 1895.4 330.0
>> 45.0 | 0.0 inf nan 0.0 nan | 45.0 0.0 239.8 3457.4 522.3
>> 45.5 | 0.0 inf nan 0.0 nan | 45.5 0.1 249.8 5126.7 621.0
>>
>>
>>
>> results are horrific. systat shows 32kB I/O, gstat shows half are
>> reads
>> half are writes.
>>
>> Why UFS need to read full block, change one 4kB part and then write
>> back, instead of just writing 4kB part?
> 
> Because that's the way the buffer cache works. It writes an entire buffer
> cache block (unless at the end of file), so it must read the rest of the block into
> the buffer, so it doesn't write garbage (the rest of the block) out.

Without having looked at the code or testing:

I assume using O_DIRECT when opening the file should help for that
particular test (on kernels compiled with "options DIRECTIO").

> I'd argue that using an I/O size smaller than the file system block size is
> simply sub-optimal and that most apps. don't do random I/O of blocks.
> OR
> If you had an app. that does random I/O of 4K blocks (at 4K byte offsets),
> then using a 4K/1K file system would be better.

A 4k/1k file system has higher overhead (more indirect blocks) and
is clearly sub-obtimal for most general uses, today.

> NFS is the exception, in that it keeps track of a dirty byte range within
> a buffer cache block and writes that byte range. (NFS writes are byte granular,
> unlike a disk.)

I should be easy to add support for a fragment mask to the buffer
cache, which allows to identify valid fragments. Such a mask should
be set to 0xff for all current uses of the buffer cache (meaning the
full block is valid), but a special case could then be added for writes
of exactly one or multiple fragments, where only the corresponding
valid flag bits were set. In addition, a possible later read from
disk must obviously skip fragments for which the valid mask bits are
already set.
This bit mask could then be used to update the affected fragments
only, without a read-modify-write of the containing block.

But I doubt that such a change would improve performance in the
general case, just in random update scenarios (which might still
be relevant, in case of a DBMS knowing the fragment size and using
it for DB files).

Regards, STefan