From owner-freebsd-hackers@FreeBSD.ORG Fri Jan 18 08:53:50 2013 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 6B04B309 for ; Fri, 18 Jan 2013 08:53:50 +0000 (UTC) (envelope-from se@freebsd.org) Received: from nm21-vm6.bullet.mail.ird.yahoo.com (nm21-vm6.bullet.mail.ird.yahoo.com [212.82.109.246]) by mx1.freebsd.org (Postfix) with ESMTP id 712F975D for ; Fri, 18 Jan 2013 08:53:49 +0000 (UTC) Received: from [212.82.105.247] by nm21.bullet.mail.ird.yahoo.com with NNFMP; 18 Jan 2013 08:47:17 -0000 Received: from [217.146.188.167] by tm19.bullet.mail.ird.yahoo.com with NNFMP; 18 Jan 2013 08:47:16 -0000 Received: from [127.0.0.1] by smtp135.mail.ird.yahoo.com with NNFMP; 18 Jan 2013 08:47:16 -0000 X-Yahoo-Newman-Id: 763580.20531.bm@smtp135.mail.ird.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: .DrZbX0VM1kIwxnUq708Ui3jrKyytQiY7lPv5U1Tc4Barsu vwlLGJpzp_xzehD5D_k0DjbKTbLV0n2RSfH94XOA8nLyIDHDa2nwr8vveaUJ LpBwZCcvkt7fuupFsQ_ML1XdJrEiNn370HnbAx.XjvYviA8OmyFnJlqY82rj dpORNRyTdoxpFTo3Oo3OhHj89O0PmFnDk15_mPjJErEiVpIXS1HVGAnStFlO 4c2EKVdzrmjnwywjJcorITAUrD4sD5o1eQQhsFNuQRct.ozskyye7HvXNfoz JKvjwaV6U8xr9M5x7JPk83bbdZdCG7X6OFHOtxM63qTqiflzfMKHjSUWmiuG rAAF2FXg6m5U52fzhOQPWsho0d3B9gu01ySWytcsZtjRf8a6mvErblxRRlzR CNEb.KfsvFAkg9XNLovMTfesWxB8xwV.oKbhJFIzFnEe2Z0.s0w-- X-Yahoo-SMTP: iDf2N9.swBDAhYEh7VHfpgq0lnq. Received: from [192.168.119.26] (se@87.158.25.147 with plain) by smtp135.mail.ird.yahoo.com with SMTP; 18 Jan 2013 00:47:16 -0800 PST Message-ID: <50F90C0F.5010604@freebsd.org> Date: Fri, 18 Jan 2013 09:47:11 +0100 From: Stefan Esser User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: freebsd-hackers@freebsd.org Subject: Re: stupid UFS behaviour on random writes References: <103826787.2103620.1358463687244.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <103826787.2103620.1358463687244.JavaMail.root@erie.cs.uoguelph.ca> X-Enigmail-Version: 1.5 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 18 Jan 2013 08:53:50 -0000 Am 18.01.2013 00:01, schrieb Rick Macklem: > Wojciech Puchar wrote: >> create 10GB file (on 2GB RAM machine, with some swap used to make sure >> little cache would be available for filesystem. >> >> dd if=/dev/zero of=file bs=1m count=10k >> >> block size is 32KB, fragment size 4k >> >> >> now test random read access to it (10 threads) >> >> randomio test 10 0 0 4096 >> >> normal result on such not so fast disk in my laptop. >> >> 118.5 | 118.5 5.8 82.3 383.2 85.6 | 0.0 inf nan 0.0 nan >> 138.4 | 138.4 3.9 72.2 499.7 76.1 | 0.0 inf nan 0.0 nan >> 142.9 | 142.9 5.4 69.9 297.7 60.9 | 0.0 inf nan 0.0 nan >> 133.9 | 133.9 4.3 74.1 480.1 75.1 | 0.0 inf nan 0.0 nan >> 138.4 | 138.4 5.1 72.1 380.0 71.3 | 0.0 inf nan 0.0 nan >> 145.9 | 145.9 4.7 68.8 419.3 69.6 | 0.0 inf nan 0.0 nan >> >> >> systat shows 4kB I/O size. all is fine. >> >> BUT random 4kB writes >> >> randomio test 10 1 0 4096 >> >> total | read: latency (ms) | write: latency (ms) >> iops | iops min avg max sdev | iops min avg max >> sdev >> --------+-----------------------------------+---------------------------------- >> 38.5 | 0.0 inf nan 0.0 nan | 38.5 9.0 166.5 1156.8 261.5 >> 44.0 | 0.0 inf nan 0.0 nan | 44.0 0.1 251.2 2616.7 492.7 >> 44.0 | 0.0 inf nan 0.0 nan | 44.0 7.6 178.3 1895.4 330.0 >> 45.0 | 0.0 inf nan 0.0 nan | 45.0 0.0 239.8 3457.4 522.3 >> 45.5 | 0.0 inf nan 0.0 nan | 45.5 0.1 249.8 5126.7 621.0 >> >> >> >> results are horrific. systat shows 32kB I/O, gstat shows half are >> reads >> half are writes. >> >> Why UFS need to read full block, change one 4kB part and then write >> back, instead of just writing 4kB part? > > Because that's the way the buffer cache works. It writes an entire buffer > cache block (unless at the end of file), so it must read the rest of the block into > the buffer, so it doesn't write garbage (the rest of the block) out. Without having looked at the code or testing: I assume using O_DIRECT when opening the file should help for that particular test (on kernels compiled with "options DIRECTIO"). > I'd argue that using an I/O size smaller than the file system block size is > simply sub-optimal and that most apps. don't do random I/O of blocks. > OR > If you had an app. that does random I/O of 4K blocks (at 4K byte offsets), > then using a 4K/1K file system would be better. A 4k/1k file system has higher overhead (more indirect blocks) and is clearly sub-obtimal for most general uses, today. > NFS is the exception, in that it keeps track of a dirty byte range within > a buffer cache block and writes that byte range. (NFS writes are byte granular, > unlike a disk.) I should be easy to add support for a fragment mask to the buffer cache, which allows to identify valid fragments. Such a mask should be set to 0xff for all current uses of the buffer cache (meaning the full block is valid), but a special case could then be added for writes of exactly one or multiple fragments, where only the corresponding valid flag bits were set. In addition, a possible later read from disk must obviously skip fragments for which the valid mask bits are already set. This bit mask could then be used to update the affected fragments only, without a read-modify-write of the containing block. But I doubt that such a change would improve performance in the general case, just in random update scenarios (which might still be relevant, in case of a DBMS knowing the fragment size and using it for DB files). Regards, STefan