From owner-freebsd-fs@FreeBSD.ORG Thu Jun 29 02:20:35 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7E73216A410 for ; Thu, 29 Jun 2006 02:20:35 +0000 (UTC) (envelope-from leo.huang.gd@gmail.com) Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.172]) by mx1.FreeBSD.org (Postfix) with ESMTP id BDDD644CF9 for ; Thu, 29 Jun 2006 02:20:34 +0000 (GMT) (envelope-from leo.huang.gd@gmail.com) Received: by ug-out-1314.google.com with SMTP id m3so98238uge for ; Wed, 28 Jun 2006 19:20:33 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=DdmCxdxx7OkIWpOr1FgYRcEdj1tKWC2wUcVN3H8IMasp57P3Z/CKSM98ZbOuaMuNv8IccFuXNjbnLEXpsJhzP7Wa3RnfKm7xORH5cUSh9cXE+5B830GvZS8LL7sowKFWx9jcYr2dvm+oWLMXAQekl8a5BL7ip8jLnEufZJ23ZnE= Received: by 10.67.26.7 with SMTP id d7mr1370842ugj; Wed, 28 Jun 2006 19:20:33 -0700 (PDT) Received: by 10.67.27.12 with HTTP; Wed, 28 Jun 2006 19:20:33 -0700 (PDT) Message-ID: Date: Thu, 29 Jun 2006 10:20:33 +0800 From: "Leo Huang" To: "Bruce Evans" In-Reply-To: <20060628230439.M75051@delplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <44A1B958.4030204@fer.hr> <20060628230439.M75051@delplex.bde.org> Cc: freebsd-fs@freebsd.org, Ivan Voras Subject: Re: Is the fsync() fake on FreeBSD6.1? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Jun 2006 02:20:35 -0000 hi, > >> OS Clients Result(queries per second) TPS(got from > >> iostat) > >> FreeBSD6.1 50 516.1 about 2000 > > Seems normal for drives that do write caching. I disable the driver write caching as Bjorn Gronvall suggest, the result show that the TPS come down to about 200. So I think you and Bjorn Gronvall are right. It is the disk write caching make the TPS so high. > >> Debian3.1 50 49.8 about 200 > > Seems to slow for disks that do write caching. Maybe Debian does something > to force the drive to complete it's i/o, or just does a full sync() like > someone mentioned Linux doing. I use sginfo the find that the disk write caching is also enabled default. After the disk write caching is disabled, the TPS also come down from 200 to 110. This is really pullze me. Can you give me more infomation about it? regards, Leo Huang 2006/6/28, Bruce Evans : > On Wed, 28 Jun 2006, Ivan Voras wrote: > > > Leo Huang wrote: > > > >> The result is followed: > >> OS Clients Result(queries per second) TPS(got from > >> iostat) > >> FreeBSD6.1 50 516.1 about 2000 > > Seems normal for drives that do write caching. > > >> Debian3.1 50 49.8 about 200 > > Seems to slow for disks that do write caching. Maybe Debian does something > to force the drive to complete it's i/o, or just does a full sync() like > someone mentioned Linux doing. > > >> I know that MySQL uses fsync() to flush both the data and log files at > > > > I tried to see the effects from fsync() with this little program: > > > > #include > > #include > > #include > > > > #define BUF_SIZE 512 > > #define COUNT 50000 > > > > int main() { > > int fd; > > char buf[BUF_SIZE]; > > int i; > > > > fd = open("test.file", O_CREAT|O_TRUNC|O_WRONLY, 0600); > > if (fd < 0) { > > printf("cannot open\n"); > > exit(1); > > } > > > > for (i = 0; i < COUNT; i++) { > > if (write(fd, buf, BUF_SIZE) != BUF_SIZE) { > > printf("error writing\n"); > > exit(1); > > } > > The results are much clearer with BUF_SIZE == 1 and COUNT <= fs_blocksize. > Then the file system keeps writing the same block and inode, and drives > with write caching are limited mainly by their software overhead. With > a program similar to the above, I get the following times on a 7200 RPM > ATA drive: > > COUNT = fs_blocksize = 8192 to regular file in /tmp > (mount options: none, no soft updates) > 7.76 seconds (iostat 500-3500 tps 4.5-7.7 KB/t) > to /dev/null on a devfs-free system: > 9.67 seconds (iostat 450-2200 tps 8.0 KB/t) > to /dev/ttyv on a devfs-free system: > 16.30 seconds (iostat 500-550 tps 8.0 KB/t) (yes, /dev/ttyv0 is slowest!) > > Er, the results were clear. In a previous run, with different mount options, > (-async and maybe -noatime), and COUNT = 1000, I got 4000+ tps 4.5 KB/t > consistently for the regular file. 4.5 is the average of 8 and 9 (which I > thought was 1 8K data block and 1 1K inode block, but now think was 1 1K > data block amd 1 8K inode block). Changing COUNT back to 1000 now gives > a consistent 4.5KB/t but only about 500 tps. The variation on the block > size is caused by 8192 being larger than 1000 -- the file usually consists > of 1-7 fragments except at limits it is empty or 1 block. > > fsync()ing /dev/null and /dev/ttyv1 is apparently slow because I (or > someone at my request) prematurely removed the hack for not syncing > file times for device files. IN_LAZYMOD was supposed to make the > hack unnecessary, but I never got around to making IN_LAZYMOD apply > more generally. In -current, it only applies to device files that are > not in devfs and on ffs without soft updates, but there are no such > files so it never applies. In my kernel, it applies to all files but > still only for atimes and not for soft updates. > > It is strange that fsync()ing /dev/null is slower than fsync()ing > a regular file, and especially strange that fsync()ing /dev/ttyv1 > is much slower than fsync()ing /dev/null. Both should be about twice > as fast since only 1 block needs to be written (an inode block). > > > if (fsync(fd) < 0) { > > printf("error in fsync\n"); > > exit(1); > > } > > } > > > > close(fd); > > unlink("test.file"); > > > > return 0; > > > > But I see strange results with iostat. It shows 16KB transactions, ~2900 tps > > and 46 MB/s. On the other hand, the program runs for ~36 seconds, which gives > > ~1390 tps (this is a single desktop drive). Since 36 seconds of 46MB/s would > > result in a file 1.6 GB in size, while it's clearly 50000*512=25MB, iostat is > > lying. > > This is because you fsync() every 512 bytes. The file system then writes > a 16K inode block and a 16K data block, giving 64 times as much i/o as > necessary. > > > I think it's a too valuable tool to be lying. For what it's worth, gstat is > > also lying in the same way. > > iostat and gstat just report whatever is recorded by devstat(9). The > recording is done at a fairly low level but not low enough to be > correct. Recorders lie mainly for block sizes larger than 64K. E.g., > geom claims that all (?) disk devices can handle block sizes up to > MAXPHYS (128K) and then splits up i/o's into whatever sizes the disk > devices drivers handle. Most disk devices drivers claim to handle > DFLTPHYS (64K) whether or not the disk drive can handle that, and may > further split up the i/o as necessary. This makes it hard to see the > sizes that actually reach the hardware. > > Bruce >