From owner-freebsd-fs@FreeBSD.ORG Fri Apr 19 18:22:42 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 0BD2C204 for ; Fri, 19 Apr 2013 18:22:42 +0000 (UTC) (envelope-from matthew.ahrens@delphix.com) Received: from mail-la0-x229.google.com (mail-la0-x229.google.com [IPv6:2a00:1450:4010:c03::229]) by mx1.freebsd.org (Postfix) with ESMTP id 6F0C5112A for ; Fri, 19 Apr 2013 18:22:41 +0000 (UTC) Received: by mail-la0-f41.google.com with SMTP id er20so3859239lab.0 for ; Fri, 19 Apr 2013 11:22:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=delphix.com; s=google; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=eGyYth67GAF9lWFYom85IEKgT9nY97BIAeKiCbVWoDY=; b=aHFJvGeVkX49hqTklfR/UUiSb+iQH6K4totX0zNNwrrZrIOM1aeDNeRaXpTaKLEWDF rFhwKckvTnkMetQdi5ympIZRqjkEfQzflx5rtmtMtVY9v1RA9PwswNqneNoSzIt233IX UGDC4o4tMrA87+isYpaQMhhdCZx+R7hhykFI8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type:x-gm-message-state; bh=eGyYth67GAF9lWFYom85IEKgT9nY97BIAeKiCbVWoDY=; b=YDDYgouyx9DmpSFdReQlKG/h56cGtsQ+xK5BbTQGnzKmZgcGkPuqQECaDDXvfoKfDe uB7xBtOJOH9KOD9dgxSyYMGdIUyC4AL8N5Z/Gd74cCZA0ysT0oYgjp1aNKTIeO/ie86v f9LqeIx7VmZ96guweoCITKQpy8Qv4DtvUH2jnKjfHwr32edPB23MhBhg3TMM1FGf87Uv BpkG6WrvYmLFAcFF10m4fMxg2/txHIl78lMumL9IZGrgtNge6ThWEb/guV0EddhckXxd YWlZ/vIBtNdG8xcVMa4KD8phdGw+q2lc4hnZ278G+rmp62JqNHm2LvmSDshl4Qlgwlw6 pGtQ== MIME-Version: 1.0 X-Received: by 10.112.167.200 with SMTP id zq8mr8491324lbb.58.1366395760206; Fri, 19 Apr 2013 11:22:40 -0700 (PDT) Received: by 10.114.22.4 with HTTP; Fri, 19 Apr 2013 11:22:40 -0700 (PDT) In-Reply-To: <5169B0D7.9090607@platinum.linux.pl> References: <5166EA43.7050700@platinum.linux.pl> <5167B1C5.8020402@FreeBSD.org> <51689A2C.4080402@platinum.linux.pl> <5169324A.3080309@FreeBSD.org> <516949C7.4030305@platinum.linux.pl> <5169B0D7.9090607@platinum.linux.pl> Date: Fri, 19 Apr 2013 11:22:40 -0700 Message-ID: Subject: Re: ZFS slow reads for unallocated blocks From: Matthew Ahrens To: Adam Nowacki X-Gm-Message-State: ALoCoQkbe53M2ILBG4BB3rq1v1hzEbSViUPXubQ4IV5J/iWZc/SLkKk9sBYJToATY0XnLc2j5zF/ Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: "freebsd-fs@freebsd.org" , illumos-zfs , Andriy Gapon X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Apr 2013 18:22:42 -0000 Sorry I'm late to the game here, just saw this email now. Yes, this is also a problem on illumos, though much less so on my system, only about 2x. It looks like the difference is due to the fact that the zeroed dbufs are not cached, so we have to zero the entire dbuf (e.g. 128k) for every read syscall (e.g. 8k). Increasing the size of the reads to match the recordsize results in performance parity between reading cached data and sparse zeros. You can see this behavior in the following dtrace, which shows that we are initializing the dbuf in dbuf_read_impl() as many times as we do syscalls: sudo dtrace -n 'dbuf_read_impl:entry/pid==$target/{@[probefunc] = count()}' -c 'dd if=t100m of=/dev/null bs=8k' dtrace: description 'dbuf_read_impl:entry' matched 1 probe *12800*+0 records in 12800+0 records out dtrace: pid 29419 has exited dbuf_read_impl *12800* --matt On Sat, Apr 13, 2013 at 12:24 PM, Adam Nowacki wrote: > Including zfs@illumos on this. To recap: > > Reads from sparse files are slow with speed proportional to ratio of read > size to filesystem recordsize ratio. There is no physical disk I/O. > > # zfs create -o atime=off -o recordsize=128k -o compression=off -o > sync=disabled -o mountpoint=/home/testfs home/testfs > # dd if=/dev/random of=/home/testfs/random10m bs=1024k count=10 > # truncate -s 10m /home/testfs/trunc10m > # dd if=/home/testfs/random10m of=/dev/null bs=512 > 10485760 bytes transferred in 0.078637 secs (133344041 bytes/sec) > # dd if=/home/testfs/trunc10m of=/dev/null bs=512 > 10485760 bytes transferred in 1.011500 secs (10366544 bytes/sec) > > # zfs create -o atime=off -o recordsize=8M -o compression=off -o > sync=disabled -o mountpoint=/home/testfs home/testfs > # dd if=/home/testfs/random10m of=/dev/null bs=512 > 10485760 bytes transferred in 0.080430 secs (130371205 bytes/sec) > # dd if=/home/testfs/trunc10m of=/dev/null bs=512 > 10485760 bytes transferred in 72.465486 secs (144700 bytes/sec) > > This is from FreeBSD 9.1 and possible solution at > http://tepeserwery.pl/nowak/**freebsd/zfs_sparse_** > optimization_v2.patch.txt- untested yet, system will be busy building packages for a few more days. > > > On 2013-04-13 19:11, Will Andrews wrote: > >> Hi, >> >> I think the idea of using a pre-zeroed region as the 'source' is a good >> one, but probably it would be better to set a special flag on a hole >> dbuf than to require caller flags. That way, ZFS can lazily evaluate >> the hole dbuf (i.e. avoid zeroing db_data until it has to). However, >> that could be complicated by the fact that there are many potential >> users of hole dbufs that would want to write to the dbuf. >> >> This sort of optimization should be brought to the illumos zfs list. As >> it stands, your patch is also FreeBSD-specific, since 'zero_region' only >> exists in vm/vm_kern.c. Given the frequency of zero-copying, however, >> it's quite possible there are other versions of this region elsewhere. >> >> --Will. >> >> >> On Sat, Apr 13, 2013 at 6:04 AM, Adam Nowacki > >> wrote: >> >> Temporary dbufs are created for each missing (unallocated on disk) >> record, including indirects if the hole is large enough. Those dbufs >> never find way to ARC and are freed at the end of dmu_read_uio. >> >> A small read (from a hole) would in the best case bzero 128KiB >> (recordsize, more if missing indirects) ... and I'm running modified >> ZFS with record sizes up to 8MiB. >> >> # zfs create -o atime=off -o recordsize=8M -o compression=off -o >> mountpoint=/home/testfs home/testfs >> # truncate -s 8m /home/testfs/trunc8m >> # dd if=/dev/zero of=/home/testfs/zero8m bs=8m count=1 >> 1+0 records in >> 1+0 records out >> 8388608 bytes transferred in 0.010193 secs (822987745 bytes/sec) >> >> # time cat /home/testfs/trunc8m > /dev/null >> 0.000u 6.111s 0:06.11 100.0% 15+2753k 0+0io 0pf+0w >> >> # time cat /home/testfs/zero8m > /dev/null >> 0.000u 0.010s 0:00.01 100.0% 12+2168k 0+0io 0pf+0w >> >> 600x increase in system time and close to 1MB/s - insanity. >> >> The fix - a lot of the code to efficiently handle this was already >> there. >> >> dbuf_hold_impl has int fail_sparse argument to return ENOENT for >> holes. Just had to get there and somehow back to dmu_read_uio where >> zeroing can happen at byte granularity. >> >> ... didn't have time to actually test it yet. >> >> >> On 2013-04-13 12:24, Andriy Gapon wrote: >> >> on 13/04/2013 02:35 Adam Nowacki said the following: >> >> http://tepeserwery.pl/nowak/__**freebsd/zfs_sparse___** >> optimization.patch.txt >> >> > optimization.patch.txt >> > >> >> Does it look sane? >> >> >> It's hard to tell from a quick look since they change is not >> small. >> What is your idea of the problem and the fix? >> >> On 2013-04-12 09:03, Andriy Gapon wrote: >> >> >> ENOTIME to really investigate, but here is a basic >> profile result for those >> interested: >> kernel`bzero+0xa >> kernel`dmu_buf_hold_array_by__** >> _dnode+0x1cf >> >> kernel`dmu_read_uio+0x66 >> kernel`zfs_freebsd_read+0x3c0 >> kernel`VOP_READ_APV+0x92 >> kernel`vn_read+0x1a3 >> kernel`vn_io_fault+0x23a >> kernel`dofileread+0x7b >> kernel`sys_read+0x9e >> kernel`amd64_syscall+0x238 >> kernel`0xffffffff80747e4b >> >> That's where > 99% of time is spent. >> >> >> >> >> >> ______________________________**___________________ >> freebsd-fs@freebsd.org mailing list >> http://lists.freebsd.org/__**mailman/listinfo/freebsd-fs >> >> >> > >> To unsubscribe, send any mail to >> "freebsd-fs-unsubscribe@__free**bsd.org >> >> >" >> >> >> > ______________________________**_________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/**mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@**freebsd.org > " >