From owner-freebsd-questions@FreeBSD.ORG Sun Apr 25 22:04:01 2010 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B5F0F106564A for ; Sun, 25 Apr 2010 22:04:01 +0000 (UTC) (envelope-from dan.naumov@gmail.com) Received: from mail-bw0-f216.google.com (mail-bw0-f216.google.com [209.85.218.216]) by mx1.freebsd.org (Postfix) with ESMTP id 475918FC1A for ; Sun, 25 Apr 2010 22:04:00 +0000 (UTC) Received: by bwz8 with SMTP id 8so10648503bwz.3 for ; Sun, 25 Apr 2010 15:03:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:date:message-id :subject:from:to:content-type; bh=K7Dqn87MtE9TBMX1Z4k6voVRzU/68L0+1eBv1gYRYBM=; b=WeI5Eiptvn/AnkMb+LF/q/bHW3czqwABYNpZoCTuGHgijXnU0y3z0DegpV2RvWPgHI Mn4r9xifUDhBMtmW4u7vaMYWYZODZXD9HGMCdd7t2VUBe5C4D9EKdmxoQQ/hvu3B0LBG hcjx8ZvhqtoG7hluidwbqWvy4X9/7Z6nqQvlw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=r97X55R66f6vBVFSOEnyiFWLy242jnquXkdLWcpA9hqxRHyfOj06srFvEtzV+3KLNN QZhlZDJWOvpxz8XMl/z8lq43RyeiF8LBrMphft9bTsSgCrX7ut3hFuDSpjFdsEeXOKMy AnRL4rId1iI8eT21lKhw+XkHomMH9LFzckI9I= MIME-Version: 1.0 Received: by 10.204.75.35 with SMTP id w35mr1949729bkj.194.1272233033181; Sun, 25 Apr 2010 15:03:53 -0700 (PDT) Received: by 10.204.117.20 with HTTP; Sun, 25 Apr 2010 15:03:53 -0700 (PDT) Date: Mon, 26 Apr 2010 01:03:53 +0300 Message-ID: From: Dan Naumov To: amsibamsi@gmail.com, freebsd-questions@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 Cc: Subject: RE: ZFS scheduling X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 25 Apr 2010 22:04:01 -0000 >Hi, > >I noticed that my system gets very slow when I'm doing some simple but >intense ZFS operations. For example, I move about 20 Gigabytes of data >from one data set to another on the same pool, which is a RAIDZ of 3 500 >GB SATA disks. The operations itself runs fast, but meanwhile other >things get really slow. E.g. opening a application takes 5 times as long >as before. Also simple operations like 'ls' stall for some seconds which >they did never before. It already changed a lot when I switched from >RAIDZ to a mirror with only 2 disks. Memory and CPU don't seem to be the >issue, I have a quad-core CPU and 8 GB RAM. > >I can't get rid of the idea that this has something to do with >scheduling. The system is absolutely stable and fast. Somehow small I/O >operations on ZFS seem to have it very difficult to make it through when >other bigger ones are running. Maybe this has something to do with tuning? > >I know my system information is very incomplete, and there could be a >lot of causes. But anybody knows if this could be an issue with ZFS itself? Hello As you do mention, your system information is indeed very incomplete, making your problem rather hard to diagnose :) Scheduling, in the traditional sense, is unlikely to be the cause of your problems, but here's a few things you could look into: First one is obviously the pool layout, heavy-duty writing on a pool, consisting of a single raidz vdev is slow (slower than writing to a mirror, as you already discovered), period. such is the nature of raidz. Additionally, your problem is magnified by the fact that your have reads competing with writes since you are reading (I assume) from the same pool. One approach to alleviating the problem would be to utilize a pool consisting of 2 or more raidz vdevs in a stripe, like this: pool raidz disc1 disc2 disc3 raidz disc4 disc5 disc6 The second potential cause of your issues is the system wrongly guesstimating your optimal TXG commit size. ZFS works in such a fashion, that it commits data to disk in chunks. How big chunks it writes at a time it tries to optimize by evaluating your pool IO bandwidth over time and available RAM. The TXG commits happen with an interval of 5-30 seconds. The worst case scenario is such, that if the system misguesses the optimal TXG size, then under heavy write load, it continues to defer the commit for up to the 30 second timeout and when it hits the caps, it frantically commits it ALL at once. This can and most likely will completely starve your read IO on the pool for as long as the drives choke while committing the TXG. If you are on 8.0-RELEASE, you could try playing with the vfs.zfs.txg.timeout= variable in /boot/loader.conf, generally sane values are 5-30, with 30 being the default. You could also try adjusting vfs.zfs.vdev.max_pending= down from the default of 35 to a lower value and see if that helps. AFAIK, 8-STABLE and -HEAD have a systctl variable which directly allow you to manually set the preferred TXG size and I've pretty sure I've seen some patches on the mailing lists to add this functionality to 8.0. Hope this helps. - Sincerely, Dan Naumov