From owner-freebsd-fs@FreeBSD.ORG Sun Jan 27 08:36:15 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 44483885 for ; Sun, 27 Jan 2013 08:36:15 +0000 (UTC) (envelope-from grarpamp@gmail.com) Received: from mail-ve0-f176.google.com (mail-ve0-f176.google.com [209.85.128.176]) by mx1.freebsd.org (Postfix) with ESMTP id E6B926EF for ; Sun, 27 Jan 2013 08:36:14 +0000 (UTC) Received: by mail-ve0-f176.google.com with SMTP id jz10so820786veb.21 for ; Sun, 27 Jan 2013 00:36:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:date:message-id:subject:from:to :content-type; bh=q+uUE05NiEyIeIIx+aEuROqd5fZh/rgvROYilMBRroU=; b=lE7ff3CBXRhn0XbbkYTq/jKO+30C05ihIKteqoTnUIkUj1UWuLXo1m+S/SD8Z4aSOu mltHcAHlJS0RnHfP4tnWl/rk/KIA2Y+rvLRqNkq95ut8pfumrjRxPES1iyWRa7bmVLSe KHUSEa595EYDBraTMoHdFe4kaGhUdLMxmFI+ZMipT82+O8GqdI7x7pSrxq/nIoBtUgNF 3Ko+B7ICn4HvRcSUWIW9Unrf7meuXsbTqspE9SmM5dKaiGS5LDCBoJIDTeOwSN0ugDYR i01F3vcvu1a0wLCcirH0WtMk2u9zBKMoNz+X4HZ/NHIs5+Muv6XfgubbHVsatY/OHve8 XFAQ== MIME-Version: 1.0 X-Received: by 10.52.67.75 with SMTP id l11mr10033428vdt.29.1359275768423; Sun, 27 Jan 2013 00:36:08 -0800 (PST) Received: by 10.220.219.79 with HTTP; Sun, 27 Jan 2013 00:36:08 -0800 (PST) Date: Sun, 27 Jan 2013 03:36:08 -0500 Message-ID: Subject: ZFS slackspace, grepping it for data From: grarpamp To: freebsd-fs@freebsd.org Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Jan 2013 08:36:15 -0000 Say there's a 100GB zpool over a single vdev (one drive). It's got a few datasets carved out of it. How best to stroll through only the 10GB of slackspace (aka: df 'Avail') that is present? I tried making a zvol out of it but only got 10mb of zeros, which makes sense because zfs isn't managing anything written there in that empty zvol yet. I could troll the entire drive, but that's 10x the data and I don't really want the current 90gb of data in the results. There is zdb -R, but I don't know the offsets of the slack, unless they are somehow tied to the pathname hierarchy. Any ideas? From owner-freebsd-fs@FreeBSD.ORG Sun Jan 27 09:03:14 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 16697B43 for ; Sun, 27 Jan 2013 09:03:14 +0000 (UTC) (envelope-from nowakpl@platinum.linux.pl) Received: from platinum.linux.pl (platinum.edu.pl [81.161.192.4]) by mx1.freebsd.org (Postfix) with ESMTP id CB2D27AB for ; Sun, 27 Jan 2013 09:03:13 +0000 (UTC) Received: by platinum.linux.pl (Postfix, from userid 87) id EF21247E11; Sun, 27 Jan 2013 10:03:05 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on platinum.linux.pl X-Spam-Level: X-Spam-Status: No, score=-1.4 required=3.0 tests=ALL_TRUSTED,AWL autolearn=disabled version=3.3.2 Received: from [10.255.0.2] (unknown [83.151.38.73]) by platinum.linux.pl (Postfix) with ESMTPA id 5410947DE6 for ; Sun, 27 Jan 2013 10:03:03 +0100 (CET) Message-ID: <5104ED41.8020800@platinum.linux.pl> Date: Sun, 27 Jan 2013 10:02:57 +0100 From: Adam Nowacki User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: ZFS slackspace, grepping it for data References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Jan 2013 09:03:14 -0000 On 2013-01-27 09:36, grarpamp wrote: > Say there's a 100GB zpool over a single vdev (one drive). > It's got a few datasets carved out of it. > How best to stroll through only the 10GB of slackspace > (aka: df 'Avail') that is present? > I tried making a zvol out of it but only got 10mb of zeros, > which makes sense because zfs isn't managing anything > written there in that empty zvol yet. > I could troll the entire drive, but that's 10x the data and > I don't really want the current 90gb of data in the results. > There is zdb -R, but I don't know the offsets of the slack, > unless they are somehow tied to the pathname hierarchy. > Any ideas? zdb -mmm pool_name for on-disk offset add 0x400000 If i remember correctly. From owner-freebsd-fs@FreeBSD.ORG Sun Jan 27 10:36:14 2013 Return-Path: Delivered-To: fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id C2CF3901; Sun, 27 Jan 2013 10:36:14 +0000 (UTC) (envelope-from uqs@FreeBSD.org) Received: from acme.spoerlein.net (acme.spoerlein.net [IPv6:2a01:4f8:131:23c2::1]) by mx1.freebsd.org (Postfix) with ESMTP id 4CD9DA06; Sun, 27 Jan 2013 10:36:14 +0000 (UTC) Received: from localhost (acme.spoerlein.net [IPv6:2a01:4f8:131:23c2::1]) by acme.spoerlein.net (8.14.6/8.14.6) with ESMTP id r0RAaC2Y099978 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Sun, 27 Jan 2013 11:36:12 +0100 (CET) (envelope-from uqs@FreeBSD.org) Date: Sun, 27 Jan 2013 11:36:12 +0100 From: Ulrich =?utf-8?B?U3DDtnJsZWlu?= To: current@FreeBSD.org, fs@FreeBSD.org Subject: Zpool surgery Message-ID: <20130127103612.GB38645@acme.spoerlein.net> Mail-Followup-To: current@FreeBSD.org, fs@FreeBSD.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Jan 2013 10:36:14 -0000 Hey all, I have a slight problem with transplanting a zpool, maybe this is not possible the way I like to do it, maybe I need to fuzz some identifiers... I want to transplant my old zpool tank from a 1TB drive to a new 2TB drive, but *not* use dd(1) or any other cloning mechanism, as the pool was very full very often and is surely severely fragmented. So, I have tank (the old one), the new one, let's call it tank' and then there's the archive pool where snapshots from tank are sent to, and these should now come from tank' in the future. I have: tank -> sending snapshots to archive I want: tank' -> sending snapshots to archive Ideally I would want archive to not even know that tank and tank' are different, so as to not have to send a full snapshot again, but continue the incremental snapshots. So I did zfs send -R tank | ssh otherhost "zfs recv -d tank" and that worked well, this contained a snapshot A that was also already on archive. Then I made a final snapshot B on tank, before turning down that pool and sent it to tank' as well. Now I have snapshot A on tank, tank' and archive and they are virtually identical. I have snapshot B on tank and tank' and would like to send this from tank' to archive, but it complains: cannot receive incremental stream: most recent snapshot of archive does not match incremental source Is there a way to tweak the identity of tank' to be *really* the same as tank, so that archive can accept that incremental stream? Or should I use dd(1) after all to transplant tank to tank'? My other option would be to turn on dedup on archive and send another full stream of tank', 99.9% of which would hopefully be deduped and not consume precious space on archive. Any ideas? Cheers, Uli From owner-freebsd-fs@FreeBSD.ORG Sun Jan 27 13:01:42 2013 Return-Path: Delivered-To: fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id CF35BFA8 for ; Sun, 27 Jan 2013 13:01:42 +0000 (UTC) (envelope-from nowakpl@platinum.linux.pl) Received: from platinum.linux.pl (platinum.edu.pl [81.161.192.4]) by mx1.freebsd.org (Postfix) with ESMTP id 94985F4A for ; Sun, 27 Jan 2013 13:01:42 +0000 (UTC) Received: by platinum.linux.pl (Postfix, from userid 87) id 6918947E11; Sun, 27 Jan 2013 14:01:40 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on platinum.linux.pl X-Spam-Level: X-Spam-Status: No, score=-1.4 required=3.0 tests=ALL_TRUSTED,AWL autolearn=disabled version=3.3.2 Received: from [10.255.0.2] (unknown [83.151.38.73]) by platinum.linux.pl (Postfix) with ESMTPA id E537F47DE6 for ; Sun, 27 Jan 2013 14:01:39 +0100 (CET) Message-ID: <5105252D.6060502@platinum.linux.pl> Date: Sun, 27 Jan 2013 14:01:33 +0100 From: Adam Nowacki User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: fs@FreeBSD.org Subject: RAID-Z wasted space - asize roundups to nparity +1 Content-Type: text/plain; charset=ISO-8859-2; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Jan 2013 13:01:42 -0000 I've just found something very weird in the ZFS code. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c:504 in HEAD Can someone explain the reason behind this line of code? What it does is align on-disk record size to a multiple of number of parity disks + 1 ... this really doesn't make any sense. So far as I can tell those extra sectors are just padding - completely unused. For the array I'm using this results in 4.8% of wasted disk space - 1.7TB. It's a 12x 3TB disk RAID-Z2. From owner-freebsd-fs@FreeBSD.ORG Sun Jan 27 13:48:11 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id A2B037F3 for ; Sun, 27 Jan 2013 13:48:11 +0000 (UTC) (envelope-from pawel@dawidek.net) Received: from mail.dawidek.net (garage.dawidek.net [91.121.88.72]) by mx1.freebsd.org (Postfix) with ESMTP id 6FF2FE4 for ; Sun, 27 Jan 2013 13:48:10 +0000 (UTC) Received: from localhost (89-73-195-149.dynamic.chello.pl [89.73.195.149]) by mail.dawidek.net (Postfix) with ESMTPSA id 12CC2B0A; Sun, 27 Jan 2013 14:45:31 +0100 (CET) Date: Sun, 27 Jan 2013 14:48:46 +0100 From: Pawel Jakub Dawidek To: Laurence Gill Subject: Re: HAST performance overheads? Message-ID: <20130127134845.GC1346@garage.freebsd.pl> References: <20130125121044.1afac72e@googlemail.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="WplhKdTI2c8ulnbP" Content-Disposition: inline In-Reply-To: <20130125121044.1afac72e@googlemail.com> X-OS: FreeBSD 10.0-CURRENT amd64 User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Jan 2013 13:48:11 -0000 --WplhKdTI2c8ulnbP Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Jan 25, 2013 at 12:10:44PM +0000, Laurence Gill wrote: > If I create ZFS raidz2 on these... >=20 > - # zpool create pool raidz2 da0 da1 da2 da3 da4 da5 >=20 > Then run a dd test, a sample output is... >=20 > - # dd if=3D/dev/zero of=3Dtest.dat bs=3D1M count=3D1024 > 1073741824 bytes transferred in 7.689634 secs (139634974 bytes/sec) >=20 > - # dd if=3D/dev/zero of=3Dtest.dat bs=3D16k count=3D65535 > 1073725440 bytes transferred in 1.909157 secs (562408130 bytes/sec) >=20 > This is much faster than compared to running hast, I would expect an > overhead, but not this much. For example: >=20 > - # hastctl create disk0/disk1/disk2/disk3/disk4/disk5 > - # hastctl role primary all > - # zpool create pool raidz2 disk0 disk1 disk2 disk3 disk4 disk5 >=20 > Run a dd test, and the speed is... >=20 > - # dd if=3D/dev/zero of=3Dtest.dat bs=3D1M count=3D1024 > 1073741824 bytes transferred in 40.908153 secs (26247624 bytes/sec) >=20 > - # dd if=3D/dev/zero of=3Dtest.dat bs=3D16k count=3D65535 > 1073725440 bytes transferred in 42.017997 secs (25553942 bytes/sec) Let's try to test one step at a time. Can you try to compare sequential performance of regular disk vs. HAST with no secondary configured? By no secondary configured I mean 'remote' set to 'none'. Just do: # dd if=3D/dev/zero of=3D/dev/da0 bs=3D1m count=3D10240 then configure HAST and: # dd if=3D/dev/zero of=3D/dev/hast/disk0 bs=3D1m count=3D10240 Which FreeBSD version is it? PS. Your ZFS tests are pretty meaningless, because it is possible that everything will end up in memory. I'm sure this is what happens in 'bs=3D16k count=3D65535' case. Let try raw providers first. --=20 Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://tupytaj.pl --WplhKdTI2c8ulnbP Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlEFMD0ACgkQForvXbEpPzRMigCglS8ZP9RggVl0MfVk+A25xgd2 29wAnigH5gA4RXxKI/4XLfKT8sW9eoPP =D2zj -----END PGP SIGNATURE----- --WplhKdTI2c8ulnbP-- From owner-freebsd-fs@FreeBSD.ORG Sun Jan 27 14:00:27 2013 Return-Path: Delivered-To: fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id AD3958F3; Sun, 27 Jan 2013 14:00:27 +0000 (UTC) (envelope-from freebsd-listen@fabiankeil.de) Received: from smtprelay02.ispgateway.de (smtprelay02.ispgateway.de [80.67.18.14]) by mx1.freebsd.org (Postfix) with ESMTP id 37874132; Sun, 27 Jan 2013 14:00:26 +0000 (UTC) Received: from [78.35.163.65] (helo=fabiankeil.de) by smtprelay02.ispgateway.de with esmtpsa (SSLv3:AES128-SHA:128) (Exim 4.68) (envelope-from ) id 1TzSmQ-0004Q3-NA; Sun, 27 Jan 2013 15:00:18 +0100 Date: Sun, 27 Jan 2013 14:56:01 +0100 From: Fabian Keil To: Ulrich =?UTF-8?B?U3DDtnJsZWlu?= Subject: Re: Zpool surgery Message-ID: <20130127145601.7f650d3c@fabiankeil.de> In-Reply-To: <20130127103612.GB38645@acme.spoerlein.net> References: <20130127103612.GB38645@acme.spoerlein.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/FJX6_h0WkAB0cAJZ1ipHtOB"; protocol="application/pgp-signature" X-Df-Sender: Nzc1MDY3 Cc: current@FreeBSD.org, fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Jan 2013 14:00:27 -0000 --Sig_/FJX6_h0WkAB0cAJZ1ipHtOB Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Ulrich Sp=C3=B6rlein wrote: > I have a slight problem with transplanting a zpool, maybe this is not > possible the way I like to do it, maybe I need to fuzz some > identifiers... >=20 > I want to transplant my old zpool tank from a 1TB drive to a new 2TB > drive, but *not* use dd(1) or any other cloning mechanism, as the pool > was very full very often and is surely severely fragmented. >=20 > So, I have tank (the old one), the new one, let's call it tank' and > then there's the archive pool where snapshots from tank are sent to, and > these should now come from tank' in the future. >=20 > I have: > tank -> sending snapshots to archive >=20 > I want: > tank' -> sending snapshots to archive >=20 > Ideally I would want archive to not even know that tank and tank' are > different, so as to not have to send a full snapshot again, but > continue the incremental snapshots. >=20 > So I did zfs send -R tank | ssh otherhost "zfs recv -d tank" and that > worked well, this contained a snapshot A that was also already on > archive. Then I made a final snapshot B on tank, before turning down that > pool and sent it to tank' as well. >=20 > Now I have snapshot A on tank, tank' and archive and they are virtually > identical. I have snapshot B on tank and tank' and would like to send > this from tank' to archive, but it complains: >=20 > cannot receive incremental stream: most recent snapshot of archive does > not match incremental source In general this should work, so I'd suggest that you double check that you are indeed sending the correct incremental. > Is there a way to tweak the identity of tank' to be *really* the same as > tank, so that archive can accept that incremental stream? Or should I > use dd(1) after all to transplant tank to tank'? My other option would > be to turn on dedup on archive and send another full stream of tank', > 99.9% of which would hopefully be deduped and not consume precious space > on archive. The pools don't have to be the same. I wouldn't consider dedup as you'll have to recreate the pool if it turns out the the dedup performance is pathetic. On a system that hasn't been created with dedup in mind that seems rather likely. > Any ideas? Your whole procedure seems a bit complicated to me. Why don't you use "zpool replace"? Fabian --Sig_/FJX6_h0WkAB0cAJZ1ipHtOB Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlEFMfgACgkQBYqIVf93VJ11YQCgst43rQ0fEPedB1gaEUIocoQS I/IAni9cEfESXBY5DZOO+mJ44csGHkYN =nniE -----END PGP SIGNATURE----- --Sig_/FJX6_h0WkAB0cAJZ1ipHtOB-- From owner-freebsd-fs@FreeBSD.ORG Sun Jan 27 14:31:27 2013 Return-Path: Delivered-To: fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 6528BEC3; Sun, 27 Jan 2013 14:31:27 +0000 (UTC) (envelope-from prvs=1739a0aae4=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id B66D4258; Sun, 27 Jan 2013 14:31:25 +0000 (UTC) Received: from r2d2 ([188.220.16.49]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50001877953.msg; Sun, 27 Jan 2013 14:31:18 +0000 X-Spam-Processed: mail1.multiplay.co.uk, Sun, 27 Jan 2013 14:31:18 +0000 (not processed: message from valid local sender) X-MDRemoteIP: 188.220.16.49 X-Return-Path: prvs=1739a0aae4=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk Message-ID: <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk> From: "Steven Hartland" To: =?iso-8859-1?Q?Ulrich_Sp=F6rlein?= , , References: <20130127103612.GB38645@acme.spoerlein.net> Subject: Re: Zpool surgery Date: Sun, 27 Jan 2013 14:31:56 -0000 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 8bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Jan 2013 14:31:27 -0000 ----- Original Message ----- From: "Ulrich Spörlein" > I have a slight problem with transplanting a zpool, maybe this is not > possible the way I like to do it, maybe I need to fuzz some > identifiers... > > I want to transplant my old zpool tank from a 1TB drive to a new 2TB > drive, but *not* use dd(1) or any other cloning mechanism, as the pool > was very full very often and is surely severely fragmented. > Cant you just drop the disk in the original machine, set it as a mirror then once the mirror process has completed break the mirror and remove the 1TB disk. If this is a boot disk don't forget to set the boot block as well. Regards Steve ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk. From owner-freebsd-fs@FreeBSD.ORG Sun Jan 27 14:54:28 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id E06D1738; Sun, 27 Jan 2013 14:54:28 +0000 (UTC) (envelope-from utisoft@gmail.com) Received: from mail-ie0-x22f.google.com (ie-in-x022f.1e100.net [IPv6:2607:f8b0:4001:c03::22f]) by mx1.freebsd.org (Postfix) with ESMTP id 748A5348; Sun, 27 Jan 2013 14:54:28 +0000 (UTC) Received: by mail-ie0-f175.google.com with SMTP id c12so9231ieb.20 for ; Sun, 27 Jan 2013 06:54:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=+pKTfnRqzbWS09RCaOG86XD13koGRYnf2xF/rpCzF0A=; b=mpaAzVheFC3qWalXxCCrGMuIIFGN6EEd+Tybe9VwGaJ1hfzWDOsp8XL2dNNK25g88b 0CpPVENQO/qyIW3cBFH4XdhHR2w1CzllCH/IXJv+mDCsZaKjcBacieTHolK/i6b5YvGL hB6Wy3OetxbOwrGMMd4sw0NTMnXK4Hw+Mc2yXBflff616eyUGs6La8dew5KjdVrYWmHu fKBCbQ2TPEw1Tvte3A3OCUTK9mnl7Rpo/G8bkjHpwXv018hcXA9H2KHGsGhX/4gV9yyk 3d+DPm/noKY3FluVNWlq1a/x+Vw39eVMU4CHhvd7tPnSdhjd0wj+dhFX33aSfg5EL2+T C4vA== MIME-Version: 1.0 X-Received: by 10.50.214.10 with SMTP id nw10mr3061777igc.15.1359298468005; Sun, 27 Jan 2013 06:54:28 -0800 (PST) Received: by 10.64.16.73 with HTTP; Sun, 27 Jan 2013 06:54:27 -0800 (PST) Received: by 10.64.16.73 with HTTP; Sun, 27 Jan 2013 06:54:27 -0800 (PST) In-Reply-To: <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk> References: <20130127103612.GB38645@acme.spoerlein.net> <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk> Date: Sun, 27 Jan 2013 14:54:27 +0000 Message-ID: Subject: Re: Zpool surgery From: Chris Rees To: Steven Hartland Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: current@freebsd.org, fs@freebsd.org, =?ISO-8859-1?Q?Ulrich_Sp=F6rlein?= X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Jan 2013 14:54:29 -0000 On 27 Jan 2013 14:31, "Steven Hartland" wrote: > > ----- Original Message ----- From: "Ulrich Sp=F6rlein" > > >> I have a slight problem with transplanting a zpool, maybe this is not >> possible the way I like to do it, maybe I need to fuzz some >> identifiers... >> >> I want to transplant my old zpool tank from a 1TB drive to a new 2TB >> drive, but *not* use dd(1) or any other cloning mechanism, as the pool >> was very full very often and is surely severely fragmented. >> > > Cant you just drop the disk in the original machine, set it as a mirror > then once the mirror process has completed break the mirror and remove > the 1TB disk. > > If this is a boot disk don't forget to set the boot block as well. I managed to replace a drive this way without even rebooting. I believe it's the same as a zpool replace. Chris From owner-freebsd-fs@FreeBSD.ORG Sun Jan 27 14:56:19 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 250C1A41; Sun, 27 Jan 2013 14:56:19 +0000 (UTC) (envelope-from prvs=1739a0aae4=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id 976F1382; Sun, 27 Jan 2013 14:56:18 +0000 (UTC) Received: from r2d2 ([188.220.16.49]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50001878220.msg; Sun, 27 Jan 2013 14:56:16 +0000 X-Spam-Processed: mail1.multiplay.co.uk, Sun, 27 Jan 2013 14:56:16 +0000 (not processed: message from valid local sender) X-MDRemoteIP: 188.220.16.49 X-Return-Path: prvs=1739a0aae4=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk Message-ID: <16B555759C2041ED8185DF478193A59D@multiplay.co.uk> From: "Steven Hartland" To: "Vladislav Prodan" References: <13391.1359029978.3957795939058384896@ffe16.ukr.net> <221B307551154F489452F89E304CA5F7@multiplay.co.uk> <93308.1359297551.14145052969567453184@ffe15.ukr.net> Subject: Re: Re[2]: AHCI timeout when using ZFS + AIO + NCQ Date: Sun, 27 Jan 2013 14:56:52 -0000 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="Windows-1251"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Cc: current@freebsd.org, fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Jan 2013 14:56:19 -0000 ----- Original Message ----- From: "Vladislav Prodan" >> Is it always the same disk, of so replace it SMART helps identify issues >> but doesn't tell you 100% there's no problem. > > > Now it has fallen off a different HDD - ada0. > I'm 99% sure that MHDD will not find problems in HDD - ada0 and ada2. > I still have three servers with similar chipsets that have similar problems > with blade ahci times out. I notice your disks are connecting at SATA 3.x, which rings bells. We had a very similar issue on a new Supermicro machine here and after much testing we proved to our satisfaction that the problem was the HW. Essentially the combination of SATA 3 speeds the midplane / backplane degraded the connection between the MB and HDD enough to cause the disks to randomly drop when under load. If we connected the disks directly to the MB with SATA cables the problem went away. In the end we had midplanes changed from an AHCI pass-through to active LSI controller. So if you have any sort of midplane / backplane connecting your disks try connecting them direct to the MB / controller via known SATA 3.x compliant cables and see if that stops the drops. Another test you can do is to force the disks to connect at SATA 2.x this also fixed it in our case, but wasn't something we wanted to put into production hence the controller swap. To force SATA 2 speeds you can use the following in /boot/loader.conf where 'X' is disk identifier e.g. for ada0 X = 0:- hint.ahcich.X.sata_rev=2 Hope this helps. Regards Steve ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk. From owner-freebsd-fs@FreeBSD.ORG Sun Jan 27 14:56:52 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id BC70FBB8; Sun, 27 Jan 2013 14:56:52 +0000 (UTC) (envelope-from universite@ukr.net) Received: from ffe15.ukr.net (ffe15.ukr.net [195.214.192.50]) by mx1.freebsd.org (Postfix) with ESMTP id 44A9939A; Sun, 27 Jan 2013 14:56:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=ukr.net; s=ffe; h=Date:Message-Id:From:To:References:In-Reply-To:Subject:Cc:Content-Type:Content-Transfer-Encoding:MIME-Version; bh=BJYHeKR7eBEVXWimtxiFQHBLwTNZ7ccMMCeS+OY67SI=; b=tT/t+Tzvfv2tHCj10pm29lc8jWiEiPQQeDvc4vUpp9k+P5Cy4vxy75XYr2wt1DWHKMPWdItBCoDrWYbm4f2EOTyy+2yAFk5ih22MqegzTJSSGQIcuuvFtVjGAkQfI31ZlDwe6ohRGikCgki6of0NnQWid+6Ypxo3jMWSam2D3F0=; Received: from mail by ffe15.ukr.net with local ID 1TzTO3-000OUJ-Cy ; Sun, 27 Jan 2013 16:39:11 +0200 MIME-Version: 1.0 Content-Disposition: inline Content-Transfer-Encoding: binary Content-Type: text/plain; charset="windows-1251" Subject: Re[2]: AHCI timeout when using ZFS + AIO + NCQ In-Reply-To: <221B307551154F489452F89E304CA5F7@multiplay.co.uk> References: <13391.1359029978.3957795939058384896@ffe16.ukr.net> <221B307551154F489452F89E304CA5F7@multiplay.co.uk> To: "Steven Hartland" From: "Vladislav Prodan" X-Mailer: freemail.ukr.net 4.0 Message-Id: <93308.1359297551.14145052969567453184@ffe15.ukr.net> X-Browser: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0 Date: Sun, 27 Jan 2013 16:39:11 +0200 Cc: current@freebsd.org, fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Jan 2013 14:56:52 -0000 > Is it always the same disk, of so replace it SMART helps identify issues > but doesn't tell you 100% there's no problem. Now it has fallen off a different HDD - ada0. I'm 99% sure that MHDD will not find problems in HDD - ada0 and ada2. I still have three servers with similar chipsets that have similar problems with blade ahci times out. > ----- Original Message ----- > From: "Vladislav Prodan" > To: > Cc: > Sent: Thursday, January 24, 2013 12:19 PM > Subject: AHCI timeout when using ZFS + AIO + NCQ > > > >I have the server: > > > > FreeBSD 9.1-PRERELEASE #0: Wed Jul 25 01:40:56 EEST 2012 > > > > Jan 24 12:53:01 vesuvius kernel: atapci0: port > > 0xc040-0xc047,0xc030-0xc033,0xc020-0xc027,0xc010-0xc013,0xc000-0xc00f mem 0xfe210000-0xfe2101ff irq 51 at device 0.0 on pci3 > > ... > > Jan 24 12:53:01 vesuvius kernel: ahci0: port > > 0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem 0xfe307000-0xfe3073ff irq 19 at device 17.0 on pci0 > > Jan 24 12:53:01 vesuvius kernel: ahci0: AHCI v1.20 with 6 6Gbps ports, Port Multiplier supported > > ... > > Jan 24 12:53:01 vesuvius kernel: ada2 at ahcich2 bus 0 scbus4 target 0 lun 0 > > Jan 24 12:53:01 vesuvius kernel: ada2: ATA-8 SATA 3.x device > > Jan 24 12:53:01 vesuvius kernel: ada2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) > > Jan 24 12:53:01 vesuvius kernel: ada2: Command Queueing enabled > > Jan 24 12:53:01 vesuvius kernel: ada2: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C) > > Jan 24 12:53:01 vesuvius kernel: ada2: Previously was known as ad12 > > ... > > I use 4 HDD in RAID10 via ZFS. > > > > With a very irregular intervals fall off HDD drives. As a result, the server stops. > > > > Jan 24 06:48:06 vesuvius kernel: ahcich2: Timeout on slot 6 port 0 > > Jan 24 06:48:06 vesuvius kernel: ahcich2: is 00000000 cs 00000000 ss 000000c0 rs 000000c0 tfd 40 serr 00000000 cmd 0000e817 > > Jan 24 06:48:06 vesuvius kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 4c 4e 1e 40 68 00 00 01 00 00 > > Jan 24 06:48:06 vesuvius kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout > > Jan 24 06:48:06 vesuvius kernel: (ada2:ahcich2:0:0:0): Retrying command > > Jan 24 06:51:11 vesuvius kernel: ahcich2: AHCI reset: device not ready after 31000ms (tfd = 00000080) > > Jan 24 06:51:11 vesuvius kernel: ahcich2: Timeout on slot 8 port 0 > > Jan 24 06:51:11 vesuvius kernel: ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd 00 serr 00000000 cmd 0000e817 > > Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 > > Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): CAM status: Command timeout > > Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): Error 5, Retry was blocked > > Jan 24 06:51:11 vesuvius kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 4227133, size: 8192 > > Jan 24 06:51:11 vesuvius kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 4227133, size: 8192 > > Jan 24 06:51:11 vesuvius kernel: ahcich2: AHCI reset: device not ready after 31000ms (tfd = 00000080) > > Jan 24 06:51:11 vesuvius kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 4227133, size: 8192 > > Jan 24 06:51:11 vesuvius kernel: ahcich2: Timeout on slot 8 port 0 > > Jan 24 06:51:11 vesuvius kernel: ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd 00 serr 00000000 cmd 0000e817 > > Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 > > Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): CAM status: Command timeout > > Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): Error 5, Retry was blocked > > Jan 24 06:51:11 vesuvius kernel: swap_pager: I/O error - pagein failed; blkno 4227133,size 8192, error 6 > > Jan 24 06:51:11 vesuvius kernel: (ada2:(pass2:vm_fault: pager read error, pid 1943 (named) > > Jan 24 06:51:11 vesuvius kernel: ahcich2:0:ahcich2:0:0:0:0): lost device > > Jan 24 06:51:11 vesuvius kernel: 0): passdevgonecb: devfs entry is gone > > Jan 24 06:51:11 vesuvius kernel: pid 1943 (named), uid 53: exited on signal 11 > > ... > > > > Helps only restart by pressing Power. > > Judging by the state of SMART, HDD have no problems. SATA data cable changed. > > > > > > I found a similar problem: > > > > http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html > > PR: amd64/165547: NVIDIA MCP67 AHCI SATA controller timeout > > > > -- > > Vladislav V. Prodan > > System & Network Administrator > > http://support.od.ua > > +380 67 4584408, +380 99 4060508 > > VVP88-RIPE -- Vladislav V. Prodan System & Network Administrator http://support.od.ua +380 67 4584408, +380 99 4060508 VVP88-RIPE From owner-freebsd-fs@FreeBSD.ORG Sun Jan 27 15:29:05 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id A3FE55B8; Sun, 27 Jan 2013 15:29:05 +0000 (UTC) (envelope-from universite@ukr.net) Received: from ffe11.ukr.net (ffe11.ukr.net [195.214.192.31]) by mx1.freebsd.org (Postfix) with ESMTP id 5511A734; Sun, 27 Jan 2013 15:29:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=ukr.net; s=ffe; h=Date:Message-Id:From:To:References:In-Reply-To:Subject:Cc:Content-Type:Content-Transfer-Encoding:MIME-Version; bh=KA3pILGNPBJkWcpAOJAYYa+9vxKGxcXhrIZ+SsImMNo=; b=C/fyFj5QdF2pqtrKTZinJdNeCy1WcGK2v1aRtcfK8JlWdZ66n2w1VaslcpA2T3rskP6ez496vmWq14Y/kIdfYMbfj1M6GmyQg5Q431bDrf99VTN7cpJWdeqjLKhHPPTsh2UBMmdj0ASbG+X/Sv8Z+KsSFo4rNlE7HirG30yUXxY=; Received: from mail by ffe11.ukr.net with local ID 1TzTvB-000Idu-Kk ; Sun, 27 Jan 2013 17:13:25 +0200 MIME-Version: 1.0 Content-Disposition: inline Content-Transfer-Encoding: binary Content-Type: text/plain; charset="windows-1251" Subject: Re[2]: Re[2]: AHCI timeout when using ZFS + AIO + NCQ In-Reply-To: <16B555759C2041ED8185DF478193A59D@multiplay.co.uk> References: <16B555759C2041ED8185DF478193A59D@multiplay.co.uk> <93308.1359297551.14145052969567453184@ffe15.ukr.net> <13391.1359029978.3957795939058384896@ffe16.ukr.net> <221B307551154F489452F89E304CA5F7@multiplay.co.uk> To: "Steven Hartland" From: "Vladislav Prodan" X-Mailer: freemail.ukr.net 4.0 Message-Id: <70362.1359299605.3196836531757973504@ffe11.ukr.net> X-Browser: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0 Date: Sun, 27 Jan 2013 17:13:25 +0200 Cc: current@freebsd.org, fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Jan 2013 15:29:05 -0000 > ----- Original Message ----- > From: "Vladislav Prodan" > > >> Is it always the same disk, of so replace it SMART helps identify issues > >> but doesn't tell you 100% there's no problem. > > > > > > Now it has fallen off a different HDD - ada0. > > I'm 99% sure that MHDD will not find problems in HDD - ada0 and ada2. > > I still have three servers with similar chipsets that have similar problems > > with blade ahci times out. > > I notice your disks are connecting at SATA 3.x, which rings bells. We had > a very similar issue on a new Supermicro machine here and after much > testing we proved to our satisfaction that the problem was the HW. I have a motherboard ASUS M5A97 PRO http://www.asus.com/Motherboard/M5A97_PRO/#specifications Has replacement SATA data cables. Putting hard RAID controller does not guarantee data recovery at his death. > Essentially the combination of SATA 3 speeds the midplane / backplane > degraded the connection between the MB and HDD enough to cause > the disks to randomly drop when under load. > > If we connected the disks directly to the MB with SATA cables the > problem went away. In the end we had midplanes changed from an > AHCI pass-through to active LSI controller. > > So if you have any sort of midplane / backplane connecting your disks > try connecting them direct to the MB / controller via known SATA 3.x > compliant cables and see if that stops the drops. > > Another test you can do is to force the disks to connect at SATA 2.x > this also fixed it in our case, but wasn't something we wanted to > put into production hence the controller swap. > > To force SATA 2 speeds you can use the following in /boot/loader.conf > where 'X' is disk identifier e.g. for ada0 X = 0:- > hint.ahcich.X.sata_rev=2 > > Hope this helps. > > Regards > Steve > -- Vladislav V. Prodan System & Network Administrator http://support.od.ua +380 67 4584408, +380 99 4060508 VVP88-RIPE From owner-freebsd-fs@FreeBSD.ORG Sun Jan 27 18:44:04 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id ADC8F485; Sun, 27 Jan 2013 18:44:04 +0000 (UTC) (envelope-from prvs=1739a0aae4=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id 17EA3FC1; Sun, 27 Jan 2013 18:44:03 +0000 (UTC) Received: from r2d2 ([188.220.16.49]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50001881137.msg; Sun, 27 Jan 2013 18:44:01 +0000 X-Spam-Processed: mail1.multiplay.co.uk, Sun, 27 Jan 2013 18:44:01 +0000 (not processed: message from valid local sender) X-MDRemoteIP: 188.220.16.49 X-Return-Path: prvs=1739a0aae4=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk Message-ID: <917933DB5C9A490D93A739058C2507A1@multiplay.co.uk> From: "Steven Hartland" To: "Vladislav Prodan" References: <16B555759C2041ED8185DF478193A59D@multiplay.co.uk> <93308.1359297551.14145052969567453184@ffe15.ukr.net> <13391.1359029978.3957795939058384896@ffe16.ukr.net> <221B307551154F489452F89E304CA5F7@multiplay.co.uk> <70362.1359299605.3196836531757973504@ffe11.ukr.net> Subject: Re: AHCI timeout when using ZFS + AIO + NCQ Date: Sun, 27 Jan 2013 18:44:37 -0000 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Cc: current@freebsd.org, fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Jan 2013 18:44:04 -0000 ----- Original Message ----- From: "Vladislav Prodan" To: "Steven Hartland" Cc: ; Sent: Sunday, January 27, 2013 3:13 PM Subject: Re[2]: Re[2]: AHCI timeout when using ZFS + AIO + NCQ > > >> ----- Original Message ----- >> From: "Vladislav Prodan" >> >> >> Is it always the same disk, of so replace it SMART helps identify issues >> >> but doesn't tell you 100% there's no problem. >> > >> > >> > Now it has fallen off a different HDD - ada0. >> > I'm 99% sure that MHDD will not find problems in HDD - ada0 and ada2. >> > I still have three servers with similar chipsets that have similar problems >> > with blade ahci times out. >> >> I notice your disks are connecting at SATA 3.x, which rings bells. We had >> a very similar issue on a new Supermicro machine here and after much >> testing we proved to our satisfaction that the problem was the HW. > > > I have a motherboard ASUS M5A97 PRO > http://www.asus.com/Motherboard/M5A97_PRO/#specifications > Has replacement SATA data cables. > Putting hard RAID controller does not guarantee data recovery at his death. Not sure what that has to do with cable / track lengths via things like a backplane? Do you or do you not have a hotswap backplane? >> Essentially the combination of SATA 3 speeds the midplane / backplane >> degraded the connection between the MB and HDD enough to cause >> the disks to randomly drop when under load. >> >> If we connected the disks directly to the MB with SATA cables the >> problem went away. In the end we had midplanes changed from an >> AHCI pass-through to active LSI controller. >> >> So if you have any sort of midplane / backplane connecting your disks >> try connecting them direct to the MB / controller via known SATA 3.x >> compliant cables and see if that stops the drops. >> >> Another test you can do is to force the disks to connect at SATA 2.x >> this also fixed it in our case, but wasn't something we wanted to >> put into production hence the controller swap. >> >> To force SATA 2 speeds you can use the following in /boot/loader.conf >> where 'X' is disk identifier e.g. for ada0 X = 0:- >> hint.ahcich.X.sata_rev=2 This is still worth trying as it could still indicate a problem with your controller, cables or disks. Regards Steve ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk. From owner-freebsd-fs@FreeBSD.ORG Sun Jan 27 19:02:08 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 5E7EE6BF; Sun, 27 Jan 2013 19:02:08 +0000 (UTC) (envelope-from universite@ukr.net) Received: from ffe16.ukr.net (ffe16.ukr.net [195.214.192.51]) by mx1.freebsd.org (Postfix) with ESMTP id 06DD7B1; Sun, 27 Jan 2013 19:02:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=ukr.net; s=ffe; h=Date:Message-Id:From:To:References:In-Reply-To:Subject:Cc:Content-Type:Content-Transfer-Encoding:MIME-Version; bh=sgAv80HRGHZrDFfsP8s1dqUbMJPy4blkmHgI4So/RBw=; b=TBW+YxoNuAsbRBw0DTEacjNpHoqgd2/MiI5S12C8LYbpvjyK0JDiIwlIAYq2BJ/URRD6LK0rdE4ssuq66gE/kbB9N84f3JWlBn3se8FzCrEARihxnjm58tGxIbArgLaf6WJ7lkc5r5VQ7hhjPOUGDRxI1MZTLfJjv7oyuXypDno=; Received: from mail by ffe16.ukr.net with local ID 1TzXUN-000IPl-KG ; Sun, 27 Jan 2013 21:01:59 +0200 MIME-Version: 1.0 Content-Disposition: inline Content-Transfer-Encoding: binary Content-Type: text/plain; charset="windows-1251" Subject: Re[2]: AHCI timeout when using ZFS + AIO + NCQ In-Reply-To: <917933DB5C9A490D93A739058C2507A1@multiplay.co.uk> References: <917933DB5C9A490D93A739058C2507A1@multiplay.co.uk> <16B555759C2041ED8185DF478193A59D@multiplay.co.uk> <93308.1359297551.14145052969567453184@ffe15.ukr.net> <13391.1359029978.3957795939058384896@ffe16.ukr.net> <221B307551154F489452F89E304CA5F7@multiplay.co.uk> <70362.1359299605.3196836531757973504@ffe11.ukr.net> To: "Steven Hartland" From: "Vladislav Prodan" X-Mailer: freemail.ukr.net 4.0 Message-Id: <70578.1359313319.18126575192049975296@ffe16.ukr.net> X-Browser: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0 Date: Sun, 27 Jan 2013 21:01:59 +0200 Cc: current@freebsd.org, fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Jan 2013 19:02:08 -0000 > >> Essentially the combination of SATA 3 speeds the midplane / backplane > >> degraded the connection between the MB and HDD enough to cause > >> the disks to randomly drop when under load. > >> > >> If we connected the disks directly to the MB with SATA cables the > >> problem went away. In the end we had midplanes changed from an > >> AHCI pass-through to active LSI controller. > >> > >> So if you have any sort of midplane / backplane connecting your disks > >> try connecting them direct to the MB / controller via known SATA 3.x > >> compliant cables and see if that stops the drops. > >> > >> Another test you can do is to force the disks to connect at SATA 2.x > >> this also fixed it in our case, but wasn't something we wanted to > >> put into production hence the controller swap. > >> > >> To force SATA 2 speeds you can use the following in /boot/loader.conf > >> where 'X' is disk identifier e.g. for ada0 X = 0:- > >> hint.ahcich.X.sata_rev=2 > > This is still worth trying as it could still indicate a problem > with your controller, cables or disks. > Or, simply disable the ahci kernel module and use only ata? -- Vladislav V. Prodan System & Network Administrator http://support.od.ua +380 67 4584408, +380 99 4060508 VVP88-RIPE From owner-freebsd-fs@FreeBSD.ORG Sun Jan 27 19:08:08 2013 Return-Path: Delivered-To: fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 9D773C4A; Sun, 27 Jan 2013 19:08:08 +0000 (UTC) (envelope-from uqs@FreeBSD.org) Received: from acme.spoerlein.net (acme.spoerlein.net [IPv6:2a01:4f8:131:23c2::1]) by mx1.freebsd.org (Postfix) with ESMTP id 2DFC5127; Sun, 27 Jan 2013 19:08:08 +0000 (UTC) Received: from localhost (acme.spoerlein.net [IPv6:2a01:4f8:131:23c2::1]) by acme.spoerlein.net (8.14.6/8.14.6) with ESMTP id r0RJ86S1009795 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Sun, 27 Jan 2013 20:08:06 +0100 (CET) (envelope-from uqs@FreeBSD.org) Date: Sun, 27 Jan 2013 20:08:06 +0100 From: Ulrich =?utf-8?B?U3DDtnJsZWlu?= To: Fabian Keil Subject: Re: Zpool surgery Message-ID: <20130127190806.GQ35868@acme.spoerlein.net> Mail-Followup-To: Fabian Keil , current@FreeBSD.org, fs@FreeBSD.org References: <20130127103612.GB38645@acme.spoerlein.net> <20130127145601.7f650d3c@fabiankeil.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20130127145601.7f650d3c@fabiankeil.de> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: current@FreeBSD.org, fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Jan 2013 19:08:08 -0000 On Sun, 2013-01-27 at 14:56:01 +0100, Fabian Keil wrote: > Ulrich Spörlein wrote: > > > I have a slight problem with transplanting a zpool, maybe this is not > > possible the way I like to do it, maybe I need to fuzz some > > identifiers... > > > > I want to transplant my old zpool tank from a 1TB drive to a new 2TB > > drive, but *not* use dd(1) or any other cloning mechanism, as the pool > > was very full very often and is surely severely fragmented. > > > > So, I have tank (the old one), the new one, let's call it tank' and > > then there's the archive pool where snapshots from tank are sent to, and > > these should now come from tank' in the future. > > > > I have: > > tank -> sending snapshots to archive > > > > I want: > > tank' -> sending snapshots to archive > > > > Ideally I would want archive to not even know that tank and tank' are > > different, so as to not have to send a full snapshot again, but > > continue the incremental snapshots. > > > > So I did zfs send -R tank | ssh otherhost "zfs recv -d tank" and that > > worked well, this contained a snapshot A that was also already on > > archive. Then I made a final snapshot B on tank, before turning down that > > pool and sent it to tank' as well. > > > > Now I have snapshot A on tank, tank' and archive and they are virtually > > identical. I have snapshot B on tank and tank' and would like to send > > this from tank' to archive, but it complains: > > > > cannot receive incremental stream: most recent snapshot of archive does > > not match incremental source > > In general this should work, so I'd suggest that you double check > that you are indeed sending the correct incremental. > > > Is there a way to tweak the identity of tank' to be *really* the same as > > tank, so that archive can accept that incremental stream? Or should I > > use dd(1) after all to transplant tank to tank'? My other option would > > be to turn on dedup on archive and send another full stream of tank', > > 99.9% of which would hopefully be deduped and not consume precious space > > on archive. > > The pools don't have to be the same. > > I wouldn't consider dedup as you'll have to recreate the pool if > it turns out the the dedup performance is pathetic. On a system > that hasn't been created with dedup in mind that seems rather > likely. > > > Any ideas? > > Your whole procedure seems a bit complicated to me. > > Why don't you use "zpool replace"? Ehhh, .... "zpool replace", eh? I have to say I didn't know that option was available, but also because this is on a newer machine, I needed some way to do this over the network, so a direct zpool replace is not that easy. I dug out an old ATA-to-USB case and will use that to attach the old tank to the new machine and then have a try at this zpool replace thing. How will that affect the fragmentation level of the new pool? Will the resilver do something sensible wrt. keeping files together for better read-ahead performance? Cheers, Uli From owner-freebsd-fs@FreeBSD.ORG Sun Jan 27 19:12:13 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 4F7EDEE0; Sun, 27 Jan 2013 19:12:13 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe02.c2i.net [212.247.154.34]) by mx1.freebsd.org (Postfix) with ESMTP id 2BE98155; Sun, 27 Jan 2013 19:12:11 +0000 (UTC) X-T2-Spam-Status: No, hits=-1.0 required=5.0 tests=ALL_TRUSTED Received: from [176.74.213.204] (account mc467741@c2i.net HELO laptop015.hselasky.homeunix.org) by mailfe02.swip.net (CommuniGate Pro SMTP 5.4.4) with ESMTPA id 372818926; Sun, 27 Jan 2013 20:12:10 +0100 From: Hans Petter Selasky To: freebsd-current@freebsd.org Subject: Re: Zpool surgery Date: Sun, 27 Jan 2013 20:13:24 +0100 User-Agent: KMail/1.13.7 (FreeBSD/9.1-STABLE; KDE/4.8.4; amd64; ; ) References: <20130127103612.GB38645@acme.spoerlein.net> <20130127145601.7f650d3c@fabiankeil.de> <20130127190806.GQ35868@acme.spoerlein.net> In-Reply-To: <20130127190806.GQ35868@acme.spoerlein.net> X-Face: ?p&W)c( =?iso-8859-1?q?+80hU=3B=27=7B=2E=245K+zq=7BoC6y=7C=0A=09/D=27an*6mw?=>j'f:eBsex\Gi, Cc: fs@freebsd.org, current@freebsd.org, Ulrich =?utf-8?q?Sp=C3=B6rlein?= X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Jan 2013 19:12:13 -0000 On Sunday 27 January 2013 20:08:06 Ulrich Sp=C3=B6rlein wrote: > I dug out an old ATA-to-USB case and will use that to attach the old > tank to the new machine and then have a try at this zpool replace thing. If you are using -current you might want this patch first: http://svnweb.freebsd.org/changeset/base/245995 =2D-HPS From owner-freebsd-fs@FreeBSD.ORG Sun Jan 27 20:11:57 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 6B452D00; Sun, 27 Jan 2013 20:11:57 +0000 (UTC) (envelope-from peter@rulingia.com) Received: from vps.rulingia.com (host-122-100-2-194.octopus.com.au [122.100.2.194]) by mx1.freebsd.org (Postfix) with ESMTP id 031062F8; Sun, 27 Jan 2013 20:11:56 +0000 (UTC) Received: from server.rulingia.com (c220-239-246-167.belrs5.nsw.optusnet.com.au [220.239.246.167]) by vps.rulingia.com (8.14.5/8.14.5) with ESMTP id r0RKBltH004480 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Mon, 28 Jan 2013 07:11:47 +1100 (EST) (envelope-from peter@rulingia.com) X-Bogosity: Ham, spamicity=0.000000 Received: from server.rulingia.com (localhost.rulingia.com [127.0.0.1]) by server.rulingia.com (8.14.5/8.14.5) with ESMTP id r0RKBgLt092002 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 28 Jan 2013 07:11:42 +1100 (EST) (envelope-from peter@server.rulingia.com) Received: (from peter@localhost) by server.rulingia.com (8.14.5/8.14.5/Submit) id r0RKBffX092001; Mon, 28 Jan 2013 07:11:41 +1100 (EST) (envelope-from peter) Date: Mon, 28 Jan 2013 07:11:40 +1100 From: Peter Jeremy To: Steven Hartland Subject: Re: Zpool surgery Message-ID: <20130127201140.GD29105@server.rulingia.com> References: <20130127103612.GB38645@acme.spoerlein.net> <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="QRj9sO5tAVLaXnSD" Content-Disposition: inline In-Reply-To: <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk> X-PGP-Key: http://www.rulingia.com/keys/peter.pgp User-Agent: Mutt/1.5.21 (2010-09-15) Cc: current@freebsd.org, fs@freebsd.org, Ulrich =?iso-8859-1?Q?Sp=F6rlein?= X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Jan 2013 20:11:57 -0000 --QRj9sO5tAVLaXnSD Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2013-Jan-27 14:31:56 -0000, Steven Hartland wr= ote: >----- Original Message -----=20 >From: "Ulrich Sp=F6rlein" >> I want to transplant my old zpool tank from a 1TB drive to a new 2TB >> drive, but *not* use dd(1) or any other cloning mechanism, as the pool >> was very full very often and is surely severely fragmented. > >Cant you just drop the disk in the original machine, set it as a mirror >then once the mirror process has completed break the mirror and remove >the 1TB disk. That will replicate any fragmentation as well. "zfs send | zfs recv" is the only (current) way to defragment a ZFS pool. --=20 Peter Jeremy --QRj9sO5tAVLaXnSD Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlEFifwACgkQ/opHv/APuIeuUACgqCCNXfxYUs6MF9RcFnRvANg3 T+AAnAsdg/RXxe7Y9nCPRFmKWizYzuKB =Y809 -----END PGP SIGNATURE----- --QRj9sO5tAVLaXnSD-- From owner-freebsd-fs@FreeBSD.ORG Sun Jan 27 21:34:22 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 16381190 for ; Sun, 27 Jan 2013 21:34:22 +0000 (UTC) (envelope-from grarpamp@gmail.com) Received: from mail-vb0-f54.google.com (mail-vb0-f54.google.com [209.85.212.54]) by mx1.freebsd.org (Postfix) with ESMTP id BF7AF817 for ; Sun, 27 Jan 2013 21:34:21 +0000 (UTC) Received: by mail-vb0-f54.google.com with SMTP id l1so1467342vba.13 for ; Sun, 27 Jan 2013 13:34:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=rfX5ThjrzcCQ5Nj9fKH9eJ2RK1KR0C5ZkUxSG+k61QA=; b=ntvQ6JnnzhGD+l2MF+swYgZr/CpoxS5lQeSzAXkB4hdQHJD+hjJQxWHWFyMX2lVn0r h4mA2WGWTnuXtofQWJgypEr7xIwT/XbLRKo9z/EIlvKMJIyBFL6468IpGNHxqk10Datd 3IhKF52npE6ghu3rR3XofucLNIJ9+fJR72JonFmJHMDRkIwzLjoodJKzIchvYZdknpN+ A0PYQ0Xe8U/BW3UC8K1lQGGWMr3h538iH4OLED/x95+DtKDB5vMrcEtQBf+EOkhfYFTs ymTrlYGzGB9avSjLa8Nd5dpJstXWq99kfW7nCMLcaUX/vvZAZEDwvD9hpmywGPwVTtsB fUpg== MIME-Version: 1.0 X-Received: by 10.220.150.136 with SMTP id y8mr12785591vcv.34.1359322461178; Sun, 27 Jan 2013 13:34:21 -0800 (PST) Received: by 10.220.219.79 with HTTP; Sun, 27 Jan 2013 13:34:20 -0800 (PST) In-Reply-To: References: Date: Sun, 27 Jan 2013 16:34:20 -0500 Message-ID: Subject: Re: ZFS slackspace, grepping it for data From: grarpamp To: freebsd-fs@freebsd.org Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Jan 2013 21:34:22 -0000 > zdb -mmm pool_name Ahh, I saw this later too, thanks. Seems I've got 425k free ranges to scan among 25k free txg's. This will take a while but it's still a nice feature. I doubt it was meant for this purpose though. More likely for debugging zfs structures and data issues. > for on-disk offset add 0x400000 If i remember correctly. I could check for it with a string search near the head of data. Does that fs to disk offset stay the same throughout the fs? The minimum range size appears to be 4KiB (245k worth), with another 75k at 8KiB and 100k more on up to 32KiB. So not sure yet whether using zdb to collect the slack will perform any worse than supplying the list to dd, or even trying to write some C to avoid the shell overhead and further to read the disk direct. I occaisionally get failed assertions and core dumps with various zdb operations. Is there interest in ticketing them? Assertion failed: (object_count == usedobjs (0x0 == 0x1e33ec)), file /re8/src/cddl/usr.sbin/zdb/../../../cddl/contrib/opensolaris/cmd/zdb/zdb.c, line 1649. From owner-freebsd-fs@FreeBSD.ORG Mon Jan 28 06:35:45 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id C902ED8C; Mon, 28 Jan 2013 06:35:45 +0000 (UTC) (envelope-from pyunyh@gmail.com) Received: from mail-pb0-f47.google.com (mail-pb0-f47.google.com [209.85.160.47]) by mx1.freebsd.org (Postfix) with ESMTP id 6E955BF2; Mon, 28 Jan 2013 06:35:45 +0000 (UTC) Received: by mail-pb0-f47.google.com with SMTP id rp8so514298pbb.34 for ; Sun, 27 Jan 2013 22:35:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:from:date:to:cc:subject:message-id:reply-to:references :mime-version:content-type:content-disposition:in-reply-to :user-agent; bh=2RacGay4nKiXjswap+LYj/1R3BklExGECrJVUYi2YcE=; b=jfDdnNRfagARNXe8mlBcGP6m59yqAXmAUhck8r3GUBB9aKYCz/AS2sLDQBQLPyTFAi RR2Ha8pOTsd4haaQKclcY/u/eIv/r99tR6CS5Dn4jNx3VKS/LM8U3cd9Q/5CKEVGptQ2 qqQCNGXBz2bcpn408+9hpu7F8CJRK5Ls97wwK7XYofCyTJjQbOhlI59q7udckpA7SmIn ktvckWbbowjY7eqpI7NoE7RcOgTPRuZna7igRLmHQVk7qR2L8SuVSMBduKg233e7JuDz ihjvOEv7Yu1n0INf2iNCcKKp65xNmO6MGZR2CmWNS3KwUFdJlofE/KOqGVVhL/6L+LDA pMBg== X-Received: by 10.66.84.195 with SMTP id b3mr33785573paz.30.1359354939703; Sun, 27 Jan 2013 22:35:39 -0800 (PST) Received: from pyunyh@gmail.com (lpe4.p59-icn.cdngp.net. [114.111.62.249]) by mx.google.com with ESMTPS id x6sm6157347paw.0.2013.01.27.22.35.35 (version=TLSv1 cipher=RC4-SHA bits=128/128); Sun, 27 Jan 2013 22:35:38 -0800 (PST) Received: by pyunyh@gmail.com (sSMTP sendmail emulation); Mon, 28 Jan 2013 15:35:31 +0900 From: YongHyeon PYUN Date: Mon, 28 Jan 2013 15:35:31 +0900 To: Christian Gusenbauer Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory Message-ID: <20130128063531.GC1447@michelle.cdnetworks.com> References: <201301241805.57623.c47g@gmx.at> <20130125043043.GA1429@michelle.cdnetworks.com> <20130125045048.GB1429@michelle.cdnetworks.com> <201301251809.50929.c47g@gmx.at> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201301251809.50929.c47g@gmx.at> User-Agent: Mutt/1.4.2.3i Cc: freebsd-fs@freebsd.org, net@freebsd.org, yongari@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: pyunyh@gmail.com List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Jan 2013 06:35:45 -0000 On Fri, Jan 25, 2013 at 06:09:50PM +0100, Christian Gusenbauer wrote: > On Friday 25 January 2013 05:50:48 YongHyeon PYUN wrote: > > On Fri, Jan 25, 2013 at 01:30:43PM +0900, YongHyeon PYUN wrote: > > > On Thu, Jan 24, 2013 at 05:21:50PM -0500, John Baldwin wrote: > > > > On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote: > > > > > On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer wrote: > > > > > > On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote: > > > > > > > On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer > wrote: > > > > > > > > On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote: > > > > > > > > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov > wrote: > > > > > > > > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian > Gusenbauer wrote: > > > > > > > > > > > Hi! > > > > > > > > > > > > > > > > > > > > > > I'm using 9.1 stable svn revision 245605 and I get the > > > > > > > > > > > panic below if I execute the following commands (as > > > > > > > > > > > single user): > > > > > > > > > > > > > > > > > > > > > > # swapon -a > > > > > > > > > > > # dumpon /dev/ada0s3b > > > > > > > > > > > # mount -u / > > > > > > > > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up > > > > > > > > > > > # mount -t nfs -o rsize=32768 data:/multimedia /mnt > > > > > > > > > > > # cp /mnt/Movies/test/a.m2ts /tmp > > > > > > > > > > > > > > > > > > > > > > then the system panics almost immediately. I'll attach > > > > > > > > > > > the stack trace. > > > > > > > > > > > > > > > > > > > > > > Note, that I'm using jumbo frames (6144 byte) on a 1Gbit > > > > > > > > > > > network, maybe that's the cause for the panic, because > > > > > > > > > > > the bcopy (see stack frame #15) fails. > > > > > > > > > > > > > > > > > > > > > > Any clues? > > > > > > > > > > > > > > > > > > > > I tried a similar operation with the nfs mount of > > > > > > > > > > rsize=32768 and mtu 6144, but the machine runs HEAD and em > > > > > > > > > > instead of age. I was unable to reproduce the panic on the > > > > > > > > > > copy of the 5GB file from nfs mount. > > > > > > > > > > > > > > > > Hmmm, I did a quick test. If I do not change the MTU, so just > > > > > > > > configuring age0 with > > > > > > > > > > > > > > > > # ifconfig age0 inet 192.168.2.2 up > > > > > > > > > > > > > > > > then I can copy all files from the mounted directory without > > > > > > > > any problems, too. So it's probably age0 related? > > > > > > > > > > > > > > From your backtrace and the buffer printout, I see somewhat > > > > > > > strange thing. The buffer data address is 0xffffff8171418000, > > > > > > > while kernel faulted at the attempt to write at > > > > > > > 0xffffff8171413000, which is is lower then the buffer data > > > > > > > pointer, at the attempt to bcopy to the buffer. > > > > > > > > > > > > > > The other data suggests that there were no overflow of the data > > > > > > > from the server response. So it might be that mbuf_len(mp) > > > > > > > returned negative number ? I am not sure is it possible at all. > > > > > > > > > > > > > > Try this debugging patch, please. You need to add INVARIANTS etc > > > > > > > to the kernel config. > > > > > > > > > > > > > > diff --git a/sys/fs/nfs/nfs_commonsubs.c > > > > > > > b/sys/fs/nfs/nfs_commonsubs.c index efc0786..9a6bda5 100644 > > > > > > > --- a/sys/fs/nfs/nfs_commonsubs.c > > > > > > > +++ b/sys/fs/nfs/nfs_commonsubs.c > > > > > > > @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, > > > > > > > struct uio *uiop, int siz) } > > > > > > > > > > > > > > mbufcp = NFSMTOD(mp, caddr_t); > > > > > > > len = mbuf_len(mp); > > > > > > > > > > > > > > + KASSERT(len > 0, ("len %d", len)); > > > > > > > > > > > > > > } > > > > > > > xfer = (left > len) ? len : left; > > > > > > > > > > > > > > #ifdef notdef > > > > > > > > > > > > > > @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, > > > > > > > struct uio *uiop, int siz) uiop->uio_resid -= xfer; > > > > > > > > > > > > > > } > > > > > > > if (uiop->uio_iov->iov_len <= siz) { > > > > > > > > > > > > > > + KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d", > > > > > > > + uiop->uio_iovcnt)); > > > > > > > > > > > > > > uiop->uio_iovcnt--; > > > > > > > uiop->uio_iov++; > > > > > > > > > > > > > > } else { > > > > > > > > > > > > > > I thought that server have returned too long response, but it > > > > > > > seems to be not the case from your data. Still, I think the > > > > > > > patch below might be due. > > > > > > > > > > > > > > diff --git a/sys/fs/nfsclient/nfs_clrpcops.c > > > > > > > b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644 > > > > > > > --- a/sys/fs/nfsclient/nfs_clrpcops.c > > > > > > > +++ b/sys/fs/nfsclient/nfs_clrpcops.c > > > > > > > @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio > > > > > > > *uiop, struct ucred *cred, NFSM_DISSECT(tl, u_int32_t *, > > > > > > > NFSX_UNSIGNED); > > > > > > > > > > > > > > eof = fxdr_unsigned(int, *tl); > > > > > > > > > > > > > > } > > > > > > > > > > > > > > - NFSM_STRSIZ(retlen, rsize); > > > > > > > + NFSM_STRSIZ(retlen, len); > > > > > > > > > > > > > > error = nfsm_mbufuio(nd, uiop, retlen); > > > > > > > if (error) > > > > > > > > > > > > > > goto nfsmout; > > > > > > > > > > > > I applied your patches and now I get a > > > > > > > > > > > > panic: len -4 > > > > > > cpuid = 1 > > > > > > KDB: enter: panic > > > > > > Dumping 377 out of 6116 > > > > > > MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94% > > > > > > > > > > This means that the age driver either produced corrupted mbuf chain, > > > > > or filled wrong negative value into the mbuf len field. I am quite > > > > > certain that the issue is in the driver. > > > > > > > > > > I added the net@ to Cc:, hopefully you could get help there. > > > > > > > > And I've cc'd Pyun who has written most of this driver and is likely > > > > the one most familiar with its handling of jumbo frames. > > > > > > Try attached one and let me know how it goes. > > > Note, I don't have age(4) anymore so it wasn't tested at all. > > > > Sorry, ignore previous patch and use this one(age.diff2) instead. > > Thanks for the patch! I ignored the first and applied only the second one, but > unfortunately that did not change anything. I still get the "panic: len -4" > :-(. Ok, I contacted QAC and got a hint for its descriptor usage and I realized the controller does not work as I initially expected! When I wrote age(4) for the controller, the hardware was available only for a couple of weeks so I may have not enough time to test it. Sorry about that. I'll let you know when experimental patch is available. Due to lack of hardware, it would take more time than it used to be. Thanks for reporting! > > Ciao, > Christian. From owner-freebsd-fs@FreeBSD.ORG Mon Jan 28 08:58:25 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id EB380A82; Mon, 28 Jan 2013 08:58:24 +0000 (UTC) (envelope-from uqs@FreeBSD.org) Received: from acme.spoerlein.net (acme.spoerlein.net [IPv6:2a01:4f8:131:23c2::1]) by mx1.freebsd.org (Postfix) with ESMTP id 786A9256; Mon, 28 Jan 2013 08:58:24 +0000 (UTC) Received: from localhost (acme.spoerlein.net [IPv6:2a01:4f8:131:23c2::1]) by acme.spoerlein.net (8.14.6/8.14.6) with ESMTP id r0S8wLP2029200 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Mon, 28 Jan 2013 09:58:22 +0100 (CET) (envelope-from uqs@FreeBSD.org) Date: Mon, 28 Jan 2013 09:58:20 +0100 From: Ulrich =?utf-8?B?U3DDtnJsZWlu?= To: Peter Jeremy Subject: Re: Zpool surgery Message-ID: <20130128085820.GR35868@acme.spoerlein.net> Mail-Followup-To: Peter Jeremy , Steven Hartland , current@freebsd.org, fs@freebsd.org References: <20130127103612.GB38645@acme.spoerlein.net> <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk> <20130127201140.GD29105@server.rulingia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20130127201140.GD29105@server.rulingia.com> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: current@freebsd.org, fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Jan 2013 08:58:25 -0000 On Mon, 2013-01-28 at 07:11:40 +1100, Peter Jeremy wrote: > On 2013-Jan-27 14:31:56 -0000, Steven Hartland wrote: > >----- Original Message ----- > >From: "Ulrich Spörlein" > >> I want to transplant my old zpool tank from a 1TB drive to a new 2TB > >> drive, but *not* use dd(1) or any other cloning mechanism, as the pool > >> was very full very often and is surely severely fragmented. > > > >Cant you just drop the disk in the original machine, set it as a mirror > >then once the mirror process has completed break the mirror and remove > >the 1TB disk. > > That will replicate any fragmentation as well. "zfs send | zfs recv" > is the only (current) way to defragment a ZFS pool. But are you then also supposed to be able send incremental snapshots to a third pool from the pool that you just cloned? I did the zpool replace now over night, and it did not remove the old device yet, as it found cksum errors on the pool: root@coyote:~# zpool status -v pool: tank state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://illumos.org/msg/ZFS-8000-8A scan: resilvered 873G in 11h33m with 24 errors on Mon Jan 28 09:45:32 2013 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 27 replacing-0 ONLINE 0 0 61 da0.eli ONLINE 0 0 61 ada1.eli ONLINE 0 0 61 errors: Permanent errors have been detected in the following files: tank/src@2013-01-17:/.svn/pristine/8e/8ed35772a38e0fec00bc1cbc2f05480f4fd4759b.svn-base tank/src@2013-01-17:/.svn/pristine/4f/4febd82f50bd408f958d4412ceea50cef48fe8f7.svn-base tank/src@2013-01-17:/sys/dev/mvs/mvs_soc.c tank/src@2013-01-17:/secure/usr.bin/openssl/man/pkcs8.1 tank/src@2013-01-17:/.svn/pristine/ab/ab1efecf2c0a8f67162b2ed760772337017c5a64.svn-base tank/src@2013-01-17:/.svn/pristine/90/907580a473b00f09b01815a52251fbdc3e34e8f6.svn-base tank/src@2013-01-17:/sys/dev/agp/agpreg.h tank/src@2013-01-17:/sys/dev/isci/scil/scic_sds_remote_node_context.h tank/src@2013-01-17:/.svn/pristine/a8/a8dfc65edca368c5d2af3d655859f25150795bc5.svn-base tank/src@2013-01-17:/contrib/llvm/utils/TableGen/DAGISelMatcher.cpp tank/src@2013-01-17:/contrib/tcpdump/print-babel.c tank/src@2013-01-17:/.svn/pristine/30/30ef0f53aa09a5185f55f4ecac842dbc13dab8fd.svn-base tank/src@2013-01-17:/.svn/pristine/cb/cb32411a6873621a449b24d9127305b2ee6630e9.svn-base tank/src@2013-01-17:/.svn/pristine/03/030d211b1e95f703f9a61201eed63efdbb8e41c0.svn-base tank/src@2013-01-17:/.svn/pristine/27/27f1181d33434a72308de165c04202b6159d6ac2.svn-base tank/src@2013-01-17:/lib/libpam/modules/pam_exec/pam_exec.c tank/src@2013-01-17:/contrib/llvm/include/llvm/PassSupport.h tank/src@2013-01-17:/.svn/pristine/90/90f818b5f897f26c7b301c1ac2d0ce0d3eaef28d.svn-base tank/src@2013-01-17:/sys/vm/vm_pager.c tank/src@2013-01-17:/.svn/pristine/5e/5e9331052e8c2e0fa5fd8c74c4edb04058e3b95f.svn-base tank/src@2013-01-17:/.svn/pristine/1d/1d5d6e75cfb77e48e4711ddd10148986392c4fae.svn-base tank/src@2013-01-17:/.svn/pristine/c5/c55e964c62ed759089c4bf5e49adf6e49eb59108.svn-base tank/src@2013-01-17:/crypto/openssl/crypto/cms/cms_lcl.h tank/ncvs@2013-01-17:/ports/textproc/uncrustify/distinfo,v Interestingly, these only seem to affect the snapshot, and I'm now wondering if that is the problem why the backup pool did not accept the next incremental snapshot from the new pool. How does the receiving pool known that it has the correct snapshot to store an incremental one anyway? Is there a toplevel checksum, like for git commits? How can I display and compare that? Cheers, Uli From owner-freebsd-fs@FreeBSD.ORG Mon Jan 28 11:06:43 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 71370913 for ; Mon, 28 Jan 2013 11:06:43 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id 62FAECCF for ; Mon, 28 Jan 2013 11:06:43 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r0SB6hYb034536 for ; Mon, 28 Jan 2013 11:06:43 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r0SB6hLk034534 for freebsd-fs@FreeBSD.org; Mon, 28 Jan 2013 11:06:43 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 28 Jan 2013 11:06:43 GMT Message-Id: <201301281106.r0SB6hLk034534@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-fs@FreeBSD.org Subject: Current problem reports assigned to freebsd-fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Jan 2013 11:06:43 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/175179 fs [zfs] ZFS may attach wrong device on move o kern/175071 fs [ufs] [panic] softdep_deallocate_dependencies: unrecov o kern/174372 fs [zfs] Pagefault appears to be related to ZFS o kern/174315 fs [zfs] chflags uchg not supported o kern/174310 fs [zfs] root point mounting broken on CURRENT with multi o kern/174279 fs [ufs] UFS2-SU+J journal and filesystem corruption o kern/174060 fs [ext2fs] Ext2FS system crashes (buffer overflow?) o kern/173830 fs [zfs] Brain-dead simple change to ZFS error descriptio o kern/173718 fs [zfs] phantom directory in zraid2 pool f kern/173657 fs [nfs] strange UID map with nfsuserd o kern/173363 fs [zfs] [panic] Panic on 'zpool replace' on readonly poo o kern/173136 fs [unionfs] mounting above the NFS read-only share panic o kern/172348 fs [unionfs] umount -f of filesystem in use with readonly o kern/172334 fs [unionfs] unionfs permits recursive union mounts; caus o kern/171626 fs [tmpfs] tmpfs should be noisier when the requested siz o kern/171415 fs [zfs] zfs recv fails with "cannot receive incremental o kern/170945 fs [gpt] disk layout not portable between direct connect o bin/170778 fs [zfs] [panic] FreeBSD panics randomly o kern/170680 fs [nfs] Multiple NFS Client bug in the FreeBSD 7.4-RELEA o kern/170497 fs [xfs][panic] kernel will panic whenever I ls a mounted o kern/169945 fs [zfs] [panic] Kernel panic while importing zpool (afte o kern/169480 fs [zfs] ZFS stalls on heavy I/O o kern/169398 fs [zfs] Can't remove file with permanent error o kern/169339 fs panic while " : > /etc/123" o kern/169319 fs [zfs] zfs resilver can't complete o kern/168947 fs [nfs] [zfs] .zfs/snapshot directory is messed up when o kern/168942 fs [nfs] [hang] nfsd hangs after being restarted (not -HU o kern/168158 fs [zfs] incorrect parsing of sharenfs options in zfs (fs o kern/167979 fs [ufs] DIOCGDINFO ioctl does not work on 8.2 file syste o kern/167977 fs [smbfs] mount_smbfs results are differ when utf-8 or U o kern/167688 fs [fusefs] Incorrect signal handling with direct_io o kern/167685 fs [zfs] ZFS on USB drive prevents shutdown / reboot o kern/167612 fs [portalfs] The portal file system gets stuck inside po o kern/167272 fs [zfs] ZFS Disks reordering causes ZFS to pick the wron o kern/167260 fs [msdosfs] msdosfs disk was mounted the second time whe o kern/167109 fs [zfs] [panic] zfs diff kernel panic Fatal trap 9: gene o kern/167105 fs [nfs] mount_nfs can not handle source exports wiht mor o kern/167067 fs [zfs] [panic] ZFS panics the server o kern/167065 fs [zfs] boot fails when a spare is the boot disk o kern/167048 fs [nfs] [patch] RELEASE-9 crash when using ZFS+NULLFS+NF o kern/166912 fs [ufs] [panic] Panic after converting Softupdates to jo o kern/166851 fs [zfs] [hang] Copying directory from the mounted UFS di o kern/166477 fs [nfs] NFS data corruption. o kern/165950 fs [ffs] SU+J and fsck problem o kern/165923 fs [nfs] Writing to NFS-backed mmapped files fails if flu o kern/165521 fs [zfs] [hang] livelock on 1 Gig of RAM with zfs when 31 o kern/165392 fs Multiple mkdir/rmdir fails with errno 31 o kern/165087 fs [unionfs] lock violation in unionfs o kern/164472 fs [ufs] fsck -B panics on particular data inconsistency o kern/164370 fs [zfs] zfs destroy for snapshot fails on i386 and sparc o kern/164261 fs [nullfs] [patch] fix panic with NFS served from NULLFS o kern/164256 fs [zfs] device entry for volume is not created after zfs o kern/164184 fs [ufs] [panic] Kernel panic with ufs_makeinode o kern/163801 fs [md] [request] allow mfsBSD legacy installed in 'swap' o kern/163770 fs [zfs] [hang] LOR between zfs&syncer + vnlru leading to o kern/163501 fs [nfs] NFS exporting a dir and a subdir in that dir to o kern/162944 fs [coda] Coda file system module looks broken in 9.0 o kern/162860 fs [zfs] Cannot share ZFS filesystem to hosts with a hyph o kern/162751 fs [zfs] [panic] kernel panics during file operations o kern/162591 fs [nullfs] cross-filesystem nullfs does not work as expe o kern/162519 fs [zfs] "zpool import" relies on buggy realpath() behavi o kern/162362 fs [snapshots] [panic] ufs with snapshot(s) panics when g o kern/161968 fs [zfs] [hang] renaming snapshot with -r including a zvo o kern/161864 fs [ufs] removing journaling from UFS partition fails on o bin/161807 fs [patch] add option for explicitly specifying metadata o kern/161579 fs [smbfs] FreeBSD sometimes panics when an smb share is o kern/161533 fs [zfs] [panic] zfs receive panic: system ioctl returnin o kern/161438 fs [zfs] [panic] recursed on non-recursive spa_namespace_ o kern/161424 fs [nullfs] __getcwd() calls fail when used on nullfs mou o kern/161280 fs [zfs] Stack overflow in gptzfsboot o kern/161205 fs [nfs] [pfsync] [regression] [build] Bug report freebsd o kern/161169 fs [zfs] [panic] ZFS causes kernel panic in dbuf_dirty o kern/161112 fs [ufs] [lor] filesystem LOR in FreeBSD 9.0-BETA3 o kern/160893 fs [zfs] [panic] 9.0-BETA2 kernel panic o kern/160860 fs [ufs] Random UFS root filesystem corruption with SU+J o kern/160801 fs [zfs] zfsboot on 8.2-RELEASE fails to boot from root-o o kern/160790 fs [fusefs] [panic] VPUTX: negative ref count with FUSE o kern/160777 fs [zfs] [hang] RAID-Z3 causes fatal hang upon scrub/impo o kern/160706 fs [zfs] zfs bootloader fails when a non-root vdev exists o kern/160591 fs [zfs] Fail to boot on zfs root with degraded raidz2 [r o kern/160410 fs [smbfs] [hang] smbfs hangs when transferring large fil o kern/160283 fs [zfs] [patch] 'zfs list' does abort in make_dataset_ha o kern/159930 fs [ufs] [panic] kernel core o kern/159402 fs [zfs][loader] symlinks cause I/O errors o kern/159357 fs [zfs] ZFS MAXNAMELEN macro has confusing name (off-by- o kern/159356 fs [zfs] [patch] ZFS NAME_ERR_DISKLIKE check is Solaris-s o kern/159351 fs [nfs] [patch] - divide by zero in mountnfs() o kern/159251 fs [zfs] [request]: add FLETCHER4 as DEDUP hash option o kern/159077 fs [zfs] Can't cd .. with latest zfs version o kern/159048 fs [smbfs] smb mount corrupts large files o kern/159045 fs [zfs] [hang] ZFS scrub freezes system o kern/158839 fs [zfs] ZFS Bootloader Fails if there is a Dead Disk o kern/158802 fs amd(8) ICMP storm and unkillable process. o kern/158231 fs [nullfs] panic on unmounting nullfs mounted over ufs o f kern/157929 fs [nfs] NFS slow read o kern/157399 fs [zfs] trouble with: mdconfig force delete && zfs strip o kern/157179 fs [zfs] zfs/dbuf.c: panic: solaris assert: arc_buf_remov o kern/156797 fs [zfs] [panic] Double panic with FreeBSD 9-CURRENT and o kern/156781 fs [zfs] zfs is losing the snapshot directory, p kern/156545 fs [ufs] mv could break UFS on SMP systems o kern/156193 fs [ufs] [hang] UFS snapshot hangs && deadlocks processes o kern/156039 fs [nullfs] [unionfs] nullfs + unionfs do not compose, re o kern/155615 fs [zfs] zfs v28 broken on sparc64 -current o kern/155587 fs [zfs] [panic] kernel panic with zfs p kern/155411 fs [regression] [8.2-release] [tmpfs]: mount: tmpfs : No o kern/155199 fs [ext2fs] ext3fs mounted as ext2fs gives I/O errors o bin/155104 fs [zfs][patch] use /dev prefix by default when importing o kern/154930 fs [zfs] cannot delete/unlink file from full volume -> EN o kern/154828 fs [msdosfs] Unable to create directories on external USB o kern/154491 fs [smbfs] smb_co_lock: recursive lock for object 1 p kern/154228 fs [md] md getting stuck in wdrain state o kern/153996 fs [zfs] zfs root mount error while kernel is not located o kern/153753 fs [zfs] ZFS v15 - grammatical error when attempting to u o kern/153716 fs [zfs] zpool scrub time remaining is incorrect o kern/153695 fs [patch] [zfs] Booting from zpool created on 4k-sector o kern/153680 fs [xfs] 8.1 failing to mount XFS partitions o kern/153418 fs [zfs] [panic] Kernel Panic occurred writing to zfs vol o kern/153351 fs [zfs] locking directories/files in ZFS o bin/153258 fs [patch][zfs] creating ZVOLs requires `refreservation' s kern/153173 fs [zfs] booting from a gzip-compressed dataset doesn't w o bin/153142 fs [zfs] ls -l outputs `ls: ./.zfs: Operation not support o kern/153126 fs [zfs] vdev failure, zpool=peegel type=vdev.too_small o kern/152022 fs [nfs] nfs service hangs with linux client [regression] o kern/151942 fs [zfs] panic during ls(1) zfs snapshot directory o kern/151905 fs [zfs] page fault under load in /sbin/zfs o bin/151713 fs [patch] Bug in growfs(8) with respect to 32-bit overfl o kern/151648 fs [zfs] disk wait bug o kern/151629 fs [fs] [patch] Skip empty directory entries during name o kern/151330 fs [zfs] will unshare all zfs filesystem after execute a o kern/151326 fs [nfs] nfs exports fail if netgroups contain duplicate o kern/151251 fs [ufs] Can not create files on filesystem with heavy us o kern/151226 fs [zfs] can't delete zfs snapshot o kern/150503 fs [zfs] ZFS disks are UNAVAIL and corrupted after reboot o kern/150501 fs [zfs] ZFS vdev failure vdev.bad_label on amd64 o kern/150390 fs [zfs] zfs deadlock when arcmsr reports drive faulted o kern/150336 fs [nfs] mountd/nfsd became confused; refused to reload n o kern/149208 fs mksnap_ffs(8) hang/deadlock o kern/149173 fs [patch] [zfs] make OpenSolaris installa o kern/149015 fs [zfs] [patch] misc fixes for ZFS code to build on Glib o kern/149014 fs [zfs] [patch] declarations in ZFS libraries/utilities o kern/149013 fs [zfs] [patch] make ZFS makefiles use the libraries fro o kern/148504 fs [zfs] ZFS' zpool does not allow replacing drives to be o kern/148490 fs [zfs]: zpool attach - resilver bidirectionally, and re o kern/148368 fs [zfs] ZFS hanging forever on 8.1-PRERELEASE o kern/148138 fs [zfs] zfs raidz pool commands freeze o kern/147903 fs [zfs] [panic] Kernel panics on faulty zfs device o kern/147881 fs [zfs] [patch] ZFS "sharenfs" doesn't allow different " o kern/147420 fs [ufs] [panic] ufs_dirbad, nullfs, jail panic (corrupt o kern/146941 fs [zfs] [panic] Kernel Double Fault - Happens constantly o kern/146786 fs [zfs] zpool import hangs with checksum errors o kern/146708 fs [ufs] [panic] Kernel panic in softdep_disk_write_compl o kern/146528 fs [zfs] Severe memory leak in ZFS on i386 o kern/146502 fs [nfs] FreeBSD 8 NFS Client Connection to Server s kern/145712 fs [zfs] cannot offline two drives in a raidz2 configurat o kern/145411 fs [xfs] [panic] Kernel panics shortly after mounting an f bin/145309 fs bsdlabel: Editing disk label invalidates the whole dev o kern/145272 fs [zfs] [panic] Panic during boot when accessing zfs on o kern/145246 fs [ufs] dirhash in 7.3 gratuitously frees hashes when it o kern/145238 fs [zfs] [panic] kernel panic on zpool clear tank o kern/145229 fs [zfs] Vast differences in ZFS ARC behavior between 8.0 o kern/145189 fs [nfs] nfsd performs abysmally under load o kern/144929 fs [ufs] [lor] vfs_bio.c + ufs_dirhash.c p kern/144447 fs [zfs] sharenfs fsunshare() & fsshare_main() non functi o kern/144416 fs [panic] Kernel panic on online filesystem optimization s kern/144415 fs [zfs] [panic] kernel panics on boot after zfs crash o kern/144234 fs [zfs] Cannot boot machine with recent gptzfsboot code o kern/143825 fs [nfs] [panic] Kernel panic on NFS client o bin/143572 fs [zfs] zpool(1): [patch] The verbose output from iostat o kern/143212 fs [nfs] NFSv4 client strange work ... o kern/143184 fs [zfs] [lor] zfs/bufwait LOR o kern/142878 fs [zfs] [vfs] lock order reversal o kern/142597 fs [ext2fs] ext2fs does not work on filesystems with real o kern/142489 fs [zfs] [lor] allproc/zfs LOR o kern/142466 fs Update 7.2 -> 8.0 on Raid 1 ends with screwed raid [re o kern/142306 fs [zfs] [panic] ZFS drive (from OSX Leopard) causes two o kern/142068 fs [ufs] BSD labels are got deleted spontaneously o kern/141897 fs [msdosfs] [panic] Kernel panic. msdofs: file name leng o kern/141463 fs [nfs] [panic] Frequent kernel panics after upgrade fro o kern/141305 fs [zfs] FreeBSD ZFS+sendfile severe performance issues ( o kern/141091 fs [patch] [nullfs] fix panics with DIAGNOSTIC enabled o kern/141086 fs [nfs] [panic] panic("nfs: bioread, not dir") on FreeBS o kern/141010 fs [zfs] "zfs scrub" fails when backed by files in UFS2 o kern/140888 fs [zfs] boot fail from zfs root while the pool resilveri o kern/140661 fs [zfs] [patch] /boot/loader fails to work on a GPT/ZFS- o kern/140640 fs [zfs] snapshot crash o kern/140068 fs [smbfs] [patch] smbfs does not allow semicolon in file o kern/139725 fs [zfs] zdb(1) dumps core on i386 when examining zpool c o kern/139715 fs [zfs] vfs.numvnodes leak on busy zfs p bin/139651 fs [nfs] mount(8): read-only remount of NFS volume does n o kern/139407 fs [smbfs] [panic] smb mount causes system crash if remot o kern/138662 fs [panic] ffs_blkfree: freeing free block o kern/138421 fs [ufs] [patch] remove UFS label limitations o kern/138202 fs mount_msdosfs(1) see only 2Gb o kern/136968 fs [ufs] [lor] ufs/bufwait/ufs (open) o kern/136945 fs [ufs] [lor] filedesc structure/ufs (poll) o kern/136944 fs [ffs] [lor] bufwait/snaplk (fsync) o kern/136873 fs [ntfs] Missing directories/files on NTFS volume o kern/136865 fs [nfs] [patch] NFS exports atomic and on-the-fly atomic p kern/136470 fs [nfs] Cannot mount / in read-only, over NFS o kern/135546 fs [zfs] zfs.ko module doesn't ignore zpool.cache filenam o kern/135469 fs [ufs] [panic] kernel crash on md operation in ufs_dirb o kern/135050 fs [zfs] ZFS clears/hides disk errors on reboot o kern/134491 fs [zfs] Hot spares are rather cold... o kern/133676 fs [smbfs] [panic] umount -f'ing a vnode-based memory dis p kern/133174 fs [msdosfs] [patch] msdosfs must support multibyte inter o kern/132960 fs [ufs] [panic] panic:ffs_blkfree: freeing free frag o kern/132397 fs reboot causes filesystem corruption (failure to sync b o kern/132331 fs [ufs] [lor] LOR ufs and syncer o kern/132237 fs [msdosfs] msdosfs has problems to read MSDOS Floppy o kern/132145 fs [panic] File System Hard Crashes o kern/131441 fs [unionfs] [nullfs] unionfs and/or nullfs not combineab o kern/131360 fs [nfs] poor scaling behavior of the NFS server under lo o kern/131342 fs [nfs] mounting/unmounting of disks causes NFS to fail o bin/131341 fs makefs: error "Bad file descriptor" on the mount poin o kern/130920 fs [msdosfs] cp(1) takes 100% CPU time while copying file o kern/130210 fs [nullfs] Error by check nullfs o kern/129760 fs [nfs] after 'umount -f' of a stale NFS share FreeBSD l o kern/129488 fs [smbfs] Kernel "bug" when using smbfs in smbfs_smb.c: o kern/129231 fs [ufs] [patch] New UFS mount (norandom) option - mostly o kern/129152 fs [panic] non-userfriendly panic when trying to mount(8) o kern/127787 fs [lor] [ufs] Three LORs: vfslock/devfs/vfslock, ufs/vfs o bin/127270 fs fsck_msdosfs(8) may crash if BytesPerSec is zero o kern/127029 fs [panic] mount(8): trying to mount a write protected zi o kern/126287 fs [ufs] [panic] Kernel panics while mounting an UFS file o kern/125895 fs [ffs] [panic] kernel: panic: ffs_blkfree: freeing free s kern/125738 fs [zfs] [request] SHA256 acceleration in ZFS o kern/123939 fs [msdosfs] corrupts new files o kern/122380 fs [ffs] ffs_valloc:dup alloc (Soekris 4801/7.0/USB Flash o bin/122172 fs [fs]: amd(8) automount daemon dies on 6.3-STABLE i386, o bin/121898 fs [nullfs] pwd(1)/getcwd(2) fails with Permission denied o bin/121072 fs [smbfs] mount_smbfs(8) cannot normally convert the cha o kern/120483 fs [ntfs] [patch] NTFS filesystem locking changes o kern/120482 fs [ntfs] [patch] Sync style changes between NetBSD and F o kern/118912 fs [2tb] disk sizing/geometry problem with large array o kern/118713 fs [minidump] [patch] Display media size required for a k o kern/118318 fs [nfs] NFS server hangs under special circumstances o bin/118249 fs [ufs] mv(1): moving a directory changes its mtime o kern/118126 fs [nfs] [patch] Poor NFS server write performance o kern/118107 fs [ntfs] [panic] Kernel panic when accessing a file at N o kern/117954 fs [ufs] dirhash on very large directories blocks the mac o bin/117315 fs [smbfs] mount_smbfs(8) and related options can't mount o kern/117158 fs [zfs] zpool scrub causes panic if geli vdevs detach on o bin/116980 fs [msdosfs] [patch] mount_msdosfs(8) resets some flags f o conf/116931 fs lack of fsck_cd9660 prevents mounting iso images with o kern/116583 fs [ffs] [hang] System freezes for short time when using o bin/115361 fs [zfs] mount(8) gets into a state where it won't set/un o kern/114955 fs [cd9660] [patch] [request] support for mask,dirmask,ui o kern/114847 fs [ntfs] [patch] [request] dirmask support for NTFS ala o kern/114676 fs [ufs] snapshot creation panics: snapacct_ufs2: bad blo o bin/114468 fs [patch] [request] add -d option to umount(8) to detach o kern/113852 fs [smbfs] smbfs does not properly implement DFS referral o bin/113838 fs [patch] [request] mount(8): add support for relative p o bin/113049 fs [patch] [request] make quot(8) use getopt(3) and show o kern/112658 fs [smbfs] [patch] smbfs and caching problems (resolves b o kern/111843 fs [msdosfs] Long Names of files are incorrectly created o kern/111782 fs [ufs] dump(8) fails horribly for large filesystems s bin/111146 fs [2tb] fsck(8) fails on 6T filesystem o bin/107829 fs [2TB] fdisk(8): invalid boundary checking in fdisk / w o kern/106107 fs [ufs] left-over fsck_snapshot after unfinished backgro o kern/104406 fs [ufs] Processes get stuck in "ufs" state under persist o kern/104133 fs [ext2fs] EXT2FS module corrupts EXT2/3 filesystems o kern/103035 fs [ntfs] Directories in NTFS mounted disc images appear o kern/101324 fs [smbfs] smbfs sometimes not case sensitive when it's s o kern/99290 fs [ntfs] mount_ntfs ignorant of cluster sizes s bin/97498 fs [request] newfs(8) has no option to clear the first 12 o kern/97377 fs [ntfs] [patch] syntax cleanup for ntfs_ihash.c o kern/95222 fs [cd9660] File sections on ISO9660 level 3 CDs ignored o kern/94849 fs [ufs] rename on UFS filesystem is not atomic o bin/94810 fs fsck(8) incorrectly reports 'file system marked clean' o kern/94769 fs [ufs] Multiple file deletions on multi-snapshotted fil o kern/94733 fs [smbfs] smbfs may cause double unlock o kern/93942 fs [vfs] [patch] panic: ufs_dirbad: bad dir (patch from D o kern/92272 fs [ffs] [hang] Filling a filesystem while creating a sna o kern/91134 fs [smbfs] [patch] Preserve access and modification time a kern/90815 fs [smbfs] [patch] SMBFS with character conversions somet o kern/88657 fs [smbfs] windows client hang when browsing a samba shar o kern/88555 fs [panic] ffs_blkfree: freeing free frag on AMD 64 o bin/87966 fs [patch] newfs(8): introduce -A flag for newfs to enabl o kern/87859 fs [smbfs] System reboot while umount smbfs. o kern/86587 fs [msdosfs] rm -r /PATH fails with lots of small files o bin/85494 fs fsck_ffs: unchecked use of cg_inosused macro etc. o kern/80088 fs [smbfs] Incorrect file time setting on NTFS mounted vi o bin/74779 fs Background-fsck checks one filesystem twice and omits o kern/73484 fs [ntfs] Kernel panic when doing `ls` from the client si o bin/73019 fs [ufs] fsck_ufs(8) cannot alloc 607016868 bytes for ino o kern/71774 fs [ntfs] NTFS cannot "see" files on a WinXP filesystem o bin/70600 fs fsck(8) throws files away when it can't grow lost+foun o kern/68978 fs [panic] [ufs] crashes with failing hard disk, loose po o kern/65920 fs [nwfs] Mounted Netware filesystem behaves strange o kern/65901 fs [smbfs] [patch] smbfs fails fsx write/truncate-down/tr o kern/61503 fs [smbfs] mount_smbfs does not work as non-root o kern/55617 fs [smbfs] Accessing an nsmb-mounted drive via a smb expo o kern/51685 fs [hang] Unbounded inode allocation causes kernel to loc o kern/36566 fs [smbfs] System reboot with dead smb mount and umount o bin/27687 fs fsck(8) wrapper is not properly passing options to fsc o kern/18874 fs [2TB] 32bit NFS servers export wrong negative values t 296 problems total. From owner-freebsd-fs@FreeBSD.ORG Mon Jan 28 12:00:10 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 2DB5D23F; Mon, 28 Jan 2013 12:00:10 +0000 (UTC) (envelope-from laurencesgill@googlemail.com) Received: from mail-we0-x22b.google.com (we-in-x022b.1e100.net [IPv6:2a00:1450:400c:c03::22b]) by mx1.freebsd.org (Postfix) with ESMTP id 987C3292; Mon, 28 Jan 2013 12:00:09 +0000 (UTC) Received: by mail-we0-f171.google.com with SMTP id u54so1398103wey.2 for ; Mon, 28 Jan 2013 04:00:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=x-received:date:from:to:cc:subject:message-id:in-reply-to :references:x-mailer:mime-version:content-type :content-transfer-encoding; bh=DWoix26UxXcqQk/DIlK7sgYcybDwgf+SEv1R/pUwWB4=; b=b3/gXcZEXPWAOOL5m1X2Ah1aYdOolv5QYWyEBfvYvLYtgflsH4rcm4IklTF3tIryrR S1v1iMpvZZ8jz/D7SQ4A5Om9cXTqs3zKbJmW7bL/LNAqcNS5RTiUFE/vvyI+MxtOfOFL Pbujq9GshVUwudwMCNr+FAdP3qx2pnmqsKMEdS9x90sCIkSq9DP7OZwK4L73+/JeENsr oOve6nKyoSW0NzJwuDc7ONC6t9fooMakEk+9Shk6QLVaXD4Nn6R/QMl3I7FKlxR+65To UL26CX4/tkMX/cfaLuEfbrc4d05V/Bg6djTIecZGRvhLMgDGVZ/SieFgzzd5n5gAhtPw 5PeA== X-Received: by 10.180.80.170 with SMTP id s10mr9158900wix.27.1359374408538; Mon, 28 Jan 2013 04:00:08 -0800 (PST) Received: from localhost (gateway.ash.thebunker.net. [213.129.64.4]) by mx.google.com with ESMTPS id ge2sm5859532wib.4.2013.01.28.04.00.08 (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Mon, 28 Jan 2013 04:00:08 -0800 (PST) Date: Mon, 28 Jan 2013 12:00:55 +0000 From: Laurence Gill To: Pawel Jakub Dawidek Subject: Re: HAST performance overheads? Message-ID: <20130128120055.6ca7c734@googlemail.com> In-Reply-To: <20130127134845.GC1346@garage.freebsd.pl> References: <20130125121044.1afac72e@googlemail.com> <20130127134845.GC1346@garage.freebsd.pl> X-Mailer: Claws Mail 3.8.1 (GTK+ 2.24.12; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: base64 Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Jan 2013 12:00:10 -0000 LS0tLS1CRUdJTiBQR1AgU0lHTkVEIE1FU1NBR0UtLS0tLQ0KSGFzaDogU0hBMQ0KDQpPbiBTdW4s IDI3IEphbiAyMDEzIDE0OjQ4OjQ2ICswMTAwDQpQYXdlbCBKYWt1YiBEYXdpZGVrIDxwamRARnJl ZUJTRC5vcmc+IHdyb3RlOg0KDQo+IE9uIEZyaSwgSmFuIDI1LCAyMDEzIGF0IDEyOjEwOjQ0UE0g KzAwMDAsIExhdXJlbmNlIEdpbGwgd3JvdGU6DQo+ID4gSWYgSSBjcmVhdGUgWkZTIHJhaWR6MiBv biB0aGVzZS4uLg0KPiA+IA0KPiA+ICAtICMgenBvb2wgY3JlYXRlIHBvb2wgcmFpZHoyIGRhMCBk YTEgZGEyIGRhMyBkYTQgZGE1DQo+ID4gDQo+ID4gVGhlbiBydW4gYSBkZCB0ZXN0LCBhIHNhbXBs ZSBvdXRwdXQgaXMuLi4NCj4gPiANCj4gPiAgLSAjIGRkIGlmPS9kZXYvemVybyBvZj10ZXN0LmRh dCBicz0xTSBjb3VudD0xMDI0DQo+ID4gICAgICAxMDczNzQxODI0IGJ5dGVzIHRyYW5zZmVycmVk IGluIDcuNjg5NjM0IHNlY3MgKDEzOTYzNDk3NA0KPiA+IGJ5dGVzL3NlYykNCj4gPiANCj4gPiAg LSAjIGRkIGlmPS9kZXYvemVybyBvZj10ZXN0LmRhdCBicz0xNmsgY291bnQ9NjU1MzUNCj4gPiAg ICAgIDEwNzM3MjU0NDAgYnl0ZXMgdHJhbnNmZXJyZWQgaW4gMS45MDkxNTcgc2VjcyAoNTYyNDA4 MTMwDQo+ID4gYnl0ZXMvc2VjKQ0KPiA+IA0KPiA+IFRoaXMgaXMgbXVjaCBmYXN0ZXIgdGhhbiBj b21wYXJlZCB0byBydW5uaW5nIGhhc3QsIEkgd291bGQgZXhwZWN0IGFuDQo+ID4gb3ZlcmhlYWQs IGJ1dCBub3QgdGhpcyBtdWNoLiAgRm9yIGV4YW1wbGU6DQo+ID4gDQo+ID4gIC0gIyBoYXN0Y3Rs IGNyZWF0ZSBkaXNrMC9kaXNrMS9kaXNrMi9kaXNrMy9kaXNrNC9kaXNrNQ0KPiA+ICAtICMgaGFz dGN0bCByb2xlIHByaW1hcnkgYWxsDQo+ID4gIC0gIyB6cG9vbCBjcmVhdGUgcG9vbCByYWlkejIg ZGlzazAgZGlzazEgZGlzazIgZGlzazMgZGlzazQgZGlzazUNCj4gPiANCj4gPiBSdW4gYSBkZCB0 ZXN0LCBhbmQgdGhlIHNwZWVkIGlzLi4uDQo+ID4gDQo+ID4gIC0gIyBkZCBpZj0vZGV2L3plcm8g b2Y9dGVzdC5kYXQgYnM9MU0gY291bnQ9MTAyNA0KPiA+ICAgICAgMTA3Mzc0MTgyNCBieXRlcyB0 cmFuc2ZlcnJlZCBpbiA0MC45MDgxNTMgc2VjcyAoMjYyNDc2MjQNCj4gPiBieXRlcy9zZWMpDQo+ ID4gDQo+ID4gIC0gIyBkZCBpZj0vZGV2L3plcm8gb2Y9dGVzdC5kYXQgYnM9MTZrIGNvdW50PTY1 NTM1DQo+ID4gICAgICAxMDczNzI1NDQwIGJ5dGVzIHRyYW5zZmVycmVkIGluIDQyLjAxNzk5NyBz ZWNzICgyNTU1Mzk0Mg0KPiA+IGJ5dGVzL3NlYykNCj4gDQo+IExldCdzIHRyeSB0byB0ZXN0IG9u ZSBzdGVwIGF0IGEgdGltZS4gQ2FuIHlvdSB0cnkgdG8gY29tcGFyZQ0KPiBzZXF1ZW50aWFsIHBl cmZvcm1hbmNlIG9mIHJlZ3VsYXIgZGlzayB2cy4gSEFTVCB3aXRoIG5vIHNlY29uZGFyeQ0KPiBj b25maWd1cmVkPw0KPiANCj4gQnkgbm8gc2Vjb25kYXJ5IGNvbmZpZ3VyZWQgSSBtZWFuICdyZW1v dGUnIHNldCB0byAnbm9uZScuDQo+IA0KPiBKdXN0IGRvOg0KPiANCj4gCSMgZGQgaWY9L2Rldi96 ZXJvIG9mPS9kZXYvZGEwIGJzPTFtIGNvdW50PTEwMjQwDQo+IA0KPiB0aGVuIGNvbmZpZ3VyZSBI QVNUIGFuZDoNCj4gDQo+IAkjIGRkIGlmPS9kZXYvemVybyBvZj0vZGV2L2hhc3QvZGlzazAgYnM9 MW0gY291bnQ9MTAyNDANCj4gDQo+IFdoaWNoIEZyZWVCU0QgdmVyc2lvbiBpcyBpdD8NCj4gDQo+ IFBTLiBZb3VyIFpGUyB0ZXN0cyBhcmUgcHJldHR5IG1lYW5pbmdsZXNzLCBiZWNhdXNlIGl0IGlz IHBvc3NpYmxlIHRoYXQNCj4gICAgIGV2ZXJ5dGhpbmcgd2lsbCBlbmQgdXAgaW4gbWVtb3J5LiBJ J20gc3VyZSB0aGlzIGlzIHdoYXQgaGFwcGVucyBpbg0KPiAgICAgJ2JzPTE2ayBjb3VudD02NTUz NScgY2FzZS4gTGV0IHRyeSByYXcgcHJvdmlkZXJzIGZpcnN0Lg0KPiANCg0KVGhhbmtzIGZvciB0 aGUgcmVwbHkuICBJJ20gdXNpbmcgRnJlZUJTRCA5LjEtUkVMRUFTRS4gSGVyZSBhcmUgdGhlDQpy ZXN1bHRzOg0KDQogIyBkZCBpZj0vZGV2L3plcm8gb2Y9L2Rldi9kYTAgYnM9MW0gY291bnQ9MTAy NDANCiAxMDczNzQxODI0MCBieXRlcyB0cmFuc2ZlcnJlZCBpbiA3NTUuMTQ0NjQ0IHNlY3MgKDE0 MjE5MDIyIGJ5dGVzL3NlYykNCg0KICMgZGQgaWY9L2Rldi96ZXJvIG9mPS9kZXYvaGFzdC9kaXNr MCBicz0xbSBjb3VudD0xMDI0MA0KIDEwNzM3NDE4MjQwIGJ5dGVzIHRyYW5zZmVycmVkIGluIDg0 NC4xNjc2MDIgc2VjcyAoMTI3MTk1MzQgYnl0ZXMvc2VjKQ0KDQoNCldoaWNoIGluZGljYXRlcyBh IHZlcnkgc21hbGwgb3ZlcmhlYWQsIGhtbW0uLi4NCg0KDQotIC0tIA0KTGF1cmVuY2UgR2lsbA0K DQpmOiAwODcyMSAxNTcgNjY1DQpza3lwZTogbGF1cmVuY2VnZw0KZTogbGF1cmVuY2VzZ2lsbEBn b29nbGVtYWlsLmNvbQ0KUEdQIG9uIEtleSBTZXJ2ZXJzDQotLS0tLUJFR0lOIFBHUCBTSUdOQVRV UkUtLS0tLQ0KVmVyc2lvbjogR251UEcgdjIuMC4xOSAoR05VL0xpbnV4KQ0KDQppRVlFQVJFQ0FB WUZBbEVHYUg0QUNna1F5Z1Z0OFNxMFBmOFFhUUNmWDQvU0FHbndZWGZDeEorRkZuRTFPaVJ2DQpS M01BbjIyYnhqaFhuQ081QXFzeDc0R3hxNVplbVVqWA0KPTdkZ1INCi0tLS0tRU5EIFBHUCBTSUdO QVRVUkUtLS0tLQ0K From owner-freebsd-fs@FreeBSD.ORG Mon Jan 28 15:44:56 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id E83479B7 for ; Mon, 28 Jan 2013 15:44:56 +0000 (UTC) (envelope-from c47g@gmx.at) Received: from mout.gmx.net (mout.gmx.net [212.227.15.18]) by mx1.freebsd.org (Postfix) with ESMTP id 96A85680 for ; Mon, 28 Jan 2013 15:44:56 +0000 (UTC) Received: from mailout-de.gmx.net ([10.1.76.12]) by mrigmx.server.lan (mrigmx002) with ESMTP (Nemesis) id 0MTdbK-1UQLx92daS-00QQGP for ; Mon, 28 Jan 2013 16:44:55 +0100 Received: (qmail invoked by alias); 28 Jan 2013 15:44:55 -0000 Received: from cm56-168-232.liwest.at (EHLO bones.gusis.at) [86.56.168.232] by mail.gmx.net (mp012) with SMTP; 28 Jan 2013 16:44:55 +0100 X-Authenticated: #9978462 X-Provags-ID: V01U2FsdGVkX1+RNOEagoaR8Aj+CRm8gJ6VegtnFdM0y+SAncGTcB 2AVQOib+LqwOIQ From: Christian Gusenbauer To: pyunyh@gmail.com Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory Date: Mon, 28 Jan 2013 16:46:43 +0100 User-Agent: KMail/1.13.7 (FreeBSD/9.1-STABLE; KDE/4.8.4; amd64; ; ) References: <201301241805.57623.c47g@gmx.at> <201301251809.50929.c47g@gmx.at> <20130128063531.GC1447@michelle.cdnetworks.com> In-Reply-To: <20130128063531.GC1447@michelle.cdnetworks.com> MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201301281646.43551.c47g@gmx.at> X-Y-GMX-Trusted: 0 Cc: freebsd-fs@freebsd.org, net@freebsd.org, yongari@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Jan 2013 15:44:57 -0000 On Monday 28 January 2013 07:35:31 YongHyeon PYUN wrote: > On Fri, Jan 25, 2013 at 06:09:50PM +0100, Christian Gusenbauer wrote: > > On Friday 25 January 2013 05:50:48 YongHyeon PYUN wrote: > > > On Fri, Jan 25, 2013 at 01:30:43PM +0900, YongHyeon PYUN wrote: > > > > On Thu, Jan 24, 2013 at 05:21:50PM -0500, John Baldwin wrote: > > > > > On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote: > > > > > > On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer wrote: > > > > > > > On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote: > > > > > > > > On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian > > > > > > > > Gusenbauer > > > > wrote: > > > > > > > > > On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote: > > > > > > > > > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin > > > > > > > > > > Belousov > > > > wrote: > > > > > > > > > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian > > > > Gusenbauer wrote: > > > > > > > > > > > > Hi! > > > > > > > > > > > > > > > > > > > > > > > > I'm using 9.1 stable svn revision 245605 and I get > > > > > > > > > > > > the panic below if I execute the following commands > > > > > > > > > > > > (as single user): > > > > > > > > > > > > > > > > > > > > > > > > # swapon -a > > > > > > > > > > > > # dumpon /dev/ada0s3b > > > > > > > > > > > > # mount -u / > > > > > > > > > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up > > > > > > > > > > > > # mount -t nfs -o rsize=32768 data:/multimedia /mnt > > > > > > > > > > > > # cp /mnt/Movies/test/a.m2ts /tmp > > > > > > > > > > > > > > > > > > > > > > > > then the system panics almost immediately. I'll > > > > > > > > > > > > attach the stack trace. > > > > > > > > > > > > > > > > > > > > > > > > Note, that I'm using jumbo frames (6144 byte) on a > > > > > > > > > > > > 1Gbit network, maybe that's the cause for the panic, > > > > > > > > > > > > because the bcopy (see stack frame #15) fails. > > > > > > > > > > > > > > > > > > > > > > > > Any clues? > > > > > > > > > > > > > > > > > > > > > > I tried a similar operation with the nfs mount of > > > > > > > > > > > rsize=32768 and mtu 6144, but the machine runs HEAD and > > > > > > > > > > > em instead of age. I was unable to reproduce the panic > > > > > > > > > > > on the copy of the 5GB file from nfs mount. > > > > > > > > > > > > > > > > > > Hmmm, I did a quick test. If I do not change the MTU, so > > > > > > > > > just configuring age0 with > > > > > > > > > > > > > > > > > > # ifconfig age0 inet 192.168.2.2 up > > > > > > > > > > > > > > > > > > then I can copy all files from the mounted directory > > > > > > > > > without any problems, too. So it's probably age0 related? > > > > > > > > > > > > > > > > From your backtrace and the buffer printout, I see somewhat > > > > > > > > strange thing. The buffer data address is 0xffffff8171418000, > > > > > > > > while kernel faulted at the attempt to write at > > > > > > > > 0xffffff8171413000, which is is lower then the buffer data > > > > > > > > pointer, at the attempt to bcopy to the buffer. > > > > > > > > > > > > > > > > The other data suggests that there were no overflow of the > > > > > > > > data from the server response. So it might be that > > > > > > > > mbuf_len(mp) returned negative number ? I am not sure is it > > > > > > > > possible at all. > > > > > > > > > > > > > > > > Try this debugging patch, please. You need to add INVARIANTS > > > > > > > > etc to the kernel config. > > > > > > > > > > > > > > > > diff --git a/sys/fs/nfs/nfs_commonsubs.c > > > > > > > > b/sys/fs/nfs/nfs_commonsubs.c index efc0786..9a6bda5 100644 > > > > > > > > --- a/sys/fs/nfs/nfs_commonsubs.c > > > > > > > > +++ b/sys/fs/nfs/nfs_commonsubs.c > > > > > > > > @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, > > > > > > > > struct uio *uiop, int siz) } > > > > > > > > > > > > > > > > mbufcp = NFSMTOD(mp, caddr_t); > > > > > > > > len = mbuf_len(mp); > > > > > > > > > > > > > > > > + KASSERT(len > 0, ("len %d", len)); > > > > > > > > > > > > > > > > } > > > > > > > > xfer = (left > len) ? len : left; > > > > > > > > > > > > > > > > #ifdef notdef > > > > > > > > > > > > > > > > @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, > > > > > > > > struct uio *uiop, int siz) uiop->uio_resid -= xfer; > > > > > > > > > > > > > > > > } > > > > > > > > if (uiop->uio_iov->iov_len <= siz) { > > > > > > > > > > > > > > > > + KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d", > > > > > > > > + uiop->uio_iovcnt)); > > > > > > > > > > > > > > > > uiop->uio_iovcnt--; > > > > > > > > uiop->uio_iov++; > > > > > > > > > > > > > > > > } else { > > > > > > > > > > > > > > > > I thought that server have returned too long response, but it > > > > > > > > seems to be not the case from your data. Still, I think the > > > > > > > > patch below might be due. > > > > > > > > > > > > > > > > diff --git a/sys/fs/nfsclient/nfs_clrpcops.c > > > > > > > > b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 > > > > > > > > 100644 --- a/sys/fs/nfsclient/nfs_clrpcops.c > > > > > > > > +++ b/sys/fs/nfsclient/nfs_clrpcops.c > > > > > > > > @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio > > > > > > > > *uiop, struct ucred *cred, NFSM_DISSECT(tl, u_int32_t *, > > > > > > > > NFSX_UNSIGNED); > > > > > > > > > > > > > > > > eof = fxdr_unsigned(int, *tl); > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > - NFSM_STRSIZ(retlen, rsize); > > > > > > > > + NFSM_STRSIZ(retlen, len); > > > > > > > > > > > > > > > > error = nfsm_mbufuio(nd, uiop, retlen); > > > > > > > > if (error) > > > > > > > > > > > > > > > > goto nfsmout; > > > > > > > > > > > > > > I applied your patches and now I get a > > > > > > > > > > > > > > panic: len -4 > > > > > > > cpuid = 1 > > > > > > > KDB: enter: panic > > > > > > > Dumping 377 out of 6116 > > > > > > > MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94% > > > > > > > > > > > > This means that the age driver either produced corrupted mbuf > > > > > > chain, or filled wrong negative value into the mbuf len field. I > > > > > > am quite certain that the issue is in the driver. > > > > > > > > > > > > I added the net@ to Cc:, hopefully you could get help there. > > > > > > > > > > And I've cc'd Pyun who has written most of this driver and is > > > > > likely the one most familiar with its handling of jumbo frames. > > > > > > > > Try attached one and let me know how it goes. > > > > Note, I don't have age(4) anymore so it wasn't tested at all. > > > > > > Sorry, ignore previous patch and use this one(age.diff2) instead. > > > > Thanks for the patch! I ignored the first and applied only the second > > one, but unfortunately that did not change anything. I still get the > > "panic: len -4" > > > > :-(. > > Ok, I contacted QAC and got a hint for its descriptor usage and I > realized the controller does not work as I initially expected! > When I wrote age(4) for the controller, the hardware was available > only for a couple of weeks so I may have not enough time to test > it. Sorry about that. > I'll let you know when experimental patch is available. Due to lack > of hardware, it would take more time than it used to be. > > Thanks for reporting! Thanks for investing your time! I'm looking forward to test your next patch(es) :-)! Ciao, Christian. From owner-freebsd-fs@FreeBSD.ORG Mon Jan 28 16:11:48 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 78D5CD06; Mon, 28 Jan 2013 16:11:48 +0000 (UTC) (envelope-from laurencesgill@googlemail.com) Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com [209.85.212.179]) by mx1.freebsd.org (Postfix) with ESMTP id BFD5C8D9; Mon, 28 Jan 2013 16:11:47 +0000 (UTC) Received: by mail-wi0-f179.google.com with SMTP id o1so1587895wic.12 for ; Mon, 28 Jan 2013 08:11:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=x-received:date:from:to:cc:subject:message-id:in-reply-to :references:x-mailer:mime-version:content-type :content-transfer-encoding; bh=rg9H/7I/0w+W7kHM8c5+sIRDAUeRSogB+FSwAWsitUs=; b=1KWjJpJXQGe93+JPOWBoEPmjYV5OF5ctAz1j8cfrcJbE4vrH5xJp/3K39rGOQHRVyJ /oMBZC8u8fDQexWst0lvfLibcrNI6lkUhlaFnyTiShfXPgDgOEUMFn9WW7mX2cekP0mU RifRbRoNI1JhSfvEQKJnPn0T9yog1ph3MxbVRBkrFyjt3hfpWbE2RUs7+9tf0p0EzFE5 HhdskdjoSkZf6RZKVYlIx5i/U6xz+tj8Wc4tFayu+8/lo5LlJNfV6OTbSZ/jpXUcmEgy cTO68Ok8J8GWt8T192p61FjC+xJjrbXxr5KQQQt6vQ56YY2dmywUfWM5diKNmQi26zem gC3Q== X-Received: by 10.180.81.39 with SMTP id w7mr10873810wix.15.1359389501290; Mon, 28 Jan 2013 08:11:41 -0800 (PST) Received: from localhost (gateway.ash.thebunker.net. [213.129.64.4]) by mx.google.com with ESMTPS id bd7sm14112933wib.8.2013.01.28.08.11.40 (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Mon, 28 Jan 2013 08:11:41 -0800 (PST) Date: Mon, 28 Jan 2013 16:12:28 +0000 From: Laurence Gill To: freebsd-fs@freebsd.org Subject: Re: HAST performance overheads? Message-ID: <20130128161228.477ce174@googlemail.com> In-Reply-To: <20130128120055.6ca7c734@googlemail.com> References: <20130125121044.1afac72e@googlemail.com> <20130127134845.GC1346@garage.freebsd.pl> <20130128120055.6ca7c734@googlemail.com> X-Mailer: Claws Mail 3.8.1 (GTK+ 2.24.12; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: base64 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Jan 2013 16:11:48 -0000 LS0tLS1CRUdJTiBQR1AgU0lHTkVEIE1FU1NBR0UtLS0tLQ0KSGFzaDogU0hBMQ0KDQpPbiBNb24s IDI4IEphbiAyMDEzIDEyOjAwOjU1ICswMDAwDQpMYXVyZW5jZSBHaWxsIDxsYXVyZW5jZXNnaWxs QGdvb2dsZW1haWwuY29tPiB3cm90ZToNCj4gT24gU3VuLCAyNyBKYW4gMjAxMyAxNDo0ODo0NiAr MDEwMA0KPiBQYXdlbCBKYWt1YiBEYXdpZGVrIDxwamRARnJlZUJTRC5vcmc+IHdyb3RlOg0KPiA+ IA0KPiA+IExldCdzIHRyeSB0byB0ZXN0IG9uZSBzdGVwIGF0IGEgdGltZS4gQ2FuIHlvdSB0cnkg dG8gY29tcGFyZQ0KPiA+IHNlcXVlbnRpYWwgcGVyZm9ybWFuY2Ugb2YgcmVndWxhciBkaXNrIHZz LiBIQVNUIHdpdGggbm8gc2Vjb25kYXJ5DQo+ID4gY29uZmlndXJlZD8NCj4gPiANCj4gPiBCeSBu byBzZWNvbmRhcnkgY29uZmlndXJlZCBJIG1lYW4gJ3JlbW90ZScgc2V0IHRvICdub25lJy4NCj4g PiANCj4gPiBKdXN0IGRvOg0KPiA+IA0KPiA+IAkjIGRkIGlmPS9kZXYvemVybyBvZj0vZGV2L2Rh MCBicz0xbSBjb3VudD0xMDI0MA0KPiA+IA0KPiA+IHRoZW4gY29uZmlndXJlIEhBU1QgYW5kOg0K PiA+IA0KPiA+IAkjIGRkIGlmPS9kZXYvemVybyBvZj0vZGV2L2hhc3QvZGlzazAgYnM9MW0gY291 bnQ9MTAyNDANCj4gPiANCj4gPiBXaGljaCBGcmVlQlNEIHZlcnNpb24gaXMgaXQ/DQo+ID4gDQo+ ID4gUFMuIFlvdXIgWkZTIHRlc3RzIGFyZSBwcmV0dHkgbWVhbmluZ2xlc3MsIGJlY2F1c2UgaXQg aXMgcG9zc2libGUNCj4gPiB0aGF0IGV2ZXJ5dGhpbmcgd2lsbCBlbmQgdXAgaW4gbWVtb3J5LiBJ J20gc3VyZSB0aGlzIGlzIHdoYXQNCj4gPiBoYXBwZW5zIGluICdicz0xNmsgY291bnQ9NjU1MzUn IGNhc2UuIExldCB0cnkgcmF3IHByb3ZpZGVycyBmaXJzdC4NCj4gPiANCj4gDQo+IFRoYW5rcyBm b3IgdGhlIHJlcGx5LiAgSSdtIHVzaW5nIEZyZWVCU0QgOS4xLVJFTEVBU0UuIEhlcmUgYXJlIHRo ZQ0KPiByZXN1bHRzOg0KPiANCj4gICMgZGQgaWY9L2Rldi96ZXJvIG9mPS9kZXYvZGEwIGJzPTFt IGNvdW50PTEwMjQwDQo+ICAxMDczNzQxODI0MCBieXRlcyB0cmFuc2ZlcnJlZCBpbiA3NTUuMTQ0 NjQ0IHNlY3MgKDE0MjE5MDIyIGJ5dGVzL3NlYykNCj4gDQo+ICAjIGRkIGlmPS9kZXYvemVybyBv Zj0vZGV2L2hhc3QvZGlzazAgYnM9MW0gY291bnQ9MTAyNDANCj4gIDEwNzM3NDE4MjQwIGJ5dGVz IHRyYW5zZmVycmVkIGluIDg0NC4xNjc2MDIgc2VjcyAoMTI3MTk1MzQgYnl0ZXMvc2VjKQ0KPiAN Cj4gDQo+IFdoaWNoIGluZGljYXRlcyBhIHZlcnkgc21hbGwgb3ZlcmhlYWQsIGhtbW0uLi4NCj4g DQoNCkZ1cnRoZXIgdG8gdGhpcywgc3RpY2tpbmcgd2l0aCB0aGUgMSBkaXNrIGZvciB0ZXN0aW5n LCBJIHNlZSB0aGUNCmZvbGxvd2luZzoNCg0KIC0gVUZTIG9uIGRhMA0KICMgZGQgaWY9L2Rldi96 ZXJvIG9mPXRlc3QuZGF0IGJzPTFtIGNvdW50PTEwMjQwDQogMTA3Mzc0MTgyNDAgYnl0ZXMgdHJh bnNmZXJyZWQgaW4gNzYuMTEyODczIHNlY3MgKDE0MTA3MjMwMiBieXRlcy9zZWMpDQoNCiAtIFVG UyBvbiBoYXN0L2Rpc2swDQogIyAgZGQgaWY9L2Rldi96ZXJvIG9mPXRlc3QuZGF0ICBicz0xbSBj b3VudD0xMDI0MA0KIDEwNzM3NDE4MjQwIGJ5dGVzIHRyYW5zZmVycmVkIGluIDg1NS43MjA5ODUg c2VjcyAoMTI1NDc4MDMgYnl0ZXMvc2VjKQ0KDQpXaGljaCBpcyByb3VnaGx5IHRoZSBzYW1lIGFz IHVzaW5nIHRoZSByYXcgaGFzdCBwcm92aWRlci4NCg0KDQogLSB6ZnMgb24gZGEwDQogIyBkZCBp Zj0vZGV2L3plcm8gb2Y9dGVzdC5kYXQgYnM9MW0gY291bnQ9MTAyNDANCiAxMDczNzQxODI0MCBi eXRlcyB0cmFuc2ZlcnJlZCBpbiAxMTQuMzM4OTAwIHNlY3MgKDkzOTA4NzA3IGJ5dGVzL3NlYykN Cg0KIC0gemZzIG9uIGhhc3QvZGlzazANCiAjIGRkIGlmPS9kZXYvemVybyBvZj10ZXN0LmRhdCBi cz0xbSBjb3VudD0xMDI0MA0KIDEwNzM3NDE4MjQwIGJ5dGVzIHRyYW5zZmVycmVkIGluIDEyODcu MDg4NDE2IHNlY3MgKDgzNDI0MDkgYnl0ZXMvc2VjKQ0KDQpXaGljaCBzZWVtcyBzbG93ZXIgdGhh biB0aGUgcmF3IHByb3ZpZGVyIGJ5IGFwcHJveCA0TUIvcy4NCg0KU28gSSdtIHN0aWxsIHRyeWlu ZyB0byB3b3JrIG91dCB3aHkgdGhlIGV4dHJhICJkcm9wIiB3aGVuIHVzaW5nIFpGUyBvbg0KaGFz dC4uLg0KDQoNCg0KLSAtLSANCkxhdXJlbmNlIEdpbGwNCg0KZjogMDg3MjEgMTU3IDY2NQ0Kc2t5 cGU6IGxhdXJlbmNlZ2cNCmU6IGxhdXJlbmNlc2dpbGxAZ29vZ2xlbWFpbC5jb20NClBHUCBvbiBL ZXkgU2VydmVycw0KLS0tLS1CRUdJTiBQR1AgU0lHTkFUVVJFLS0tLS0NClZlcnNpb246IEdudVBH IHYyLjAuMTkgKEdOVS9MaW51eCkNCg0KaUVZRUFSRUNBQVlGQWxFR28zUUFDZ2tReWdWdDhTcTBQ ZjhLM1FDZlZBK25vZklnUkhNL2dZaUF6aXM2VEY1Kw0KVnZZQW4ya0VPVnRHeVNSMGVadGVnR3J2 VWFwNUJWaHgNCj05ZkN2DQotLS0tLUVORCBQR1AgU0lHTkFUVVJFLS0tLS0NCg== From owner-freebsd-fs@FreeBSD.ORG Mon Jan 28 18:06:12 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 0245A7CD; Mon, 28 Jan 2013 18:06:12 +0000 (UTC) (envelope-from hag@linnaean.org) Received: from perdition.linnaean.org (perdition.linnaean.org [IPv6:2001:470:8917:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id CF5B6FA3; Mon, 28 Jan 2013 18:06:11 +0000 (UTC) Received: by perdition.linnaean.org (Postfix, from userid 31013) id EC36E884; Mon, 28 Jan 2013 13:06:10 -0500 (EST) From: Daniel Hagerty To: Ulrich =?utf-8?Q?Sp=C3=B6rlein?= Subject: Re: Zpool surgery References: <20130127103612.GB38645@acme.spoerlein.net> <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk> <20130127201140.GD29105@server.rulingia.com> <20130128085820.GR35868@acme.spoerlein.net> Sender: Daniel Hagerty Date: Mon, 28 Jan 2013 13:06:10 -0500 In-Reply-To: <20130128085820.GR35868@acme.spoerlein.net> (Ulrich Sp's message of "Mon, 28 Jan 2013 09:58:20 +0100") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: current@freebsd.org, fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: Daniel Hagerty List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Jan 2013 18:06:12 -0000 Ulrich Sp=C3=B6rlein writes: > But are you then also supposed to be able send incremental snapshots to > a third pool from the pool that you just cloned? I can't speak to your problems, but I did recently do what you seem to be doing, without incident. That is, I had a pool and an archive. I copied datasets from pool to a new pool', and pool' could send to the archive as if it were the original pool. Two possible differences in what I do that leap to mind: 1. I only send select snapshots to archive; the synchronization snapshots are not among them. 2. I use receive -F. > How does the receiving pool known that it has the correct snapshot to > store an incremental one anyway? Is there a toplevel checksum, like for > git commits? How can I display and compare that? I don't know for sure, but I'd hazard a guess that: $ zfs get -p guid pool/home@daily-2013-01-28 NAME PROPERTY VALUE SOURCE pool/home@daily-2013-01-28 guid 259258190084829958 - plays a part. Good luck! From owner-freebsd-fs@FreeBSD.ORG Mon Jan 28 20:04:51 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 0F7E7138; Mon, 28 Jan 2013 20:04:51 +0000 (UTC) (envelope-from freebsd-listen@fabiankeil.de) Received: from smtprelay03.ispgateway.de (smtprelay03.ispgateway.de [80.67.29.28]) by mx1.freebsd.org (Postfix) with ESMTP id 7D2D584A; Mon, 28 Jan 2013 20:04:50 +0000 (UTC) Received: from [78.35.168.72] (helo=fabiankeil.de) by smtprelay03.ispgateway.de with esmtpsa (SSLv3:AES128-SHA:128) (Exim 4.68) (envelope-from ) id 1TzuwH-0001SW-MR; Mon, 28 Jan 2013 21:04:21 +0100 Date: Mon, 28 Jan 2013 20:58:02 +0100 From: Fabian Keil To: Ulrich =?UTF-8?B?U3DDtnJsZWlu?= Subject: Re: Zpool surgery Message-ID: <20130128205802.1ffab53e@fabiankeil.de> In-Reply-To: <20130128085820.GR35868@acme.spoerlein.net> References: <20130127103612.GB38645@acme.spoerlein.net> <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk> <20130127201140.GD29105@server.rulingia.com> <20130128085820.GR35868@acme.spoerlein.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/NoSyaoazf+aPmp=rJ5E9Umz"; protocol="application/pgp-signature" X-Df-Sender: Nzc1MDY3 Cc: current@freebsd.org, fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Jan 2013 20:04:51 -0000 --Sig_/NoSyaoazf+aPmp=rJ5E9Umz Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Ulrich Sp=C3=B6rlein wrote: > On Mon, 2013-01-28 at 07:11:40 +1100, Peter Jeremy wrote: > > On 2013-Jan-27 14:31:56 -0000, Steven Hartland wrote: > > >----- Original Message -----=20 > > >From: "Ulrich Sp=C3=B6rlein" > > >> I want to transplant my old zpool tank from a 1TB drive to a new 2TB > > >> drive, but *not* use dd(1) or any other cloning mechanism, as the po= ol > > >> was very full very often and is surely severely fragmented. > > > > > >Cant you just drop the disk in the original machine, set it as a mirror > > >then once the mirror process has completed break the mirror and remove > > >the 1TB disk. > >=20 > > That will replicate any fragmentation as well. "zfs send | zfs recv" > > is the only (current) way to defragment a ZFS pool. It's not obvious to me why "zpool replace" (or doing it manually) would replicate the fragmentation. > But are you then also supposed to be able send incremental snapshots to > a third pool from the pool that you just cloned? Yes. > I did the zpool replace now over night, and it did not remove the old > device yet, as it found cksum errors on the pool: >=20 > root@coyote:~# zpool status -v > pool: tank > state: ONLINE > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: http://illumos.org/msg/ZFS-8000-8A > scan: resilvered 873G in 11h33m with 24 errors on Mon Jan 28 09:45:32 2= 013 > config: >=20 > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 27 > replacing-0 ONLINE 0 0 61 > da0.eli ONLINE 0 0 61 > ada1.eli ONLINE 0 0 61 >=20 > errors: Permanent errors have been detected in the following files: >=20 > tank/src@2013-01-17:/.svn/pristine/8e/8ed35772a38e0fec00bc1cbc2f0= 5480f4fd4759b.svn-base [...] > tank/ncvs@2013-01-17:/ports/textproc/uncrustify/distinfo,v >=20 > Interestingly, these only seem to affect the snapshot, and I'm now > wondering if that is the problem why the backup pool did not accept the > next incremental snapshot from the new pool. I doubt that. My expectation would be that it only prevents the "zfs send" to finish successfully. BTW, you could try reading the files to be sure that the checksum problems are permanent and not just temporary USB issues. > How does the receiving pool known that it has the correct snapshot to > store an incremental one anyway? Is there a toplevel checksum, like for > git commits? How can I display and compare that? Try zstreamdump: fk@r500 ~ $sudo zfs send -i @2013-01-24_20:48 tank/etc@2013-01-26_21:14 | z= streamdump | head -11 BEGIN record hdrtype =3D 1 features =3D 4 magic =3D 2f5bacbac creation_time =3D 5104392a type =3D 2 flags =3D 0x0 toguid =3D a1eb3cfe794e675c fromguid =3D 77fb8881b19cb41f toname =3D tank/etc@2013-01-26_21:14 END checksum =3D 1047a3f2dceb/67c999f5e40ecf9/442237514c1120ed/efd508ab5203= c91c fk@r500 ~ $sudo zfs send lexmark/backup/r500/tank/etc@2013-01-24_20:48 | zs= treamdump | head -11 BEGIN record hdrtype =3D 1 features =3D 4 magic =3D 2f5bacbac creation_time =3D 51018ff4 type =3D 2 flags =3D 0x0 toguid =3D 77fb8881b19cb41f fromguid =3D 0 toname =3D lexmark/backup/r500/tank/etc@2013-01-24_20:48 END checksum =3D 1c262b5ffe935/78d8a68e0eb0c8e7/eb1dde3bd923d153/9e08291036= 49ae22 Fabian --Sig_/NoSyaoazf+aPmp=rJ5E9Umz Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlEG2FAACgkQBYqIVf93VJ28pwCdElXCUi5LtiuQDigCoscMjT3q bXAAn0MaWH2Uuj3tqtaoWIKXeMBeW76D =w0Za -----END PGP SIGNATURE----- --Sig_/NoSyaoazf+aPmp=rJ5E9Umz-- From owner-freebsd-fs@FreeBSD.ORG Mon Jan 28 21:44:28 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 4B5FCD8B; Mon, 28 Jan 2013 21:44:28 +0000 (UTC) (envelope-from dan@dan.emsphone.com) Received: from email2.allantgroup.com (email2.emsphone.com [199.67.51.116]) by mx1.freebsd.org (Postfix) with ESMTP id D7B1EE60; Mon, 28 Jan 2013 21:44:27 +0000 (UTC) Received: from dan.emsphone.com (dan.emsphone.com [172.17.17.101]) by email2.allantgroup.com (8.14.5/8.14.5) with ESMTP id r0SLfD5F000686 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 28 Jan 2013 15:41:13 -0600 (CST) (envelope-from dan@dan.emsphone.com) Received: from dan.emsphone.com (smmsp@localhost [127.0.0.1]) by dan.emsphone.com (8.14.6/8.14.6) with ESMTP id r0SLfCYg060240 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 28 Jan 2013 15:41:12 -0600 (CST) (envelope-from dan@dan.emsphone.com) Received: (from dan@localhost) by dan.emsphone.com (8.14.6/8.14.6/Submit) id r0SLfBWT060239; Mon, 28 Jan 2013 15:41:11 -0600 (CST) (envelope-from dan) Date: Mon, 28 Jan 2013 15:41:11 -0600 From: Dan Nelson To: Fabian Keil Subject: Re: Zpool surgery Message-ID: <20130128214111.GA14888@dan.emsphone.com> References: <20130127103612.GB38645@acme.spoerlein.net> <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk> <20130127201140.GD29105@server.rulingia.com> <20130128085820.GR35868@acme.spoerlein.net> <20130128205802.1ffab53e@fabiankeil.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20130128205802.1ffab53e@fabiankeil.de> X-OS: FreeBSD 9.1-STABLE User-Agent: Mutt/1.5.21 (2010-09-15) X-Virus-Scanned: clamav-milter 0.97.6 at email2.allantgroup.com X-Virus-Status: Clean X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (email2.allantgroup.com [172.17.19.78]); Mon, 28 Jan 2013 15:41:13 -0600 (CST) X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00, RP_MATCHES_RCVD autolearn=ham version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on email2.allantgroup.com X-Scanned-By: MIMEDefang 2.73 Cc: current@freebsd.org, fs@freebsd.org, Ulrich =?utf-8?B?U3DDtnJsZWlu?= X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Jan 2013 21:44:28 -0000 In the last episode (Jan 28), Fabian Keil said: > Ulrich Spörlein wrote: > > On Mon, 2013-01-28 at 07:11:40 +1100, Peter Jeremy wrote: > > > On 2013-Jan-27 14:31:56 -0000, Steven Hartland wrote: > > > >----- Original Message ----- > > > >From: "Ulrich Spörlein" > > > >> I want to transplant my old zpool tank from a 1TB drive to a new > > > >> 2TB drive, but *not* use dd(1) or any other cloning mechanism, as > > > >> the pool was very full very often and is surely severely > > > >> fragmented. > > > > > > > >Cant you just drop the disk in the original machine, set it as a > > > >mirror then once the mirror process has completed break the mirror > > > >and remove the 1TB disk. > > > > > > That will replicate any fragmentation as well. "zfs send | zfs recv" > > > is the only (current) way to defragment a ZFS pool. > > It's not obvious to me why "zpool replace" (or doing it manually) > would replicate the fragmentation. "zpool replace" essentially adds your new disk as a mirror to the parent vdev, then deletes the original disk when the resilver is done. Since mirrors are block-identical copies of each other, the new disk will contain an exact copy of the original disk, followed by 1TB of freespace. -- Dan Nelson dnelson@allantgroup.com From owner-freebsd-fs@FreeBSD.ORG Mon Jan 28 21:55:55 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 5781C314 for ; Mon, 28 Jan 2013 21:55:55 +0000 (UTC) (envelope-from matthew.ahrens@delphix.com) Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com [209.85.212.179]) by mx1.freebsd.org (Postfix) with ESMTP id E760BF38 for ; Mon, 28 Jan 2013 21:55:54 +0000 (UTC) Received: by mail-wi0-f179.google.com with SMTP id o1so1888499wic.12 for ; Mon, 28 Jan 2013 13:55:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=delphix.com; s=google; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=y8OpT1j5CjrT1VzbLMRWCOk/mby9FAJJsjTdnEF+mso=; b=GetgPVQ1HdVIyAKc2Y0cIvhFjd1SYtTXxCxpjfZ+5TGSDfrrNJATnur35toaZ+m/IR igVfI1WHVvOtqJxLCtGyWb9Sy21rr1R1ZkSo7vtPcIp6NqdzfBbV79DV0knGbdhsYCGz Q/e55tVaGJmdrMxqO08g4DFU8GnYDIIYR6+dI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type:x-gm-message-state; bh=y8OpT1j5CjrT1VzbLMRWCOk/mby9FAJJsjTdnEF+mso=; b=CGj0/cZH3l5LPaMnsMi6V4myPuTYyjBBY9ZLHRKGS+j8nmDyH0xYzIkQY1DJAahz6/ KTnp9kArchpdngvUfqs02yj5RRNBygYtZi7Hh1GcYke8C95xOmAQBEudkKXle8MJCFSE fRAZXLlsDGbNb3d9ypD5Qel3lUBEa4gtasdD4J0/8MOnh926JN6VMDxcDMRs7lK31Zep SSOFCy8qeZ7oF9mIOtFGxh1Un8z/qvp88GXtQ5LniRuCVnuOSSOoAN7F4rG7WtRC17Jh /yciM1RWtWmGa12TBTszk6v4J6PL55Oc1bzOexGrHt5kFksAGhc51Cf8Q2hAcbMx2bSe IDNw== MIME-Version: 1.0 X-Received: by 10.194.123.105 with SMTP id lz9mr23895914wjb.43.1359410153419; Mon, 28 Jan 2013 13:55:53 -0800 (PST) Received: by 10.194.32.168 with HTTP; Mon, 28 Jan 2013 13:55:53 -0800 (PST) In-Reply-To: <5105252D.6060502@platinum.linux.pl> References: <5105252D.6060502@platinum.linux.pl> Date: Mon, 28 Jan 2013 13:55:53 -0800 Message-ID: Subject: Re: RAID-Z wasted space - asize roundups to nparity +1 From: Matthew Ahrens To: Adam Nowacki X-Gm-Message-State: ALoCoQmFjZxrlyGjrcnEgskfc5TytCpM4alOcN+vLYpof6HbkJ02aF5UI2FElmGFk0H7/wXAi3Z+ Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Jan 2013 21:55:55 -0000 This is so that we won't end up with small, unallocatable segments. E.g. if you are using RAIDZ2, the smallest usable segment would be 3 sectors (1 sector data + 2 sectors parity). If we left a 1 or 2 sector free segment, it would be unusable and you'd be able to get into strange accounting situations where you have free space but can't write because you're "out of space". The amount of waste due to this can be minimized by using larger blocksizes (e.g. the default recordsize of 128k and files larger than 128k), and by using smaller sector sizes (e.g. 512b sector disks rather than 4k sector disks). In your case these techniques would limit the waste to 0.6%. --matt On Sun, Jan 27, 2013 at 5:01 AM, Adam Nowacki wrote: > I've just found something very weird in the ZFS code. > > sys/cddl/contrib/opensolaris/**uts/common/fs/zfs/vdev_raidz.**c:504 in > HEAD > > Can someone explain the reason behind this line of code? What it does is > align on-disk record size to a multiple of number of parity disks + 1 ... > this really doesn't make any sense. So far as I can tell those extra > sectors are just padding - completely unused. > > For the array I'm using this results in 4.8% of wasted disk space - 1.7TB. > It's a 12x 3TB disk RAID-Z2. > ______________________________**_________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/**mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@**freebsd.org > " > From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 00:21:42 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id DE47BB90 for ; Tue, 29 Jan 2013 00:21:42 +0000 (UTC) (envelope-from dumbbell@FreeBSD.org) Received: from mail.made4.biz (unknown [IPv6:2001:41d0:1:7018::1:3]) by mx1.freebsd.org (Postfix) with ESMTP id A4B1A995 for ; Tue, 29 Jan 2013 00:21:42 +0000 (UTC) Received: from [2a01:e35:8b20:ae00:290:f5ff:fe9d:b78c] (helo=magellan.dumbbell.fr) by mail.made4.biz with esmtpsa (TLSv1:DHE-RSA-CAMELLIA256-SHA:256) (Exim 4.80.1 (FreeBSD)) (envelope-from ) id 1TzyxI-000IZ1-MC; Tue, 29 Jan 2013 01:21:41 +0100 Message-ID: <5107160F.9000008@FreeBSD.org> Date: Tue, 29 Jan 2013 01:21:35 +0100 From: =?ISO-8859-1?Q?Jean-S=E9bastien_P=E9dron?= User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 MIME-Version: 1.0 To: Will DeVries Subject: Re: Read-only port of NetBSD's UDF filesystem. References: In-Reply-To: X-Enigmail-Version: 1.4.6 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig32A5BA3EC9C1CE5C726A29F3" Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 00:21:42 -0000 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig32A5BA3EC9C1CE5C726A29F3 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On 21.01.2013 00:30, Will DeVries wrote: > I have been working on a read-only port of NetBSD's UDF file > system implementation, which I now believe to be complete except for an= y > bug related fixes that may arise. This file system supports UDF versio= ns > through 2.60 on CDs, DVDs and Blu-rays. >=20 > While it could use more testing, it seems to be stable and working well= , > and now seems like a good time to publish it for review. At the very > least, I can judge interest and get advice on aspects that perhaps need= > more work. Hi Will! I just tested your port and it's working for me! I was able to mount a Blu-Ray disc and play the movie using VLC. However, it seems limited to 3 MB/s, which prevents a smooth read of the movie. Running dd(1) confirms that. I didn't investigate further for now and fear I won't have the time to do it in the short term... Have you tested the speed on NetBSD? If you have any ideas, I'll gladly test them! Thanks for your work! --=20 Jean-S=E9bastien P=E9dron --------------enig32A5BA3EC9C1CE5C726A29F3 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlEHFhQACgkQa+xGJsFYOlOGvQCeOOMne4aACOYkv9kv5G+6XuIk r00An2ZLjGeC/Ck3O5IMVM6KnPQx9+eP =R+SK -----END PGP SIGNATURE----- --------------enig32A5BA3EC9C1CE5C726A29F3-- From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 03:21:20 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 28860CFF; Tue, 29 Jan 2013 03:21:20 +0000 (UTC) (envelope-from wollman@hergotha.csail.mit.edu) Received: from hergotha.csail.mit.edu (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) by mx1.freebsd.org (Postfix) with ESMTP id 4E4701B7; Tue, 29 Jan 2013 03:21:19 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.5/8.14.5) with ESMTP id r0T3LHOh080812; Mon, 28 Jan 2013 22:21:17 -0500 (EST) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.5/8.14.4/Submit) id r0T3LHvB080809; Mon, 28 Jan 2013 22:21:17 -0500 (EST) (envelope-from wollman) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <20743.16429.97668.569869@hergotha.csail.mit.edu> Date: Mon, 28 Jan 2013 22:21:17 -0500 From: Garrett Wollman To: freebsd-stable@freebsd.org, freebsd-fs@freebsd.org Subject: ZFS deadlock on rrl->rr_ -- look familiar to anyone? X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (hergotha.csail.mit.edu [127.0.0.1]); Mon, 28 Jan 2013 22:21:17 -0500 (EST) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on hergotha.csail.mit.edu X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 03:21:20 -0000 I just had a big fileserver deadlock in an odd way. I was investigating a user's problem, and decided for various reasons to restart mountd. It had been complaining like this: Jan 28 21:06:43 nfs-prod-1 mountd[1108]: can't delete exports for /usr/local/.zfs/snapshot/monthly-2013-01: Invalid argument for a while, which is odd because /usr/local was never exported. When I restarted mountd, it hung waiting on rrl->rr_, but the system may already have been deadlocked at that point. procstat reported: 87678 104365 mountd - mi_switch sleepq_wait _cv_wait rrw_enter zfs_root lookup namei vfs_donmount sys_nmount amd64_syscall Xfast_syscall I was able to run shutdown, and the rc scripts eventually hung in sync(1) and timed out. The kernel then hung trying to do the same thing, but I was able to break into the debugger. The debugger interrupted an idle thread, which was not particularly helpful, but I was able to quickly gather the following information before I had to reset the machine to restore normal service. Locked vnodes 0xfffffe00536383c0: 0xfffffe00536383c0: tag syncer, type VNON tag syncer, type VNON usecount 1, writecount 0, refcount 2 mountedhere 0 usecount 1, writecount 0, refcount 2 mountedhere 0 flags (VI(0x200)) flags (VI(0x200)) lock type syncer: EXCL by thread 0xfffffe00348cc470 (pid 22) lock type syncer: EXCL by thread 0xfffffe00348cc470 (pid 22) db> ps pid ppid pgrp uid state wmesg wchan cmd 87996 1 87994 65534 D rrl->rr_ 0xfffffe0048ff8108 df 87976 1 87726 0 D+ rrl->rr_ 0xfffffe0048ff8108 sync 87707 1 87705 65534 D rrl->rr_ 0xfffffe0048ff8108 df 87700 1 87698 65534 D rrl->rr_ 0xfffffe0048ff8108 df 87678 1 87657 0 D+ rrl->rr_ 0xfffffe0048ff8108 mountd 87531 1 87529 65534 D rrl->rr_ 0xfffffe0048ff8108 df 87387 1 87385 65534 D rrl->rr_ 0xfffffe0048ff8108 df 87380 1 87378 65534 D rrl->rr_ 0xfffffe0048ff8108 df 87103 1 87101 65534 D rrl->rr_ 0xfffffe0048ff8108 df 87096 1 87094 65534 D rrl->rr_ 0xfffffe0048ff8108 df 85193 1 85192 0 D zio->io_ 0xfffffe10d3e75320 zfs 24 0 0 0 DL sdflush 0xffffffff80e50878 [softdepflush] 23 0 0 0 DL vlruwt 0xfffffe0048c0a940 [vnlru] 22 0 0 0 DL rrl->rr_ 0xfffffe0048ff8108 [syncer] 21 0 0 0 DL psleep 0xffffffff80e3c048 [bufdaemon] 20 0 0 0 DL pgzero 0xffffffff80e5a81c [pagezero] 19 0 0 0 DL psleep 0xffffffff80e599e8 [vmdaemon] 18 0 0 0 DL psleep 0xffffffff80e599ac [pagedaemon] 17 0 0 0 DL gkt:wait 0xffffffff80de6c0c [g_mp_kt] 16 0 0 0 DL ipmireq 0xfffffe00347400b8 [ipmi0: kcs] 9 0 0 0 DL ccb_scan 0xffffffff80dc1360 [xpt_thrd] 8 0 0 0 DL waiting_ 0xffffffff80e41e80 [sctp_iterator] 7 0 0 0 DL (threaded) [zfskern] 101355 D tx->tx_s 0xfffffe0050342e10 [txg_thread_enter] 101354 D tx->tx_q 0xfffffe0050342e30 [txg_thread_enter] 100989 D tx->tx_s 0xfffffe004fd27a10 [txg_thread_enter] 100988 D tx->tx_q 0xfffffe004fd27a30 [txg_thread_enter] 100593 D tx->tx_s 0xfffffe004a8c0a10 [txg_thread_enter] 100592 D tx->tx_q 0xfffffe004a8c0a30 [txg_thread_enter] 100216 D l2arc_fe 0xffffffff81228bc0 [l2arc_feed_thread] 100215 D arc_recl 0xffffffff81218d20 [arc_reclaim_thread] 15 0 0 0 DL (threaded) [usb] [32 uninteresting and identical threads deleted] 6 0 0 0 DL mps_scan 0xfffffe00276816a8 [mps_scan2] 5 0 0 0 DL mps_scan 0xfffffe0027612ca8 [mps_scan1] 4 0 0 0 DL mps_scan 0xfffffe00274ef4a8 [mps_scan0] 14 0 0 0 DL - 0xffffffff80ded764 [yarrow] 3 0 0 0 DL crypto_r 0xffffffff80e4e0a0 [crypto returns] 2 0 0 0 DL crypto_w 0xffffffff80e4e060 [crypto] 13 0 0 0 DL (threaded) [geom] 100055 D - 0xffffffff80de6b90 [g_down] 100054 D - 0xffffffff80de6b88 [g_up] 100053 D - 0xffffffff80de6b78 [g_event] 12 0 0 0 WL (threaded) [intr] 100189 I [irq1: atkbd0] 100188 I [swi0: uart uart] 100187 I [irq19: atapci1] 100186 I [irq18: atapci0+] 100169 I [irq294: igb1:link] 100167 I [irq293: igb1:que 7] 100165 I [irq292: igb1:que 6] 100163 I [irq291: igb1:que 5] 100161 I [irq290: igb1:que 4] 100159 I [irq289: igb1:que 3] 100157 I [irq288: igb1:que 2] 100155 I [irq287: igb1:que 1] 100153 I [irq286: igb1:que 0] 100152 I [irq285: igb0:link] 100150 I [irq284: igb0:que 7] 100148 I [irq283: igb0:que 6] 100146 I [irq282: igb0:que 5] 100144 I [irq281: igb0:que 4] 100142 I [irq280: igb0:que 3] 100140 I [irq279: igb0:que 2] 100138 I [irq278: igb0:que 1] 100136 I [irq277: igb0:que 0] 100131 I [irq20: hpet0 ehci0] 100126 I [irq21: uhci2 uhci5] 100121 I [irq22: uhci1 uhci4] 100116 I [irq23: uhci0 uhci3+] 100115 I [irq276: mps2] 100112 I [irq275: mps1] 100108 I [irq274: ix1:link] 100106 I [irq273: ix1:que 7] 100104 I [irq272: ix1:que 6] 100102 I [irq271: ix1:que 5] 100100 I [irq270: ix1:que 4] 100098 I [irq269: ix1:que 3] 100096 I [irq268: ix1:que 2] 100094 I [irq267: ix1:que 1] 100092 I [irq266: ix1:que 0] 100090 I [irq265: ix0:link] 100088 I [irq264: ix0:que 7] 100086 I [irq263: ix0:que 6] 100084 I [irq262: ix0:que 5] 100082 I [irq261: ix0:que 4] 100080 I [irq260: ix0:que 3] 100078 I [irq259: ix0:que 2] 100076 I [irq258: ix0:que 1] 100074 I [irq257: ix0:que 0] 100073 I [irq256: mps0] 100065 I [swi2: cambio] 100064 I [swi6: task queue] 100063 I [swi6: Giant taskq] 100060 I [swi5: +] [24 identical [swi4: clock] threads deleted] 100028 I [swi1: netisr 0] 100027 I [swi3: vm] 11 0 0 0 RL (threaded) [idle] [24 identical idle threads deleted] 1 0 1 0 DLs rrl->rr_ 0xfffffe0048ff8108 [init] 10 0 0 0 DL audit_wo 0xffffffff80e4f7f0 [audit] 0 0 0 0 DLs (threaded) [kernel] 420220 D - 0xfffffe07bf578380 [zil_clean] [66 similar zil_clean threads deleted] 101353 D - 0xfffffe004a481c80 [zfs_vn_rele_taskq] 101352 D - 0xfffffe005324fa80 [zio_ioctl_intr] 101351 D - 0xfffffe005324fb00 [zio_ioctl_issue] 101350 D - 0xfffffe005324fb80 [zio_claim_intr] 101349 D - 0xfffffe005324fc00 [zio_claim_issue] 101348 D - 0xfffffe005324fc80 [zio_free_intr] 101347 D - 0xfffffe005324fd00 [zio_free_issue_99] [99 similar zio_free_issue_* threads deleted] 101247 D - 0xfffffe005324fd80 [zio_write_intr_high] 101246 D - 0xfffffe005324fd80 [zio_write_intr_high] 101245 D - 0xfffffe005324fd80 [zio_write_intr_high] 101244 D - 0xfffffe005324fd80 [zio_write_intr_high] 101243 D - 0xfffffe005324fd80 [zio_write_intr_high] 101242 D - 0xfffffe005324fe00 [zio_write_intr_7] 101241 D - 0xfffffe005324fe00 [zio_write_intr_6] 101240 D - 0xfffffe005324fe00 [zio_write_intr_5] 101239 D - 0xfffffe005324fe00 [zio_write_intr_4] 101238 D - 0xfffffe005324fe00 [zio_write_intr_3] 101237 D - 0xfffffe005324fe00 [zio_write_intr_2] 101236 D - 0xfffffe005324fe00 [zio_write_intr_1] 101235 D - 0xfffffe005324fe00 [zio_write_intr_0] 101234 D - 0xfffffe0053250000 [zio_write_issue_hig] 101233 D - 0xfffffe0053250000 [zio_write_issue_hig] 101232 D - 0xfffffe0053250000 [zio_write_issue_hig] 101231 D - 0xfffffe0053250000 [zio_write_issue_hig] 101230 D - 0xfffffe0053250000 [zio_write_issue_hig] 101229 D - 0xfffffe0053250080 [zio_write_issue_23] 101228 D - 0xfffffe0053250080 [zio_write_issue_22] 101227 D - 0xfffffe0053250080 [zio_write_issue_21] 101226 D - 0xfffffe0053250080 [zio_write_issue_20] 101225 D - 0xfffffe0053250080 [zio_write_issue_19] 101224 D - 0xfffffe0053250080 [zio_write_issue_18] 101223 D - 0xfffffe0053250080 [zio_write_issue_17] 101222 D - 0xfffffe0053250080 [zio_write_issue_16] 101221 D - 0xfffffe0053250080 [zio_write_issue_15] 101220 D - 0xfffffe0053250080 [zio_write_issue_14] 101219 D - 0xfffffe0053250080 [zio_write_issue_13] 101218 D - 0xfffffe0053250080 [zio_write_issue_12] 101217 D - 0xfffffe0053250080 [zio_write_issue_11] 101216 D - 0xfffffe0053250080 [zio_write_issue_10] 101215 D - 0xfffffe0053250080 [zio_write_issue_9] 101214 D - 0xfffffe0053250080 [zio_write_issue_8] 101213 D - 0xfffffe0053250080 [zio_write_issue_7] 101212 D - 0xfffffe0053250080 [zio_write_issue_6] 101211 D - 0xfffffe0053250080 [zio_write_issue_5] 101210 D - 0xfffffe0053250080 [zio_write_issue_4] 101209 D - 0xfffffe0053250080 [zio_write_issue_3] 101208 D - 0xfffffe0053250080 [zio_write_issue_2] 101207 D - 0xfffffe0053250080 [zio_write_issue_1] 101206 D - 0xfffffe0053250080 [zio_write_issue_0] 101205 D - 0xfffffe0053250100 [zio_read_intr_23] 101204 D - 0xfffffe0053250100 [zio_read_intr_22] 101203 D - 0xfffffe0053250100 [zio_read_intr_21] 101202 D - 0xfffffe0053250100 [zio_read_intr_20] 101201 D - 0xfffffe0053250100 [zio_read_intr_19] 101200 D - 0xfffffe0053250100 [zio_read_intr_18] 101199 D - 0xfffffe0053250100 [zio_read_intr_17] 101198 D - 0xfffffe0053250100 [zio_read_intr_16] 101197 D - 0xfffffe0053250100 [zio_read_intr_15] 101196 D - 0xfffffe0053250100 [zio_read_intr_14] 101195 D - 0xfffffe0053250100 [zio_read_intr_13] 101194 D - 0xfffffe0053250100 [zio_read_intr_12] 101193 D - 0xfffffe0053250100 [zio_read_intr_11] 101192 D - 0xfffffe0053250100 [zio_read_intr_10] 101191 D - 0xfffffe0053250100 [zio_read_intr_9] 101190 D - 0xfffffe0053250100 [zio_read_intr_8] 101189 D - 0xfffffe0053250100 [zio_read_intr_7] 101188 D - 0xfffffe0053250100 [zio_read_intr_6] 101187 D - 0xfffffe0053250100 [zio_read_intr_5] 101186 D - 0xfffffe0053250100 [zio_read_intr_4] 101185 D - 0xfffffe0053250100 [zio_read_intr_3] 101184 D - 0xfffffe0053250100 [zio_read_intr_2] 101183 D - 0xfffffe0053250100 [zio_read_intr_1] 101182 D - 0xfffffe0053250100 [zio_read_intr_0] 101181 D - 0xfffffe0053250180 [zio_read_issue_7] 101180 D - 0xfffffe0053250180 [zio_read_issue_6] 101179 D - 0xfffffe0053250180 [zio_read_issue_5] 101178 D - 0xfffffe0053250180 [zio_read_issue_4] 101177 D - 0xfffffe0053250180 [zio_read_issue_3] 101176 D - 0xfffffe0053250180 [zio_read_issue_2] 101175 D - 0xfffffe0053250180 [zio_read_issue_1] 101174 D - 0xfffffe0053250180 [zio_read_issue_0] 101173 D - 0xfffffe0053250200 [zio_null_intr] 101172 D - 0xfffffe0053250280 [zio_null_issue] 100987 D - 0xfffffe0048cc9500 [zfs_vn_rele_taskq] 100986 D - 0xfffffe0048c72280 [zio_ioctl_intr] 100985 D - 0xfffffe0048c71a00 [zio_ioctl_issue] 100984 D - 0xfffffe0048dd0d00 [zio_claim_intr] 100983 D - 0xfffffe0048dd0680 [zio_claim_issue] 100982 D - 0xfffffe004a949080 [zio_free_intr] 100981 D - 0xfffffe0048b77d80 [zio_free_issue_99] [99 more zio_free_issue_* threads deleted] 100881 D - 0xfffffe004a94a480 [zio_write_intr_high] 100880 D - 0xfffffe004a94a480 [zio_write_intr_high] 100879 D - 0xfffffe004a94a480 [zio_write_intr_high] 100878 D - 0xfffffe004a94a480 [zio_write_intr_high] 100877 D - 0xfffffe004a94a480 [zio_write_intr_high] 100876 D - 0xfffffe0048dd1180 [zio_write_intr_7] 100875 D - 0xfffffe0048dd1180 [zio_write_intr_6] 100874 D - 0xfffffe0048dd1180 [zio_write_intr_5] 100873 D - 0xfffffe0048dd1180 [zio_write_intr_4] 100872 D - 0xfffffe0048dd1180 [zio_write_intr_3] 100871 D - 0xfffffe0048dd1180 [zio_write_intr_2] 100870 D - 0xfffffe0048dd1180 [zio_write_intr_1] 100869 D - 0xfffffe0048dd1180 [zio_write_intr_0] 100868 D - 0xfffffe0048dd1100 [zio_write_issue_hig] 100867 D - 0xfffffe0048dd1100 [zio_write_issue_hig] 100866 D - 0xfffffe0048dd1100 [zio_write_issue_hig] 100865 D - 0xfffffe0048dd1100 [zio_write_issue_hig] 100864 D - 0xfffffe0048dd1100 [zio_write_issue_hig] 100863 D - 0xfffffe0048dd1080 [zio_write_issue_23] 100862 D - 0xfffffe0048dd1080 [zio_write_issue_22] 100861 D - 0xfffffe0048dd1080 [zio_write_issue_21] 100860 D - 0xfffffe0048dd1080 [zio_write_issue_20] 100859 D - 0xfffffe0048dd1080 [zio_write_issue_19] 100858 D - 0xfffffe0048dd1080 [zio_write_issue_18] 100857 D - 0xfffffe0048dd1080 [zio_write_issue_17] 100856 D - 0xfffffe0048dd1080 [zio_write_issue_16] 100855 D - 0xfffffe0048dd1080 [zio_write_issue_15] 100854 D - 0xfffffe0048dd1080 [zio_write_issue_14] 100853 D - 0xfffffe0048dd1080 [zio_write_issue_13] 100852 D - 0xfffffe0048dd1080 [zio_write_issue_12] 100851 D - 0xfffffe0048dd1080 [zio_write_issue_11] 100850 D - 0xfffffe0048dd1080 [zio_write_issue_10] 100849 D - 0xfffffe0048dd1080 [zio_write_issue_9] 100848 D - 0xfffffe0048dd1080 [zio_write_issue_8] 100847 D - 0xfffffe0048dd1080 [zio_write_issue_7] 100846 D - 0xfffffe0048dd1080 [zio_write_issue_6] 100845 D - 0xfffffe0048dd1080 [zio_write_issue_5] 100844 D - 0xfffffe0048dd1080 [zio_write_issue_4] 100843 D - 0xfffffe0048dd1080 [zio_write_issue_3] 100842 D - 0xfffffe0048dd1080 [zio_write_issue_2] 100841 D - 0xfffffe0048dd1080 [zio_write_issue_1] 100840 D - 0xfffffe0048dd1080 [zio_write_issue_0] 100839 D - 0xfffffe0048dd1000 [zio_read_intr_23] 100838 D - 0xfffffe0048dd1000 [zio_read_intr_22] 100837 D - 0xfffffe0048dd1000 [zio_read_intr_21] 100836 D - 0xfffffe0048dd1000 [zio_read_intr_20] 100835 D - 0xfffffe0048dd1000 [zio_read_intr_19] 100834 D - 0xfffffe0048dd1000 [zio_read_intr_18] 100833 D - 0xfffffe0048dd1000 [zio_read_intr_17] 100832 D - 0xfffffe0048dd1000 [zio_read_intr_16] 100831 D - 0xfffffe0048dd1000 [zio_read_intr_15] 100830 D - 0xfffffe0048dd1000 [zio_read_intr_14] 100829 D - 0xfffffe0048dd1000 [zio_read_intr_13] 100828 D - 0xfffffe0048dd1000 [zio_read_intr_12] 100827 D - 0xfffffe0048dd1000 [zio_read_intr_11] 100826 D - 0xfffffe0048dd1000 [zio_read_intr_10] 100825 D - 0xfffffe0048dd1000 [zio_read_intr_9] 100824 D - 0xfffffe0048dd1000 [zio_read_intr_8] 100823 D - 0xfffffe0048dd1000 [zio_read_intr_7] 100822 D - 0xfffffe0048dd1000 [zio_read_intr_6] 100821 D - 0xfffffe0048dd1000 [zio_read_intr_5] 100820 D - 0xfffffe0048dd1000 [zio_read_intr_4] 100819 D - 0xfffffe0048dd1000 [zio_read_intr_3] 100818 D - 0xfffffe0048dd1000 [zio_read_intr_2] 100817 D - 0xfffffe0048dd1000 [zio_read_intr_1] 100816 D - 0xfffffe0048dd1000 [zio_read_intr_0] 100815 D - 0xfffffe0048dd0e00 [zio_read_issue_7] 100814 D - 0xfffffe0048dd0e00 [zio_read_issue_6] 100813 D - 0xfffffe0048dd0e00 [zio_read_issue_5] 100812 D - 0xfffffe0048dd0e00 [zio_read_issue_4] 100811 D - 0xfffffe0048dd0e00 [zio_read_issue_3] 100810 D - 0xfffffe0048dd0e00 [zio_read_issue_2] 100809 D - 0xfffffe0048dd0e00 [zio_read_issue_1] 100808 D - 0xfffffe0048dd0e00 [zio_read_issue_0] 100807 D - 0xfffffe0048dd0600 [zio_null_intr] 100806 D - 0xfffffe0048dd0180 [zio_null_issue] 100594 D - 0xfffffe004a3bcc80 [zil_clean] 100591 D - 0xfffffe0048c65100 [zfs_vn_rele_taskq] 100590 D - 0xfffffe0048d5c280 [zio_ioctl_intr] 100589 D - 0xfffffe0048d5c300 [zio_ioctl_issue] 100588 D - 0xfffffe0048d5c380 [zio_claim_intr] 100587 D - 0xfffffe0048d5c400 [zio_claim_issue] 100586 D - 0xfffffe0048d5c480 [zio_free_intr] 100585 D - 0xfffffe0048d5c500 [zio_free_issue_99] [99 more zio_free_issue_* threads deleted] 100485 D - 0xfffffe0048d5c580 [zio_write_intr_high] 100484 D - 0xfffffe0048d5c580 [zio_write_intr_high] 100483 D - 0xfffffe0048d5c580 [zio_write_intr_high] 100482 D - 0xfffffe0048d5c580 [zio_write_intr_high] 100481 D - 0xfffffe0048d5c580 [zio_write_intr_high] 100480 D - 0xfffffe0048d5c600 [zio_write_intr_7] 100479 D - 0xfffffe0048d5c600 [zio_write_intr_6] 100478 D - 0xfffffe0048d5c600 [zio_write_intr_5] 100477 D - 0xfffffe0048d5c600 [zio_write_intr_4] 100476 D - 0xfffffe0048d5c600 [zio_write_intr_3] 100475 D - 0xfffffe0048d5c600 [zio_write_intr_2] 100474 D - 0xfffffe0048d5c600 [zio_write_intr_1] 100473 D - 0xfffffe0048d5c600 [zio_write_intr_0] 100472 D - 0xfffffe0048d5c680 [zio_write_issue_hig] 100471 D - 0xfffffe0048d5c680 [zio_write_issue_hig] 100470 D - 0xfffffe0048d5c680 [zio_write_issue_hig] 100469 D - 0xfffffe0048d5c680 [zio_write_issue_hig] 100468 D - 0xfffffe0048d5c680 [zio_write_issue_hig] 100467 D - 0xfffffe0048d5c700 [zio_write_issue_23] 100466 D - 0xfffffe0048d5c700 [zio_write_issue_22] 100465 D - 0xfffffe0048d5c700 [zio_write_issue_21] 100464 D - 0xfffffe0048d5c700 [zio_write_issue_20] 100463 D - 0xfffffe0048d5c700 [zio_write_issue_19] 100462 D - 0xfffffe0048d5c700 [zio_write_issue_18] 100461 D - 0xfffffe0048d5c700 [zio_write_issue_17] 100460 D - 0xfffffe0048d5c700 [zio_write_issue_16] 100459 D - 0xfffffe0048d5c700 [zio_write_issue_15] 100458 D - 0xfffffe0048d5c700 [zio_write_issue_14] 100457 D - 0xfffffe0048d5c700 [zio_write_issue_13] 100456 D - 0xfffffe0048d5c700 [zio_write_issue_12] 100455 D - 0xfffffe0048d5c700 [zio_write_issue_11] 100454 D - 0xfffffe0048d5c700 [zio_write_issue_10] 100453 D - 0xfffffe0048d5c700 [zio_write_issue_9] 100452 D - 0xfffffe0048d5c700 [zio_write_issue_8] 100451 D - 0xfffffe0048d5c700 [zio_write_issue_7] 100450 D - 0xfffffe0048d5c700 [zio_write_issue_6] 100449 D - 0xfffffe0048d5c700 [zio_write_issue_5] 100448 D - 0xfffffe0048d5c700 [zio_write_issue_4] 100447 D - 0xfffffe0048d5c700 [zio_write_issue_3] 100446 D - 0xfffffe0048d5c700 [zio_write_issue_2] 100445 D - 0xfffffe0048d5c700 [zio_write_issue_1] 100444 D - 0xfffffe0048d5c700 [zio_write_issue_0] 100443 D - 0xfffffe0048d5c780 [zio_read_intr_23] 100442 D - 0xfffffe0048d5c780 [zio_read_intr_22] 100441 D - 0xfffffe0048d5c780 [zio_read_intr_21] 100440 D - 0xfffffe0048d5c780 [zio_read_intr_20] 100439 D - 0xfffffe0048d5c780 [zio_read_intr_19] 100438 D - 0xfffffe0048d5c780 [zio_read_intr_18] 100437 D - 0xfffffe0048d5c780 [zio_read_intr_17] 100436 D - 0xfffffe0048d5c780 [zio_read_intr_16] 100435 D - 0xfffffe0048d5c780 [zio_read_intr_15] 100434 D - 0xfffffe0048d5c780 [zio_read_intr_14] 100433 D - 0xfffffe0048d5c780 [zio_read_intr_13] 100432 D - 0xfffffe0048d5c780 [zio_read_intr_12] 100431 D - 0xfffffe0048d5c780 [zio_read_intr_11] 100430 D - 0xfffffe0048d5c780 [zio_read_intr_10] 100429 D - 0xfffffe0048d5c780 [zio_read_intr_9] 100428 D - 0xfffffe0048d5c780 [zio_read_intr_8] 100427 D - 0xfffffe0048d5c780 [zio_read_intr_7] 100426 D - 0xfffffe0048d5c780 [zio_read_intr_6] 100425 D - 0xfffffe0048d5c780 [zio_read_intr_5] 100424 D - 0xfffffe0048d5c780 [zio_read_intr_4] 100423 D - 0xfffffe0048d5c780 [zio_read_intr_3] 100422 D - 0xfffffe0048d5c780 [zio_read_intr_2] 100421 D - 0xfffffe0048d5c780 [zio_read_intr_1] 100420 D - 0xfffffe0048d5c780 [zio_read_intr_0] 100419 D - 0xfffffe0048d5c800 [zio_read_issue_7] 100418 D - 0xfffffe0048d5c800 [zio_read_issue_6] 100417 D - 0xfffffe0048d5c800 [zio_read_issue_5] 100416 D - 0xfffffe0048d5c800 [zio_read_issue_4] 100415 D - 0xfffffe0048d5c800 [zio_read_issue_3] 100414 D - 0xfffffe0048d5c800 [zio_read_issue_2] 100413 D - 0xfffffe0048d5c800 [zio_read_issue_1] 100412 D - 0xfffffe0048d5c800 [zio_read_issue_0] 100411 D - 0xfffffe0048d5c880 [zio_null_intr] 100410 D - 0xfffffe0048d5c900 [zio_null_issue] 100214 D - 0xfffffe00348bbc00 [system_taskq_23] [23 more system_taskq_* threads deleted] 100190 D - 0xfffffe00348bbc80 [mca taskq] 100168 D - 0xfffffe0034092b00 [igb1 que] 100166 D - 0xfffffe0034092c80 [igb1 que] 100164 D - 0xfffffe0034092e00 [igb1 que] 100162 D - 0xfffffe0034092180 [igb1 que] 100160 D - 0xfffffe0034092300 [igb1 que] 100158 D - 0xfffffe0034092480 [igb1 que] 100156 D - 0xfffffe003408b300 [igb1 que] 100154 D - 0xfffffe003408b480 [igb1 que] 100151 D - 0xfffffe003405b500 [igb0 que] 100149 D - 0xfffffe003405b680 [igb0 que] 100147 D - 0xfffffe0034054580 [igb0 que] 100145 D - 0xfffffe003404d400 [igb0 que] 100143 D - 0xfffffe003404d580 [igb0 que] 100141 D - 0xfffffe003404d700 [igb0 que] 100139 D - 0xfffffe003404d880 [igb0 que] 100137 D - 0xfffffe003404da00 [igb0 que] 100113 D - 0xfffffe0027807300 [mps2 taskq] 100110 D - 0xfffffe0027697a80 [mps1 taskq] 100109 D - 0xfffffe002768a700 [ix1 linkq] 100107 D - 0xfffffe002768a800 [ix1 que] 100105 D - 0xfffffe002768a980 [ix1 que] 100103 D - 0xfffffe002768ab00 [ix1 que] 100101 D - 0xfffffe002768ac80 [ix1 que] 100099 D - 0xfffffe0027680b80 [ix1 que] 100097 D - 0xfffffe0027680d00 [ix1 que] 100095 D - 0xfffffe0027623680 [ix1 que] 100093 D - 0xfffffe0027623380 [ix1 que] 100091 D - 0xfffffe0027600480 [ix0 linkq] 100089 D - 0xfffffe0027600580 [ix0 que] 100087 D - 0xfffffe0027600700 [ix0 que] 100085 D - 0xfffffe0027527300 [ix0 que] 100083 D - 0xfffffe0027527480 [ix0 que] 100081 D - 0xfffffe0027527600 [ix0 que] 100079 D - 0xfffffe0027527780 [ix0 que] 100077 D - 0xfffffe0027527900 [ix0 que] 100075 D - 0xfffffe0027527a80 [ix0 que] 100071 D - 0xfffffe00274ffe00 [mps0 taskq] 100070 D - 0xfffffe00273fdb00 [kqueue taskq] 100069 D - 0xfffffe00273fdb80 [ffs_trim taskq] 100068 D - 0xfffffe00273fdc00 [acpi_task_2] 100067 D - 0xfffffe00273fdc00 [acpi_task_1] 100066 D - 0xfffffe00273fdc00 [acpi_task_0] 100062 D - 0xfffffe002743b280 [aiod_bio taskq] 100061 D zfsvfs-> 0xfffffe0048ff8138 [thread taskq] 100056 D - 0xfffffe002732d600 [firmware taskq] 100000 D sched 0xffffffff80de6d80 [swapper] The stuck df(1) processes running as nobody were undoubtedly started by munin-node, and seem to be related to my user's symptom (munin graphs show no response for about half an hour after the user's problem starts). It may not be a *true* deadlock, because over the past few days, munin has been showing this problem at about the same time of day, but the system always comes back (without a reboot) in a little over half an hour. Does anyone recognize this? If it happens again, which threads' stack traces would be useful in diagnosing this? -GAWollman From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 08:20:54 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 020B76DB; Tue, 29 Jan 2013 08:20:54 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id DD910F10; Tue, 29 Jan 2013 08:20:52 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id KAA10855; Tue, 29 Jan 2013 10:20:50 +0200 (EET) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1U06Qz-000M5B-TM; Tue, 29 Jan 2013 10:20:49 +0200 Message-ID: <51078660.8000004@FreeBSD.org> Date: Tue, 29 Jan 2013 10:20:48 +0200 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130121 Thunderbird/17.0.2 MIME-Version: 1.0 To: Garrett Wollman Subject: Re: ZFS deadlock on rrl->rr_ -- look familiar to anyone? References: <20743.16429.97668.569869@hergotha.csail.mit.edu> In-Reply-To: <20743.16429.97668.569869@hergotha.csail.mit.edu> X-Enigmail-Version: 1.4.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@FreeBSD.org, freebsd-stable@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 08:20:54 -0000 on 29/01/2013 05:21 Garrett Wollman said the following: > When > I restarted mountd, it hung waiting on rrl->rr_, but the system may > already have been deadlocked at that point. procstat reported: > > 87678 104365 mountd - mi_switch sleepq_wait _cv_wait rrw_enter zfs_root lookup namei vfs_donmount sys_nmount amd64_syscall Xfast_syscall ... > If it happens again procstat -kk -a -- Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 10:51:50 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 734F87B8 for ; Tue, 29 Jan 2013 10:51:50 +0000 (UTC) (envelope-from nowakpl@platinum.linux.pl) Received: from platinum.linux.pl (platinum.edu.pl [81.161.192.4]) by mx1.freebsd.org (Postfix) with ESMTP id 1D428F25 for ; Tue, 29 Jan 2013 10:51:49 +0000 (UTC) Received: by platinum.linux.pl (Postfix, from userid 87) id E089147E16; Tue, 29 Jan 2013 11:51:41 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on platinum.linux.pl X-Spam-Level: X-Spam-Status: No, score=-1.4 required=3.0 tests=ALL_TRUSTED,AWL autolearn=disabled version=3.3.2 Received: from [10.255.0.2] (unknown [83.151.38.73]) by platinum.linux.pl (Postfix) with ESMTPA id 4297647E0F; Tue, 29 Jan 2013 11:51:38 +0100 (CET) Message-ID: <5107A9B7.5030803@platinum.linux.pl> Date: Tue, 29 Jan 2013 11:51:35 +0100 From: Adam Nowacki User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: Matthew Ahrens Subject: Re: RAID-Z wasted space - asize roundups to nparity +1 References: <5105252D.6060502@platinum.linux.pl> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 10:51:50 -0000 On 2013-01-28 22:55, Matthew Ahrens wrote: > This is so that we won't end up with small, unallocatable segments. > E.g. if you are using RAIDZ2, the smallest usable segment would be 3 > sectors (1 sector data + 2 sectors parity). If we left a 1 or 2 sector > free segment, it would be unusable and you'd be able to get into strange > accounting situations where you have free space but can't write because > you're "out of space". Sounds reasonable. > The amount of waste due to this can be minimized by using larger > blocksizes (e.g. the default recordsize of 128k and files larger than > 128k), and by using smaller sector sizes (e.g. 512b sector disks rather > than 4k sector disks). In your case these techniques would limit the > waste to 0.6%. This brings another issue - recordsize capped at 128KiB. We are using the pool for off-line storage of large files (from 50MB to 20GB). Files are stored and read sequentially as a whole. With 12 disks in RAID-Z2, 4KiB sectors, 128KiB record size and the padding above 9.4% of disk space goes completely unused - one whole disk. Increasing recordsize cap seems trivial enough. On-disk structures and kernel code support it already - a single of code had to be changed (#define SPA_MAXBLOCKSHIFT - from 17 to 20) to support 1MiB recordsizes. This of course breaks compatibility with any other system without this modification. With Suns cooperation this could be handled in safe and compatible manner via pool version upgrade. Recordsize of 128KiB would remain the default but anyone could increase it with zfs set. Pool appears to work just fine with 15TB copied so far from another pool. Wasted disk space drops down to 0.7%. Sequential read speed increased from ~400MB/s to ~600MB/s. Writes stay about the same at ~300MB/s. So far however I was not able to boot from that pool. gptzfsboot required a heap size increase and appears to work. zfsloader crashes and I've become lost in the code. I've also identified another problem with ZFS wasting disk space. When compression is off allocations are always a multiple of record size. With the default recordsize of 128KiB a 129KiB file would use 256KiB of disk space (+ parity and other inefficiencies mentioned above). This may be there to help with fragmentation but then it would be good to have a setting to turn it off - even if by means of a no-op compression that would count zeroes backwards and return short psize. > > --matt > > On Sun, Jan 27, 2013 at 5:01 AM, Adam Nowacki > wrote: > > I've just found something very weird in the ZFS code. > > sys/cddl/contrib/opensolaris/__uts/common/fs/zfs/vdev_raidz.__c:504 > in HEAD > > Can someone explain the reason behind this line of code? What it > does is align on-disk record size to a multiple of number of parity > disks + 1 ... this really doesn't make any sense. So far as I can > tell those extra sectors are just padding - completely unused. > > For the array I'm using this results in 4.8% of wasted disk space - > 1.7TB. It's a 12x 3TB disk RAID-Z2. > _________________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/__mailman/listinfo/freebsd-fs > > To unsubscribe, send any mail to > "freebsd-fs-unsubscribe@__freebsd.org > " > > From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 10:58:06 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id E4422AC2 for ; Tue, 29 Jan 2013 10:58:06 +0000 (UTC) (envelope-from prvs=1741a054e2=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id 87E82F8C for ; Tue, 29 Jan 2013 10:58:06 +0000 (UTC) Received: from r2d2 ([188.220.16.49]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50001906002.msg for ; Tue, 29 Jan 2013 10:57:58 +0000 X-Spam-Processed: mail1.multiplay.co.uk, Tue, 29 Jan 2013 10:57:58 +0000 (not processed: message from valid local sender) X-MDRemoteIP: 188.220.16.49 X-Return-Path: prvs=1741a054e2=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk X-MDaemon-Deliver-To: fs@freebsd.org Message-ID: <2FD375DC62B24754B8945BF0A1E26B78@multiplay.co.uk> From: "Steven Hartland" To: "Adam Nowacki" , "Matthew Ahrens" References: <5105252D.6060502@platinum.linux.pl> <5107A9B7.5030803@platinum.linux.pl> Subject: Re: RAID-Z wasted space - asize roundups to nparity +1 Date: Tue, 29 Jan 2013 10:58:40 -0000 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=response Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Cc: fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 10:58:07 -0000 ----- Original Message ----- From: "Adam Nowacki" > On 2013-01-28 22:55, Matthew Ahrens wrote: >> This is so that we won't end up with small, unallocatable segments. >> E.g. if you are using RAIDZ2, the smallest usable segment would be 3 >> sectors (1 sector data + 2 sectors parity). If we left a 1 or 2 sector >> free segment, it would be unusable and you'd be able to get into strange >> accounting situations where you have free space but can't write because >> you're "out of space". > > Sounds reasonable. > >> The amount of waste due to this can be minimized by using larger >> blocksizes (e.g. the default recordsize of 128k and files larger than >> 128k), and by using smaller sector sizes (e.g. 512b sector disks rather >> than 4k sector disks). In your case these techniques would limit the >> waste to 0.6%. > > This brings another issue - recordsize capped at 128KiB. We are using > the pool for off-line storage of large files (from 50MB to 20GB). Files > are stored and read sequentially as a whole. With 12 disks in RAID-Z2, > 4KiB sectors, 128KiB record size and the padding above 9.4% of disk > space goes completely unused - one whole disk. This is something thats being worked on upstream, its not as trivial as it first looks unfortuantely. Regards Steve ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk. From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 11:06:29 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id A9419D4A for ; Tue, 29 Jan 2013 11:06:29 +0000 (UTC) (envelope-from olivier@gid0.org) Received: from mail-ea0-f170.google.com (mail-ea0-f170.google.com [209.85.215.170]) by mx1.freebsd.org (Postfix) with ESMTP id 485BF68 for ; Tue, 29 Jan 2013 11:06:28 +0000 (UTC) Received: by mail-ea0-f170.google.com with SMTP id a11so133773eaa.15 for ; Tue, 29 Jan 2013 03:06:27 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type:x-gm-message-state; bh=lJY/gvBy7AioQdkXF6YAjmup0aVH8VnDEEeCuAKC6PQ=; b=Y81RnZeVOUTSflQDeHTHE3fxUJxvKmuzvHxFjVO9d1DHSR23Gt3/FAz3aIvbjib9Rp UxZ4e8ZYDFDKmuoKa2eMP/3xD1nDmgUl/khRBr3UtMJRsNopFqxtOHpEy3oeR9O83JTk vYL07KI9psaGG15Y9V4lzNXYlPW4JEL5fm+VAQRs2hsvYmgs9EgB7Kp3bT6g93572Mcj ueVMuS3hdON9bqlauT4c/jKNYkVrrohGCY0UgLx9/8M6pwUmNXp/fgOv28ymtoj8CeP8 CcZM/uPmL24qAIr+oZdzL0OGUwZgwSpzrrJtIxecVajP8iH/QFO7RS32TiIszIHgN/E8 eGDQ== MIME-Version: 1.0 X-Received: by 10.14.220.1 with SMTP id n1mr2369333eep.16.1359457587713; Tue, 29 Jan 2013 03:06:27 -0800 (PST) Received: by 10.14.189.5 with HTTP; Tue, 29 Jan 2013 03:06:27 -0800 (PST) In-Reply-To: <5107A9B7.5030803@platinum.linux.pl> References: <5105252D.6060502@platinum.linux.pl> <5107A9B7.5030803@platinum.linux.pl> Date: Tue, 29 Jan 2013 12:06:27 +0100 Message-ID: Subject: Re: RAID-Z wasted space - asize roundups to nparity +1 From: Olivier Smedts To: Adam Nowacki Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQmgNjtoHUGCvXPsiAl/x+UqiYPo50O4eIWShX8ViX1E9EDjjdqcqRvXRpwDBeJXJEdWRleI Cc: Matthew Ahrens , fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 11:06:29 -0000 2013/1/29 Adam Nowacki : > This brings another issue - recordsize capped at 128KiB. We are using the > pool for off-line storage of large files (from 50MB to 20GB). Files are > stored and read sequentially as a whole. With 12 disks in RAID-Z2, 4KiB > sectors, 128KiB record size and the padding above 9.4% of disk space goes > completely unused - one whole disk. > > Increasing recordsize cap seems trivial enough. On-disk structures and > kernel code support it already - a single of code had to be changed (#define > SPA_MAXBLOCKSHIFT - from 17 to 20) to support 1MiB recordsizes. This of > course breaks compatibility with any other system without this modification. > With Suns cooperation this could be handled in safe and compatible manner > via pool version upgrade. Recordsize of 128KiB would remain the default but > anyone could increase it with zfs set. One MB blocksize is already implemented by Oracle with zpool version 32. -- Olivier Smedts _ ASCII ribbon campaign ( ) e-mail: olivier@gid0.org - against HTML email & vCards X www: http://www.gid0.org - against proprietary attachments / \ "Il y a seulement 10 sortes de gens dans le monde : ceux qui comprennent le binaire, et ceux qui ne le comprennent pas." From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 11:18:54 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 95D167BD for ; Tue, 29 Jan 2013 11:18:54 +0000 (UTC) (envelope-from prvs=1741a054e2=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id 381FB208 for ; Tue, 29 Jan 2013 11:18:53 +0000 (UTC) Received: from r2d2 ([188.220.16.49]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50001906339.msg for ; Tue, 29 Jan 2013 11:18:53 +0000 X-Spam-Processed: mail1.multiplay.co.uk, Tue, 29 Jan 2013 11:18:53 +0000 (not processed: message from valid local sender) X-MDRemoteIP: 188.220.16.49 X-Return-Path: prvs=1741a054e2=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk X-MDaemon-Deliver-To: fs@freebsd.org Message-ID: <32655B893F594E9BB0CBDD88C186E27E@multiplay.co.uk> From: "Steven Hartland" To: "Olivier Smedts" , "Adam Nowacki" References: <5105252D.6060502@platinum.linux.pl> <5107A9B7.5030803@platinum.linux.pl> Subject: Re: RAID-Z wasted space - asize roundups to nparity +1 Date: Tue, 29 Jan 2013 11:19:31 -0000 MIME-Version: 1.0 X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: Matthew Ahrens , fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 11:18:54 -0000 ----- Original Message -----=20 From: "Olivier Smedts" > 2013/1/29 Adam Nowacki : >> This brings another issue - recordsize capped at 128KiB. We are using the >> pool for off-line storage of large files (from 50MB to 20GB). Files are >> stored and read sequentially as a whole. With 12 disks in RAID-Z2, 4KiB >> sectors, 128KiB record size and the padding above 9.4% of disk space goes >> completely unused - one whole disk. >> >> Increasing recordsize cap seems trivial enough. On-disk structures and >> kernel code support it already - a single of code had to be changed (#define >> SPA_MAXBLOCKSHIFT - from 17 to 20) to support 1MiB recordsizes. This of >> course breaks compatibility with any other system without this modification. >> With Suns cooperation this could be handled in safe and compatible manner >> via pool version upgrade. Recordsize of 128KiB would remain the default but >> anyone could increase it with zfs set. >=20 > One MB blocksize is already implemented by Oracle with zpool version 32. Oracle is not the upstream, since they went closed source, illumos is our new upstream. It you want to follow the discussion see the thread titled "128K max blocksize in zfs" on developer@lists.illumos.org. Regards Steve =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.=20 In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk. From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 14:58:23 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id A350E5D2; Tue, 29 Jan 2013 14:58:23 +0000 (UTC) (envelope-from freebsd-listen@fabiankeil.de) Received: from smtprelay05.ispgateway.de (smtprelay05.ispgateway.de [80.67.31.98]) by mx1.freebsd.org (Postfix) with ESMTP id 3746586; Tue, 29 Jan 2013 14:58:23 +0000 (UTC) Received: from [78.35.166.2] (helo=fabiankeil.de) by smtprelay05.ispgateway.de with esmtpsa (SSLv3:AES128-SHA:128) (Exim 4.68) (envelope-from ) id 1U0Cdb-0002ju-CX; Tue, 29 Jan 2013 15:58:15 +0100 Date: Tue, 29 Jan 2013 15:52:50 +0100 From: Fabian Keil To: Dan Nelson Subject: Re: Zpool surgery Message-ID: <20130129155250.29d8f764@fabiankeil.de> In-Reply-To: <20130128214111.GA14888@dan.emsphone.com> References: <20130127103612.GB38645@acme.spoerlein.net> <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk> <20130127201140.GD29105@server.rulingia.com> <20130128085820.GR35868@acme.spoerlein.net> <20130128205802.1ffab53e@fabiankeil.de> <20130128214111.GA14888@dan.emsphone.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/WXg2ahZC0rmXbVAQa_iy9g/"; protocol="application/pgp-signature" X-Df-Sender: Nzc1MDY3 Cc: current@freebsd.org, fs@freebsd.org, Ulrich =?UTF-8?B?U3DDtnJsZWlu?= X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 14:58:23 -0000 --Sig_/WXg2ahZC0rmXbVAQa_iy9g/ Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Dan Nelson wrote: > In the last episode (Jan 28), Fabian Keil said: > > Ulrich Sp=C3=B6rlein wrote: > > > On Mon, 2013-01-28 at 07:11:40 +1100, Peter Jeremy wrote: > > > > On 2013-Jan-27 14:31:56 -0000, Steven Hartland wrote: > > > > >----- Original Message -----=20 > > > > >From: "Ulrich Sp=C3=B6rlein" > > > > >> I want to transplant my old zpool tank from a 1TB drive to a new > > > > >> 2TB drive, but *not* use dd(1) or any other cloning mechanism, as > > > > >> the pool was very full very often and is surely severely > > > > >> fragmented. > > > > > > > > > >Cant you just drop the disk in the original machine, set it as a > > > > >mirror then once the mirror process has completed break the mirror > > > > >and remove the 1TB disk. > > > >=20 > > > > That will replicate any fragmentation as well. "zfs send | zfs rec= v" > > > > is the only (current) way to defragment a ZFS pool. > >=20 > > It's not obvious to me why "zpool replace" (or doing it manually) > > would replicate the fragmentation. >=20 > "zpool replace" essentially adds your new disk as a mirror to the parent > vdev, then deletes the original disk when the resilver is done. Since > mirrors are block-identical copies of each other, the new disk will conta= in > an exact copy of the original disk, followed by 1TB of freespace. Thanks for the explanation. I was under the impression that zfs mirrors worked at a higher level than traditional mirrors like gmirror but there seems to be indeed less magic than I expected. Fabian --Sig_/WXg2ahZC0rmXbVAQa_iy9g/ Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlEH4kgACgkQBYqIVf93VJ1Z4ACgsP2gJkFDDqwImnab1rnKF5Xu gc8AoJuwpBMZrXVyX8ZSboeS6co0PHOk =8PGU -----END PGP SIGNATURE----- --Sig_/WXg2ahZC0rmXbVAQa_iy9g/-- From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 15:12:10 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id CF585933; Tue, 29 Jan 2013 15:12:10 +0000 (UTC) (envelope-from gergely.czuczy@harmless.hu) Received: from marvin.harmless.hu (marvin.harmless.hu [195.56.55.204]) by mx1.freebsd.org (Postfix) with ESMTP id 6D4B012D; Tue, 29 Jan 2013 15:12:10 +0000 (UTC) Received: from gprs4f7a62e4.pool.t-umts.hu ([79.122.98.228] helo=unknown) by marvin.harmless.hu with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.75 (FreeBSD)) (envelope-from ) id 1U0CUZ-000HPm-On; Tue, 29 Jan 2013 15:48:55 +0100 Date: Tue, 29 Jan 2013 15:48:52 +0100 From: Gergely CZUCZY To: Nicolas Rachinsky Subject: Re: slowdown of zfs (tx->tx) Message-ID: <20130129154852.000021f1@unknown> In-Reply-To: <20130117093259.GA83951@mid.pc5.i.0x5.de> References: <20130114195148.GA20540@mid.pc5.i.0x5.de> <20130114214652.GA76779@mid.pc5.i.0x5.de> <20130115224556.GA41774@mid.pc5.i.0x5.de> <50F67551.5020704@FreeBSD.org> <20130116095009.GA36867@mid.pc5.i.0x5.de> <50F69788.2040506@FreeBSD.org> <20130117093259.GA83951@mid.pc5.i.0x5.de> Organization: Harmless Digital X-Mailer: Claws Mail 3.7.6 (GTK+ 2.16.0; i586-pc-mingw32msvc) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: freebsd-fs , Andriy Gapon X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 15:12:10 -0000 Hello, Once think we've noticed on our systems, might be unrelated, but still. After heavy usage of dedup, our ZFS pools tended to slow down drastically. The solution was to deallocate and reallocate dedup-enabled filesystems (copying or send/recieving data back and forth). Just an idea, might be unrelated in your case. Best regards, Gergely On Thu, 17 Jan 2013 10:32:59 +0100 Nicolas Rachinsky wrote: > * Andriy Gapon [2013-01-16 14:05 +0200]: > > on 16/01/2013 12:14 Steven Hartland said the following: > > > You only have ~11% free so yer it is pretty full ;-) > > > > just in case, Steve is not kidding. > > > > Those free hundreds of gigabytes could be spread over the terabytes > > and could be quite fragmented if the pool has a history of adding > > and removing lots of files. ZFS could be spending quite a lot of > > time in that case when it looks for some free space and tries to > > minimize further fragmentation. > > > > Empirical/anecdotal safe limit on pool utilization is said to be > > about 70-80%. > > > > You can test if this guess is true by doing the following: > > kgdb -w > > (kgdb) set metaslab_min_alloc_size=4096 > > > > If performance noticeably improves after that, then this is your > > problem indeed. > > I tried this, but I didn't notice any difference in performance. > > Next I'll try the update Artem suggested. > > Thanks > > Nicolas From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 15:45:12 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 0EC92475 for ; Tue, 29 Jan 2013 15:45:12 +0000 (UTC) (envelope-from romain@blogreen.org) Received: from marvin.blogreen.org (unknown [IPv6:2001:470:1f12:b9c::2]) by mx1.freebsd.org (Postfix) with ESMTP id B822231C for ; Tue, 29 Jan 2013 15:45:11 +0000 (UTC) Received: by marvin.blogreen.org (Postfix, from userid 1001) id 57AE31B1C1; Tue, 29 Jan 2013 16:45:07 +0100 (CET) Date: Tue, 29 Jan 2013 16:45:07 +0100 From: Romain =?iso-8859-1?Q?Tarti=E8re?= To: freebsd-fs@freebsd.org Subject: Re: ZFS deduplication Message-ID: <20130129154507.GA53833@blogreen.org> References: <20130123143728.GA84218@blogreen.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="pWyiEgJYm5f9v55/" Content-Disposition: inline In-Reply-To: <20130123143728.GA84218@blogreen.org> X-PGP-Key: http://romain.blogreen.org/pubkey.asc User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 15:45:12 -0000 --pWyiEgJYm5f9v55/ Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Jan 23, 2013 at 03:37:29PM +0100, Romain Tarti=E8re wrote: > However, `zpool list` reports an inconsistent deduplication value (it > used to be ~1.4 AFAICR): >=20 > > zpool list data > > NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT > > data 460G 101G 359G 21% 1386985.39x ONLINE - Looks like this was some kind of corruption: the system crashed (it has crashed a few times over the last few months but I could not get details because of a failing serial port on the machine and the X server was running) and was then unable to reboot: just after importing the zpool, the kernel panicked in ddt_phys_decref() trying to dereference a NULL pointer (because of the serial port I don't have a text backtrace, however I took a few shots just in case). I replaced the disks of the pool with new ones, reinstalled FreeBSD and restored from backup. I keep the old disks untouched in case some FreeBSD developer involved in ZFS is interested about this corruption and needs a real-life corrupted filesystem for analysis. Please let me know by private mail. Thanks, Romain --=20 Romain Tarti=E8re http://people.FreeBSD.org/~romain/ pgp: 8234 9A78 E7C0 B807 0B59 80FF BA4D 1D95 5112 336F (ID: 0x5112336F) (plain text =3Dnon-HTML=3D PGP/GPG encrypted/signed e-mail much appreciated) --pWyiEgJYm5f9v55/ Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQGcBAEBAgAGBQJRB+6DAAoJELpNHZVREjNvuiwMAI9c94X5rN00GgKkPbPdF9KX WlOv67UEXrdN6GfsjKXEnC9BTPOCGjs3ZCw3esllQqyJEoUL+nzZVyxU9IrLdVPA v9AHX523u8d1SqpsGZKnhCd+JWWMuOa6CXK5GgoVtSajxZXurt1CSpXnyRw6yxQZ PDM8oPXWMbtT+mx+AJseZKyAF2TSDDkYCoKVx1NaaZFAWwVZw4cFpBZHcvGrzvHh SqLsiYnSAJ5watETrowrNZrpOK75EKOVDaPCpTesvY+Yhzu2gKMeOIlyNqYodssT j9ezEKjYNYFkFMXxvUS9QP0BtIOUq/O4rd42Bu6HwzX8WCoe9Dyj1iIr8X+/A65o DP4zVSIW6u3y4haEd0ZmhZNUid4S1vsBixiYSKc8B69uqkzmDgYrruCoTFOKzlFB h5BmB1JpGa2tml8Kq18+3+KEmEAgyY+Qy6AF4rdXvpfizUibATKg1Lvh94uZSq+O 2cZZRvLu4nfcT7AuA+/BvFdns9Cwz1ampM4RHGEfdQ== =7M8k -----END PGP SIGNATURE----- --pWyiEgJYm5f9v55/-- From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 16:44:53 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 85C4D8C3; Tue, 29 Jan 2013 16:44:53 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-bk0-f53.google.com (mail-bk0-f53.google.com [209.85.214.53]) by mx1.freebsd.org (Postfix) with ESMTP id EAA1D888; Tue, 29 Jan 2013 16:44:52 +0000 (UTC) Received: by mail-bk0-f53.google.com with SMTP id j10so394834bkw.26 for ; Tue, 29 Jan 2013 08:44:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:sender:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:content-type :content-transfer-encoding; bh=jiTPKyMmtazJoleQdasofoHQVMUTWO2oBZAxEyP661o=; b=SPFkVH2aw57Y0QTUA0U9yVcGAYvdXWg3WNlmq9dWJdkxm1U2Fto9KDeS7F9+Vz95+m 2feVB5zUVRJGJw62WCSqz3f5FzezkaWHBxZIGnVBdSAPHKa5I2Ob0f34LHaqEUpjHiQ8 WQBEEARUnBwjYah7aT+qomXFywfVVZHQIrdItOapLYGI7tK3IsqN1d7Eaiq3Tgb66CR+ cUrlZmMFfaLOkU+wuoiPpB+jQ5IZlljcNWp2CmVGp1DfZQLOjR3CBbM0Ku9vbK1VT9WX obL/Vbex4EFw8dIsdJmn0689TiX8MCoOGO5Naou19RVng/z9oA/dzTtYba3HgUCuF/JF kNyA== X-Received: by 10.204.150.134 with SMTP id y6mr112305bkv.15.1359477891771; Tue, 29 Jan 2013 08:44:51 -0800 (PST) Received: from mavbook.mavhome.dp.ua ([91.198.175.1]) by mx.google.com with ESMTPS id gy3sm4878528bkc.16.2013.01.29.08.44.49 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 29 Jan 2013 08:44:50 -0800 (PST) Sender: Alexander Motin Message-ID: <5107FC7E.8070108@FreeBSD.org> Date: Tue, 29 Jan 2013 18:44:46 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130125 Thunderbird/17.0.2 MIME-Version: 1.0 To: Jeremy Chadwick Subject: Re: disk "flipped" - a known problem? References: <20130121221617.GA23909@icarus.home.lan> <50FED818.7070704@FreeBSD.org> <20130125083619.GA51096@icarus.home.lan> <20130125211232.GA3037@icarus.home.lan> In-Reply-To: <20130125211232.GA3037@icarus.home.lan> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, avg@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 16:44:53 -0000 On 25.01.2013 23:12, Jeremy Chadwick wrote: > Now about cam_periph_alloc -- I wanted to provide proof that I have seen > this message before / proving Andriy isn't crazy. :-) This is from > when I was messing about with this bad disk the day I received it: > > Jan 18 19:54:57 icarus kernel: ada5 at ahcich5 bus 0 scbus5 target 0 lun 0 > Jan 18 19:54:57 icarus kernel: ada5: ATA-7 SATA 1.x device > Jan 18 19:54:57 icarus kernel: ada5: 150.000MB/s transfers (SATA 1.x, UDMA6, PIO 8192bytes) > Jan 18 19:54:57 icarus kernel: ada5: Command Queueing enabled > Jan 18 19:54:57 icarus kernel: ada5: 143089MB (293046768 512 byte sectors: 16H 63S/T 16383C) > Jan 18 19:54:57 icarus kernel: ada5: Previously was known as ad14 > Jan 18 19:54:57 icarus kernel: cam_periph_alloc: attempt to re-allocate valid device pass5 rejected flags 0x18 refcount 1 > Jan 18 19:54:57 icarus kernel: passasync: Unable to attach new device due to status 0x6: CCB request was invalid > Jan 18 19:54:57 icarus kernel: GEOM_RAID: NVIDIA-6: Array NVIDIA-6 created. > Jan 18 19:55:27 icarus kernel: GEOM_RAID: NVIDIA-6: Force array start due to timeout. > Jan 18 19:55:27 icarus kernel: GEOM_RAID: NVIDIA-6: Disk ada5 state changed from NONE to ACTIVE. > Jan 18 19:55:27 icarus kernel: GEOM_RAID: NVIDIA-6: Subdisk RAID 0+1 279.47G:3-ada5 state changed from NONE to REBUILD. > Jan 18 19:55:27 icarus kernel: GEOM_RAID: NVIDIA-6: Array started. > Jan 18 19:55:27 icarus kernel: GEOM_RAID: NVIDIA-6: Volume RAID 0+1 279.47G state changed from STARTING to BROKEN. > Jan 18 19:55:39 icarus kernel: GEOM_RAID: NVIDIA-6: Volume RAID 0+1 279.47G state changed from BROKEN to STOPPED. > Jan 18 19:55:49 icarus kernel: GEOM_RAID: NVIDIA-6: Array NVIDIA-6 destroyed. > > So why didn't I see this message today? On January 20th I rebuild > world/kernel after removing GEOM_RAID from my kernel config. The reason > I removed GEOM_RAID is that, as you can see, that bad disk** was > previously in a system (not my own) with an nVidia SATA chipset with > their RAID option ROM enabled (my system is Intel, hence "array timeout" > since there's no nVidia option ROM, I believe). Array timeout means that within defined timeout GEOM RAID failed to detect all of array components. GEOM RAID doesn't depend on option ROM presence to access the data. Array was finally marked as BROKEN because it was one disk of RAID0+1's four. > I got sick and tired of having to "fight" with the kernel. The last two > messages were a result of me doing "graid stop ada5". And of course "dd > if=/dev/zero of=/dev/ada5 bs=64k" will cause GEOM to re-taste, causing > the RAID metadata to get re-read, "NVIDIA-7" created, rinse lather > repeat. But there's already a thread on this: > > http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016292.html > > Just easier for me to remove the option, that's all. I would personally prefer to erase unwanted stale metadata with `graid delete NVIDIA-7`. It erases only one sector where metadata stored and doesn't corrupt any other data on the disk. -- Alexander Motin From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 17:05:34 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 69B6F260 for ; Tue, 29 Jan 2013 17:05:34 +0000 (UTC) (envelope-from fjwcash@gmail.com) Received: from mail-qe0-f41.google.com (mail-qe0-f41.google.com [209.85.128.41]) by mx1.freebsd.org (Postfix) with ESMTP id 3016B974 for ; Tue, 29 Jan 2013 17:05:33 +0000 (UTC) Received: by mail-qe0-f41.google.com with SMTP id 7so285570qeb.14 for ; Tue, 29 Jan 2013 09:05:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=piWKtDuiAP8LtdFRC0cHlCzkmupVT9dc7P0zadttVnk=; b=fhArPGdkLTZ5zix6lZB72ZATc7Bo77DRCRMJSpFK4R8RdQyIxp32Z7QdsyZbCTJSu0 vBdCJ5axcDXvY0pyiGSdzm4ZkRjLWg5e4IHF4UsJ6XI6YY/81rx//aH1QJ7oIal5p6QX o33JJ3WyYQixBL3fLsmlhxvNBAVXCbE0spQr7z/Xjmd+M6SWA1UArTTc2GDG1AXlz0DP gQESB+EDHUrRAX2tjDngl33rJpviK9T5SwKWzX60tQs74H++QFqUxytwnTr2W/A47bPz CGeofD5BsUE2nbiFb6rTK0Mg9Q6PffU/B9LY94k2YnMEXcOm0aOQ5sCAPQ0mH9nRWZL1 iELw== MIME-Version: 1.0 X-Received: by 10.224.177.10 with SMTP id bg10mr1846578qab.78.1359479133312; Tue, 29 Jan 2013 09:05:33 -0800 (PST) Received: by 10.49.106.233 with HTTP; Tue, 29 Jan 2013 09:05:33 -0800 (PST) Received: by 10.49.106.233 with HTTP; Tue, 29 Jan 2013 09:05:33 -0800 (PST) In-Reply-To: <5107A9B7.5030803@platinum.linux.pl> References: <5105252D.6060502@platinum.linux.pl> <5107A9B7.5030803@platinum.linux.pl> Date: Tue, 29 Jan 2013 09:05:33 -0800 Message-ID: Subject: Re: RAID-Z wasted space - asize roundups to nparity +1 From: Freddie Cash To: Adam Nowacki Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: Matthew Ahrens , fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 17:05:34 -0000 On Jan 29, 2013 2:52 AM, "Adam Nowacki" wrote: > > On 2013-01-28 22:55, Matthew Ahrens wrote: >> >> This is so that we won't end up with small, unallocatable segments. >> E.g. if you are using RAIDZ2, the smallest usable segment would be 3 >> sectors (1 sector data + 2 sectors parity). If we left a 1 or 2 sector >> free segment, it would be unusable and you'd be able to get into strange >> accounting situations where you have free space but can't write because >> you're "out of space". > > > Sounds reasonable. > > >> The amount of waste due to this can be minimized by using larger >> blocksizes (e.g. the default recordsize of 128k and files larger than >> 128k), and by using smaller sector sizes (e.g. 512b sector disks rather >> than 4k sector disks). In your case these techniques would limit the >> waste to 0.6%. > > > This brings another issue - recordsize capped at 128KiB. We are using the pool for off-line storage of large files (from 50MB to 20GB). Files are stored and read sequentially as a whole. With 12 disks in RAID-Z2, 4KiB sectors, 128KiB record size and the padding above 9.4% of disk space goes completely unused - one whole disk. > > Increasing recordsize cap seems trivial enough. On-disk structures and kernel code support it already - a single of code had to be changed (#define SPA_MAXBLOCKSHIFT - from 17 to 20) to support 1MiB recordsizes. This of course breaks compatibility with any other system without this modification. With Suns cooperation this could be handled in safe and compatible manner via pool version upgrade. Recordsize of 128KiB would remain the default but anyone could increase it with zfs set. There's work upstream (Illumos, I believe, maybe Delphix?) to add support for recordings above 128 KB. It'll be added ad a feature flag, so only compatible with open-source ZFS. From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 18:14:40 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 0EDB7287 for ; Tue, 29 Jan 2013 18:14:40 +0000 (UTC) (envelope-from matthew.ahrens@delphix.com) Received: from mail-la0-x235.google.com (mail-la0-x235.google.com [IPv6:2a00:1450:4010:c03::235]) by mx1.freebsd.org (Postfix) with ESMTP id 89A16E49 for ; Tue, 29 Jan 2013 18:14:39 +0000 (UTC) Received: by mail-la0-f53.google.com with SMTP id fr10so533902lab.12 for ; Tue, 29 Jan 2013 10:14:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=delphix.com; s=google; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=ysQRgwYXD0w0xg518IqGyFdrPwrB5w6+A+jKT8g4WyU=; b=NLTFL0iOEbRABik60N5vh/eqmaXioQQzFDgovUPfrUkpF6MBvSSQR+fGFk4xkVZiyd ML94sFJF1cA9do8TZevBfgyFzxu5+y8bs0rOs3ryHh2pfFGLsYK4ur0L55NlPLKPlgXq 0SEmbMlsn69KMB70sX/5UrYxAB/AhidwROc88= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type:x-gm-message-state; bh=ysQRgwYXD0w0xg518IqGyFdrPwrB5w6+A+jKT8g4WyU=; b=QDqMyIzY9q6V0wt/9I6qFCOBIjvmqdtFvuhzvNsdzQaNjzvHGjEeQA9PRMIDoJ1kWW iQ8ya2vp/R8YcD4zh7Pw40DPcst/JuVlwKNuWbneQAiJfgb8eussFKrQjvSSxT4y9qBM I7oktKVg/HOty3UU36oEz6fjgKFpv1FzpZR5rFaVkz0SDPDnhCwzQQlQzD9AUxIhumIq J1gtKSSNWguqPSt9/429maLNSnW3hxqc1kvYeKt+YzPgp/Ld/cU4wto6p6x30BPDs+oO 0bxI5bxzbIXF3lQh701KPQNooo2125KGgN2b1edzU9e9utSK7WkN4dTi719ssN0dWZok 4IOg== MIME-Version: 1.0 X-Received: by 10.152.144.202 with SMTP id so10mr1976797lab.9.1359483278471; Tue, 29 Jan 2013 10:14:38 -0800 (PST) Received: by 10.114.68.109 with HTTP; Tue, 29 Jan 2013 10:14:38 -0800 (PST) In-Reply-To: <5107A9B7.5030803@platinum.linux.pl> References: <5105252D.6060502@platinum.linux.pl> <5107A9B7.5030803@platinum.linux.pl> Date: Tue, 29 Jan 2013 10:14:38 -0800 Message-ID: Subject: Re: RAID-Z wasted space - asize roundups to nparity +1 From: Matthew Ahrens To: Adam Nowacki X-Gm-Message-State: ALoCoQn3bsZdbLdhB6W7iNDz3krWVpHcLic/HP8dTdlpiNiyeSTVvA/l5kIN0m4ZOMkX7RUU/6ft Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 18:14:40 -0000 On Tue, Jan 29, 2013 at 2:51 AM, Adam Nowacki wrote: > I've also identified another problem with ZFS wasting disk space. When > compression is off allocations are always a multiple of record size. With > the default recordsize of 128KiB a 129KiB file would use 256KiB of disk > space (+ parity and other inefficiencies mentioned above). This may be > there to help with fragmentation but then it would be good to have a > setting to turn it off - even if by means of a no-op compression that would > count zeroes backwards and return short psize. > The most straightforward way to do this would be, as you alluded, to always compress the last block of the file, even if no compression has been selected. For maximum speed, we could use the already-implemented zle (zero-length encoding) algorithm. --matt From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 19:00:02 2013 Return-Path: Delivered-To: freebsd-fs@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 0F8B9D99 for ; Tue, 29 Jan 2013 19:00:02 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id E70046D for ; Tue, 29 Jan 2013 19:00:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r0TJ01IU093310 for ; Tue, 29 Jan 2013 19:00:01 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r0TJ01Vt093309; Tue, 29 Jan 2013 19:00:01 GMT (envelope-from gnats) Date: Tue, 29 Jan 2013 19:00:01 GMT Message-Id: <201301291900.r0TJ01Vt093309@freefall.freebsd.org> To: freebsd-fs@FreeBSD.org Cc: From: Jeremy Chadwick Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: Jeremy Chadwick List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 19:00:02 -0000 The following reply was made to PR kern/169480; it has been noted by GNATS. From: Jeremy Chadwick To: Harry Coin Cc: bug-followup@FreeBSD.org, levent.serinol@mynet.com Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O Date: Tue, 29 Jan 2013 10:50:28 -0800 Re 1,2: that transfer speed (183MBytes/second) sounds much better/much more accurate for what's going on. The speed-limiting factors were certainly a small blocksize (512 bytes) used by dd, and using /dev/random rather than /dev/zero. I realise you're probably expecting to see something like 480MBytes/second (4 drives * 120MB/sec), but that's probably not going to happen on that model of system and with that CPU. For example, on my Q9550 system described earlier, I can get about this: $ dd if=/dev/zero of=testfile bs=64k ^C27148+0 records in 27147+0 records out 1779105792 bytes transferred in 6.935566 secs (256519186 bytes/sec) While "gstat -I500ms" shows each disk going between 60MBytes/sec and 140MBytes/sec. "zpool iostat -v data 1" shows between 120-220MBytes/sec at the pool level, and showing around 65-110MBytes/sec on a per-disk level. Anyway, point being, things are faster with a large bs and from a source that doesn't churn interrupts. But don't necessarily "pull a Linux" and start doing things like bs=1m -- as I said before, Linux dd is different, because the I/O is cached (without --direct), while on FreeBSD dd is always direct. Re 3: That sounds a bit on the slow side. I would expect those disks, at least during writes, to do more. If **all** the drives show this behaviour consistently in gstat, then you know the issue IS NOT with an individual disk, and is instead the issue lies elsewhere. That rules out one piece of the puzzle, and that's good. Re 5: Did you mean to type 14MBytes/second, not 14mbits/second? If so, yes, I would agree that's slow. Scrubbing is not necessarily a good way to "benchmark" disks, but I understand for "benchmarking" ZFS it's the best you've got to some degree. Regarding dd'ing and 512 bytes -- as I described to you in my previous mail: > This speed will be "bursty" and "sporadic" due to the how ZFS ARC > works. The interval at which "things are flushed to disk" is based on > the vfs.zfs.txg.timeout sysctl, which on FreeBSD 9.1-RELEASE should > default to 5 (5 seconds). This is where your "4 secs or so" magic value comes from. Please do not change this sysctl/value; keep it at 5. Finally, your vmstat -i output shows something of concern, UNLESS you did this WHILE you had the dd (doesn't matter what block size) going, and are using /dev/random or /dev/urandom (same thing on FreeBSD): > irq20: hpet0 620136 328 > irq259: ahci1 849746 450 These interrupt rates are quite high. hpet0 refers to your event timer/clock timer (see kern.eventtimer.choice and kern.eventtimer.timer) being HPET, and ahci1 refers to your Intel ICH7 AHCI controller. Basically what's happening here is that you're generating a ton of interrupts doing dd if=/dev/urandom bs=512. And it makes perfect sense to me why: because /dev/urandom has to harvest entropy from interrupt sources (please see random(4) man page), and you're generating a lot of interrupts to your AHCI controller for each individual 512-byte write. When you say "move a video from one dataset to another", please explain what it is you're moving from and to. Specifically: what filesystems, and output from "zfs list". If you're moving a file from a ZFS filesystem to another ZFS filesystem on the same pool, then please state that. That may help kernel folks figure out where your issue lies. At this stage, a kernel developer is going to need to step in and try to help you figure out where the actual bottleneck is occurring. This is going to be very difficult/complex/very likely not possible with you using nas4free, because you will almost certainly be asked to rebuild world/kernel to include some new options and possibly asked to include DTrace/CTF support (for real-time debugging). The situation is tricky. It would really help if you would/could remove nas4free from the picture and instead just run stock FreeBSD, because as I said, if there are some kind of kernel tunings or adjustment values the nas4free folks put in place that stock FreeBSD doesn't, those could be harming you. I can't be of more help here, I'm sorry to say. The good news is that your disks sound fine. Kernel developers will need to take this up. P.S. -- I would strongly recommend updating your nas4free forum post with a link to this conversation in this PR. IMO, the nas4free people need to step up and take responsibility (and that almost certainly means talking/working with the FreeBSD folks). -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 23:20:19 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id A87AFCC5 for ; Tue, 29 Jan 2013 23:20:19 +0000 (UTC) (envelope-from toasty@dragondata.com) Received: from mail-ia0-x232.google.com (mail-ia0-x232.google.com [IPv6:2607:f8b0:4001:c02::232]) by mx1.freebsd.org (Postfix) with ESMTP id 79612EB9 for ; Tue, 29 Jan 2013 23:20:19 +0000 (UTC) Received: by mail-ia0-f178.google.com with SMTP id y26so1457887iab.9 for ; Tue, 29 Jan 2013 15:20:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dragondata.com; s=google; h=x-received:from:content-type:content-transfer-encoding:subject :message-id:date:to:mime-version:x-mailer; bh=mV634ayqjMBOx7SILnuvSR+59ucUPOAyAQgnk69w53U=; b=cvKZdBK8TuaL52Hpw3gkErDkhUMxpGbMyTCsWlwT0D6Y6nu4/sbHZBbS8sae0X7hHE AEJXfjMNsW+TQlu1LKOxKt/h32HDXEey6Q/eWXQkf89+XamYEzmGn0Q8nlccuuEtlc2Y NkR0rU8aR545WcEn2GJ3U8wcLxPViDhggoQUk= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:from:content-type:content-transfer-encoding:subject :message-id:date:to:mime-version:x-mailer:x-gm-message-state; bh=mV634ayqjMBOx7SILnuvSR+59ucUPOAyAQgnk69w53U=; b=TJv8YdYIq32YMay5tCTDPsnDGW+yb/+QWvEl0fNw7AG/6gWZEW1I1tx6xW/5fe0ePE PKRss0mdOP5mBFnN4xHMec7uNwpH/Y8E3J7kuy5r1jh8qQtS/+FGZocH4KKwC5xFPihN 4utlKCSglWZoQmmK/UVrBxcEkVYvuU/v6NCpMFuHIYUXQSfGhMd8Clfn2BItYH0jrEp6 d9MlfLHT/9UT3gqR5pI24a6BCt67iM03Id9XYEUxyt7oBUJU5Qb8yZOJ67zfcbdm5c1/ z/JVuGY5hots0/fUmcM5EBiesKRfqNKXM3L/CgaE5Pe5AKLOzBCgJP3wNdRqQd1QgieM cklw== X-Received: by 10.42.30.132 with SMTP id v4mr1808396icc.34.1359501619173; Tue, 29 Jan 2013 15:20:19 -0800 (PST) Received: from vpn132.rw1.your.org (vpn132.rw1.your.org. [204.9.51.132]) by mx.google.com with ESMTPS id vq4sm2912997igb.10.2013.01.29.15.20.17 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 29 Jan 2013 15:20:18 -0800 (PST) From: Kevin Day Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Subject: Improving ZFS performance for large directories Message-Id: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> Date: Tue, 29 Jan 2013 17:20:15 -0600 To: FreeBSD Filesystems Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) X-Mailer: Apple Mail (2.1499) X-Gm-Message-State: ALoCoQlKSPeuKIj2/xZpRMC2K963YVXcKPluY/0NMQ+f5nOEn8076UECfrirKNoNH0eFdz2/QQWg X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 23:20:19 -0000 I'm trying to improve performance when using ZFS in large (>60000 files) = directories. A common activity is to use "getdirentries" to enumerate = all the files in the directory, then "lstat" on each one to get = information about it. Doing an "ls -l" in a large directory like this = can take 10-30 seconds to complete. Trying to figure out why, I did: ktrace ls -l /path/to/large/directory kdump -R |sort -rn |more to see what sys calls were taking the most time, I ended up with: 69247 ls 0.190729 STRU struct stat {dev=3D846475008, = ino=3D46220085, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, = rdev=3D4294967295, atime=3D1333196714, stime=3D1201004393, = ctime=3D1333196714.547566024, birthtime=3D1333196714.547566024, = size=3D30784, blksize=3D31232, blocks=3D62, flags=3D0x0 } 69247 ls 0.180121 STRU struct stat {dev=3D846475008, = ino=3D46233417, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, = rdev=3D4294967295, atime=3D1333197088, stime=3D1209814737, = ctime=3D1333197088.913571042, birthtime=3D1333197088.913571042, = size=3D3162220, blksize=3D131072, blocks=3D6409, flags=3D0x0 } 69247 ls 0.152370 RET getdirentries 4088/0xff8 69247 ls 0.139939 CALL stat(0x800d8f598,0x7fffffffcca0) 69247 ls 0.130411 RET __acl_get_link 0 69247 ls 0.121602 RET __acl_get_link 0 69247 ls 0.105799 RET getdirentries 4064/0xfe0 69247 ls 0.105069 RET getdirentries 4068/0xfe4 69247 ls 0.096862 RET getdirentries 4028/0xfbc 69247 ls 0.085012 RET getdirentries 4088/0xff8 69247 ls 0.082722 STRU struct stat {dev=3D846475008, = ino=3D72941319, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, = rdev=3D4294967295, atime=3D1348686155, stime=3D1348347621, = ctime=3D1348686155.768875422, birthtime=3D1348686155.768875422, = size=3D6686225, blksize=3D131072, blocks=3D13325, flags=3D0x0 } 69247 ls 0.070318 STRU struct stat {dev=3D846475008, = ino=3D46211679, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, = rdev=3D4294967295, atime=3D1333196475, stime=3D1240230314, = ctime=3D1333196475.038567672, birthtime=3D1333196475.038567672, = size=3D829895, blksize=3D131072, blocks=3D1797, flags=3D0x0 } 69247 ls 0.068060 RET getdirentries 4048/0xfd0 69247 ls 0.065118 RET getdirentries 4088/0xff8 69247 ls 0.062536 RET getdirentries 4096/0x1000 69247 ls 0.061118 RET getdirentries 4020/0xfb4 69247 ls 0.055038 STRU struct stat {dev=3D846475008, = ino=3D46220358, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, = rdev=3D4294967295, atime=3D1333196720, stime=3D1274282669, = ctime=3D1333196720.972567345, birthtime=3D1333196720.972567345, = size=3D382344, blksize=3D131072, blocks=3D773, flags=3D0x0 } 69247 ls 0.054948 STRU struct stat {dev=3D846475008, = ino=3D75025952, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, = rdev=3D4294967295, atime=3D1351071350, stime=3D1349726805, = ctime=3D1351071350.800873870, birthtime=3D1351071350.800873870, = size=3D2575559, blksize=3D131072, blocks=3D5127, flags=3D0x0 } 69247 ls 0.054828 STRU struct stat {dev=3D846475008, = ino=3D65021883, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, = rdev=3D4294967295, atime=3D1335730367, stime=3D1332843230, = ctime=3D1335730367.541567371, birthtime=3D1335730367.541567371, = size=3D226347, blksize=3D131072, blocks=3D517, flags=3D0x0 } 69247 ls 0.053743 STRU struct stat {dev=3D846475008, = ino=3D46222016, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, = rdev=3D4294967295, atime=3D1333196765, stime=3D1257110706, = ctime=3D1333196765.206574132, birthtime=3D1333196765.206574132, = size=3D62112, blksize=3D62464, blocks=3D123, flags=3D0x0 } 69247 ls 0.052015 RET getdirentries 4060/0xfdc 69247 ls 0.051388 RET getdirentries 4068/0xfe4 69247 ls 0.049875 RET getdirentries 4088/0xff8 69247 ls 0.049156 RET getdirentries 4032/0xfc0 69247 ls 0.048609 RET getdirentries 4040/0xfc8 69247 ls 0.048279 RET getdirentries 4032/0xfc0 69247 ls 0.048062 RET getdirentries 4064/0xfe0 69247 ls 0.047577 RET getdirentries 4076/0xfec (snip) the STRU are returns from calling lstat(). It looks like both getdirentries and lstat are taking quite a while to = return. The shortest return for any lstat() call is 0.000004 seconds, = the maximum is 0.190729 and the average is around 0.0004. Just from = lstat() alone, that makes "ls" take over 20 seconds. I'm prepared to try an L2arc cache device (with = secondarycache=3Dmetadata), but I'm having trouble determining how big = of a device I'd need. We've got >30M inodes now on this filesystem, = including some files with extremely long names. Is there some way to = determine the amount of metadata on a ZFS filesystem? From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 23:42:30 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 5A31C248 for ; Tue, 29 Jan 2013 23:42:30 +0000 (UTC) (envelope-from matthew.ahrens@delphix.com) Received: from mail-lb0-f179.google.com (mail-lb0-f179.google.com [209.85.217.179]) by mx1.freebsd.org (Postfix) with ESMTP id AF4BFF98 for ; Tue, 29 Jan 2013 23:42:29 +0000 (UTC) Received: by mail-lb0-f179.google.com with SMTP id j14so1406798lbo.24 for ; Tue, 29 Jan 2013 15:42:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=delphix.com; s=google; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=4dOb06WmFFmGv//BJ/aIa1Nknh5ufuQM5mi+Ga3VnSE=; b=VyQiUBfj5DJ0MCTh5ethc/8eZZuncYteMgWu0H8X7PCl78Y7pzMfYyXwZ+l7O40Vse iZgMM7fbnRR08NnOmm5vS9wpHgJcZ9CwOvTLTTSluKwfCiOF7mcuCz6txlsO9Y0FKTXf bCRm4ogQG+8dhgn9AMjeYySEXJEdpgLPIIn1E= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type:x-gm-message-state; bh=4dOb06WmFFmGv//BJ/aIa1Nknh5ufuQM5mi+Ga3VnSE=; b=PnT4pYXsrcWQeDvcvIszTQ9ElSCtOau6utyEFYLaFrXm7LkVhy9u961J+LHl/LDtJE MS6qTL6x4DtbstAAJkawT1Zf6wv6gfQxAc7yhFRW9TpU9EPgDA0jboYaLlYL1lbsdGfK oG8GNZuSS6eg+sxn2kXm1V9Jqbel5X7YQ5J0aah8mD89ml5oDjAvR9QFjxf5kL05RpCj EfY4DgJw4DWxOvfrUyx91eSe4c+qz0VRhN7wZjznIaVznyU6/YRGtOeIg0vEQfx836z6 QpJ/S43a1EYgBgh2OnCuHEJUlCfhsRndM9IdPIn1zXfiO2cIt9GXtwahJHNWVUhaLIyW DNGg== MIME-Version: 1.0 X-Received: by 10.152.144.202 with SMTP id so10mr2721142lab.9.1359502948362; Tue, 29 Jan 2013 15:42:28 -0800 (PST) Received: by 10.114.68.109 with HTTP; Tue, 29 Jan 2013 15:42:28 -0800 (PST) In-Reply-To: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> Date: Tue, 29 Jan 2013 15:42:28 -0800 Message-ID: Subject: Re: Improving ZFS performance for large directories From: Matthew Ahrens To: Kevin Day X-Gm-Message-State: ALoCoQmoJx68nM8xURko1bIRspmB/I2cz6arNabSxA3deWbDWGQe5KWlDHVA5r1lc+FVZokLVdpR Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 23:42:30 -0000 On Tue, Jan 29, 2013 at 3:20 PM, Kevin Day wrote: > I'm prepared to try an L2arc cache device (with secondarycache=metadata), You might first see how long it takes when everything is cached. E.g. by doing this in the same directory several times. This will give you a lower bound on the time it will take (or put another way, an upper bound on the improvement available from a cache device). > but I'm having trouble determining how big of a device I'd need. We've got > >30M inodes now on this filesystem, including some files with extremely > long names. Is there some way to determine the amount of metadata on a ZFS > filesystem? For a specific filesystem, nothing comes to mind, but I'm sure you could cobble something together with zdb. There are several tools to determine the amount of metadata in a ZFS storage pool: - "zdb -bbb " but this is unreliable on pools that are in use - "zpool scrub ; ; echo '::walk spa|::zfs_blkstats' | mdb -k" the scrub is slow, but this can be mitigated by setting the global variable zfs_no_scrub_io to 1. If you don't have mdb or equivalent debugging tools on freebsd, you can manually look at ->spa_dsl_pool->dp_blkstats. In either case, the "LSIZE" is the size that's required for caching (in memory or on a l2arc cache device). At a minimum you will need 512 bytes for each file, to cache the dnode_phys_t. --matt From owner-freebsd-fs@FreeBSD.ORG Wed Jan 30 00:06:06 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 16348650 for ; Wed, 30 Jan 2013 00:06:06 +0000 (UTC) (envelope-from toasty@dragondata.com) Received: from mail-ia0-x22e.google.com (mail-ia0-x22e.google.com [IPv6:2607:f8b0:4001:c02::22e]) by mx1.freebsd.org (Postfix) with ESMTP id D7221FE for ; Wed, 30 Jan 2013 00:06:05 +0000 (UTC) Received: by mail-ia0-f174.google.com with SMTP id o25so1453847iad.5 for ; Tue, 29 Jan 2013 16:06:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dragondata.com; s=google; h=x-received:content-type:mime-version:subject:from:in-reply-to:date :cc:message-id:references:to:x-mailer; bh=8sboX/ToX1azSjD1R2MMSsZH/dhn5fvGFOkTgIZv4jY=; b=jbE8aMdNgBQK7S8V2vlZbNCt1iQ9dHQRYxk4BiBv0kiz/0scoxWzMdVHrAummJl5Ts mUcdXnNm1dVmUmNbFLRiCCQrrVTZ/4FTsfc1/EkykJrdDpblGlfAYY9lVwqOqwKd4vf9 TyDIq7fK7n+sNx+mc3O7SQ7v/wYuFNpDpL/uY= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:content-type:mime-version:subject:from:in-reply-to:date :cc:message-id:references:to:x-mailer:x-gm-message-state; bh=8sboX/ToX1azSjD1R2MMSsZH/dhn5fvGFOkTgIZv4jY=; b=G+laN224gQKTIm2dkhUxiJbxhFZ7zQe13nP/iOThgaF2Zonr1LlnEL+9/rcAXYEa0/ gvatwHT0Nj0NWGsr5eJboVTpK17WRCkpCI+nwpykit98Iee503M4r4msS6z1cTrmWKhq yFiOnrWM83YJvYmq5ab0qpMTAr70jIVCdpLTVyPVumcW63GpGVnbLVlj6eEZx6DhEHXs IZ9cslDVDqNGJRgkgws2BAQ7fRDEU3k5ElvhfL3NCDUrRopBvUBMQbb14QGr/kQbZfc7 GNdFVtov9te0XIyNe8Lspzalo0rknuVxYdw1W3IwfJi2jVaRymyWYfESAc7RPEv46k71 5lSA== X-Received: by 10.42.11.203 with SMTP id v11mr1911977icv.28.1359504365528; Tue, 29 Jan 2013 16:06:05 -0800 (PST) Received: from vpn132.rw1.your.org (vpn132.rw1.your.org. [204.9.51.132]) by mx.google.com with ESMTPS id uj6sm3844598igb.4.2013.01.29.16.06.03 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 29 Jan 2013 16:06:04 -0800 (PST) Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: Improving ZFS performance for large directories From: Kevin Day In-Reply-To: Date: Tue, 29 Jan 2013 18:06:01 -0600 Message-Id: References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> To: Matthew Ahrens X-Mailer: Apple Mail (2.1499) X-Gm-Message-State: ALoCoQn+7di3aQMcA75dIfGldt7pYAFItZYBgEiliyBw3tVFZFyVAdaNgRG692kEa9iKr5URNyr/ Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Jan 2013 00:06:06 -0000 On Jan 29, 2013, at 5:42 PM, Matthew Ahrens wrote: > On Tue, Jan 29, 2013 at 3:20 PM, Kevin Day = wrote: > I'm prepared to try an L2arc cache device (with = secondarycache=3Dmetadata), >=20 > You might first see how long it takes when everything is cached. E.g. = by doing this in the same directory several times. This will give you a = lower bound on the time it will take (or put another way, an upper bound = on the improvement available from a cache device). > =20 Doing it twice back-to-back makes a bit of difference but it's still = slow either way. After not touching this directory for about 30 minutes: # time ls -l >/dev/null 0.773u 2.665s 0:18.21 18.8% 35+2749k 3012+0io 0pf+0w Immediately again: # time ls -l > /dev/null 0.665u 1.077s 0:08.60 20.1% 35+2719k 556+0io 0pf+0w 18.2 vs 8.6 seconds is an improvement, but even the 8.6 seconds is = longer than what I was expecting. >=20 > For a specific filesystem, nothing comes to mind, but I'm sure you = could cobble something together with zdb. There are several tools to = determine the amount of metadata in a ZFS storage pool: >=20 > - "zdb -bbb " > but this is unreliable on pools that are in use I tried this and it consumed >16GB of memory after about 5 minutes so I = had to kill it. I'll try it again during our next maintenance window = where it can be the only thing running. > - "zpool scrub ; ; echo '::walk = spa|::zfs_blkstats' | mdb -k" > the scrub is slow, but this can be mitigated by setting the global = variable zfs_no_scrub_io to 1. If you don't have mdb or equivalent = debugging tools on freebsd, you can manually look at = ->spa_dsl_pool->dp_blkstats. >=20 > In either case, the "LSIZE" is the size that's required for caching = (in memory or on a l2arc cache device). At a minimum you will need 512 = bytes for each file, to cache the dnode_phys_t. Okay, thanks a bunch. I'll try this on the next chance I get too. I think some of the issue is that nothing is being allowed to stay = cached long. We have several parallel rsyncs running at once that are = basically scanning every directory as fast as they can, combined with a = bunch of rsync, http and ftp clients. I'm guessing with all that = activity things are getting shoved out pretty quickly. From owner-freebsd-fs@FreeBSD.ORG Wed Jan 30 01:28:58 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id CA8297CD for ; Wed, 30 Jan 2013 01:28:58 +0000 (UTC) (envelope-from prvs=1742e413e9=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id 679793FF for ; Wed, 30 Jan 2013 01:28:58 +0000 (UTC) Received: from r2d2 ([188.220.16.49]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50001920725.msg for ; Wed, 30 Jan 2013 01:28:56 +0000 X-Spam-Processed: mail1.multiplay.co.uk, Wed, 30 Jan 2013 01:28:56 +0000 (not processed: message from valid local sender) X-MDRemoteIP: 188.220.16.49 X-Return-Path: prvs=1742e413e9=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk X-MDaemon-Deliver-To: freebsd-fs@freebsd.org Message-ID: <9792709BF58143EFBDAABE638F769775@multiplay.co.uk> From: "Steven Hartland" To: "Kevin Day" , "Matthew Ahrens" References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> Subject: Re: Improving ZFS performance for large directories Date: Wed, 30 Jan 2013 01:29:35 -0000 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Jan 2013 01:28:58 -0000 ----- Original Message ----- From: "Kevin Day" > I think some of the issue is that nothing is being allowed to stay cached long. We have several parallel rsyncs running at once > that are basically scanning every directory as fast as they can, combined with a bunch of rsync, http and ftp clients. I'm > guessing with all that activity things are getting shoved out pretty quickly. zfs send / recv a possible replacements for the rsyncs? Regards Steve ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk. From owner-freebsd-fs@FreeBSD.ORG Wed Jan 30 02:24:58 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id E10623DF for ; Wed, 30 Jan 2013 02:24:58 +0000 (UTC) (envelope-from toasty@dragondata.com) Received: from mail-ie0-x235.google.com (mail-ie0-x235.google.com [IPv6:2607:f8b0:4001:c03::235]) by mx1.freebsd.org (Postfix) with ESMTP id 737257C3 for ; Wed, 30 Jan 2013 02:24:58 +0000 (UTC) Received: by mail-ie0-f181.google.com with SMTP id 17so919412iea.12 for ; Tue, 29 Jan 2013 18:24:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dragondata.com; s=google; h=x-received:subject:mime-version:content-type:from:x-priority :in-reply-to:date:cc:content-transfer-encoding:message-id:references :to:x-mailer; bh=8Zzcq+JZK8DfDrybaLFcueH81JvX4Y2StmRgnRHs5C4=; b=YG6yuEdOTGr0886IDC2CRkM79p8JwMoNcX5wHp2qNdipOapVWvmCxaaSTXcQPbweIu 62CmrPOFr+riuPrmEA3btybecCpdWuHqLiF4U4BFD6krE+aHmAkpQZQOoHkuTpRKS1HK i3gnJxaBgNGagFa/uAuuuGIwB91GKzydBSJhE= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:subject:mime-version:content-type:from:x-priority :in-reply-to:date:cc:content-transfer-encoding:message-id:references :to:x-mailer:x-gm-message-state; bh=8Zzcq+JZK8DfDrybaLFcueH81JvX4Y2StmRgnRHs5C4=; b=EuCe7QedueA8qTP/b5u1UencayIhMgGavcLPECP6/3IY8azBUf9cpF9oUwwsYH2ump OWRJzQ1YdX7ufpY/kCjGiaQAyaxTBTvg5go6UyzeoFMXfB/WnbRJ0u6cyCXth84x85AD 6PqN8/LTwlh+x7fs+WyvxaMczHNMP3gysZPHcXRbnNSaon3vemaQcp0hI40DbVzxWqve zW/fHI10q8MrQdY3hCgrknbQoEUcxwZv8DaYCOGjFkJc7WR2Iy64YPHn6DINbUOAWhLj /+WgqPmI332EWEaN5TXWNBpQDnOomJnZi6m9/GNpiqCQD1vKAFHtajKR/pMKAlEnuhds PJmA== X-Received: by 10.42.27.74 with SMTP id i10mr2064818icc.47.1359512698139; Tue, 29 Jan 2013 18:24:58 -0800 (PST) Received: from vpn132.rw1.your.org (vpn132.rw1.your.org. [204.9.51.132]) by mx.google.com with ESMTPS id bg10sm3322632igc.6.2013.01.29.18.24.55 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 29 Jan 2013 18:24:57 -0800 (PST) Subject: Re: Improving ZFS performance for large directories Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Content-Type: text/plain; charset=iso-8859-1 From: Kevin Day X-Priority: 3 In-Reply-To: <9792709BF58143EFBDAABE638F769775@multiplay.co.uk> Date: Tue, 29 Jan 2013 20:24:54 -0600 Content-Transfer-Encoding: quoted-printable Message-Id: References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> <9792709BF58143EFBDAABE638F769775@multiplay.co.uk> To: "Steven Hartland" X-Mailer: Apple Mail (2.1499) X-Gm-Message-State: ALoCoQmoEPhlFq2Ko0MEwkfXrVoWjeJSF3zKexdL1H5iWwXSQ6a8mS6EPsPU9xyPK4jSzM1dVNW8 Cc: FreeBSD Filesystems , Matthew Ahrens X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Jan 2013 02:24:58 -0000 On Jan 29, 2013, at 7:29 PM, "Steven Hartland" = wrote: >=20 > ----- Original Message ----- From: "Kevin Day" >=20 >> I think some of the issue is that nothing is being allowed to stay = cached long. We have several parallel rsyncs running at once that are = basically scanning every directory as fast as they can, combined with a = bunch of rsync, http and ftp clients. I'm guessing with all that = activity things are getting shoved out pretty quickly. >=20 > zfs send / recv a possible replacements for the rsyncs? Unfortunately not. We're pulling these files from a host that we do not = control, and isn't running ZFS. We're also serving these files up via a = public rsync daemon, and the vast majority of the clients receiving = files from it are not running ZFS either. Total data size is about 125TB now, growing to ~300TB in the near = future. It's just a ton of data that really isn't being stored in the = best manner for this kind of system, but we don't control the layout. -- Kevin From owner-freebsd-fs@FreeBSD.ORG Wed Jan 30 09:43:34 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id CAA2F5F1; Wed, 30 Jan 2013 09:43:34 +0000 (UTC) (envelope-from uqs@FreeBSD.org) Received: from acme.spoerlein.net (acme.spoerlein.net [IPv6:2a01:4f8:131:23c2::1]) by mx1.freebsd.org (Postfix) with ESMTP id 3A1EAAC8; Wed, 30 Jan 2013 09:43:34 +0000 (UTC) Received: from localhost (acme.spoerlein.net [IPv6:2a01:4f8:131:23c2::1]) by acme.spoerlein.net (8.14.6/8.14.6) with ESMTP id r0U9hQMG090453 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Wed, 30 Jan 2013 10:43:26 +0100 (CET) (envelope-from uqs@FreeBSD.org) Date: Wed, 30 Jan 2013 10:43:26 +0100 From: Ulrich =?utf-8?B?U3DDtnJsZWlu?= To: Fabian Keil Subject: Re: Zpool surgery Message-ID: <20130130094326.GT35868@acme.spoerlein.net> Mail-Followup-To: Fabian Keil , Dan Nelson , Peter Jeremy , current@freebsd.org, fs@freebsd.org References: <20130127103612.GB38645@acme.spoerlein.net> <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk> <20130127201140.GD29105@server.rulingia.com> <20130128085820.GR35868@acme.spoerlein.net> <20130128205802.1ffab53e@fabiankeil.de> <20130128214111.GA14888@dan.emsphone.com> <20130129155250.29d8f764@fabiankeil.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20130129155250.29d8f764@fabiankeil.de> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Dan Nelson , current@freebsd.org, fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Jan 2013 09:43:34 -0000 On Tue, 2013-01-29 at 15:52:50 +0100, Fabian Keil wrote: > Dan Nelson wrote: > > > In the last episode (Jan 28), Fabian Keil said: > > > Ulrich Spörlein wrote: > > > > On Mon, 2013-01-28 at 07:11:40 +1100, Peter Jeremy wrote: > > > > > On 2013-Jan-27 14:31:56 -0000, Steven Hartland wrote: > > > > > >----- Original Message ----- > > > > > >From: "Ulrich Spörlein" > > > > > >> I want to transplant my old zpool tank from a 1TB drive to a new > > > > > >> 2TB drive, but *not* use dd(1) or any other cloning mechanism, as > > > > > >> the pool was very full very often and is surely severely > > > > > >> fragmented. > > > > > > > > > > > >Cant you just drop the disk in the original machine, set it as a > > > > > >mirror then once the mirror process has completed break the mirror > > > > > >and remove the 1TB disk. > > > > > > > > > > That will replicate any fragmentation as well. "zfs send | zfs recv" > > > > > is the only (current) way to defragment a ZFS pool. > > > > > > It's not obvious to me why "zpool replace" (or doing it manually) > > > would replicate the fragmentation. > > > > "zpool replace" essentially adds your new disk as a mirror to the parent > > vdev, then deletes the original disk when the resilver is done. Since > > mirrors are block-identical copies of each other, the new disk will contain > > an exact copy of the original disk, followed by 1TB of freespace. > > Thanks for the explanation. > > I was under the impression that zfs mirrors worked at a higher > level than traditional mirrors like gmirror but there seems to > be indeed less magic than I expected. > > Fabian To wrap this up, while the zpool replace worked for the disk, I played around with it some more, and using snapshots instead *did* work the second time. I'm not sure what I did wrong the first time ... So basically this: # zfs send -R oldtank@2013-01-22 | zfs recv -F -d newtank (takes ages, then do a final snapshot before unmounting and send the incremental) # zfs send -R -i 2013-01-22 oldtank@2013-01-29 | zfs recv -F -d newtank Allows me to send snapshots up to 2013-01-29 to the "archive" pool from either oldtank or newtank. Yay! Cheers, Uli From owner-freebsd-fs@FreeBSD.ORG Wed Jan 30 10:20:05 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 9BC97CB6 for ; Wed, 30 Jan 2013 10:20:05 +0000 (UTC) (envelope-from ronald-freebsd8@klop.yi.org) Received: from smarthost1.greenhost.nl (smarthost1.greenhost.nl [195.190.28.78]) by mx1.freebsd.org (Postfix) with ESMTP id 3739CD31 for ; Wed, 30 Jan 2013 10:20:04 +0000 (UTC) Received: from smtp.greenhost.nl ([213.108.104.138]) by smarthost1.greenhost.nl with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.69) (envelope-from ) id 1U0Ulu-0001bp-Hx for freebsd-fs@freebsd.org; Wed, 30 Jan 2013 11:20:03 +0100 Received: from [81.21.138.17] (helo=ronaldradial.versatec.local) by smtp.greenhost.nl with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.72) (envelope-from ) id 1U0Ulu-0007jg-Ew for freebsd-fs@freebsd.org; Wed, 30 Jan 2013 11:20:02 +0100 Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes To: freebsd-fs@freebsd.org Subject: Re: Improving ZFS performance for large directories References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> Date: Wed, 30 Jan 2013 11:20:02 +0100 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: "Ronald Klop" Message-ID: In-Reply-To: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> User-Agent: Opera Mail/12.13 (Win32) X-Virus-Scanned: by clamav at smarthost1.samage.net X-Spam-Level: / X-Spam-Score: 0.8 X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.1 X-Scan-Signature: a9e4b997d6a751f3e45cb47a3c2b1d2c X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Jan 2013 10:20:05 -0000 On Wed, 30 Jan 2013 00:20:15 +0100, Kevin Day wrote: > > I'm trying to improve performance when using ZFS in large (>60000 files) > directories. A common activity is to use "getdirentries" to enumerate > all the files in the directory, then "lstat" on each one to get > information about it. Doing an "ls -l" in a large directory like this > can take 10-30 seconds to complete. Trying to figure out why, I did: > > ktrace ls -l /path/to/large/directory > kdump -R |sort -rn |more Does ls -lf /pat/to/large/directory make a difference. It makes ls not to sort the directory so it can use a more efficient way of traversing the directory. Ronald. > > to see what sys calls were taking the most time, I ended up with: > > 69247 ls 0.190729 STRU struct stat {dev=846475008, ino=46220085, > mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295, > atime=1333196714, stime=1201004393, ctime=1333196714.547566024, > birthtime=1333196714.547566024, size=30784, blksize=31232, blocks=62, > flags=0x0 } > 69247 ls 0.180121 STRU struct stat {dev=846475008, ino=46233417, > mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295, > atime=1333197088, stime=1209814737, ctime=1333197088.913571042, > birthtime=1333197088.913571042, size=3162220, blksize=131072, > blocks=6409, flags=0x0 } > 69247 ls 0.152370 RET getdirentries 4088/0xff8 > 69247 ls 0.139939 CALL stat(0x800d8f598,0x7fffffffcca0) > 69247 ls 0.130411 RET __acl_get_link 0 > 69247 ls 0.121602 RET __acl_get_link 0 > 69247 ls 0.105799 RET getdirentries 4064/0xfe0 > 69247 ls 0.105069 RET getdirentries 4068/0xfe4 > 69247 ls 0.096862 RET getdirentries 4028/0xfbc > 69247 ls 0.085012 RET getdirentries 4088/0xff8 > 69247 ls 0.082722 STRU struct stat {dev=846475008, ino=72941319, > mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295, > atime=1348686155, stime=1348347621, ctime=1348686155.768875422, > birthtime=1348686155.768875422, size=6686225, blksize=131072, > blocks=13325, flags=0x0 } > 69247 ls 0.070318 STRU struct stat {dev=846475008, ino=46211679, > mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295, > atime=1333196475, stime=1240230314, ctime=1333196475.038567672, > birthtime=1333196475.038567672, size=829895, blksize=131072, > blocks=1797, flags=0x0 } > 69247 ls 0.068060 RET getdirentries 4048/0xfd0 > 69247 ls 0.065118 RET getdirentries 4088/0xff8 > 69247 ls 0.062536 RET getdirentries 4096/0x1000 > 69247 ls 0.061118 RET getdirentries 4020/0xfb4 > 69247 ls 0.055038 STRU struct stat {dev=846475008, ino=46220358, > mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295, > atime=1333196720, stime=1274282669, ctime=1333196720.972567345, > birthtime=1333196720.972567345, size=382344, blksize=131072, blocks=773, > flags=0x0 } > 69247 ls 0.054948 STRU struct stat {dev=846475008, ino=75025952, > mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295, > atime=1351071350, stime=1349726805, ctime=1351071350.800873870, > birthtime=1351071350.800873870, size=2575559, blksize=131072, > blocks=5127, flags=0x0 } > 69247 ls 0.054828 STRU struct stat {dev=846475008, ino=65021883, > mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295, > atime=1335730367, stime=1332843230, ctime=1335730367.541567371, > birthtime=1335730367.541567371, size=226347, blksize=131072, blocks=517, > flags=0x0 } > 69247 ls 0.053743 STRU struct stat {dev=846475008, ino=46222016, > mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295, > atime=1333196765, stime=1257110706, ctime=1333196765.206574132, > birthtime=1333196765.206574132, size=62112, blksize=62464, blocks=123, > flags=0x0 } > 69247 ls 0.052015 RET getdirentries 4060/0xfdc > 69247 ls 0.051388 RET getdirentries 4068/0xfe4 > 69247 ls 0.049875 RET getdirentries 4088/0xff8 > 69247 ls 0.049156 RET getdirentries 4032/0xfc0 > 69247 ls 0.048609 RET getdirentries 4040/0xfc8 > 69247 ls 0.048279 RET getdirentries 4032/0xfc0 > 69247 ls 0.048062 RET getdirentries 4064/0xfe0 > 69247 ls 0.047577 RET getdirentries 4076/0xfec > (snip) > > the STRU are returns from calling lstat(). > > It looks like both getdirentries and lstat are taking quite a while to > return. The shortest return for any lstat() call is 0.000004 seconds, > the maximum is 0.190729 and the average is around 0.0004. Just from > lstat() alone, that makes "ls" take over 20 seconds. > > I'm prepared to try an L2arc cache device (with > secondarycache=metadata), but I'm having trouble determining how big of > a device I'd need. We've got >30M inodes now on this filesystem, > including some files with extremely long names. Is there some way to > determine the amount of metadata on a ZFS filesystem? > > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Wed Jan 30 10:36:42 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 4900D351 for ; Wed, 30 Jan 2013 10:36:42 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from mail-bk0-f50.google.com (mail-bk0-f50.google.com [209.85.214.50]) by mx1.freebsd.org (Postfix) with ESMTP id BEC36DEE for ; Wed, 30 Jan 2013 10:36:41 +0000 (UTC) Received: by mail-bk0-f50.google.com with SMTP id jg9so742690bkc.37 for ; Wed, 30 Jan 2013 02:36:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:subject:mime-version:content-type:from:in-reply-to:date :cc:content-transfer-encoding:message-id:references:to:x-mailer; bh=XwHHaBNRncPTtC58MSyZcTj2pLTLfUFOxLhKoPVoOyk=; b=fiJn30P9vkLKlpzRL2/tToD0BPwYJcAb0HbHvf9eQO4E3IgA/svj5KZcqzDixPUdit sozWw/Iz/IGkeLTBps/SR2CQQXVqZactSDQhte7thzxuu7S9+uIFE8m+5VylGm7xEQJf 0jKamcJ6GQGm+dINt/LMFvqpzMZcNZxO8xgA6Qal2OyAZ9fp0xem6G/yHcIX4ueoeuPq 9VZzcqCwbXj7q2gKafOIDdZanufwrbidtW+MwCp5KUGRZ/49AN5A7msKqjLkb/P+q6Q0 kzQj0KIa0O8i3YNHEc9imX2CxIZ+pFGZOWsFwmda5hzQ+AQODLjQeipFHR3XN9ojlGSJ Cj6A== X-Received: by 10.204.12.206 with SMTP id y14mr1081502bky.132.1359542194602; Wed, 30 Jan 2013 02:36:34 -0800 (PST) Received: from ndenevsa.sf.moneybookers.net (g1.moneybookers.com. [217.18.249.148]) by mx.google.com with ESMTPS id z5sm383371bkv.11.2013.01.30.02.36.33 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 30 Jan 2013 02:36:34 -0800 (PST) Subject: Re: Improving ZFS performance for large directories Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Content-Type: text/plain; charset=us-ascii From: Nikolay Denev In-Reply-To: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> Date: Wed, 30 Jan 2013 12:36:35 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <5267B97C-ED47-4AAB-8415-12D6987E9371@gmail.com> References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> To: Kevin Day X-Mailer: Apple Mail (2.1499) Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Jan 2013 10:36:42 -0000 On Jan 30, 2013, at 1:20 AM, Kevin Day wrote: >=20 > I'm trying to improve performance when using ZFS in large (>60000 = files) directories. A common activity is to use "getdirentries" to = enumerate all the files in the directory, then "lstat" on each one to = get information about it. Doing an "ls -l" in a large directory like = this can take 10-30 seconds to complete. Trying to figure out why, I = did: >=20 > ktrace ls -l /path/to/large/directory > kdump -R |sort -rn |more >=20 > to see what sys calls were taking the most time, I ended up with: >=20 > 69247 ls 0.190729 STRU struct stat {dev=3D846475008, = ino=3D46220085, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, = rdev=3D4294967295, atime=3D1333196714, stime=3D1201004393, = ctime=3D1333196714.547566024, birthtime=3D1333196714.547566024, = size=3D30784, blksize=3D31232, blocks=3D62, flags=3D0x0 } > 69247 ls 0.180121 STRU struct stat {dev=3D846475008, = ino=3D46233417, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, = rdev=3D4294967295, atime=3D1333197088, stime=3D1209814737, = ctime=3D1333197088.913571042, birthtime=3D1333197088.913571042, = size=3D3162220, blksize=3D131072, blocks=3D6409, flags=3D0x0 } > 69247 ls 0.152370 RET getdirentries 4088/0xff8 > 69247 ls 0.139939 CALL stat(0x800d8f598,0x7fffffffcca0) > 69247 ls 0.130411 RET __acl_get_link 0 > 69247 ls 0.121602 RET __acl_get_link 0 > 69247 ls 0.105799 RET getdirentries 4064/0xfe0 > 69247 ls 0.105069 RET getdirentries 4068/0xfe4 > 69247 ls 0.096862 RET getdirentries 4028/0xfbc > 69247 ls 0.085012 RET getdirentries 4088/0xff8 > 69247 ls 0.082722 STRU struct stat {dev=3D846475008, = ino=3D72941319, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, = rdev=3D4294967295, atime=3D1348686155, stime=3D1348347621, = ctime=3D1348686155.768875422, birthtime=3D1348686155.768875422, = size=3D6686225, blksize=3D131072, blocks=3D13325, flags=3D0x0 } > 69247 ls 0.070318 STRU struct stat {dev=3D846475008, = ino=3D46211679, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, = rdev=3D4294967295, atime=3D1333196475, stime=3D1240230314, = ctime=3D1333196475.038567672, birthtime=3D1333196475.038567672, = size=3D829895, blksize=3D131072, blocks=3D1797, flags=3D0x0 } > 69247 ls 0.068060 RET getdirentries 4048/0xfd0 > 69247 ls 0.065118 RET getdirentries 4088/0xff8 > 69247 ls 0.062536 RET getdirentries 4096/0x1000 > 69247 ls 0.061118 RET getdirentries 4020/0xfb4 > 69247 ls 0.055038 STRU struct stat {dev=3D846475008, = ino=3D46220358, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, = rdev=3D4294967295, atime=3D1333196720, stime=3D1274282669, = ctime=3D1333196720.972567345, birthtime=3D1333196720.972567345, = size=3D382344, blksize=3D131072, blocks=3D773, flags=3D0x0 } > 69247 ls 0.054948 STRU struct stat {dev=3D846475008, = ino=3D75025952, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, = rdev=3D4294967295, atime=3D1351071350, stime=3D1349726805, = ctime=3D1351071350.800873870, birthtime=3D1351071350.800873870, = size=3D2575559, blksize=3D131072, blocks=3D5127, flags=3D0x0 } > 69247 ls 0.054828 STRU struct stat {dev=3D846475008, = ino=3D65021883, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, = rdev=3D4294967295, atime=3D1335730367, stime=3D1332843230, = ctime=3D1335730367.541567371, birthtime=3D1335730367.541567371, = size=3D226347, blksize=3D131072, blocks=3D517, flags=3D0x0 } > 69247 ls 0.053743 STRU struct stat {dev=3D846475008, = ino=3D46222016, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, = rdev=3D4294967295, atime=3D1333196765, stime=3D1257110706, = ctime=3D1333196765.206574132, birthtime=3D1333196765.206574132, = size=3D62112, blksize=3D62464, blocks=3D123, flags=3D0x0 } > 69247 ls 0.052015 RET getdirentries 4060/0xfdc > 69247 ls 0.051388 RET getdirentries 4068/0xfe4 > 69247 ls 0.049875 RET getdirentries 4088/0xff8 > 69247 ls 0.049156 RET getdirentries 4032/0xfc0 > 69247 ls 0.048609 RET getdirentries 4040/0xfc8 > 69247 ls 0.048279 RET getdirentries 4032/0xfc0 > 69247 ls 0.048062 RET getdirentries 4064/0xfe0 > 69247 ls 0.047577 RET getdirentries 4076/0xfec > (snip) >=20 > the STRU are returns from calling lstat(). >=20 > It looks like both getdirentries and lstat are taking quite a while to = return. The shortest return for any lstat() call is 0.000004 seconds, = the maximum is 0.190729 and the average is around 0.0004. Just from = lstat() alone, that makes "ls" take over 20 seconds. >=20 > I'm prepared to try an L2arc cache device (with = secondarycache=3Dmetadata), but I'm having trouble determining how big = of a device I'd need. We've got >30M inodes now on this filesystem, = including some files with extremely long names. Is there some way to = determine the amount of metadata on a ZFS filesystem? >=20 > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" What are your : vfs.zfs.arc_meta_limit and vfs.zfs.arc_meta_used = sysctls? Maybe increasing the limit can help? Regards, From owner-freebsd-fs@FreeBSD.ORG Wed Jan 30 15:15:10 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 799B41B8 for ; Wed, 30 Jan 2013 15:15:10 +0000 (UTC) (envelope-from toasty@dragondata.com) Received: from mail-ie0-x22c.google.com (ie-in-x022c.1e100.net [IPv6:2607:f8b0:4001:c03::22c]) by mx1.freebsd.org (Postfix) with ESMTP id 490C9F94 for ; Wed, 30 Jan 2013 15:15:10 +0000 (UTC) Received: by mail-ie0-f172.google.com with SMTP id c10so1338113ieb.17 for ; Wed, 30 Jan 2013 07:15:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dragondata.com; s=google; h=x-received:content-type:mime-version:subject:from:in-reply-to:date :cc:content-transfer-encoding:message-id:references:to:x-mailer; bh=4jStnLPiLJlFlXWlcVd2u39urNx9ecuHBzR1WKnGeAY=; b=duvti6SSYFdBAFGyj87Qf3PSjr9+9PgNgdCCDIVygZjzzEunXnEYmsQum1cwd6iHmS hxaifO0JuzuzVGzG3suqMY+pYVVrvhXeCBQEUP07lo6YxguAMFciAqerdqFf3rBGwXda 1ugktxPbzKf/p3np0p0Hsakib/9Uf4SZzGjUo= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:content-type:mime-version:subject:from:in-reply-to:date :cc:content-transfer-encoding:message-id:references:to:x-mailer :x-gm-message-state; bh=4jStnLPiLJlFlXWlcVd2u39urNx9ecuHBzR1WKnGeAY=; b=oiSUN8jI/7GqNIRyYkCUIoLAyygk/qwDOYYkAM2hsjW/smXQaqv9d1EowrzQYJ7sOK 9oxQaGph4jiSpE9xDXliaiikZRI8ht2XjGnT+7dGny1Cd5imThT8Dxl/WnvZEiXANAzz lnxV+uM3S2DvfC+TtDgHX8B6UeT6fTfNUdkW8Iylylmd/ohX/XekYTQmk5Z4r9ZuAOMq /DiWoJXuj2A4DCuauu7kBOndTGXDmhkg0PNRRHc81c21jVjbceCLu8bCLDZW3H+KJLtX VKgeTfh8Sx8Jo42fhlVCCzar3ZAvablXw6b95w1hxWcVYSxzf0E/01hyD1A1FVXXJG3c thrg== X-Received: by 10.50.47.200 with SMTP id f8mr3951203ign.98.1359558909991; Wed, 30 Jan 2013 07:15:09 -0800 (PST) Received: from vpn132.rw1.your.org (vpn132.rw1.your.org. [204.9.51.132]) by mx.google.com with ESMTPS id bg10sm4495947igc.6.2013.01.30.07.15.07 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 30 Jan 2013 07:15:08 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: Improving ZFS performance for large directories From: Kevin Day In-Reply-To: Date: Wed, 30 Jan 2013 09:15:04 -0600 Content-Transfer-Encoding: quoted-printable Message-Id: References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> To: "Ronald Klop" X-Mailer: Apple Mail (2.1499) X-Gm-Message-State: ALoCoQkruxYgS03kacjD54zkMx4V3qRKoiNaTXmQywWTjh4H6FRcoDDHNGsh5gUccTzxo4AX8ktz Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Jan 2013 15:15:10 -0000 On Jan 30, 2013, at 4:20 AM, "Ronald Klop" = wrote: > On Wed, 30 Jan 2013 00:20:15 +0100, Kevin Day = wrote: >=20 >>=20 >> I'm trying to improve performance when using ZFS in large (>60000 = files) directories. A common activity is to use "getdirentries" to = enumerate all the files in the directory, then "lstat" on each one to = get information about it. Doing an "ls -l" in a large directory like = this can take 10-30 seconds to complete. Trying to figure out why, I = did: >>=20 >> ktrace ls -l /path/to/large/directory >> kdump -R |sort -rn |more >=20 > Does ls -lf /pat/to/large/directory make a difference. It makes ls not = to sort the directory so it can use a more efficient way of traversing = the directory. >=20 > Ronald. Nope, the sort seems to add a trivial amount of extra time to the entire = operation. Nearly all the time is spent in lstat() or getdirentries(). = Good idea though! -- Kevin From owner-freebsd-fs@FreeBSD.ORG Wed Jan 30 15:19:59 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 8559B308 for ; Wed, 30 Jan 2013 15:19:59 +0000 (UTC) (envelope-from toasty@dragondata.com) Received: from mail-ie0-x22c.google.com (ie-in-x022c.1e100.net [IPv6:2607:f8b0:4001:c03::22c]) by mx1.freebsd.org (Postfix) with ESMTP id 515C7FDA for ; Wed, 30 Jan 2013 15:19:59 +0000 (UTC) Received: by mail-ie0-f172.google.com with SMTP id c10so1369847ieb.3 for ; Wed, 30 Jan 2013 07:19:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dragondata.com; s=google; h=x-received:content-type:mime-version:subject:from:in-reply-to:date :cc:content-transfer-encoding:message-id:references:to:x-mailer; bh=r8f3Yu1HhqeZ9wHXx0xyu9zShuD73vpJQwFBGe3oISw=; b=qUbIKur38wWMrpJww/gOOCIYLFt90vPmxsMEGUiEsZPXgDgEuytOkpc3a7oGj3bnhM 1/hP4ihpYbVWlSAMv5Y5q5CiG+Xj+HbL5zVmEM+WmqAxVO22TuxlPkxz2boseY6fjr5I 9QHeX464fAAbOcFCpiWySrA/03PMOICKl5Uek= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:content-type:mime-version:subject:from:in-reply-to:date :cc:content-transfer-encoding:message-id:references:to:x-mailer :x-gm-message-state; bh=r8f3Yu1HhqeZ9wHXx0xyu9zShuD73vpJQwFBGe3oISw=; b=PavDKmiLf/Sn1qOmSrsbmGHPgeoQS5LROkHLQLEEjBaGTAxDahWGqnkF6JvnkAVgGI S7A/gdGsI00skwd2VmVjZapgNlTywx5JdduWCrDWqcBtLzZJ/G+gkXhWhoU3iwb+JHUH H5NbctKDY+9xSqer5YcyiL3jVuHfHK3Z46NalGaZ6+T/op10FLh5hiHesi0bSen2z+Jv HaMmepWyojXOeiqWLXbz11gfqwfutPI8JIrcCIlwrWJhygizB9RON3NgI//Trq65MUOy +DxoEr0Yh6ZsfNIqUnmU9hdjjVWg7GLBBZSR9AAcQbx9kg5IAy9DcJ8k8Z+kLBcNJJMp 3e+w== X-Received: by 10.50.13.208 with SMTP id j16mr3837750igc.73.1359559199008; Wed, 30 Jan 2013 07:19:59 -0800 (PST) Received: from vpn132.rw1.your.org (vpn132.rw1.your.org. [204.9.51.132]) by mx.google.com with ESMTPS id fb10sm2077564igb.1.2013.01.30.07.19.55 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 30 Jan 2013 07:19:57 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: Improving ZFS performance for large directories From: Kevin Day In-Reply-To: <5267B97C-ED47-4AAB-8415-12D6987E9371@gmail.com> Date: Wed, 30 Jan 2013 09:19:52 -0600 Content-Transfer-Encoding: 7bit Message-Id: <47975CEB-EA50-4F6C-8C47-6F32312F34C4@dragondata.com> References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> <5267B97C-ED47-4AAB-8415-12D6987E9371@gmail.com> To: Nikolay Denev X-Mailer: Apple Mail (2.1499) X-Gm-Message-State: ALoCoQkYnbcKjzrRuMXTbOa1TnXlK7LV6CYRTKJE5zTGGBvs7cMBaMH2lTMQXKldb04O4ZSDB54d Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Jan 2013 15:19:59 -0000 On Jan 30, 2013, at 4:36 AM, Nikolay Denev wrote: > > > What are your : vfs.zfs.arc_meta_limit and vfs.zfs.arc_meta_used sysctls? > Maybe increasing the limit can help? vfs.zfs.arc_meta_limit: 8199079936 vfs.zfs.arc_meta_used: 13965744408 Full output of zfs-stats: ------------------------------------------------------------------------ ZFS Subsystem Report Wed Jan 30 15:16:54 2013 ------------------------------------------------------------------------ System Information: Kernel Version: 901000 (osreldate) Hardware Platform: amd64 Processor Architecture: amd64 ZFS Storage pool Version: 28 ZFS Filesystem Version: 5 FreeBSD 9.1-RC2 #1: Tue Oct 30 20:37:38 UTC 2012 root 3:16PM up 19 days, 19:44, 2 users, load averages: 0.91, 0.80, 0.68 ------------------------------------------------------------------------ System Memory: 12.44% 7.72 GiB Active, 6.04% 3.75 GiB Inact 77.33% 48.01 GiB Wired, 2.25% 1.40 GiB Cache 1.94% 1.21 GiB Free, 0.00% 1.21 MiB Gap Real Installed: 64.00 GiB Real Available: 99.97% 63.98 GiB Real Managed: 97.04% 62.08 GiB Logical Total: 64.00 GiB Logical Used: 90.07% 57.65 GiB Logical Free: 9.93% 6.35 GiB Kernel Memory: 22.62 GiB Data: 99.91% 22.60 GiB Text: 0.09% 21.27 MiB Kernel Memory Map: 54.28 GiB Size: 34.75% 18.86 GiB Free: 65.25% 35.42 GiB ------------------------------------------------------------------------ ARC Summary: (HEALTHY) Memory Throttle Count: 0 ARC Misc: Deleted: 430.91m Recycle Misses: 111.27m Mutex Misses: 2.49m Evict Skips: 647.25m ARC Size: 87.63% 26.77 GiB Target Size: (Adaptive) 87.64% 26.77 GiB Min Size (Hard Limit): 12.50% 3.82 GiB Max Size (High Water): 8:1 30.54 GiB ARC Size Breakdown: Recently Used Cache Size: 58.64% 15.70 GiB Frequently Used Cache Size: 41.36% 11.07 GiB ARC Hash Breakdown: Elements Max: 2.19m Elements Current: 86.15% 1.89m Collisions: 344.47m Chain Max: 17 Chains: 552.47k ------------------------------------------------------------------------ ARC Efficiency: 21.94b Cache Hit Ratio: 97.00% 21.28b Cache Miss Ratio: 3.00% 657.23m Actual Hit Ratio: 73.15% 16.05b Data Demand Efficiency: 98.94% 1.32b Data Prefetch Efficiency: 14.83% 299.44m CACHE HITS BY CACHE LIST: Anonymously Used: 23.03% 4.90b Most Recently Used: 6.12% 1.30b Most Frequently Used: 69.29% 14.75b Most Recently Used Ghost: 0.50% 105.94m Most Frequently Used Ghost: 1.07% 226.92m CACHE HITS BY DATA TYPE: Demand Data: 6.11% 1.30b Prefetch Data: 0.21% 44.42m Demand Metadata: 69.29% 14.75b Prefetch Metadata: 24.38% 5.19b CACHE MISSES BY DATA TYPE: Demand Data: 2.12% 13.90m Prefetch Data: 38.80% 255.02m Demand Metadata: 30.97% 203.56m Prefetch Metadata: 28.11% 184.75m ------------------------------------------------------------------------ L2ARC is disabled ------------------------------------------------------------------------ File-Level Prefetch: (HEALTHY) DMU Efficiency: 24.08b Hit Ratio: 66.02% 15.90b Miss Ratio: 33.98% 8.18b Colinear: 8.18b Hit Ratio: 0.01% 560.82k Miss Ratio: 99.99% 8.18b Stride: 15.23b Hit Ratio: 99.98% 15.23b Miss Ratio: 0.02% 2.62m DMU Misc: Reclaim: 8.18b Successes: 0.08% 6.31m Failures: 99.92% 8.17b Streams: 663.44m +Resets: 0.06% 397.18k -Resets: 99.94% 663.04m Bogus: 0 ------------------------------------------------------------------------ VDEV cache is disabled ------------------------------------------------------------------------ ZFS Tunables (sysctl): kern.maxusers 384 vm.kmem_size 66662760448 vm.kmem_size_scale 1 vm.kmem_size_min 0 vm.kmem_size_max 329853485875 vfs.zfs.l2c_only_size 0 vfs.zfs.mfu_ghost_data_lsize 2121007104 vfs.zfs.mfu_ghost_metadata_lsize 7876605440 vfs.zfs.mfu_ghost_size 9997612544 vfs.zfs.mfu_data_lsize 10160539648 vfs.zfs.mfu_metadata_lsize 17161216 vfs.zfs.mfu_size 11163991040 vfs.zfs.mru_ghost_data_lsize 7235079680 vfs.zfs.mru_ghost_metadata_lsize 11107812352 vfs.zfs.mru_ghost_size 18342892032 vfs.zfs.mru_data_lsize 4406255616 vfs.zfs.mru_metadata_lsize 3924364288 vfs.zfs.mru_size 8893582336 vfs.zfs.anon_data_lsize 0 vfs.zfs.anon_metadata_lsize 0 vfs.zfs.anon_size 999424 vfs.zfs.l2arc_norw 1 vfs.zfs.l2arc_feed_again 1 vfs.zfs.l2arc_noprefetch 1 vfs.zfs.l2arc_feed_min_ms 200 vfs.zfs.l2arc_feed_secs 1 vfs.zfs.l2arc_headroom 2 vfs.zfs.l2arc_write_boost 8388608 vfs.zfs.l2arc_write_max 8388608 vfs.zfs.arc_meta_limit 8199079936 vfs.zfs.arc_meta_used 14161977912 vfs.zfs.arc_min 4099539968 vfs.zfs.arc_max 32796319744 vfs.zfs.dedup.prefetch 1 vfs.zfs.mdcomp_disable 0 vfs.zfs.write_limit_override 0 vfs.zfs.write_limit_inflated 206088929280 vfs.zfs.write_limit_max 8587038720 vfs.zfs.write_limit_min 33554432 vfs.zfs.write_limit_shift 3 vfs.zfs.no_write_throttle 0 vfs.zfs.zfetch.array_rd_sz 1048576 vfs.zfs.zfetch.block_cap 256 vfs.zfs.zfetch.min_sec_reap 2 vfs.zfs.zfetch.max_streams 8 vfs.zfs.prefetch_disable 0 vfs.zfs.mg_alloc_failures 12 vfs.zfs.check_hostid 1 vfs.zfs.recover 0 vfs.zfs.txg.synctime_ms 1000 vfs.zfs.txg.timeout 5 vfs.zfs.vdev.cache.bshift 16 vfs.zfs.vdev.cache.size 0 vfs.zfs.vdev.cache.max 16384 vfs.zfs.vdev.write_gap_limit 4096 vfs.zfs.vdev.read_gap_limit 32768 vfs.zfs.vdev.aggregation_limit 131072 vfs.zfs.vdev.ramp_rate 2 vfs.zfs.vdev.time_shift 6 vfs.zfs.vdev.min_pending 4 vfs.zfs.vdev.max_pending 10 vfs.zfs.vdev.bio_flush_disable 0 vfs.zfs.cache_flush_disable 0 vfs.zfs.zil_replay_disable 0 vfs.zfs.zio.use_uma 0 vfs.zfs.snapshot_list_prefetch 0 vfs.zfs.version.zpl 5 vfs.zfs.version.spa 28 vfs.zfs.version.acl 1 vfs.zfs.debug 0 vfs.zfs.super_owner 0 ------------------------------------------------------------------------ From owner-freebsd-fs@FreeBSD.ORG Wed Jan 30 16:34:48 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 8D63B27E for ; Wed, 30 Jan 2013 16:34:48 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from mail-bk0-f51.google.com (mail-bk0-f51.google.com [209.85.214.51]) by mx1.freebsd.org (Postfix) with ESMTP id E7728774 for ; Wed, 30 Jan 2013 16:34:47 +0000 (UTC) Received: by mail-bk0-f51.google.com with SMTP id ik5so922431bkc.38 for ; Wed, 30 Jan 2013 08:34:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:subject:mime-version:content-type:from:in-reply-to:date :cc:content-transfer-encoding:message-id:references:to:x-mailer; bh=8RTCp/ASRpz9O9eFd1VM9MVuj+AlGo/ex9Alocsrpy0=; b=P0iibY5ikgCzzIMWXojRZpi357R608ZbSCwecaIvvR6Z6TO6BzbzMfJOkGq6HBKKM0 SwaMEO/Q0T1U1nEkWz5ujhv+5GzuO2dG0vWcoPcq+JptER9vrgTBttVEjakDLp8vI/BH YuThDDe7qcw8MKSeXNrmPqXDM6bblgDlL728ZKAnqqUhfGRmU920keaNDAmIyHtstEfO 7lLulsLYq/Sl4foceTHANPoEG7Zc0o0jvx8NuU32ewdA/lIfzWBlgha5vozRTo16vCH7 0D3dO5jt4UVrfkAZSBjIsxhDzhtwQwnV7hn7wnzACBdggU9zkmg/aWbJo/r2IlCSWo2p NeaA== X-Received: by 10.204.11.78 with SMTP id s14mr1431340bks.118.1359563681583; Wed, 30 Jan 2013 08:34:41 -0800 (PST) Received: from imba-brutale-3.totalterror.net ([93.152.184.10]) by mx.google.com with ESMTPS id z5sm819545bkv.11.2013.01.30.08.34.39 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 30 Jan 2013 08:34:40 -0800 (PST) Subject: Re: Improving ZFS performance for large directories Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Content-Type: text/plain; charset=windows-1252 From: Nikolay Denev In-Reply-To: <47975CEB-EA50-4F6C-8C47-6F32312F34C4@dragondata.com> Date: Wed, 30 Jan 2013 18:34:41 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <23E6691A-F30C-4731-9F78-FD8ADDDA09AE@gmail.com> References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> <5267B97C-ED47-4AAB-8415-12D6987E9371@gmail.com> <47975CEB-EA50-4F6C-8C47-6F32312F34C4@dragondata.com> To: Kevin Day X-Mailer: Apple Mail (2.1499) Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Jan 2013 16:34:48 -0000 On Jan 30, 2013, at 5:19 PM, Kevin Day wrote: >=20 > On Jan 30, 2013, at 4:36 AM, Nikolay Denev wrote: >>=20 >>=20 >> What are your : vfs.zfs.arc_meta_limit and vfs.zfs.arc_meta_used = sysctls? >> Maybe increasing the limit can help? >=20 >=20 > vfs.zfs.arc_meta_limit: 8199079936 > vfs.zfs.arc_meta_used: 13965744408 >=20 > Full output of zfs-stats: [=85snipped=85] Looks like you can try to increase arc_meta_limit to be let's say : half = of arc_max. (16398159872 in your case). From owner-freebsd-fs@FreeBSD.ORG Wed Jan 30 20:59:19 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 56498EAD for ; Wed, 30 Jan 2013 20:59:19 +0000 (UTC) (envelope-from toasty@dragondata.com) Received: from mail-ie0-x22f.google.com (mail-ie0-x22f.google.com [IPv6:2607:f8b0:4001:c03::22f]) by mx1.freebsd.org (Postfix) with ESMTP id F221E6AC for ; Wed, 30 Jan 2013 20:59:18 +0000 (UTC) Received: by mail-ie0-f175.google.com with SMTP id c12so1687626ieb.6 for ; Wed, 30 Jan 2013 12:59:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dragondata.com; s=google; h=x-received:content-type:mime-version:subject:from:in-reply-to:date :cc:content-transfer-encoding:message-id:references:to:x-mailer; bh=cnUN69pWlUjCGxQg9jE+TzJzR5O+4orLIVM/h6z/A7U=; b=Cc3tS0JuX+tkIqPeXky9R/5+AQB69kVOHLPA8szy6+4ThBEScrvCd3F3g1ker8WB3q GWAnc5o2yIEnJo6aQGK6Ssks6g7lDUDLVLV1GeJ0qm2LUXCrNtLvG2w1PQ8B5rcmrSXL e76RZyZbm3+jLpmPTsJ4VG5iSWlUiwXSciDpA= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:content-type:mime-version:subject:from:in-reply-to:date :cc:content-transfer-encoding:message-id:references:to:x-mailer :x-gm-message-state; bh=cnUN69pWlUjCGxQg9jE+TzJzR5O+4orLIVM/h6z/A7U=; b=GTt2W43YcjtaORYRsleZg45/xnFXvNselPfeI6Nr2QbZTKnFDXmrwM5z8PFRkvLENx kE1WoPku4buWmvrJxdaYWqivYAlm6Sq6exPVedwhpPkFcmQ/rOT9W3NisAzcPOES3Tzj /egkFpIqtCWaFPIX4qzsHZe1mUJiywX8sXUgAMzbCIjkhwDBM7BX6LAa/lo6NM612NPf XVMwBzb2ezCrqCD0tHeP8uDPlWVRw1U1mONq4I+YL7Uyt6OOMfv+kzmDqvr0Uyci7lGA vzuOTL3dUCmVN4TfQfgtLH6QViXN6czCqQMtvphUeI9ZKqBgBRHp/CoQxDVS1VsY1Dbd GufA== X-Received: by 10.50.192.197 with SMTP id hi5mr4638121igc.45.1359579558523; Wed, 30 Jan 2013 12:59:18 -0800 (PST) Received: from vpn132.rw1.your.org (vpn132.rw1.your.org. [204.9.51.132]) by mx.google.com with ESMTPS id fa6sm3343316igb.2.2013.01.30.12.59.16 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 30 Jan 2013 12:59:17 -0800 (PST) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: Improving ZFS performance for large directories From: Kevin Day In-Reply-To: <23E6691A-F30C-4731-9F78-FD8ADDDA09AE@gmail.com> Date: Wed, 30 Jan 2013 14:59:15 -0600 Content-Transfer-Encoding: quoted-printable Message-Id: References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> <5267B97C-ED47-4AAB-8415-12D6987E9371@gmail.com> <47975CEB-EA50-4F6C-8C47-6F32312F34C4@dragondata.com> <23E6691A-F30C-4731-9F78-FD8ADDDA09AE@gmail.com> To: Nikolay Denev X-Mailer: Apple Mail (2.1499) X-Gm-Message-State: ALoCoQnkS+6kCreYGM448c9En3dGEp/g3iUGqVrmWf1fActc0Mdnx6jek0faAnU6k7bN8BfFSNBO Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Jan 2013 20:59:19 -0000 On Jan 30, 2013, at 10:34 AM, Nikolay Denev wrote: >>=20 >> vfs.zfs.arc_meta_limit: 8199079936 >> vfs.zfs.arc_meta_used: 13965744408 >>=20 >> Full output of zfs-stats: [=85snipped=85] >=20 > Looks like you can try to increase arc_meta_limit to be let's say : = half of arc_max. (16398159872 in your case). >=20 >=20 Okay, will give this a shot on the next reboot too. Does anyone here understand the significance of "used" being higher than = "limit"? Is the limit only a suggestion, or are there cases where = there'a certain metadata that must be in arc, and it's particularly = large here? -- Kevin From owner-freebsd-fs@FreeBSD.ORG Wed Jan 30 21:56:10 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 6C643145 for ; Wed, 30 Jan 2013 21:56:10 +0000 (UTC) (envelope-from artemb@gmail.com) Received: from mail-vb0-f51.google.com (mail-vb0-f51.google.com [209.85.212.51]) by mx1.freebsd.org (Postfix) with ESMTP id 30961940 for ; Wed, 30 Jan 2013 21:56:10 +0000 (UTC) Received: by mail-vb0-f51.google.com with SMTP id fq11so1298286vbb.38 for ; Wed, 30 Jan 2013 13:56:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=wpSwv3N4jMIGGcP8n72/lz2cALhn/QdLAKhy7UP+hEs=; b=mtwbJOBxtUdF6vKUzQssPpCv1g6Z66uP9VaNfEUCyGfzsv4+X6xwWeU+g+NwXi9GCH nhiPiH3r56Iz9kjTx+ZM7dCGMTkHPbEcrXYl1RmoqhW+qI0TbUCMNoOIhYaY4HCagrWW nJonqyNbla7eJkX+qNE7k2BAbzxFEJV6vs2bAIafXk6L5QzWT4xWd8LDhM/73dlRZX8g P7mVU3G4ysL6UzyyB62ewF37cHpVWR9w2ld4hy3seAJmCFIPyCTJ+hdQiT/FOg6zYjIt aXH5ZPqZJUhaxsGHOCphYhdZfsLouNMglO39h/XufIL+eRoXj2rvJqWPRS7/ORSpnx1V +Q2w== MIME-Version: 1.0 X-Received: by 10.220.108.2 with SMTP id d2mr6090053vcp.60.1359582969557; Wed, 30 Jan 2013 13:56:09 -0800 (PST) Sender: artemb@gmail.com Received: by 10.220.123.2 with HTTP; Wed, 30 Jan 2013 13:56:09 -0800 (PST) In-Reply-To: References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> <5267B97C-ED47-4AAB-8415-12D6987E9371@gmail.com> <47975CEB-EA50-4F6C-8C47-6F32312F34C4@dragondata.com> <23E6691A-F30C-4731-9F78-FD8ADDDA09AE@gmail.com> Date: Wed, 30 Jan 2013 13:56:09 -0800 X-Google-Sender-Auth: UFa8nGEIxkYKzwrgcUrOaIq3oic Message-ID: Subject: Re: Improving ZFS performance for large directories From: Artem Belevich To: Kevin Day Content-Type: text/plain; charset=ISO-8859-1 Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Jan 2013 21:56:10 -0000 On Wed, Jan 30, 2013 at 12:59 PM, Kevin Day wrote: > > Does anyone here understand the significance of "used" being higher than "limit"? Is the limit only a suggestion, or are there cases where there'a certain metadata that must be in arc, and it's particularly large here? arc_meta_limit is a soft limit which basically tells ARC to attempt evicting metadata entries and reuse their buffers as opposed to allocating new memory and growing ARC. According to the comment next to arc_evict() function, it's a best-effort attempt and eviction is not guaranteed. That could potentially allow meta_size to remain above meta_limit. --Artem From owner-freebsd-fs@FreeBSD.ORG Wed Jan 30 22:16:19 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 93A218A7; Wed, 30 Jan 2013 22:16:19 +0000 (UTC) (envelope-from universite@ukr.net) Received: from ffe17.ukr.net (ffe17.ukr.net [195.214.192.83]) by mx1.freebsd.org (Postfix) with ESMTP id 225C0A11; Wed, 30 Jan 2013 22:16:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=ukr.net; s=ffe; h=Date:Message-Id:From:To:References:In-Reply-To:Subject:Cc:Content-Type:Content-Transfer-Encoding:MIME-Version; bh=fV+Bh0N8JNjGFvYk/bb45iFUqDQptup1FImZtFB/6Y4=; b=Q0ixxUY3cZf4anPbmhMYsDqjv1snjiiijqH2XruWglCDasyj6SNa9gw5XRB1FxfAadL4Izu8PVH+dwDODFd8CEUmO0JLUu5G3eZAtb7lYI3BmRCzYdzYDqul4dlpcTyCDc7S1ny+Vy0Cp5+/vP8OXYFb6/rk7zs2Vg8V1m4tamc=; Received: from mail by ffe17.ukr.net with local ID 1U0fZJ-000N1z-3y ; Wed, 30 Jan 2013 23:51:45 +0200 MIME-Version: 1.0 Content-Disposition: inline Content-Transfer-Encoding: binary Content-Type: text/plain; charset="windows-1251" Subject: Re[2]: Re[2]: AHCI timeout when using ZFS + AIO + NCQ In-Reply-To: <1359317924363-5781425.post@n5.nabble.com> References: <70362.1359299605.3196836531757973504@ffe11.ukr.net> <16B555759C2041ED8185DF478193A59D@multiplay.co.uk> <917933DB5C9A490D93A739058C2507A1@multiplay.co.uk> <93308.1359297551.14145052969567453184@ffe15.ukr.net> <13391.1359029978.3957795939058384896@ffe16.ukr.net> <70578.1359313319.18126575192049975296@ffe16.ukr.net> <221B307551154F489452F89E304CA5F7@multiplay.co.uk> <1359317924363-5781425.post@n5.nabble.com> To: "Beeblebrox" From: "Vladislav Prodan" X-Mailer: freemail.ukr.net 4.0 Message-Id: <87448.1359582705.624376220320202752@ffe17.ukr.net> X-Browser: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0 Date: Wed, 30 Jan 2013 23:51:45 +0200 Cc: freebsd-fs@freebsd.org, freebsd-current@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Jan 2013 22:16:19 -0000 > I once ran into a very severe AHCI timeout problem. After months of trying to > figure it out and insane "Hardware_ECC_Recovered" error values, I found that > the error was with the power connector plug / sata HDD interface. All errors > disappeared after replacing that cable. Since you have error on more than 1 > HDD, I suggest: > 1. Check smartctl output for each AND all HDD > 2. Check whether your power supply unit is still healthy or if it is > supplying inconsistent power. > 3. Check the main power supply line and whether it shows any voltage > fluctuations or if there is a new heavy consumer of amps on the same power > line as the server is plugged to. > > I've deliberately chose a different server that has a different chipset, and that there were no problems with the HDD. Added kernel support: device ahci # AHCI-compatible SATA controllers And now, after 2.5 days fell off one HDD. [3:14]beastie:root->/root# zpool status pool: tank state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: none requested config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 mirror-0 ONLINE 0 0 0 gpt/disk0 ONLINE 0 0 0 gpt/disk2 ONLINE 0 0 0 mirror-1 DEGRADED 0 0 0 gpt/disk1 ONLINE 0 0 0 4931885954389536913 REMOVED 0 0 0 was /dev/gpt/disk3 errors: No known data errors Jan 30 09:49:28 beastie kernel: ahcich3: Timeout on slot 29 port 0 Jan 30 09:49:28 beastie kernel: ahcich3: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd c0 serr 00000000 cmd 0004dd17 Jan 30 09:49:28 beastie kernel: (ada3:ahcich3:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00 Jan 30 09:49:28 beastie kernel: (ada3:ahcich3:0:0:0): CAM status: Command timeout Jan 30 09:49:28 beastie kernel: (ada3:ahcich3:0:0:0): Retrying command Jan 30 09:51:31 beastie kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080) Jan 30 09:51:31 beastie kernel: ahcich3: Timeout on slot 29 port 0 Jan 30 09:51:31 beastie kernel: ahcich3: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd 80 serr 00000000 cmd 0004dd17 Jan 30 09:51:31 beastie kernel: (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Jan 30 09:51:31 beastie kernel: (aprobe0:ahcich3:0:0:0): CAM status: Command timeout Jan 30 09:51:31 beastie kernel: (aprobe0:ahcich3:0:0:0): Error 5, Retry was blocked Jan 30 09:51:31 beastie kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080) Jan 30 09:51:31 beastie kernel: ahcich3: Timeout on slot 29 port 0 Jan 30 09:51:31 beastie kernel: ahcich3: is 00000000 cs 00000000 ss 00000000 rs 20000000 tfd 58 serr 00000000 cmd 0004dd17 Jan 30 09:51:31 beastie kernel: (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Jan 30 09:51:31 beastie kernel: (aprobe0:ahcich3:0:0:0): CAM status: Command timeout Jan 30 09:51:31 beastie kernel: (aprobe0:ahcich3:0:0:0): Error 5, Retry was blocked Jan 30 09:51:31 beastie kernel: (ada3:ahcich3:0:0:0): lost device Jan 30 09:51:31 beastie kernel: (pass3:ahcich3:0:0:0): passdevgonecb: devfs entry is gone -- Vladislav V. Prodan System & Network Administrator http://support.od.ua +380 67 4584408, +380 99 4060508 VVP88-RIPE From owner-freebsd-fs@FreeBSD.ORG Thu Jan 31 03:51:26 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 9D8C6183; Thu, 31 Jan 2013 03:51:26 +0000 (UTC) (envelope-from wollman@hergotha.csail.mit.edu) Received: from hergotha.csail.mit.edu (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) by mx1.freebsd.org (Postfix) with ESMTP id 3678A804; Thu, 31 Jan 2013 03:51:25 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.5/8.14.5) with ESMTP id r0V3pOUj092930; Wed, 30 Jan 2013 22:51:24 -0500 (EST) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.5/8.14.4/Submit) id r0V3pOAr092927; Wed, 30 Jan 2013 22:51:24 -0500 (EST) (envelope-from wollman) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <20745.59964.60447.379943@hergotha.csail.mit.edu> Date: Wed, 30 Jan 2013 22:51:24 -0500 From: Garrett Wollman To: freebsd-stable@freebsd.org, freebsd-fs@freebsd.org Subject: More on odd ZFS not-quite-deadlock X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (hergotha.csail.mit.edu [127.0.0.1]); Wed, 30 Jan 2013 22:51:24 -0500 (EST) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on hergotha.csail.mit.edu X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Jan 2013 03:51:26 -0000 I posted a few days ago about what I thought was a ZFS-related almost-deadlock. I have a bit more information now, but I'm still puzzled. Hopefully someone else has seen this before. While things are in the hung state, a "zfs recv" is running. It's receiving an empty snapshot to one of the many datasets on this file server. "zfs recv" reports that receiving this particular empty snapshot takes just about half an hour. When it finally completes, everything starts working normally again. (This particular replication job will no longer be operational in a few hours, so this may be the last time I can collect information about the issue for a while.) The same "zfs recv" takes only a few seconds 23 hours out of 24. The kstacks of the processes that appear to possibly be involved look like this: PID TID COMM TDNAME KSTACK 0 100061 kernel thread taskq mi_switch+0x196 sleepq_wait+0x42 _sx_slock_hard+0x3bb _sx_slock+0x3d zfs_reclaim_complete+0x38 taskqueue_run_locked+0x85 taskqueue_thread_loop+0x46 fork_exit+0x11f fork_trampoline+0xe 7 100215 zfskern arc_reclaim_thre mi_switch+0x196 sleepq_timedwait+0x42 _cv_timedwait+0x13c arc_reclaim_thread+0x29d fork_exit+0x11f fork_trampoline+0xe 7 100216 zfskern l2arc_feed_threa mi_switch+0x196 sleepq_timedwait+0x42 _cv_timedwait+0x13c l2arc_feed_thread+0x1a8 fork_exit+0x11f fork_trampoline+0xe 7 100592 zfskern txg_thread_enter mi_switch+0x196 sleepq_wait+0x42 _cv_wait+0x121 txg_thread_wait+0x79 txg_quiesce_thread+0xb5 fork_exit+0x11f fork_trampoline+0xe 7 100593 zfskern txg_thread_enter mi_switch+0x196 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_thread_wait+0x3c txg_sync_thread+0x269 fork_exit+0x11f fork_trampoline+0xe 7 100989 zfskern txg_thread_enter mi_switch+0x196 sleepq_wait+0x42 _cv_wait+0x121 txg_thread_wait+0x79 txg_quiesce_thread+0xb5 fork_exit+0x11f fork_trampoline+0xe 7 100990 zfskern txg_thread_enter mi_switch+0x196 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_thread_wait+0x3c txg_sync_thread+0x269 fork_exit+0x11f fork_trampoline+0xe 7 101355 zfskern txg_thread_enter mi_switch+0x196 sleepq_wait+0x42 _cv_wait+0x121 txg_thread_wait+0x79 txg_quiesce_thread+0xb5 fork_exit+0x11f fork_trampoline+0xe 7 101356 zfskern txg_thread_enter mi_switch+0x196 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_thread_wait+0x3c txg_sync_thread+0x269 fork_exit+0x11f fork_trampoline+0xe 13 100053 geom g_event mi_switch+0x196 sleepq_wait+0x42 _sleep+0x3a8 g_run_events+0x430 fork_exit+0x11f fork_trampoline+0xe 13 100054 geom g_up mi_switch+0x196 sleepq_wait+0x42 _sleep+0x3a8 g_io_schedule_up+0xd8 g_up_procbody+0x5c fork_exit+0x11f fork_trampoline+0xe 13 100055 geom g_down mi_switch+0x196 sleepq_wait+0x42 _sleep+0x3a8 g_io_schedule_down+0x20e g_down_procbody+0x5c fork_exit+0x11f fork_trampoline+0xe 22 100225 syncer - mi_switch+0x196 sleepq_wait+0x42 _cv_wait+0x121 rrw_enter+0xdb zfs_sync+0x63 sync_fsync+0x19d VOP_FSYNC_APV+0x4a sync_vnode+0x15e sched_sync+0x1c5 fork_exit+0x11f fork_trampoline+0xe 93224 102554 zfs - mi_switch+0x196 sleepq_wait+0x42 _cv_wait+0x121 zio_wait+0x61 dbuf_read+0x5e5 dnode_next_offset_level+0x28d dnode_next_offset+0xb9 dmu_object_next+0x3e dsl_dataset_destroy+0x164 dmu_recv_end+0x184 zfs_ioc_recv+0x9f4 zfsdev_ioctl+0xe6 devfs_ioctl_f+0x7b kern_ioctl+0x115 sys_ioctl+0xf0 amd64_syscall+0x5ea Xfast_syscall+0xf7 [This is the zfs recv process that is applying the replication package with an empty snapshot.] 93320 102479 df - mi_switch+0x196 sleepq_wait+0x42 _cv_wait+0x121 rrw_enter+0xdb zfs_root+0x40 lookup+0xaa6 namei+0x535 kern_statfs+0xa4 sys_statfs+0x37 amd64_syscall+0x5ea Xfast_syscall+0xf7 [7 more like this] (I've deleted all of the threads that are clearly waiting for some unrelated event, such as nanosleep() and select().) -GAWollman From owner-freebsd-fs@FreeBSD.ORG Thu Jan 31 08:44:37 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 37701254; Thu, 31 Jan 2013 08:44:37 +0000 (UTC) (envelope-from lev@FreeBSD.org) Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru [46.4.40.135]) by mx1.freebsd.org (Postfix) with ESMTP id DD28A2EE; Thu, 31 Jan 2013 08:44:36 +0000 (UTC) Received: from lion.home.serebryakov.spb.ru (unknown [IPv6:2001:470:923f:1:2577:cf36:d0d4:4986]) (Authenticated sender: lev@serebryakov.spb.ru) by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id B6CD34ACC7; Thu, 31 Jan 2013 12:44:28 +0400 (MSK) Date: Thu, 31 Jan 2013 12:44:19 +0400 From: Lev Serebryakov Organization: FreeBSD X-Priority: 3 (Normal) Message-ID: <1291867.20130131124419@serebryakov.spb.ru> To: freebsd-fs@FreeBSD.org, freebsd-stable@freebsd.org Subject: 9.1-STABLE, live lock up, seems that it is ZFS lockup in "zfskern{txg_thread_enter}" state "tx->tx" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: lev@FreeBSD.org List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Jan 2013 08:44:37 -0000 Hello, freebsd-fs. I have 9.1-STABLE (r244958) system, amd64, 8GiB memory. Two SATA disks, 750Gb each. Disks are partitoned into 7 (BSD) partitons (exactly the same), 5 of these pairs are joined into gmirrors for "system" FSes (UFS2), one pair is used for swaps and 7th pair is used as zmirror for /usr/home. Tonight system becomes unusable, as every process which try to read directories in /usr/home (like "ls ~" or "find /usr/home -type f") hangs forever. I could login to system, login shell starts, but if I run "ls" right after -- it hangs. Every periodic process, which try to read home FS (directories, not files!) hangs. It looks, like stat() calls on this FS hangs, but not open()/read()/write()/close(). One thing I fins suspicious in different system diagnostics, is kernel thread "zfskern{txg_thread_enter}" which is shown in state "tx->tx" forever. Disks looks completely OK according to smartd/smartctl, no hardware errors in dmesg, etc. =============================================== # zpool status pool: pool state: ONLINE status: The pool is formatted using a legacy on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on software that does not support feature flags. scan: resilvered 32.1G in 0h34m with 0 errors on Sat Jun 2 16:22:59 2012 config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0s1h ONLINE 0 0 0 ada1s1h ONLINE 0 0 0 errors: No known data errors ================================================ -- // Black Lion AKA Lev Serebryakov From owner-freebsd-fs@FreeBSD.ORG Thu Jan 31 16:03:30 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 2EDBAA64 for ; Thu, 31 Jan 2013 16:03:30 +0000 (UTC) (envelope-from pluknet@gmail.com) Received: from mail-qe0-f52.google.com (mail-qe0-f52.google.com [209.85.128.52]) by mx1.freebsd.org (Postfix) with ESMTP id D7DCFFD6 for ; Thu, 31 Jan 2013 16:03:29 +0000 (UTC) Received: by mail-qe0-f52.google.com with SMTP id 6so1349003qeb.11 for ; Thu, 31 Jan 2013 08:03:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=x/z3mYX3Z1sloFJ+we21eSXbSGsC6DXMvBFRlRI8018=; b=j4w3TgznRpGt8LSD7eFli+GGKzVLlbAntX6h81gYpbG5E7sqrJyXnYvLj+kqJqzqEA 9yYMdmOFfayRDaE44GL0CRHfPHjvK+K9vAHv+BpjrUJWL+Aih29vbLxcxOZVBQEIb2eA cT36kzn0qRKFhS/fxhV+VZ/5HsWbg7ekwCjB1L6jesTUIKFuzHDmdAWeKj2PvAWLr4vo K8wm/OnP51Hsl8Hrikp9x75XqQqlIKpz2VzQPN2ylq+9u5V3UhJY00nxXRVp1OZY4gHQ KzKNaRGrWNzO2Aw3iWpLlV2aNxQhwtiXV0nWm6TlBHrZV2DdZBAeJT68SrWI+P7Z+Wtm MpAg== MIME-Version: 1.0 X-Received: by 10.229.78.97 with SMTP id j33mr2252518qck.107.1359648202844; Thu, 31 Jan 2013 08:03:22 -0800 (PST) Received: by 10.229.78.96 with HTTP; Thu, 31 Jan 2013 08:03:22 -0800 (PST) In-Reply-To: <1171241649.2066788.1358383377496.JavaMail.root@erie.cs.uoguelph.ca> References: <1171241649.2066788.1358383377496.JavaMail.root@erie.cs.uoguelph.ca> Date: Thu, 31 Jan 2013 19:03:22 +0300 Message-ID: Subject: Re: getcwd lies on/under nfs4-mounted zfs dataset From: Sergey Kandaurov To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Jan 2013 16:03:30 -0000 On 17 January 2013 04:42, Rick Macklem wrote: > pluknet@gmail.com wrote: >> Hi. >> >> We stuck with the problem getting wrong current directory path >> when sitting on/under zfs dataset filesystem mounted over NFSv4. >> Both nfs server and client are 10.0-CURRENT from December or so. >> >> The component path "user3" unexpectedly appears to be "." (dot). >> nfs-client:/home/user3 # pwd >> /home/. >> nfs-client:/home/user3/var/run # pwd >> /home/./var/run >> > Ok, I've figured out what is going on. The algorithm in libc > works, but vn_fullpath1() doesn't. The latter assumes that > "mount points" are marked with VV_ROOT etc. For the > "pseudo mount points" (which are mount points within the > directory tree on the NFSv4 server), this isn't the case. > > If you: > sysctl debug.disablecwd=1 > sysctl debug.disablefullpath=1 > > it works. (At least for the UFS case I tested.) Thank you very much, Rick! As an interim solution, we've decided to go that way. > > I can't see how this can be made to work correctly > for vn_fullpath1() unless it was re-written to use the > same algorithm that lib/libc/gen/getcwd.c implements. > > I was pretty sure this used to work. Maybe the syscalls > used to be disabled by default or weren't used by the > libc functions? > > Anyhow, sorry about the cofusing posts while I figured > out what was going on, rick > ps: Don't use the patch I posted. It isn't needed and > will break stuff. > >> nfs-client:~ # procstat -f 3225 >> PID COMM FD T V FLAGS REF OFFSET PRO NAME >> 3225 a.out text v r r-------- - - - /home/./var/a.out >> 3225 a.out ctty v c rw------- - - - /dev/pts/2 >> 3225 a.out cwd v d r-------- - - - /home/./var >> 3225 a.out root v d r-------- - - - / >> >> The used setup follows. >> >> 1. NFS Server with local ZFS: >> # cat /etc/exports >> V4: / -sec=sys >> >> # zfs list >> pool1 10.4M 122G 580K /pool1 >> pool1/user3 on /pool1/user3 (zfs, NFS exported, local, nfsv4acls) >> >> Exports list on localhost: >> /pool1/user3 109.70.28.0 >> /pool1 109.70.28.0 >> >> # zfs get sharenfs pool1/user3 >> NAME PROPERTY VALUE SOURCE >> pool1/user3 sharenfs -alldirs -maproot=root -network=109.70.28.0/24 >> local >> >> 2. pool1 is mounted on NFSv4 client: >> nfs-server:/pool1 on /home (nfs, noatime, nfsv4acls) >> >> So that on NFS client the "pool1/user3" dataset comes at /home/user3. >> / - ufs >> /home - zpool-over-nfsv4 >> /home/user3 - zfs dataset "pool1/user3" >> >> At the same time it works as expected when we're not on zfs dataset, >> but directly on its parent zfs pool (also over NFSv4), e.g. >> nfs-client:/home/non_dataset_dir # pwd >> /home/non_dataset_dir >> >> The ls command works as expected: >> nfs-client:/# ls -dl /home/user3/var/ >> drwxrwxrwt+ 6 root wheel 6 Jan 10 16:19 /home/user3/var/ >> -- wbr, pluknet From owner-freebsd-fs@FreeBSD.ORG Fri Feb 1 07:44:05 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 543E88EC for ; Fri, 1 Feb 2013 07:44:05 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 8AD26E8 for ; Fri, 1 Feb 2013 07:44:04 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id JAA16602 for ; Fri, 01 Feb 2013 09:43:56 +0200 (EET) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1U1BHw-0000i0-9B for freebsd-fs@FreeBSD.org; Fri, 01 Feb 2013 09:43:56 +0200 Message-ID: <510B723A.2090404@FreeBSD.org> Date: Fri, 01 Feb 2013 09:43:54 +0200 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130121 Thunderbird/17.0.2 MIME-Version: 1.0 To: freebsd-fs@FreeBSD.org Subject: zfs hang/deadlock - what to do - how to report X-Enigmail-Version: 1.4.6 Content-Type: text/plain; charset=x-viet-vps Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Feb 2013 07:44:05 -0000 Please first read the following https://wiki.freebsd.org/AvgZfsDeadlockDebug Please follow the advices and do suggested preliminary analysis. Please report accordingly. Thank you. -- Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Fri Feb 1 09:09:08 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 68218B5A for ; Fri, 1 Feb 2013 09:09:08 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail06.syd.optusnet.com.au (mail06.syd.optusnet.com.au [211.29.132.187]) by mx1.freebsd.org (Postfix) with ESMTP id 25654698 for ; Fri, 1 Feb 2013 09:09:06 +0000 (UTC) Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106]) by mail06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r1198suE025477 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Fri, 1 Feb 2013 20:08:57 +1100 Date: Fri, 1 Feb 2013 20:08:54 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: fs@freebsd.org Subject: some fixes for msdosfs Message-ID: <20130201182606.A1492@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=MscKcBme c=1 sm=1 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=IPjz-GnoPKUA:10 a=Nf6N-zW9uRpywVz5buAA:9 a=CjuIK1q_8ugA:10 a=ADzlCoBTbut52P-S:21 a=qp1OgrSAsFp3MxqR:21 a=TEtd8y5WR3g2ypngnwZWYw==:117 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Feb 2013 09:09:08 -0000 Please commit some of these fixes. 1. The directory entry for dotdot was corrupted in the FAT32 case when moving a directory to a subdir of the root directory from somewhere else. For all directory moves that change the parent directory, the dotdot entry must be fixed up. For msdosfs, the root directory is magic for non-FAT32. It is less magic for FAT32, but needs the same magic for the dotdot fixup. It didn't have it. chkdsk and fsck_msdosfs fix the corrupt directory entries with no problems. The fix is simple -- use the same magic for dotdot in msdosfs_rename() as in msdosfs_mkdir(). But the patch is large due to related cleanups and unrelated changes that -current already has. @ Index: msdosfs_vnops.c @ =================================================================== @ RCS file: /home/ncvs/src/sys/fs/msdosfs/msdosfs_vnops.c,v @ retrieving revision 1.147 @ diff -u -2 -r1.147 msdosfs_vnops.c @ --- msdosfs_vnops.c 4 Feb 2004 21:52:53 -0000 1.147 @ +++ msdosfs_vnops.c 31 Jan 2013 17:36:21 -0000 @ @@ -926,5 +978,5 @@ @ int doingdirectory = 0, newparent = 0; @ int error; @ - u_long cn; @ + u_long cn, pcl; @ daddr_t bn; @ struct denode *fddep; /* from file's parent directory */ This is in msdosfs_rename(). Use the same variable name as in mkdir(), instead of reusing cn. @ @@ -1199,9 +1251,13 @@ @ } @ dotdotp = (struct direntry *)bp->b_data + 1; @ - putushort(dotdotp->deStartCluster, dp->de_StartCluster); @ + pcl = dp->de_StartCluster; @ + if (FAT32(pmp) && pcl == pmp->pm_rootdirblk) @ + pcl = MSDOSFSROOT; @ + putushort(dotdotp->deStartCluster, pcl); @ if (FAT32(pmp)) @ - putushort(dotdotp->deHighClust, dp->de_StartCluster >> 16); @ - error = bwrite(bp); @ - if (error) { @ + putushort(dotdotp->deHighClust, pcl >> 16); Use the same code as in mkdir(). Don't comment on it again. @ + if (fvp->v_mount->mnt_flag & MNT_ASYNC) @ + bdwrite(bp); @ + else if ((error = bwrite(bp)) != 0) { @ /* XXX should really panic here, fs is corrupt */ @ VOP_UNLOCK(fvp, 0, td); Unrelated changes that -current already has. @ @@ -1313,6 +1369,11 @@ @ putushort(denp[0].deMTime, ndirent.de_MTime); @ pcl = pdep->de_StartCluster; @ + /* @ + * Although the root directory has a non-magic starting cluster @ + * number for FAT32, chkdsk and fsck_msdosfs still require @ + * references to it in dotdot entries to be magic. @ + */ @ if (FAT32(pmp) && pcl == pmp->pm_rootdirblk) @ - pcl = 0; @ + pcl = MSDOSFSROOT; @ putushort(denp[1].deStartCluster, pcl); @ putushort(denp[1].deCDate, ndirent.de_CDate); This is in msdosfs_mkdir(). Document the magic there. Don't hard-code 0. @ @@ -1324,9 +1385,10 @@ @ if (FAT32(pmp)) { @ putushort(denp[0].deHighClust, newcluster >> 16); @ - putushort(denp[1].deHighClust, pdep->de_StartCluster >> 16); @ + putushort(denp[1].deHighClust, pcl >> 16); @ } Don't depend on magic soft-coding of 0. For the FAT32 root directory, pdep->de_StartCluster is usually < 65536. Perhaps it is always small. The old code depended on this to get a result of 0 when the value is shifted. @ @ - error = bwrite(bp); @ - if (error) @ + if (ap->a_dvp->v_mount->mnt_flag & MNT_ASYNC) @ + bdwrite(bp); @ + else if ((error = bwrite(bp)) != 0) @ goto bad; @ Unrelated changes that -current already has. Further cleanups: the condition for being the root directory should probably be written as (DETOV(pdep)->v_vflag & VV_ROOT). msdosfs uses this in some places, but it prefers to test if a denode's first cluster number is MSDOSFSROOT. The latter is simpler and was equivalent before FAT32 existed. msdosfs still uses it a lot, but it now means that the denode is for a non-FAT32 root directory. This is quite confusing. The old test often gives the correct classification because for the FAT32 case where it differs, the root directory is not magic, and the code under the test is really handling the magic case. Comments add to the confusion because they are mostly unchanged and still say that the root directory is always magic. Cases where the root directory is magic for FAT32 are mostly classified using the (FAT32(pmp) && cn == pmp->pm_rootdirblk) condition. Apparently there is some magic that requires the FAT32(pmp) condition before pmp->pm_rootdirblk can be trusted. The VV_ROOT condition seems better for these cases. ============ 2. mountmsdosfs() had an insane sanity test. While testing the above, I tried FAT32 on a small partition. This failed to mount because pmp->pm_Sectors was nonzero. Normally, FAT32 file systems are so large that the 16-bit pm_Sectors can't hold the size. This is indicated by setting it to 0 and using only pm_HugeSectors. But at least old versions of newfs_msdos use the 16-bit field if possible, and msdosfs supports this except for breaking its own support in the sanity check. This is quite different from the handling of pm_FATsecs -- now the 16-bit value is always ignored for FAT32 except for checking that it is 0, and newfs_msdos doesn't use the 16-bit value for FAT32. I just removed the sanity test. @ Index: msdosfs_vfsops.c @ =================================================================== @ RCS file: /home/ncvs/src/sys/fs/msdosfs/msdosfs_vfsops.c,v @ retrieving revision 1.120 @ diff -u -2 -r1.120 msdosfs_vfsops.c @ --- msdosfs_vfsops.c 16 Jun 2004 09:47:03 -0000 1.120 @ +++ msdosfs_vfsops.c 31 Jan 2013 17:53:25 -0000 @ @@ -431,5 +459,4 @@ @ if (bsp->bs710.bsBootSectSig2 != BOOTSIG2 @ || bsp->bs710.bsBootSectSig3 != BOOTSIG3 @ - || pmp->pm_Sectors @ || pmp->pm_FATsecs @ || getushort(b710->bpbFSVers)) { The sanity tests of the signatures have already been removed in -current. ============ 3. Backup FATs were sometimes marked dirty by copying their first block from the primary FAT, and then they were not marked clean on unmount. This bug has been known for a long time, and always happened while testing (1), so I fixed it. My tests were mostly to create a new file system, 1 mkdir and move the new directory forth and back from the root partition to corrupt it. Since all the FAT entries are in the first block of the FAT, backing this up always marks the backups as unclean. chkdsk and fsck_msdosfs fix this, but it gives them extra work and uninspires confidence in the backups. @ Index: msdosfs_fat.c @ =================================================================== @ RCS file: /home/ncvs/src/sys/fs/msdosfs/msdosfs_fat.c,v @ retrieving revision 1.50 @ diff -u -2 -r1.50 msdosfs_fat.c @ --- msdosfs_fat.c 1 Sep 2008 13:18:16 -0000 1.50 @ +++ msdosfs_fat.c 31 Jan 2013 15:07:41 -0000 @ @@ -337,6 +338,6 @@ @ u_long fatbn; @ { @ - int i; @ struct buf *bpn; @ + int cleanfat, i; @ @ #ifdef MSDOSFS_DEBUG @ @@ -378,4 +356,10 @@ @ * bwrite()'s and really slow things down. @ */ @ + if (fatbn != pmp->pm_fatblk || FAT12(pmp)) @ + cleanfat = 0; @ + else if (FAT16(pmp)) @ + cleanfat = 16; @ + else @ + cleanfat = 32; @ for (i = 1; i < pmp->pm_FATs; i++) { @ fatbn += pmp->pm_FATsecs; @ @@ -384,5 +368,10 @@ @ 0, 0, 0); @ bcopy(bp->b_data, bpn->b_data, bp->b_bcount); @ - if (pmp->pm_flags & MSDOSFSMNT_WAITONFAT) @ + /* Force the clean bit on in the other copies. */ @ + if (cleanfat == 16) @ + ((u_int8_t *)bpn->b_data)[3] |= 0x80; @ + else if (cleanfat == 32) @ + ((u_int8_t *)bpn->b_data)[7] |= 0x08; @ + if (pmp->pm_mountp->mnt_flag & MNT_SYNCHRONOUS) @ bwrite(bpn); @ else Unrelated change for the bwrite() condition. The MSDOSFSMNT_WAITONFAT flag is bogus and broken. It does less than track the MNT_SYNCHRONOUS flag. It is set to the latter at mount time but not updated by MNT_UPDATE. You could exploit this to set it to a different value than the current MNT_SYNCHRONOUS setting, but this is undocumented and fragile. (FAT updates should be sync by default, but this is too slow, so the default is async (delayed) FAT, async (delayed or async) file data and sync metadata for denodes, which is probably a worse combination than async everything. But you could change the FAT write policy to sync by mounting with sync and then MNT_UPDATEing with nosync to get nosync (default) for file data.) @ @@ -394,11 +383,10 @@ @ * Write out the first (or current) fat last. @ */ @ - if (pmp->pm_flags & MSDOSFSMNT_WAITONFAT) @ + if (pmp->pm_mountp->mnt_flag & MNT_SYNCHRONOUS) @ bwrite(bp); @ else @ bdwrite(bp); Fixing the condition is more important for the primary FAT. The backups should probably always be written with async or even delayed writes. (async would be not much different from sync, since we don't check for success. It would still be very slow.) @ - /* @ - * Maybe update fsinfo sector here? @ - */ @ + @ + pmp->pm_fmod |= 1; Unrelated changes. (I moved all fsinfo updates from here, and use pm_fmod as a set of flags, with the 1 flag indicating the old (unused) condition of a modified FAT.) @ } @ @ @@ -1097,5 +1085,5 @@ @ * manipulating the upper bit of the FAT entry for cluster 1. Note that @ * this bit is not defined for FAT12 volumes, which are always assumed to @ - * be dirty. @ + * be clean. @ * @ * The fatentry() routine only works on cluster numbers that a file could Vaguely related -- fix a backwards comment in markvoldirty(). markvoldirty() is too specialized. The bug wouldn't have existed if updatefats() had been used instead of the direct bread()/bwrite() of a single block in markvoldirty(). There are also some locking and blocksize problems with this special i/o. However, I prefer not to write the all backups for marking the volume dirty, and it was easy to keep them marked clean in updatefats(). Writing the primary FAT when only it is dirty is already 1 too many writes. So it is a feature that markvoldirty() only writes 1 block. Bruce From owner-freebsd-fs@FreeBSD.ORG Fri Feb 1 19:24:38 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 5C6C27E6 for ; Fri, 1 Feb 2013 19:24:38 +0000 (UTC) (envelope-from peter@rulingia.com) Received: from vps.rulingia.com (host-122-100-2-194.octopus.com.au [122.100.2.194]) by mx1.freebsd.org (Postfix) with ESMTP id E7B77A73 for ; Fri, 1 Feb 2013 19:24:37 +0000 (UTC) Received: from server.rulingia.com (c220-239-236-213.belrs5.nsw.optusnet.com.au [220.239.236.213]) by vps.rulingia.com (8.14.5/8.14.5) with ESMTP id r11JONrl039865 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Sat, 2 Feb 2013 06:24:24 +1100 (EST) (envelope-from peter@rulingia.com) X-Bogosity: Ham, spamicity=0.000000 Received: from server.rulingia.com (localhost.rulingia.com [127.0.0.1]) by server.rulingia.com (8.14.5/8.14.5) with ESMTP id r11JOIN8027080 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 2 Feb 2013 06:24:18 +1100 (EST) (envelope-from peter@server.rulingia.com) Received: (from peter@localhost) by server.rulingia.com (8.14.5/8.14.5/Submit) id r11JOGLh027079; Sat, 2 Feb 2013 06:24:16 +1100 (EST) (envelope-from peter) Date: Sat, 2 Feb 2013 06:24:16 +1100 From: Peter Jeremy To: Kevin Day Subject: Re: Improving ZFS performance for large directories Message-ID: <20130201192416.GA76461@server.rulingia.com> References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="envbJBWh7q8WU6mo" Content-Disposition: inline In-Reply-To: X-PGP-Key: http://www.rulingia.com/keys/peter.pgp User-Agent: Mutt/1.5.21 (2010-09-15) Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Feb 2013 19:24:38 -0000 --envbJBWh7q8WU6mo Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2013-Jan-29 18:06:01 -0600, Kevin Day wrote: >On Jan 29, 2013, at 5:42 PM, Matthew Ahrens wrote: >> On Tue, Jan 29, 2013 at 3:20 PM, Kevin Day wrote: >> I'm prepared to try an L2arc cache device (with secondarycache=3Dmetadat= a), >>=20 >> You might first see how long it takes when everything is cached. E.g. b= y doing this in the same directory several times. This will give you a low= er bound on the time it will take (or put another way, an upper bound on th= e improvement available from a cache device). >> =20 > >Doing it twice back-to-back makes a bit of difference but it's still slow = either way. ZFS can very conservative about caching data and twice might not be enough. I suggest you try 8-10 times, or until the time stops reducing. >I think some of the issue is that nothing is being allowed to stay cached = long. Well ZFS doesn't do any time-based eviction so if things aren't staying in the cache, it's because they are being evicted by things that ZFS considers more deserving. Looking at the zfs-stats you posted, it looks like your workload has very low locality of reference (the data hitrate is very) low. If this is not what you expect then you need more RAM. OTOH, your vfs.zfs.arc_meta_used being above vfs.zfs.arc_meta_limit suggests that ZFS really wants to cache more metadata (by default ZFS has a 25% metadata, 75% data split in ARC to prevent metadata caching starving data caching). I would go even further than the 50:50 split suggested later and try 75:25 (ie, triple the current vfs.zfs.arc_meta_limit). Note that if there is basically no locality of reference in your workload (as I suspect), you can even turn off data caching for specific filesystems with zfs set primarycache=3Dmetadata tank/foo (note that you still need to increase vfs.zfs.arc_meta_limit to allow ZFS to use the the ARC to cache metadata). --=20 Peter Jeremy --envbJBWh7q8WU6mo Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlEMFmAACgkQ/opHv/APuIecWACgn5H+MWNyBmOSD6dCkZOrkIF7 mUgAn0tVC7elSQq2Z22FqQ5/wNi+0Fvn =u4yZ -----END PGP SIGNATURE----- --envbJBWh7q8WU6mo--