From owner-freebsd-hackers@FreeBSD.ORG Sun Oct 7 10:40:56 2012 Return-Path: Delivered-To: hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 818B3106566C; Sun, 7 Oct 2012 10:40:56 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe09.c2i.net [212.247.155.2]) by mx1.freebsd.org (Postfix) with ESMTP id 864688FC08; Sun, 7 Oct 2012 10:40:54 +0000 (UTC) X-T2-Spam-Status: No, hits=-1.0 required=5.0 tests=ALL_TRUSTED Received: from [176.74.213.204] (account mc467741@c2i.net HELO laptop015.hselasky.homeunix.org) by mailfe09.swip.net (CommuniGate Pro SMTP 5.4.4) with ESMTPA id 154285814; Sun, 07 Oct 2012 12:40:46 +0200 From: Hans Petter Selasky To: freebsd-usb@freebsd.org, hackers@freebsd.org Date: Sun, 7 Oct 2012 12:42:11 +0200 User-Agent: KMail/1.13.7 (FreeBSD/9.1-PRERELEASE; KDE/4.8.4; amd64; ; ) References: <5070C1E9.5000405@sbcglobal.net> In-Reply-To: <5070C1E9.5000405@sbcglobal.net> X-Face: 'mmZ:T{)),Oru^0c+/}w'`gU1$ubmG?lp!=R4Wy\ELYo2)@'UZ24N@d2+AyewRX}mAm; Yp |U[@, _z/([?1bCfM{_"B<.J>mICJCHAzzGHI{y7{%JVz%R~yJHIji`y>Y}k1C4TfysrsUI -%GU9V5]iUZF&nRn9mJ'?&>O MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201210071242.11932.hselasky@c2i.net> Cc: bugs@freebsd.org, Jin Guojun , current@freebsd.org Subject: Re: 9.1-RCs issues X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 07 Oct 2012 10:40:56 -0000 On Sunday 07 October 2012 01:42:33 Jin Guojun wrote: > 1) moused stops functioning on 9.1-RC2. Neither PS2 nor USB mouse can work. > 9.1-RC1 has no such problem. > > 2) All i386 / amd64 of 9.1-RC1/RC2 have USB read failure -- see dmesg > output at end of this email. > ada0 is internal SATA drive for system disk -- s# partitions: /, /tmp, > /var, /usr > s1 -- 6.4-Release > s2 -- 8.3-Release > s3 -- 9.1-RC2 amd64 > s4 -- 9.1-RC2 i386 -- This slice also contains /home > da0 is external USB2 drive (300GB) plugged in USB2 port -- mounted on /mnt > Regarding USB, it might be some patches did not reach it for the RC's. Have you tried 9-stable, or any 10-current snapshots? --HPS From owner-freebsd-hackers@FreeBSD.ORG Sun Oct 7 12:17:36 2012 Return-Path: Delivered-To: freebsd-hackers@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 42CA4106566C for ; Sun, 7 Oct 2012 12:17:36 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 930F38FC18 for ; Sun, 7 Oct 2012 12:17:34 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id PAA14196 for ; Sun, 07 Oct 2012 15:17:32 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1TKpnY-000873-Gy for freebsd-hackers@freebsd.org; Sun, 07 Oct 2012 15:17:32 +0300 Message-ID: <507172DA.2020309@FreeBSD.org> Date: Sun, 07 Oct 2012 15:17:30 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:15.0) Gecko/20120913 Thunderbird/15.0.1 MIME-Version: 1.0 To: freebsd-hackers X-Enigmail-Version: 1.4.3 Content-Type: text/plain; charset=X-VIET-VPS Content-Transfer-Encoding: 7bit Cc: Subject: machine/cpu.h in userland X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 07 Oct 2012 12:17:36 -0000 I noticed that couple of our userland file include machine/cpu.h for no apparent good reason. It looks like at least amd64 cpu.h has no userland serviceable parts inside. Maybe its content should be put under _KERNEL ? Also: --- a/sbin/adjkerntz/adjkerntz.c +++ b/sbin/adjkerntz/adjkerntz.c @@ -51,7 +51,6 @@ __FBSDID("$FreeBSD$"); #include #include #include -#include #include #include "pathnames.h" diff --git a/usr.bin/w/w.c b/usr.bin/w/w.c index 8441ce5..9674fc2 100644 --- a/usr.bin/w/w.c +++ b/usr.bin/w/w.c @@ -57,7 +57,6 @@ static const char sccsid[] = "@(#)w.c 8.4 (Berkeley) 4/16/94"; #include #include -#include #include #include #include -- Andriy Gapon From owner-freebsd-hackers@FreeBSD.ORG Sun Oct 7 12:43:38 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id F27161065670; Sun, 7 Oct 2012 12:43:37 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe02.c2i.net [212.247.154.34]) by mx1.freebsd.org (Postfix) with ESMTP id 742E38FC12; Sun, 7 Oct 2012 12:43:35 +0000 (UTC) X-T2-Spam-Status: No, hits=-0.2 required=5.0 tests=ALL_TRUSTED, BAYES_50 Received: from [176.74.213.204] (account mc467741@c2i.net HELO laptop015.hselasky.homeunix.org) by mailfe02.swip.net (CommuniGate Pro SMTP 5.4.4) with ESMTPA id 329237632; Sun, 07 Oct 2012 14:43:27 +0200 From: Hans Petter Selasky To: freebsd-hackers@freebsd.org Date: Sun, 7 Oct 2012 14:44:52 +0200 User-Agent: KMail/1.13.7 (FreeBSD/9.1-PRERELEASE; KDE/4.8.4; amd64; ; ) References: <5070C1E9.5000405@sbcglobal.net> <201210071242.11932.hselasky@c2i.net> In-Reply-To: <201210071242.11932.hselasky@c2i.net> X-Face: 'mmZ:T{)),Oru^0c+/}w'`gU1$ubmG?lp!=R4Wy\ELYo2)@'UZ24N@d2+AyewRX}mAm; Yp |U[@, _z/([?1bCfM{_"B<.J>mICJCHAzzGHI{y7{%JVz%R~yJHIji`y>Y}k1C4TfysrsUI -%GU9V5]iUZF&nRn9mJ'?&>O MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201210071444.52845.hselasky@c2i.net> Cc: bugs@freebsd.org, hackers@freebsd.org, Jin Guojun , freebsd-usb@freebsd.org, current@freebsd.org Subject: Re: 9.1-RCs issues X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 07 Oct 2012 12:43:38 -0000 On Sunday 07 October 2012 12:42:11 Hans Petter Selasky wrote: > On Sunday 07 October 2012 01:42:33 Jin Guojun wrote: > > 1) moused stops functioning on 9.1-RC2. Neither PS2 nor USB mouse can > > work. > > > > 9.1-RC1 has no such problem. > > > > 2) All i386 / amd64 of 9.1-RC1/RC2 have USB read failure -- see dmesg > > output at end of this email. > > ada0 is internal SATA drive for system disk -- s# partitions: /, /tmp, > > /var, /usr > > > > s1 -- 6.4-Release > > s2 -- 8.3-Release > > s3 -- 9.1-RC2 amd64 > > s4 -- 9.1-RC2 i386 -- This slice also contains /home > > > > da0 is external USB2 drive (300GB) plugged in USB2 port -- mounted on > > /mnt > > Regarding USB, it might be some patches did not reach it for the RC's. Have > you tried 9-stable, or any 10-current snapshots? > > --HPS s/reach/make --HPS From owner-freebsd-hackers@FreeBSD.ORG Sun Oct 7 12:43:38 2012 Return-Path: Delivered-To: hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id F27161065670; Sun, 7 Oct 2012 12:43:37 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe02.c2i.net [212.247.154.34]) by mx1.freebsd.org (Postfix) with ESMTP id 742E38FC12; Sun, 7 Oct 2012 12:43:35 +0000 (UTC) X-T2-Spam-Status: No, hits=-0.2 required=5.0 tests=ALL_TRUSTED, BAYES_50 Received: from [176.74.213.204] (account mc467741@c2i.net HELO laptop015.hselasky.homeunix.org) by mailfe02.swip.net (CommuniGate Pro SMTP 5.4.4) with ESMTPA id 329237632; Sun, 07 Oct 2012 14:43:27 +0200 From: Hans Petter Selasky To: freebsd-hackers@freebsd.org Date: Sun, 7 Oct 2012 14:44:52 +0200 User-Agent: KMail/1.13.7 (FreeBSD/9.1-PRERELEASE; KDE/4.8.4; amd64; ; ) References: <5070C1E9.5000405@sbcglobal.net> <201210071242.11932.hselasky@c2i.net> In-Reply-To: <201210071242.11932.hselasky@c2i.net> X-Face: 'mmZ:T{)),Oru^0c+/}w'`gU1$ubmG?lp!=R4Wy\ELYo2)@'UZ24N@d2+AyewRX}mAm; Yp |U[@, _z/([?1bCfM{_"B<.J>mICJCHAzzGHI{y7{%JVz%R~yJHIji`y>Y}k1C4TfysrsUI -%GU9V5]iUZF&nRn9mJ'?&>O MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201210071444.52845.hselasky@c2i.net> Cc: bugs@freebsd.org, hackers@freebsd.org, Jin Guojun , freebsd-usb@freebsd.org, current@freebsd.org Subject: Re: 9.1-RCs issues X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 07 Oct 2012 12:43:38 -0000 On Sunday 07 October 2012 12:42:11 Hans Petter Selasky wrote: > On Sunday 07 October 2012 01:42:33 Jin Guojun wrote: > > 1) moused stops functioning on 9.1-RC2. Neither PS2 nor USB mouse can > > work. > > > > 9.1-RC1 has no such problem. > > > > 2) All i386 / amd64 of 9.1-RC1/RC2 have USB read failure -- see dmesg > > output at end of this email. > > ada0 is internal SATA drive for system disk -- s# partitions: /, /tmp, > > /var, /usr > > > > s1 -- 6.4-Release > > s2 -- 8.3-Release > > s3 -- 9.1-RC2 amd64 > > s4 -- 9.1-RC2 i386 -- This slice also contains /home > > > > da0 is external USB2 drive (300GB) plugged in USB2 port -- mounted on > > /mnt > > Regarding USB, it might be some patches did not reach it for the RC's. Have > you tried 9-stable, or any 10-current snapshots? > > --HPS s/reach/make --HPS From owner-freebsd-hackers@FreeBSD.ORG Sun Oct 7 14:11:23 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 75786106564A; Sun, 7 Oct 2012 14:11:23 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id CBE7D8FC0A; Sun, 7 Oct 2012 14:11:22 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q97EBUU8023980; Sun, 7 Oct 2012 17:11:30 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q97EBIuV004831; Sun, 7 Oct 2012 17:11:18 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q97EBIRA004830; Sun, 7 Oct 2012 17:11:18 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Sun, 7 Oct 2012 17:11:18 +0300 From: Konstantin Belousov To: Andriy Gapon Message-ID: <20121007141118.GW35915@deviant.kiev.zoral.com.ua> References: <507172DA.2020309@FreeBSD.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="CLy2MrvMHpW9mjhY" Content-Disposition: inline In-Reply-To: <507172DA.2020309@FreeBSD.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: freebsd-hackers Subject: Re: machine/cpu.h in userland X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 07 Oct 2012 14:11:23 -0000 --CLy2MrvMHpW9mjhY Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Oct 07, 2012 at 03:17:30PM +0300, Andriy Gapon wrote: >=20 > I noticed that couple of our userland file include machine/cpu.h for no a= pparent > good reason. It looks like at least amd64 cpu.h has no userland servicea= ble > parts inside. > Maybe its content should be put under _KERNEL ? >=20 > Also: >=20 > --- a/sbin/adjkerntz/adjkerntz.c > +++ b/sbin/adjkerntz/adjkerntz.c > @@ -51,7 +51,6 @@ __FBSDID("$FreeBSD$"); > #include > #include > #include > -#include > #include >=20 > #include "pathnames.h" > diff --git a/usr.bin/w/w.c b/usr.bin/w/w.c > index 8441ce5..9674fc2 100644 > --- a/usr.bin/w/w.c > +++ b/usr.bin/w/w.c > @@ -57,7 +57,6 @@ static const char sccsid[] =3D "@(#)w.c 8.4 (Berkeley) = 4/16/94"; > #include > #include >=20 > -#include > #include > #include > #include Both proposals are obviously fine. Some research with git blame traces the include presence in the w.c to the 4.4 Lite import. What I do not understand is why do you spam lists instead of commmitting the obvious changes ? --CLy2MrvMHpW9mjhY Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAlBxjYYACgkQC3+MBN1Mb4gbZACgjj5XP4xZJPDVQkz/9t13TBA8 QDsAnR8v1n5IgnKkqOkgzq802v6pwA8m =e9Kw -----END PGP SIGNATURE----- --CLy2MrvMHpW9mjhY-- From owner-freebsd-hackers@FreeBSD.ORG Sun Oct 7 14:22:10 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id DCD56106570F for ; Sun, 7 Oct 2012 14:22:10 +0000 (UTC) (envelope-from lists@eitanadler.com) Received: from mail-pa0-f54.google.com (mail-pa0-f54.google.com [209.85.220.54]) by mx1.freebsd.org (Postfix) with ESMTP id A53368FC14 for ; Sun, 7 Oct 2012 14:22:10 +0000 (UTC) Received: by mail-pa0-f54.google.com with SMTP id bi1so3593847pad.13 for ; Sun, 07 Oct 2012 07:22:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=eitanadler.com; s=0xdeadbeef; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=63fq90DDONBRRVjHb2cXdj9MMe2ka3/CyO6LDH4bRAg=; b=sJ5M6Lf/OrFbr6oH0RwLsypPmnz+7rM9GcZbXnFa68aQhr+IYY4l6RWEhdypjq/akT juPZsLHA+Tw45fu1BGZuxYoQ1Q16CTbB2HalbueCPQEw9Hsu/81tK3GKrI21Lvqg3LHC alEUauzfpLnoLLFj/VWgf+SJnjPMq3Yf6dIhs= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:x-gm-message-state; bh=63fq90DDONBRRVjHb2cXdj9MMe2ka3/CyO6LDH4bRAg=; b=EGxHc3PUhOjbOsLYeCrOyk+5oZ9xSlggKMYwITPwRhzo1qKOpOioPW0b7Ck0MGhp51 bGax1OEXQT0YQd0DNBd5UvxDtgGk/nykcisKu3GTEWs1d4nXvYge4qOIa06e3ABecomw XMwwbWg1VujfK4NJ6ZZ/yGPhbFcYSLK6NjhuKJo7lwqqb33MbuHVj3QxKzPzRS+/By1I vlRXynOLnMX8j+pr0b7pxyu8Rb4KCGFI2yLnGLQbPURCwNR/LGZiuF9m+pg4MxCREoOV ZziApng9lO70PPuM8VHUWYigXGRFR/3ZUVEed3jsNnuUyo7mhBiXFDswbKkgOc9DBMcH SiIg== Received: by 10.66.86.2 with SMTP id l2mr36225740paz.70.1349619729740; Sun, 07 Oct 2012 07:22:09 -0700 (PDT) MIME-Version: 1.0 Received: by 10.66.161.163 with HTTP; Sun, 7 Oct 2012 07:21:38 -0700 (PDT) In-Reply-To: <20121007141118.GW35915@deviant.kiev.zoral.com.ua> References: <507172DA.2020309@FreeBSD.org> <20121007141118.GW35915@deviant.kiev.zoral.com.ua> From: Eitan Adler Date: Sun, 7 Oct 2012 10:21:38 -0400 Message-ID: To: Konstantin Belousov Content-Type: text/plain; charset=UTF-8 X-Gm-Message-State: ALoCoQlEapypI34uD0bu7BTbt6yRBnCicdOjcySGyBZ0UiyYUohLm6Iop5IzH0HiX40hvBWizslT Cc: freebsd-hackers , Andriy Gapon Subject: Re: machine/cpu.h in userland X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 07 Oct 2012 14:22:11 -0000 On 7 October 2012 10:11, Konstantin Belousov wrote: > What I do not understand is why do you spam lists instead of > commmitting the obvious changes ? It is not spam to ask for review. He was uncertain, and asked for some clarification. -- Eitan Adler From owner-freebsd-hackers@FreeBSD.ORG Sun Oct 7 17:06:27 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E5C9B1065670 for ; Sun, 7 Oct 2012 17:06:27 +0000 (UTC) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from wojtek.tensor.gdynia.pl (wojtek.tensor.gdynia.pl [89.206.35.99]) by mx1.freebsd.org (Postfix) with ESMTP id 19AEB8FC08 for ; Sun, 7 Oct 2012 17:06:26 +0000 (UTC) Received: from wojtek.tensor.gdynia.pl (localhost [127.0.0.1]) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5) with ESMTP id q97H0RMq016006; Sun, 7 Oct 2012 19:00:27 +0200 (CEST) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from localhost (wojtek@localhost) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5/Submit) with ESMTP id q97H0QL4016003; Sun, 7 Oct 2012 19:00:27 +0200 (CEST) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Date: Sun, 7 Oct 2012 19:00:26 +0200 (CEST) From: Wojciech Puchar To: Brandon Falk In-Reply-To: <5069C9FC.6020400@brandonfa.lk> Message-ID: References: <5069C9FC.6020400@brandonfa.lk> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (wojtek.tensor.gdynia.pl [127.0.0.1]); Sun, 07 Oct 2012 19:00:27 +0200 (CEST) Cc: freebsd-hackers@freebsd.org Subject: Re: SMP Version of tar X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 07 Oct 2012 17:06:28 -0000 > I would be willing to work on a SMP version of tar (initially just gzip or > something). > > I don't have the best experience in compression, and how to multi-thread it, > but I think I would be able to learn and help out. gzip cannot - it is single stream. bzip2 - no idea grzip (from ports, i use it) - can be multithreaded as it packs using fixed large chunks. From owner-freebsd-hackers@FreeBSD.ORG Sun Oct 7 17:51:55 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id DE47B1065728 for ; Sun, 7 Oct 2012 17:51:55 +0000 (UTC) (envelope-from tim@kientzle.com) Received: from monday.kientzle.com (99-115-135-74.uvs.sntcca.sbcglobal.net [99.115.135.74]) by mx1.freebsd.org (Postfix) with ESMTP id B86A48FC1C for ; Sun, 7 Oct 2012 17:51:55 +0000 (UTC) Received: (from root@localhost) by monday.kientzle.com (8.14.4/8.14.4) id q97HpkUQ017185; Sun, 7 Oct 2012 17:51:46 GMT (envelope-from tim@kientzle.com) Received: from [192.168.2.143] (CiscoE3000 [192.168.1.65]) by kientzle.com with SMTP id 64ke244kk3upjc7r2g7je68tq6; Sun, 07 Oct 2012 17:51:46 +0000 (UTC) (envelope-from tim@kientzle.com) Mime-Version: 1.0 (Apple Message framework v1278) Content-Type: text/plain; charset=us-ascii From: Tim Kientzle In-Reply-To: Date: Sun, 7 Oct 2012 10:52:27 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <324B736D-8961-4E44-A212-2ECF3E60F2A0@kientzle.com> References: <5069C9FC.6020400@brandonfa.lk> To: Wojciech Puchar X-Mailer: Apple Mail (2.1278) Cc: freebsd-hackers@freebsd.org, Brandon Falk Subject: Re: SMP Version of tar X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 07 Oct 2012 17:51:56 -0000 On Oct 7, 2012, at 10:00 AM, Wojciech Puchar wrote: >> I would be willing to work on a SMP version of tar (initially just = gzip or something). >>=20 >> I don't have the best experience in compression, and how to = multi-thread it, but I think I would be able to learn and help out. >=20 > gzip cannot - it is single stream. gunzip commutes with cat, so gzip compression can be multi-threaded by compressing separate blocks and concatenating the result. For proof, look at Mark Adler's pigz program, which does exactly this. GZip decompression is admittedly trickier. > bzip2 - no idea bzip2 is block oriented and can be multi-threaded for both compression = and decompression. Tim From owner-freebsd-hackers@FreeBSD.ORG Mon Oct 8 06:36:40 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id DCF6D1065672 for ; Mon, 8 Oct 2012 06:36:40 +0000 (UTC) (envelope-from trond@fagskolen.gjovik.no) Received: from smtp.fagskolen.gjovik.no (smtp.fagskolen.gjovik.no [IPv6:2001:700:1100:1:200:ff:fe00:b]) by mx1.freebsd.org (Postfix) with ESMTP id 5E5F88FC0A for ; Mon, 8 Oct 2012 06:36:38 +0000 (UTC) Received: from mail.fig.ol.no (localhost [127.0.0.1]) by mail.fig.ol.no (8.14.5/8.14.5) with ESMTP id q986aUKx069523 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 8 Oct 2012 08:36:30 +0200 (CEST) (envelope-from trond@fagskolen.gjovik.no) Received: from localhost (trond@localhost) by mail.fig.ol.no (8.14.5/8.14.5/Submit) with ESMTP id q986aTrZ069520; Mon, 8 Oct 2012 08:36:30 +0200 (CEST) (envelope-from trond@fagskolen.gjovik.no) X-Authentication-Warning: mail.fig.ol.no: trond owned process doing -bs Date: Mon, 8 Oct 2012 08:36:29 +0200 (CEST) From: =?ISO-8859-1?Q?Trond_Endrest=F8l?= Sender: Trond.Endrestol@fagskolen.gjovik.no To: Wojciech Puchar In-Reply-To: Message-ID: References: <5069C9FC.6020400@brandonfa.lk> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) Organization: Fagskolen Innlandet OpenPGP: url=http://fig.ol.no/~trond/trond.key MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="2055831798-1630032467-1349678190=:36463" X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on mail.fig.ol.no Cc: freebsd-hackers@freebsd.org, Brandon Falk Subject: Re: SMP Version of tar X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 08 Oct 2012 06:36:41 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --2055831798-1630032467-1349678190=:36463 Content-Type: TEXT/PLAIN; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT On Sun, 7 Oct 2012 19:00+0200, Wojciech Puchar wrote: > > I would be willing to work on a SMP version of tar (initially just gzip or > > something). > > > > I don't have the best experience in compression, and how to multi-thread it, > > but I think I would be able to learn and help out. > > gzip cannot - it is single stream. > > bzip2 - no idea Check out archivers/pbzip2. > grzip (from ports, i use it) - can be multithreaded as it packs using fixed > large chunks. -- +-------------------------------+------------------------------------+ | Vennlig hilsen, | Best regards, | | Trond Endrestøl, | Trond Endrestøl, | | IT-ansvarlig, | System administrator, | | Fagskolen Innlandet, | Gjøvik Technical College, Norway, | | tlf. mob. 952 62 567, | Cellular...: +47 952 62 567, | | sentralbord 61 14 54 00. | Switchboard: +47 61 14 54 00. | +-------------------------------+------------------------------------+ --2055831798-1630032467-1349678190=:36463-- From owner-freebsd-hackers@FreeBSD.ORG Mon Oct 8 06:38:44 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5099E106566B for ; Mon, 8 Oct 2012 06:38:44 +0000 (UTC) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from wojtek.tensor.gdynia.pl (wojtek.tensor.gdynia.pl [89.206.35.99]) by mx1.freebsd.org (Postfix) with ESMTP id A56528FC0A for ; Mon, 8 Oct 2012 06:38:43 +0000 (UTC) Received: from wojtek.tensor.gdynia.pl (localhost [127.0.0.1]) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5) with ESMTP id q986cY0E003668; Mon, 8 Oct 2012 08:38:34 +0200 (CEST) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from localhost (wojtek@localhost) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5/Submit) with ESMTP id q986cYpm003665; Mon, 8 Oct 2012 08:38:34 +0200 (CEST) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Date: Mon, 8 Oct 2012 08:38:33 +0200 (CEST) From: Wojciech Puchar To: Tim Kientzle In-Reply-To: <324B736D-8961-4E44-A212-2ECF3E60F2A0@kientzle.com> Message-ID: References: <5069C9FC.6020400@brandonfa.lk> <324B736D-8961-4E44-A212-2ECF3E60F2A0@kientzle.com> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (wojtek.tensor.gdynia.pl [127.0.0.1]); Mon, 08 Oct 2012 08:38:34 +0200 (CEST) Cc: freebsd-hackers@freebsd.org, Brandon Falk Subject: Re: SMP Version of tar X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 08 Oct 2012 06:38:44 -0000 >> gzip cannot - it is single stream. > > gunzip commutes with cat, so gzip > compression can be multi-threaded > by compressing separate blocks and > concatenating the result. right. but resulting file format must be different. From owner-freebsd-hackers@FreeBSD.ORG Mon Oct 8 08:45:29 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A32A1106564A for ; Mon, 8 Oct 2012 08:45:29 +0000 (UTC) (envelope-from roam@ringlet.net) Received: from nimbus.fccf.net (nimbus.fccf.net [77.77.144.35]) by mx1.freebsd.org (Postfix) with ESMTP id 4BE678FC1B for ; Mon, 8 Oct 2012 08:45:29 +0000 (UTC) Received: from straylight.m.ringlet.net (unknown [78.90.13.150]) by nimbus.fccf.net (Postfix) with ESMTPSA id 03BC1952 for ; Mon, 8 Oct 2012 11:38:16 +0300 (EEST) Received: from roam (uid 1000) (envelope-from roam@ringlet.net) id 316002 by straylight.m.ringlet.net (DragonFly Mail Agent); Mon, 08 Oct 2012 11:38:14 +0300 Date: Mon, 8 Oct 2012 11:38:14 +0300 From: Peter Pentchev To: Wojciech Puchar Message-ID: <20121008083814.GA5830@straylight.m.ringlet.net> Mail-Followup-To: Wojciech Puchar , Tim Kientzle , freebsd-hackers@freebsd.org, Brandon Falk References: <5069C9FC.6020400@brandonfa.lk> <324B736D-8961-4E44-A212-2ECF3E60F2A0@kientzle.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="n8g4imXOkfNTN/H1" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Brandon Falk , freebsd-hackers@freebsd.org Subject: Re: SMP Version of tar X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 08 Oct 2012 08:45:29 -0000 --n8g4imXOkfNTN/H1 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Oct 08, 2012 at 08:38:33AM +0200, Wojciech Puchar wrote: > >>gzip cannot - it is single stream. > > > >gunzip commutes with cat, so gzip > >compression can be multi-threaded > >by compressing separate blocks and > >concatenating the result. >=20 > right. but resulting file format must be different. Not necessarily. If I understand correctly what Tim means, he's talking about an in-memory compression of several blocks by several separate threads, and then - after all the threads have compressed their respective blocks - writing out the result to the output file in order. Of course, this would incur a small penalty in that the dictionary would not be reused between blocks, but it might still be worth it. G'luck, Peter --=20 Peter Pentchev roam@ringlet.net roam@FreeBSD.org peter@packetscale.com PGP key: http://people.FreeBSD.org/~roam/roam.key.asc Key fingerprint FDBA FD79 C26F 3C51 C95E DF9E ED18 B68D 1619 4553 Hey, out there - is it *you* reading me, or is it someone else? --n8g4imXOkfNTN/H1 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAEBCAAGBQJQcpDwAAoJEGUe77AlJ98TLBMP/jQ74ESXef5g/Uedklzi/PXI wsgP8BFBzHwldymZnH/lRMYKLUbjYka+HIrf/hrdLBRVu4/uyYP5+3aYD2DuFxHP gONtqrBo9FSuXVxk9fB8tfldoM4rudovgBZbUHkm+mONRtMkyQ4diBEvLnJHUKmL oiphw/QjOUveuxssnFiOBVu9x07yWORNNarVT4xl7otjhL+G7aapvU+NqVvSidzG aq8ftYAgo1npyoZubSVb0KHHASRAryLz3iMSW3tJSg9mMbReZbxZ60no0X3X0c8Y 9fs8gP3eH2T2R8rxh/A9+ursgC/gSDNsSIQo3ta0eJ+Rp9U+7il3Y3K7BlsltmNg yxdhQjF6PRDCpt3KGS10oijNdHpmKrOGBH0pY9nJoDUlSYGIjScHlqX7dY4vbtLO R+3w9f33iowMWG1skY0fcbCZnljpQyqIwRiC1iCLDn/qpPAyG9bw4ZAdfbF27P7d sEUaFe2Sj5hEoDkLuArXOIcOokLNQhGcf5nZmg9uCgbnHibfk65d053L7zeexGqQ oxBl63HHx/Xh25qEzndfVrDahDgxS8+vsU5BKlA12VPBq7Kg1CB+pFKme7jHaFcW JjtVU39/ml/pkINEMw5HL/T79HdrN2I4jkiWKlCsq3jsySKVH8pcEA8+Og82nvcD lGHdNT7Zd3X0qM90dix9 =yNTU -----END PGP SIGNATURE----- --n8g4imXOkfNTN/H1-- From owner-freebsd-hackers@FreeBSD.ORG Mon Oct 8 10:22:09 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5779A1065673 for ; Mon, 8 Oct 2012 10:22:09 +0000 (UTC) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from wojtek.tensor.gdynia.pl (wojtek.tensor.gdynia.pl [89.206.35.99]) by mx1.freebsd.org (Postfix) with ESMTP id AA82F8FC12 for ; Mon, 8 Oct 2012 10:22:08 +0000 (UTC) Received: from wojtek.tensor.gdynia.pl (localhost [127.0.0.1]) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5) with ESMTP id q98ALxHI004680; Mon, 8 Oct 2012 12:21:59 +0200 (CEST) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from localhost (wojtek@localhost) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5/Submit) with ESMTP id q98ALx0w004677; Mon, 8 Oct 2012 12:21:59 +0200 (CEST) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Date: Mon, 8 Oct 2012 12:21:59 +0200 (CEST) From: Wojciech Puchar To: Peter Pentchev In-Reply-To: <20121008083814.GA5830@straylight.m.ringlet.net> Message-ID: References: <5069C9FC.6020400@brandonfa.lk> <324B736D-8961-4E44-A212-2ECF3E60F2A0@kientzle.com> <20121008083814.GA5830@straylight.m.ringlet.net> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (wojtek.tensor.gdynia.pl [127.0.0.1]); Mon, 08 Oct 2012 12:21:59 +0200 (CEST) Cc: freebsd-hackers@freebsd.org, Brandon Falk Subject: Re: SMP Version of tar X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 08 Oct 2012 10:22:09 -0000 > Not necessarily. If I understand correctly what Tim means, he's talking > about an in-memory compression of several blocks by several separate > threads, and then - after all the threads have compressed their but gzip format is single stream. dictionary IMHO is not reset every X kilobytes. parallel gzip is possible but not with same data format. By the way in my opinion grzip (in ports/archivers/grzip) is one of the most under-valued software. almost unknown. compresses faster than bzip2, with better results, it is very simple and making parallel version is trivial - there is just a procedure to compress single block. Personally i use gzip for fast compression and grzip for strong compression, and don't use bzip2 at all From owner-freebsd-hackers@FreeBSD.ORG Mon Oct 8 16:11:35 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A6E96106566B; Mon, 8 Oct 2012 16:11:35 +0000 (UTC) (envelope-from marcel@xcllnt.net) Received: from mail.xcllnt.net (mail.xcllnt.net [70.36.220.4]) by mx1.freebsd.org (Postfix) with ESMTP id 5DFBB8FC0A; Mon, 8 Oct 2012 16:11:35 +0000 (UTC) Received: from sa-nc-cs-116.static.jnpr.net (natint3.juniper.net [66.129.224.36]) (authenticated bits=0) by mail.xcllnt.net (8.14.5/8.14.5) with ESMTP id q98GBW7O082253 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO); Mon, 8 Oct 2012 09:11:33 -0700 (PDT) (envelope-from marcel@xcllnt.net) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) From: Marcel Moolenaar In-Reply-To: Date: Mon, 8 Oct 2012 09:11:29 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <127FA63D-8EEE-4616-AE1E-C39469DDCC6A@xcllnt.net> References: <201210020750.23358.jhb@freebsd.org> <201210021037.27762.jhb@freebsd.org> To: Garrett Cooper X-Mailer: Apple Mail (2.1499) Cc: freebsd-hackers@freebsd.org, freebsd-arch@freebsd.org, "Simon J. Gerraty" Subject: Re: [CFT/RFC]: refactor bsd.prog.mk to understand multiple programs instead of a singular program X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 08 Oct 2012 16:11:35 -0000 On Oct 4, 2012, at 9:42 AM, Garrett Cooper wrote: >>> Both parties (Isilon/Juniper) are converging on the ATF porting work >>> that Giorgos/myself have done after talking at the FreeBSD = Foundation >>> meet-n-greet. I have contributed all of the patches that I have = other >>> to marcel for feedback. >>=20 >> This is very non-obvious to the public at large (e.g. there was no = public >> response to one group's inquiry about the second ATF import for = example). >> Also, given that you had no idea that sgf@ and obrien@ were working = on >> importing NetBSD's bmake as a prerequisite for ATF, it seems that = whatever >> discussions were held were not very detailed at best. I think it = would be >> good to have the various folks working on ATF to at least summarize = the >> current state of things and sketch out some sort of plan or roadmap = for future >> work in a public forum (such as atf@, though a summary mail would be = quite >> appropriate for arch@). >=20 > I'm in part to blame for this. There was some discussion -- but not at > length; unfortunately no one from Juniper was present at the meet and > greet; the information I got was second hand; I didn't follow up to > figure out the exact details / clarify what I had in mind with the > appropriate parties. Hang on. I want in on the blame part! :-) Seriously: no-one is really to blame as far as I can see. We just had two independent efforts (ATF & bmake) and there was no indication that one would be greatly benefitted from the other. At least not to the point of creating a dependency. I just committed the bmake bits. It not only adds bmake to the build, but also includes the changes necessary to use bmake. With that in place it's easier to decide whether we want the dependency or not. Before we can switch permanently to bmake, we need to do the following first: 1. Request an EXP ports build with bmake as make(1). This should tell us the "damage" of switching to bmake for ports. 2. In parallel with 1: build www & docs with bmake and assess the damage 3. Fix all the damage Then: 4. Switch. It could be a while (many weeks) before we get to 4, so the question really is whether the people working on ATF are willing and able to build and install FreeBSD using WITH_BMAKE? --=20 Marcel Moolenaar marcel@xcllnt.net From owner-freebsd-hackers@FreeBSD.ORG Mon Oct 8 16:17:39 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7F083106580A for ; Mon, 8 Oct 2012 16:17:39 +0000 (UTC) (envelope-from lists@eitanadler.com) Received: from mail-da0-f54.google.com (mail-da0-f54.google.com [209.85.210.54]) by mx1.freebsd.org (Postfix) with ESMTP id 4D9C58FC12 for ; Mon, 8 Oct 2012 16:17:39 +0000 (UTC) Received: by mail-da0-f54.google.com with SMTP id z9so1794842dad.13 for ; Mon, 08 Oct 2012 09:17:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=eitanadler.com; s=0xdeadbeef; h=mime-version:from:date:message-id:subject:to:content-type; bh=HQ4PKGPFVfuTDVQcvy/m/Rq+8EL54vq2oyC4HGCRtKs=; b=FVA0/RRTpMIKAPD0MeB2ofw9J0UDeHjjP+miu8dmLMPAEMOfs1DjWsLGh7MWS8YA64 /mpJvjDoM2RYV2py8rVlUPU8qnBsqqs4/h+xa/ok4afsj+kufFJvvvjyi13B4T90KfrW /ufUAPB/1HuRIMfrV/hvnQTXXJxTajPYhIalI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type :x-gm-message-state; bh=HQ4PKGPFVfuTDVQcvy/m/Rq+8EL54vq2oyC4HGCRtKs=; b=l+gUlrB92tXpnZ+Gz/mu7InkfNtShIHRAaWuX5nL30swUR2ULsxLHXImZexIX/JzFy YZpdIueqYS6YEyN7Hs60glkbZjZXdBiYXmVu2a4PSRlCUahEGlXsKHRg5v40XANXJp+L Mzu6hw2llIv0UGxoiEWDNXuQWrPsz8DonJkE7ZO9MlgORazip2PCbbvQSK8NRVn65zQ2 0JaiGVkiLouAGDAJ5HXMcrEyF1xXID31IZMwTbFtjKhejNv+ezxCgPmqkKaSzphsErr4 98w7eLks8idAqqdFEEqtoXf/XhH7NNh7/pJ70N+S1oLwJNEog12A3VrwUT3Rqcidp1ME /cig== Received: by 10.68.200.72 with SMTP id jq8mr54316898pbc.38.1349713058528; Mon, 08 Oct 2012 09:17:38 -0700 (PDT) MIME-Version: 1.0 Received: by 10.66.161.163 with HTTP; Mon, 8 Oct 2012 09:17:08 -0700 (PDT) From: Eitan Adler Date: Mon, 8 Oct 2012 12:17:08 -0400 Message-ID: To: FreeBSD Hackers Content-Type: text/plain; charset=UTF-8 X-Gm-Message-State: ALoCoQnBsbR5BYdqjUYL9cz9fesooBQZzaDJMG8R+KY7j43Rth0wXNA0MQVf7jHMh7w5I0FKX3Xs Subject: -lpthread vs -pthread: does -D_REENTRANT matter? X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 08 Oct 2012 16:17:39 -0000 The only difference between -lpthread and -pthread that I could see is that the latter also sets -D_REENTRANT. However, I can't find any uses of _REENTRANT anywhere outside of a few utilities that seem to define it manually. Testing with various manually written pthread programs resulted in identical binaries, let alone identical results. Is there an actual difference between -pthread and -lpthread or is this just a historical artifact? -- Eitan Adler From owner-freebsd-hackers@FreeBSD.ORG Tue Oct 9 06:30:41 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id EC1381019 for ; Tue, 9 Oct 2012 06:30:41 +0000 (UTC) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from wojtek.tensor.gdynia.pl (wojtek.tensor.gdynia.pl [89.206.35.99]) by mx1.freebsd.org (Postfix) with ESMTP id 296758FC19 for ; Tue, 9 Oct 2012 06:30:40 +0000 (UTC) Received: from wojtek.tensor.gdynia.pl (localhost [127.0.0.1]) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5) with ESMTP id q996R9CM013488; Tue, 9 Oct 2012 08:27:09 +0200 (CEST) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from localhost (wojtek@localhost) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5/Submit) with ESMTP id q996R9XU013485; Tue, 9 Oct 2012 08:27:09 +0200 (CEST) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Date: Tue, 9 Oct 2012 08:27:09 +0200 (CEST) From: Wojciech Puchar To: Peter Pentchev Subject: Re: SMP Version of tar In-Reply-To: <20121008083814.GA5830@straylight.m.ringlet.net> Message-ID: References: <5069C9FC.6020400@brandonfa.lk> <324B736D-8961-4E44-A212-2ECF3E60F2A0@kientzle.com> <20121008083814.GA5830@straylight.m.ringlet.net> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (wojtek.tensor.gdynia.pl [127.0.0.1]); Tue, 09 Oct 2012 08:27:10 +0200 (CEST) Cc: freebsd-hackers@freebsd.org, Brandon Falk X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Oct 2012 06:30:42 -0000 > Not necessarily. If I understand correctly what Tim means, he's talking > about an in-memory compression of several blocks by several separate > threads, and then - after all the threads have compressed their > respective blocks - writing out the result to the output file in order. > Of course, this would incur a small penalty in that the dictionary would > not be reused between blocks, but it might still be worth it. all fine. i just wanted to point out that ungzipping normal standard gzip file cannot be multithreaded, and multithreaded-compressed gzip output would be different. From owner-freebsd-hackers@FreeBSD.ORG Tue Oct 9 10:46:55 2012 Return-Path: Delivered-To: hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 596ACF5C for ; Tue, 9 Oct 2012 10:46:55 +0000 (UTC) (envelope-from danny@cs.huji.ac.il) Received: from kabab.cs.huji.ac.il (kabab.cs.huji.ac.il [132.65.16.84]) by mx1.freebsd.org (Postfix) with ESMTP id 00AE18FC18 for ; Tue, 9 Oct 2012 10:46:54 +0000 (UTC) Received: from pampa.cs.huji.ac.il ([132.65.80.32]) by kabab.cs.huji.ac.il with esmtp id 1TLXKv-0003ON-5r for hackers@freebsd.org; Tue, 09 Oct 2012 12:46:53 +0200 X-Mailer: exmh version 2.7.2 01/07/2005 with nmh-1.3 To: hackers@freebsd.org Subject: Re: problem cross-compiling 9.1 In-reply-to: References: Comments: In-reply-to Warner Losh message dated "Mon, 08 Oct 2012 14:55:51 -0600." Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Tue, 09 Oct 2012 12:46:53 +0200 From: Daniel Braniss Message-ID: X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Oct 2012 10:46:55 -0000 [snip] > any fix? > > You have found the fix. Remove the WITHOUT_XXXX options from the build that keep it from completing. You'll be able to add them at installworld time w/o a hassle. nanobsd uses this to keep things down, while still being able to build the system. > > Warner > where can I find the with/without list? btw, I did look at nanobsd in the past and have borrowed some ideas :-) thanks, danny From owner-freebsd-hackers@FreeBSD.ORG Tue Oct 9 13:36:07 2012 Return-Path: Delivered-To: hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9F6BF8C4 for ; Tue, 9 Oct 2012 13:36:07 +0000 (UTC) (envelope-from yanegomi@gmail.com) Received: from mail-pa0-f54.google.com (mail-pa0-f54.google.com [209.85.220.54]) by mx1.freebsd.org (Postfix) with ESMTP id 6A0798FC19 for ; Tue, 9 Oct 2012 13:36:07 +0000 (UTC) Received: by mail-pa0-f54.google.com with SMTP id bi1so5699437pad.13 for ; Tue, 09 Oct 2012 06:36:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=references:mime-version:in-reply-to:content-type :content-transfer-encoding:message-id:cc:x-mailer:from:subject:date :to; bh=pGmQa7qJmi2BEJYIbTgTgzLwFlXfA65dzZPixR8ihWU=; b=TulAXWf4deVAlI/twHshHtwkvK2HyWlwUBLouC7wDHMt5M/Obf4mjLTbmDezcHXr12 DR3dDGYTCdt1TRc+eUlKXlkbK/ZMy8FjJz8IKdWx/0rmS4YOx1hqimNibOA9DYv/AU6P Nf5E5/NtGspSbkwLydDonHVfJdnbHaP7d3WTJgCqApDQ4rgzOKPQbrTZ3mxGSVQjG6x/ azlbqT2OXcQ7F4MB5miuz5OrgkcygWxqEncBVAFuqVbsIyscHicpOdWmxXW6pTmGwbks HTy2EVUUfCrECv2xc505hgmQDJM1vePqnl1u/X/jw/YUl+V77n4wLTvuW0rSzoyhVV04 GBCw== Received: by 10.68.234.7 with SMTP id ua7mr64410041pbc.91.1349789761138; Tue, 09 Oct 2012 06:36:01 -0700 (PDT) Received: from [192.168.20.12] (c-24-19-191-56.hsd1.wa.comcast.net. [24.19.191.56]) by mx.google.com with ESMTPS id oi2sm11178812pbb.62.2012.10.09.06.35.59 (version=SSLv3 cipher=OTHER); Tue, 09 Oct 2012 06:35:59 -0700 (PDT) References: Mime-Version: 1.0 (1.0) In-Reply-To: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Message-Id: <9036FCFF-AAEA-4D20-AD73-10726F81440A@gmail.com> X-Mailer: iPhone Mail (10A403) From: Garrett Cooper Subject: Re: problem cross-compiling 9.1 Date: Tue, 9 Oct 2012 06:35:57 -0700 To: Daniel Braniss Cc: "hackers@freebsd.org" X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Oct 2012 13:36:07 -0000 On Oct 9, 2012, at 3:46 AM, Daniel Braniss wrote: > [snip] >> any fix? >>> You have found the fix. Remove the WITHOUT_XXXX options from the build t= hat keep it from completing. You'll be able to add them at installworld tim= e w/o a hassle. nanobsd uses this to keep things down, while still being ab= le to build the system. >>> Warner > where can I find the with/without list? > btw, I did look at nanobsd in the past and have borrowed some ideas :-) man make.conf and man src.conf, then read through bsd.own.mk if interested i= n knowing what exactly can be used. HTH! -Garrett= From owner-freebsd-hackers@FreeBSD.ORG Tue Oct 9 14:12:48 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id AF7B86D1; Tue, 9 Oct 2012 14:12:48 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from mail-wi0-f178.google.com (mail-wi0-f178.google.com [209.85.212.178]) by mx1.freebsd.org (Postfix) with ESMTP id C6A3B8FC0C; Tue, 9 Oct 2012 14:12:47 +0000 (UTC) Received: by mail-wi0-f178.google.com with SMTP id hr7so4072576wib.13 for ; Tue, 09 Oct 2012 07:12:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; bh=j14Cl+Whi217hvrLClyuARGwY6JBTXtO+V8PPUZFYbM=; b=WOeCJHK9XhIClbeeqshTnOfQME2/diIiIVemFITKROgexWSI4Kst4Z9PIl19qW7Aax Yio2uKsESDfqM3XkQgix6c856w0o4Ll/Qg3cBX86b25x7vE6dEcs/whksgmS3z+jHaT8 X4GmtcaNFZr/xfFfgcyX3Xg6vxlVPMEfmecLzR8ibOMRHsbWd+UIzI5sLLdk45ch3Pbs 5k8ByelmxiyBIpujDsnuE8OnI9o8aGHOHcBlWXhaWDZLhbc9V7JYrT41fbDXqh2i7J6s Bnjoy9cgyNRO9WLfmtuqAZWUBiwY3ynRKlr3enPVCFuMW0j8H01mNcxJsMJIxg+sNDbo YH1g== Received: by 10.217.2.146 with SMTP id p18mr12374882wes.198.1349791966518; Tue, 09 Oct 2012 07:12:46 -0700 (PDT) Received: from ndenevsa.sf.moneybookers.net (g1.moneybookers.com. [217.18.249.148]) by mx.google.com with ESMTPS id gg4sm28310910wib.6.2012.10.09.07.12.44 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 09 Oct 2012 07:12:45 -0700 (PDT) Subject: Re: NFS server bottlenecks Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\)) Content-Type: text/plain; charset=us-ascii From: Nikolay Denev In-Reply-To: <1666343702.1682678.1349300219198.JavaMail.root@erie.cs.uoguelph.ca> Date: Tue, 9 Oct 2012 17:12:43 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: References: <1666343702.1682678.1349300219198.JavaMail.root@erie.cs.uoguelph.ca> To: freebsd-hackers@freebsd.org X-Mailer: Apple Mail (2.1498) Cc: rmacklem@freebsd.org, Garrett Wollman X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Oct 2012 14:12:48 -0000 On Oct 4, 2012, at 12:36 AM, Rick Macklem wrote: > Garrett Wollman wrote: >> <> said: >>=20 >>>> Simple: just use a sepatate mutex for each list that a cache entry >>>> is on, rather than a global lock for everything. This would reduce >>>> the mutex contention, but I'm not sure how significantly since I >>>> don't have the means to measure it yet. >>>>=20 >>> Well, since the cache trimming is removing entries from the lists, I >>> don't >>> see how that can be done with a global lock for list updates? >>=20 >> Well, the global lock is what we have now, but the cache trimming >> process only looks at one list at a time, so not locking the list = that >> isn't being iterated over probably wouldn't hurt, unless there's some >> mechanism (that I didn't see) for entries to move from one list to >> another. Note that I'm considering each hash bucket a separate >> "list". (One issue to worry about in that case would be cache-line >> contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE >> ought to be increased to reduce that.) >>=20 > Yea, a separate mutex for each hash list might help. There is also the > LRU list that all entries end up on, that gets used by the trimming = code. > (I think? I wrote this stuff about 8 years ago, so I haven't looked at > it in a while.) >=20 > Also, increasing the hash table size is probably a good idea, = especially > if you reduce how aggressively the cache is trimmed. >=20 >>> Only doing it once/sec would result in a very large cache when >>> bursts of >>> traffic arrives. >>=20 >> My servers have 96 GB of memory so that's not a big deal for me. >>=20 > This code was originally "production tested" on a server with 1Gbyte, > so times have changed a bit;-) >=20 >>> I'm not sure I see why doing it as a separate thread will improve >>> things. >>> There are N nfsd threads already (N can be bumped up to 256 if you >>> wish) >>> and having a bunch more "cache trimming threads" would just increase >>> contention, wouldn't it? >>=20 >> Only one cache-trimming thread. The cache trim holds the (global) >> mutex for much longer than any individual nfsd service thread has any >> need to, and having N threads doing that in parallel is why it's so >> heavily contended. If there's only one thread doing the trim, then >> the nfsd service threads aren't spending time either contending on = the >> mutex (it will be held less frequently and for shorter periods). >>=20 > I think the little drc2.patch which will keep the nfsd threads from > acquiring the mutex and doing the trimming most of the time, might be > sufficient. I still don't see why a separate trimming thread will be > an advantage. I'd also be worried that the one cache trimming thread > won't get the job done soon enough. >=20 > When I did production testing on a 1Gbyte server that saw a peak > load of about 100RPCs/sec, it was necessary to trim aggressively. > (Although I'd be tempted to say that a server with 1Gbyte is no > longer relevant, I recently recall someone trying to run FreeBSD > on a i486, although I doubt they wanted to run the nfsd on it.) >=20 >>> The only negative effect I can think of w.r.t. having the nfsd >>> threads doing it would be a (I believe negligible) increase in RPC >>> response times (the time the nfsd thread spends trimming the cache). >>> As noted, I think this time would be negligible compared to disk I/O >>> and network transit times in the total RPC response time? >>=20 >> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G >> network connectivity, spinning on a contended mutex takes a >> significant amount of CPU time. (For the current design of the NFS >> server, it may actually be a win to turn off adaptive mutexes -- I >> should give that a try once I'm able to do more testing.) >>=20 > Have fun with it. Let me know when you have what you think is a good = patch. >=20 > rick >=20 >> -GAWollman >> _______________________________________________ >> freebsd-hackers@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >> To unsubscribe, send any mail to >> "freebsd-hackers-unsubscribe@freebsd.org" > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" My quest for IOPS over NFS continues :) So far I'm not able to achieve more than about 3000 8K read requests = over NFS, while the server locally gives much more. And this is all from a file that is completely in ARC cache, no disk IO = involved. I've snatched some sample DTrace script from the net : [ = http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes = ] And modified it for our new NFS server : #!/usr/sbin/dtrace -qs=20 fbt:kernel:nfsrvd_*:entry { self->ts =3D timestamp;=20 @counts[probefunc] =3D count(); } fbt:kernel:nfsrvd_*:return / self->ts > 0 / { this->delta =3D (timestamp-self->ts)/1000000; } fbt:kernel:nfsrvd_*:return / self->ts > 0 && this->delta > 100 / { @slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50); } fbt:kernel:nfsrvd_*:return / self->ts > 0 / { @dist[probefunc, "ms"] =3D quantize(this->delta); self->ts =3D 0; } END { printf("\n"); printa("function %-20s %@10d\n", @counts); printf("\n"); printa("function %s(), time in %s:%@d\n", @dist); printf("\n"); printa("function %s(), time in %s for >=3D 100 ms:%@d\n", = @slow); } And here's a sample output from one or two minutes during the run of = Oracle's ORION benchmark tool from a Linux machine, on a 32G file on NFS mount over 10G ethernet: [16:01]root@goliath:/home/ndenev# ./nfsrvd.d =20 ^C function nfsrvd_access 4 function nfsrvd_statfs 10 function nfsrvd_getattr 14 function nfsrvd_commit 76 function nfsrvd_sentcache 110048 function nfsrvd_write 110048 function nfsrvd_read 283648 function nfsrvd_dorpc 393800 function nfsrvd_getcache 393800 function nfsrvd_rephead 393800 function nfsrvd_updatecache 393800 function nfsrvd_access(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 =20 1 | 0 =20 function nfsrvd_statfs(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 =20 1 | 0 =20 function nfsrvd_getattr(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 =20 1 | 0 =20 function nfsrvd_sentcache(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 =20 1 | 0 =20 function nfsrvd_rephead(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 =20 1 | 0 =20 function nfsrvd_updatecache(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 =20 1 | 0 =20 function nfsrvd_getcache(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 =20 1 | 1 =20 2 | 0 =20 4 | 1 =20 8 | 0 =20 function nfsrvd_write(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 =20 1 | 5 =20 2 | 4 =20 4 | 0 =20 function nfsrvd_read(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 =20 1 | 19 =20 2 | 3 =20 4 | 2 =20 8 | 0 =20 16 | 1 =20 32 | 0 =20 64 | 0 =20 128 | 0 =20 256 | 1 =20 512 | 0 =20 function nfsrvd_commit(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 =20 1 |@@@@@@@ 14 =20 2 | 0 =20 4 |@ 1 =20 8 |@ 1 =20 16 | 0 =20 32 |@@@@@@@ 14 =20 64 |@ 2 =20 128 | 0 =20 function nfsrvd_commit(), time in ms for >=3D 100 ms: value ------------- Distribution ------------- count =20 < 100 | 0 =20 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 =20 150 | 0 =20 function nfsrvd_read(), time in ms for >=3D 100 ms: value ------------- Distribution ------------- count =20 250 | 0 =20 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 =20 350 | 0 =20 Looks like the nfs server cache functions are quite fast, but extremely = frequently called. I hope someone can find this information useful. From owner-freebsd-hackers@FreeBSD.ORG Tue Oct 9 15:23:44 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E60A8ED5; Tue, 9 Oct 2012 15:23:44 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from mail-wg0-f50.google.com (mail-wg0-f50.google.com [74.125.82.50]) by mx1.freebsd.org (Postfix) with ESMTP id 1D2BE8FC1B; Tue, 9 Oct 2012 15:23:43 +0000 (UTC) Received: by mail-wg0-f50.google.com with SMTP id 16so4398578wgi.31 for ; Tue, 09 Oct 2012 08:23:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; bh=UGqIpUVIh4maA0DSdOTvB4Q5EkRwYYo7xq2/jBodgGo=; b=KoF9yXqS88pNipeojuA0HCRgJcmQXBhHm9mmf/FyM+y4Bk24X8TXAGCcgGOyQRo9DI +npJbCaLGyZae0QpQe4C5l8hNVMorEJOkeTGrMgT70ve+KOi9sSbH9BpsA5jCB1D/FWm juALvKsIlkDnoiqr+G4ZliTuIVc21HrzsWpXzWx17dDj0REY5Bu8yaIUyn8AXVQgXeMz Q11yDhzfeMfKwdFJ/HzTGOhPeSZhAG6zaMo+yw8glM1gsK5MmOYfOs1nIE8pmQ4BqFTu fZVf+bJ1QlI5wzBr/yuJ6pH1AlLNK0JuEtM5C/fS6KFOOzlTen/RebUoUM50C33xZQ5G RLag== Received: by 10.216.201.80 with SMTP id a58mr12447484weo.15.1349796222818; Tue, 09 Oct 2012 08:23:42 -0700 (PDT) Received: from ndenevsa.sf.moneybookers.net (g1.moneybookers.com. [217.18.249.148]) by mx.google.com with ESMTPS id cu1sm25356406wib.6.2012.10.09.08.23.39 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 09 Oct 2012 08:23:40 -0700 (PDT) Subject: Re: NFS server bottlenecks Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\)) Content-Type: text/plain; charset=us-ascii From: Nikolay Denev In-Reply-To: Date: Tue, 9 Oct 2012 18:23:37 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <59B593A0-DA96-4395-B6B9-196F73A1415C@gmail.com> References: <1666343702.1682678.1349300219198.JavaMail.root@erie.cs.uoguelph.ca> To: freebsd-hackers@freebsd.org X-Mailer: Apple Mail (2.1498) Cc: rmacklem@freebsd.org, Garrett Wollman X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Oct 2012 15:23:45 -0000 On Oct 9, 2012, at 5:12 PM, Nikolay Denev wrote: >=20 > On Oct 4, 2012, at 12:36 AM, Rick Macklem = wrote: >=20 >> Garrett Wollman wrote: >>> <>> said: >>>=20 >>>>> Simple: just use a sepatate mutex for each list that a cache entry >>>>> is on, rather than a global lock for everything. This would reduce >>>>> the mutex contention, but I'm not sure how significantly since I >>>>> don't have the means to measure it yet. >>>>>=20 >>>> Well, since the cache trimming is removing entries from the lists, = I >>>> don't >>>> see how that can be done with a global lock for list updates? >>>=20 >>> Well, the global lock is what we have now, but the cache trimming >>> process only looks at one list at a time, so not locking the list = that >>> isn't being iterated over probably wouldn't hurt, unless there's = some >>> mechanism (that I didn't see) for entries to move from one list to >>> another. Note that I'm considering each hash bucket a separate >>> "list". (One issue to worry about in that case would be cache-line >>> contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE >>> ought to be increased to reduce that.) >>>=20 >> Yea, a separate mutex for each hash list might help. There is also = the >> LRU list that all entries end up on, that gets used by the trimming = code. >> (I think? I wrote this stuff about 8 years ago, so I haven't looked = at >> it in a while.) >>=20 >> Also, increasing the hash table size is probably a good idea, = especially >> if you reduce how aggressively the cache is trimmed. >>=20 >>>> Only doing it once/sec would result in a very large cache when >>>> bursts of >>>> traffic arrives. >>>=20 >>> My servers have 96 GB of memory so that's not a big deal for me. >>>=20 >> This code was originally "production tested" on a server with 1Gbyte, >> so times have changed a bit;-) >>=20 >>>> I'm not sure I see why doing it as a separate thread will improve >>>> things. >>>> There are N nfsd threads already (N can be bumped up to 256 if you >>>> wish) >>>> and having a bunch more "cache trimming threads" would just = increase >>>> contention, wouldn't it? >>>=20 >>> Only one cache-trimming thread. The cache trim holds the (global) >>> mutex for much longer than any individual nfsd service thread has = any >>> need to, and having N threads doing that in parallel is why it's so >>> heavily contended. If there's only one thread doing the trim, then >>> the nfsd service threads aren't spending time either contending on = the >>> mutex (it will be held less frequently and for shorter periods). >>>=20 >> I think the little drc2.patch which will keep the nfsd threads from >> acquiring the mutex and doing the trimming most of the time, might be >> sufficient. I still don't see why a separate trimming thread will be >> an advantage. I'd also be worried that the one cache trimming thread >> won't get the job done soon enough. >>=20 >> When I did production testing on a 1Gbyte server that saw a peak >> load of about 100RPCs/sec, it was necessary to trim aggressively. >> (Although I'd be tempted to say that a server with 1Gbyte is no >> longer relevant, I recently recall someone trying to run FreeBSD >> on a i486, although I doubt they wanted to run the nfsd on it.) >>=20 >>>> The only negative effect I can think of w.r.t. having the nfsd >>>> threads doing it would be a (I believe negligible) increase in RPC >>>> response times (the time the nfsd thread spends trimming the = cache). >>>> As noted, I think this time would be negligible compared to disk = I/O >>>> and network transit times in the total RPC response time? >>>=20 >>> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G >>> network connectivity, spinning on a contended mutex takes a >>> significant amount of CPU time. (For the current design of the NFS >>> server, it may actually be a win to turn off adaptive mutexes -- I >>> should give that a try once I'm able to do more testing.) >>>=20 >> Have fun with it. Let me know when you have what you think is a good = patch. >>=20 >> rick >>=20 >>> -GAWollman >>> _______________________________________________ >>> freebsd-hackers@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>> To unsubscribe, send any mail to >>> "freebsd-hackers-unsubscribe@freebsd.org" >> _______________________________________________ >> freebsd-fs@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-fs >> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >=20 > My quest for IOPS over NFS continues :) > So far I'm not able to achieve more than about 3000 8K read requests = over NFS, > while the server locally gives much more. > And this is all from a file that is completely in ARC cache, no disk = IO involved. >=20 > I've snatched some sample DTrace script from the net : [ = http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes = ] >=20 > And modified it for our new NFS server : >=20 > #!/usr/sbin/dtrace -qs=20 >=20 > fbt:kernel:nfsrvd_*:entry > { > self->ts =3D timestamp;=20 > @counts[probefunc] =3D count(); > } >=20 > fbt:kernel:nfsrvd_*:return > / self->ts > 0 / > { > this->delta =3D (timestamp-self->ts)/1000000; > } >=20 > fbt:kernel:nfsrvd_*:return > / self->ts > 0 && this->delta > 100 / > { > @slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50); > } >=20 > fbt:kernel:nfsrvd_*:return > / self->ts > 0 / > { > @dist[probefunc, "ms"] =3D quantize(this->delta); > self->ts =3D 0; > } >=20 > END > { > printf("\n"); > printa("function %-20s %@10d\n", @counts); > printf("\n"); > printa("function %s(), time in %s:%@d\n", @dist); > printf("\n"); > printa("function %s(), time in %s for >=3D 100 ms:%@d\n", = @slow); > } >=20 > And here's a sample output from one or two minutes during the run of = Oracle's ORION benchmark > tool from a Linux machine, on a 32G file on NFS mount over 10G = ethernet: >=20 > [16:01]root@goliath:/home/ndenev# ./nfsrvd.d =20 > ^C >=20 > function nfsrvd_access 4 > function nfsrvd_statfs 10 > function nfsrvd_getattr 14 > function nfsrvd_commit 76 > function nfsrvd_sentcache 110048 > function nfsrvd_write 110048 > function nfsrvd_read 283648 > function nfsrvd_dorpc 393800 > function nfsrvd_getcache 393800 > function nfsrvd_rephead 393800 > function nfsrvd_updatecache 393800 >=20 > function nfsrvd_access(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 =20 > 1 | 0 =20 >=20 > function nfsrvd_statfs(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 =20 > 1 | 0 =20 >=20 > function nfsrvd_getattr(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 =20 > 1 | 0 =20 >=20 > function nfsrvd_sentcache(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 =20 > 1 | 0 =20 >=20 > function nfsrvd_rephead(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 =20 > 1 | 0 =20 >=20 > function nfsrvd_updatecache(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 =20 > 1 | 0 =20 >=20 > function nfsrvd_getcache(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 =20 > 1 | 1 =20 > 2 | 0 =20 > 4 | 1 =20 > 8 | 0 =20 >=20 > function nfsrvd_write(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 =20 > 1 | 5 =20 > 2 | 4 =20 > 4 | 0 =20 >=20 > function nfsrvd_read(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 =20 > 1 | 19 =20 > 2 | 3 =20 > 4 | 2 =20 > 8 | 0 =20 > 16 | 1 =20 > 32 | 0 =20 > 64 | 0 =20 > 128 | 0 =20 > 256 | 1 =20 > 512 | 0 =20 >=20 > function nfsrvd_commit(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 =20 > 1 |@@@@@@@ 14 =20 > 2 | 0 =20 > 4 |@ 1 =20 > 8 |@ 1 =20 > 16 | 0 =20 > 32 |@@@@@@@ 14 =20 > 64 |@ 2 =20 > 128 | 0 =20 >=20 >=20 > function nfsrvd_commit(), time in ms for >=3D 100 ms: > value ------------- Distribution ------------- count =20 > < 100 | 0 =20 > 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 =20 > 150 | 0 =20 >=20 > function nfsrvd_read(), time in ms for >=3D 100 ms: > value ------------- Distribution ------------- count =20 > 250 | 0 =20 > 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 =20 > 350 | 0 =20 >=20 >=20 > Looks like the nfs server cache functions are quite fast, but = extremely frequently called. >=20 > I hope someone can find this information useful. >=20 Here's another quick one : #!/usr/sbin/dtrace -qs=20 #pragma D option quiet fbt:kernel:nfsrvd_*:entry { self->trace =3D 1; } fbt:kernel:nfsrvd_*:return / self->trace / { @calls[probefunc] =3D count(); } tick-1sec { printf("%40s | %s\n", "function", "calls per second"); printa("%40s %10@d\n", @calls); clear(@calls); printf("\n"); } Showing the number of calls per second to the nfsrvd_* functions. From owner-freebsd-hackers@FreeBSD.ORG Tue Oct 9 15:35:15 2012 Return-Path: Delivered-To: hackers@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B81404E8 for ; Tue, 9 Oct 2012 15:35:15 +0000 (UTC) (envelope-from erik@cederstrand.dk) Received: from csmtp2.one.com (csmtp2.one.com [91.198.169.22]) by mx1.freebsd.org (Postfix) with ESMTP id 757198FC08 for ; Tue, 9 Oct 2012 15:35:14 +0000 (UTC) Received: from [192.168.1.18] (unknown [217.157.7.221]) by csmtp2.one.com (Postfix) with ESMTPA id B44F93018818 for ; Tue, 9 Oct 2012 15:35:07 +0000 (UTC) From: Erik Cederstrand Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Subject: time_t when used as timedelta Message-Id: <787F09EF-E3F7-467E-B023-B7846509D2A1@cederstrand.dk> Date: Tue, 9 Oct 2012 17:35:09 +0200 To: FreeBSD Hackers Mime-Version: 1.0 (Mac OS X Mail 6.0 \(1486\)) X-Mailer: Apple Mail (2.1486) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Oct 2012 15:35:15 -0000 Hi list, I'm looking at this possible divide-by zero in dhclient: = http://scan.freebsd.your.org/freebsd-head/WORLD/2012-10-07-amd64/report-nB= hqE2.html.gz#EndPath In this specific case, it's obvious from the intention of the code that = ip->client->interval is always >0, but it's not obvious to me in the = code. I could add an assert before the possible divide-by-zero: assert(ip->client->interval > 0); But looking at the code, I'm not sure it's very elegant. = ip->client->interval is defined as time_t (see = src/sbin/dhclient/dhcpd.h), which is a signed integer type, if I'm = correct. However, some time_t members of struct client_state and struct = client_config (see said header file) are assumed in the code to be = positive and possibly non-null. Instead of plastering the code with = asserts, is there something like an utime_t type? Or are there better = ways to enforce the invariant? Thanks, Erik= From owner-freebsd-hackers@FreeBSD.ORG Tue Oct 9 16:02:38 2012 Return-Path: Delivered-To: hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id BBCE2F25 for ; Tue, 9 Oct 2012 16:02:38 +0000 (UTC) (envelope-from freebsd@damnhippie.dyndns.org) Received: from duck.symmetricom.us (duck.symmetricom.us [206.168.13.214]) by mx1.freebsd.org (Postfix) with ESMTP id C5E898FC16 for ; Tue, 9 Oct 2012 16:02:31 +0000 (UTC) Received: from damnhippie.dyndns.org (daffy.symmetricom.us [206.168.13.218]) by duck.symmetricom.us (8.14.5/8.14.5) with ESMTP id q99G2OrY096571 for ; Tue, 9 Oct 2012 10:02:24 -0600 (MDT) (envelope-from freebsd@damnhippie.dyndns.org) Received: from [172.22.42.240] (revolution.hippie.lan [172.22.42.240]) by damnhippie.dyndns.org (8.14.3/8.14.3) with ESMTP id q99G2LCZ080554; Tue, 9 Oct 2012 10:02:21 -0600 (MDT) (envelope-from freebsd@damnhippie.dyndns.org) Subject: Re: time_t when used as timedelta From: Ian Lepore To: Erik Cederstrand In-Reply-To: <787F09EF-E3F7-467E-B023-B7846509D2A1@cederstrand.dk> References: <787F09EF-E3F7-467E-B023-B7846509D2A1@cederstrand.dk> Content-Type: text/plain; charset="us-ascii" Date: Tue, 09 Oct 2012 10:02:21 -0600 Message-ID: <1349798541.1123.6.camel@revolution.hippie.lan> Mime-Version: 1.0 X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit Cc: FreeBSD Hackers X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Oct 2012 16:02:38 -0000 On Tue, 2012-10-09 at 17:35 +0200, Erik Cederstrand wrote: > Hi list, > > I'm looking at this possible divide-by zero in dhclient: http://scan.freebsd.your.org/freebsd-head/WORLD/2012-10-07-amd64/report-nBhqE2.html.gz#EndPath > > In this specific case, it's obvious from the intention of the code that ip->client->interval is always >0, but it's not obvious to me in the code. I could add an assert before the possible divide-by-zero: > > assert(ip->client->interval > 0); > > But looking at the code, I'm not sure it's very elegant. ip->client->interval is defined as time_t (see src/sbin/dhclient/dhcpd.h), which is a signed integer type, if I'm correct. However, some time_t members of struct client_state and struct client_config (see said header file) are assumed in the code to be positive and possibly non-null. Instead of plastering the code with asserts, is there something like an utime_t type? Or are there better ways to enforce the invariant? > It looks to me like the place where enforcement is really needed is in parse_lease_time() which should ensure at the very least that negative values never get through, and in some cases that zeroes don't sneak in from config files. If it were ensured that ip->client->config->backoff_cutoff could never be less than 1 (and it appears any value less than 1 would be insane), then the division by zero case could never happen. However, at least one of the config statements handled by parse_lease_time() allows a value of zero. Since nothing seems to ensure that backoff_cutoff is non-zero, it seems like a potential source of div-by-zero errors too, in that same function. -- Ian From owner-freebsd-hackers@FreeBSD.ORG Tue Oct 9 17:52:23 2012 Return-Path: Delivered-To: hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0180FA7A for ; Tue, 9 Oct 2012 17:52:23 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from mail-qa0-f54.google.com (mail-qa0-f54.google.com [209.85.216.54]) by mx1.freebsd.org (Postfix) with ESMTP id A33788FC1A for ; Tue, 9 Oct 2012 17:52:22 +0000 (UTC) Received: by mail-qa0-f54.google.com with SMTP id y23so3560183qad.13 for ; Tue, 09 Oct 2012 10:52:21 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer :x-gm-message-state; bh=54YCas1zo0O0B8GzkoSjXAR0Pc0wLxtRxKpGn2KMijE=; b=ROv4t79k0RMZAYLWo8dkzlfxERCI+mSEQKEDHbZFrMAyv/SwlifQbcz6mt8ch2zFGn 15GNmagwMpgJ8QWmLWgs16238dim5wAuckw0c2YvZKkVidRbSTzUhS2Qn/7u1tHxuW5W Rm48gApPOiT8gPsxIfCBD+v1HwqDo3BcffUJV62ua9yHfBnRU47hkvtHLpk4ujtWYdFU UubMO+ol27e4+JJykSts4pv5D79SaqCb0SWq7Leb2/+/vsjHAECPlYbtcHdpsUW8L6CY V27aov+cT+Ks6M2nyr9Yt7x8B6bR8m/JdFpZ6oauaFOP0uNx0y8Bz6JIXfCWh69s2Csi 7nDg== Received: by 10.224.199.2 with SMTP id eq2mr36071159qab.55.1349805141702; Tue, 09 Oct 2012 10:52:21 -0700 (PDT) Received: from [10.30.101.53] ([209.117.142.2]) by mx.google.com with ESMTPS id j3sm21055821qek.7.2012.10.09.10.52.18 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 09 Oct 2012 10:52:19 -0700 (PDT) Sender: Warner Losh Subject: Re: problem cross-compiling 9.1 Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset=us-ascii From: Warner Losh In-Reply-To: Date: Tue, 9 Oct 2012 11:52:15 -0600 Content-Transfer-Encoding: quoted-printable Message-Id: References: To: Daniel Braniss X-Mailer: Apple Mail (2.1084) X-Gm-Message-State: ALoCoQmsXOwFo5V6QFrCfWNHsys9oUDK22708lgE5fRN1rofjs2Onb/eFxi2bb4siMMeQDG8G1fp Cc: hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Oct 2012 17:52:23 -0000 On Oct 9, 2012, at 4:46 AM, Daniel Braniss wrote: > [snip] >> any fix? >>> You have found the fix. Remove the WITHOUT_XXXX options from the = build that keep it from completing. You'll be able to add them at = installworld time w/o a hassle. nanobsd uses this to keep things down, = while still being able to build the system. >>> Warner >>=20 > where can I find the with/without list? > btw, I did look at nanobsd in the past and have borrowed some ideas = :-) bsd.own.mk Warner From owner-freebsd-hackers@FreeBSD.ORG Tue Oct 9 18:25:48 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3FCEBACE for ; Tue, 9 Oct 2012 18:25:48 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 1210F8FC1D for ; Tue, 9 Oct 2012 18:25:48 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 6333FB91E; Tue, 9 Oct 2012 14:25:47 -0400 (EDT) From: John Baldwin To: Warner Losh Subject: Re: No bus_space_read_8 on x86 ? Date: Tue, 9 Oct 2012 11:54:15 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p20; KDE/4.5.5; amd64; ; ) References: <506DC574.9010300@intel.com> <201210051208.45550.jhb@freebsd.org> <8BC4C95F-2D10-46A5-89C8-74801BB4E23A@bsdimp.com> In-Reply-To: <8BC4C95F-2D10-46A5-89C8-74801BB4E23A@bsdimp.com> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201210091154.15873.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Tue, 09 Oct 2012 14:25:47 -0400 (EDT) Cc: freebsd-hackers@freebsd.org, Carl Delsey X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Oct 2012 18:25:48 -0000 On Monday, October 08, 2012 4:59:24 pm Warner Losh wrote: > > On Oct 5, 2012, at 10:08 AM, John Baldwin wrote: > > > On Thursday, October 04, 2012 1:20:52 pm Carl Delsey wrote: > >> I noticed that the bus_space_*_8 functions are unimplemented for x86. > >> Looking at the code, it seems this is intentional. > >> > >> Is this done because on 32-bit systems we don't know, in the general > >> case, whether to read the upper or lower 32-bits first? > >> > >> If that's the reason, I was thinking we could provide two > >> implementations for i386: bus_space_read_8_upper_first and > >> bus_space_read_8_lower_first. For amd64 we would just have bus_space_read_8 > >> > >> Anybody who wants to use bus_space_read_8 in their file would do > >> something like: > >> #define BUS_SPACE_8_BYTES LOWER_FIRST > >> or > >> #define BUS_SPACE_8_BYTES UPPER_FIRST > >> whichever is appropriate for their hardware. > >> > >> This would go in their source file before including bus.h and we would > >> take care of mapping to the correct implementation. > >> > >> With the prevalence of 64-bit registers these days, if we don't provide > >> an implementation, I expect many drivers will end up rolling their own. > >> > >> If this seems like a good idea, I'll happily whip up a patch and submit it. > > > > I think cxgb* already have an implementation. For amd64 we should certainly > > have bus_space_*_8(), at least for SYS_RES_MEMORY. I think they should fail > > for SYS_RES_IOPORT. I don't think we can force a compile-time error though, > > would just have to return -1 on reads or some such? > > I believe it was because bus reads weren't guaranteed to be atomic on i386. > don't know if that's still the case or a concern, but it was an intentional omission. True. If you are on a 32-bit system you can read the two 4 byte values and then build a 64-bit value. For 64-bit platforms we should offer bus_read_8() however. -- John Baldwin From owner-freebsd-hackers@FreeBSD.ORG Wed Oct 10 00:18:07 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C6551FAD; Wed, 10 Oct 2012 00:18:07 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 376EE8FC1E; Wed, 10 Oct 2012 00:18:06 +0000 (UTC) Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn-pri.mail.uoguelph.ca with ESMTP; 09 Oct 2012 20:18:00 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 3900DB4056; Tue, 9 Oct 2012 20:18:00 -0400 (EDT) Date: Tue, 9 Oct 2012 20:18:00 -0400 (EDT) From: Rick Macklem To: Nikolay Denev Message-ID: <1492364164.1964483.1349828280211.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: Subject: Re: NFS server bottlenecks MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - IE7 (Win)/6.0.10_GA_2692) Cc: rmacklem@freebsd.org, Garrett Wollman , freebsd-hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Oct 2012 00:18:07 -0000 Nikolay Denev wrote: > On Oct 4, 2012, at 12:36 AM, Rick Macklem > wrote: > > > Garrett Wollman wrote: > >> < >> said: > >> > >>>> Simple: just use a sepatate mutex for each list that a cache > >>>> entry > >>>> is on, rather than a global lock for everything. This would > >>>> reduce > >>>> the mutex contention, but I'm not sure how significantly since I > >>>> don't have the means to measure it yet. > >>>> > >>> Well, since the cache trimming is removing entries from the lists, > >>> I > >>> don't > >>> see how that can be done with a global lock for list updates? > >> > >> Well, the global lock is what we have now, but the cache trimming > >> process only looks at one list at a time, so not locking the list > >> that > >> isn't being iterated over probably wouldn't hurt, unless there's > >> some > >> mechanism (that I didn't see) for entries to move from one list to > >> another. Note that I'm considering each hash bucket a separate > >> "list". (One issue to worry about in that case would be cache-line > >> contention in the array of hash buckets; perhaps > >> NFSRVCACHE_HASHSIZE > >> ought to be increased to reduce that.) > >> > > Yea, a separate mutex for each hash list might help. There is also > > the > > LRU list that all entries end up on, that gets used by the trimming > > code. > > (I think? I wrote this stuff about 8 years ago, so I haven't looked > > at > > it in a while.) > > > > Also, increasing the hash table size is probably a good idea, > > especially > > if you reduce how aggressively the cache is trimmed. > > > >>> Only doing it once/sec would result in a very large cache when > >>> bursts of > >>> traffic arrives. > >> > >> My servers have 96 GB of memory so that's not a big deal for me. > >> > > This code was originally "production tested" on a server with > > 1Gbyte, > > so times have changed a bit;-) > > > >>> I'm not sure I see why doing it as a separate thread will improve > >>> things. > >>> There are N nfsd threads already (N can be bumped up to 256 if you > >>> wish) > >>> and having a bunch more "cache trimming threads" would just > >>> increase > >>> contention, wouldn't it? > >> > >> Only one cache-trimming thread. The cache trim holds the (global) > >> mutex for much longer than any individual nfsd service thread has > >> any > >> need to, and having N threads doing that in parallel is why it's so > >> heavily contended. If there's only one thread doing the trim, then > >> the nfsd service threads aren't spending time either contending on > >> the > >> mutex (it will be held less frequently and for shorter periods). > >> > > I think the little drc2.patch which will keep the nfsd threads from > > acquiring the mutex and doing the trimming most of the time, might > > be > > sufficient. I still don't see why a separate trimming thread will be > > an advantage. I'd also be worried that the one cache trimming thread > > won't get the job done soon enough. > > > > When I did production testing on a 1Gbyte server that saw a peak > > load of about 100RPCs/sec, it was necessary to trim aggressively. > > (Although I'd be tempted to say that a server with 1Gbyte is no > > longer relevant, I recently recall someone trying to run FreeBSD > > on a i486, although I doubt they wanted to run the nfsd on it.) > > > >>> The only negative effect I can think of w.r.t. having the nfsd > >>> threads doing it would be a (I believe negligible) increase in RPC > >>> response times (the time the nfsd thread spends trimming the > >>> cache). > >>> As noted, I think this time would be negligible compared to disk > >>> I/O > >>> and network transit times in the total RPC response time? > >> > >> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G > >> network connectivity, spinning on a contended mutex takes a > >> significant amount of CPU time. (For the current design of the NFS > >> server, it may actually be a win to turn off adaptive mutexes -- I > >> should give that a try once I'm able to do more testing.) > >> > > Have fun with it. Let me know when you have what you think is a good > > patch. > > > > rick > > > >> -GAWollman > >> _______________________________________________ > >> freebsd-hackers@freebsd.org mailing list > >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > >> To unsubscribe, send any mail to > >> "freebsd-hackers-unsubscribe@freebsd.org" > > _______________________________________________ > > freebsd-fs@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > > To unsubscribe, send any mail to > > "freebsd-fs-unsubscribe@freebsd.org" > > My quest for IOPS over NFS continues :) > So far I'm not able to achieve more than about 3000 8K read requests > over NFS, > while the server locally gives much more. > And this is all from a file that is completely in ARC cache, no disk > IO involved. > Just out of curiousity, why do you use 8K reads instead of 64K reads. Since the RPC overhead (including the DRC functions) is per RPC, doing fewer larger RPCs should usually work better. (Sometimes large rsize/wsize values generate too large a burst of traffic for a network interface to handle and then the rsize/wsize has to be decreased to avoid this issue.) And, although this experiment seems useful for testing patches that try and reduce DRC CPU overheads, most "real" NFS servers will be doing disk I/O. > I've snatched some sample DTrace script from the net : [ > http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes > ] > > And modified it for our new NFS server : > > #!/usr/sbin/dtrace -qs > > fbt:kernel:nfsrvd_*:entry > { > self->ts = timestamp; > @counts[probefunc] = count(); > } > > fbt:kernel:nfsrvd_*:return > / self->ts > 0 / > { > this->delta = (timestamp-self->ts)/1000000; > } > > fbt:kernel:nfsrvd_*:return > / self->ts > 0 && this->delta > 100 / > { > @slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50); > } > > fbt:kernel:nfsrvd_*:return > / self->ts > 0 / > { > @dist[probefunc, "ms"] = quantize(this->delta); > self->ts = 0; > } > > END > { > printf("\n"); > printa("function %-20s %@10d\n", @counts); > printf("\n"); > printa("function %s(), time in %s:%@d\n", @dist); > printf("\n"); > printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow); > } > > And here's a sample output from one or two minutes during the run of > Oracle's ORION benchmark > tool from a Linux machine, on a 32G file on NFS mount over 10G > ethernet: > > [16:01]root@goliath:/home/ndenev# ./nfsrvd.d > ^C > > function nfsrvd_access 4 > function nfsrvd_statfs 10 > function nfsrvd_getattr 14 > function nfsrvd_commit 76 > function nfsrvd_sentcache 110048 > function nfsrvd_write 110048 > function nfsrvd_read 283648 > function nfsrvd_dorpc 393800 > function nfsrvd_getcache 393800 > function nfsrvd_rephead 393800 > function nfsrvd_updatecache 393800 > > function nfsrvd_access(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 > 1 | 0 > > function nfsrvd_statfs(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 > 1 | 0 > > function nfsrvd_getattr(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 > 1 | 0 > > function nfsrvd_sentcache(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 > 1 | 0 > > function nfsrvd_rephead(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 > 1 | 0 > > function nfsrvd_updatecache(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 > 1 | 0 > > function nfsrvd_getcache(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 > 1 | 1 > 2 | 0 > 4 | 1 > 8 | 0 > > function nfsrvd_write(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 > 1 | 5 > 2 | 4 > 4 | 0 > > function nfsrvd_read(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 > 1 | 19 > 2 | 3 > 4 | 2 > 8 | 0 > 16 | 1 > 32 | 0 > 64 | 0 > 128 | 0 > 256 | 1 > 512 | 0 > > function nfsrvd_commit(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 > 1 |@@@@@@@ 14 > 2 | 0 > 4 |@ 1 > 8 |@ 1 > 16 | 0 > 32 |@@@@@@@ 14 > 64 |@ 2 > 128 | 0 > > > function nfsrvd_commit(), time in ms for >= 100 ms: > value ------------- Distribution ------------- count > < 100 | 0 > 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 > 150 | 0 > > function nfsrvd_read(), time in ms for >= 100 ms: > value ------------- Distribution ------------- count > 250 | 0 > 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 > 350 | 0 > > > Looks like the nfs server cache functions are quite fast, but > extremely frequently called. > Yep, they are called for every RPC. I may try coding up a patch that replaces the single mutex with one for each hash bucket, for TCP. I'll post if/when I get this patch to a testing/review stage, rick > I hope someone can find this information useful. > > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to > "freebsd-hackers-unsubscribe@freebsd.org" From owner-freebsd-hackers@FreeBSD.ORG Wed Oct 10 04:53:31 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 10339C17 for ; Wed, 10 Oct 2012 04:53:31 +0000 (UTC) (envelope-from tim@kientzle.com) Received: from monday.kientzle.com (99-115-135-74.uvs.sntcca.sbcglobal.net [99.115.135.74]) by mx1.freebsd.org (Postfix) with ESMTP id D6CA08FC08 for ; Wed, 10 Oct 2012 04:53:30 +0000 (UTC) Received: (from root@localhost) by monday.kientzle.com (8.14.4/8.14.4) id q9A4rM96032111; Wed, 10 Oct 2012 04:53:22 GMT (envelope-from tim@kientzle.com) Received: from [192.168.2.143] (CiscoE3000 [192.168.1.65]) by kientzle.com with SMTP id b7sp22idag68zchregtgjzrste; Wed, 10 Oct 2012 04:53:22 +0000 (UTC) (envelope-from tim@kientzle.com) Subject: Re: SMP Version of tar Mime-Version: 1.0 (Apple Message framework v1278) Content-Type: text/plain; charset=us-ascii From: Tim Kientzle In-Reply-To: Date: Tue, 9 Oct 2012 21:54:03 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <15DBA1A9-A4B6-4F7D-A9DC-3412C4BE3517@kientzle.com> References: <5069C9FC.6020400@brandonfa.lk> <324B736D-8961-4E44-A212-2ECF3E60F2A0@kientzle.com> <20121008083814.GA5830@straylight.m.ringlet.net> To: Wojciech Puchar X-Mailer: Apple Mail (2.1278) Cc: freebsd-hackers@freebsd.org, Brandon Falk X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Oct 2012 04:53:31 -0000 On Oct 8, 2012, at 3:21 AM, Wojciech Puchar wrote: >> Not necessarily. If I understand correctly what Tim means, he's = talking >> about an in-memory compression of several blocks by several separate >> threads, and then - after all the threads have compressed their >=20 > but gzip format is single stream. dictionary IMHO is not reset every X = kilobytes. >=20 > parallel gzip is possible but not with same data format. Yes, it is. The following creates a compressed file that is completely compatible with the standard gzip/gunzip tools: * Break file into blocks * Compress each block into a gzip file (with gzip header and trailer = information) * Concatenate the result. This can be correctly decoded by gunzip. In theory, you get slightly worse compression. In practice, if your = blocks are reasonably large (a megabyte or so each), the difference is = negligible. Tim From owner-freebsd-hackers@FreeBSD.ORG Wed Oct 10 12:08:46 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 540F6294 for ; Wed, 10 Oct 2012 12:08:46 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 0A9C28FC14 for ; Wed, 10 Oct 2012 12:08:45 +0000 (UTC) Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn-pri.mail.uoguelph.ca with ESMTP; 10 Oct 2012 08:08:44 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id CD252B4035; Wed, 10 Oct 2012 08:08:44 -0400 (EDT) Date: Wed, 10 Oct 2012 08:08:44 -0400 (EDT) From: Rick Macklem To: Garrett Wollman Message-ID: <461825404.1975816.1349870924809.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <20596.52616.867711.175010@hergotha.csail.mit.edu> Subject: Re: NFS server bottlenecks MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - IE7 (Win)/6.0.10_GA_2692) Cc: Nikolay Denev , freebsd-hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Oct 2012 12:08:46 -0000 Garrett Wollman wrote: > < said: > > > And, although this experiment seems useful for testing patches that > > try > > and reduce DRC CPU overheads, most "real" NFS servers will be doing > > disk > > I/O. > > We don't always have control over what the user does. I think the > worst-case for my users involves a third-party program (that they're > not willing to modify) that does line-buffered writes in append mode. > This uses nearly all of the CPU on per-RPC overhead (each write is > three RPCs: GETATTR, WRITE, COMMIT). > Yes. My comment was simply meant to imply that his testing isn't a realistic load for most NFS servers. It was not meant to imply that reducing the CPU overhead/lock contention of the DRC is a useless exercise. rick > -GAWollman From owner-freebsd-hackers@FreeBSD.ORG Wed Oct 10 01:21:14 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id AA266A5 for ; Wed, 10 Oct 2012 01:21:14 +0000 (UTC) (envelope-from wollman@hergotha.csail.mit.edu) Received: from hergotha.csail.mit.edu (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) by mx1.freebsd.org (Postfix) with ESMTP id 585718FC14 for ; Wed, 10 Oct 2012 01:21:14 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.5/8.14.5) with ESMTP id q9A1LD2m043208; Tue, 9 Oct 2012 21:21:13 -0400 (EDT) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.5/8.14.4/Submit) id q9A1LDoI043205; Tue, 9 Oct 2012 21:21:13 -0400 (EDT) (envelope-from wollman) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <20596.52616.867711.175010@hergotha.csail.mit.edu> Date: Tue, 9 Oct 2012 21:21:12 -0400 From: Garrett Wollman To: Rick Macklem Subject: Re: NFS server bottlenecks In-Reply-To: <1492364164.1964483.1349828280211.JavaMail.root@erie.cs.uoguelph.ca> References: <1492364164.1964483.1349828280211.JavaMail.root@erie.cs.uoguelph.ca> X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (hergotha.csail.mit.edu [127.0.0.1]); Tue, 09 Oct 2012 21:21:13 -0400 (EDT) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on hergotha.csail.mit.edu X-Mailman-Approved-At: Wed, 10 Oct 2012 12:30:45 +0000 Cc: Nikolay Denev , freebsd-hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Oct 2012 01:21:14 -0000 < said: > And, although this experiment seems useful for testing patches that try > and reduce DRC CPU overheads, most "real" NFS servers will be doing disk > I/O. We don't always have control over what the user does. I think the worst-case for my users involves a third-party program (that they're not willing to modify) that does line-buffered writes in append mode. This uses nearly all of the CPU on per-RPC overhead (each write is three RPCs: GETATTR, WRITE, COMMIT). -GAWollman From owner-freebsd-hackers@FreeBSD.ORG Wed Oct 10 14:33:18 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 351D74A6 for ; Wed, 10 Oct 2012 14:33:18 +0000 (UTC) (envelope-from lidl@hydra.pix.net) Received: from hydra.pix.net (hydra.pix.net [IPv6:2001:470:e254:10::3c]) by mx1.freebsd.org (Postfix) with ESMTP id F3E418FC0A for ; Wed, 10 Oct 2012 14:33:17 +0000 (UTC) Received: from hydra.pix.net (localhost [127.0.0.1]) by hydra.pix.net (8.14.5/8.14.5) with ESMTP id q9AEXGuA008619; Wed, 10 Oct 2012 10:33:16 -0400 (EDT) (envelope-from lidl@hydra.pix.net) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.97.5 at mail.pix.net Received: (from lidl@localhost) by hydra.pix.net (8.14.5/8.14.5/Submit) id q9AEXEl9008618; Wed, 10 Oct 2012 10:33:14 -0400 (EDT) (envelope-from lidl) Date: Wed, 10 Oct 2012 10:33:14 -0400 From: Kurt Lidl To: Tim Kientzle Subject: Re: SMP Version of tar Message-ID: <20121010143314.GA8402@pix.net> References: <5069C9FC.6020400@brandonfa.lk> <324B736D-8961-4E44-A212-2ECF3E60F2A0@kientzle.com> <20121008083814.GA5830@straylight.m.ringlet.net> <15DBA1A9-A4B6-4F7D-A9DC-3412C4BE3517@kientzle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <15DBA1A9-A4B6-4F7D-A9DC-3412C4BE3517@kientzle.com> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Wojciech Puchar , Brandon Falk , freebsd-hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Oct 2012 14:33:18 -0000 On Tue, Oct 09, 2012 at 09:54:03PM -0700, Tim Kientzle wrote: > > On Oct 8, 2012, at 3:21 AM, Wojciech Puchar wrote: > > >> Not necessarily. If I understand correctly what Tim means, he's talking > >> about an in-memory compression of several blocks by several separate > >> threads, and then - after all the threads have compressed their > > > > but gzip format is single stream. dictionary IMHO is not reset every X kilobytes. > > > > parallel gzip is possible but not with same data format. > > Yes, it is. > > The following creates a compressed file that > is completely compatible with the standard > gzip/gunzip tools: > > * Break file into blocks > * Compress each block into a gzip file (with gzip header and trailer information) > * Concatenate the result. > > This can be correctly decoded by gunzip. > > In theory, you get slightly worse compression. In practice, if your blocks are reasonably large (a megabyte or so each), the difference is negligible. I am not sure, but I think this conversation might have a slight misunderstanding due to imprecisely specified language, while the technical part is in agreement. Tim is correct in that gzip datastream allows for concatenation of compressed blocks of data, so you might break the input stream into a bunch of blocks [A, B, C, etc], and then can append those together into [A.gz, B.gz, C.gz, etc], and when uncompressed, you will get the original input stream. I think that Wojciech's point is that the compressed data stream for for the single datastream is different than the compressed data stream of [A.gz, B.gz, C.gz, etc]. Both will decompress to the same thing, but the intermediate compressed representation will be different. -Kurt From owner-freebsd-hackers@FreeBSD.ORG Wed Oct 10 14:42:18 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E5C50771; Wed, 10 Oct 2012 14:42:18 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from mail-wi0-f178.google.com (mail-wi0-f178.google.com [209.85.212.178]) by mx1.freebsd.org (Postfix) with ESMTP id 1AB1E8FC12; Wed, 10 Oct 2012 14:42:17 +0000 (UTC) Received: by mail-wi0-f178.google.com with SMTP id hr7so642396wib.13 for ; Wed, 10 Oct 2012 07:42:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; bh=0uvAubi10Zr77AqIE0leBfCOdJWQKIhIhklmMQFQfMI=; b=Rb4RRMd5uRFPdwtkobaijb/QcZ2uvBLo0dxLXHI6omth/FE7RRQ+WXekWWjd6U92mG nVc5pUpSIu16Hwn8Syp5VRDS6qzl36Y0oOSxKKd0RnoySOQngApdf3HmtZVRcSf5CGLS Oeh5nahMLQTC3LRav0vvqrnTwZfUV6inw2VuqC0uIRJ5FmBFg6LM9wNW5n9sReR7wQRT yyPRRsCKKoGaJkNdwq2ox1LRJ9U03Z+UOSwCjP42t9mY5DEHjN5GjgLgKl0+sRaixFyn 7zTqvvrR5IR6V5FGT1UO749psZ3PveGUZE6epLyfVcJQ7azeO6Vz/7sDYAhaQDx2avwA ZYDA== Received: by 10.216.218.105 with SMTP id j83mr8876869wep.164.1349880136909; Wed, 10 Oct 2012 07:42:16 -0700 (PDT) Received: from ndenevsa.sf.moneybookers.net (g1.moneybookers.com. [217.18.249.148]) by mx.google.com with ESMTPS id b7sm30432075wiz.3.2012.10.10.07.42.15 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 10 Oct 2012 07:42:16 -0700 (PDT) Subject: Re: NFS server bottlenecks Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\)) Content-Type: text/plain; charset=us-ascii From: Nikolay Denev In-Reply-To: <1492364164.1964483.1349828280211.JavaMail.root@erie.cs.uoguelph.ca> Date: Wed, 10 Oct 2012 17:42:15 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: References: <1492364164.1964483.1349828280211.JavaMail.root@erie.cs.uoguelph.ca> To: Rick Macklem X-Mailer: Apple Mail (2.1498) Cc: rmacklem@freebsd.org, Garrett Wollman , freebsd-hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Oct 2012 14:42:19 -0000 On Oct 10, 2012, at 3:18 AM, Rick Macklem wrote: > Nikolay Denev wrote: >> On Oct 4, 2012, at 12:36 AM, Rick Macklem >> wrote: >>=20 >>> Garrett Wollman wrote: >>>> <>>> said: >>>>=20 >>>>>> Simple: just use a sepatate mutex for each list that a cache >>>>>> entry >>>>>> is on, rather than a global lock for everything. This would >>>>>> reduce >>>>>> the mutex contention, but I'm not sure how significantly since I >>>>>> don't have the means to measure it yet. >>>>>>=20 >>>>> Well, since the cache trimming is removing entries from the lists, >>>>> I >>>>> don't >>>>> see how that can be done with a global lock for list updates? >>>>=20 >>>> Well, the global lock is what we have now, but the cache trimming >>>> process only looks at one list at a time, so not locking the list >>>> that >>>> isn't being iterated over probably wouldn't hurt, unless there's >>>> some >>>> mechanism (that I didn't see) for entries to move from one list to >>>> another. Note that I'm considering each hash bucket a separate >>>> "list". (One issue to worry about in that case would be cache-line >>>> contention in the array of hash buckets; perhaps >>>> NFSRVCACHE_HASHSIZE >>>> ought to be increased to reduce that.) >>>>=20 >>> Yea, a separate mutex for each hash list might help. There is also >>> the >>> LRU list that all entries end up on, that gets used by the trimming >>> code. >>> (I think? I wrote this stuff about 8 years ago, so I haven't looked >>> at >>> it in a while.) >>>=20 >>> Also, increasing the hash table size is probably a good idea, >>> especially >>> if you reduce how aggressively the cache is trimmed. >>>=20 >>>>> Only doing it once/sec would result in a very large cache when >>>>> bursts of >>>>> traffic arrives. >>>>=20 >>>> My servers have 96 GB of memory so that's not a big deal for me. >>>>=20 >>> This code was originally "production tested" on a server with >>> 1Gbyte, >>> so times have changed a bit;-) >>>=20 >>>>> I'm not sure I see why doing it as a separate thread will improve >>>>> things. >>>>> There are N nfsd threads already (N can be bumped up to 256 if you >>>>> wish) >>>>> and having a bunch more "cache trimming threads" would just >>>>> increase >>>>> contention, wouldn't it? >>>>=20 >>>> Only one cache-trimming thread. The cache trim holds the (global) >>>> mutex for much longer than any individual nfsd service thread has >>>> any >>>> need to, and having N threads doing that in parallel is why it's so >>>> heavily contended. If there's only one thread doing the trim, then >>>> the nfsd service threads aren't spending time either contending on >>>> the >>>> mutex (it will be held less frequently and for shorter periods). >>>>=20 >>> I think the little drc2.patch which will keep the nfsd threads from >>> acquiring the mutex and doing the trimming most of the time, might >>> be >>> sufficient. I still don't see why a separate trimming thread will be >>> an advantage. I'd also be worried that the one cache trimming thread >>> won't get the job done soon enough. >>>=20 >>> When I did production testing on a 1Gbyte server that saw a peak >>> load of about 100RPCs/sec, it was necessary to trim aggressively. >>> (Although I'd be tempted to say that a server with 1Gbyte is no >>> longer relevant, I recently recall someone trying to run FreeBSD >>> on a i486, although I doubt they wanted to run the nfsd on it.) >>>=20 >>>>> The only negative effect I can think of w.r.t. having the nfsd >>>>> threads doing it would be a (I believe negligible) increase in RPC >>>>> response times (the time the nfsd thread spends trimming the >>>>> cache). >>>>> As noted, I think this time would be negligible compared to disk >>>>> I/O >>>>> and network transit times in the total RPC response time? >>>>=20 >>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G >>>> network connectivity, spinning on a contended mutex takes a >>>> significant amount of CPU time. (For the current design of the NFS >>>> server, it may actually be a win to turn off adaptive mutexes -- I >>>> should give that a try once I'm able to do more testing.) >>>>=20 >>> Have fun with it. Let me know when you have what you think is a good >>> patch. >>>=20 >>> rick >>>=20 >>>> -GAWollman >>>> _______________________________________________ >>>> freebsd-hackers@freebsd.org mailing list >>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>>> To unsubscribe, send any mail to >>>> "freebsd-hackers-unsubscribe@freebsd.org" >>> _______________________________________________ >>> freebsd-fs@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs >>> To unsubscribe, send any mail to >>> "freebsd-fs-unsubscribe@freebsd.org" >>=20 >> My quest for IOPS over NFS continues :) >> So far I'm not able to achieve more than about 3000 8K read requests >> over NFS, >> while the server locally gives much more. >> And this is all from a file that is completely in ARC cache, no disk >> IO involved. >>=20 > Just out of curiousity, why do you use 8K reads instead of 64K reads. > Since the RPC overhead (including the DRC functions) is per RPC, doing > fewer larger RPCs should usually work better. (Sometimes large = rsize/wsize > values generate too large a burst of traffic for a network interface = to > handle and then the rsize/wsize has to be decreased to avoid this = issue.) >=20 > And, although this experiment seems useful for testing patches that = try > and reduce DRC CPU overheads, most "real" NFS servers will be doing = disk > I/O. >=20 This is the default blocksize the Oracle and probably most databases = use. It uses also larger blocks, but for small random reads in OLTP = applications this is what is used. >> I've snatched some sample DTrace script from the net : [ >> = http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes >> ] >>=20 >> And modified it for our new NFS server : >>=20 >> #!/usr/sbin/dtrace -qs >>=20 >> fbt:kernel:nfsrvd_*:entry >> { >> self->ts =3D timestamp; >> @counts[probefunc] =3D count(); >> } >>=20 >> fbt:kernel:nfsrvd_*:return >> / self->ts > 0 / >> { >> this->delta =3D (timestamp-self->ts)/1000000; >> } >>=20 >> fbt:kernel:nfsrvd_*:return >> / self->ts > 0 && this->delta > 100 / >> { >> @slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50); >> } >>=20 >> fbt:kernel:nfsrvd_*:return >> / self->ts > 0 / >> { >> @dist[probefunc, "ms"] =3D quantize(this->delta); >> self->ts =3D 0; >> } >>=20 >> END >> { >> printf("\n"); >> printa("function %-20s %@10d\n", @counts); >> printf("\n"); >> printa("function %s(), time in %s:%@d\n", @dist); >> printf("\n"); >> printa("function %s(), time in %s for >=3D 100 ms:%@d\n", @slow); >> } >>=20 >> And here's a sample output from one or two minutes during the run of >> Oracle's ORION benchmark >> tool from a Linux machine, on a 32G file on NFS mount over 10G >> ethernet: >>=20 >> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d >> ^C >>=20 >> function nfsrvd_access 4 >> function nfsrvd_statfs 10 >> function nfsrvd_getattr 14 >> function nfsrvd_commit 76 >> function nfsrvd_sentcache 110048 >> function nfsrvd_write 110048 >> function nfsrvd_read 283648 >> function nfsrvd_dorpc 393800 >> function nfsrvd_getcache 393800 >> function nfsrvd_rephead 393800 >> function nfsrvd_updatecache 393800 >>=20 >> function nfsrvd_access(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 >> 1 | 0 >>=20 >> function nfsrvd_statfs(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 >> 1 | 0 >>=20 >> function nfsrvd_getattr(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 >> 1 | 0 >>=20 >> function nfsrvd_sentcache(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 >> 1 | 0 >>=20 >> function nfsrvd_rephead(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 >> 1 | 0 >>=20 >> function nfsrvd_updatecache(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 >> 1 | 0 >>=20 >> function nfsrvd_getcache(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 >> 1 | 1 >> 2 | 0 >> 4 | 1 >> 8 | 0 >>=20 >> function nfsrvd_write(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 >> 1 | 5 >> 2 | 4 >> 4 | 0 >>=20 >> function nfsrvd_read(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 >> 1 | 19 >> 2 | 3 >> 4 | 2 >> 8 | 0 >> 16 | 1 >> 32 | 0 >> 64 | 0 >> 128 | 0 >> 256 | 1 >> 512 | 0 >>=20 >> function nfsrvd_commit(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 >> 1 |@@@@@@@ 14 >> 2 | 0 >> 4 |@ 1 >> 8 |@ 1 >> 16 | 0 >> 32 |@@@@@@@ 14 >> 64 |@ 2 >> 128 | 0 >>=20 >>=20 >> function nfsrvd_commit(), time in ms for >=3D 100 ms: >> value ------------- Distribution ------------- count >> < 100 | 0 >> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 >> 150 | 0 >>=20 >> function nfsrvd_read(), time in ms for >=3D 100 ms: >> value ------------- Distribution ------------- count >> 250 | 0 >> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 >> 350 | 0 >>=20 >>=20 >> Looks like the nfs server cache functions are quite fast, but >> extremely frequently called. >>=20 > Yep, they are called for every RPC. >=20 > I may try coding up a patch that replaces the single mutex with > one for each hash bucket, for TCP. >=20 > I'll post if/when I get this patch to a testing/review stage, rick >=20 Cool. I've readjusted the precision of the dtrace script a bit, and I can see now the following three functions as taking most of the time : = nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache() This was recorded during a oracle benchmark run called SLOB, which = caused 99% cpu load on the NFS server. >> I hope someone can find this information useful. >>=20 >> _______________________________________________ >> freebsd-hackers@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >> To unsubscribe, send any mail to >> "freebsd-hackers-unsubscribe@freebsd.org" From owner-freebsd-hackers@FreeBSD.ORG Wed Oct 10 16:42:47 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 45374CA4 for ; Wed, 10 Oct 2012 16:42:47 +0000 (UTC) (envelope-from rysto32@gmail.com) Received: from mail-vc0-f182.google.com (mail-vc0-f182.google.com [209.85.220.182]) by mx1.freebsd.org (Postfix) with ESMTP id ECC948FC08 for ; Wed, 10 Oct 2012 16:42:46 +0000 (UTC) Received: by mail-vc0-f182.google.com with SMTP id fw7so1296566vcb.13 for ; Wed, 10 Oct 2012 09:42:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=WQtYJzGsYuj7puhxEnEntVwMPFE4PvRkDWMajnzmoW0=; b=WwM6x0eDL30fF7ZpmFz3UUknEWNh38+fqwaNrPFVFBF4ccIAQDnPFzo745/vUmVrct nAW/QcFp6PSjUyVz/qTuXEuxKqtKv662D7Zjl2IoK1EwyO8ZBjJz9qvhFpVgexYjNH2o J8fc+ddajWGS8ahRyn0kEFZ/sSsQ3SDAK64G6JKFVPsBaz+5nzpWwNR0kVoatJgdNbor fgzDVjzs0moBbv4EkIJfOOcDIfnRmMBg6C2QQQHDOrox7Nzb6tYt8v6jSyWm0VKw3oJm ncpgTzBrQSB+fJChEEPIAApRkCikqMqm2S7HC+eN1Rfy6wzVAO67duUj1UtrEw/lWlB1 +VPg== MIME-Version: 1.0 Received: by 10.52.29.74 with SMTP id i10mr5850740vdh.40.1349887366025; Wed, 10 Oct 2012 09:42:46 -0700 (PDT) Received: by 10.58.207.114 with HTTP; Wed, 10 Oct 2012 09:42:45 -0700 (PDT) In-Reply-To: <1349746003.10434.YahooMailClassic@web181706.mail.ne1.yahoo.com> References: <1349746003.10434.YahooMailClassic@web181706.mail.ne1.yahoo.com> Date: Wed, 10 Oct 2012 12:42:45 -0400 Message-ID: Subject: Re: Kernel memory usage From: Ryan Stone To: Sushanth Rai Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Oct 2012 16:42:47 -0000 On Mon, Oct 8, 2012 at 9:26 PM, Sushanth Rai wrote= : > I was trying to co-relate the o/p from "top" to that I get from vmstat -z= . I don't have any user programs that wires memory. Given that, I'm assumin= g the wired memory count shown by "top" is memory used by kernel. Now I wou= ld like find out how the kernel is using this "wired" memory. So, I look at= dynamic memory allocated by kernel using "vmstat -z". I think memory alloc= ated via malloc() is serviced by zones if the allocation size is <4k. So, I= 'm not sure how useful "vmstat -m" is. I also add up memory used by buffer = cache. Is there any other significant chunk I'm missing ? Does vmstat -m sh= ow memory that is not accounted for in vmstat -z. All allocations by malloc that are larger than a single page are served by uma_large_malloc, and as far as I can tell these allocations will not be accounted for in vmstat -z (they will, of course, be accounted for in vmstat -m). Similarly, all allocations through contigmalloc will not be accounted for in vmstat -z. From owner-freebsd-hackers@FreeBSD.ORG Wed Oct 10 20:46:29 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A53583E4 for ; Wed, 10 Oct 2012 20:46:29 +0000 (UTC) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from wojtek.tensor.gdynia.pl (wojtek.tensor.gdynia.pl [89.206.35.99]) by mx1.freebsd.org (Postfix) with ESMTP id CA9E38FC08 for ; Wed, 10 Oct 2012 20:46:28 +0000 (UTC) Received: from wojtek.tensor.gdynia.pl (localhost [127.0.0.1]) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5) with ESMTP id q9AKkCHl002228; Wed, 10 Oct 2012 22:46:12 +0200 (CEST) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from localhost (wojtek@localhost) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5/Submit) with ESMTP id q9AKkBU6002225; Wed, 10 Oct 2012 22:46:11 +0200 (CEST) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Date: Wed, 10 Oct 2012 22:46:11 +0200 (CEST) From: Wojciech Puchar To: Kurt Lidl Subject: Re: SMP Version of tar In-Reply-To: <20121010143314.GA8402@pix.net> Message-ID: References: <5069C9FC.6020400@brandonfa.lk> <324B736D-8961-4E44-A212-2ECF3E60F2A0@kientzle.com> <20121008083814.GA5830@straylight.m.ringlet.net> <15DBA1A9-A4B6-4F7D-A9DC-3412C4BE3517@kientzle.com> <20121010143314.GA8402@pix.net> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (wojtek.tensor.gdynia.pl [127.0.0.1]); Wed, 10 Oct 2012 22:46:12 +0200 (CEST) Cc: Brandon Falk , freebsd-hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Oct 2012 20:46:29 -0000 > > Tim is correct in that gzip datastream allows for concatenation of > compressed blocks of data, so you might break the input stream into > a bunch of blocks [A, B, C, etc], and then can append those together > into [A.gz, B.gz, C.gz, etc], and when uncompressed, you will get > the original input stream. > I think that Wojciech's point is that the compressed data stream for > for the single datastream is different than the compressed data > stream of [A.gz, B.gz, C.gz, etc]. Both will decompress to the same > thing, but the intermediate compressed representation will be different. So - after your response it is clear that parallel generated tar.gz will be different and have slightly (can be ignored) worse compression, and WILL be compatible with standard gzip as it can decompress from multiple streams which i wasn't aware of. That's good. at the same time parallel tar will go back to single thread when unpacking standard .tar.gz - not a big deal, as gzip decompression is untrafast and I/O is usually a limit. From owner-freebsd-hackers@FreeBSD.ORG Wed Oct 10 21:44:17 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0843B8F3; Wed, 10 Oct 2012 21:44:17 +0000 (UTC) (envelope-from carl.r.delsey@intel.com) Received: from mga03.intel.com (mga03.intel.com [143.182.124.21]) by mx1.freebsd.org (Postfix) with ESMTP id C72F88FC14; Wed, 10 Oct 2012 21:44:16 +0000 (UTC) Received: from azsmga002.ch.intel.com ([10.2.17.35]) by azsmga101.ch.intel.com with ESMTP; 10 Oct 2012 14:44:10 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.80,565,1344236400"; d="scan'208";a="154794507" Received: from crdelsey-fbsd.ch.intel.com (HELO [10.2.105.127]) ([10.2.105.127]) by AZSMGA002.ch.intel.com with ESMTP; 10 Oct 2012 14:44:09 -0700 Message-ID: <5075EC29.1010907@intel.com> Date: Wed, 10 Oct 2012 14:44:09 -0700 From: Carl Delsey User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:13.0) Gecko/20120724 Thunderbird/13.0.1 MIME-Version: 1.0 To: John Baldwin Subject: Re: No bus_space_read_8 on x86 ? References: <506DC574.9010300@intel.com> <201210051208.45550.jhb@freebsd.org> <8BC4C95F-2D10-46A5-89C8-74801BB4E23A@bsdimp.com> <201210091154.15873.jhb@freebsd.org> In-Reply-To: <201210091154.15873.jhb@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Oct 2012 21:44:17 -0000 Sorry for the slow response. I was dealing with a bit of a family emergency. Responses inline below. On 10/09/12 08:54, John Baldwin wrote: > On Monday, October 08, 2012 4:59:24 pm Warner Losh wrote: >> On Oct 5, 2012, at 10:08 AM, John Baldwin wrote: >>> I think cxgb* already have an implementation. For amd64 we should certainly >>> have bus_space_*_8(), at least for SYS_RES_MEMORY. I think they should fail >>> for SYS_RES_IOPORT. I don't think we can force a compile-time error though, >>> would just have to return -1 on reads or some such? Yes. Exactly what I was thinking. >> I believe it was because bus reads weren't guaranteed to be atomic on i386. >> don't know if that's still the case or a concern, but it was an intentional omission. > True. If you are on a 32-bit system you can read the two 4 byte values and > then build a 64-bit value. For 64-bit platforms we should offer bus_read_8() > however. I believe there is still no way to perform a 64-bit read on a i386 (or at least without messing with SSE instructions), but if you have to read a 64-bit register, you are stuck with doing two 32-bit reads and concatenating them. I figure we may as well provide an implementation for those who have to do that as well as the implementation for 64-bit. Anyhow, it sounds like we are basically in agreement. I'll put together a patch and send it out for review. Thanks, Carl From owner-freebsd-hackers@FreeBSD.ORG Wed Oct 10 22:09:15 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C92823D5; Wed, 10 Oct 2012 22:09:15 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 0FE5E8FC14; Wed, 10 Oct 2012 22:09:14 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAJ+LclCDaFvO/2dsb2JhbABFFoV7uhmCIAEBAQQBAQEgBCcgBgUbDgoCAg0ZAikBCSYGCAcEARwBA4dkC6ZJkXWBIYouGoRkgRIDkz6CLYEVjxmDCYFHNA X-IronPort-AV: E=Sophos;i="4.80,567,1344225600"; d="scan'208";a="185836181" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu-pri.mail.uoguelph.ca with ESMTP; 10 Oct 2012 18:09:07 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id EC7FAB3F62; Wed, 10 Oct 2012 18:09:07 -0400 (EDT) Date: Wed, 10 Oct 2012 18:09:07 -0400 (EDT) From: Rick Macklem To: Nikolay Denev Message-ID: <1071150615.2039567.1349906947942.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: Subject: Re: NFS server bottlenecks MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - IE7 (Win)/6.0.10_GA_2692) Cc: rmacklem@freebsd.org, Garrett Wollman , freebsd-hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Oct 2012 22:09:16 -0000 Nikolay Denev wrote: > On Oct 10, 2012, at 3:18 AM, Rick Macklem > wrote: > > > Nikolay Denev wrote: > >> On Oct 4, 2012, at 12:36 AM, Rick Macklem > >> wrote: > >> > >>> Garrett Wollman wrote: > >>>> < >>>> said: > >>>> > >>>>>> Simple: just use a sepatate mutex for each list that a cache > >>>>>> entry > >>>>>> is on, rather than a global lock for everything. This would > >>>>>> reduce > >>>>>> the mutex contention, but I'm not sure how significantly since > >>>>>> I > >>>>>> don't have the means to measure it yet. > >>>>>> > >>>>> Well, since the cache trimming is removing entries from the > >>>>> lists, > >>>>> I > >>>>> don't > >>>>> see how that can be done with a global lock for list updates? > >>>> > >>>> Well, the global lock is what we have now, but the cache trimming > >>>> process only looks at one list at a time, so not locking the list > >>>> that > >>>> isn't being iterated over probably wouldn't hurt, unless there's > >>>> some > >>>> mechanism (that I didn't see) for entries to move from one list > >>>> to > >>>> another. Note that I'm considering each hash bucket a separate > >>>> "list". (One issue to worry about in that case would be > >>>> cache-line > >>>> contention in the array of hash buckets; perhaps > >>>> NFSRVCACHE_HASHSIZE > >>>> ought to be increased to reduce that.) > >>>> > >>> Yea, a separate mutex for each hash list might help. There is also > >>> the > >>> LRU list that all entries end up on, that gets used by the > >>> trimming > >>> code. > >>> (I think? I wrote this stuff about 8 years ago, so I haven't > >>> looked > >>> at > >>> it in a while.) > >>> > >>> Also, increasing the hash table size is probably a good idea, > >>> especially > >>> if you reduce how aggressively the cache is trimmed. > >>> > >>>>> Only doing it once/sec would result in a very large cache when > >>>>> bursts of > >>>>> traffic arrives. > >>>> > >>>> My servers have 96 GB of memory so that's not a big deal for me. > >>>> > >>> This code was originally "production tested" on a server with > >>> 1Gbyte, > >>> so times have changed a bit;-) > >>> > >>>>> I'm not sure I see why doing it as a separate thread will > >>>>> improve > >>>>> things. > >>>>> There are N nfsd threads already (N can be bumped up to 256 if > >>>>> you > >>>>> wish) > >>>>> and having a bunch more "cache trimming threads" would just > >>>>> increase > >>>>> contention, wouldn't it? > >>>> > >>>> Only one cache-trimming thread. The cache trim holds the (global) > >>>> mutex for much longer than any individual nfsd service thread has > >>>> any > >>>> need to, and having N threads doing that in parallel is why it's > >>>> so > >>>> heavily contended. If there's only one thread doing the trim, > >>>> then > >>>> the nfsd service threads aren't spending time either contending > >>>> on > >>>> the > >>>> mutex (it will be held less frequently and for shorter periods). > >>>> > >>> I think the little drc2.patch which will keep the nfsd threads > >>> from > >>> acquiring the mutex and doing the trimming most of the time, might > >>> be > >>> sufficient. I still don't see why a separate trimming thread will > >>> be > >>> an advantage. I'd also be worried that the one cache trimming > >>> thread > >>> won't get the job done soon enough. > >>> > >>> When I did production testing on a 1Gbyte server that saw a peak > >>> load of about 100RPCs/sec, it was necessary to trim aggressively. > >>> (Although I'd be tempted to say that a server with 1Gbyte is no > >>> longer relevant, I recently recall someone trying to run FreeBSD > >>> on a i486, although I doubt they wanted to run the nfsd on it.) > >>> > >>>>> The only negative effect I can think of w.r.t. having the nfsd > >>>>> threads doing it would be a (I believe negligible) increase in > >>>>> RPC > >>>>> response times (the time the nfsd thread spends trimming the > >>>>> cache). > >>>>> As noted, I think this time would be negligible compared to disk > >>>>> I/O > >>>>> and network transit times in the total RPC response time? > >>>> > >>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and > >>>> 10G > >>>> network connectivity, spinning on a contended mutex takes a > >>>> significant amount of CPU time. (For the current design of the > >>>> NFS > >>>> server, it may actually be a win to turn off adaptive mutexes -- > >>>> I > >>>> should give that a try once I'm able to do more testing.) > >>>> > >>> Have fun with it. Let me know when you have what you think is a > >>> good > >>> patch. > >>> > >>> rick > >>> > >>>> -GAWollman > >>>> _______________________________________________ > >>>> freebsd-hackers@freebsd.org mailing list > >>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > >>>> To unsubscribe, send any mail to > >>>> "freebsd-hackers-unsubscribe@freebsd.org" > >>> _______________________________________________ > >>> freebsd-fs@freebsd.org mailing list > >>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs > >>> To unsubscribe, send any mail to > >>> "freebsd-fs-unsubscribe@freebsd.org" > >> > >> My quest for IOPS over NFS continues :) > >> So far I'm not able to achieve more than about 3000 8K read > >> requests > >> over NFS, > >> while the server locally gives much more. > >> And this is all from a file that is completely in ARC cache, no > >> disk > >> IO involved. > >> > > Just out of curiousity, why do you use 8K reads instead of 64K > > reads. > > Since the RPC overhead (including the DRC functions) is per RPC, > > doing > > fewer larger RPCs should usually work better. (Sometimes large > > rsize/wsize > > values generate too large a burst of traffic for a network interface > > to > > handle and then the rsize/wsize has to be decreased to avoid this > > issue.) > > > > And, although this experiment seems useful for testing patches that > > try > > and reduce DRC CPU overheads, most "real" NFS servers will be doing > > disk > > I/O. > > > > This is the default blocksize the Oracle and probably most databases > use. > It uses also larger blocks, but for small random reads in OLTP > applications this is what is used. > If the client is doing 8K reads, you could increase the read ahead "readahead=N" (N up to 16), to try and increase the bandwidth. (But if the CPU is 99% busy, then I don't think it will matter.) > > >> I've snatched some sample DTrace script from the net : [ > >> http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes > >> ] > >> > >> And modified it for our new NFS server : > >> > >> #!/usr/sbin/dtrace -qs > >> > >> fbt:kernel:nfsrvd_*:entry > >> { > >> self->ts = timestamp; > >> @counts[probefunc] = count(); > >> } > >> > >> fbt:kernel:nfsrvd_*:return > >> / self->ts > 0 / > >> { > >> this->delta = (timestamp-self->ts)/1000000; > >> } > >> > >> fbt:kernel:nfsrvd_*:return > >> / self->ts > 0 && this->delta > 100 / > >> { > >> @slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50); > >> } > >> > >> fbt:kernel:nfsrvd_*:return > >> / self->ts > 0 / > >> { > >> @dist[probefunc, "ms"] = quantize(this->delta); > >> self->ts = 0; > >> } > >> > >> END > >> { > >> printf("\n"); > >> printa("function %-20s %@10d\n", @counts); > >> printf("\n"); > >> printa("function %s(), time in %s:%@d\n", @dist); > >> printf("\n"); > >> printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow); > >> } > >> > >> And here's a sample output from one or two minutes during the run > >> of > >> Oracle's ORION benchmark > >> tool from a Linux machine, on a 32G file on NFS mount over 10G > >> ethernet: > >> > >> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d > >> ^C > >> > >> function nfsrvd_access 4 > >> function nfsrvd_statfs 10 > >> function nfsrvd_getattr 14 > >> function nfsrvd_commit 76 > >> function nfsrvd_sentcache 110048 > >> function nfsrvd_write 110048 > >> function nfsrvd_read 283648 > >> function nfsrvd_dorpc 393800 > >> function nfsrvd_getcache 393800 > >> function nfsrvd_rephead 393800 > >> function nfsrvd_updatecache 393800 > >> > >> function nfsrvd_access(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 > >> 1 | 0 > >> > >> function nfsrvd_statfs(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 > >> 1 | 0 > >> > >> function nfsrvd_getattr(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 > >> 1 | 0 > >> > >> function nfsrvd_sentcache(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 > >> 1 | 0 > >> > >> function nfsrvd_rephead(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 > >> 1 | 0 > >> > >> function nfsrvd_updatecache(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 > >> 1 | 0 > >> > >> function nfsrvd_getcache(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 > >> 1 | 1 > >> 2 | 0 > >> 4 | 1 > >> 8 | 0 > >> > >> function nfsrvd_write(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 > >> 1 | 5 > >> 2 | 4 > >> 4 | 0 > >> > >> function nfsrvd_read(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 > >> 1 | 19 > >> 2 | 3 > >> 4 | 2 > >> 8 | 0 > >> 16 | 1 > >> 32 | 0 > >> 64 | 0 > >> 128 | 0 > >> 256 | 1 > >> 512 | 0 > >> > >> function nfsrvd_commit(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 > >> 1 |@@@@@@@ 14 > >> 2 | 0 > >> 4 |@ 1 > >> 8 |@ 1 > >> 16 | 0 > >> 32 |@@@@@@@ 14 > >> 64 |@ 2 > >> 128 | 0 > >> > >> > >> function nfsrvd_commit(), time in ms for >= 100 ms: > >> value ------------- Distribution ------------- count > >> < 100 | 0 > >> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 > >> 150 | 0 > >> > >> function nfsrvd_read(), time in ms for >= 100 ms: > >> value ------------- Distribution ------------- count > >> 250 | 0 > >> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 > >> 350 | 0 > >> > >> > >> Looks like the nfs server cache functions are quite fast, but > >> extremely frequently called. > >> > > Yep, they are called for every RPC. > > > > I may try coding up a patch that replaces the single mutex with > > one for each hash bucket, for TCP. > > > > I'll post if/when I get this patch to a testing/review stage, rick > > > > Cool. > > I've readjusted the precision of the dtrace script a bit, and I can > see > now the following three functions as taking most of the time : > nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache() > > This was recorded during a oracle benchmark run called SLOB, which > caused 99% cpu load on the NFS server. > Even with the drc2.patch and a large value for vfs.nfsd.tcphighwater? (Assuming the mounts are TCP ones.) Have fun with it, rick > > >> I hope someone can find this information useful. > >> > >> _______________________________________________ > >> freebsd-hackers@freebsd.org mailing list > >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > >> To unsubscribe, send any mail to > >> "freebsd-hackers-unsubscribe@freebsd.org" From owner-freebsd-hackers@FreeBSD.ORG Thu Oct 11 05:46:55 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0E7AA9B; Thu, 11 Oct 2012 05:46:55 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from mail-we0-f182.google.com (mail-we0-f182.google.com [74.125.82.182]) by mx1.freebsd.org (Postfix) with ESMTP id 3897A8FC18; Thu, 11 Oct 2012 05:46:53 +0000 (UTC) Received: by mail-we0-f182.google.com with SMTP id x43so1026256wey.13 for ; Wed, 10 Oct 2012 22:46:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; bh=3fx+jBrdvTe9PkuSAKy238hn/Sx5nfvqwmYbIRsQOZU=; b=r3hd8iJP0VR5cEDzd0gSP+m28tBm5cYtlLiK9SnVplj7rkOpiwuiOia9xNizCrAJIr J5gOYpXnKficKdT15eK2JwhWzYh9eZtO25TPbGUroqYf6iyIseTQ9q7l/NSHc9abKsVg OMHVf85UWPwSKoqFK8JGNoZOG/2Zh8eTRdT2ADo8zFC0NdjB0/8g03WQ//l3gEZHRnpi SQeYcpSEKxtslYKMtCcFD5cEalMtMs+2Po9K+f3JxryWdbZq4O/89vj+goURarLmjR1r Q6h0F3ohvhU+WsCbgTBARAfKZpTLezwZAN4Q+2mxrwewcHjTKXt5uuNvLeGvJsCp1GT0 qDIw== Received: by 10.180.97.35 with SMTP id dx3mr18002577wib.14.1349934412989; Wed, 10 Oct 2012 22:46:52 -0700 (PDT) Received: from [10.0.0.86] ([93.152.184.10]) by mx.google.com with ESMTPS id b3sm33623528wie.0.2012.10.10.22.46.50 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 10 Oct 2012 22:46:52 -0700 (PDT) Subject: Re: NFS server bottlenecks Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\)) Content-Type: text/plain; charset=us-ascii From: Nikolay Denev In-Reply-To: <1071150615.2039567.1349906947942.JavaMail.root@erie.cs.uoguelph.ca> Date: Thu, 11 Oct 2012 08:46:49 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <19724137-ABB0-43AF-BCB9-EBE8ACD6E3BD@gmail.com> References: <1071150615.2039567.1349906947942.JavaMail.root@erie.cs.uoguelph.ca> To: Rick Macklem X-Mailer: Apple Mail (2.1498) Cc: rmacklem@freebsd.org, Garrett Wollman , freebsd-hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Oct 2012 05:46:55 -0000 On Oct 11, 2012, at 1:09 AM, Rick Macklem wrote: > Nikolay Denev wrote: >> On Oct 10, 2012, at 3:18 AM, Rick Macklem >> wrote: >>=20 >>> Nikolay Denev wrote: >>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem >>>> wrote: >>>>=20 >>>>> Garrett Wollman wrote: >>>>>> <>>>>> said: >>>>>>=20 >>>>>>>> Simple: just use a sepatate mutex for each list that a cache >>>>>>>> entry >>>>>>>> is on, rather than a global lock for everything. This would >>>>>>>> reduce >>>>>>>> the mutex contention, but I'm not sure how significantly since >>>>>>>> I >>>>>>>> don't have the means to measure it yet. >>>>>>>>=20 >>>>>>> Well, since the cache trimming is removing entries from the >>>>>>> lists, >>>>>>> I >>>>>>> don't >>>>>>> see how that can be done with a global lock for list updates? >>>>>>=20 >>>>>> Well, the global lock is what we have now, but the cache trimming >>>>>> process only looks at one list at a time, so not locking the list >>>>>> that >>>>>> isn't being iterated over probably wouldn't hurt, unless there's >>>>>> some >>>>>> mechanism (that I didn't see) for entries to move from one list >>>>>> to >>>>>> another. Note that I'm considering each hash bucket a separate >>>>>> "list". (One issue to worry about in that case would be >>>>>> cache-line >>>>>> contention in the array of hash buckets; perhaps >>>>>> NFSRVCACHE_HASHSIZE >>>>>> ought to be increased to reduce that.) >>>>>>=20 >>>>> Yea, a separate mutex for each hash list might help. There is also >>>>> the >>>>> LRU list that all entries end up on, that gets used by the >>>>> trimming >>>>> code. >>>>> (I think? I wrote this stuff about 8 years ago, so I haven't >>>>> looked >>>>> at >>>>> it in a while.) >>>>>=20 >>>>> Also, increasing the hash table size is probably a good idea, >>>>> especially >>>>> if you reduce how aggressively the cache is trimmed. >>>>>=20 >>>>>>> Only doing it once/sec would result in a very large cache when >>>>>>> bursts of >>>>>>> traffic arrives. >>>>>>=20 >>>>>> My servers have 96 GB of memory so that's not a big deal for me. >>>>>>=20 >>>>> This code was originally "production tested" on a server with >>>>> 1Gbyte, >>>>> so times have changed a bit;-) >>>>>=20 >>>>>>> I'm not sure I see why doing it as a separate thread will >>>>>>> improve >>>>>>> things. >>>>>>> There are N nfsd threads already (N can be bumped up to 256 if >>>>>>> you >>>>>>> wish) >>>>>>> and having a bunch more "cache trimming threads" would just >>>>>>> increase >>>>>>> contention, wouldn't it? >>>>>>=20 >>>>>> Only one cache-trimming thread. The cache trim holds the (global) >>>>>> mutex for much longer than any individual nfsd service thread has >>>>>> any >>>>>> need to, and having N threads doing that in parallel is why it's >>>>>> so >>>>>> heavily contended. If there's only one thread doing the trim, >>>>>> then >>>>>> the nfsd service threads aren't spending time either contending >>>>>> on >>>>>> the >>>>>> mutex (it will be held less frequently and for shorter periods). >>>>>>=20 >>>>> I think the little drc2.patch which will keep the nfsd threads >>>>> from >>>>> acquiring the mutex and doing the trimming most of the time, might >>>>> be >>>>> sufficient. I still don't see why a separate trimming thread will >>>>> be >>>>> an advantage. I'd also be worried that the one cache trimming >>>>> thread >>>>> won't get the job done soon enough. >>>>>=20 >>>>> When I did production testing on a 1Gbyte server that saw a peak >>>>> load of about 100RPCs/sec, it was necessary to trim aggressively. >>>>> (Although I'd be tempted to say that a server with 1Gbyte is no >>>>> longer relevant, I recently recall someone trying to run FreeBSD >>>>> on a i486, although I doubt they wanted to run the nfsd on it.) >>>>>=20 >>>>>>> The only negative effect I can think of w.r.t. having the nfsd >>>>>>> threads doing it would be a (I believe negligible) increase in >>>>>>> RPC >>>>>>> response times (the time the nfsd thread spends trimming the >>>>>>> cache). >>>>>>> As noted, I think this time would be negligible compared to disk >>>>>>> I/O >>>>>>> and network transit times in the total RPC response time? >>>>>>=20 >>>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and >>>>>> 10G >>>>>> network connectivity, spinning on a contended mutex takes a >>>>>> significant amount of CPU time. (For the current design of the >>>>>> NFS >>>>>> server, it may actually be a win to turn off adaptive mutexes -- >>>>>> I >>>>>> should give that a try once I'm able to do more testing.) >>>>>>=20 >>>>> Have fun with it. Let me know when you have what you think is a >>>>> good >>>>> patch. >>>>>=20 >>>>> rick >>>>>=20 >>>>>> -GAWollman >>>>>> _______________________________________________ >>>>>> freebsd-hackers@freebsd.org mailing list >>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>>>>> To unsubscribe, send any mail to >>>>>> "freebsd-hackers-unsubscribe@freebsd.org" >>>>> _______________________________________________ >>>>> freebsd-fs@freebsd.org mailing list >>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs >>>>> To unsubscribe, send any mail to >>>>> "freebsd-fs-unsubscribe@freebsd.org" >>>>=20 >>>> My quest for IOPS over NFS continues :) >>>> So far I'm not able to achieve more than about 3000 8K read >>>> requests >>>> over NFS, >>>> while the server locally gives much more. >>>> And this is all from a file that is completely in ARC cache, no >>>> disk >>>> IO involved. >>>>=20 >>> Just out of curiousity, why do you use 8K reads instead of 64K >>> reads. >>> Since the RPC overhead (including the DRC functions) is per RPC, >>> doing >>> fewer larger RPCs should usually work better. (Sometimes large >>> rsize/wsize >>> values generate too large a burst of traffic for a network interface >>> to >>> handle and then the rsize/wsize has to be decreased to avoid this >>> issue.) >>>=20 >>> And, although this experiment seems useful for testing patches that >>> try >>> and reduce DRC CPU overheads, most "real" NFS servers will be doing >>> disk >>> I/O. >>>=20 >>=20 >> This is the default blocksize the Oracle and probably most databases >> use. >> It uses also larger blocks, but for small random reads in OLTP >> applications this is what is used. >>=20 > If the client is doing 8K reads, you could increase the read ahead > "readahead=3DN" (N up to 16), to try and increase the bandwidth. > (But if the CPU is 99% busy, then I don't think it will matter.) I'll try to check if this is possible to be set, as we are testing not = only with the Linux NFS client, but also with the Oracle's built in so called DirectNFS client that is = built in to the app. >=20 >>=20 >>>> I've snatched some sample DTrace script from the net : [ >>>> = http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes >>>> ] >>>>=20 >>>> And modified it for our new NFS server : >>>>=20 >>>> #!/usr/sbin/dtrace -qs >>>>=20 >>>> fbt:kernel:nfsrvd_*:entry >>>> { >>>> self->ts =3D timestamp; >>>> @counts[probefunc] =3D count(); >>>> } >>>>=20 >>>> fbt:kernel:nfsrvd_*:return >>>> / self->ts > 0 / >>>> { >>>> this->delta =3D (timestamp-self->ts)/1000000; >>>> } >>>>=20 >>>> fbt:kernel:nfsrvd_*:return >>>> / self->ts > 0 && this->delta > 100 / >>>> { >>>> @slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50); >>>> } >>>>=20 >>>> fbt:kernel:nfsrvd_*:return >>>> / self->ts > 0 / >>>> { >>>> @dist[probefunc, "ms"] =3D quantize(this->delta); >>>> self->ts =3D 0; >>>> } >>>>=20 >>>> END >>>> { >>>> printf("\n"); >>>> printa("function %-20s %@10d\n", @counts); >>>> printf("\n"); >>>> printa("function %s(), time in %s:%@d\n", @dist); >>>> printf("\n"); >>>> printa("function %s(), time in %s for >=3D 100 ms:%@d\n", @slow); >>>> } >>>>=20 >>>> And here's a sample output from one or two minutes during the run >>>> of >>>> Oracle's ORION benchmark >>>> tool from a Linux machine, on a 32G file on NFS mount over 10G >>>> ethernet: >>>>=20 >>>> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d >>>> ^C >>>>=20 >>>> function nfsrvd_access 4 >>>> function nfsrvd_statfs 10 >>>> function nfsrvd_getattr 14 >>>> function nfsrvd_commit 76 >>>> function nfsrvd_sentcache 110048 >>>> function nfsrvd_write 110048 >>>> function nfsrvd_read 283648 >>>> function nfsrvd_dorpc 393800 >>>> function nfsrvd_getcache 393800 >>>> function nfsrvd_rephead 393800 >>>> function nfsrvd_updatecache 393800 >>>>=20 >>>> function nfsrvd_access(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 >>>> 1 | 0 >>>>=20 >>>> function nfsrvd_statfs(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 >>>> 1 | 0 >>>>=20 >>>> function nfsrvd_getattr(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 >>>> 1 | 0 >>>>=20 >>>> function nfsrvd_sentcache(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 >>>> 1 | 0 >>>>=20 >>>> function nfsrvd_rephead(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 >>>> 1 | 0 >>>>=20 >>>> function nfsrvd_updatecache(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 >>>> 1 | 0 >>>>=20 >>>> function nfsrvd_getcache(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 >>>> 1 | 1 >>>> 2 | 0 >>>> 4 | 1 >>>> 8 | 0 >>>>=20 >>>> function nfsrvd_write(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 >>>> 1 | 5 >>>> 2 | 4 >>>> 4 | 0 >>>>=20 >>>> function nfsrvd_read(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 >>>> 1 | 19 >>>> 2 | 3 >>>> 4 | 2 >>>> 8 | 0 >>>> 16 | 1 >>>> 32 | 0 >>>> 64 | 0 >>>> 128 | 0 >>>> 256 | 1 >>>> 512 | 0 >>>>=20 >>>> function nfsrvd_commit(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 >>>> 1 |@@@@@@@ 14 >>>> 2 | 0 >>>> 4 |@ 1 >>>> 8 |@ 1 >>>> 16 | 0 >>>> 32 |@@@@@@@ 14 >>>> 64 |@ 2 >>>> 128 | 0 >>>>=20 >>>>=20 >>>> function nfsrvd_commit(), time in ms for >=3D 100 ms: >>>> value ------------- Distribution ------------- count >>>> < 100 | 0 >>>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 >>>> 150 | 0 >>>>=20 >>>> function nfsrvd_read(), time in ms for >=3D 100 ms: >>>> value ------------- Distribution ------------- count >>>> 250 | 0 >>>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 >>>> 350 | 0 >>>>=20 >>>>=20 >>>> Looks like the nfs server cache functions are quite fast, but >>>> extremely frequently called. >>>>=20 >>> Yep, they are called for every RPC. >>>=20 >>> I may try coding up a patch that replaces the single mutex with >>> one for each hash bucket, for TCP. >>>=20 >>> I'll post if/when I get this patch to a testing/review stage, rick >>>=20 >>=20 >> Cool. >>=20 >> I've readjusted the precision of the dtrace script a bit, and I can >> see >> now the following three functions as taking most of the time : >> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache() >>=20 >> This was recorded during a oracle benchmark run called SLOB, which >> caused 99% cpu load on the NFS server. >>=20 > Even with the drc2.patch and a large value for vfs.nfsd.tcphighwater? > (Assuming the mounts are TCP ones.) >=20 > Have fun with it, rick >=20 I had upped it, but probably not enough. I'm now running with = vfs.nfsd.tcphighwater set to some ridiculous number, and NFSRVCACHE_HASHSIZE set to 500. So far it looks like good improvement as those functions no longer show = up in the dtrace script output. I'll run some more benchmarks and testing today. Thanks! >>=20 >>>> I hope someone can find this information useful. >>>>=20 >>>> _______________________________________________ >>>> freebsd-hackers@freebsd.org mailing list >>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>>> To unsubscribe, send any mail to >>>> "freebsd-hackers-unsubscribe@freebsd.org" From owner-freebsd-hackers@FreeBSD.ORG Thu Oct 11 11:49:44 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0F75F5F1 for ; Thu, 11 Oct 2012 11:49:44 +0000 (UTC) (envelope-from kwiat3k@panic.pl) Received: from mail.panic.pl (mail.panic.pl [188.116.33.105]) by mx1.freebsd.org (Postfix) with ESMTP id BB66E8FC14 for ; Thu, 11 Oct 2012 11:49:43 +0000 (UTC) Received: from mail.panic.pl (unknown [188.116.33.105]) by mail.panic.pl (Postfix) with ESMTP id 73C789B794 for ; Thu, 11 Oct 2012 13:40:56 +0200 (CEST) X-Virus-Scanned: amavisd-new at panic.pl Received: from mail.panic.pl ([188.116.33.105]) by mail.panic.pl (mail.panic.pl [188.116.33.105]) (amavisd-new, port 10024) with ESMTP id vx+vnmdr1cJG for ; Thu, 11 Oct 2012 13:40:56 +0200 (CEST) Received: from localhost.localdomain (admin.wp-sa.pl [212.77.105.137]) by mail.panic.pl (Postfix) with ESMTPSA id 1E64A9B78D for ; Thu, 11 Oct 2012 13:40:56 +0200 (CEST) Date: Thu, 11 Oct 2012 13:41:08 +0200 From: Mateusz Kwiatkowski To: freebsd-hackers@freebsd.org Subject: truss kills process? Message-ID: <20121011134108.65fd11ba@panic.pl> Organization: Panic.PL X-Mailer: Claws Mail 3.8.1 (GTK+ 2.24.13; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Oct 2012 11:49:44 -0000 Hello, We noticed quite strange behaviour of truss: # sleep 100 & [1] 93212 # truss -p 93212 sigprocmask(SIG_BLOCK,SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2,0x0) = 0 (0x0) sigprocmask(SIG_SETMASK,0x0,0x0) = 0 (0x0) sigprocmask(SIG_BLOCK,SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2,0x0) = 0 (0x0) sigprocmask(SIG_SETMASK,0x0,0x0) = 0 (0x0) process exit, rval = 0 [1] + done sleep 100 Sleep ends immediately instead of waiting for desired number of seconds. I wonder if this is normal behavior or maybe a bug? Checked under 8.2-RELEASE-p3 and 9.0-RELEASE. -- Mateusz Kwiatkowski From owner-freebsd-hackers@FreeBSD.ORG Thu Oct 11 13:06:44 2012 Return-Path: Delivered-To: hackers@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 069DA3C6 for ; Thu, 11 Oct 2012 13:06:44 +0000 (UTC) (envelope-from erik@cederstrand.dk) Received: from csmtp3.one.com (csmtp3.one.com [91.198.169.23]) by mx1.freebsd.org (Postfix) with ESMTP id B7DFB8FC18 for ; Thu, 11 Oct 2012 13:06:43 +0000 (UTC) Received: from [192.168.1.18] (unknown [217.157.7.221]) by csmtp3.one.com (Postfix) with ESMTPA id 66F052404978 for ; Thu, 11 Oct 2012 13:06:42 +0000 (UTC) From: Erik Cederstrand Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Subject: curcpu false positive? Message-Id: <3A22DF7A-00BB-408C-8F76-C1E119E0E48C@cederstrand.dk> Date: Thu, 11 Oct 2012 15:06:41 +0200 To: FreeBSD Hackers Mime-Version: 1.0 (Mac OS X Mail 6.0 \(1486\)) X-Mailer: Apple Mail (2.1486) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Oct 2012 13:06:44 -0000 Hello, I'm looking at some Clang Static Analyzer reports in the kernel, and a = lot of them point back to a null pointer dereference in __pcpu_type = (sys/amd64/include/pcpu.h:102) which is defined as: 102 /* 103 * Evaluates to the type of the per-cpu variable name. 104 */ 105 #define __pcpu_type(name) = \ 106 __typeof(((struct pcpu *)0)->name) Which indeed looks like a NULL pointer dereference. Looking at the = latest commit message there, I'm sure the code is correct, but I'm = unsure why the null pointer is OK. I'd appreciate an explanation :-) Thanks, Erik= From owner-freebsd-hackers@FreeBSD.ORG Thu Oct 11 13:11:23 2012 Return-Path: Delivered-To: hackers@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A92486CD for ; Thu, 11 Oct 2012 13:11:23 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 00B648FC12 for ; Thu, 11 Oct 2012 13:11:21 +0000 (UTC) Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua [212.40.38.101]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id QAA03299; Thu, 11 Oct 2012 16:11:18 +0300 (EEST) (envelope-from avg@FreeBSD.org) Message-ID: <5076C576.3020306@FreeBSD.org> Date: Thu, 11 Oct 2012 16:11:18 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: Erik Cederstrand Subject: Re: curcpu false positive? References: <3A22DF7A-00BB-408C-8F76-C1E119E0E48C@cederstrand.dk> In-Reply-To: <3A22DF7A-00BB-408C-8F76-C1E119E0E48C@cederstrand.dk> X-Enigmail-Version: 1.4.3 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: FreeBSD Hackers X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Oct 2012 13:11:23 -0000 on 11/10/2012 16:06 Erik Cederstrand said the following: > Hello, > > I'm looking at some Clang Static Analyzer reports in the kernel, and a lot of them point back to a null pointer dereference in __pcpu_type (sys/amd64/include/pcpu.h:102) which is defined as: > > 102 /* > 103 * Evaluates to the type of the per-cpu variable name. > 104 */ > 105 #define __pcpu_type(name) \ > 106 __typeof(((struct pcpu *)0)->name) > > > Which indeed looks like a NULL pointer dereference. Looking at the latest commit message there, I'm sure the code is correct, but I'm unsure why the null pointer is OK. I'd appreciate an explanation :-) Read about __typeof [1]. It's evaluated at compile time, so actual value of an expression does not matter at all. [1] http://gcc.gnu.org/onlinedocs/gcc/Typeof.html -- Andriy Gapon From owner-freebsd-hackers@FreeBSD.ORG Thu Oct 11 16:21:05 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B6B92913; Thu, 11 Oct 2012 16:21:05 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from mail-we0-f182.google.com (mail-we0-f182.google.com [74.125.82.182]) by mx1.freebsd.org (Postfix) with ESMTP id D74538FC14; Thu, 11 Oct 2012 16:21:04 +0000 (UTC) Received: by mail-we0-f182.google.com with SMTP id x43so1461794wey.13 for ; Thu, 11 Oct 2012 09:20:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; bh=qAI24G5yefhBz4Gpif8LkmBCnqIDNTpEReTQU/t9kNg=; b=YXrCDRfbYNAUo5rvqQnRsKryJF+23pgBPpa1HZtwA1oTy1eBxIciw47Q1sPd+yZcJp EYvAcnpuDyIviFetjRNoI8aIqD/nnVAtlv0kMi6uoMBv/cQ0DAZ0Xfr6EWcinvHPzJD3 zdqL6ZAKgGXzf70tijHwFimHBGm980IT2vpK11qErQ8j+9VH8W1MLq8XzmyaKKPzufrs Ych54ZHZW0jmydsZXm7CGVHguYkWVgGz4IQwzClFe34pCxKmN2qxkoMFZbMvQCm8gByU 80q4CK64BY6Gu1kspo6QL+m4HQLPhQNJOvRFEJTheINHy09ufduIZin7T/LRcILfjg/n O86Q== Received: by 10.180.86.202 with SMTP id r10mr3394767wiz.12.1349972457860; Thu, 11 Oct 2012 09:20:57 -0700 (PDT) Received: from ndenevsa.sf.moneybookers.net (g1.moneybookers.com. [217.18.249.148]) by mx.google.com with ESMTPS id b3sm35970004wie.0.2012.10.11.09.20.54 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 11 Oct 2012 09:20:56 -0700 (PDT) Subject: Re: NFS server bottlenecks Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\)) Content-Type: text/plain; charset=us-ascii From: Nikolay Denev In-Reply-To: <19724137-ABB0-43AF-BCB9-EBE8ACD6E3BD@gmail.com> Date: Thu, 11 Oct 2012 19:20:53 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <0A8CDBF9-28C3-46D2-BB58-0559D00BD545@gmail.com> References: <1071150615.2039567.1349906947942.JavaMail.root@erie.cs.uoguelph.ca> <19724137-ABB0-43AF-BCB9-EBE8ACD6E3BD@gmail.com> To: "freebsd-hackers@freebsd.org" X-Mailer: Apple Mail (2.1498) Cc: rmacklem@freebsd.org, Garrett Wollman X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Oct 2012 16:21:05 -0000 On Oct 11, 2012, at 8:46 AM, Nikolay Denev wrote: >=20 > On Oct 11, 2012, at 1:09 AM, Rick Macklem = wrote: >=20 >> Nikolay Denev wrote: >>> On Oct 10, 2012, at 3:18 AM, Rick Macklem >>> wrote: >>>=20 >>>> Nikolay Denev wrote: >>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem >>>>> wrote: >>>>>=20 >>>>>> Garrett Wollman wrote: >>>>>>> <>>>>>> said: >>>>>>>=20 >>>>>>>>> Simple: just use a sepatate mutex for each list that a cache >>>>>>>>> entry >>>>>>>>> is on, rather than a global lock for everything. This would >>>>>>>>> reduce >>>>>>>>> the mutex contention, but I'm not sure how significantly since >>>>>>>>> I >>>>>>>>> don't have the means to measure it yet. >>>>>>>>>=20 >>>>>>>> Well, since the cache trimming is removing entries from the >>>>>>>> lists, >>>>>>>> I >>>>>>>> don't >>>>>>>> see how that can be done with a global lock for list updates? >>>>>>>=20 >>>>>>> Well, the global lock is what we have now, but the cache = trimming >>>>>>> process only looks at one list at a time, so not locking the = list >>>>>>> that >>>>>>> isn't being iterated over probably wouldn't hurt, unless there's >>>>>>> some >>>>>>> mechanism (that I didn't see) for entries to move from one list >>>>>>> to >>>>>>> another. Note that I'm considering each hash bucket a separate >>>>>>> "list". (One issue to worry about in that case would be >>>>>>> cache-line >>>>>>> contention in the array of hash buckets; perhaps >>>>>>> NFSRVCACHE_HASHSIZE >>>>>>> ought to be increased to reduce that.) >>>>>>>=20 >>>>>> Yea, a separate mutex for each hash list might help. There is = also >>>>>> the >>>>>> LRU list that all entries end up on, that gets used by the >>>>>> trimming >>>>>> code. >>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't >>>>>> looked >>>>>> at >>>>>> it in a while.) >>>>>>=20 >>>>>> Also, increasing the hash table size is probably a good idea, >>>>>> especially >>>>>> if you reduce how aggressively the cache is trimmed. >>>>>>=20 >>>>>>>> Only doing it once/sec would result in a very large cache when >>>>>>>> bursts of >>>>>>>> traffic arrives. >>>>>>>=20 >>>>>>> My servers have 96 GB of memory so that's not a big deal for me. >>>>>>>=20 >>>>>> This code was originally "production tested" on a server with >>>>>> 1Gbyte, >>>>>> so times have changed a bit;-) >>>>>>=20 >>>>>>>> I'm not sure I see why doing it as a separate thread will >>>>>>>> improve >>>>>>>> things. >>>>>>>> There are N nfsd threads already (N can be bumped up to 256 if >>>>>>>> you >>>>>>>> wish) >>>>>>>> and having a bunch more "cache trimming threads" would just >>>>>>>> increase >>>>>>>> contention, wouldn't it? >>>>>>>=20 >>>>>>> Only one cache-trimming thread. The cache trim holds the = (global) >>>>>>> mutex for much longer than any individual nfsd service thread = has >>>>>>> any >>>>>>> need to, and having N threads doing that in parallel is why it's >>>>>>> so >>>>>>> heavily contended. If there's only one thread doing the trim, >>>>>>> then >>>>>>> the nfsd service threads aren't spending time either contending >>>>>>> on >>>>>>> the >>>>>>> mutex (it will be held less frequently and for shorter periods). >>>>>>>=20 >>>>>> I think the little drc2.patch which will keep the nfsd threads >>>>>> from >>>>>> acquiring the mutex and doing the trimming most of the time, = might >>>>>> be >>>>>> sufficient. I still don't see why a separate trimming thread will >>>>>> be >>>>>> an advantage. I'd also be worried that the one cache trimming >>>>>> thread >>>>>> won't get the job done soon enough. >>>>>>=20 >>>>>> When I did production testing on a 1Gbyte server that saw a peak >>>>>> load of about 100RPCs/sec, it was necessary to trim aggressively. >>>>>> (Although I'd be tempted to say that a server with 1Gbyte is no >>>>>> longer relevant, I recently recall someone trying to run FreeBSD >>>>>> on a i486, although I doubt they wanted to run the nfsd on it.) >>>>>>=20 >>>>>>>> The only negative effect I can think of w.r.t. having the nfsd >>>>>>>> threads doing it would be a (I believe negligible) increase in >>>>>>>> RPC >>>>>>>> response times (the time the nfsd thread spends trimming the >>>>>>>> cache). >>>>>>>> As noted, I think this time would be negligible compared to = disk >>>>>>>> I/O >>>>>>>> and network transit times in the total RPC response time? >>>>>>>=20 >>>>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and >>>>>>> 10G >>>>>>> network connectivity, spinning on a contended mutex takes a >>>>>>> significant amount of CPU time. (For the current design of the >>>>>>> NFS >>>>>>> server, it may actually be a win to turn off adaptive mutexes -- >>>>>>> I >>>>>>> should give that a try once I'm able to do more testing.) >>>>>>>=20 >>>>>> Have fun with it. Let me know when you have what you think is a >>>>>> good >>>>>> patch. >>>>>>=20 >>>>>> rick >>>>>>=20 >>>>>>> -GAWollman >>>>>>> _______________________________________________ >>>>>>> freebsd-hackers@freebsd.org mailing list >>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>>>>>> To unsubscribe, send any mail to >>>>>>> "freebsd-hackers-unsubscribe@freebsd.org" >>>>>> _______________________________________________ >>>>>> freebsd-fs@freebsd.org mailing list >>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs >>>>>> To unsubscribe, send any mail to >>>>>> "freebsd-fs-unsubscribe@freebsd.org" >>>>>=20 >>>>> My quest for IOPS over NFS continues :) >>>>> So far I'm not able to achieve more than about 3000 8K read >>>>> requests >>>>> over NFS, >>>>> while the server locally gives much more. >>>>> And this is all from a file that is completely in ARC cache, no >>>>> disk >>>>> IO involved. >>>>>=20 >>>> Just out of curiousity, why do you use 8K reads instead of 64K >>>> reads. >>>> Since the RPC overhead (including the DRC functions) is per RPC, >>>> doing >>>> fewer larger RPCs should usually work better. (Sometimes large >>>> rsize/wsize >>>> values generate too large a burst of traffic for a network = interface >>>> to >>>> handle and then the rsize/wsize has to be decreased to avoid this >>>> issue.) >>>>=20 >>>> And, although this experiment seems useful for testing patches that >>>> try >>>> and reduce DRC CPU overheads, most "real" NFS servers will be doing >>>> disk >>>> I/O. >>>>=20 >>>=20 >>> This is the default blocksize the Oracle and probably most databases >>> use. >>> It uses also larger blocks, but for small random reads in OLTP >>> applications this is what is used. >>>=20 >> If the client is doing 8K reads, you could increase the read ahead >> "readahead=3DN" (N up to 16), to try and increase the bandwidth. >> (But if the CPU is 99% busy, then I don't think it will matter.) >=20 > I'll try to check if this is possible to be set, as we are testing not = only with the Linux NFS client, > but also with the Oracle's built in so called DirectNFS client that is = built in to the app. >=20 >>=20 >>>=20 >>>>> I've snatched some sample DTrace script from the net : [ >>>>> = http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes >>>>> ] >>>>>=20 >>>>> And modified it for our new NFS server : >>>>>=20 >>>>> #!/usr/sbin/dtrace -qs >>>>>=20 >>>>> fbt:kernel:nfsrvd_*:entry >>>>> { >>>>> self->ts =3D timestamp; >>>>> @counts[probefunc] =3D count(); >>>>> } >>>>>=20 >>>>> fbt:kernel:nfsrvd_*:return >>>>> / self->ts > 0 / >>>>> { >>>>> this->delta =3D (timestamp-self->ts)/1000000; >>>>> } >>>>>=20 >>>>> fbt:kernel:nfsrvd_*:return >>>>> / self->ts > 0 && this->delta > 100 / >>>>> { >>>>> @slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50); >>>>> } >>>>>=20 >>>>> fbt:kernel:nfsrvd_*:return >>>>> / self->ts > 0 / >>>>> { >>>>> @dist[probefunc, "ms"] =3D quantize(this->delta); >>>>> self->ts =3D 0; >>>>> } >>>>>=20 >>>>> END >>>>> { >>>>> printf("\n"); >>>>> printa("function %-20s %@10d\n", @counts); >>>>> printf("\n"); >>>>> printa("function %s(), time in %s:%@d\n", @dist); >>>>> printf("\n"); >>>>> printa("function %s(), time in %s for >=3D 100 ms:%@d\n", @slow); >>>>> } >>>>>=20 >>>>> And here's a sample output from one or two minutes during the run >>>>> of >>>>> Oracle's ORION benchmark >>>>> tool from a Linux machine, on a 32G file on NFS mount over 10G >>>>> ethernet: >>>>>=20 >>>>> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d >>>>> ^C >>>>>=20 >>>>> function nfsrvd_access 4 >>>>> function nfsrvd_statfs 10 >>>>> function nfsrvd_getattr 14 >>>>> function nfsrvd_commit 76 >>>>> function nfsrvd_sentcache 110048 >>>>> function nfsrvd_write 110048 >>>>> function nfsrvd_read 283648 >>>>> function nfsrvd_dorpc 393800 >>>>> function nfsrvd_getcache 393800 >>>>> function nfsrvd_rephead 393800 >>>>> function nfsrvd_updatecache 393800 >>>>>=20 >>>>> function nfsrvd_access(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 >>>>> 1 | 0 >>>>>=20 >>>>> function nfsrvd_statfs(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 >>>>> 1 | 0 >>>>>=20 >>>>> function nfsrvd_getattr(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 >>>>> 1 | 0 >>>>>=20 >>>>> function nfsrvd_sentcache(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 >>>>> 1 | 0 >>>>>=20 >>>>> function nfsrvd_rephead(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 >>>>> 1 | 0 >>>>>=20 >>>>> function nfsrvd_updatecache(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 >>>>> 1 | 0 >>>>>=20 >>>>> function nfsrvd_getcache(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 >>>>> 1 | 1 >>>>> 2 | 0 >>>>> 4 | 1 >>>>> 8 | 0 >>>>>=20 >>>>> function nfsrvd_write(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 >>>>> 1 | 5 >>>>> 2 | 4 >>>>> 4 | 0 >>>>>=20 >>>>> function nfsrvd_read(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 >>>>> 1 | 19 >>>>> 2 | 3 >>>>> 4 | 2 >>>>> 8 | 0 >>>>> 16 | 1 >>>>> 32 | 0 >>>>> 64 | 0 >>>>> 128 | 0 >>>>> 256 | 1 >>>>> 512 | 0 >>>>>=20 >>>>> function nfsrvd_commit(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 >>>>> 1 |@@@@@@@ 14 >>>>> 2 | 0 >>>>> 4 |@ 1 >>>>> 8 |@ 1 >>>>> 16 | 0 >>>>> 32 |@@@@@@@ 14 >>>>> 64 |@ 2 >>>>> 128 | 0 >>>>>=20 >>>>>=20 >>>>> function nfsrvd_commit(), time in ms for >=3D 100 ms: >>>>> value ------------- Distribution ------------- count >>>>> < 100 | 0 >>>>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 >>>>> 150 | 0 >>>>>=20 >>>>> function nfsrvd_read(), time in ms for >=3D 100 ms: >>>>> value ------------- Distribution ------------- count >>>>> 250 | 0 >>>>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 >>>>> 350 | 0 >>>>>=20 >>>>>=20 >>>>> Looks like the nfs server cache functions are quite fast, but >>>>> extremely frequently called. >>>>>=20 >>>> Yep, they are called for every RPC. >>>>=20 >>>> I may try coding up a patch that replaces the single mutex with >>>> one for each hash bucket, for TCP. >>>>=20 >>>> I'll post if/when I get this patch to a testing/review stage, rick >>>>=20 >>>=20 >>> Cool. >>>=20 >>> I've readjusted the precision of the dtrace script a bit, and I can >>> see >>> now the following three functions as taking most of the time : >>> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache() >>>=20 >>> This was recorded during a oracle benchmark run called SLOB, which >>> caused 99% cpu load on the NFS server. >>>=20 >> Even with the drc2.patch and a large value for vfs.nfsd.tcphighwater? >> (Assuming the mounts are TCP ones.) >>=20 >> Have fun with it, rick >>=20 >=20 > I had upped it, but probably not enough. I'm now running with = vfs.nfsd.tcphighwater set > to some ridiculous number, and NFSRVCACHE_HASHSIZE set to 500. > So far it looks like good improvement as those functions no longer = show up in the dtrace script output. > I'll run some more benchmarks and testing today. >=20 > Thanks! >=20 >>>=20 >>>>> I hope someone can find this information useful. >>>>>=20 >>>>> _______________________________________________ >>>>> freebsd-hackers@freebsd.org mailing list >>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>>>> To unsubscribe, send any mail to >>>>> "freebsd-hackers-unsubscribe@freebsd.org" >=20 I haven't had the opportunity today to run more DB tests over NFS as the = DBA was busy with something else, however I tested a bit the large file transfers. And I'm seeing something strange probably not only NFS but also ZFS and = ARC related. When I first tested the drc2.patch I reported a huge bandwidth = improvement, but now I think that this was probably because of the machine freshly = rebooted instead of just the patch. The patch surely improved things, especially CPU utilization combined = with the increased cache. But today I'm again having a file completely cached in ZFS's ARC cache, = which when transferred over NFS reaches about 300MB/s, when at some tests it reached 900+MB/s (as = reported in my first email). The file locally can be read at about 3GB/s as reported by dd. Local: [17:51]root@goliath:/tank/spa_db/undo# dd if=3Ddata.dbf of=3D/dev/null = bs=3D1M =20 30720+1 records in 30720+1 records out 32212262912 bytes transferred in 10.548485 secs (3053733573 bytes/sec) Over NFS: [17:48]root@spa:/mnt/spa_db/undo# dd if=3Ddata.dbf of=3D/dev/null bs=3D1M = =20 30720+1 records in 30720+1 records out 32212262912 bytes (32 GB) copied, 88.0663 seconds, 366 MB/s The machines are almost idle during this transfer and I can't see a = reason why it can't reach the full bandwith when it's just reading it from RAM. I've tried again tracing with DTrace to see what's happening with this = script : fbt:kernel:nfs*:entry { this->ts =3D timestamp; @counts[probefunc] =3D count(); } fbt:kernel:nfs*:return / this->ts > 0 / { @time[probefunc] =3D avg(timestamp - this->ts); } END { trunc(@counts, 10); trunc(@time, 10); printf("Top 10 called functions\n\n");=09 printa(@counts); printf("\n\nTop 10 slowest functions\n\n"); printa(@time); } And here's the result (several seconds during the dd test): Top 10 called functions nfsrc_freecache 88849 nfsrc_wanted 88849 nfsrv_fillattr 88849 nfsrv_postopattr 88849 nfsrvd_read 88849 nfsrvd_rephead 88849 nfsrvd_updatecache 88849 nfsvno_testexp 88849 nfsrc_trimcache 177697 nfsvno_getattr 177698 Top 10 slowest functions nfsd_excred 5673 nfsrc_freecache 5674 nfsrv_postopattr 5970 nfsrv_servertimer 6327 nfssvc_nfscommon 6596 nfsd_fhtovp 8000 nfsrvd_read 8380 nfssvc_program 92752 nfsvno_read 124979 nfsvno_fhtovp 1789523 I might try now to trace what nfsvno_fhtovp() is doing and where is = spending it's time. Any other ideas are welcome :) From owner-freebsd-hackers@FreeBSD.ORG Thu Oct 11 16:47:17 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 10ED71A1; Thu, 11 Oct 2012 16:47:17 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from mail-wg0-f50.google.com (mail-wg0-f50.google.com [74.125.82.50]) by mx1.freebsd.org (Postfix) with ESMTP id 38EFA8FC12; Thu, 11 Oct 2012 16:47:15 +0000 (UTC) Received: by mail-wg0-f50.google.com with SMTP id 16so1507684wgi.31 for ; Thu, 11 Oct 2012 09:47:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; bh=42qujbjVn/cwcUzL/CgDbTaz1h3ZzyWh3DjKBwwM4uY=; b=iorcJB7paRfvs+FzuLp2MO7nQ4qKgUXLQ5wElkdpcqI6JYum3fR2EmxlRv8iHAljCI 774wMi3FVyuOf26Pe28tCUCrkoH+tC5LlJsDf/P7PfRfvwATCOMdG9XNVzBj6Vp5CJgu iWRdx41qemywW+Bb0l11DodMtmR5OrvdGEXBNvH172g4Veldl2dxkE6Ca8ZpTNhrWUVR cQ9DXZOkDQzV4XopgdWEpiunM+b3ehAtEhbPL8Sv2wRcHjcdZ5NJuZA7bzbLYmH9HEcq KnJSv0F1tKYD+ips391ZdKeB5Hs+ccDD1POcAM6mEaIWZVRTmdpVGfjGnfkEgvB4kJET 2+4A== Received: by 10.180.87.132 with SMTP id ay4mr3583965wib.5.1349974034907; Thu, 11 Oct 2012 09:47:14 -0700 (PDT) Received: from ndenevsa.sf.moneybookers.net (g1.moneybookers.com. [217.18.249.148]) by mx.google.com with ESMTPS id gm7sm9563488wib.10.2012.10.11.09.47.13 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 11 Oct 2012 09:47:14 -0700 (PDT) Subject: Re: NFS server bottlenecks Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\)) Content-Type: text/plain; charset=us-ascii From: Nikolay Denev In-Reply-To: <0A8CDBF9-28C3-46D2-BB58-0559D00BD545@gmail.com> Date: Thu, 11 Oct 2012 19:47:12 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: References: <1071150615.2039567.1349906947942.JavaMail.root@erie.cs.uoguelph.ca> <19724137-ABB0-43AF-BCB9-EBE8ACD6E3BD@gmail.com> <0A8CDBF9-28C3-46D2-BB58-0559D00BD545@gmail.com> To: "freebsd-hackers@freebsd.org" X-Mailer: Apple Mail (2.1498) Cc: rmacklem@freebsd.org, Garrett Wollman X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Oct 2012 16:47:17 -0000 On Oct 11, 2012, at 7:20 PM, Nikolay Denev wrote: > On Oct 11, 2012, at 8:46 AM, Nikolay Denev wrote: >=20 >>=20 >> On Oct 11, 2012, at 1:09 AM, Rick Macklem = wrote: >>=20 >>> Nikolay Denev wrote: >>>> On Oct 10, 2012, at 3:18 AM, Rick Macklem >>>> wrote: >>>>=20 >>>>> Nikolay Denev wrote: >>>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem >>>>>> wrote: >>>>>>=20 >>>>>>> Garrett Wollman wrote: >>>>>>>> <>>>>>>> said: >>>>>>>>=20 >>>>>>>>>> Simple: just use a sepatate mutex for each list that a cache >>>>>>>>>> entry >>>>>>>>>> is on, rather than a global lock for everything. This would >>>>>>>>>> reduce >>>>>>>>>> the mutex contention, but I'm not sure how significantly = since >>>>>>>>>> I >>>>>>>>>> don't have the means to measure it yet. >>>>>>>>>>=20 >>>>>>>>> Well, since the cache trimming is removing entries from the >>>>>>>>> lists, >>>>>>>>> I >>>>>>>>> don't >>>>>>>>> see how that can be done with a global lock for list updates? >>>>>>>>=20 >>>>>>>> Well, the global lock is what we have now, but the cache = trimming >>>>>>>> process only looks at one list at a time, so not locking the = list >>>>>>>> that >>>>>>>> isn't being iterated over probably wouldn't hurt, unless = there's >>>>>>>> some >>>>>>>> mechanism (that I didn't see) for entries to move from one list >>>>>>>> to >>>>>>>> another. Note that I'm considering each hash bucket a separate >>>>>>>> "list". (One issue to worry about in that case would be >>>>>>>> cache-line >>>>>>>> contention in the array of hash buckets; perhaps >>>>>>>> NFSRVCACHE_HASHSIZE >>>>>>>> ought to be increased to reduce that.) >>>>>>>>=20 >>>>>>> Yea, a separate mutex for each hash list might help. There is = also >>>>>>> the >>>>>>> LRU list that all entries end up on, that gets used by the >>>>>>> trimming >>>>>>> code. >>>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't >>>>>>> looked >>>>>>> at >>>>>>> it in a while.) >>>>>>>=20 >>>>>>> Also, increasing the hash table size is probably a good idea, >>>>>>> especially >>>>>>> if you reduce how aggressively the cache is trimmed. >>>>>>>=20 >>>>>>>>> Only doing it once/sec would result in a very large cache when >>>>>>>>> bursts of >>>>>>>>> traffic arrives. >>>>>>>>=20 >>>>>>>> My servers have 96 GB of memory so that's not a big deal for = me. >>>>>>>>=20 >>>>>>> This code was originally "production tested" on a server with >>>>>>> 1Gbyte, >>>>>>> so times have changed a bit;-) >>>>>>>=20 >>>>>>>>> I'm not sure I see why doing it as a separate thread will >>>>>>>>> improve >>>>>>>>> things. >>>>>>>>> There are N nfsd threads already (N can be bumped up to 256 if >>>>>>>>> you >>>>>>>>> wish) >>>>>>>>> and having a bunch more "cache trimming threads" would just >>>>>>>>> increase >>>>>>>>> contention, wouldn't it? >>>>>>>>=20 >>>>>>>> Only one cache-trimming thread. The cache trim holds the = (global) >>>>>>>> mutex for much longer than any individual nfsd service thread = has >>>>>>>> any >>>>>>>> need to, and having N threads doing that in parallel is why = it's >>>>>>>> so >>>>>>>> heavily contended. If there's only one thread doing the trim, >>>>>>>> then >>>>>>>> the nfsd service threads aren't spending time either contending >>>>>>>> on >>>>>>>> the >>>>>>>> mutex (it will be held less frequently and for shorter = periods). >>>>>>>>=20 >>>>>>> I think the little drc2.patch which will keep the nfsd threads >>>>>>> from >>>>>>> acquiring the mutex and doing the trimming most of the time, = might >>>>>>> be >>>>>>> sufficient. I still don't see why a separate trimming thread = will >>>>>>> be >>>>>>> an advantage. I'd also be worried that the one cache trimming >>>>>>> thread >>>>>>> won't get the job done soon enough. >>>>>>>=20 >>>>>>> When I did production testing on a 1Gbyte server that saw a peak >>>>>>> load of about 100RPCs/sec, it was necessary to trim = aggressively. >>>>>>> (Although I'd be tempted to say that a server with 1Gbyte is no >>>>>>> longer relevant, I recently recall someone trying to run FreeBSD >>>>>>> on a i486, although I doubt they wanted to run the nfsd on it.) >>>>>>>=20 >>>>>>>>> The only negative effect I can think of w.r.t. having the nfsd >>>>>>>>> threads doing it would be a (I believe negligible) increase in >>>>>>>>> RPC >>>>>>>>> response times (the time the nfsd thread spends trimming the >>>>>>>>> cache). >>>>>>>>> As noted, I think this time would be negligible compared to = disk >>>>>>>>> I/O >>>>>>>>> and network transit times in the total RPC response time? >>>>>>>>=20 >>>>>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and >>>>>>>> 10G >>>>>>>> network connectivity, spinning on a contended mutex takes a >>>>>>>> significant amount of CPU time. (For the current design of the >>>>>>>> NFS >>>>>>>> server, it may actually be a win to turn off adaptive mutexes = -- >>>>>>>> I >>>>>>>> should give that a try once I'm able to do more testing.) >>>>>>>>=20 >>>>>>> Have fun with it. Let me know when you have what you think is a >>>>>>> good >>>>>>> patch. >>>>>>>=20 >>>>>>> rick >>>>>>>=20 >>>>>>>> -GAWollman >>>>>>>> _______________________________________________ >>>>>>>> freebsd-hackers@freebsd.org mailing list >>>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>>>>>>> To unsubscribe, send any mail to >>>>>>>> "freebsd-hackers-unsubscribe@freebsd.org" >>>>>>> _______________________________________________ >>>>>>> freebsd-fs@freebsd.org mailing list >>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs >>>>>>> To unsubscribe, send any mail to >>>>>>> "freebsd-fs-unsubscribe@freebsd.org" >>>>>>=20 >>>>>> My quest for IOPS over NFS continues :) >>>>>> So far I'm not able to achieve more than about 3000 8K read >>>>>> requests >>>>>> over NFS, >>>>>> while the server locally gives much more. >>>>>> And this is all from a file that is completely in ARC cache, no >>>>>> disk >>>>>> IO involved. >>>>>>=20 >>>>> Just out of curiousity, why do you use 8K reads instead of 64K >>>>> reads. >>>>> Since the RPC overhead (including the DRC functions) is per RPC, >>>>> doing >>>>> fewer larger RPCs should usually work better. (Sometimes large >>>>> rsize/wsize >>>>> values generate too large a burst of traffic for a network = interface >>>>> to >>>>> handle and then the rsize/wsize has to be decreased to avoid this >>>>> issue.) >>>>>=20 >>>>> And, although this experiment seems useful for testing patches = that >>>>> try >>>>> and reduce DRC CPU overheads, most "real" NFS servers will be = doing >>>>> disk >>>>> I/O. >>>>>=20 >>>>=20 >>>> This is the default blocksize the Oracle and probably most = databases >>>> use. >>>> It uses also larger blocks, but for small random reads in OLTP >>>> applications this is what is used. >>>>=20 >>> If the client is doing 8K reads, you could increase the read ahead >>> "readahead=3DN" (N up to 16), to try and increase the bandwidth. >>> (But if the CPU is 99% busy, then I don't think it will matter.) >>=20 >> I'll try to check if this is possible to be set, as we are testing = not only with the Linux NFS client, >> but also with the Oracle's built in so called DirectNFS client that = is built in to the app. >>=20 >>>=20 >>>>=20 >>>>>> I've snatched some sample DTrace script from the net : [ >>>>>> = http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes >>>>>> ] >>>>>>=20 >>>>>> And modified it for our new NFS server : >>>>>>=20 >>>>>> #!/usr/sbin/dtrace -qs >>>>>>=20 >>>>>> fbt:kernel:nfsrvd_*:entry >>>>>> { >>>>>> self->ts =3D timestamp; >>>>>> @counts[probefunc] =3D count(); >>>>>> } >>>>>>=20 >>>>>> fbt:kernel:nfsrvd_*:return >>>>>> / self->ts > 0 / >>>>>> { >>>>>> this->delta =3D (timestamp-self->ts)/1000000; >>>>>> } >>>>>>=20 >>>>>> fbt:kernel:nfsrvd_*:return >>>>>> / self->ts > 0 && this->delta > 100 / >>>>>> { >>>>>> @slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50); >>>>>> } >>>>>>=20 >>>>>> fbt:kernel:nfsrvd_*:return >>>>>> / self->ts > 0 / >>>>>> { >>>>>> @dist[probefunc, "ms"] =3D quantize(this->delta); >>>>>> self->ts =3D 0; >>>>>> } >>>>>>=20 >>>>>> END >>>>>> { >>>>>> printf("\n"); >>>>>> printa("function %-20s %@10d\n", @counts); >>>>>> printf("\n"); >>>>>> printa("function %s(), time in %s:%@d\n", @dist); >>>>>> printf("\n"); >>>>>> printa("function %s(), time in %s for >=3D 100 ms:%@d\n", @slow); >>>>>> } >>>>>>=20 >>>>>> And here's a sample output from one or two minutes during the run >>>>>> of >>>>>> Oracle's ORION benchmark >>>>>> tool from a Linux machine, on a 32G file on NFS mount over 10G >>>>>> ethernet: >>>>>>=20 >>>>>> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d >>>>>> ^C >>>>>>=20 >>>>>> function nfsrvd_access 4 >>>>>> function nfsrvd_statfs 10 >>>>>> function nfsrvd_getattr 14 >>>>>> function nfsrvd_commit 76 >>>>>> function nfsrvd_sentcache 110048 >>>>>> function nfsrvd_write 110048 >>>>>> function nfsrvd_read 283648 >>>>>> function nfsrvd_dorpc 393800 >>>>>> function nfsrvd_getcache 393800 >>>>>> function nfsrvd_rephead 393800 >>>>>> function nfsrvd_updatecache 393800 >>>>>>=20 >>>>>> function nfsrvd_access(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 >>>>>> 1 | 0 >>>>>>=20 >>>>>> function nfsrvd_statfs(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 >>>>>> 1 | 0 >>>>>>=20 >>>>>> function nfsrvd_getattr(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 >>>>>> 1 | 0 >>>>>>=20 >>>>>> function nfsrvd_sentcache(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 >>>>>> 1 | 0 >>>>>>=20 >>>>>> function nfsrvd_rephead(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 >>>>>> 1 | 0 >>>>>>=20 >>>>>> function nfsrvd_updatecache(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 >>>>>> 1 | 0 >>>>>>=20 >>>>>> function nfsrvd_getcache(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 >>>>>> 1 | 1 >>>>>> 2 | 0 >>>>>> 4 | 1 >>>>>> 8 | 0 >>>>>>=20 >>>>>> function nfsrvd_write(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 >>>>>> 1 | 5 >>>>>> 2 | 4 >>>>>> 4 | 0 >>>>>>=20 >>>>>> function nfsrvd_read(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 >>>>>> 1 | 19 >>>>>> 2 | 3 >>>>>> 4 | 2 >>>>>> 8 | 0 >>>>>> 16 | 1 >>>>>> 32 | 0 >>>>>> 64 | 0 >>>>>> 128 | 0 >>>>>> 256 | 1 >>>>>> 512 | 0 >>>>>>=20 >>>>>> function nfsrvd_commit(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 >>>>>> 1 |@@@@@@@ 14 >>>>>> 2 | 0 >>>>>> 4 |@ 1 >>>>>> 8 |@ 1 >>>>>> 16 | 0 >>>>>> 32 |@@@@@@@ 14 >>>>>> 64 |@ 2 >>>>>> 128 | 0 >>>>>>=20 >>>>>>=20 >>>>>> function nfsrvd_commit(), time in ms for >=3D 100 ms: >>>>>> value ------------- Distribution ------------- count >>>>>> < 100 | 0 >>>>>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 >>>>>> 150 | 0 >>>>>>=20 >>>>>> function nfsrvd_read(), time in ms for >=3D 100 ms: >>>>>> value ------------- Distribution ------------- count >>>>>> 250 | 0 >>>>>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 >>>>>> 350 | 0 >>>>>>=20 >>>>>>=20 >>>>>> Looks like the nfs server cache functions are quite fast, but >>>>>> extremely frequently called. >>>>>>=20 >>>>> Yep, they are called for every RPC. >>>>>=20 >>>>> I may try coding up a patch that replaces the single mutex with >>>>> one for each hash bucket, for TCP. >>>>>=20 >>>>> I'll post if/when I get this patch to a testing/review stage, rick >>>>>=20 >>>>=20 >>>> Cool. >>>>=20 >>>> I've readjusted the precision of the dtrace script a bit, and I can >>>> see >>>> now the following three functions as taking most of the time : >>>> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache() >>>>=20 >>>> This was recorded during a oracle benchmark run called SLOB, which >>>> caused 99% cpu load on the NFS server. >>>>=20 >>> Even with the drc2.patch and a large value for = vfs.nfsd.tcphighwater? >>> (Assuming the mounts are TCP ones.) >>>=20 >>> Have fun with it, rick >>>=20 >>=20 >> I had upped it, but probably not enough. I'm now running with = vfs.nfsd.tcphighwater set >> to some ridiculous number, and NFSRVCACHE_HASHSIZE set to 500. >> So far it looks like good improvement as those functions no longer = show up in the dtrace script output. >> I'll run some more benchmarks and testing today. >>=20 >> Thanks! >>=20 >>>>=20 >>>>>> I hope someone can find this information useful. >>>>>>=20 >>>>>> _______________________________________________ >>>>>> freebsd-hackers@freebsd.org mailing list >>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>>>>> To unsubscribe, send any mail to >>>>>> "freebsd-hackers-unsubscribe@freebsd.org" >>=20 >=20 > I haven't had the opportunity today to run more DB tests over NFS as = the DBA was busy with something else, > however I tested a bit the large file transfers. > And I'm seeing something strange probably not only NFS but also ZFS = and ARC related. >=20 > When I first tested the drc2.patch I reported a huge bandwidth = improvement, > but now I think that this was probably because of the machine freshly = rebooted instead of just the patch. > The patch surely improved things, especially CPU utilization combined = with the increased cache. > But today I'm again having a file completely cached in ZFS's ARC = cache, which when transferred over NFS > reaches about 300MB/s, when at some tests it reached 900+MB/s (as = reported in my first email). > The file locally can be read at about 3GB/s as reported by dd. >=20 > Local: > [17:51]root@goliath:/tank/spa_db/undo# dd if=3Ddata.dbf of=3D/dev/null = bs=3D1M =20 > 30720+1 records in > 30720+1 records out > 32212262912 bytes transferred in 10.548485 secs (3053733573 bytes/sec) >=20 > Over NFS: > [17:48]root@spa:/mnt/spa_db/undo# dd if=3Ddata.dbf of=3D/dev/null = bs=3D1M =20 > 30720+1 records in > 30720+1 records out > 32212262912 bytes (32 GB) copied, 88.0663 seconds, 366 MB/s >=20 > The machines are almost idle during this transfer and I can't see a = reason why it can't reach the full bandwith when it's > just reading it from RAM. >=20 > I've tried again tracing with DTrace to see what's happening with this = script : >=20 > fbt:kernel:nfs*:entry > { > this->ts =3D timestamp; > @counts[probefunc] =3D count(); > } >=20 > fbt:kernel:nfs*:return > / this->ts > 0 / > { > @time[probefunc] =3D avg(timestamp - this->ts); > } >=20 > END > { > trunc(@counts, 10); > trunc(@time, 10); > printf("Top 10 called functions\n\n");=09 > printa(@counts); > printf("\n\nTop 10 slowest functions\n\n"); > printa(@time); > } >=20 > And here's the result (several seconds during the dd test): >=20 > Top 10 called functions > nfsrc_freecache 88849 > nfsrc_wanted 88849 > nfsrv_fillattr 88849 > nfsrv_postopattr 88849 > nfsrvd_read 88849 > nfsrvd_rephead 88849 > nfsrvd_updatecache 88849 > nfsvno_testexp 88849 > nfsrc_trimcache 177697 > nfsvno_getattr 177698 >=20 > Top 10 slowest functions > nfsd_excred 5673 > nfsrc_freecache 5674 > nfsrv_postopattr 5970 > nfsrv_servertimer 6327 > nfssvc_nfscommon 6596 > nfsd_fhtovp 8000 > nfsrvd_read 8380 > nfssvc_program 92752 > nfsvno_read 124979 > nfsvno_fhtovp 1789523 >=20 > I might try now to trace what nfsvno_fhtovp() is doing and where is = spending it's time. >=20 > Any other ideas are welcome :) >=20 To take the network out of the equation I redid the test by mounting the = same filesystem over NFS on the server: [18:23]root@goliath:~# mount -t nfs -o = rw,hard,intr,tcp,nfsv3,rsize=3D1048576,wsize=3D1048576 = localhost:/tank/spa_db/undo /mnt [18:24]root@goliath:~# dd if=3D/mnt/data.dbf of=3D/dev/null bs=3D1M=20 30720+1 records in 30720+1 records out 32212262912 bytes transferred in 79.793343 secs (403696120 bytes/sec) [18:25]root@goliath:~# dd if=3D/mnt/data.dbf of=3D/dev/null bs=3D1M 30720+1 records in 30720+1 records out 32212262912 bytes transferred in 12.033420 secs (2676900110 bytes/sec) During the first run I saw several nfsd threads in top, along with dd = and again zero disk I/O. There was increase in memory usage because of the double buffering = ARC->buffercahe. The second run was with all of the nfsd threads totally idle, and read = directly from the buffercache. From owner-freebsd-hackers@FreeBSD.ORG Thu Oct 11 21:04:36 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 85FC8A1A; Thu, 11 Oct 2012 21:04:36 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id D55278FC0C; Thu, 11 Oct 2012 21:04:35 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAN6MclCDaFvO/2dsb2JhbABFFoV7uhmCIAEBAQMBAQEBIAQnIAYFBRYOCgICDRkCIwYBCSYGCAIFBAEcAQOHUgMJBgumTIgWDYlUgSGJSGYahGSBEgOTPliBVYEVig6FC4MJgUc0 X-IronPort-AV: E=Sophos;i="4.80,574,1344225600"; d="scan'208";a="183270578" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn-pri.mail.uoguelph.ca with ESMTP; 11 Oct 2012 17:04:28 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 1F7CCB40FB; Thu, 11 Oct 2012 17:04:28 -0400 (EDT) Date: Thu, 11 Oct 2012 17:04:28 -0400 (EDT) From: Rick Macklem To: Nikolay Denev Message-ID: <1517976814.2112914.1349989468096.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <0A8CDBF9-28C3-46D2-BB58-0559D00BD545@gmail.com> Subject: Re: NFS server bottlenecks MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - IE7 (Win)/6.0.10_GA_2692) Cc: rmacklem@freebsd.org, Garrett Wollman , freebsd-hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Oct 2012 21:04:36 -0000 Nikolay Denev wrote: > On Oct 11, 2012, at 8:46 AM, Nikolay Denev wrote: > > > > > On Oct 11, 2012, at 1:09 AM, Rick Macklem > > wrote: > > > >> Nikolay Denev wrote: > >>> On Oct 10, 2012, at 3:18 AM, Rick Macklem > >>> wrote: > >>> > >>>> Nikolay Denev wrote: > >>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem > >>>>> wrote: > >>>>> > >>>>>> Garrett Wollman wrote: > >>>>>>> < >>>>>>> said: > >>>>>>> > >>>>>>>>> Simple: just use a sepatate mutex for each list that a cache > >>>>>>>>> entry > >>>>>>>>> is on, rather than a global lock for everything. This would > >>>>>>>>> reduce > >>>>>>>>> the mutex contention, but I'm not sure how significantly > >>>>>>>>> since > >>>>>>>>> I > >>>>>>>>> don't have the means to measure it yet. > >>>>>>>>> > >>>>>>>> Well, since the cache trimming is removing entries from the > >>>>>>>> lists, > >>>>>>>> I > >>>>>>>> don't > >>>>>>>> see how that can be done with a global lock for list updates? > >>>>>>> > >>>>>>> Well, the global lock is what we have now, but the cache > >>>>>>> trimming > >>>>>>> process only looks at one list at a time, so not locking the > >>>>>>> list > >>>>>>> that > >>>>>>> isn't being iterated over probably wouldn't hurt, unless > >>>>>>> there's > >>>>>>> some > >>>>>>> mechanism (that I didn't see) for entries to move from one > >>>>>>> list > >>>>>>> to > >>>>>>> another. Note that I'm considering each hash bucket a separate > >>>>>>> "list". (One issue to worry about in that case would be > >>>>>>> cache-line > >>>>>>> contention in the array of hash buckets; perhaps > >>>>>>> NFSRVCACHE_HASHSIZE > >>>>>>> ought to be increased to reduce that.) > >>>>>>> > >>>>>> Yea, a separate mutex for each hash list might help. There is > >>>>>> also > >>>>>> the > >>>>>> LRU list that all entries end up on, that gets used by the > >>>>>> trimming > >>>>>> code. > >>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't > >>>>>> looked > >>>>>> at > >>>>>> it in a while.) > >>>>>> > >>>>>> Also, increasing the hash table size is probably a good idea, > >>>>>> especially > >>>>>> if you reduce how aggressively the cache is trimmed. > >>>>>> > >>>>>>>> Only doing it once/sec would result in a very large cache > >>>>>>>> when > >>>>>>>> bursts of > >>>>>>>> traffic arrives. > >>>>>>> > >>>>>>> My servers have 96 GB of memory so that's not a big deal for > >>>>>>> me. > >>>>>>> > >>>>>> This code was originally "production tested" on a server with > >>>>>> 1Gbyte, > >>>>>> so times have changed a bit;-) > >>>>>> > >>>>>>>> I'm not sure I see why doing it as a separate thread will > >>>>>>>> improve > >>>>>>>> things. > >>>>>>>> There are N nfsd threads already (N can be bumped up to 256 > >>>>>>>> if > >>>>>>>> you > >>>>>>>> wish) > >>>>>>>> and having a bunch more "cache trimming threads" would just > >>>>>>>> increase > >>>>>>>> contention, wouldn't it? > >>>>>>> > >>>>>>> Only one cache-trimming thread. The cache trim holds the > >>>>>>> (global) > >>>>>>> mutex for much longer than any individual nfsd service thread > >>>>>>> has > >>>>>>> any > >>>>>>> need to, and having N threads doing that in parallel is why > >>>>>>> it's > >>>>>>> so > >>>>>>> heavily contended. If there's only one thread doing the trim, > >>>>>>> then > >>>>>>> the nfsd service threads aren't spending time either > >>>>>>> contending > >>>>>>> on > >>>>>>> the > >>>>>>> mutex (it will be held less frequently and for shorter > >>>>>>> periods). > >>>>>>> > >>>>>> I think the little drc2.patch which will keep the nfsd threads > >>>>>> from > >>>>>> acquiring the mutex and doing the trimming most of the time, > >>>>>> might > >>>>>> be > >>>>>> sufficient. I still don't see why a separate trimming thread > >>>>>> will > >>>>>> be > >>>>>> an advantage. I'd also be worried that the one cache trimming > >>>>>> thread > >>>>>> won't get the job done soon enough. > >>>>>> > >>>>>> When I did production testing on a 1Gbyte server that saw a > >>>>>> peak > >>>>>> load of about 100RPCs/sec, it was necessary to trim > >>>>>> aggressively. > >>>>>> (Although I'd be tempted to say that a server with 1Gbyte is no > >>>>>> longer relevant, I recently recall someone trying to run > >>>>>> FreeBSD > >>>>>> on a i486, although I doubt they wanted to run the nfsd on it.) > >>>>>> > >>>>>>>> The only negative effect I can think of w.r.t. having the > >>>>>>>> nfsd > >>>>>>>> threads doing it would be a (I believe negligible) increase > >>>>>>>> in > >>>>>>>> RPC > >>>>>>>> response times (the time the nfsd thread spends trimming the > >>>>>>>> cache). > >>>>>>>> As noted, I think this time would be negligible compared to > >>>>>>>> disk > >>>>>>>> I/O > >>>>>>>> and network transit times in the total RPC response time? > >>>>>>> > >>>>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and > >>>>>>> 10G > >>>>>>> network connectivity, spinning on a contended mutex takes a > >>>>>>> significant amount of CPU time. (For the current design of the > >>>>>>> NFS > >>>>>>> server, it may actually be a win to turn off adaptive mutexes > >>>>>>> -- > >>>>>>> I > >>>>>>> should give that a try once I'm able to do more testing.) > >>>>>>> > >>>>>> Have fun with it. Let me know when you have what you think is a > >>>>>> good > >>>>>> patch. > >>>>>> > >>>>>> rick > >>>>>> > >>>>>>> -GAWollman > >>>>>>> _______________________________________________ > >>>>>>> freebsd-hackers@freebsd.org mailing list > >>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > >>>>>>> To unsubscribe, send any mail to > >>>>>>> "freebsd-hackers-unsubscribe@freebsd.org" > >>>>>> _______________________________________________ > >>>>>> freebsd-fs@freebsd.org mailing list > >>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs > >>>>>> To unsubscribe, send any mail to > >>>>>> "freebsd-fs-unsubscribe@freebsd.org" > >>>>> > >>>>> My quest for IOPS over NFS continues :) > >>>>> So far I'm not able to achieve more than about 3000 8K read > >>>>> requests > >>>>> over NFS, > >>>>> while the server locally gives much more. > >>>>> And this is all from a file that is completely in ARC cache, no > >>>>> disk > >>>>> IO involved. > >>>>> > >>>> Just out of curiousity, why do you use 8K reads instead of 64K > >>>> reads. > >>>> Since the RPC overhead (including the DRC functions) is per RPC, > >>>> doing > >>>> fewer larger RPCs should usually work better. (Sometimes large > >>>> rsize/wsize > >>>> values generate too large a burst of traffic for a network > >>>> interface > >>>> to > >>>> handle and then the rsize/wsize has to be decreased to avoid this > >>>> issue.) > >>>> > >>>> And, although this experiment seems useful for testing patches > >>>> that > >>>> try > >>>> and reduce DRC CPU overheads, most "real" NFS servers will be > >>>> doing > >>>> disk > >>>> I/O. > >>>> > >>> > >>> This is the default blocksize the Oracle and probably most > >>> databases > >>> use. > >>> It uses also larger blocks, but for small random reads in OLTP > >>> applications this is what is used. > >>> > >> If the client is doing 8K reads, you could increase the read ahead > >> "readahead=N" (N up to 16), to try and increase the bandwidth. > >> (But if the CPU is 99% busy, then I don't think it will matter.) > > > > I'll try to check if this is possible to be set, as we are testing > > not only with the Linux NFS client, > > but also with the Oracle's built in so called DirectNFS client that > > is built in to the app. > > > >> > >>> > >>>>> I've snatched some sample DTrace script from the net : [ > >>>>> http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes > >>>>> ] > >>>>> > >>>>> And modified it for our new NFS server : > >>>>> > >>>>> #!/usr/sbin/dtrace -qs > >>>>> > >>>>> fbt:kernel:nfsrvd_*:entry > >>>>> { > >>>>> self->ts = timestamp; > >>>>> @counts[probefunc] = count(); > >>>>> } > >>>>> > >>>>> fbt:kernel:nfsrvd_*:return > >>>>> / self->ts > 0 / > >>>>> { > >>>>> this->delta = (timestamp-self->ts)/1000000; > >>>>> } > >>>>> > >>>>> fbt:kernel:nfsrvd_*:return > >>>>> / self->ts > 0 && this->delta > 100 / > >>>>> { > >>>>> @slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50); > >>>>> } > >>>>> > >>>>> fbt:kernel:nfsrvd_*:return > >>>>> / self->ts > 0 / > >>>>> { > >>>>> @dist[probefunc, "ms"] = quantize(this->delta); > >>>>> self->ts = 0; > >>>>> } > >>>>> > >>>>> END > >>>>> { > >>>>> printf("\n"); > >>>>> printa("function %-20s %@10d\n", @counts); > >>>>> printf("\n"); > >>>>> printa("function %s(), time in %s:%@d\n", @dist); > >>>>> printf("\n"); > >>>>> printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow); > >>>>> } > >>>>> > >>>>> And here's a sample output from one or two minutes during the > >>>>> run > >>>>> of > >>>>> Oracle's ORION benchmark > >>>>> tool from a Linux machine, on a 32G file on NFS mount over 10G > >>>>> ethernet: > >>>>> > >>>>> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d > >>>>> ^C > >>>>> > >>>>> function nfsrvd_access 4 > >>>>> function nfsrvd_statfs 10 > >>>>> function nfsrvd_getattr 14 > >>>>> function nfsrvd_commit 76 > >>>>> function nfsrvd_sentcache 110048 > >>>>> function nfsrvd_write 110048 > >>>>> function nfsrvd_read 283648 > >>>>> function nfsrvd_dorpc 393800 > >>>>> function nfsrvd_getcache 393800 > >>>>> function nfsrvd_rephead 393800 > >>>>> function nfsrvd_updatecache 393800 > >>>>> > >>>>> function nfsrvd_access(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 > >>>>> 1 | 0 > >>>>> > >>>>> function nfsrvd_statfs(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 > >>>>> 1 | 0 > >>>>> > >>>>> function nfsrvd_getattr(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 > >>>>> 1 | 0 > >>>>> > >>>>> function nfsrvd_sentcache(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 > >>>>> 1 | 0 > >>>>> > >>>>> function nfsrvd_rephead(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 > >>>>> 1 | 0 > >>>>> > >>>>> function nfsrvd_updatecache(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 > >>>>> 1 | 0 > >>>>> > >>>>> function nfsrvd_getcache(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 > >>>>> 1 | 1 > >>>>> 2 | 0 > >>>>> 4 | 1 > >>>>> 8 | 0 > >>>>> > >>>>> function nfsrvd_write(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 > >>>>> 1 | 5 > >>>>> 2 | 4 > >>>>> 4 | 0 > >>>>> > >>>>> function nfsrvd_read(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 > >>>>> 1 | 19 > >>>>> 2 | 3 > >>>>> 4 | 2 > >>>>> 8 | 0 > >>>>> 16 | 1 > >>>>> 32 | 0 > >>>>> 64 | 0 > >>>>> 128 | 0 > >>>>> 256 | 1 > >>>>> 512 | 0 > >>>>> > >>>>> function nfsrvd_commit(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 > >>>>> 1 |@@@@@@@ 14 > >>>>> 2 | 0 > >>>>> 4 |@ 1 > >>>>> 8 |@ 1 > >>>>> 16 | 0 > >>>>> 32 |@@@@@@@ 14 > >>>>> 64 |@ 2 > >>>>> 128 | 0 > >>>>> > >>>>> > >>>>> function nfsrvd_commit(), time in ms for >= 100 ms: > >>>>> value ------------- Distribution ------------- count > >>>>> < 100 | 0 > >>>>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 > >>>>> 150 | 0 > >>>>> > >>>>> function nfsrvd_read(), time in ms for >= 100 ms: > >>>>> value ------------- Distribution ------------- count > >>>>> 250 | 0 > >>>>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 > >>>>> 350 | 0 > >>>>> > >>>>> > >>>>> Looks like the nfs server cache functions are quite fast, but > >>>>> extremely frequently called. > >>>>> > >>>> Yep, they are called for every RPC. > >>>> > >>>> I may try coding up a patch that replaces the single mutex with > >>>> one for each hash bucket, for TCP. > >>>> > >>>> I'll post if/when I get this patch to a testing/review stage, > >>>> rick > >>>> > >>> > >>> Cool. > >>> > >>> I've readjusted the precision of the dtrace script a bit, and I > >>> can > >>> see > >>> now the following three functions as taking most of the time : > >>> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache() > >>> > >>> This was recorded during a oracle benchmark run called SLOB, which > >>> caused 99% cpu load on the NFS server. > >>> > >> Even with the drc2.patch and a large value for > >> vfs.nfsd.tcphighwater? > >> (Assuming the mounts are TCP ones.) > >> > >> Have fun with it, rick > >> > > > > I had upped it, but probably not enough. I'm now running with > > vfs.nfsd.tcphighwater set > > to some ridiculous number, and NFSRVCACHE_HASHSIZE set to 500. > > So far it looks like good improvement as those functions no longer > > show up in the dtrace script output. > > I'll run some more benchmarks and testing today. > > > > Thanks! > > > >>> > >>>>> I hope someone can find this information useful. > >>>>> > >>>>> _______________________________________________ > >>>>> freebsd-hackers@freebsd.org mailing list > >>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > >>>>> To unsubscribe, send any mail to > >>>>> "freebsd-hackers-unsubscribe@freebsd.org" > > > > I haven't had the opportunity today to run more DB tests over NFS as > the DBA was busy with something else, > however I tested a bit the large file transfers. > And I'm seeing something strange probably not only NFS but also ZFS > and ARC related. > > When I first tested the drc2.patch I reported a huge bandwidth > improvement, > but now I think that this was probably because of the machine freshly > rebooted instead of just the patch. > The patch surely improved things, especially CPU utilization combined > with the increased cache. > But today I'm again having a file completely cached in ZFS's ARC > cache, which when transferred over NFS > reaches about 300MB/s, when at some tests it reached 900+MB/s (as > reported in my first email). > The file locally can be read at about 3GB/s as reported by dd. > > Local: > [17:51]root@goliath:/tank/spa_db/undo# dd if=data.dbf of=/dev/null > bs=1M > 30720+1 records in > 30720+1 records out > 32212262912 bytes transferred in 10.548485 secs (3053733573 bytes/sec) > > Over NFS: > [17:48]root@spa:/mnt/spa_db/undo# dd if=data.dbf of=/dev/null bs=1M > 30720+1 records in > 30720+1 records out > 32212262912 bytes (32 GB) copied, 88.0663 seconds, 366 MB/s > > The machines are almost idle during this transfer and I can't see a > reason why it can't reach the full bandwith when it's > just reading it from RAM. > Well, one thing is that you must fill the network pipe with bits. That means that "rsize * readahead" must be >= "data rate * transit delay" (both in same units, such as bytes/sec). For example (given a 10Gb/s network): 65536 * 16 = 1Mbyte - a 10Gbps rate is about 1200Mbytes/sec --> this should fill the network "pipe" so long as the RPC transit delay is less than 1/1200th of a sec (or somewhat less than 1msec) If you were using rsize=8192 and the default readahead=2: 8192 * 2 = 16Kbytes - a 10Gbps rate is about 1200000Kbytes/sec --> the network pipe would be full if the RPC transit delay is less than 16/1200000 (or a little over 10usec, unlikely for any real network, I think?) I don't know what your RPC transit delay is (half of a NULL RPC RTT, but I don't know that, either;-). That was why I suggested "readahead=16". I'd also use "rsize=65536", but if that worked very poorly, I'd try decreasing rsize by half (32768, 16384..) to see what was optimal. rick ps: I'm assuming your dd runs are from a FreeBSD client and not the Oracle one. pss: Did I actually get the above calculations correct without a calculator;-) > I've tried again tracing with DTrace to see what's happening with this > script : > > fbt:kernel:nfs*:entry > { > this->ts = timestamp; > @counts[probefunc] = count(); > } > > fbt:kernel:nfs*:return > / this->ts > 0 / > { > @time[probefunc] = avg(timestamp - this->ts); > } > > END > { > trunc(@counts, 10); > trunc(@time, 10); > printf("Top 10 called functions\n\n"); > printa(@counts); > printf("\n\nTop 10 slowest functions\n\n"); > printa(@time); > } > > And here's the result (several seconds during the dd test): > > Top 10 called functions > nfsrc_freecache 88849 > nfsrc_wanted 88849 > nfsrv_fillattr 88849 > nfsrv_postopattr 88849 > nfsrvd_read 88849 > nfsrvd_rephead 88849 > nfsrvd_updatecache 88849 > nfsvno_testexp 88849 > nfsrc_trimcache 177697 > nfsvno_getattr 177698 > > Top 10 slowest functions > nfsd_excred 5673 > nfsrc_freecache 5674 > nfsrv_postopattr 5970 > nfsrv_servertimer 6327 > nfssvc_nfscommon 6596 > nfsd_fhtovp 8000 > nfsrvd_read 8380 > nfssvc_program 92752 > nfsvno_read 124979 > nfsvno_fhtovp 1789523 > > I might try now to trace what nfsvno_fhtovp() is doing and where is > spending it's time. > > Any other ideas are welcome :) > > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to > "freebsd-hackers-unsubscribe@freebsd.org" From owner-freebsd-hackers@FreeBSD.ORG Thu Oct 11 21:33:10 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E932D5C1; Thu, 11 Oct 2012 21:33:10 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 229128FC16; Thu, 11 Oct 2012 21:33:09 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAJ+LclCDaFvO/2dsb2JhbABFFoV7uhmCIAEBAQMBAQEBIAQnIAYFBRYOCgICDRkCIwYBCSYGCAIFBAEcAQOHUgMJBgumSYgUDYlUgSGJSGYahGSBEgOTPliBVYEVig6FC4MJgUc0 X-IronPort-AV: E=Sophos;i="4.80,574,1344225600"; d="scan'208";a="186026601" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu-pri.mail.uoguelph.ca with ESMTP; 11 Oct 2012 17:33:08 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 567D3B402D; Thu, 11 Oct 2012 17:33:08 -0400 (EDT) Date: Thu, 11 Oct 2012 17:33:08 -0400 (EDT) From: Rick Macklem To: Nikolay Denev Message-ID: <314705086.2114438.1349991188290.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: Subject: Re: NFS server bottlenecks MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - IE7 (Win)/6.0.10_GA_2692) Cc: rmacklem@freebsd.org, Garrett Wollman , freebsd-hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Oct 2012 21:33:11 -0000 Nikolay Denev wrote: > On Oct 11, 2012, at 7:20 PM, Nikolay Denev wrote: > > > On Oct 11, 2012, at 8:46 AM, Nikolay Denev wrote: > > > >> > >> On Oct 11, 2012, at 1:09 AM, Rick Macklem > >> wrote: > >> > >>> Nikolay Denev wrote: > >>>> On Oct 10, 2012, at 3:18 AM, Rick Macklem > >>>> wrote: > >>>> > >>>>> Nikolay Denev wrote: > >>>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem > >>>>>> > >>>>>> wrote: > >>>>>> > >>>>>>> Garrett Wollman wrote: > >>>>>>>> < >>>>>>>> said: > >>>>>>>> > >>>>>>>>>> Simple: just use a sepatate mutex for each list that a > >>>>>>>>>> cache > >>>>>>>>>> entry > >>>>>>>>>> is on, rather than a global lock for everything. This would > >>>>>>>>>> reduce > >>>>>>>>>> the mutex contention, but I'm not sure how significantly > >>>>>>>>>> since > >>>>>>>>>> I > >>>>>>>>>> don't have the means to measure it yet. > >>>>>>>>>> > >>>>>>>>> Well, since the cache trimming is removing entries from the > >>>>>>>>> lists, > >>>>>>>>> I > >>>>>>>>> don't > >>>>>>>>> see how that can be done with a global lock for list > >>>>>>>>> updates? > >>>>>>>> > >>>>>>>> Well, the global lock is what we have now, but the cache > >>>>>>>> trimming > >>>>>>>> process only looks at one list at a time, so not locking the > >>>>>>>> list > >>>>>>>> that > >>>>>>>> isn't being iterated over probably wouldn't hurt, unless > >>>>>>>> there's > >>>>>>>> some > >>>>>>>> mechanism (that I didn't see) for entries to move from one > >>>>>>>> list > >>>>>>>> to > >>>>>>>> another. Note that I'm considering each hash bucket a > >>>>>>>> separate > >>>>>>>> "list". (One issue to worry about in that case would be > >>>>>>>> cache-line > >>>>>>>> contention in the array of hash buckets; perhaps > >>>>>>>> NFSRVCACHE_HASHSIZE > >>>>>>>> ought to be increased to reduce that.) > >>>>>>>> > >>>>>>> Yea, a separate mutex for each hash list might help. There is > >>>>>>> also > >>>>>>> the > >>>>>>> LRU list that all entries end up on, that gets used by the > >>>>>>> trimming > >>>>>>> code. > >>>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't > >>>>>>> looked > >>>>>>> at > >>>>>>> it in a while.) > >>>>>>> > >>>>>>> Also, increasing the hash table size is probably a good idea, > >>>>>>> especially > >>>>>>> if you reduce how aggressively the cache is trimmed. > >>>>>>> > >>>>>>>>> Only doing it once/sec would result in a very large cache > >>>>>>>>> when > >>>>>>>>> bursts of > >>>>>>>>> traffic arrives. > >>>>>>>> > >>>>>>>> My servers have 96 GB of memory so that's not a big deal for > >>>>>>>> me. > >>>>>>>> > >>>>>>> This code was originally "production tested" on a server with > >>>>>>> 1Gbyte, > >>>>>>> so times have changed a bit;-) > >>>>>>> > >>>>>>>>> I'm not sure I see why doing it as a separate thread will > >>>>>>>>> improve > >>>>>>>>> things. > >>>>>>>>> There are N nfsd threads already (N can be bumped up to 256 > >>>>>>>>> if > >>>>>>>>> you > >>>>>>>>> wish) > >>>>>>>>> and having a bunch more "cache trimming threads" would just > >>>>>>>>> increase > >>>>>>>>> contention, wouldn't it? > >>>>>>>> > >>>>>>>> Only one cache-trimming thread. The cache trim holds the > >>>>>>>> (global) > >>>>>>>> mutex for much longer than any individual nfsd service thread > >>>>>>>> has > >>>>>>>> any > >>>>>>>> need to, and having N threads doing that in parallel is why > >>>>>>>> it's > >>>>>>>> so > >>>>>>>> heavily contended. If there's only one thread doing the trim, > >>>>>>>> then > >>>>>>>> the nfsd service threads aren't spending time either > >>>>>>>> contending > >>>>>>>> on > >>>>>>>> the > >>>>>>>> mutex (it will be held less frequently and for shorter > >>>>>>>> periods). > >>>>>>>> > >>>>>>> I think the little drc2.patch which will keep the nfsd threads > >>>>>>> from > >>>>>>> acquiring the mutex and doing the trimming most of the time, > >>>>>>> might > >>>>>>> be > >>>>>>> sufficient. I still don't see why a separate trimming thread > >>>>>>> will > >>>>>>> be > >>>>>>> an advantage. I'd also be worried that the one cache trimming > >>>>>>> thread > >>>>>>> won't get the job done soon enough. > >>>>>>> > >>>>>>> When I did production testing on a 1Gbyte server that saw a > >>>>>>> peak > >>>>>>> load of about 100RPCs/sec, it was necessary to trim > >>>>>>> aggressively. > >>>>>>> (Although I'd be tempted to say that a server with 1Gbyte is > >>>>>>> no > >>>>>>> longer relevant, I recently recall someone trying to run > >>>>>>> FreeBSD > >>>>>>> on a i486, although I doubt they wanted to run the nfsd on > >>>>>>> it.) > >>>>>>> > >>>>>>>>> The only negative effect I can think of w.r.t. having the > >>>>>>>>> nfsd > >>>>>>>>> threads doing it would be a (I believe negligible) increase > >>>>>>>>> in > >>>>>>>>> RPC > >>>>>>>>> response times (the time the nfsd thread spends trimming the > >>>>>>>>> cache). > >>>>>>>>> As noted, I think this time would be negligible compared to > >>>>>>>>> disk > >>>>>>>>> I/O > >>>>>>>>> and network transit times in the total RPC response time? > >>>>>>>> > >>>>>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, > >>>>>>>> and > >>>>>>>> 10G > >>>>>>>> network connectivity, spinning on a contended mutex takes a > >>>>>>>> significant amount of CPU time. (For the current design of > >>>>>>>> the > >>>>>>>> NFS > >>>>>>>> server, it may actually be a win to turn off adaptive mutexes > >>>>>>>> -- > >>>>>>>> I > >>>>>>>> should give that a try once I'm able to do more testing.) > >>>>>>>> > >>>>>>> Have fun with it. Let me know when you have what you think is > >>>>>>> a > >>>>>>> good > >>>>>>> patch. > >>>>>>> > >>>>>>> rick > >>>>>>> > >>>>>>>> -GAWollman > >>>>>>>> _______________________________________________ > >>>>>>>> freebsd-hackers@freebsd.org mailing list > >>>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > >>>>>>>> To unsubscribe, send any mail to > >>>>>>>> "freebsd-hackers-unsubscribe@freebsd.org" > >>>>>>> _______________________________________________ > >>>>>>> freebsd-fs@freebsd.org mailing list > >>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs > >>>>>>> To unsubscribe, send any mail to > >>>>>>> "freebsd-fs-unsubscribe@freebsd.org" > >>>>>> > >>>>>> My quest for IOPS over NFS continues :) > >>>>>> So far I'm not able to achieve more than about 3000 8K read > >>>>>> requests > >>>>>> over NFS, > >>>>>> while the server locally gives much more. > >>>>>> And this is all from a file that is completely in ARC cache, no > >>>>>> disk > >>>>>> IO involved. > >>>>>> > >>>>> Just out of curiousity, why do you use 8K reads instead of 64K > >>>>> reads. > >>>>> Since the RPC overhead (including the DRC functions) is per RPC, > >>>>> doing > >>>>> fewer larger RPCs should usually work better. (Sometimes large > >>>>> rsize/wsize > >>>>> values generate too large a burst of traffic for a network > >>>>> interface > >>>>> to > >>>>> handle and then the rsize/wsize has to be decreased to avoid > >>>>> this > >>>>> issue.) > >>>>> > >>>>> And, although this experiment seems useful for testing patches > >>>>> that > >>>>> try > >>>>> and reduce DRC CPU overheads, most "real" NFS servers will be > >>>>> doing > >>>>> disk > >>>>> I/O. > >>>>> > >>>> > >>>> This is the default blocksize the Oracle and probably most > >>>> databases > >>>> use. > >>>> It uses also larger blocks, but for small random reads in OLTP > >>>> applications this is what is used. > >>>> > >>> If the client is doing 8K reads, you could increase the read ahead > >>> "readahead=N" (N up to 16), to try and increase the bandwidth. > >>> (But if the CPU is 99% busy, then I don't think it will matter.) > >> > >> I'll try to check if this is possible to be set, as we are testing > >> not only with the Linux NFS client, > >> but also with the Oracle's built in so called DirectNFS client that > >> is built in to the app. > >> > >>> > >>>> > >>>>>> I've snatched some sample DTrace script from the net : [ > >>>>>> http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes > >>>>>> ] > >>>>>> > >>>>>> And modified it for our new NFS server : > >>>>>> > >>>>>> #!/usr/sbin/dtrace -qs > >>>>>> > >>>>>> fbt:kernel:nfsrvd_*:entry > >>>>>> { > >>>>>> self->ts = timestamp; > >>>>>> @counts[probefunc] = count(); > >>>>>> } > >>>>>> > >>>>>> fbt:kernel:nfsrvd_*:return > >>>>>> / self->ts > 0 / > >>>>>> { > >>>>>> this->delta = (timestamp-self->ts)/1000000; > >>>>>> } > >>>>>> > >>>>>> fbt:kernel:nfsrvd_*:return > >>>>>> / self->ts > 0 && this->delta > 100 / > >>>>>> { > >>>>>> @slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50); > >>>>>> } > >>>>>> > >>>>>> fbt:kernel:nfsrvd_*:return > >>>>>> / self->ts > 0 / > >>>>>> { > >>>>>> @dist[probefunc, "ms"] = quantize(this->delta); > >>>>>> self->ts = 0; > >>>>>> } > >>>>>> > >>>>>> END > >>>>>> { > >>>>>> printf("\n"); > >>>>>> printa("function %-20s %@10d\n", @counts); > >>>>>> printf("\n"); > >>>>>> printa("function %s(), time in %s:%@d\n", @dist); > >>>>>> printf("\n"); > >>>>>> printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow); > >>>>>> } > >>>>>> > >>>>>> And here's a sample output from one or two minutes during the > >>>>>> run > >>>>>> of > >>>>>> Oracle's ORION benchmark > >>>>>> tool from a Linux machine, on a 32G file on NFS mount over 10G > >>>>>> ethernet: > >>>>>> > >>>>>> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d > >>>>>> ^C > >>>>>> > >>>>>> function nfsrvd_access 4 > >>>>>> function nfsrvd_statfs 10 > >>>>>> function nfsrvd_getattr 14 > >>>>>> function nfsrvd_commit 76 > >>>>>> function nfsrvd_sentcache 110048 > >>>>>> function nfsrvd_write 110048 > >>>>>> function nfsrvd_read 283648 > >>>>>> function nfsrvd_dorpc 393800 > >>>>>> function nfsrvd_getcache 393800 > >>>>>> function nfsrvd_rephead 393800 > >>>>>> function nfsrvd_updatecache 393800 > >>>>>> > >>>>>> function nfsrvd_access(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 > >>>>>> 1 | 0 > >>>>>> > >>>>>> function nfsrvd_statfs(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 > >>>>>> 1 | 0 > >>>>>> > >>>>>> function nfsrvd_getattr(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 > >>>>>> 1 | 0 > >>>>>> > >>>>>> function nfsrvd_sentcache(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 > >>>>>> 1 | 0 > >>>>>> > >>>>>> function nfsrvd_rephead(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 > >>>>>> 1 | 0 > >>>>>> > >>>>>> function nfsrvd_updatecache(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 > >>>>>> 1 | 0 > >>>>>> > >>>>>> function nfsrvd_getcache(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 > >>>>>> 1 | 1 > >>>>>> 2 | 0 > >>>>>> 4 | 1 > >>>>>> 8 | 0 > >>>>>> > >>>>>> function nfsrvd_write(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 > >>>>>> 1 | 5 > >>>>>> 2 | 4 > >>>>>> 4 | 0 > >>>>>> > >>>>>> function nfsrvd_read(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 > >>>>>> 1 | 19 > >>>>>> 2 | 3 > >>>>>> 4 | 2 > >>>>>> 8 | 0 > >>>>>> 16 | 1 > >>>>>> 32 | 0 > >>>>>> 64 | 0 > >>>>>> 128 | 0 > >>>>>> 256 | 1 > >>>>>> 512 | 0 > >>>>>> > >>>>>> function nfsrvd_commit(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 > >>>>>> 1 |@@@@@@@ 14 > >>>>>> 2 | 0 > >>>>>> 4 |@ 1 > >>>>>> 8 |@ 1 > >>>>>> 16 | 0 > >>>>>> 32 |@@@@@@@ 14 > >>>>>> 64 |@ 2 > >>>>>> 128 | 0 > >>>>>> > >>>>>> > >>>>>> function nfsrvd_commit(), time in ms for >= 100 ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> < 100 | 0 > >>>>>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 > >>>>>> 150 | 0 > >>>>>> > >>>>>> function nfsrvd_read(), time in ms for >= 100 ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> 250 | 0 > >>>>>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 > >>>>>> 350 | 0 > >>>>>> > >>>>>> > >>>>>> Looks like the nfs server cache functions are quite fast, but > >>>>>> extremely frequently called. > >>>>>> > >>>>> Yep, they are called for every RPC. > >>>>> > >>>>> I may try coding up a patch that replaces the single mutex with > >>>>> one for each hash bucket, for TCP. > >>>>> > >>>>> I'll post if/when I get this patch to a testing/review stage, > >>>>> rick > >>>>> > >>>> > >>>> Cool. > >>>> > >>>> I've readjusted the precision of the dtrace script a bit, and I > >>>> can > >>>> see > >>>> now the following three functions as taking most of the time : > >>>> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache() > >>>> > >>>> This was recorded during a oracle benchmark run called SLOB, > >>>> which > >>>> caused 99% cpu load on the NFS server. > >>>> > >>> Even with the drc2.patch and a large value for > >>> vfs.nfsd.tcphighwater? > >>> (Assuming the mounts are TCP ones.) > >>> > >>> Have fun with it, rick > >>> > >> > >> I had upped it, but probably not enough. I'm now running with > >> vfs.nfsd.tcphighwater set > >> to some ridiculous number, and NFSRVCACHE_HASHSIZE set to 500. > >> So far it looks like good improvement as those functions no longer > >> show up in the dtrace script output. > >> I'll run some more benchmarks and testing today. > >> > >> Thanks! > >> > >>>> > >>>>>> I hope someone can find this information useful. > >>>>>> > >>>>>> _______________________________________________ > >>>>>> freebsd-hackers@freebsd.org mailing list > >>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > >>>>>> To unsubscribe, send any mail to > >>>>>> "freebsd-hackers-unsubscribe@freebsd.org" > >> > > > > I haven't had the opportunity today to run more DB tests over NFS as > > the DBA was busy with something else, > > however I tested a bit the large file transfers. > > And I'm seeing something strange probably not only NFS but also ZFS > > and ARC related. > > > > When I first tested the drc2.patch I reported a huge bandwidth > > improvement, > > but now I think that this was probably because of the machine > > freshly rebooted instead of just the patch. > > The patch surely improved things, especially CPU utilization > > combined with the increased cache. > > But today I'm again having a file completely cached in ZFS's ARC > > cache, which when transferred over NFS > > reaches about 300MB/s, when at some tests it reached 900+MB/s (as > > reported in my first email). > > The file locally can be read at about 3GB/s as reported by dd. > > > > Local: > > [17:51]root@goliath:/tank/spa_db/undo# dd if=data.dbf of=/dev/null > > bs=1M > > 30720+1 records in > > 30720+1 records out > > 32212262912 bytes transferred in 10.548485 secs (3053733573 > > bytes/sec) > > > > Over NFS: > > [17:48]root@spa:/mnt/spa_db/undo# dd if=data.dbf of=/dev/null bs=1M > > 30720+1 records in > > 30720+1 records out > > 32212262912 bytes (32 GB) copied, 88.0663 seconds, 366 MB/s > > > > The machines are almost idle during this transfer and I can't see a > > reason why it can't reach the full bandwith when it's > > just reading it from RAM. > > > > I've tried again tracing with DTrace to see what's happening with > > this script : > > > > fbt:kernel:nfs*:entry > > { > > this->ts = timestamp; > > @counts[probefunc] = count(); > > } > > > > fbt:kernel:nfs*:return > > / this->ts > 0 / > > { > > @time[probefunc] = avg(timestamp - this->ts); > > } > > > > END > > { > > trunc(@counts, 10); > > trunc(@time, 10); > > printf("Top 10 called functions\n\n"); > > printa(@counts); > > printf("\n\nTop 10 slowest functions\n\n"); > > printa(@time); > > } > > > > And here's the result (several seconds during the dd test): > > > > Top 10 called functions > > nfsrc_freecache 88849 > > nfsrc_wanted 88849 > > nfsrv_fillattr 88849 > > nfsrv_postopattr 88849 > > nfsrvd_read 88849 > > nfsrvd_rephead 88849 > > nfsrvd_updatecache 88849 > > nfsvno_testexp 88849 > > nfsrc_trimcache 177697 > > nfsvno_getattr 177698 > > > > Top 10 slowest functions > > nfsd_excred 5673 > > nfsrc_freecache 5674 > > nfsrv_postopattr 5970 > > nfsrv_servertimer 6327 > > nfssvc_nfscommon 6596 > > nfsd_fhtovp 8000 > > nfsrvd_read 8380 > > nfssvc_program 92752 > > nfsvno_read 124979 > > nfsvno_fhtovp 1789523 > > > > I might try now to trace what nfsvno_fhtovp() is doing and where is > > spending it's time. > > > > Any other ideas are welcome :) > > > > To take the network out of the equation I redid the test by mounting > the same filesystem over NFS on the server: > > [18:23]root@goliath:~# mount -t nfs -o > rw,hard,intr,tcp,nfsv3,rsize=1048576,wsize=1048576 Just fyi, the maximum rsize,wsize is MAXBSIZE, which is 65536 for FreeBSD currently. As I noted in the other email, I'd suggest "rsize=65536,wsize=65536,readahead=16,...". > localhost:/tank/spa_db/undo /mnt > [18:24]root@goliath:~# dd if=/mnt/data.dbf of=/dev/null bs=1M > 30720+1 records in > 30720+1 records out > 32212262912 bytes transferred in 79.793343 secs (403696120 bytes/sec) > [18:25]root@goliath:~# dd if=/mnt/data.dbf of=/dev/null bs=1M > 30720+1 records in > 30720+1 records out > 32212262912 bytes transferred in 12.033420 secs (2676900110 bytes/sec) > > During the first run I saw several nfsd threads in top, along with dd > and again zero disk I/O. > There was increase in memory usage because of the double buffering > ARC->buffercahe. > The second run was with all of the nfsd threads totally idle, and read > directly from the buffercache. From owner-freebsd-hackers@FreeBSD.ORG Thu Oct 11 22:02:54 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2DAD2D35; Thu, 11 Oct 2012 22:02:54 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id BC0AF8FC17; Thu, 11 Oct 2012 22:02:53 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap0EAJ+LclCDaFvO/2dsb2JhbABFhhG6GYJKVhsODAINGQJfiBimVJF1gSGPLIESA5VrkC6DCYF7 X-IronPort-AV: E=Sophos;i="4.80,574,1344225600"; d="scan'208";a="186031155" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu-pri.mail.uoguelph.ca with ESMTP; 11 Oct 2012 18:02:52 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id C0136B4096; Thu, 11 Oct 2012 18:02:52 -0400 (EDT) Date: Thu, 11 Oct 2012 18:02:52 -0400 (EDT) From: Rick Macklem To: Nikolay Denev Message-ID: <608951636.2115684.1349992972756.JavaMail.root@erie.cs.uoguelph.ca> Subject: Re: NFS server bottlenecks MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - IE7 (Win)/6.0.10_GA_2692) Cc: FreeBSD Hackers , Garrett Wollman X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Oct 2012 22:02:54 -0000 Oops, I didn't get the "readahead" option description quite right in the last post. The default read ahead is 1, which does result in "rsize * 2", since there is the read + 1 readahead. "rsize * 16" would actually be for the option "readahead=15" and for "readahead=16" the calculation would be "rsize * 17". However, the example was otherwise ok, I think? rick From owner-freebsd-hackers@FreeBSD.ORG Fri Oct 12 15:54:58 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4F5DE15D for ; Fri, 12 Oct 2012 15:54:58 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 227718FC14 for ; Fri, 12 Oct 2012 15:54:58 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 864E0B9BC; Fri, 12 Oct 2012 11:54:57 -0400 (EDT) From: John Baldwin To: Carl Delsey Subject: Re: No bus_space_read_8 on x86 ? Date: Fri, 12 Oct 2012 11:31:46 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p20; KDE/4.5.5; amd64; ; ) References: <506DC574.9010300@intel.com> <201210091154.15873.jhb@freebsd.org> <5075EC29.1010907@intel.com> In-Reply-To: <5075EC29.1010907@intel.com> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201210121131.46373.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Fri, 12 Oct 2012 11:54:57 -0400 (EDT) Cc: freebsd-hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Oct 2012 15:54:58 -0000 On Wednesday, October 10, 2012 5:44:09 pm Carl Delsey wrote: > Sorry for the slow response. I was dealing with a bit of a family > emergency. Responses inline below. > > On 10/09/12 08:54, John Baldwin wrote: > > On Monday, October 08, 2012 4:59:24 pm Warner Losh wrote: > >> On Oct 5, 2012, at 10:08 AM, John Baldwin wrote: > > >>> I think cxgb* already have an implementation. For amd64 we should certainly > >>> have bus_space_*_8(), at least for SYS_RES_MEMORY. I think they should fail > >>> for SYS_RES_IOPORT. I don't think we can force a compile-time error though, > >>> would just have to return -1 on reads or some such? > > Yes. Exactly what I was thinking. > > >> I believe it was because bus reads weren't guaranteed to be atomic on i386. > >> don't know if that's still the case or a concern, but it was an intentional omission. > > True. If you are on a 32-bit system you can read the two 4 byte values and > > then build a 64-bit value. For 64-bit platforms we should offer bus_read_8() > > however. > > I believe there is still no way to perform a 64-bit read on a i386 (or > at least without messing with SSE instructions), but if you have to read > a 64-bit register, you are stuck with doing two 32-bit reads and > concatenating them. I figure we may as well provide an implementation > for those who have to do that as well as the implementation for 64-bit. I think the problem though is that the way you should glue those two 32-bit reads together is device dependent. I don't think you can provide a completely device-neutral bus_read_8() on i386. We should certainly have it on 64-bit platforms, but I think drivers that want to work on 32-bit platforms need to explicitly merge the two words themselves. -- John Baldwin From owner-freebsd-hackers@FreeBSD.ORG Fri Oct 12 16:04:28 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3382D935; Fri, 12 Oct 2012 16:04:28 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 08D398FC14; Fri, 12 Oct 2012 16:04:28 +0000 (UTC) Received: from fledge.watson.org (fledge.watson.org [65.122.17.41]) by cyrus.watson.org (Postfix) with ESMTPS id B5C3C46B09; Fri, 12 Oct 2012 12:04:27 -0400 (EDT) Date: Fri, 12 Oct 2012 17:04:27 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: John Baldwin Subject: Re: No bus_space_read_8 on x86 ? In-Reply-To: <201210121131.46373.jhb@freebsd.org> Message-ID: References: <506DC574.9010300@intel.com> <201210091154.15873.jhb@freebsd.org> <5075EC29.1010907@intel.com> <201210121131.46373.jhb@freebsd.org> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-hackers@freebsd.org, Carl Delsey X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Oct 2012 16:04:28 -0000 On Fri, 12 Oct 2012, John Baldwin wrote: >>>> I believe it was because bus reads weren't guaranteed to be atomic on >>>> i386. don't know if that's still the case or a concern, but it was an >>>> intentional omission. >>> True. If you are on a 32-bit system you can read the two 4 byte values >>> and then build a 64-bit value. For 64-bit platforms we should offer >>> bus_read_8() however. >> >> I believe there is still no way to perform a 64-bit read on a i386 (or at >> least without messing with SSE instructions), but if you have to read a >> 64-bit register, you are stuck with doing two 32-bit reads and >> concatenating them. I figure we may as well provide an implementation for >> those who have to do that as well as the implementation for 64-bit. > > I think the problem though is that the way you should glue those two 32-bit > reads together is device dependent. I don't think you can provide a > completely device-neutral bus_read_8() on i386. We should certainly have it > on 64-bit platforms, but I think drivers that want to work on 32-bit > platforms need to explicitly merge the two words themselves. Indeed -- and on non-x86, where there are uncached direct map segments, and TLB entries that disable caching, reading 2x 32-bit vs 1x 64-bit have quite different effects in terms of atomicity. Where uncached I/Os are being used, those differences may affect semantics significantly -- e.g., if your device has a 64-bit memory-mapped FIFO or registers, 2x 32-bit gives you two halves of two different 64-bit values, rather than two halves of the same value. As device drivers depend on those atomicity semantics, we should (at the busspace level) offer only the exactly expected semantics, rather than trying to patch things up. If a device driver accessing 64-bit fields wants to support doing it using two 32-bit reads, it can figure out how to splice it together following bus_space_read_region_4(). Robert From owner-freebsd-hackers@FreeBSD.ORG Fri Oct 12 17:46:11 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B19CE535; Fri, 12 Oct 2012 17:46:11 +0000 (UTC) (envelope-from carl.r.delsey@intel.com) Received: from mga14.intel.com (mga14.intel.com [143.182.124.37]) by mx1.freebsd.org (Postfix) with ESMTP id 72D258FC14; Fri, 12 Oct 2012 17:46:11 +0000 (UTC) Received: from azsmga001.ch.intel.com ([10.2.17.19]) by azsmga102.ch.intel.com with ESMTP; 12 Oct 2012 10:46:05 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.80,577,1344236400"; d="scan'208";a="203822158" Received: from crdelsey-mobl2.amr.corp.intel.com (HELO [10.255.71.218]) ([10.255.71.218]) by azsmga001.ch.intel.com with ESMTP; 12 Oct 2012 10:46:04 -0700 Message-ID: <5078575B.2020808@intel.com> Date: Fri, 12 Oct 2012 10:46:03 -0700 From: Carl Delsey User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120824 Thunderbird/15.0 MIME-Version: 1.0 To: Robert Watson Subject: Re: No bus_space_read_8 on x86 ? References: <506DC574.9010300@intel.com> <201210091154.15873.jhb@freebsd.org> <5075EC29.1010907@intel.com> <201210121131.46373.jhb@freebsd.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Oct 2012 17:46:11 -0000 On 10/12/2012 9:04 AM, Robert Watson wrote: > > On Fri, 12 Oct 2012, John Baldwin wrote: > >>>>> I believe it was because bus reads weren't guaranteed to be atomic >>>>> on i386. don't know if that's still the case or a concern, but it >>>>> was an intentional omission. >>>> True. If you are on a 32-bit system you can read the two 4 byte >>>> values and then build a 64-bit value. For 64-bit platforms we >>>> should offer bus_read_8() however. >>> >>> I believe there is still no way to perform a 64-bit read on a i386 >>> (or at least without messing with SSE instructions), but if you have >>> to read a 64-bit register, you are stuck with doing two 32-bit reads >>> and concatenating them. I figure we may as well provide an >>> implementation for those who have to do that as well as the >>> implementation for 64-bit. >> >> I think the problem though is that the way you should glue those two >> 32-bit reads together is device dependent. I don't think you can >> provide a completely device-neutral bus_read_8() on i386. We should >> certainly have it on 64-bit platforms, but I think drivers that want >> to work on 32-bit platforms need to explicitly merge the two words >> themselves. > > Indeed -- and on non-x86, where there are uncached direct map > segments, and TLB entries that disable caching, reading 2x 32-bit vs > 1x 64-bit have quite different effects in terms of atomicity. Where > uncached I/Os are being used, those differences may affect semantics > significantly -- e.g., if your device has a 64-bit memory-mapped FIFO > or registers, 2x 32-bit gives you two halves of two different 64-bit > values, rather than two halves of the same value. As device drivers > depend on those atomicity semantics, we should (at the busspace level) > offer only the exactly expected semantics, rather than trying to patch > things up. If a device driver accessing 64-bit fields wants to > support doing it using two 32-bit reads, it can figure out how to > splice it together following bus_space_read_region_4(). I wouldn't make any default behaviour for bus_space_read_8 on i386, just amd64. My assumption (which may be unjustified) is that by far the most common implementations to read a 64-bit register on i386 would be to read the lower 4 bytes first, followed by the upper 4 bytes (or vice versa) and then stitch them together. I think we should provide helper functions for these two cases, otherwise I fear our code base will be littered with multiple independent implementations of this. Some driver writer who wants to take advantage of these helper functions would do something like #ifdef i386 #define bus_space_read_8 bus_space_read_8_lower_first #endif otherwise, using bus_space_read_8 won't compile for i386 builds. If these implementations won't work for their case, they are free to write their own implementation or take whatever action is necessary. I guess my question is, are these cases common enough that it is worth helping developers by providing functions that do the double read and shifts for them, or do we leave them to deal with it on their own at the risk of possibly some duplicated code. Thanks, Carl From owner-freebsd-hackers@FreeBSD.ORG Sat Oct 13 02:06:01 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D942BA8; Sat, 13 Oct 2012 02:06:01 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 3059A8FC17; Sat, 13 Oct 2012 02:06:00 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAN6MclCDaFvO/2dsb2JhbAA9CBaFe7oZgiABAQEEAQEBIAQnIAsFFg4KERkCBCUBCSYGCAcEARwEh2QLpkyRd4tlBIRkgRIDjm6EUIItgRWPGYMJgUc0 X-IronPort-AV: E=Sophos;i="4.80,580,1344225600"; d="scan'208";a="183450178" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn-pri.mail.uoguelph.ca with ESMTP; 12 Oct 2012 22:05:54 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 168AEB4034; Fri, 12 Oct 2012 22:05:54 -0400 (EDT) Date: Fri, 12 Oct 2012 22:05:54 -0400 (EDT) From: Rick Macklem To: Nikolay Denev Message-ID: <937460294.2185822.1350093954059.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <608951636.2115684.1349992972756.JavaMail.root@erie.cs.uoguelph.ca> Subject: Re: NFS server bottlenecks MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_2185821_351431992.1350093954057" X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - IE7 (Win)/6.0.10_GA_2692) Cc: FreeBSD Hackers , Garrett Wollman X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Oct 2012 02:06:01 -0000 ------=_Part_2185821_351431992.1350093954057 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit I wrote: > Oops, I didn't get the "readahead" option description > quite right in the last post. The default read ahead > is 1, which does result in "rsize * 2", since there is > the read + 1 readahead. > > "rsize * 16" would actually be for the option "readahead=15" > and for "readahead=16" the calculation would be "rsize * 17". > > However, the example was otherwise ok, I think? rick I've attached the patch drc3.patch (it assumes drc2.patch has already been applied) that replaces the single mutex with one for each hash list for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200. These patches are also at: http://people.freebsd.org/~rmacklem/drc2.patch http://people.freebsd.org/~rmacklem/drc3.patch in case the attachments don't get through. rick ps: I haven't tested drc3.patch a lot, but I think it's ok? > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to > "freebsd-hackers-unsubscribe@freebsd.org" ------=_Part_2185821_351431992.1350093954057 Content-Type: text/x-patch; name=drc2.patch Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename=drc2.patch LS0tIGZzL25mc3NlcnZlci9uZnNfbmZzZGNhY2hlLmMub3JpZwkyMDEyLTAyLTI5IDIxOjA3OjUz LjAwMDAwMDAwMCAtMDUwMAorKysgZnMvbmZzc2VydmVyL25mc19uZnNkY2FjaGUuYwkyMDEyLTEw LTAzIDA4OjIzOjI0LjAwMDAwMDAwMCAtMDQwMApAQCAtMTY0LDggKzE2NCwxOSBAQCBORlNDQUNI RU1VVEVYOwogaW50IG5mc3JjX2Zsb29kbGV2ZWwgPSBORlNSVkNBQ0hFX0ZMT09ETEVWRUwsIG5m c3JjX3RjcHNhdmVkcmVwbGllcyA9IDA7CiAjZW5kaWYJLyogIUFQUExFS0VYVCAqLwogCitTWVND VExfREVDTChfdmZzX25mc2QpOworCitzdGF0aWMgaW50CW5mc3JjX3RjcGhpZ2h3YXRlciA9IDA7 CitTWVNDVExfSU5UKF92ZnNfbmZzZCwgT0lEX0FVVE8sIHRjcGhpZ2h3YXRlciwgQ1RMRkxBR19S VywKKyAgICAmbmZzcmNfdGNwaGlnaHdhdGVyLCAwLAorICAgICJIaWdoIHdhdGVyIG1hcmsgZm9y IFRDUCBjYWNoZSBlbnRyaWVzIik7CitzdGF0aWMgaW50CW5mc3JjX3VkcGhpZ2h3YXRlciA9IE5G U1JWQ0FDSEVfVURQSElHSFdBVEVSOworU1lTQ1RMX0lOVChfdmZzX25mc2QsIE9JRF9BVVRPLCB1 ZHBoaWdod2F0ZXIsIENUTEZMQUdfUlcsCisgICAgJm5mc3JjX3VkcGhpZ2h3YXRlciwgMCwKKyAg ICAiSGlnaCB3YXRlciBtYXJrIGZvciBVRFAgY2FjaGUgZW50cmllcyIpOworCiBzdGF0aWMgaW50 IG5mc3JjX3RjcG5vbmlkZW1wb3RlbnQgPSAxOwotc3RhdGljIGludCBuZnNyY191ZHBoaWdod2F0 ZXIgPSBORlNSVkNBQ0hFX1VEUEhJR0hXQVRFUiwgbmZzcmNfdWRwY2FjaGVzaXplID0gMDsKK3N0 YXRpYyBpbnQgbmZzcmNfdWRwY2FjaGVzaXplID0gMDsKIHN0YXRpYyBUQUlMUV9IRUFEKCwgbmZz cnZjYWNoZSkgbmZzcnZ1ZHBscnU7CiBzdGF0aWMgc3RydWN0IG5mc3J2aGFzaGhlYWQgbmZzcnZo YXNodGJsW05GU1JWQ0FDSEVfSEFTSFNJWkVdLAogICAgIG5mc3J2dWRwaGFzaHRibFtORlNSVkNB Q0hFX0hBU0hTSVpFXTsKQEAgLTc4MSw4ICs3OTIsMTUgQEAgbmZzcmNfdHJpbWNhY2hlKHVfaW50 NjRfdCBzb2NrcmVmLCBzdHJ1YwogewogCXN0cnVjdCBuZnNydmNhY2hlICpycCwgKm5leHRycDsK IAlpbnQgaTsKKwlzdGF0aWMgdGltZV90IGxhc3R0cmltID0gMDsKIAorCWlmIChORlNEX01PTk9T RUMgPT0gbGFzdHRyaW0gJiYKKwkgICAgbmZzcmNfdGNwc2F2ZWRyZXBsaWVzIDwgbmZzcmNfdGNw aGlnaHdhdGVyICYmCisJICAgIG5mc3JjX3VkcGNhY2hlc2l6ZSA8IChuZnNyY191ZHBoaWdod2F0 ZXIgKworCSAgICBuZnNyY191ZHBoaWdod2F0ZXIgLyAyKSkKKwkJcmV0dXJuOwogCU5GU0xPQ0tD QUNIRSgpOworCWxhc3R0cmltID0gTkZTRF9NT05PU0VDOwogCVRBSUxRX0ZPUkVBQ0hfU0FGRShy cCwgJm5mc3J2dWRwbHJ1LCByY19scnUsIG5leHRycCkgewogCQlpZiAoIShycC0+cmNfZmxhZyAm IChSQ19JTlBST0d8UkNfTE9DS0VEfFJDX1dBTlRFRCkpCiAJCSAgICAgJiYgcnAtPnJjX3JlZmNu dCA9PSAwCg== ------=_Part_2185821_351431992.1350093954057 Content-Type: text/x-patch; name=drc3.patch Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename=drc3.patch LS0tIGZzL25mc3NlcnZlci9uZnNfbmZzZGNhY2hlLmMuc2F2CTIwMTItMTAtMTAgMTg6NTY6MDEu MDAwMDAwMDAwIC0wNDAwCisrKyBmcy9uZnNzZXJ2ZXIvbmZzX25mc2RjYWNoZS5jCTIwMTItMTAt MTIgMjE6MDQ6MjEuMDAwMDAwMDAwIC0wNDAwCkBAIC0xNjAsNyArMTYwLDggQEAgX19GQlNESUQo IiRGcmVlQlNEOiBoZWFkL3N5cy9mcy9uZnNzZXJ2ZQogI2luY2x1ZGUgPGZzL25mcy9uZnNwb3J0 Lmg+CiAKIGV4dGVybiBzdHJ1Y3QgbmZzc3RhdHMgbmV3bmZzc3RhdHM7Ci1ORlNDQUNIRU1VVEVY OworZXh0ZXJuIHN0cnVjdCBtdHggbmZzcmNfdGNwbXR4W05GU1JWQ0FDSEVfSEFTSFNJWkVdOwor ZXh0ZXJuIHN0cnVjdCBtdHggbmZzcmNfdWRwbXR4OwogaW50IG5mc3JjX2Zsb29kbGV2ZWwgPSBO RlNSVkNBQ0hFX0ZMT09ETEVWRUwsIG5mc3JjX3RjcHNhdmVkcmVwbGllcyA9IDA7CiAjZW5kaWYJ LyogIUFQUExFS0VYVCAqLwogCkBAIC0yMDgsMTAgKzIwOSwxMSBAQCBzdGF0aWMgaW50IG5ld25m c3YyX3Byb2NpZFtORlNfVjNOUFJPQ1NdCiAJTkZTVjJQUk9DX05PT1AsCiB9OwogCisjZGVmaW5l CW5mc3JjX2hhc2goeGlkKQkoKCh4aWQpICsgKCh4aWQpID4+IDI0KSkgJSBORlNSVkNBQ0hFX0hB U0hTSVpFKQogI2RlZmluZQlORlNSQ1VEUEhBU0goeGlkKSBcCi0JKCZuZnNydnVkcGhhc2h0Ymxb KCh4aWQpICsgKCh4aWQpID4+IDI0KSkgJSBORlNSVkNBQ0hFX0hBU0hTSVpFXSkKKwkoJm5mc3J2 dWRwaGFzaHRibFtuZnNyY19oYXNoKHhpZCldKQogI2RlZmluZQlORlNSQ0hBU0goeGlkKSBcCi0J KCZuZnNydmhhc2h0YmxbKCh4aWQpICsgKCh4aWQpID4+IDI0KSkgJSBORlNSVkNBQ0hFX0hBU0hT SVpFXSkKKwkoJm5mc3J2aGFzaHRibFtuZnNyY19oYXNoKHhpZCldKQogI2RlZmluZQlUUlVFCTEK ICNkZWZpbmUJRkFMU0UJMAogI2RlZmluZQlORlNSVkNBQ0hFX0NIRUNLTEVOCTEwMApAQCAtMjYy LDYgKzI2NCwxOCBAQCBzdGF0aWMgaW50IG5mc3JjX2dldGxlbmFuZGNrc3VtKG1idWZfdCBtCiBz dGF0aWMgdm9pZCBuZnNyY19tYXJrc2FtZXRjcGNvbm4odV9pbnQ2NF90KTsKIAogLyoKKyAqIFJl dHVybiB0aGUgY29ycmVjdCBtdXRleCBmb3IgdGhpcyBjYWNoZSBlbnRyeS4KKyAqLworc3RhdGlj IF9faW5saW5lIHN0cnVjdCBtdHggKgorbmZzcmNfY2FjaGVtdXRleChzdHJ1Y3QgbmZzcnZjYWNo ZSAqcnApCit7CisKKwlpZiAoKHJwLT5yY19mbGFnICYgUkNfVURQKSAhPSAwKQorCQlyZXR1cm4g KCZuZnNyY191ZHBtdHgpOworCXJldHVybiAoJm5mc3JjX3RjcG10eFtuZnNyY19oYXNoKHJwLT5y Y194aWQpXSk7Cit9CisKKy8qCiAgKiBJbml0aWFsaXplIHRoZSBzZXJ2ZXIgcmVxdWVzdCBjYWNo ZSBsaXN0CiAgKi8KIEFQUExFU1RBVElDIHZvaWQKQEAgLTMzNiwxMCArMzUwLDEyIEBAIG5mc3Jj X2dldHVkcChzdHJ1Y3QgbmZzcnZfZGVzY3JpcHQgKm5kLCAKIAlzdHJ1Y3Qgc29ja2FkZHJfaW42 ICpzYWRkcjY7CiAJc3RydWN0IG5mc3J2aGFzaGhlYWQgKmhwOwogCWludCByZXQgPSAwOworCXN0 cnVjdCBtdHggKm11dGV4OwogCisJbXV0ZXggPSBuZnNyY19jYWNoZW11dGV4KG5ld3JwKTsKIAlo cCA9IE5GU1JDVURQSEFTSChuZXdycC0+cmNfeGlkKTsKIGxvb3A6Ci0JTkZTTE9DS0NBQ0hFKCk7 CisJbXR4X2xvY2sobXV0ZXgpOwogCUxJU1RfRk9SRUFDSChycCwgaHAsIHJjX2hhc2gpIHsKIAkg ICAgaWYgKG5ld3JwLT5yY194aWQgPT0gcnAtPnJjX3hpZCAmJgogCQluZXdycC0+cmNfcHJvYyA9 PSBycC0+cmNfcHJvYyAmJgpAQCAtMzQ3LDggKzM2Myw4IEBAIGxvb3A6CiAJCW5mc2FkZHJfbWF0 Y2goTkVURkFNSUxZKHJwKSwgJnJwLT5yY19oYWRkciwgbmQtPm5kX25hbSkpIHsKIAkJCWlmICgo cnAtPnJjX2ZsYWcgJiBSQ19MT0NLRUQpICE9IDApIHsKIAkJCQlycC0+cmNfZmxhZyB8PSBSQ19X QU5URUQ7Ci0JCQkJKHZvaWQpbXR4X3NsZWVwKHJwLCBORlNDQUNIRU1VVEVYUFRSLAotCQkJCSAg ICAoUFpFUk8gLSAxKSB8IFBEUk9QLCAibmZzcmMiLCAxMCAqIGh6KTsKKwkJCQkodm9pZCltdHhf c2xlZXAocnAsIG11dGV4LCAoUFpFUk8gLSAxKSB8IFBEUk9QLAorCQkJCSAgICAibmZzcmMiLCAx MCAqIGh6KTsKIAkJCQlnb3RvIGxvb3A7CiAJCQl9CiAJCQlpZiAocnAtPnJjX2ZsYWcgPT0gMCkK QEAgLTM1OCwxNCArMzc0LDE0IEBAIGxvb3A6CiAJCQlUQUlMUV9JTlNFUlRfVEFJTCgmbmZzcnZ1 ZHBscnUsIHJwLCByY19scnUpOwogCQkJaWYgKHJwLT5yY19mbGFnICYgUkNfSU5QUk9HKSB7CiAJ CQkJbmV3bmZzc3RhdHMuc3J2Y2FjaGVfaW5wcm9naGl0cysrOwotCQkJCU5GU1VOTE9DS0NBQ0hF KCk7CisJCQkJbXR4X3VubG9jayhtdXRleCk7CiAJCQkJcmV0ID0gUkNfRFJPUElUOwogCQkJfSBl bHNlIGlmIChycC0+cmNfZmxhZyAmIFJDX1JFUFNUQVRVUykgewogCQkJCS8qCiAJCQkJICogVjIg b25seS4KIAkJCQkgKi8KIAkJCQluZXduZnNzdGF0cy5zcnZjYWNoZV9ub25pZGVtZG9uZWhpdHMr KzsKLQkJCQlORlNVTkxPQ0tDQUNIRSgpOworCQkJCW10eF91bmxvY2sobXV0ZXgpOwogCQkJCW5m c3J2ZF9yZXBoZWFkKG5kKTsKIAkJCQkqKG5kLT5uZF9lcnJwKSA9IHJwLT5yY19zdGF0dXM7CiAJ CQkJcmV0ID0gUkNfUkVQTFk7CkBAIC0zNzMsNyArMzg5LDcgQEAgbG9vcDoKIAkJCQkJTkZTUlZD QUNIRV9VRFBUSU1FT1VUOwogCQkJfSBlbHNlIGlmIChycC0+cmNfZmxhZyAmIFJDX1JFUE1CVUYp IHsKIAkJCQluZXduZnNzdGF0cy5zcnZjYWNoZV9ub25pZGVtZG9uZWhpdHMrKzsKLQkJCQlORlNV TkxPQ0tDQUNIRSgpOworCQkJCW10eF91bmxvY2sobXV0ZXgpOwogCQkJCW5kLT5uZF9tcmVxID0g bV9jb3B5bShycC0+cmNfcmVwbHksIDAsCiAJCQkJCU1fQ09QWUFMTCwgTV9XQUlUKTsKIAkJCQly ZXQgPSBSQ19SRVBMWTsKQEAgLTQwMyw3ICs0MTksNyBAQCBsb29wOgogCX0KIAlMSVNUX0lOU0VS VF9IRUFEKGhwLCBuZXdycCwgcmNfaGFzaCk7CiAJVEFJTFFfSU5TRVJUX1RBSUwoJm5mc3J2dWRw bHJ1LCBuZXdycCwgcmNfbHJ1KTsKLQlORlNVTkxPQ0tDQUNIRSgpOworCW10eF91bmxvY2sobXV0 ZXgpOwogCW5kLT5uZF9ycCA9IG5ld3JwOwogCXJldCA9IFJDX0RPSVQ7CiAKQEAgLTQyMSwxMiAr NDM3LDE0IEBAIG5mc3J2ZF91cGRhdGVjYWNoZShzdHJ1Y3QgbmZzcnZfZGVzY3JpcHQKIAlzdHJ1 Y3QgbmZzcnZjYWNoZSAqcnA7CiAJc3RydWN0IG5mc3J2Y2FjaGUgKnJldHJwID0gTlVMTDsKIAlt YnVmX3QgbTsKKwlzdHJ1Y3QgbXR4ICptdXRleDsKIAogCXJwID0gbmQtPm5kX3JwOwogCWlmICgh cnApCiAJCXBhbmljKCJuZnNydmRfdXBkYXRlY2FjaGUgbnVsbCBycCIpOwogCW5kLT5uZF9ycCA9 IE5VTEw7Ci0JTkZTTE9DS0NBQ0hFKCk7CisJbXV0ZXggPSBuZnNyY19jYWNoZW11dGV4KHJwKTsK KwltdHhfbG9jayhtdXRleCk7CiAJbmZzcmNfbG9jayhycCk7CiAJaWYgKCEocnAtPnJjX2ZsYWcg JiBSQ19JTlBST0cpKQogCQlwYW5pYygibmZzcnZkX3VwZGF0ZWNhY2hlIG5vdCBpbnByb2ciKTsK QEAgLTQ0MSw3ICs0NTksNyBAQCBuZnNydmRfdXBkYXRlY2FjaGUoc3RydWN0IG5mc3J2X2Rlc2Ny aXB0CiAJICovCiAJaWYgKG5kLT5uZF9yZXBzdGF0ID09IE5GU0VSUl9SRVBMWUZST01DQUNIRSkg ewogCQluZXduZnNzdGF0cy5zcnZjYWNoZV9ub25pZGVtZG9uZWhpdHMrKzsKLQkJTkZTVU5MT0NL Q0FDSEUoKTsKKwkJbXR4X3VubG9jayhtdXRleCk7CiAJCW5kLT5uZF9yZXBzdGF0ID0gMDsKIAkJ aWYgKG5kLT5uZF9tcmVxKQogCQkJbWJ1Zl9mcmVlbShuZC0+bmRfbXJlcSk7CkBAIC00NzQsMjEg KzQ5MiwyMSBAQCBuZnNydmRfdXBkYXRlY2FjaGUoc3RydWN0IG5mc3J2X2Rlc2NyaXB0CiAJCSAg ICBuZnN2Ml9yZXBzdGF0W25ld25mc3YyX3Byb2NpZFtuZC0+bmRfcHJvY251bV1dKSB7CiAJCQly cC0+cmNfc3RhdHVzID0gbmQtPm5kX3JlcHN0YXQ7CiAJCQlycC0+cmNfZmxhZyB8PSBSQ19SRVBT VEFUVVM7Ci0JCQlORlNVTkxPQ0tDQUNIRSgpOworCQkJbXR4X3VubG9jayhtdXRleCk7CiAJCX0g ZWxzZSB7CiAJCQlpZiAoIShycC0+cmNfZmxhZyAmIFJDX1VEUCkpIHsKLQkJCSAgICBuZnNyY190 Y3BzYXZlZHJlcGxpZXMrKzsKKwkJCSAgICBhdG9taWNfYWRkX2ludCgmbmZzcmNfdGNwc2F2ZWRy ZXBsaWVzLCAxKTsKIAkJCSAgICBpZiAobmZzcmNfdGNwc2F2ZWRyZXBsaWVzID4KIAkJCQluZXdu ZnNzdGF0cy5zcnZjYWNoZV90Y3BwZWFrKQogCQkJCW5ld25mc3N0YXRzLnNydmNhY2hlX3RjcHBl YWsgPQogCQkJCSAgICBuZnNyY190Y3BzYXZlZHJlcGxpZXM7CiAJCQl9Ci0JCQlORlNVTkxPQ0tD QUNIRSgpOworCQkJbXR4X3VubG9jayhtdXRleCk7CiAJCQltID0gbV9jb3B5bShuZC0+bmRfbXJl cSwgMCwgTV9DT1BZQUxMLCBNX1dBSVQpOwotCQkJTkZTTE9DS0NBQ0hFKCk7CisJCQltdHhfbG9j ayhtdXRleCk7CiAJCQlycC0+cmNfcmVwbHkgPSBtOwogCQkJcnAtPnJjX2ZsYWcgfD0gUkNfUkVQ TUJVRjsKLQkJCU5GU1VOTE9DS0NBQ0hFKCk7CisJCQltdHhfdW5sb2NrKG11dGV4KTsKIAkJfQog CQlpZiAocnAtPnJjX2ZsYWcgJiBSQ19VRFApIHsKIAkJCXJwLT5yY190aW1lc3RhbXAgPSBORlNE X01PTk9TRUMgKwpAQCAtNTA0LDcgKzUyMiw3IEBAIG5mc3J2ZF91cGRhdGVjYWNoZShzdHJ1Y3Qg bmZzcnZfZGVzY3JpcHQKIAkJfQogCX0gZWxzZSB7CiAJCW5mc3JjX2ZyZWVjYWNoZShycCk7Ci0J CU5GU1VOTE9DS0NBQ0hFKCk7CisJCW10eF91bmxvY2sobXV0ZXgpOwogCX0KIAogb3V0OgpAQCAt NTIwLDE0ICs1MzgsMTYgQEAgb3V0OgogQVBQTEVTVEFUSUMgdm9pZAogbmZzcnZkX2RlbGNhY2hl KHN0cnVjdCBuZnNydmNhY2hlICpycCkKIHsKKwlzdHJ1Y3QgbXR4ICptdXRleDsKIAorCW11dGV4 ID0gbmZzcmNfY2FjaGVtdXRleChycCk7CiAJaWYgKCEocnAtPnJjX2ZsYWcgJiBSQ19JTlBST0cp KQogCQlwYW5pYygibmZzcnZkX2RlbGNhY2hlIG5vdCBpbiBwcm9nIik7Ci0JTkZTTE9DS0NBQ0hF KCk7CisJbXR4X2xvY2sobXV0ZXgpOwogCXJwLT5yY19mbGFnICY9IH5SQ19JTlBST0c7CiAJaWYg KHJwLT5yY19yZWZjbnQgPT0gMCAmJiAhKHJwLT5yY19mbGFnICYgUkNfTE9DS0VEKSkKIAkJbmZz cmNfZnJlZWNhY2hlKHJwKTsKLQlORlNVTkxPQ0tDQUNIRSgpOworCW10eF91bmxvY2sobXV0ZXgp OwogfQogCiAvKgpAQCAtNTM5LDcgKzU1OSw5IEBAIEFQUExFU1RBVElDIHZvaWQKIG5mc3J2ZF9z ZW50Y2FjaGUoc3RydWN0IG5mc3J2Y2FjaGUgKnJwLCBzdHJ1Y3Qgc29ja2V0ICpzbywgaW50IGVy cikKIHsKIAl0Y3Bfc2VxIHRtcF9zZXE7CisJc3RydWN0IG10eCAqbXV0ZXg7CiAKKwltdXRleCA9 IG5mc3JjX2NhY2hlbXV0ZXgocnApOwogCWlmICghKHJwLT5yY19mbGFnICYgUkNfTE9DS0VEKSkK IAkJcGFuaWMoIm5mc3J2ZF9zZW50Y2FjaGUgbm90IGxvY2tlZCIpOwogCWlmICghZXJyKSB7CkBA IC01NDgsMTAgKzU3MCwxMCBAQCBuZnNydmRfc2VudGNhY2hlKHN0cnVjdCBuZnNydmNhY2hlICpy cCwgCiAJCSAgICAgc28tPnNvX3Byb3RvLT5wcl9wcm90b2NvbCAhPSBJUFBST1RPX1RDUCkKIAkJ CXBhbmljKCJuZnMgc2VudCBjYWNoZSIpOwogCQlpZiAobmZzcnZfZ2V0c29ja3NlcW51bShzbywg JnRtcF9zZXEpKSB7Ci0JCQlORlNMT0NLQ0FDSEUoKTsKKwkJCW10eF9sb2NrKG11dGV4KTsKIAkJ CXJwLT5yY190Y3BzZXEgPSB0bXBfc2VxOwogCQkJcnAtPnJjX2ZsYWcgfD0gUkNfVENQU0VROwot CQkJTkZTVU5MT0NLQ0FDSEUoKTsKKwkJCW10eF91bmxvY2sobXV0ZXgpOwogCQl9CiAJfQogCW5m c3JjX3VubG9jayhycCk7CkBAIC01NzAsMTEgKzU5MiwxMyBAQCBuZnNyY19nZXR0Y3Aoc3RydWN0 IG5mc3J2X2Rlc2NyaXB0ICpuZCwgCiAJc3RydWN0IG5mc3J2Y2FjaGUgKmhpdHJwOwogCXN0cnVj dCBuZnNydmhhc2hoZWFkICpocCwgbmZzcmNfdGVtcGxpc3Q7CiAJaW50IGhpdCwgcmV0ID0gMDsK KwlzdHJ1Y3QgbXR4ICptdXRleDsKIAorCW11dGV4ID0gbmZzcmNfY2FjaGVtdXRleChuZXdycCk7 CiAJaHAgPSBORlNSQ0hBU0gobmV3cnAtPnJjX3hpZCk7CiAJbmV3cnAtPnJjX3JlcWxlbiA9IG5m c3JjX2dldGxlbmFuZGNrc3VtKG5kLT5uZF9tcmVwLCAmbmV3cnAtPnJjX2Nrc3VtKTsKIHRyeWFn YWluOgotCU5GU0xPQ0tDQUNIRSgpOworCW10eF9sb2NrKG11dGV4KTsKIAloaXQgPSAxOwogCUxJ U1RfSU5JVCgmbmZzcmNfdGVtcGxpc3QpOwogCS8qCkBAIC02MzIsOCArNjU2LDggQEAgdHJ5YWdh aW46CiAJCXJwID0gaGl0cnA7CiAJCWlmICgocnAtPnJjX2ZsYWcgJiBSQ19MT0NLRUQpICE9IDAp IHsKIAkJCXJwLT5yY19mbGFnIHw9IFJDX1dBTlRFRDsKLQkJCSh2b2lkKW10eF9zbGVlcChycCwg TkZTQ0FDSEVNVVRFWFBUUiwKLQkJCSAgICAoUFpFUk8gLSAxKSB8IFBEUk9QLCAibmZzcmMiLCAx MCAqIGh6KTsKKwkJCSh2b2lkKW10eF9zbGVlcChycCwgbXV0ZXgsIChQWkVSTyAtIDEpIHwgUERS T1AsCisJCQkgICAgIm5mc3JjIiwgMTAgKiBoeik7CiAJCQlnb3RvIHRyeWFnYWluOwogCQl9CiAJ CWlmIChycC0+cmNfZmxhZyA9PSAwKQpAQCAtNjQxLDcgKzY2NSw3IEBAIHRyeWFnYWluOgogCQly cC0+cmNfZmxhZyB8PSBSQ19MT0NLRUQ7CiAJCWlmIChycC0+cmNfZmxhZyAmIFJDX0lOUFJPRykg ewogCQkJbmV3bmZzc3RhdHMuc3J2Y2FjaGVfaW5wcm9naGl0cysrOwotCQkJTkZTVU5MT0NLQ0FD SEUoKTsKKwkJCW10eF91bmxvY2sobXV0ZXgpOwogCQkJaWYgKG5ld3JwLT5yY19zb2NrcmVmID09 IHJwLT5yY19zb2NrcmVmKQogCQkJCW5mc3JjX21hcmtzYW1ldGNwY29ubihycC0+cmNfc29ja3Jl Zik7CiAJCQlyZXQgPSBSQ19EUk9QSVQ7CkBAIC02NTAsNyArNjc0LDcgQEAgdHJ5YWdhaW46CiAJ CQkgKiBWMiBvbmx5LgogCQkJICovCiAJCQluZXduZnNzdGF0cy5zcnZjYWNoZV9ub25pZGVtZG9u ZWhpdHMrKzsKLQkJCU5GU1VOTE9DS0NBQ0hFKCk7CisJCQltdHhfdW5sb2NrKG11dGV4KTsKIAkJ CWlmIChuZXdycC0+cmNfc29ja3JlZiA9PSBycC0+cmNfc29ja3JlZikKIAkJCQluZnNyY19tYXJr c2FtZXRjcGNvbm4ocnAtPnJjX3NvY2tyZWYpOwogCQkJcmV0ID0gUkNfUkVQTFk7CkBAIC02NjAs NyArNjg0LDcgQEAgdHJ5YWdhaW46CiAJCQkJTkZTUlZDQUNIRV9UQ1BUSU1FT1VUOwogCQl9IGVs c2UgaWYgKHJwLT5yY19mbGFnICYgUkNfUkVQTUJVRikgewogCQkJbmV3bmZzc3RhdHMuc3J2Y2Fj aGVfbm9uaWRlbWRvbmVoaXRzKys7Ci0JCQlORlNVTkxPQ0tDQUNIRSgpOworCQkJbXR4X3VubG9j ayhtdXRleCk7CiAJCQlpZiAobmV3cnAtPnJjX3NvY2tyZWYgPT0gcnAtPnJjX3NvY2tyZWYpCiAJ CQkJbmZzcmNfbWFya3NhbWV0Y3Bjb25uKHJwLT5yY19zb2NrcmVmKTsKIAkJCXJldCA9IFJDX1JF UExZOwpAQCAtNjg1LDcgKzcwOSw3IEBAIHRyeWFnYWluOgogCW5ld3JwLT5yY19jYWNoZXRpbWUg PSBORlNEX01PTk9TRUM7CiAJbmV3cnAtPnJjX2ZsYWcgfD0gUkNfSU5QUk9HOwogCUxJU1RfSU5T RVJUX0hFQUQoaHAsIG5ld3JwLCByY19oYXNoKTsKLQlORlNVTkxPQ0tDQUNIRSgpOworCW10eF91 bmxvY2sobXV0ZXgpOwogCW5kLT5uZF9ycCA9IG5ld3JwOwogCXJldCA9IFJDX0RPSVQ7CiAKQEAg LTY5NiwxNiArNzIwLDE3IEBAIG91dDoKIAogLyoKICAqIExvY2sgYSBjYWNoZSBlbnRyeS4KLSAq IEFsc28gcHV0cyBhIG11dGV4IGxvY2sgb24gdGhlIGNhY2hlIGxpc3QuCiAgKi8KIHN0YXRpYyB2 b2lkCiBuZnNyY19sb2NrKHN0cnVjdCBuZnNydmNhY2hlICpycCkKIHsKLQlORlNDQUNIRUxPQ0tS RVFVSVJFRCgpOworCXN0cnVjdCBtdHggKm11dGV4OworCisJbXV0ZXggPSBuZnNyY19jYWNoZW11 dGV4KHJwKTsKKwltdHhfYXNzZXJ0KG11dGV4LCBNQV9PV05FRCk7CiAJd2hpbGUgKChycC0+cmNf ZmxhZyAmIFJDX0xPQ0tFRCkgIT0gMCkgewogCQlycC0+cmNfZmxhZyB8PSBSQ19XQU5URUQ7Ci0J CSh2b2lkKW10eF9zbGVlcChycCwgTkZTQ0FDSEVNVVRFWFBUUiwgUFpFUk8gLSAxLAotCQkgICAg Im5mc3JjIiwgMCk7CisJCSh2b2lkKW10eF9zbGVlcChycCwgbXV0ZXgsIFBaRVJPIC0gMSwgIm5m c3JjIiwgMCk7CiAJfQogCXJwLT5yY19mbGFnIHw9IFJDX0xPQ0tFRDsKIH0KQEAgLTcxNiwxMSAr NzQxLDEzIEBAIG5mc3JjX2xvY2soc3RydWN0IG5mc3J2Y2FjaGUgKnJwKQogc3RhdGljIHZvaWQK IG5mc3JjX3VubG9jayhzdHJ1Y3QgbmZzcnZjYWNoZSAqcnApCiB7CisJc3RydWN0IG10eCAqbXV0 ZXg7CiAKLQlORlNMT0NLQ0FDSEUoKTsKKwltdXRleCA9IG5mc3JjX2NhY2hlbXV0ZXgocnApOwor CW10eF9sb2NrKG11dGV4KTsKIAlycC0+cmNfZmxhZyAmPSB+UkNfTE9DS0VEOwogCW5mc3JjX3dh bnRlZChycCk7Ci0JTkZTVU5MT0NLQ0FDSEUoKTsKKwltdHhfdW5sb2NrKG11dGV4KTsKIH0KIAog LyoKQEAgLTc0Myw3ICs3NzAsNiBAQCBzdGF0aWMgdm9pZAogbmZzcmNfZnJlZWNhY2hlKHN0cnVj dCBuZnNydmNhY2hlICpycCkKIHsKIAotCU5GU0NBQ0hFTE9DS1JFUVVJUkVEKCk7CiAJTElTVF9S RU1PVkUocnAsIHJjX2hhc2gpOwogCWlmIChycC0+cmNfZmxhZyAmIFJDX1VEUCkgewogCQlUQUlM UV9SRU1PVkUoJm5mc3J2dWRwbHJ1LCBycCwgcmNfbHJ1KTsKQEAgLTc1Myw3ICs3NzksNyBAQCBu ZnNyY19mcmVlY2FjaGUoc3RydWN0IG5mc3J2Y2FjaGUgKnJwKQogCWlmIChycC0+cmNfZmxhZyAm IFJDX1JFUE1CVUYpIHsKIAkJbWJ1Zl9mcmVlbShycC0+cmNfcmVwbHkpOwogCQlpZiAoIShycC0+ cmNfZmxhZyAmIFJDX1VEUCkpCi0JCQluZnNyY190Y3BzYXZlZHJlcGxpZXMtLTsKKwkJCWF0b21p Y19hZGRfaW50KCZuZnNyY190Y3BzYXZlZHJlcGxpZXMsIC0xKTsKIAl9CiAJRlJFRSgoY2FkZHJf dClycCwgTV9ORlNSVkNBQ0hFKTsKIAluZXduZnNzdGF0cy5zcnZjYWNoZV9zaXplLS07CkBAIC03 NjgsMjAgKzc5NCwyMiBAQCBuZnNydmRfY2xlYW5jYWNoZSh2b2lkKQogCXN0cnVjdCBuZnNydmNh Y2hlICpycCwgKm5leHRycDsKIAlpbnQgaTsKIAotCU5GU0xPQ0tDQUNIRSgpOwogCWZvciAoaSA9 IDA7IGkgPCBORlNSVkNBQ0hFX0hBU0hTSVpFOyBpKyspIHsKKwkJbXR4X2xvY2soJm5mc3JjX3Rj cG10eFtpXSk7CiAJCUxJU1RfRk9SRUFDSF9TQUZFKHJwLCAmbmZzcnZoYXNodGJsW2ldLCByY19o YXNoLCBuZXh0cnApIHsKIAkJCW5mc3JjX2ZyZWVjYWNoZShycCk7CiAJCX0KKwkJbXR4X3VubG9j aygmbmZzcmNfdGNwbXR4W2ldKTsKIAl9CisJbXR4X2xvY2soJm5mc3JjX3VkcG10eCk7CiAJZm9y IChpID0gMDsgaSA8IE5GU1JWQ0FDSEVfSEFTSFNJWkU7IGkrKykgewogCQlMSVNUX0ZPUkVBQ0hf U0FGRShycCwgJm5mc3J2dWRwaGFzaHRibFtpXSwgcmNfaGFzaCwgbmV4dHJwKSB7CiAJCQluZnNy Y19mcmVlY2FjaGUocnApOwogCQl9CiAJfQogCW5ld25mc3N0YXRzLnNydmNhY2hlX3NpemUgPSAw OworCW10eF91bmxvY2soJm5mc3JjX3VkcG10eCk7CiAJbmZzcmNfdGNwc2F2ZWRyZXBsaWVzID0g MDsKLQlORlNVTkxPQ0tDQUNIRSgpOwogfQogCiAvKgpAQCAtNzkyLDM0ICs4MjAsNDIgQEAgbmZz cmNfdHJpbWNhY2hlKHVfaW50NjRfdCBzb2NrcmVmLCBzdHJ1YwogewogCXN0cnVjdCBuZnNydmNh Y2hlICpycCwgKm5leHRycDsKIAlpbnQgaTsKLQlzdGF0aWMgdGltZV90IGxhc3R0cmltID0gMDsK KwlzdGF0aWMgdGltZV90IHVkcF9sYXN0dHJpbSA9IDAsIHRjcF9sYXN0dHJpbSA9IDA7CiAKLQlp ZiAoTkZTRF9NT05PU0VDID09IGxhc3R0cmltICYmCi0JICAgIG5mc3JjX3RjcHNhdmVkcmVwbGll cyA8IG5mc3JjX3RjcGhpZ2h3YXRlciAmJgotCSAgICBuZnNyY191ZHBjYWNoZXNpemUgPCAobmZz cmNfdWRwaGlnaHdhdGVyICsKLQkgICAgbmZzcmNfdWRwaGlnaHdhdGVyIC8gMikpCi0JCXJldHVy bjsKLQlORlNMT0NLQ0FDSEUoKTsKLQlsYXN0dHJpbSA9IE5GU0RfTU9OT1NFQzsKLQlUQUlMUV9G T1JFQUNIX1NBRkUocnAsICZuZnNydnVkcGxydSwgcmNfbHJ1LCBuZXh0cnApIHsKLQkJaWYgKCEo cnAtPnJjX2ZsYWcgJiAoUkNfSU5QUk9HfFJDX0xPQ0tFRHxSQ19XQU5URUQpKQotCQkgICAgICYm IHJwLT5yY19yZWZjbnQgPT0gMAotCQkgICAgICYmICgocnAtPnJjX2ZsYWcgJiBSQ19SRUZDTlQp IHx8Ci0JCQkgTkZTRF9NT05PU0VDID4gcnAtPnJjX3RpbWVzdGFtcCB8fAotCQkJIG5mc3JjX3Vk cGNhY2hlc2l6ZSA+IG5mc3JjX3VkcGhpZ2h3YXRlcikpCi0JCQluZnNyY19mcmVlY2FjaGUocnAp OwotCX0KLQlmb3IgKGkgPSAwOyBpIDwgTkZTUlZDQUNIRV9IQVNIU0laRTsgaSsrKSB7Ci0JCUxJ U1RfRk9SRUFDSF9TQUZFKHJwLCAmbmZzcnZoYXNodGJsW2ldLCByY19oYXNoLCBuZXh0cnApIHsK KwlpZiAoTkZTRF9NT05PU0VDICE9IHVkcF9sYXN0dHJpbSB8fAorCSAgICBuZnNyY191ZHBjYWNo ZXNpemUgPj0gKG5mc3JjX3VkcGhpZ2h3YXRlciArCisJICAgIG5mc3JjX3VkcGhpZ2h3YXRlciAv IDIpKSB7CisJCW10eF9sb2NrKCZuZnNyY191ZHBtdHgpOworCQl1ZHBfbGFzdHRyaW0gPSBORlNE X01PTk9TRUM7CisJCVRBSUxRX0ZPUkVBQ0hfU0FGRShycCwgJm5mc3J2dWRwbHJ1LCByY19scnUs IG5leHRycCkgewogCQkJaWYgKCEocnAtPnJjX2ZsYWcgJiAoUkNfSU5QUk9HfFJDX0xPQ0tFRHxS Q19XQU5URUQpKQogCQkJICAgICAmJiBycC0+cmNfcmVmY250ID09IDAKIAkJCSAgICAgJiYgKChy cC0+cmNfZmxhZyAmIFJDX1JFRkNOVCkgfHwKIAkJCQkgTkZTRF9NT05PU0VDID4gcnAtPnJjX3Rp bWVzdGFtcCB8fAotCQkJCSBuZnNyY19hY3RpdmVzb2NrZXQocnAsIHNvY2tyZWYsIHNvKSkpCisJ CQkJIG5mc3JjX3VkcGNhY2hlc2l6ZSA+IG5mc3JjX3VkcGhpZ2h3YXRlcikpCiAJCQkJbmZzcmNf ZnJlZWNhY2hlKHJwKTsKIAkJfQorCQltdHhfdW5sb2NrKCZuZnNyY191ZHBtdHgpOworCX0KKwlp ZiAoTkZTRF9NT05PU0VDICE9IHRjcF9sYXN0dHJpbSB8fAorCSAgICBuZnNyY190Y3BzYXZlZHJl cGxpZXMgPj0gbmZzcmNfdGNwaGlnaHdhdGVyKSB7CisJCWZvciAoaSA9IDA7IGkgPCBORlNSVkNB Q0hFX0hBU0hTSVpFOyBpKyspIHsKKwkJCW10eF9sb2NrKCZuZnNyY190Y3BtdHhbaV0pOworCQkJ aWYgKGkgPT0gMCkKKwkJCQl0Y3BfbGFzdHRyaW0gPSBORlNEX01PTk9TRUM7CisJCQlMSVNUX0ZP UkVBQ0hfU0FGRShycCwgJm5mc3J2aGFzaHRibFtpXSwgcmNfaGFzaCwKKwkJCSAgICBuZXh0cnAp IHsKKwkJCQlpZiAoIShycC0+cmNfZmxhZyAmCisJCQkJICAgICAoUkNfSU5QUk9HfFJDX0xPQ0tF RHxSQ19XQU5URUQpKQorCQkJCSAgICAgJiYgcnAtPnJjX3JlZmNudCA9PSAwCisJCQkJICAgICAm JiAoKHJwLT5yY19mbGFnICYgUkNfUkVGQ05UKSB8fAorCQkJCQkgTkZTRF9NT05PU0VDID4gcnAt PnJjX3RpbWVzdGFtcCB8fAorCQkJCQkgbmZzcmNfYWN0aXZlc29ja2V0KHJwLCBzb2NrcmVmLCBz bykpKQorCQkJCQluZnNyY19mcmVlY2FjaGUocnApOworCQkJfQorCQkJbXR4X3VubG9jaygmbmZz cmNfdGNwbXR4W2ldKTsKKwkJfQogCX0KLQlORlNVTkxPQ0tDQUNIRSgpOwogfQogCiAvKgpAQCAt ODI4LDEyICs4NjQsMTQgQEAgbmZzcmNfdHJpbWNhY2hlKHVfaW50NjRfdCBzb2NrcmVmLCBzdHJ1 YwogQVBQTEVTVEFUSUMgdm9pZAogbmZzcnZkX3JlZmNhY2hlKHN0cnVjdCBuZnNydmNhY2hlICpy cCkKIHsKKwlzdHJ1Y3QgbXR4ICptdXRleDsKIAotCU5GU0xPQ0tDQUNIRSgpOworCW11dGV4ID0g bmZzcmNfY2FjaGVtdXRleChycCk7CisJbXR4X2xvY2sobXV0ZXgpOwogCWlmIChycC0+cmNfcmVm Y250IDwgMCkKIAkJcGFuaWMoIm5mcyBjYWNoZSByZWZjbnQiKTsKIAlycC0+cmNfcmVmY250Kys7 Ci0JTkZTVU5MT0NLQ0FDSEUoKTsKKwltdHhfdW5sb2NrKG11dGV4KTsKIH0KIAogLyoKQEAgLTg0 MiwxNCArODgwLDE2IEBAIG5mc3J2ZF9yZWZjYWNoZShzdHJ1Y3QgbmZzcnZjYWNoZSAqcnApCiBB UFBMRVNUQVRJQyB2b2lkCiBuZnNydmRfZGVyZWZjYWNoZShzdHJ1Y3QgbmZzcnZjYWNoZSAqcnAp CiB7CisJc3RydWN0IG10eCAqbXV0ZXg7CiAKLQlORlNMT0NLQ0FDSEUoKTsKKwltdXRleCA9IG5m c3JjX2NhY2hlbXV0ZXgocnApOworCW10eF9sb2NrKG11dGV4KTsKIAlpZiAocnAtPnJjX3JlZmNu dCA8PSAwKQogCQlwYW5pYygibmZzIGNhY2hlIGRlcmVmY250Iik7CiAJcnAtPnJjX3JlZmNudC0t OwogCWlmIChycC0+cmNfcmVmY250ID09IDAgJiYgIShycC0+cmNfZmxhZyAmIChSQ19MT0NLRUQg fCBSQ19JTlBST0cpKSkKIAkJbmZzcmNfZnJlZWNhY2hlKHJwKTsKLQlORlNVTkxPQ0tDQUNIRSgp OworCW10eF91bmxvY2sobXV0ZXgpOwogfQogCiAvKgotLS0gZnMvbmZzc2VydmVyL25mc19uZnNk cG9ydC5jLnNhdgkyMDEyLTEwLTExIDE3OjM4OjI2LjAwMDAwMDAwMCAtMDQwMAorKysgZnMvbmZz c2VydmVyL25mc19uZnNkcG9ydC5jCTIwMTItMTAtMTEgMTc6NDM6MTYuMDAwMDAwMDAwIC0wNDAw CkBAIC02MCw3ICs2MCw4IEBAIGV4dGVybiBTVkNQT09MCSpuZnNydmRfcG9vbDsKIGV4dGVybiBz dHJ1Y3QgbmZzdjRsb2NrIG5mc2Rfc3VzcGVuZF9sb2NrOwogc3RydWN0IHZmc29wdGxpc3QgbmZz djRyb290X29wdCwgbmZzdjRyb290X25ld29wdDsKIE5GU0RMT0NLTVVURVg7Ci1zdHJ1Y3QgbXR4 IG5mc19jYWNoZV9tdXRleDsKK3N0cnVjdCBtdHggbmZzcmNfdGNwbXR4W05GU1JWQ0FDSEVfSEFT SFNJWkVdOworc3RydWN0IG10eCBuZnNyY191ZHBtdHg7CiBzdHJ1Y3QgbXR4IG5mc192NHJvb3Rf bXV0ZXg7CiBzdHJ1Y3QgbmZzcnZmaCBuZnNfcm9vdGZoLCBuZnNfcHViZmg7CiBpbnQgbmZzX3B1 YmZoc2V0ID0gMCwgbmZzX3Jvb3RmaHNldCA9IDA7CkBAIC0zMjg4LDcgKzMyODksNyBAQCBleHRl cm4gaW50ICgqbmZzZF9jYWxsX25mc2QpKHN0cnVjdCB0aHJlCiBzdGF0aWMgaW50CiBuZnNkX21v ZGV2ZW50KG1vZHVsZV90IG1vZCwgaW50IHR5cGUsIHZvaWQgKmRhdGEpCiB7Ci0JaW50IGVycm9y ID0gMDsKKwlpbnQgZXJyb3IgPSAwLCBpOwogCXN0YXRpYyBpbnQgbG9hZGVkID0gMDsKIAogCXN3 aXRjaCAodHlwZSkgewpAQCAtMzI5Niw3ICszMjk3LDEwIEBAIG5mc2RfbW9kZXZlbnQobW9kdWxl X3QgbW9kLCBpbnQgdHlwZSwgdm8KIAkJaWYgKGxvYWRlZCkKIAkJCWdvdG8gb3V0OwogCQluZXdu ZnNfcG9ydGluaXQoKTsKLQkJbXR4X2luaXQoJm5mc19jYWNoZV9tdXRleCwgIm5mc19jYWNoZV9t dXRleCIsIE5VTEwsIE1UWF9ERUYpOworCQlmb3IgKGkgPSAwOyBpIDwgTkZTUlZDQUNIRV9IQVNI U0laRTsgaSsrKQorCQkJbXR4X2luaXQoJm5mc3JjX3RjcG10eFtpXSwgIm5mc190Y3BjYWNoZV9t dXRleCIsIE5VTEwsCisJCQkgICAgTVRYX0RFRik7CisJCW10eF9pbml0KCZuZnNyY191ZHBtdHgs ICJuZnNfdWRwY2FjaGVfbXV0ZXgiLCBOVUxMLCBNVFhfREVGKTsKIAkJbXR4X2luaXQoJm5mc192 NHJvb3RfbXV0ZXgsICJuZnNfdjRyb290X211dGV4IiwgTlVMTCwgTVRYX0RFRik7CiAJCW10eF9p bml0KCZuZnN2NHJvb3RfbW50Lm1udF9tdHgsICJzdHJ1Y3QgbW91bnQgbXR4IiwgTlVMTCwKIAkJ ICAgIE1UWF9ERUYpOwpAQCAtMzM0MCw3ICszMzQ0LDkgQEAgbmZzZF9tb2RldmVudChtb2R1bGVf dCBtb2QsIGludCB0eXBlLCB2bwogCQkJc3ZjcG9vbF9kZXN0cm95KG5mc3J2ZF9wb29sKTsKIAog CQkvKiBhbmQgZ2V0IHJpZCBvZiB0aGUgbG9ja3MgKi8KLQkJbXR4X2Rlc3Ryb3koJm5mc19jYWNo ZV9tdXRleCk7CisJCWZvciAoaSA9IDA7IGkgPCBORlNSVkNBQ0hFX0hBU0hTSVpFOyBpKyspCisJ CQltdHhfZGVzdHJveSgmbmZzcmNfdGNwbXR4W2ldKTsKKwkJbXR4X2Rlc3Ryb3koJm5mc3JjX3Vk cG10eCk7CiAJCW10eF9kZXN0cm95KCZuZnNfdjRyb290X211dGV4KTsKIAkJbXR4X2Rlc3Ryb3ko Jm5mc3Y0cm9vdF9tbnQubW50X210eCk7CiAJCWxvY2tkZXN0cm95KCZuZnN2NHJvb3RfbW50Lm1u dF9leHBsb2NrKTsKLS0tIGZzL25mcy9uZnNwb3J0Lmguc2F2CTIwMTItMTAtMTAgMjA6NTY6MjYu MDAwMDAwMDAwIC0wNDAwCisrKyBmcy9uZnMvbmZzcG9ydC5oCTIwMTItMTAtMTAgMjA6NTY6NDIu MDAwMDAwMDAwIC0wNDAwCkBAIC01NDYsMTEgKzU0Niw2IEBAIHZvaWQgbmZzcnZkX3JjdihzdHJ1 Y3Qgc29ja2V0ICosIHZvaWQgKiwKICNkZWZpbmUJTkZTUkVRU1BJTkxPQ0sJCWV4dGVybiBzdHJ1 Y3QgbXR4IG5mc19yZXFfbXV0ZXgKICNkZWZpbmUJTkZTTE9DS1JFUSgpCQltdHhfbG9jaygmbmZz X3JlcV9tdXRleCkKICNkZWZpbmUJTkZTVU5MT0NLUkVRKCkJCW10eF91bmxvY2soJm5mc19yZXFf bXV0ZXgpCi0jZGVmaW5lCU5GU0NBQ0hFTVVURVgJCWV4dGVybiBzdHJ1Y3QgbXR4IG5mc19jYWNo ZV9tdXRleAotI2RlZmluZQlORlNDQUNIRU1VVEVYUFRSCSgmbmZzX2NhY2hlX211dGV4KQotI2Rl ZmluZQlORlNMT0NLQ0FDSEUoKQkJbXR4X2xvY2soJm5mc19jYWNoZV9tdXRleCkKLSNkZWZpbmUJ TkZTVU5MT0NLQ0FDSEUoKQltdHhfdW5sb2NrKCZuZnNfY2FjaGVfbXV0ZXgpCi0jZGVmaW5lCU5G U0NBQ0hFTE9DS1JFUVVJUkVEKCkJbXR4X2Fzc2VydCgmbmZzX2NhY2hlX211dGV4LCBNQV9PV05F RCkKICNkZWZpbmUJTkZTU09DS01VVEVYCQlleHRlcm4gc3RydWN0IG10eCBuZnNfc2xvY2tfbXV0 ZXgKICNkZWZpbmUJTkZTU09DS01VVEVYUFRSCQkoJm5mc19zbG9ja19tdXRleCkKICNkZWZpbmUJ TkZTTE9DS1NPQ0soKQkJbXR4X2xvY2soJm5mc19zbG9ja19tdXRleCkKLS0tIGZzL25mcy9uZnNy dmNhY2hlLmguc2F2CTIwMTItMTAtMTIgMjA6MDM6NDIuMDAwMDAwMDAwIC0wNDAwCisrKyBmcy9u ZnMvbmZzcnZjYWNoZS5oCTIwMTItMTAtMTIgMjA6MDM6NTUuMDAwMDAwMDAwIC0wNDAwCkBAIC00 MSw3ICs0MSw3IEBACiAjZGVmaW5lCU5GU1JWQ0FDSEVfTUFYX1NJWkUJMjA0OAogI2RlZmluZQlO RlNSVkNBQ0hFX01JTl9TSVpFCSAgNjQKIAotI2RlZmluZQlORlNSVkNBQ0hFX0hBU0hTSVpFCTIw CisjZGVmaW5lCU5GU1JWQ0FDSEVfSEFTSFNJWkUJMjAwCiAKIHN0cnVjdCBuZnNydmNhY2hlIHsK IAlMSVNUX0VOVFJZKG5mc3J2Y2FjaGUpIHJjX2hhc2g7CQkvKiBIYXNoIGNoYWluICovCg== ------=_Part_2185821_351431992.1350093954057-- From owner-freebsd-hackers@FreeBSD.ORG Sat Oct 13 04:55:43 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 153A634A for ; Sat, 13 Oct 2012 04:55:43 +0000 (UTC) (envelope-from wollman@hergotha.csail.mit.edu) Received: from hergotha.csail.mit.edu (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) by mx1.freebsd.org (Postfix) with ESMTP id B74128FC0A for ; Sat, 13 Oct 2012 04:55:42 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.5/8.14.5) with ESMTP id q9D4tf2s037126; Sat, 13 Oct 2012 00:55:41 -0400 (EDT) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.5/8.14.4/Submit) id q9D4tfcG037123; Sat, 13 Oct 2012 00:55:41 -0400 (EDT) (envelope-from wollman) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <20600.62541.243673.307571@hergotha.csail.mit.edu> Date: Sat, 13 Oct 2012 00:55:41 -0400 From: Garrett Wollman To: Rick Macklem Subject: Re: NFS server bottlenecks In-Reply-To: <937460294.2185822.1350093954059.JavaMail.root@erie.cs.uoguelph.ca> References: <608951636.2115684.1349992972756.JavaMail.root@erie.cs.uoguelph.ca> <937460294.2185822.1350093954059.JavaMail.root@erie.cs.uoguelph.ca> X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (hergotha.csail.mit.edu [127.0.0.1]); Sat, 13 Oct 2012 00:55:41 -0400 (EDT) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on hergotha.csail.mit.edu Cc: Nikolay Denev , FreeBSD Hackers X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Oct 2012 04:55:43 -0000 < said: > I've attached the patch drc3.patch (it assumes drc2.patch has already been > applied) that replaces the single mutex with one for each hash list > for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200. I haven't tested this at all, but I think putting all of the mutexes in an array like that is likely to cause cache-line ping-ponging. It may be better to use a pool mutex, or to put the mutexes adjacent in memory to the list heads that they protect. (But I probably won't be able to do the performance testing on any of these for a while. I have a server running the "drc2" code but haven't gotten my users to put a load on it yet.) -GAWollman From owner-freebsd-hackers@FreeBSD.ORG Sat Oct 13 10:26:44 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 60591A7F; Sat, 13 Oct 2012 10:26:44 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 118148FC12; Sat, 13 Oct 2012 10:26:44 +0000 (UTC) Received: from fledge.watson.org (fledge.watson.org [65.122.17.41]) by cyrus.watson.org (Postfix) with ESMTPS id 10A2546B09; Sat, 13 Oct 2012 06:26:43 -0400 (EDT) Date: Sat, 13 Oct 2012 11:26:42 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Carl Delsey Subject: Re: No bus_space_read_8 on x86 ? In-Reply-To: <5078575B.2020808@intel.com> Message-ID: References: <506DC574.9010300@intel.com> <201210091154.15873.jhb@freebsd.org> <5075EC29.1010907@intel.com> <201210121131.46373.jhb@freebsd.org> <5078575B.2020808@intel.com> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Oct 2012 10:26:44 -0000 On Fri, 12 Oct 2012, Carl Delsey wrote: >> Indeed -- and on non-x86, where there are uncached direct map segments, and >> TLB entries that disable caching, reading 2x 32-bit vs 1x 64-bit have quite >> different effects in terms of atomicity. Where uncached I/Os are being >> used, those differences may affect semantics significantly -- e.g., if your >> device has a 64-bit memory-mapped FIFO or registers, 2x 32-bit gives you >> two halves of two different 64-bit values, rather than two halves of the >> same value. As device drivers depend on those atomicity semantics, we >> should (at the busspace level) offer only the exactly expected semantics, >> rather than trying to patch things up. If a device driver accessing 64-bit >> fields wants to support doing it using two 32-bit reads, it can figure out >> how to splice it together following bus_space_read_region_4(). > I wouldn't make any default behaviour for bus_space_read_8 on i386, just > amd64. My assumption (which may be unjustified) is that by far the most > common implementations to read a 64-bit register on i386 would be to read the > lower 4 bytes first, followed by the upper 4 bytes (or vice versa) and then > stitch them together. I think we should provide helper functions for these > two cases, otherwise I fear our code base will be littered with multiple > independent implementations of this. > > Some driver writer who wants to take advantage of these helper functions > would do something like > #ifdef i386 > #define bus_space_read_8 bus_space_read_8_lower_first > #endif > otherwise, using bus_space_read_8 won't compile for i386 builds. > If these implementations won't work for their case, they are free to write > their own implementation or take whatever action is necessary. > > I guess my question is, are these cases common enough that it is worth > helping developers by providing functions that do the double read and shifts > for them, or do we leave them to deal with it on their own at the risk of > possibly some duplicated code. I was thinking we might suggest to developers that they use a KPI that specifically captures the underlying semantics, so it's clear they understand them. Untested example: uint64_t v; /* * On 32-bit systems, read the 64-bit statistic using two 32-bit * reads. * * XXX: This will sometimes lead to a race. * * XXX: Gosh, I wonder if some word-swapping is needed in the merge? */ #ifdef 32-bit bus_space_read_region_4(space, handle, offset, (uint32_t *)&v, 2; #else bus_space_read_8(space, handle, offset, &v); #endif The potential need to word swap, however, suggests that you may be right about the error-prone nature of manual merging. Robert From owner-freebsd-hackers@FreeBSD.ORG Sat Oct 13 13:03:37 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7021E91B; Sat, 13 Oct 2012 13:03:37 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id DC2F28FC12; Sat, 13 Oct 2012 13:03:36 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap8EAN6MclCDaFvO/2dsb2JhbABFhhG6GYIgAQEBBAEBASArIAsbGAICDRkCKQEJJgYIBwQBHASHZAumTJF3gSGKLhqEZIESA5M+gi2BFY8ZgwmBRzQ X-IronPort-AV: E=Sophos;i="4.80,581,1344225600"; d="scan'208";a="183488893" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn-pri.mail.uoguelph.ca with ESMTP; 13 Oct 2012 09:03:23 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id F1267B41C2; Sat, 13 Oct 2012 09:03:22 -0400 (EDT) Date: Sat, 13 Oct 2012 09:03:22 -0400 (EDT) From: Rick Macklem To: Garrett Wollman Message-ID: <611092759.2189637.1350133402953.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <20600.62541.243673.307571@hergotha.csail.mit.edu> Subject: Re: NFS server bottlenecks MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - IE7 (Win)/6.0.10_GA_2692) Cc: Nikolay Denev , FreeBSD Hackers X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Oct 2012 13:03:37 -0000 Garrett Wollman wrote: > < said: > > > I've attached the patch drc3.patch (it assumes drc2.patch has > > already been > > applied) that replaces the single mutex with one for each hash list > > for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200. > > I haven't tested this at all, but I think putting all of the mutexes > in an array like that is likely to cause cache-line ping-ponging. It > may be better to use a pool mutex, or to put the mutexes adjacent in > memory to the list heads that they protect. Well, I'll admit I don't know how to do this. What the code does need is a "set of mutexes", where any of the mutexes can be referred to by an "index". I could easily define a structure that has: struct nfsrc_hashhead { struct nfsrvcachehead head; struct mtx mutex; } nfsrc_hashhead[NFSRVCACHE_HASHSIZE]; - but all that does is leave a small structure between each "struct mtx" and I wouldn't have thought that would make much difference. (How big is a typical hardware cache line these days? I have no idea.) - I suppose I could "waste space" and define a glob of unused space between them, like: struct nfsrc_hashhead { struct nfsrvcachehead head; char garbage[N]; struct mtx mutex; } nfsrc_hashhead[NFSRVCACHE_HASHSIZE]; - If this makes sense, how big should N be? (Somewhat less that the length of a cache line, I'd guess. It seems that the structure should be at least a cache line length in size.) All this seems "kinda hokey" to me and beyond what code at this level should be worrying about, but I'm game to make changes, if others think it's appropriate. I've never use mtx_pool(9) mutexes, but it doesn't sound like they would be the right fit, from reading the man page. (Assuming the mtx_pool_find() is guaranteed to return the same mutex for the same address passed in as an argument, it would seem that they would work, since I can pass &nfsrvcachehead[i] in as the pointer arg to "index" a mutex.) Hopefully jhb@ can say if using mtx_pool(9) for this would be better than an array: struct mtx nfsrc_tcpmtx[NFSRVCACHE_HASHSIZE]; Does anyone conversant with mutexes know what the best coding approach is? >(But I probably won't be > able to do the performance testing on any of these for a while. I > have a server running the "drc2" code but haven't gotten my users to > put a load on it yet.) > No rush. At this point, the earliest I could commit something like this to head would be December. rick ps: I hope John doesn't mind being added to the cc list yet again. It's just that I suspect he knows a fair bit about mutex implementation and possible hardware cache line effects. > -GAWollman > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to > "freebsd-hackers-unsubscribe@freebsd.org" From owner-freebsd-hackers@FreeBSD.ORG Sat Oct 13 15:22:55 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 482E6D3F; Sat, 13 Oct 2012 15:22:55 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from mail-wg0-f50.google.com (mail-wg0-f50.google.com [74.125.82.50]) by mx1.freebsd.org (Postfix) with ESMTP id A42558FC17; Sat, 13 Oct 2012 15:22:54 +0000 (UTC) Received: by mail-wg0-f50.google.com with SMTP id 16so2959150wgi.31 for ; Sat, 13 Oct 2012 08:22:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; bh=/olCYPBEWcL7a7m43HBPAkiGp1N2wqa0K+xYBKri0LI=; b=eKLvkyZ+8ZV26Oi+8RVYt75xYScP2qlO3T3X1APvryZ2xfuTckMVvC77WUBqz+AmLW OGXi/JU/PgLlRGjExhfQe41B2BGp/4xE+wtLI4a2HhcaMSzwTNMhxFDijCWTf+LZ624M OKR54RzfgkWPrpbeOj8Pg4/JZOSlGXpVaP0rczNVS+3UVKFvSojNe8CHCsJKsqx8OBo1 54Q7WY1L/YNDZfAjfjA3MuNWYCRPYPgmYCe+C3MLoUjWxZvwSaFOImFwUHrHnpeb8zUx iq2IwtJyVG4mkD+ea7Chi70Qck9Px11NPVtwZTN3IEnEp1OALxaXi53LiYz83TjOY9IQ 03rw== Received: by 10.180.19.71 with SMTP id c7mr12865089wie.2.1350141767964; Sat, 13 Oct 2012 08:22:47 -0700 (PDT) Received: from [10.181.156.211] ([213.226.63.148]) by mx.google.com with ESMTPS id dm3sm4093716wib.3.2012.10.13.08.22.45 (version=TLSv1/SSLv3 cipher=OTHER); Sat, 13 Oct 2012 08:22:47 -0700 (PDT) Subject: Re: NFS server bottlenecks Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\)) Content-Type: text/plain; charset=us-ascii From: Nikolay Denev In-Reply-To: <937460294.2185822.1350093954059.JavaMail.root@erie.cs.uoguelph.ca> Date: Sat, 13 Oct 2012 18:22:50 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <302BF685-4B9D-49C8-8000-8D0F6540C8F7@gmail.com> References: <937460294.2185822.1350093954059.JavaMail.root@erie.cs.uoguelph.ca> To: Rick Macklem X-Mailer: Apple Mail (2.1498) Cc: FreeBSD Hackers , Garrett Wollman X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Oct 2012 15:22:55 -0000 On Oct 13, 2012, at 5:05 AM, Rick Macklem wrote: > I wrote: >> Oops, I didn't get the "readahead" option description >> quite right in the last post. The default read ahead >> is 1, which does result in "rsize * 2", since there is >> the read + 1 readahead. >>=20 >> "rsize * 16" would actually be for the option "readahead=3D15" >> and for "readahead=3D16" the calculation would be "rsize * 17". >>=20 >> However, the example was otherwise ok, I think? rick >=20 > I've attached the patch drc3.patch (it assumes drc2.patch has already = been > applied) that replaces the single mutex with one for each hash list > for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200. >=20 > These patches are also at: > http://people.freebsd.org/~rmacklem/drc2.patch > http://people.freebsd.org/~rmacklem/drc3.patch > in case the attachments don't get through. >=20 > rick > ps: I haven't tested drc3.patch a lot, but I think it's ok? drc3.patch applied and build cleanly and shows nice improvement! I've done a quick benchmark using iozone over the NFS mount from the = Linux host. drc2.pach (but with NFSRVCACHE_HASHSIZE=3D500) TEST WITH 8K = --------------------------------------------------------------------------= ----------------------- Auto Mode Using Minimum Record Size 8 KB Using Maximum Record Size 8 KB Using minimum file size of 2097152 kilobytes. Using maximum file size of 2097152 kilobytes. O_DIRECT feature enabled SYNC Mode.=20 OPS Mode. Output is in operations per second. Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I -o -O = -i 0 -i 1 -i 2 Time Resolution =3D 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random = random bkwd record stride =20 KB reclen write rewrite read reread read = write read rewrite read fwrite frewrite fread freread 2097152 8 1919 1914 2356 2321 2335 = 1706 =20 TEST WITH 1M = --------------------------------------------------------------------------= ----------------------- Auto Mode Using Minimum Record Size 1024 KB Using Maximum Record Size 1024 KB Using minimum file size of 2097152 kilobytes. Using maximum file size of 2097152 kilobytes. O_DIRECT feature enabled SYNC Mode.=20 OPS Mode. Output is in operations per second. Command line used: iozone -a -y 1m -q 1m -n 2g -g 2g -C -I -o -O = -i 0 -i 1 -i 2 Time Resolution =3D 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random = random bkwd record stride =20 KB reclen write rewrite read reread read = write read rewrite read fwrite frewrite fread freread 2097152 1024 73 64 477 486 496 = 61 =20 drc3.patch TEST WITH 8K = --------------------------------------------------------------------------= ----------------------- Auto Mode Using Minimum Record Size 8 KB Using Maximum Record Size 8 KB Using minimum file size of 2097152 kilobytes. Using maximum file size of 2097152 kilobytes. O_DIRECT feature enabled SYNC Mode.=20 OPS Mode. Output is in operations per second. Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I -o -O = -i 0 -i 1 -i 2 Time Resolution =3D 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random = random bkwd record stride =20 KB reclen write rewrite read reread read = write read rewrite read fwrite frewrite fread freread 2097152 8 2108 2397 3001 3013 3010 = 2389 =20 TEST WITH 1M = --------------------------------------------------------------------------= ----------------------- Auto Mode Using Minimum Record Size 1024 KB Using Maximum Record Size 1024 KB Using minimum file size of 2097152 kilobytes. Using maximum file size of 2097152 kilobytes. O_DIRECT feature enabled SYNC Mode.=20 OPS Mode. Output is in operations per second. Command line used: iozone -a -y 1m -q 1m -n 2g -g 2g -C -I -o -O = -i 0 -i 1 -i 2 Time Resolution =3D 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random = random bkwd record stride =20 KB reclen write rewrite read reread read = write read rewrite read fwrite frewrite fread freread 2097152 1024 80 79 521 536 528 = 75 =20 Also with drc3 the CPU usage on the server is noticeably lower. Most of = the time I could see only the geom{g_up}/{g_down} threads, and a few nfsd threads, before that nfsd's were much more prominent. I guess under bigger load the performance improvement can be bigger. I'll run some more tests with heavier loads this week. Thanks, Nikolay From owner-freebsd-hackers@FreeBSD.ORG Sat Oct 13 19:16:01 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id F40D9788; Sat, 13 Oct 2012 19:16:00 +0000 (UTC) (envelope-from gnn@neville-neil.com) Received: from vps.hungerhost.com (vps.hungerhost.com [216.38.53.176]) by mx1.freebsd.org (Postfix) with ESMTP id A8B888FC08; Sat, 13 Oct 2012 19:16:00 +0000 (UTC) Received: from pool-96-250-5-62.nycmny.fios.verizon.net ([96.250.5.62]:60073 helo=minion.home) by vps.hungerhost.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.80) (envelope-from ) id 1TN7Bo-0005JZ-2M; Sat, 13 Oct 2012 15:16:00 -0400 Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: [CFT/RFC]: refactor bsd.prog.mk to understand multiple programs instead of a singular program From: George Neville-Neil In-Reply-To: <127FA63D-8EEE-4616-AE1E-C39469DDCC6A@xcllnt.net> Date: Sat, 13 Oct 2012 15:15:59 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <1340AB5D-F824-4E7D-9D6C-F7E5489AE870@neville-neil.com> References: <201210020750.23358.jhb@freebsd.org> <201210021037.27762.jhb@freebsd.org> <127FA63D-8EEE-4616-AE1E-C39469DDCC6A@xcllnt.net> To: Marcel Moolenaar X-Mailer: Apple Mail (2.1499) X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - vps.hungerhost.com X-AntiAbuse: Original Domain - freebsd.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - neville-neil.com X-Mailman-Approved-At: Sat, 13 Oct 2012 19:26:47 +0000 Cc: Garrett Cooper , freebsd-hackers@freebsd.org, "Simon J. Gerraty" , freebsd-arch@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Oct 2012 19:16:01 -0000 On Oct 8, 2012, at 12:11 , Marcel Moolenaar wrote: >=20 > On Oct 4, 2012, at 9:42 AM, Garrett Cooper wrote: >>>> Both parties (Isilon/Juniper) are converging on the ATF porting = work >>>> that Giorgos/myself have done after talking at the FreeBSD = Foundation >>>> meet-n-greet. I have contributed all of the patches that I have = other >>>> to marcel for feedback. >>>=20 >>> This is very non-obvious to the public at large (e.g. there was no = public >>> response to one group's inquiry about the second ATF import for = example). >>> Also, given that you had no idea that sgf@ and obrien@ were working = on >>> importing NetBSD's bmake as a prerequisite for ATF, it seems that = whatever >>> discussions were held were not very detailed at best. I think it = would be >>> good to have the various folks working on ATF to at least summarize = the >>> current state of things and sketch out some sort of plan or roadmap = for future >>> work in a public forum (such as atf@, though a summary mail would be = quite >>> appropriate for arch@). >>=20 >> I'm in part to blame for this. There was some discussion -- but not = at >> length; unfortunately no one from Juniper was present at the meet and >> greet; the information I got was second hand; I didn't follow up to >> figure out the exact details / clarify what I had in mind with the >> appropriate parties. >=20 > Hang on. I want in on the blame part! :-) >=20 > Seriously: no-one is really to blame as far as I can see. We just had > two independent efforts (ATF & bmake) and there was no indication that > one would be greatly benefitted from the other. At least not to the > point of creating a dependency. >=20 > I just committed the bmake bits. It not only adds bmake to the build, > but also includes the changes necessary to use bmake. >=20 > With that in place it's easier to decide whether we want the = dependency > or not. >=20 > Before we can switch permanently to bmake, we need to do the following > first: > 1. Request an EXP ports build with bmake as make(1). This should tell > us the "damage" of switching to bmake for ports. > 2. In parallel with 1: build www & docs with bmake and assess the > damage > 3. Fix all the damage >=20 > Then: >=20 > 4. Switch. >=20 > It could be a while (many weeks) before we get to 4, so the question > really is whether the people working on ATF are willing and able to > build and install FreeBSD using WITH_BMAKE? >=20 I think that's a small price to pay for getting going with the ATF stuff now rather than in 4 weeks. What's the right way to do this now with HEAD? Best, George From owner-freebsd-hackers@FreeBSD.ORG Sat Oct 13 20:52:42 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id F3AA87BA; Sat, 13 Oct 2012 20:52:41 +0000 (UTC) (envelope-from marcel@xcllnt.net) Received: from mail.xcllnt.net (mail.xcllnt.net [70.36.220.4]) by mx1.freebsd.org (Postfix) with ESMTP id C63AC8FC14; Sat, 13 Oct 2012 20:52:41 +0000 (UTC) Received: from marcelm-sslvpn-nc.jnpr.net (natint3.juniper.net [66.129.224.36]) (authenticated bits=0) by mail.xcllnt.net (8.14.5/8.14.5) with ESMTP id q9DKqdBr086942 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO); Sat, 13 Oct 2012 13:52:40 -0700 (PDT) (envelope-from marcel@xcllnt.net) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: [CFT/RFC]: refactor bsd.prog.mk to understand multiple programs instead of a singular program From: Marcel Moolenaar In-Reply-To: <1340AB5D-F824-4E7D-9D6C-F7E5489AE870@neville-neil.com> Date: Sat, 13 Oct 2012 13:52:34 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <201210020750.23358.jhb@freebsd.org> <201210021037.27762.jhb@freebsd.org> <127FA63D-8EEE-4616-AE1E-C39469DDCC6A@xcllnt.net> <1340AB5D-F824-4E7D-9D6C-F7E5489AE870@neville-neil.com> To: George Neville-Neil X-Mailer: Apple Mail (2.1499) Cc: Garrett Cooper , freebsd-hackers@freebsd.org, "Simon J. Gerraty" , freebsd-arch@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Oct 2012 20:52:42 -0000 On Oct 13, 2012, at 12:15 PM, George Neville-Neil = wrote: >=20 > I think that's a small price to pay for getting going with the ATF > stuff now rather than in 4 weeks. What's the right way to do this > now with HEAD? Set WITH_BMAKE=3Dyes in /etc/src.conf or /etc/make.conf and you're good to go. One caveat: manually rebuild and re-install usr.bin/bmake after the buildworld & installworld with WITH_BMAKE=3Dyes set. The one created as part of the buildworld has a bug due to being built by FreeBSD's make. A fix is known and will be committed soon, but until then, the manual step is needed. That's it... --=20 Marcel Moolenaar marcel@xcllnt.net From owner-freebsd-hackers@FreeBSD.ORG Sat Oct 13 20:15:37 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 6E715BCA; Sat, 13 Oct 2012 20:15:37 +0000 (UTC) (envelope-from sjg@juniper.net) Received: from exprod7og113.obsmtp.com (exprod7og113.obsmtp.com [64.18.2.179]) by mx1.freebsd.org (Postfix) with ESMTP id D6BBF8FC0C; Sat, 13 Oct 2012 20:15:36 +0000 (UTC) Received: from P-EMHUB03-HQ.jnpr.net ([66.129.224.36]) (using TLSv1) by exprod7ob113.postini.com ([64.18.6.12]) with SMTP ID DSNKUHnL4vuHMYYtf2/iwemqs81nA5iY7LWn@postini.com; Sat, 13 Oct 2012 13:15:37 PDT Received: from magenta.juniper.net (172.17.27.123) by P-EMHUB03-HQ.jnpr.net (172.24.192.33) with Microsoft SMTP Server (TLS) id 8.3.213.0; Sat, 13 Oct 2012 13:13:30 -0700 Received: from chaos.jnpr.net (chaos.jnpr.net [172.24.29.229]) by magenta.juniper.net (8.11.3/8.11.3) with ESMTP id q9DKDUh83856; Sat, 13 Oct 2012 13:13:30 -0700 (PDT) (envelope-from sjg@juniper.net) Received: from chaos.jnpr.net (localhost [127.0.0.1]) by chaos.jnpr.net (Postfix) with ESMTP id 68EBB58094; Sat, 13 Oct 2012 13:13:30 -0700 (PDT) To: George Neville-Neil Subject: Re: [CFT/RFC]: refactor bsd.prog.mk to understand multiple programs instead of a singular program In-Reply-To: <1340AB5D-F824-4E7D-9D6C-F7E5489AE870@neville-neil.com> References: <201210020750.23358.jhb@freebsd.org> <201210021037.27762.jhb@freebsd.org> <127FA63D-8EEE-4616-AE1E-C39469DDCC6A@xcllnt.net> <1340AB5D-F824-4E7D-9D6C-F7E5489AE870@neville-neil.com> Comments: In-reply-to: George Neville-Neil message dated "Sat, 13 Oct 2012 15:15:59 -0400." From: "Simon J. Gerraty" X-Mailer: MH-E 7.82+cvs; nmh 1.3; GNU Emacs 22.3.1 Date: Sat, 13 Oct 2012 13:13:30 -0700 Message-ID: <20121013201330.68EBB58094@chaos.jnpr.net> MIME-Version: 1.0 Content-Type: text/plain X-Mailman-Approved-At: Sat, 13 Oct 2012 21:33:28 +0000 Cc: Garrett Cooper , freebsd-hackers@freebsd.org, freebsd-arch@freebsd.org, Marcel Moolenaar X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Oct 2012 20:15:37 -0000 On Sat, 13 Oct 2012 15:15:59 -0400, George Neville-Neil writes: >> It could be a while (many weeks) before we get to 4, so the question >> really is whether the people working on ATF are willing and able to >> build and install FreeBSD using WITH_BMAKE? >>=20 > >I think that's a small price to pay for getting going with the ATF >stuff now rather than in 4 weeks. What's the right way to do this >now with HEAD? We can add bsd.progs.mk (if you have devel/bmake port installed you have it as /usr/local/share/mk/progs.mk) and atf.test.mk and people can just "go for it" ? From owner-freebsd-hackers@FreeBSD.ORG Sat Oct 13 22:37:11 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 002C62BA; Sat, 13 Oct 2012 22:37:10 +0000 (UTC) (envelope-from yanegomi@gmail.com) Received: from mail-ob0-f182.google.com (mail-ob0-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id 97AFD8FC0C; Sat, 13 Oct 2012 22:37:10 +0000 (UTC) Received: by mail-ob0-f182.google.com with SMTP id wc20so5176500obb.13 for ; Sat, 13 Oct 2012 15:37:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=fmxPqBHeXKhnIVGRjFlZpQ3NcNpdYVC8PT/Dd0LdV+c=; b=UkxFa9IkbISyPnGpUhh7h1WHJ7BcmQAFw2F7ZWprPCqhMci33zIsuuZohlubX0U2Yu YJevPtd6OBBNDQehq3dG0DXquIYW71up+/krGcAF29xSSM+enLTt14iJ23sttI5GwIro e2pNHQ28v4jrKTkFfAL/nqh2USR74Pbyafl38CXlwKUuGbMLvGrarTXLQK8XEPNfnlEr 5/Ny+8SyzOO9qNWVkaSEEFcojEOSlVLFWem9uuJ7xhI/Z0X+p+SFVad4oJPVElPWRS/e ZPQ/8f0q4jmKtql9knO+YWKSBD08X73eGM7k2flhEd9nUlijA7k+XqRjF+3wIiZey+E0 14uA== MIME-Version: 1.0 Received: by 10.182.31.50 with SMTP id x18mr6317895obh.56.1350167830150; Sat, 13 Oct 2012 15:37:10 -0700 (PDT) Received: by 10.76.167.202 with HTTP; Sat, 13 Oct 2012 15:37:09 -0700 (PDT) In-Reply-To: <20121013201330.68EBB58094@chaos.jnpr.net> References: <201210020750.23358.jhb@freebsd.org> <201210021037.27762.jhb@freebsd.org> <127FA63D-8EEE-4616-AE1E-C39469DDCC6A@xcllnt.net> <1340AB5D-F824-4E7D-9D6C-F7E5489AE870@neville-neil.com> <20121013201330.68EBB58094@chaos.jnpr.net> Date: Sat, 13 Oct 2012 15:37:09 -0700 Message-ID: Subject: Re: [CFT/RFC]: refactor bsd.prog.mk to understand multiple programs instead of a singular program From: Garrett Cooper To: "Simon J. Gerraty" Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-hackers@freebsd.org, George Neville-Neil , freebsd-arch@freebsd.org, Marcel Moolenaar X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Oct 2012 22:37:11 -0000 On Sat, Oct 13, 2012 at 1:13 PM, Simon J. Gerraty wrote: > > On Sat, 13 Oct 2012 15:15:59 -0400, George Neville-Neil writes: >>> It could be a while (many weeks) before we get to 4, so the question >>> really is whether the people working on ATF are willing and able to >>> build and install FreeBSD using WITH_BMAKE? >> >>I think that's a small price to pay for getting going with the ATF >>stuff now rather than in 4 weeks. What's the right way to do this >>now with HEAD? > > We can add bsd.progs.mk (if you have devel/bmake port installed you > have it as /usr/local/share/mk/progs.mk) > and atf.test.mk and people can just "go for it" ? As long as it can function sanely in a NetBSD-like manner and I can start writing tests, I don't mind... Thanks! -Garrett