From owner-freebsd-arch@FreeBSD.ORG Sun Dec 23 06:27:51 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 17D17D40; Sun, 23 Dec 2012 06:27:51 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-wg0-x22a.google.com (wg-in-x022a.1e100.net [IPv6:2a00:1450:400c:c00::22a]) by mx1.freebsd.org (Postfix) with ESMTP id 72D268FC13; Sun, 23 Dec 2012 06:27:50 +0000 (UTC) Received: by mail-wg0-f42.google.com with SMTP id dr1so2169928wgb.3 for ; Sat, 22 Dec 2012 22:27:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=djlfvBXlT9FdKprzFP2gqsd9bW1911FmiE258C8sAkg=; b=rAH5uMWjkc2wGk2tTutNbUzyRvv7oqCqkkOVkmviilyn4JP1jp8WsHVAIczj/eBpXu Lj4sjrkfLR1djFBVQDnw2z25gpnOosqQYAPwZRP8C2/boTpn1Lkp5xAFowcBjiZ6p6x+ rVo2hvx1DCCH4BHl/vF0OyopzTAuj7jR5IyCKbQ5d3/f7JA+xybJsld1Ymdrl+LrZnWS HthlSOmv9OoJimNr08J2hC1TsFeIC0KYauDUvitPnDfaGklGhZbkoWK1HYMnMMt6LVMz ZgxsDmnWlMJm/PCcAB2OrmdLZd41z5Xzkc47gdGjFEkT8e9nIkuXzYegYqA/tF35i5hr yHXw== MIME-Version: 1.0 Received: by 10.194.83.36 with SMTP id n4mr30545227wjy.59.1356244069571; Sat, 22 Dec 2012 22:27:49 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.217.57.9 with HTTP; Sat, 22 Dec 2012 22:27:49 -0800 (PST) In-Reply-To: <1356217474.1129.40.camel@revolution.hippie.lan> References: <1356204505.1129.21.camel@revolution.hippie.lan> <75ECE5AB-9276-44BA-84D7-56EF6BDC3984@kientzle.com> <1356217474.1129.40.camel@revolution.hippie.lan> Date: Sat, 22 Dec 2012 22:27:49 -0800 X-Google-Sender-Auth: 7DoDLevyT8HTzFlxHFMzDaYjino Message-ID: Subject: Re: jemalloc enhancement for small-memory systems From: Adrian Chadd To: Ian Lepore Content-Type: text/plain; charset=ISO-8859-1 Cc: Tim Kientzle , Jason Evans , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Dec 2012 06:27:51 -0000 .. is (how we currently implement) virtual address space (today) really that cheap on a machine with 16MB of RAM? Adrian From owner-freebsd-arch@FreeBSD.ORG Sun Dec 23 17:16:42 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8D256978 for ; Sun, 23 Dec 2012 17:16:42 +0000 (UTC) (envelope-from jasone@freebsd.org) Received: from canonware.com (canonware.com [204.109.63.53]) by mx1.freebsd.org (Postfix) with ESMTP id 656F68FC12 for ; Sun, 23 Dec 2012 17:16:41 +0000 (UTC) Received: from [192.168.168.12] (70-91-206-178-BusName-SFBA.hfc.comcastbusiness.net [70.91.206.178]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by canonware.com (Postfix) with ESMTPSA id 762F22843A; Sun, 23 Dec 2012 09:10:19 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: jemalloc enhancement for small-memory systems From: Jason Evans In-Reply-To: <75ECE5AB-9276-44BA-84D7-56EF6BDC3984@kientzle.com> Date: Sun, 23 Dec 2012 09:10:19 -0800 Content-Transfer-Encoding: quoted-printable Message-Id: <2698981A-EA71-41BD-A9B3-FCD130EB3832@freebsd.org> References: <1356204505.1129.21.camel@revolution.hippie.lan> <75ECE5AB-9276-44BA-84D7-56EF6BDC3984@kientzle.com> To: Tim Kientzle X-Mailer: Apple Mail (2.1499) Cc: Ian Lepore , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Dec 2012 17:16:42 -0000 On Dec 22, 2012, at 2:40 PM, Tim Kientzle wrote: > Would it be feasible for jemalloc to initially allocate > small blocks (to not over-allocate for small programs and > systems with small RAM) and then allocate successively > larger blocks as the program requires more memory? All chunks must be the same size in jemalloc, so it's not possible to = increase chunk size over the lifetime of an application. As Ian said, = chunk size isn't a major factor in physical memory usage unless = mlockall(2) enters the picture. Jason= From owner-freebsd-arch@FreeBSD.ORG Sun Dec 23 17:16:42 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9EA88979 for ; Sun, 23 Dec 2012 17:16:42 +0000 (UTC) (envelope-from jasone@freebsd.org) Received: from canonware.com (canonware.com [204.109.63.53]) by mx1.freebsd.org (Postfix) with ESMTP id 656AF8FC0C for ; Sun, 23 Dec 2012 17:16:41 +0000 (UTC) Received: from [192.168.168.12] (70-91-206-178-BusName-SFBA.hfc.comcastbusiness.net [70.91.206.178]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by canonware.com (Postfix) with ESMTPSA id BFCD728417; Sun, 23 Dec 2012 09:10:15 -0800 (PST) Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: jemalloc enhancement for small-memory systems From: Jason Evans In-Reply-To: <1356204505.1129.21.camel@revolution.hippie.lan> Date: Sun, 23 Dec 2012 09:10:14 -0800 Message-Id: References: <1356204505.1129.21.camel@revolution.hippie.lan> To: Ian Lepore X-Mailer: Apple Mail (2.1499) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Dec 2012 17:16:42 -0000 On Dec 22, 2012, at 11:28 AM, Ian Lepore = wrote: > When a daemon such as watchdogd uses mlockall(2) on a small-memory > embedded system, it can end up wiring much of the available ram = because > jemalloc allocates large chunks of vmspace by default. More = background > info on this can be found in this thread: >=20 > = http://lists.freebsd.org/pipermail/freebsd-embedded/2012-November/001679.h= tml >=20 > It's hard to tune jemalloc's allocation behavior for this in a > machine-independent way because the minimum chunk size depends on > PAGE_SIZE and other factors internal to jemalloc. I've created a = patch > that addresses this by defining that lg_chunk:0 is implicitly a = request > to set the chunk size to the smallest value allowable for the machine > it's running on. The patch is attached to this PR... >=20 > http://www.freebsd.org/cgi/query-pr.cgi?pr=3D174641 >=20 > Jason, could you please review this and consider incorporating it into > jemalloc? Or let us know if there's a better way to handle this > situation. Your approach looks good to me. I just checked in a slightly simpler = patch to the upstream jemalloc repository: = http://www.canonware.com/cgi-bin/gitweb.cgi?p=3Djemalloc.git;a=3Dcommitdif= f;h=3D1bf2743e08ba66cc141e296812839947223e4370 The only real difference is that no warnings are generated for lg_chunk = values between 0 and the minimum supported value. I don't have any short-term plans to update jemalloc in FreeBSD, so feel = free to either commit your patch as is, or update it to merge from the = upstream patch (which will make the next update a bit easier for me). Thanks, Jason= From owner-freebsd-arch@FreeBSD.ORG Sun Dec 23 20:42:12 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C5682996 for ; Sun, 23 Dec 2012 20:42:12 +0000 (UTC) (envelope-from eike@inter.net) Received: from daiquiri.ops.eusc.inter.net (daiquiri.ops.eusc.inter.net [84.23.254.156]) by mx1.freebsd.org (Postfix) with ESMTP id 7CA4E8FC0A for ; Sun, 23 Dec 2012 20:42:12 +0000 (UTC) X-Trace: 507c65696b6540736e6166752e64657c39332e3232302e37322e3131307c31546d 72694e2d3030304d62782d57447c31333536323932383034 Received: from daiquiri.ops.eusc.inter.net ([10.156.10.19] helo=localhost) by daiquiri.ops.eusc.inter.net with esmtpsa (Exim 4.76) id 1TmriN-000Mbx-WD; Sun, 23 Dec 2012 21:00:04 +0100 Subject: Re: jemalloc enhancement for small-memory systems Mime-Version: 1.0 (Apple Message framework v1085) Content-Type: text/plain; charset=us-ascii From: Eike Dierks In-Reply-To: <1356204505.1129.21.camel@revolution.hippie.lan> Date: Sun, 23 Dec 2012 21:00:03 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <9EDB4DF7-6763-407C-BBCE-02915CDE7005@inter.net> References: <1356204505.1129.21.camel@revolution.hippie.lan> To: freebsd-arch@freebsd.org X-Mailer: Apple Mail (2.1085) X-SA-Exim-Connect-IP: 93.220.72.110 X-SA-Exim-Mail-From: eike@inter.net X-SA-Exim-Scanned: No (on daiquiri.ops.eusc.inter.net); SAEximRunCond expanded to false Cc: jasone@canonware.com X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Dec 2012 20:42:12 -0000 Hi Ian, I'm trying to understand the underlying problem. Looks like you already investigated that. Please tell us more about that. As far as I understand, jemalloc is not that bad at all. but it seems to get into conflict with the use of mlockall in some = situations. We should get Jason Evans in the boat to sort this out. malloc is not an easy task ... I once had the idea that the VM in FreeBSD was somehow build upon the = Mach VM? Is this still true today? How do they cope with this kind of problems in Darwin ~eike On Dec 22, 2012, at 20:28 , Ian Lepore wrote: > When a daemon such as watchdogd uses mlockall(2) on a small-memory > embedded system, it can end up wiring much of the available ram = because > jemalloc allocates large chunks of vmspace by default. More = background > info on this can be found in this thread: >=20 > = http://lists.freebsd.org/pipermail/freebsd-embedded/2012-November/001679.h= tml >=20 > It's hard to tune jemalloc's allocation behavior for this in a > machine-independent way because the minimum chunk size depends on > PAGE_SIZE and other factors internal to jemalloc. I've created a = patch > that addresses this by defining that lg_chunk:0 is implicitly a = request > to set the chunk size to the smallest value allowable for the machine > it's running on. The patch is attached to this PR... >=20 > http://www.freebsd.org/cgi/query-pr.cgi?pr=3D174641 >=20 > Jason, could you please review this and consider incorporating it into > jemalloc? Or let us know if there's a better way to handle this > situation. >=20 > -- Ian >=20 >=20 > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to = "freebsd-arch-unsubscribe@freebsd.org" From owner-freebsd-arch@FreeBSD.ORG Tue Dec 25 20:42:30 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 29C6DE95; Tue, 25 Dec 2012 20:42:30 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id EEBD38FC0A; Tue, 25 Dec 2012 20:42:29 +0000 (UTC) Received: from fledge.watson.org (fledge.watson.org [65.122.17.41]) by cyrus.watson.org (Postfix) with ESMTPS id 2D41046B09; Tue, 25 Dec 2012 15:42:28 -0500 (EST) Date: Tue, 25 Dec 2012 20:42:27 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Alan Cox Subject: Re: Unmapped I/O In-Reply-To: <50D22EA6.1040501@rice.edu> Message-ID: References: <20121219135451.GU71906@kib.kiev.ua> <50D22EA6.1040501@rice.edu> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: alc@freebsd.org, Konstantin Belousov , arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Dec 2012 20:42:30 -0000 On Wed, 19 Dec 2012, Alan Cox wrote: >> Are the machines that don't have a direct map performance critical? My >> expectation is that they are legacy or embedded. This seems like a great >> project to do when the rest of the pieces are stable and fast. Until then >> they could just use something like pbufs? > > I think the answer to your first question depends entirely on who you are. > :-) Also, at the low-end of the server space, there are many people trying > to promote arm-based systems. While FreeBSD may never run on your arm-based > phone, I think that ceding the arm-based server market to others will be a > strategic mistake. > > Alan > > P.S. I think we're moving the discussion to far away from kib's original, so > I suggest changing the subject line on any follow ups. Despite moving the discussion a little further away: MIPS-based systems, a direct mapped map segment (e.g., kseg, xkphys, etc) is part of the underlying design and doesn't rely on any TLB entries at all. We run much of the kernel from direct map regions to avoid causing TLB pressure. Robert From owner-freebsd-arch@FreeBSD.ORG Tue Dec 25 20:49:02 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 50A5EFBC; Tue, 25 Dec 2012 20:49:02 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id DF1D28FC16; Tue, 25 Dec 2012 20:49:01 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.5/8.14.5) with ESMTP id qBPKmsUJ058519; Tue, 25 Dec 2012 22:48:54 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.3 kib.kiev.ua qBPKmsUJ058519 Received: (from kostik@localhost) by tom.home (8.14.5/8.14.5/Submit) id qBPKms2B058518; Tue, 25 Dec 2012 22:48:54 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 25 Dec 2012 22:48:54 +0200 From: Konstantin Belousov To: Robert Watson Subject: Re: Unmapped I/O Message-ID: <20121225204854.GE82219@kib.kiev.ua> References: <20121219135451.GU71906@kib.kiev.ua> <50D22EA6.1040501@rice.edu> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="ey/N+yb7u/X9mFhi" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: alc@freebsd.org, arch@freebsd.org, Alan Cox X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Dec 2012 20:49:02 -0000 --ey/N+yb7u/X9mFhi Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Dec 25, 2012 at 08:42:27PM +0000, Robert Watson wrote: > On Wed, 19 Dec 2012, Alan Cox wrote: >=20 > >> Are the machines that don't have a direct map performance critical? My= =20 > >> expectation is that they are legacy or embedded. This seems like a gr= eat=20 > >> project to do when the rest of the pieces are stable and fast. Until t= hen=20 > >> they could just use something like pbufs? > > > > I think the answer to your first question depends entirely on who you a= re.=20 > > :-) Also, at the low-end of the server space, there are many people tr= ying=20 > > to promote arm-based systems. While FreeBSD may never run on your arm-= based=20 > > phone, I think that ceding the arm-based server market to others will b= e a=20 > > strategic mistake. > > > > Alan > > > > P.S. I think we're moving the discussion to far away from kib's origina= l, so=20 > > I suggest changing the subject line on any follow ups. >=20 > Despite moving the discussion a little further away: MIPS-based > systems, a direct mapped map segment (e.g., kseg, xkphys, etc) is part > of the underlying design and doesn't rely on any TLB entries at all. > We run much of the kernel from direct map regions to avoid causing TLB > pressure. Yes, as it was noted already, 32bit mips kseg is not much usable on the mips systems with more than 1GB of RAM. But Alan' another patch, with, I believe, small modification, could provide the gain there too. --ey/N+yb7u/X9mFhi Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJQ2hE2AAoJEJDCuSvBvK1BMRsP/jYntSnPvEg8THXL/XEou9Rj VwWQaVLH3GH5Tj4tZ7B6sQIsg3C9+fYVGA84LFCc/i9aa6aKZmBXTrHGB2ba5TX6 afd2kbIG0Bcn7MeAyLfy58OQUKA0wymmjGPMw/+0+6ZP0c+6vmO1mDMdy2MpWzKI QJpTYaBxwZeS3QroQKkj/l3BFS4sL3sz3QeVdt8Tg9JslbMCFxqWur+DKvRDq/g0 E1zLtS1qu/Zn00BewYlMxxOsBSQ+IbtZlGJtaWqu8+bdOGG+ns2sFQDy25Nxui0d UFFIfUXnjI33IYjDWVEcymk9lOeqDTylBhK2geuI3mzZtkM1JlVx9QPFO0SIT0WV Ci3PftF2SoVz47ogeHBkzy3BUU/jyjU68uDx55p6JtOGvgrCH4mQ1h4ivpsOAZAg G7Xx1V9r10iNs6UIioX0U9lZ+bZmJoNkxH8A/G8eHhzqjqHx+Wr2AfnBQv/rcyM6 KRYqahLfr6PyTAfVpQ+LAD0azCB8FkL0jwADqSrxZakQDsIF1Kc26sZuEFO2HdeO rwO3hgKE8AAS/r0cRa0mGtf8xEfnI1gy3cUoYUF1CUJywxBGrKowm749lMAtaHak s38wPZfCxcSOPHwMVQykQeVB3UcDiXgvSvk9Ov15Raq6+zq2BA3OcW5I1ymlR25V 1nQdaatbyHdZ48p+U56r =+63G -----END PGP SIGNATURE----- --ey/N+yb7u/X9mFhi-- From owner-freebsd-arch@FreeBSD.ORG Tue Dec 25 23:21:34 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id ED55614D; Tue, 25 Dec 2012 23:21:34 +0000 (UTC) (envelope-from marius@alchemy.franken.de) Received: from alchemy.franken.de (alchemy.franken.de [194.94.249.214]) by mx1.freebsd.org (Postfix) with ESMTP id 20EB78FC13; Tue, 25 Dec 2012 23:21:33 +0000 (UTC) Received: from alchemy.franken.de (localhost [127.0.0.1]) by alchemy.franken.de (8.14.5/8.14.5/ALCHEMY.FRANKEN.DE) with ESMTP id qBPNLRcf047719; Wed, 26 Dec 2012 00:21:27 +0100 (CET) (envelope-from marius@alchemy.franken.de) Received: (from marius@localhost) by alchemy.franken.de (8.14.5/8.14.5/Submit) id qBPNLRf7047718; Wed, 26 Dec 2012 00:21:27 +0100 (CET) (envelope-from marius) Date: Wed, 26 Dec 2012 00:21:27 +0100 From: Marius Strobl To: Alexander Motin Subject: Re: [RFC/RFT] calloutng Message-ID: <20121225232126.GA47692@alchemy.franken.de> References: <50CCAB99.4040308@FreeBSD.org> <50CE5B54.3050905@FreeBSD.org> <50D03173.9080904@FreeBSD.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50D03173.9080904@FreeBSD.org> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Davide Italiano , FreeBSD Current , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Dec 2012 23:21:35 -0000 On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote: > Experiments with dummynet shown ineffective support for very short > tick-based callouts. New version fixes that, allowing to get as many > tick-based callout events as hz value permits, while still be able to > aggregate events and generating minimum of interrupts. > > Also this version modifies system load average calculation to fix some > cases existing in HEAD and 9 branches, that could be fixed with new > direct callout functionality. > > http://people.freebsd.org/~mav/calloutng_12_17.patch > > With several important changes made last time I am going to delay commit > to HEAD for another week to do more testing. Comments and new test cases > are welcome. Thanks for staying tuned and commenting. FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a try on sparc64 and it at least survives a buildworld there. However, with the patched kernels, buildworld times seem to increase slightly but reproducible by 1-2% (I only did four runs but typically buildworld times are rather stable and don't vary more than a minute for the same kernel and source here). Is this an expected trade-off (system time as such doesn't seem to increase)? Is there anything specific to test? Marius From owner-freebsd-arch@FreeBSD.ORG Wed Dec 26 01:19:19 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 6D440D0B for ; Wed, 26 Dec 2012 01:19:19 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from mail-ie0-f179.google.com (mail-ie0-f179.google.com [209.85.223.179]) by mx1.freebsd.org (Postfix) with ESMTP id 236628FC0A for ; Wed, 26 Dec 2012 01:19:18 +0000 (UTC) Received: by mail-ie0-f179.google.com with SMTP id k14so10021913iea.24 for ; Tue, 25 Dec 2012 17:19:18 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:sender:subject:mime-version:content-type:from :in-reply-to:date:cc:content-transfer-encoding:message-id:references :to:x-mailer:x-gm-message-state; bh=IEo3d6P6SgB/ZDl4fjhC1uCFgaTEsBLnVKC41Qv46TY=; b=cRTXtiHh6WATvNinFCyhsIKdUTitOD11UAUGdZD99E9nw+Tx6rZJzYDBnQQu8rZwyG n5ThCz9dgxFDWK1qYZL6qBHNJoIUxD1Oucs5DG4vdWx+nJW7tps3pTIS9qidhzBfnG+C 7shWghK7RVeklNlKu+aPOPyJuSMhT749BZsMU8/HLPYQw0r+CFvdH2HSIhdKS2/nACwp Cji3ajQxplz9HUsBLcARM4zSE6JZWonmVQhUcFA8/itJtjlvjcrbTN4lHJ3RKc1Eeuow bNOEiI7fd7iccUT/sJ+YMBjd4Vsh8qoPILmWwanzvaURyMXxP1pF0zx5F7PZ01WqjveU +NYg== X-Received: by 10.50.196.138 with SMTP id im10mr17856423igc.83.1356484758053; Tue, 25 Dec 2012 17:19:18 -0800 (PST) Received: from 53.imp.bsdimp.com (50-78-194-198-static.hfc.comcastbusiness.net. [50.78.194.198]) by mx.google.com with ESMTPS id fv6sm25712024igc.17.2012.12.25.17.19.16 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 25 Dec 2012 17:19:17 -0800 (PST) Sender: Warner Losh Subject: Re: Unmapped I/O Mime-Version: 1.0 (Apple Message framework v1085) Content-Type: text/plain; charset=us-ascii From: Warner Losh In-Reply-To: <20121225204854.GE82219@kib.kiev.ua> Date: Tue, 25 Dec 2012 18:19:15 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <58BA8294-A610-4474-A8B4-9AF69AD44779@bsdimp.com> References: <20121219135451.GU71906@kib.kiev.ua> <50D22EA6.1040501@rice.edu> <20121225204854.GE82219@kib.kiev.ua> To: Konstantin Belousov X-Mailer: Apple Mail (2.1085) X-Gm-Message-State: ALoCoQkCoZ4yp7A1l2aMtSyO2D2p+lUuCkZd317fShEGX23zpTPW1Zk6n+gE8RG0IQE6WLaxxNRH Cc: alc@freebsd.org, arch@freebsd.org, Robert Watson , Alan Cox X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Dec 2012 01:19:19 -0000 On Dec 25, 2012, at 1:48 PM, Konstantin Belousov wrote: > On Tue, Dec 25, 2012 at 08:42:27PM +0000, Robert Watson wrote: >> On Wed, 19 Dec 2012, Alan Cox wrote: >>=20 >>>> Are the machines that don't have a direct map performance critical? = My=20 >>>> expectation is that they are legacy or embedded. This seems like a = great=20 >>>> project to do when the rest of the pieces are stable and fast. = Until then=20 >>>> they could just use something like pbufs? >>>=20 >>> I think the answer to your first question depends entirely on who = you are.=20 >>> :-) Also, at the low-end of the server space, there are many people = trying=20 >>> to promote arm-based systems. While FreeBSD may never run on your = arm-based=20 >>> phone, I think that ceding the arm-based server market to others = will be a=20 >>> strategic mistake. >>>=20 >>> Alan >>>=20 >>> P.S. I think we're moving the discussion to far away from kib's = original, so=20 >>> I suggest changing the subject line on any follow ups. >>=20 >> Despite moving the discussion a little further away: MIPS-based >> systems, a direct mapped map segment (e.g., kseg, xkphys, etc) is = part >> of the underlying design and doesn't rely on any TLB entries at all. >> We run much of the kernel from direct map regions to avoid causing = TLB >> pressure. > Yes, as it was noted already, 32bit mips kseg is not much usable on > the mips systems with more than 1GB of RAM. But Alan' another patch, > with, I believe, small modification, could provide the gain there too. Most mips move to 64-bit when they have more than 512MB or 1GB because = the direct map is just too handy... Warner= From owner-freebsd-arch@FreeBSD.ORG Wed Dec 26 19:24:57 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id BA8BE7ED; Wed, 26 Dec 2012 19:24:57 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-wi0-f176.google.com (mail-wi0-f176.google.com [209.85.212.176]) by mx1.freebsd.org (Postfix) with ESMTP id ED0EA8FC0A; Wed, 26 Dec 2012 19:24:56 +0000 (UTC) Received: by mail-wi0-f176.google.com with SMTP id hm6so7315626wib.3 for ; Wed, 26 Dec 2012 11:24:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:sender:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:content-type :content-transfer-encoding; bh=Obqzhm1447Fw6lGkhMWiQjoVxYq6ZTwnx1VAYzddfYY=; b=b9Hfc7QrgMjroL/GXkG5LmwH0K69R2nfIbMOrPcBB0AiaaBMbqn0AVnkN0Ml3u47Aj bdAT+Eb2BF0gP6TEHMzzic7ITalnN0Fx8Am1bEm49jgHSS6FkaWF9/fl89Ap9a8CnQet ga8PLePG82cuNBQiJWs8/IwP3bTQoWqIn+stwdcgAVGXOE1Cr+LEEfz1i/v6qXxU3bX+ 32z2/pzTRWomQVmr+vpObw7OpV3YvPP2W7mELl1nJXqPBLEHEjAkpkpl+e90ELt4CwHj yo4hFKIHl5AGhaqodUm/kwGQB+m5gWfzNzwlwLM6+ZnWPPyDVT1GCH/9pyZbpzW7Is9o ncXA== X-Received: by 10.180.90.106 with SMTP id bv10mr43765614wib.12.1356549889109; Wed, 26 Dec 2012 11:24:49 -0800 (PST) Received: from mavbook.mavhome.dp.ua ([91.198.175.1]) by mx.google.com with ESMTPS id g2sm45411030wiy.0.2012.12.26.11.24.47 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 26 Dec 2012 11:24:48 -0800 (PST) Sender: Alexander Motin Message-ID: <50DB4EFE.2020600@FreeBSD.org> Date: Wed, 26 Dec 2012 21:24:46 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:13.0) Gecko/20120628 Thunderbird/13.0.1 MIME-Version: 1.0 To: Marius Strobl Subject: Re: [RFC/RFT] calloutng References: <50CCAB99.4040308@FreeBSD.org> <50CE5B54.3050905@FreeBSD.org> <50D03173.9080904@FreeBSD.org> <20121225232126.GA47692@alchemy.franken.de> In-Reply-To: <20121225232126.GA47692@alchemy.franken.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Davide Italiano , FreeBSD Current , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Dec 2012 19:24:57 -0000 On 26.12.2012 01:21, Marius Strobl wrote: > On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote: >> Experiments with dummynet shown ineffective support for very short >> tick-based callouts. New version fixes that, allowing to get as many >> tick-based callout events as hz value permits, while still be able to >> aggregate events and generating minimum of interrupts. >> >> Also this version modifies system load average calculation to fix some >> cases existing in HEAD and 9 branches, that could be fixed with new >> direct callout functionality. >> >> http://people.freebsd.org/~mav/calloutng_12_17.patch >> >> With several important changes made last time I am going to delay commit >> to HEAD for another week to do more testing. Comments and new test cases >> are welcome. Thanks for staying tuned and commenting. > > FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a > try on sparc64 and it at least survives a buildworld there. However, > with the patched kernels, buildworld times seem to increase slightly but > reproducible by 1-2% (I only did four runs but typically buildworld > times are rather stable and don't vary more than a minute for the > same kernel and source here). Is this an expected trade-off (system > time as such doesn't seem to increase)? I don't think build process uses significant number of callouts to affect results directly. I think this additional time could be result of the deeper next event look up, done by the new code, that is practically useless for sparc64, which effectively has no cpu_idle() routine. It wouldn't affect system time and wouldn't show up in any statistics (except PMC or something alike) because it is executed inside timer hardware interrupt handler. If my guess is right, that is a part that probably still could be optimized. I'll look on it. Thanks. > Is there anything specific to test? Since the most of code is MI, for sparc64 I would mostly look on related MD parts (eventtimers and timecounters) to make sure they are working reliably in more stressful conditions. I still have some worries about possible deadlock on hardware where IPIs are used to fetch present time from other CPU. Here is small tool we are using for test correctness and performance of different user-level APIs: http://people.freebsd.org/~mav/testsleep.c -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 04:00:36 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0360A1E5; Thu, 27 Dec 2012 04:00:36 +0000 (UTC) (envelope-from bright@mu.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id D1DE18FC0A; Thu, 27 Dec 2012 04:00:35 +0000 (UTC) Received: from Alfreds-MacBook-Pro-9.local (c-67-180-208-218.hsd1.ca.comcast.net [67.180.208.218]) by elvis.mu.org (Postfix) with ESMTPSA id 00BD41A3C33; Wed, 26 Dec 2012 20:00:34 -0800 (PST) Message-ID: <50DBC7E2.1070505@mu.org> Date: Wed, 26 Dec 2012 20:00:34 -0800 From: Alfred Perlstein User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Alfred Perlstein Subject: UPDATE Re: making use of userland dtrace on FreeBSD References: <50D49DFF.3060803@ixsystems.com> In-Reply-To: <50D49DFF.3060803@ixsystems.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "arch@freebsd.org" , Adrian Chadd , Rui Paulo X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2012 04:00:36 -0000 Putting the following into /etc/src.conf builds a world that is userland dtrace'able: > /usr/home/alfred # cat /etc/src.conf > WITH_CTF=1 > CFLAGS+=-fno-omit-frame-pointer > STRIP= (yes, STRIP is intentionally left blank) The following the wiki, just doing this: > # kldload dtraceall Then you can test it by doing this: > # dtruss -s ls |& head -20 If you get stack traces along with syscalls, you're doing pretty well. Can any of our build gurus look at this? What would be the drawbacks? I don't want to hurt freebsd for heavy performance, but I think this functionality should work out of the box for most people. It allows an expert to diagnose some pretty gnarly bugs. My preliminary thoughts are: hack gcc to turn off -fomit-frame-pointer unless -O3 or greater is specified. turn on WITH_CTF in the Mk files or top level Makefiles. turn off stripping by default. What other options are there? Is there a better way that is expedient? I really do not think we should wait for codes to optimize this unless the code is hiding somewhere on a branch and ready to commit unless there is a strong reason. Can someone list some strong reasons not to? Or give encouragement to go this way? -Alfred On 12/21/12 9:35 AM, Alfred Perlstein wrote: > Hey folks, > > We have had userland dtrace for a while now. However it's not really > hooked up into the build, nor as far as I can tell are ports nor > shared libs. > > Dtrace can be immensely useful for tracking down hard to find bugs, > memory leaks, performance problems and a lot more. > > What are the thoughts on making this available by default on FreeBSD > going forward? > > What would need to happen? > > Supposedly we can do this by just adding > "CFLAGS=-fno-omit-frame-pointer" and not completely stripping > installed tools/libraries. > > Would it make sense to set this as default for the whole system? Just > libs+ports? Or do people think that the performance gain of > omit-frame-pointer (which I am unsure of) is worth the loss of > debug-ability (like a certain arctic bird based OS)? > > I have also factored in the size of binaries into this, and I really > am not sure why it would be a problem other than if we didn't offer an > "easy button" to make things "small". > > Let's figure this out, because it seems to me that we should be > offering this to our users if possible. > > -Alfred > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 04:21:56 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 131B9544 for ; Thu, 27 Dec 2012 04:21:56 +0000 (UTC) (envelope-from peter@wemm.org) Received: from mail-vb0-f50.google.com (mail-vb0-f50.google.com [209.85.212.50]) by mx1.freebsd.org (Postfix) with ESMTP id A8A2D8FC0A for ; Thu, 27 Dec 2012 04:21:55 +0000 (UTC) Received: by mail-vb0-f50.google.com with SMTP id fr13so9483058vbb.9 for ; Wed, 26 Dec 2012 20:21:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wemm.org; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=KKLPa+rkzvCfjcHmMGOIJym+4FCcYJoAGbBSDQDV358=; b=Y5dOhUvTKAOqm1qeJlPktt/cOjnDY1icVZLtpbKHU5wLKzvOECp/nXOw5FWyoOSVcG A26i2CAvfH+WHZBjGW01KpU1gi0GyoX0D+oUnPcn+NwPs1fvZlNtF+7L+N9DptN3jDc2 G3ghmpweKqUe31AEjJcIhEhiTriyaccUDZ+l4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:x-gm-message-state; bh=KKLPa+rkzvCfjcHmMGOIJym+4FCcYJoAGbBSDQDV358=; b=iZhmXGLZGYEafLXB9Qs0d3doCjjhtfFnT2ttuvFTs9Ds4rBpMxeQTXd4KryHrKZfPQ 2ne+hSNFTM1fcrKee3JmlNpuE1xaE/VIwVMJvRaqxZNI6muAxyGxuIceTpydZzLUe3GG 3hqL7V8417AeE6+06hJDbnI2dh8j9dDXjIR4/8VdCIUqL4uyf/WccUAK3aE7qO4fmdx0 8GXXjiH+VJs21iBfxnIJoC419ijTNMr8ZlMqhzUcQEn/+zfNKX6gGEAM6jbEvgRnXl4D 0Jx7W+4lMpt8tfeYs3kpAGMP7QytXJ5W8OShiy3/6uxnRF5ifmudvEkM8Wp8TOqEiv81 slNw== MIME-Version: 1.0 Received: by 10.52.27.138 with SMTP id t10mr40027392vdg.81.1356582114710; Wed, 26 Dec 2012 20:21:54 -0800 (PST) Received: by 10.220.205.6 with HTTP; Wed, 26 Dec 2012 20:21:54 -0800 (PST) In-Reply-To: <50DBC7E2.1070505@mu.org> References: <50D49DFF.3060803@ixsystems.com> <50DBC7E2.1070505@mu.org> Date: Wed, 26 Dec 2012 20:21:54 -0800 Message-ID: Subject: Re: UPDATE Re: making use of userland dtrace on FreeBSD From: Peter Wemm To: Alfred Perlstein Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQkyiGbMAcELkXtnYeP9rRGsA2r0+15ELlimoEalWxOEBy8TXpg3xqHyvDUsFGYzHtP1+5CN Cc: "arch@freebsd.org" , Adrian Chadd , Rui Paulo , Alfred Perlstein X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2012 04:21:56 -0000 On Wed, Dec 26, 2012 at 8:00 PM, Alfred Perlstein wrote: > What would be the drawbacks? I don't want to hurt freebsd for heavy > performance, but I think this functionality should work out of the box for > most people. The drawbacks are mostly performance related. It defeats a certain hardware optimizations for call/return on leaf functions. It'll mostly affect things like math, crypto, compression and multimedia libraries (that's ffmpeg, bzip2/gzip/libarchive, openssl, etc) but, we generally don't seem to care about that sort of performance anyway, so what's one more loss? Of course it wouldn't be required with dwarf unwinding awareness, but we don't have that. We have -fno-omit-frame-pointer on the amd64 kernel whenever debugging is compiled in because there's no unwinder for doing stack traces. We need a dwarf2+ unwinder and somebody to instrument the call frame state through the remaining assembler code. -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell bitcoin:188ZjyYLFJiEheQZw4UtU27e2FMLmuRBUE From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 04:41:56 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7B5F5A18; Thu, 27 Dec 2012 04:41:56 +0000 (UTC) (envelope-from bright@mu.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id 593408FC08; Thu, 27 Dec 2012 04:41:56 +0000 (UTC) Received: from Alfreds-MacBook-Pro-9.local (c-67-180-208-218.hsd1.ca.comcast.net [67.180.208.218]) by elvis.mu.org (Postfix) with ESMTPSA id BCC8C1A3C1A; Wed, 26 Dec 2012 20:41:55 -0800 (PST) Message-ID: <50DBD193.7080505@mu.org> Date: Wed, 26 Dec 2012 20:41:55 -0800 From: Alfred Perlstein User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Peter Wemm Subject: Re: UPDATE Re: making use of userland dtrace on FreeBSD References: <50D49DFF.3060803@ixsystems.com> <50DBC7E2.1070505@mu.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "arch@freebsd.org" , Adrian Chadd , Rui Paulo , Alfred Perlstein X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2012 04:41:56 -0000 On 12/26/12 8:21 PM, Peter Wemm wrote: > On Wed, Dec 26, 2012 at 8:00 PM, Alfred Perlstein wrote: > >> What would be the drawbacks? I don't want to hurt freebsd for heavy >> performance, but I think this functionality should work out of the box for >> most people. > The drawbacks are mostly performance related. It defeats a certain > hardware optimizations for call/return on leaf functions. It'll > mostly affect things like math, crypto, compression and multimedia > libraries (that's ffmpeg, bzip2/gzip/libarchive, openssl, etc) but, we > generally don't seem to care about that sort of performance anyway, so > what's one more loss? Can you clarify some? If it was somewhat easy to re-add -fomit-frame-pointer to critical libraries like this, then that would be OK? To be honest, I'm not sure if you're serious about "generally don't seem to care" or just feel defeated on the issue and we should care. What do you think? > > Of course it wouldn't be required with dwarf unwinding awareness, but > we don't have that. > > We have -fno-omit-frame-pointer on the amd64 kernel whenever debugging > is compiled in because there's no unwinder for doing stack traces. We > need a dwarf2+ unwinder and somebody to instrument the call frame > state through the remaining assembler code. > How much work is that exactly? I've only been a gdb user, not a hacker. What is the right call here? Is the increased functionality worth the performance hit until we get the dwarf2+ unwinder? Or not worth it until we get the unwinder? There's no numbers, but does anyone care to provide some, or give me some things to test to bring numbers back? -Alfred From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 05:32:47 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5D769DB8 for ; Thu, 27 Dec 2012 05:32:47 +0000 (UTC) (envelope-from peter@wemm.org) Received: from mail-vc0-f172.google.com (mail-vc0-f172.google.com [209.85.220.172]) by mx1.freebsd.org (Postfix) with ESMTP id 02B0B8FC12 for ; Thu, 27 Dec 2012 05:32:46 +0000 (UTC) Received: by mail-vc0-f172.google.com with SMTP id fw7so9466280vcb.17 for ; Wed, 26 Dec 2012 21:32:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wemm.org; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=Kwk7KiguamN48NhQdvO3KPUCKMMCYoldMI7rKhP2cHM=; b=qXZ9d/7WZmaeHn0oVCrEuHqHlsIMobQfHwVZonkLDWsHkKwa8mbL+1pXmxK43GGUCY JpkIG7R/NRAMkmOkcKVL1HTrFhALQGvh4KpZb7y9Z8wjB7oI4UsCr8yZVmjNDly1aMgC lrm/v/QPsa4yvEuSeo4WmxGqiWteigBzGf9IU= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:x-gm-message-state; bh=Kwk7KiguamN48NhQdvO3KPUCKMMCYoldMI7rKhP2cHM=; b=JunDumg+Dt+yvKGVu6xBcj69mhHyP6hyjDtJ3hh3L+eiEOOpJbDMOttMsnvH9AUrCh EIeDhZCypqifYKEoUJ+8P9EfdMMmI3zvw7i1d+bKAPkprjpJCys505SU8r/wcBeC/naq 5ZqyA74Cv6y0zW6Hru/XMIqmGOVKKcMNatFEao7V9/fVfDuikrBp+7c/Dg1QrWrNFVdK IzLnPg2diuIqJBJ1aQsk5TCxW3roEJ4RVLGuFweGYT+0zdN/KJ5C61KOXIOSAKvTEsb7 a3keygN2XBOp90riStSl0Dt09gYXgK1Dz3j2UqdBi1kp0z0hlFGqvQpooB5v7FHfxOhv fo8A== MIME-Version: 1.0 Received: by 10.52.69.201 with SMTP id g9mr38718166vdu.98.1356586366021; Wed, 26 Dec 2012 21:32:46 -0800 (PST) Received: by 10.220.205.6 with HTTP; Wed, 26 Dec 2012 21:32:45 -0800 (PST) In-Reply-To: <50DBD193.7080505@mu.org> References: <50D49DFF.3060803@ixsystems.com> <50DBC7E2.1070505@mu.org> <50DBD193.7080505@mu.org> Date: Wed, 26 Dec 2012 21:32:45 -0800 Message-ID: Subject: Re: UPDATE Re: making use of userland dtrace on FreeBSD From: Peter Wemm To: Alfred Perlstein Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQl0VuKfrYS2D88Z5ZN/z0xch/cGbCG70yfOpbHWjtS8PZSkKz1lWEqJD6+8fLUqK0GYd8ZO Cc: "arch@freebsd.org" , Adrian Chadd , Rui Paulo , Alfred Perlstein X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2012 05:32:47 -0000 On Wed, Dec 26, 2012 at 8:41 PM, Alfred Perlstein wrote: > On 12/26/12 8:21 PM, Peter Wemm wrote: >> >> On Wed, Dec 26, 2012 at 8:00 PM, Alfred Perlstein wrote: >> >>> What would be the drawbacks? I don't want to hurt freebsd for heavy >>> performance, but I think this functionality should work out of the box >>> for >>> most people. >> >> The drawbacks are mostly performance related. It defeats a certain >> hardware optimizations for call/return on leaf functions. It'll >> mostly affect things like math, crypto, compression and multimedia >> libraries (that's ffmpeg, bzip2/gzip/libarchive, openssl, etc) but, we >> generally don't seem to care about that sort of performance anyway, so >> what's one more loss? > > > Can you clarify some? If it was somewhat easy to re-add > -fomit-frame-pointer to critical libraries like this, then that would be OK? No, you can't add MD flags like this. The way to do it is see things like PIC, WARNS, etc where you can do overrides of defaults on a directory basis, and respect the system-wide user overrides. Remember, -fno-omit-frame-pointer is the default on i386 (except at high -O levels with gcc, I dont know where clang, the default compiler, draws the line). Other platforms don't even have frame pointers. You can't just scatter that switch around the place. > To be honest, I'm not sure if you're serious about "generally don't seem to > care" or just feel defeated on the issue and we should care. We took quite a performance beating because of not using the tuned-by-perl assembler code in openssl on amd64, for example. This flows through to benchmarks on things like apache throughput with mod_ssl. Or throughput on stunnel(1). My drive-by comment about not seeming to care any more is that people (except for Bruce) generally don't actually measure the performance impact of their changes any more. The last time this was widespread was when Kris Kennaway used to be constantly abusing machines and reporting the effects as measured by ministat(1). If somebody were to say "this change makes world take 15% longer to compile but makes no meaningful affect on things like bzip2, openssl throughput etc" and posted the actual ministat output to back it up then there wouldn't even be a question on performance at all. It'd only be "is 15% more build time worth ubiquitous dtrace?" And thats a far easier thing to answer. A hand-wave leads to bikesheds. Actual numbers are bikeshed repellant. I myself have killed patches that turned out to be premature optimizations because it actually didn't make any difference. For example, I never committed the lazy tlb shootdown to AMD64 because it made things slower on the hardware of the day - opteron silicon had *hardware* address space tags on their TLB and the lazy shootdown code just added more synchronization work that just added overhead.. eg: buildworld was around 2% slower with the patches. Another example was the mtxpool code that caused cache line thrashing. If we cared about performance that would never have gone in. Sure, it compiled and worked, but the costs weren't quantified till much later and we realized how much trouble they were beyond a certain usage level. What's 2%? It multiplies out.. 2% here, 1% there.. 3% over there, 0.5% somewhere else.. before you know it, there's a pretty big overall hit. >> Of course it wouldn't be required with dwarf unwinding awareness, but >> we don't have that. >> >> We have -fno-omit-frame-pointer on the amd64 kernel whenever debugging >> is compiled in because there's no unwinder for doing stack traces. We >> need a dwarf2+ unwinder and somebody to instrument the call frame >> state through the remaining assembler code. >> > How much work is that exactly? I've only been a gdb user, not a hacker. gdb has a stack unwinder. kdb/ddb/stack(9) do not. There's well established GPL code to do it, as well as libunwind and variants. Basically what this code has to do is run the dwarf2+ state machine to find all the call/return frames instead of assuming the compiler did it. Heck, even glibc has a dwarf2 unwinder built into it as part of their exception processing system. I'm not entirely sure what more work src/lib/libelf and src/lib/libdwarf need. It looks like its got just enough implemented to support the ctfconvert etc and doesn't have an unwinder in it. -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell bitcoin:188ZjyYLFJiEheQZw4UtU27e2FMLmuRBUE From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 05:43:02 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D4C71E9F; Thu, 27 Dec 2012 05:43:02 +0000 (UTC) (envelope-from masked@internode.on.net) Received: from ipmail06.adl6.internode.on.net (ipmail06.adl6.internode.on.net [IPv6:2001:44b8:8060:ff02:300:1:6:6]) by mx1.freebsd.org (Postfix) with ESMTP id 157048FC12; Thu, 27 Dec 2012 05:43:00 +0000 (UTC) Received: from ppp221-140.static.internode.on.net (HELO forexamplePC) ([150.101.221.140]) by ipmail06.adl6.internode.on.net with SMTP; 27 Dec 2012 16:12:46 +1030 Message-ID: From: "Michael Vale" To: , , Subject: Cross Compiling of ports Makefiles. Date: Thu, 27 Dec 2012 16:42:48 +1100 MIME-Version: 1.0 X-Priority: 3 X-MSMail-Priority: Normal Importance: Normal X-Mailer: Microsoft Windows Live Mail 16.4.3505.912 X-MimeOLE: Produced By Microsoft MimeOLE V16.4.3505.912 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2012 05:43:02 -0000 Hi,=20 For those of you who are aware I=E2=80=99ve been implementing a complete = cross-compiling series of functions to ports makefiles. I had a good 3+ week break since my last email with a patch to show, and = I=E2=80=99ve totally re-written it and have started from scratch. Not = including any of Ray=E2=80=99s Zrouter code either. While it=E2=80=99s still a work in progress, i have outlined the entire = system to produce target installs into the same staging directory as a = bsd system ready to be flashed onto NAND for embedded, complete with pkg = registry and ldconfig, everything has been thought of. - The reason I = have chosen this method for the ports to be installed into a tree is so = they can be compliled after build/install kernel/world and be combined = into one firmware image seemlessly. Some ports won=E2=80=99t just be = optional applications for future embedded firmware images, = they=E2=80=99ll be an integral part of it. The goal here is to be able = to build complete firmware images in one fowl swoop. Perhaps beyond the = scope most of you out there but I may wish to pick and choose exclude = required parts of the BSD system and replace them with the busybox port = and replace libc with google=E2=80=99s Bionic, uClibc or even musl. = This cannot be achieved currently with the likes of tinderbox and = pourdiere It will still be possible to build packages though. Due to the nature of cross building first i=E2=80=99ll lay out the = options and then tell you which one I am implementing first as there are = reasons for having different build-enviornments/toolchains. Ok, firstly I was going to give you all detail of all possible = cross-compiling scenarios as I outline them. but I=E2=80=99ll have you = know it=E2=80=99s much of a muchness, there is the pros and cons to each = and every different step, the one i=E2=80=99m about to put to you now is = the most feature complete and quickest to implement. That = doesn=E2=80=99t mean building without a DESTDIR JAIL in the future and = just using the build system and it=E2=80=99s tools without a new = toolchain doesn=E2=80=99t make sense (sometimes it does!) and that = i=E2=80=99m not going to do it or that I=E2=80=99m not going to do a = full '=E2=80=99Canadian Cross=E2=80=99. Ultimately as a goal the minimal command do invoke cross compliation is = TARGET(_ARCH)=3D${ARCH} make. This could go on for hours, so after just deleted to extra paragraphs, = i=E2=80=99m going to summerise. first we check for CLANG (as the x-compiler) or if we need to install = xdev (bsd make of gcc compiled for target arch). (ok so some of this wont be in Makefile order (upside down and back to = front), but im just spitting it out as it comes) if GNU configure is used, it usually pretty good at detecting the = compilers executable path from the TARGET triple alone, for worse case = scenario also set ${CC}=E2=80=99s path at the beginning of global env = ${PATH} to override any subsequent. pre-chroot: is mostly used to declare global env variables to keep the = build from failing and making sure the install will complete. do-chroot: and we have to firstly install and BUILD_DEPENDS, remember = these can be libraries too and they have to be built with the build = machines usual stuff and installed in their usual place (lucky we are = using a CHROOTED JAIL here! we could easy make a mess otherwise) = remembering sometimes some depends can be both a BUILD dep AND a RUN dep = to the TARGET. That=E2=80=99s okay, they should always be declared as = correctly and never have to cross-compile a BUILD depend. However a = BUILD depend can be build twice, (once for the build system) and again = (as a TARGET) for the TARGET as a RUN depend for the TARGET. The beauty of doing this work is we can now treat the lib and run = depends more suitably. During this process we can strip the libs, = exclude the headers and change the directory structure to one, save on = inodes, and second pkg register, libtool and ld require the files are = installed into the root tree correctly in order for them to build valid = databases and register them. Now, BUILD/HOST system has already had = it=E2=80=99s tail cut off by DESTDIR. Now there is plenty of ways we = can install everything into a valid sub-directory and have DESTDIR still = considered ROOT and PREFIX or LOCALDIR doesn=E2=80=99t have some obscure = prepending directory that doesn=E2=80=99t exist in the = CROSS_STAGING_ROOT. Some ways include adding a variable in bsd.lib.mk = and in every single one of make=E2=80=99s install targets between = ${DESTDIR} and ${LOCALBASE} or ${PREFIX}. And we could include if = statements for cross, this would leave it at that and we could go ahead = and simply install into a sub-directory before pkg, ldconfig and = firmware image packing occurs, but I=E2=80=99d rather keep all = cross-building to bsd.cross.mk and include it in bsd.port.mk and instead = within DESTDIR do-chroot: re-define ${DESTDIR} as ${_bldroot}${DESTDIR} = and all TARGET_LIBS, RUN_DEPENDS and TARGET install in a CHROOTED=3Dno = chroot. Doing the same thing could also prevent the need for a DESTDIR JAIL = install at all and just use the real build machine=E2=80=99s build env, = rather than a jail. Regardless. We still have to install these targets = and their DESTDIR is skewed. There is a few options, One is to have a MAKEOBJDIRPREFIX like option, and redefine every = target=E2=80=99s DESTDIR ${makeobjDESTDIR} before running do-install. = Now i=E2=80=99ve yet to complete this stage, but I believe this is the = way to do it. There are other options but they aren=E2=80=99t as elegant/will make = baby jesus cry. Now the install of these targets won=E2=80=99t require a chroot. A = chroot could be done, and that would be okay for one port. but if there = is already a cross compiled system in there ready for flashing to disk, = theres no way to chroot without moving files temporarially form the = existing target system and copying or building programs like /bin/sh = that will execute on the build machine and allow chroot to run. We can patch/sed PLIST files, for pkg register to work, patch/sed/edit = ldconfig=E2=80=99s db, and some other steps. But I don=E2=80=99t like = that idea. that=E2=80=99s why I=E2=80=99m opting with the other option and that is = to create some INSTALL_DEPENDS or = CROSS_COMPILING_CHROOT_INSTALL_DEPENDS, if you will. just /bin/sh and = another few=20 build TARGET port in jailed DESTDIR/CHROOTED=3Dyes. this is achieved by installing all build dependencies first... Sorry, I=E2=80=99m too tired to continue on any further! I wanted to wait until the initial plan works, shoot an email off then = get into the good stuff. But it=E2=80=99s taking me longer than I = thought even just to describe all the processes. I didn=E2=80=99t want to submit half-baked Makefiles that don=E2=80=99t = work, but I can only write about half of one anyway haha! Anyway, I=E2=80=99m going to spend some time working on them in the next = few days, so please expect an update. From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 05:47:09 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 74FA21EA; Thu, 27 Dec 2012 05:47:09 +0000 (UTC) (envelope-from alfred@ixsystems.com) Received: from mail.iXsystems.com (newknight.ixsystems.com [206.40.55.70]) by mx1.freebsd.org (Postfix) with ESMTP id 4A4E08FC08; Thu, 27 Dec 2012 05:47:08 +0000 (UTC) Received: from localhost (mail.ixsystems.com [10.2.55.1]) by mail.iXsystems.com (Postfix) with ESMTP id B592860ECA; Wed, 26 Dec 2012 21:47:08 -0800 (PST) Received: from mail.iXsystems.com ([10.2.55.1]) by localhost (mail.ixsystems.com [10.2.55.1]) (maiad, port 10024) with ESMTP id 43528-04; Wed, 26 Dec 2012 21:47:08 -0800 (PST) Received: from Alfreds-MacBook-Pro-9.local (unknown [10.8.0.26]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by mail.iXsystems.com (Postfix) with ESMTPSA id 157A160EC7; Wed, 26 Dec 2012 21:47:08 -0800 (PST) Message-ID: <50DBE0DB.6090804@ixsystems.com> Date: Wed, 26 Dec 2012 21:47:07 -0800 From: Alfred Perlstein User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Peter Wemm Subject: Re: UPDATE Re: making use of userland dtrace on FreeBSD References: <50D49DFF.3060803@ixsystems.com> <50DBC7E2.1070505@mu.org> <50DBD193.7080505@mu.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "arch@freebsd.org" , Adrian Chadd , Alfred Perlstein , Rui Paulo X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2012 05:47:09 -0000 On 12/26/12 9:32 PM, Peter Wemm wrote: > On Wed, Dec 26, 2012 at 8:41 PM, Alfred Perlstein wrote: >> On 12/26/12 8:21 PM, Peter Wemm wrote: >>> On Wed, Dec 26, 2012 at 8:00 PM, Alfred Perlstein wrote: >>> >>>> What would be the drawbacks? I don't want to hurt freebsd for heavy >>>> performance, but I think this functionality should work out of the box >>>> for >>>> most people. >>> The drawbacks are mostly performance related. It defeats a certain >>> hardware optimizations for call/return on leaf functions. It'll >>> mostly affect things like math, crypto, compression and multimedia >>> libraries (that's ffmpeg, bzip2/gzip/libarchive, openssl, etc) but, we >>> generally don't seem to care about that sort of performance anyway, so >>> what's one more loss? >> >> Can you clarify some? If it was somewhat easy to re-add >> -fomit-frame-pointer to critical libraries like this, then that would be OK? > No, you can't add MD flags like this. The way to do it is see things > like PIC, WARNS, etc where you can do overrides of defaults on a > directory basis, and respect the system-wide user overrides. > > Remember, -fno-omit-frame-pointer is the default on i386 (except at > high -O levels with gcc, I dont know where clang, the default > compiler, draws the line). Other platforms don't even have frame > pointers. You can't just scatter that switch around the place. Agreed! It seems that -fno-omit-frame-pointer documentation is a bit strange, the manual page indicates: > -O also turns on -fomit-frame-pointer on machines where > doing so > does not interfere with debugging. Then goes on to specify that under the actual option that it's turned on under -O, -O2, -O3, etc. > >> To be honest, I'm not sure if you're serious about "generally don't seem to >> care" or just feel defeated on the issue and we should care. > We took quite a performance beating because of not using the > tuned-by-perl assembler code in openssl on amd64, for example. This > flows through to benchmarks on things like apache throughput with > mod_ssl. Or throughput on stunnel(1). I don't recall if I was involved in that discussion, but that is troubling. > > My drive-by comment about not seeming to care any more is that people > (except for Bruce) generally don't actually measure the performance > impact of their changes any more. The last time this was widespread > was when Kris Kennaway used to be constantly abusing machines and > reporting the effects as measured by ministat(1). > > If somebody were to say "this change makes world take 15% longer to > compile but makes no meaningful affect on things like bzip2, openssl > throughput etc" and posted the actual ministat output to back it up > then there wouldn't even be a question on performance at all. It'd > only be "is 15% more build time worth ubiquitous dtrace?" And thats a > far easier thing to answer. > > A hand-wave leads to bikesheds. Actual numbers are bikeshed repellant. > > I myself have killed patches that turned out to be premature > optimizations because it actually didn't make any difference. For > example, I never committed the lazy tlb shootdown to AMD64 because it > made things slower on the hardware of the day - opteron silicon had > *hardware* address space tags on their TLB and the lazy shootdown code > just added more synchronization work that just added overhead.. eg: > buildworld was around 2% slower with the patches. > > Another example was the mtxpool code that caused cache line thrashing. > If we cared about performance that would never have gone in. Sure, it > compiled and worked, but the costs weren't quantified till much later > and we realized how much trouble they were beyond a certain usage > level. > > What's 2%? It multiplies out.. 2% here, 1% there.. 3% over there, > 0.5% somewhere else.. before you know it, there's a pretty big overall > hit. I see, well I will run some numbers and report back. > >>> Of course it wouldn't be required with dwarf unwinding awareness, but >>> we don't have that. >>> >>> We have -fno-omit-frame-pointer on the amd64 kernel whenever debugging >>> is compiled in because there's no unwinder for doing stack traces. We >>> need a dwarf2+ unwinder and somebody to instrument the call frame >>> state through the remaining assembler code. >>> >> How much work is that exactly? I've only been a gdb user, not a hacker. > gdb has a stack unwinder. kdb/ddb/stack(9) do not. There's well > established GPL code to do it, as well as libunwind and variants. > Basically what this code has to do is run the dwarf2+ state machine to > find all the call/return frames instead of assuming the compiler did > it. Heck, even glibc has a dwarf2 unwinder built into it as part of > their exception processing system. > > I'm not entirely sure what more work src/lib/libelf and > src/lib/libdwarf need. It looks like its got just enough implemented > to support the ctfconvert etc and doesn't have an unwinder in it. > This really seems beyond my skill level / time allotment. Let's see where the numbers put us in terms of system performance and then we can make a call on it. I'd rather take a few % of perf for the power of dtrace, but not if that % is double digits. -Alfred From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 11:10:17 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id CC1D46F6; Thu, 27 Dec 2012 11:10:17 +0000 (UTC) (envelope-from dim@FreeBSD.org) Received: from tensor.andric.com (cl-327.ede-01.nl.sixxs.net [IPv6:2001:7b8:2ff:146::2]) by mx1.freebsd.org (Postfix) with ESMTP id 834228FC08; Thu, 27 Dec 2012 11:10:17 +0000 (UTC) Received: from [IPv6:2001:7b8:3a7:0:84c:b2ef:124a:7931] (unknown [IPv6:2001:7b8:3a7:0:84c:b2ef:124a:7931]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by tensor.andric.com (Postfix) with ESMTPSA id 01ED65C5A; Thu, 27 Dec 2012 12:10:15 +0100 (CET) Message-ID: <50DC2C9B.4030802@FreeBSD.org> Date: Thu, 27 Dec 2012 12:10:19 +0100 From: Dimitry Andric Organization: The FreeBSD Project User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20121128 Thunderbird/18.0 MIME-Version: 1.0 To: Peter Wemm Subject: Re: UPDATE Re: making use of userland dtrace on FreeBSD References: <50D49DFF.3060803@ixsystems.com> <50DBC7E2.1070505@mu.org> <50DBD193.7080505@mu.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "arch@freebsd.org" , Adrian Chadd , Alfred Perlstein , Rui Paulo , Alfred Perlstein X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2012 11:10:17 -0000 On 2012-12-27 06:32, Peter Wemm wrote: > On Wed, Dec 26, 2012 at 8:41 PM, Alfred Perlstein wrote: ... >> Can you clarify some? If it was somewhat easy to re-add >> -fomit-frame-pointer to critical libraries like this, then that would be OK? > > No, you can't add MD flags like this. The way to do it is see things > like PIC, WARNS, etc where you can do overrides of defaults on a > directory basis, and respect the system-wide user overrides. > > Remember, -fno-omit-frame-pointer is the default on i386 (except at > high -O levels with gcc, I dont know where clang, the default > compiler, draws the line). Other platforms don't even have frame > pointers. You can't just scatter that switch around the place. Just for reference: - gcc versions < 4.6 always use -fno-omit-frame-pointer for i386, and enable -fomit-frame-pointer for amd64, when optimization is enabled (-O1 or higher). - gcc versions >= 4.6 enable -fomit-frame-pointer for both i386 and amd64, when optimization is enabled (-O1 or higher). - clang enables -fomit-frame-pointer only when explicity specified. I will submit a patch to upstream to make it mimic the behaviour of our gcc in base, e.g. enable -fomit-frame-pointer only for amd64, when optimization is enabled (-O1 or higher). From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 12:39:59 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D4E63312; Thu, 27 Dec 2012 12:39:59 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx07.syd.optusnet.com.au (fallbackmx07.syd.optusnet.com.au [211.29.132.9]) by mx1.freebsd.org (Postfix) with ESMTP id 8A92B8FC0A; Thu, 27 Dec 2012 12:39:58 +0000 (UTC) Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au [211.29.132.184]) by fallbackmx07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qBRCduo0007764; Thu, 27 Dec 2012 23:39:56 +1100 Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26]) by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qBRCdivE000463 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 27 Dec 2012 23:39:46 +1100 Date: Thu, 27 Dec 2012 23:39:44 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Alfred Perlstein Subject: Re: UPDATE Re: making use of userland dtrace on FreeBSD In-Reply-To: <50DBE0DB.6090804@ixsystems.com> Message-ID: <20121227214354.V965@besplex.bde.org> References: <50D49DFF.3060803@ixsystems.com> <50DBC7E2.1070505@mu.org> <50DBD193.7080505@mu.org> <50DBE0DB.6090804@ixsystems.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=FoRdZBXq c=1 sm=1 a=EG0SoA9ZrYwA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=BXM4HPcYP8wA:10 a=ZmIiC_jQAAAA:8 a=MZcKraZQlfwKwULZjgEA:9 a=CjuIK1q_8ugA:10 a=PXSvYFfLtboA:10 a=bxQHXO5Py4tHmhUgaywp5w==:117 Cc: "arch@freebsd.org" , Adrian Chadd , Alfred Perlstein , Rui Paulo X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2012 12:39:59 -0000 On Wed, 26 Dec 2012, Alfred Perlstein wrote: > On 12/26/12 9:32 PM, Peter Wemm wrote: >> On Wed, Dec 26, 2012 at 8:41 PM, Alfred Perlstein wrote: >>> On 12/26/12 8:21 PM, Peter Wemm wrote: >>>> On Wed, Dec 26, 2012 at 8:00 PM, Alfred Perlstein wrote: >>>> >>>>> What would be the drawbacks? I don't want to hurt freebsd for heavy >>>>> performance, but I think this functionality should work out of the box >>>>> for >>>>> most people. It might cost as much as 0.1% performance (more on pre-pentuimpro x86). Frame pointer use is very parallizable so it is often free. >>>> The drawbacks are mostly performance related. It defeats a certain >>>> hardware optimizations for call/return on leaf functions. It'll >>>> mostly affect things like math, crypto, compression and multimedia >>>> libraries (that's ffmpeg, bzip2/gzip/libarchive, openssl, etc) but, we >>>> generally don't seem to care about that sort of performance anyway, so >>>> what's one more loss? >>> >>> Can you clarify some? If it was somewhat easy to re-add >>> -fomit-frame-pointer to critical libraries like this, then that would be >>> OK? >> No, you can't add MD flags like this. The way to do it is see things >> like PIC, WARNS, etc where you can do overrides of defaults on a >> directory basis, and respect the system-wide user overrides. >> >> Remember, -fno-omit-frame-pointer is the default on i386 (except at >> high -O levels with gcc, Not except at high -O levels. gcc -O666 doesn't omit the frame pointer even for leave functions. I dont know where clang, the default >> compiler, draws the line). Other platforms don't even have frame >> pointers. You can't just scatter that switch around the place. clang is very incompatible. It omits the frame pointer even for non-leaf functions, even for -O1. For example, it does tail-call optimization for 'void bar(void); void foo(void) { bar(); }' and reduces this to a single jmp instruction, while gcc generates a call too bar plus a return, plus 3 instructions for frame pointer initialization and finalization, plus 1 instruction for its stack alignment pessimization. This might explain why debugging is even more broken with clang than with gdb. Here is a slightly larger program to test the optimization. % volatile int b; % volatile int f; % % void % bar(void) % { % b++; % } % % void % foo(void) % { % f++; % bar(); % } % % int % main(void) % { % int i; % % for (i = 0; i < 100000000; i++) % foo(); % return (0); % } I had to put the volatiles in to prevent the function calls being optimized away (gcc broke its promise not to optimize away loops like this in gcc-4). This program takes 0.34 seconds on freefall with clang -O and 0.65 seconds with gcc -O. But the faster speed with clang has nothing to do with -f-omit-frame-pointer. It is because clang inlines everything, so that the main loop does just 'f++; b++;'. clang produces this loop even with -g, so debugging is completely broken with clang (breakpoints in the functions don't work). Debugging works correctly with gcc. Profiling is even more broken than debugging with clang. clang generates calls to .mcount in the places where it inline functions, but this cannot work in FreeBSD (and in fact just wastes time to make a mess), since in FreeBSD .mcount is optimized to not take an explicit arg identiying the caller, so for the above it always identifies the wrong caller for foo(). -finstrument-functions seems to be less broken, but doesn't work for either clang or gcc. Both clang and gcc generate calls to __cyg_profile_func_enter/exit() for both actual functions and for inlined functions. __cyg_profile_func_enter() corresponds to .mcount, and __cyg_profile_func_exit() corresponds to FreeBSD (my) .mexitcount feature (.mexitcount is broken (null) in gcc-4.2 and broken (nonexistent) in clang), except the __cyg* functions are pessimized to take 2 explicit args identifing the caller, so they can work; they don't actually work since they are nonexistent in FreeBSD (except in x86 kernels in old versions, where they were used transiently to work around breakage of .mexitcount). After working around these bugs by putting the functions in separate files (and removing the now-unneeded volatiles): main.c: % void foo(void); % % int % main(void) % { % int i; % % for (i = 0; i < 100000000; i++) % foo(); % } foo.c: % void bar(void); % % void % foo(void) % { % bar(); % } bar.c: % void % bar(void) % { % } we can seem how much the frame pointer optimization is saving: this now takes 0.43 seconds with clang and 0.87 seconds with gcc. It is weird that the gcc time increased from 0.65 seconds to 0.87 despite doing less. After adding back the volatiles, the times are 0.43 seconds with clang and 0.85 seconds with gcc -- doing more gave a small optimization, but didn't recover 0.65 seconds. There is apparently some magic alignment or misalignment which costs or saves about the same as omitting the frame pointer. Finally, with gcc -O -fomit-frame-pointer, the program takes 0.60 seconds, and with gcc -O2 -fomit-frame-pointer, it takes 0.49 seconds, and with gcc -O2, it takes 0.49 seconds (this really doesn't omit frame pointers, so omitting the frame pointer saves nothing), With cc -O -fno-omit-frame-pointer, it takes 0.43 seconds, but this case is just broken -- the -fno-omit-frame-pointer is silently ignored :-(. % > Agreed! It seems that -fno-omit-frame-pointer documentation is a bit % > strange, the manual page indicates: % >> -O also turns on -fomit-frame-pointer on machines where doing so % >> does not interfere with debugging. % > Then goes on to specify that under the actual option that it's turned on % > under -O, -O2, -O3, etc. The latter is just wrong for i386 (see above). The former may be correct and differ for amd64 because amd64 has better debugging info and thus can afford to omit the frame pointer more often. However, I've seen anomalies for debugging. I forget the details, but remember that one of i386 and amd64 worked better for debugging libm. Another floating point debugging strangeness is that gdb understands XMM registers better on i386 than on amd64! To see this, try 'gdb /bin/cat'. Run the program and stop it with ^C. Then p $xmm0 shows deficient info for amd64. But amd64 actually uses XMM registers for floating point on amd64 (clang with certain -march also bogusly uses them on i386 too). Thus displaying of XMM is broken where it is most needed. >>>> Of course it wouldn't be required with dwarf unwinding awareness, but >>>> we don't have that. Perhaps the clang optimizations depend on this. >>>> We have -fno-omit-frame-pointer on the amd64 kernel whenever debugging >>>> is compiled in because there's no unwinder for doing stack traces. We Hmm, I didn't notice that. It is also done unconditionally in kmod.mk. It is also done conditionally for powerpc kernels and unconditionally for powerpc kmods. -fno-inline-functions-called-once should be done under the same conditions, to unbreak the stack trace for such functions. Unfortunately, this only works for gcc. For gcc, it is only needed for static functions. The above shows that it is needed even more for clang, since clang inlines non-static functions in the same file (perhaps if they are called more than once?). But -fno-inline-functions-called-once is broken (unsupported, and a warning) for clang. >>>> need a dwarf2+ unwinder and somebody to instrument the call frame >>>> state through the remaining assembler code. I wouldn't want it for ddb. ddb doesn't have access to any debug info except the symbol table. >>>> >>> How much work is that exactly? I've only been a gdb user, not a hacker. >> gdb has a stack unwinder. kdb/ddb/stack(9) do not. There's well >> established GPL code to do it, as well as libunwind and variants. >> Basically what this code has to do is run the dwarf2+ state machine to >> find all the call/return frames instead of assuming the compiler did >> it. Heck, even glibc has a dwarf2 unwinder built into it as part of >> their exception processing system. >> >> I'm not entirely sure what more work src/lib/libelf and >> src/lib/libdwarf need. It looks like its got just enough implemented >> to support the ctfconvert etc and doesn't have an unwinder in it. >> > This really seems beyond my skill level / time allotment. Let's see where > the numbers put us in terms of system performance and then we can make a call > on it. > > I'd rather take a few % of perf for the power of dtrace, but not if that % is > double digits. Since -fno-omit-frame-pointer is broken (silently ignored) for clang, using it won't make any difference. The only ways I could find to get frame pointers with clang were -pg (profiling) and -finstrument-functions. Bruce From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 18:19:50 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 248B073F; Thu, 27 Dec 2012 18:19:50 +0000 (UTC) (envelope-from sjg@juniper.net) Received: from exprod7og123.obsmtp.com (exprod7og123.obsmtp.com [64.18.2.24]) by mx1.freebsd.org (Postfix) with ESMTP id C7DBB8FC0C; Thu, 27 Dec 2012 18:19:49 +0000 (UTC) Received: from P-EMHUB03-HQ.jnpr.net ([66.129.224.36]) (using TLSv1) by exprod7ob123.postini.com ([64.18.6.12]) with SMTP ID DSNKUNyRRfx8jGDkcP6nEZ6NSV8YljduXqtB@postini.com; Thu, 27 Dec 2012 10:19:49 PST Received: from magenta.juniper.net (172.17.27.123) by P-EMHUB03-HQ.jnpr.net (172.24.192.33) with Microsoft SMTP Server (TLS) id 8.3.213.0; Thu, 27 Dec 2012 10:00:47 -0800 Received: from chaos.jnpr.net (chaos.jnpr.net [172.24.29.229]) by magenta.juniper.net (8.11.3/8.11.3) with ESMTP id qBRI0i301557; Thu, 27 Dec 2012 10:00:46 -0800 (PST) (envelope-from sjg@juniper.net) Received: from chaos.jnpr.net (localhost [127.0.0.1]) by chaos.jnpr.net (Postfix) with ESMTP id 7947F58094; Thu, 27 Dec 2012 10:00:44 -0800 (PST) To: Michael Vale Subject: Re: Cross Compiling of ports Makefiles. In-Reply-To: References: Comments: In-reply-to: "Michael Vale" message dated "Thu, 27 Dec 2012 16:42:48 +1100." From: "Simon J. Gerraty" X-Mailer: MH-E 7.82+cvs; nmh 1.3; GNU Emacs 22.3.1 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Date: Thu, 27 Dec 2012 10:00:44 -0800 Message-ID: <20121227180044.7947F58094@chaos.jnpr.net> Cc: freebsd-hackers@freebsd.org, freebsd-ports@freebsd.org, freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2012 18:19:50 -0000 >Doing the same thing could also prevent the need for a DESTDIR JAIL >install at all and just use the real build machine=E2=80=99s build env, ra= ther >than a jail. Regardless. We still have to install these targets and >their DESTDIR is skewed. There is a few options,=20 I think I know what you mean, but not clear on the "their DESTDIR is skewed" bit. >One is to have a MAKEOBJDIRPREFIX like option, and redefine every >target=E2=80=99s DESTDIR ${makeobjDESTDIR} before running do-install. Now= i=E2=80=99ve >yet to complete this stage, but I believe this is the way to do it.=20 Would it be sufficient to have an INSTALL_PREFIX and/or INSTALL_DESTDIR so that DESTDIR can be different during install ? [I was recently experimenting with something similar...] From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 18:47:15 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 02AA1EB6 for ; Thu, 27 Dec 2012 18:47:15 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 4BE4B8FC15 for ; Thu, 27 Dec 2012 18:47:14 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.5/8.14.5) with ESMTP id qBRIl5wx046042; Thu, 27 Dec 2012 20:47:05 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.3 kib.kiev.ua qBRIl5wx046042 Received: (from kostik@localhost) by tom.home (8.14.5/8.14.5/Submit) id qBRIl4Dh046041; Thu, 27 Dec 2012 20:47:04 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 27 Dec 2012 20:47:04 +0200 From: Konstantin Belousov To: Peter Wemm Subject: Re: UPDATE Re: making use of userland dtrace on FreeBSD Message-ID: <20121227184704.GK82219@kib.kiev.ua> References: <50D49DFF.3060803@ixsystems.com> <50DBC7E2.1070505@mu.org> <50DBD193.7080505@mu.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="YZQs1kEQY307C4ut" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: "arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2012 18:47:15 -0000 --YZQs1kEQY307C4ut Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Dec 26, 2012 at 09:32:45PM -0800, Peter Wemm wrote: > I myself have killed patches that turned out to be premature > optimizations because it actually didn't make any difference. For > example, I never committed the lazy tlb shootdown to AMD64 because it > made things slower on the hardware of the day - opteron silicon had > *hardware* address space tags on their TLB and the lazy shootdown code > just added more synchronization work that just added overhead.. eg: > buildworld was around 2% slower with the patches. Recent Intel CPUs have hardware TLB tags too, but there it is explicit. I have a patch for amd64 pmap which starts using PCID, but the measured effect was non-existent. My guess was that there were two factors. One is the need to do large amount of shootdowns due to buffers creation, hopefully fixed soon. Another one is the Intel only adding instructions to shot the TLB entry in the non-current PCID starting with IvyBridge, so sandy machine which I used to benchmark had to do switch PCID; invlpg; switch PCID back to the current. > >> Of course it wouldn't be required with dwarf unwinding awareness, but > >> we don't have that. > >> > >> We have -fno-omit-frame-pointer on the amd64 kernel whenever debugging > >> is compiled in because there's no unwinder for doing stack traces. We > >> need a dwarf2+ unwinder and somebody to instrument the call frame > >> state through the remaining assembler code. > >> > > How much work is that exactly? I've only been a gdb user, not a hacker. >=20 > gdb has a stack unwinder. kdb/ddb/stack(9) do not. There's well > established GPL code to do it, as well as libunwind and variants. > Basically what this code has to do is run the dwarf2+ state machine to > find all the call/return frames instead of assuming the compiler did > it. Heck, even glibc has a dwarf2 unwinder built into it as part of > their exception processing system. Our libc also calls libgcc unwinder for cancellation. >=20 > I'm not entirely sure what more work src/lib/libelf and > src/lib/libdwarf need. It looks like its got just enough implemented > to support the ctfconvert etc and doesn't have an unwinder in it. BTW, I once sit and did the annotations for the amd64 asm. I had no time to test all the bits, and I know that my hack in siglongjmp to handle the switch to the new stack seamlessly does not work, at least for gdb. http://people.freebsd.org/~kib/misc/amd64_unwind_annotations.1.patch --YZQs1kEQY307C4ut Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJQ3JenAAoJEJDCuSvBvK1B0QsP/1O6QwzqydH+nN1pcztGEnvB NjXxgTzW5IXb5bN8FuJ4EUNR1zfDIVT6A+i5VYmJO+PmhQScro9B60noIp2Yzuq6 jWJPgsf/ggvcruSpaV95cqdZjGruJ7zlgoa2cf05xven+5kXymV2Wnrmc87TlnbW xWKSjtvteEgtGzKadbdF7m6+iipvX3B1G2OVh5bMO7geQUAjysLj2sT08dvGU6eW v2vGbKcMe+O20EHLGHpBOrTBts0JzwW+vXI1tCPFcUXtVpRmV9Sg2+JbJyGLUk61 9lGoKGv2k1BAjRH/5dOf4TKxqWYOKCp1xa2ZK4uEnf9giVnLgcl5WHFefdpRKA6L +fOuqHtxMnln1BI2vs8MYjtsIeQlyA2XaESK/CKTw5vG3ln3jy1a2sW1EvGEpvt8 CkaHPzaghMpj2ZBKuKSTujKIRibB+oVlFh06AnEJoZyfdHoCjqrs8SCd/IwA7PBY h0KDsNBJlcKeRR8eLaAoTTa85FkdB4yTP8PTV6F1t2TvKl0pcng7yE8i1aDcbX6U gAH5L7vLAOmkTB+WBMayxvcCd4FA7n7kwszdXaw656hKGoTXiFsnDJ1ooJaZUAnu rKihq0zt5UUmZGb2UJeuoklnkaKph7apq6k4MHFM8sCuLbN7IhVwjdG4Eha8iHKl M5V9n/W7gUAaMJPR9ZO4 =s1n7 -----END PGP SIGNATURE----- --YZQs1kEQY307C4ut-- From owner-freebsd-arch@FreeBSD.ORG Thu Dec 27 19:09:15 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 35562103 for ; Thu, 27 Dec 2012 19:09:15 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 7EE6C8FC0A for ; Thu, 27 Dec 2012 19:09:14 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.5/8.14.5) with ESMTP id qBRJ95dg047998; Thu, 27 Dec 2012 21:09:05 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.3 kib.kiev.ua qBRJ95dg047998 Received: (from kostik@localhost) by tom.home (8.14.5/8.14.5/Submit) id qBRJ94vK047997; Thu, 27 Dec 2012 21:09:04 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 27 Dec 2012 21:09:04 +0200 From: Konstantin Belousov To: Bruce Evans Subject: Re: UPDATE Re: making use of userland dtrace on FreeBSD Message-ID: <20121227190904.GL82219@kib.kiev.ua> References: <50D49DFF.3060803@ixsystems.com> <50DBC7E2.1070505@mu.org> <50DBD193.7080505@mu.org> <50DBE0DB.6090804@ixsystems.com> <20121227214354.V965@besplex.bde.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="kaF1vgn83Aa7CiXN" Content-Disposition: inline In-Reply-To: <20121227214354.V965@besplex.bde.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: "arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Dec 2012 19:09:15 -0000 --kaF1vgn83Aa7CiXN Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Dec 27, 2012 at 11:39:44PM +1100, Bruce Evans wrote: > After working around these bugs by putting the functions in separate files > (and removing the now-unneeded volatiles): >=20 > main.c: > % void foo(void); > %=20 > % int > % main(void) > % { > % int i; > %=20 > % for (i =3D 0; i < 100000000; i++) > % foo(); > % } >=20 > foo.c: > % void bar(void); > %=20 > % void > % foo(void) > % { > % bar(); > % } >=20 > bar.c: > % void > % bar(void) > % { > % } >=20 > we can seem how much the frame pointer optimization is saving: this > now takes 0.43 seconds with clang and 0.87 seconds with gcc. It > is weird that the gcc time increased from 0.65 seconds to 0.87 > despite doing less. After adding back the volatiles, the times > are 0.43 seconds with clang and 0.85 seconds with gcc -- doing > more gave a small optimization, but didn't recover 0.65 seconds. > There is apparently some magic alignment or misalignment which > costs or saves about the same as omitting the frame pointer. > Finally, with gcc -O -fomit-frame-pointer, the program takes 0.60 > seconds, and with gcc -O2 -fomit-frame-pointer, it takes 0.49 > seconds, and with gcc -O2, it takes 0.49 seconds (this really doesn't > omit frame pointers, so omitting the frame pointer saves nothing), > With cc -O -fno-omit-frame-pointer, it takes 0.43 seconds, but this > case is just broken -- the -fno-omit-frame-pointer is silently ignored :-= (. I do not believe this measurement is indicative. i386 is register-starved architecture. Using the frame pointer means that you are left with only 6 registers instead of 7. For the PIC code, there are 5 vs. 6. It is real code that does something more than incrementing the same variable which could get the performance hit with -fno-omit-frame-pointer for i386. But on i386 use of the frame pointer is ABI mandated. For amd64, there is no so high pressure on the register file, but I do not know that much debugging tools which expect the frame pointer on amd64 or could detect and use it if present. It is only ddb for our kernel and dtrace for solaris and freebsd, gdb definitely does not. > >>>> need a dwarf2+ unwinder and somebody to instrument the call frame > >>>> state through the remaining assembler code. >=20 > I wouldn't want it for ddb. ddb doesn't have access to any debug info > except the symbol table. The unwind tables are not debugging. They, if requested, are put into the loadable segments. The dwarf unwind is required by the ABI on amd64, and is specified for all other architectures. --kaF1vgn83Aa7CiXN Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJQ3JzQAAoJEJDCuSvBvK1ByjUP/0UMeSEc4y+znGd8MKkOCGSN jT0IQVP/AHelgzfCA9dbnAGlCTF+W5b1VBZHh98ampB37oXJZAYbKlm04vt9uVDF eInFtn8XUhOgUJGMb4cYOtOIjiYHzoOa818pyD2oHkZ++TX7swJsnXGVMVciyUfh VnSKf2gwi0kIjTWbRnPB3JqbEqJYD8JcwFvechId+hlr4Mzm5cX20cc2IFjgPVgE pwt8b41dgyX7UDKAWwvribySbiZkvbwIucZYehx21AS9XvEyl55n38KHK+SiVtCb MS8fdm8VD7WaHqXHwnfAywWmOp1ahsvnf7rzWuxqYxXftrIUsYM8GGCEtddJcGR5 ab7lALUzuzOKnveLMEhBfZqUg38xYUkiqKG86L3n2Zvyry5wA9xdfq4NyiCmXtBI 0AXdgyo+8aUAH20YqB2nRbd/g7Y76a/aqkCauzQ38H1jkqwUW9DI2SM8K2rElZYL W077i/+SF5C+aVRInEP4tQkKIdP2/lM+z2tzkV4cWEIMMCDJ+3owVxj22YhWIhlJ erX/+HUv727gMZdnLPodbXqeGm6MdQovfUJQo4HsmXQQFrAnywgvRWqOX2ZBs36O OtFSzcIlS5pWMATBXLG3kzsTccGqE+iJ6qst33zd442oTPrBEYawAPf8JXIMA4Z+ 9AA78vbTrXdKID3b0x7m =h0nY -----END PGP SIGNATURE----- --kaF1vgn83Aa7CiXN-- From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 13:23:07 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 72F13D32 for ; Fri, 28 Dec 2012 13:23:07 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail12.syd.optusnet.com.au (mail12.syd.optusnet.com.au [211.29.132.193]) by mx1.freebsd.org (Postfix) with ESMTP id D250A8FC0A for ; Fri, 28 Dec 2012 13:23:06 +0000 (UTC) Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26]) by mail12.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qBSDMv22002882 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 29 Dec 2012 00:22:58 +1100 Date: Sat, 29 Dec 2012 00:22:57 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov Subject: Re: UPDATE Re: making use of userland dtrace on FreeBSD In-Reply-To: <20121227190904.GL82219@kib.kiev.ua> Message-ID: <20121228224312.X1054@besplex.bde.org> References: <50D49DFF.3060803@ixsystems.com> <50DBC7E2.1070505@mu.org> <50DBD193.7080505@mu.org> <50DBE0DB.6090804@ixsystems.com> <20121227214354.V965@besplex.bde.org> <20121227190904.GL82219@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=e5de0tV/ c=1 sm=1 a=EG0SoA9ZrYwA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=BXM4HPcYP8wA:10 a=3e9C-FsJo0l-C2I13FEA:9 a=CjuIK1q_8ugA:10 a=2yTQJ0OkpDuyj7eE:21 a=vRPkp040kyyWXpux:21 a=1gajL0UBtqThFe74:21 a=bxQHXO5Py4tHmhUgaywp5w==:117 Cc: "arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2012 13:23:07 -0000 On Thu, 27 Dec 2012, Konstantin Belousov wrote: > On Thu, Dec 27, 2012 at 11:39:44PM +1100, Bruce Evans wrote: >> After working around these bugs by putting the functions in separate files >> (and removing the now-unneeded volatiles): >> >> main.c: >> % void foo(void); >> % >> % int >> % main(void) >> % { >> % int i; >> % >> % for (i = 0; i < 100000000; i++) >> % foo(); >> % } >> >> foo.c: >> % void bar(void); >> % >> % void >> % foo(void) >> % { >> % bar(); >> % } >> >> bar.c: >> % void >> % bar(void) >> % { >> % } >> >> we can seem how much the frame pointer optimization is saving: this >> now takes 0.43 seconds with clang and 0.87 seconds with gcc. It >> is weird that the gcc time increased from 0.65 seconds to 0.87 >> despite doing less. After adding back the volatiles, the times >> are 0.43 seconds with clang and 0.85 seconds with gcc -- doing >> more gave a small optimization, but didn't recover 0.65 seconds. >> There is apparently some magic alignment or misalignment which >> costs or saves about the same as omitting the frame pointer. >> Finally, with gcc -O -fomit-frame-pointer, the program takes 0.60 >> seconds, and with gcc -O2 -fomit-frame-pointer, it takes 0.49 >> seconds, and with gcc -O2, it takes 0.49 seconds (this really doesn't >> omit frame pointers, so omitting the frame pointer saves nothing), >> With cc -O -fno-omit-frame-pointer, it takes 0.43 seconds, but this >> case is just broken -- the -fno-omit-frame-pointer is silently ignored :-(. > I do not believe this measurement is indicative. Yes, since this program is too simple to be representative. > i386 is > register-starved architecture. Using the frame pointer means that > you are left with only 6 registers instead of 7. For the PIC code, > there are 5 vs. 6. It is real code that does something more than > incrementing the same variable which could get the performance hit with > -fno-omit-frame-pointer for i386. But on i386 use of the frame pointer > is ABI mandated. Register starvation is another thing that makes very little difference. But here is another non-representative program that goes to the oppositie extreme to get register starvation. The result is that omitting the frame pointer is a small pessimization on i386 and makes no difference on amd64: main.c: % int foo(int, int, int, int, int, int, int, int, int, int); % % volatile int mf; % % int % main(void) % { % int i; % % for (i = 0; i < 100; i++) % mf += foo(1, 2, 3, 4, 5, 6, 7, 8, 9, 10); % } bar.c: % int % bar(int a, int b, int c, int d, int e, int f, int g, int h, int i, int j) % { % int n, r; % % r = 0; % for (n = 0; n < 1000; n++) { % r += a + b + c + d + e + f + g + h + i + j; % a += b; % b = -b; % c += d; % d = -d; % e += f; % f = -f; % g += h; % h = -h; % i += j; % j = -j; % } % return (r); % } foo.c: % int bar(int, int, int, int, int, int, int, int, int, int); % % int % foo(int a, int b, int c, int d, int e, int f, int g, int h, int i, int j) % { % int n, r; % % r = 0; % for (n = 0; n < 1000; n++) % r += bar(a, b, c, d, e, f, g, h, i, j); % return (r); % } i386 times on ref10-i386: gcc -O2 -o f main.c bar.c foo.c: 0.81 seconds gcc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 0.81 seconds gcc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 0.83 seconds cc -O2 -o f main.c bar.c foo.c: 1.11 seconds cc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 1.11 seconds cc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 1.11 seconds 0.81 seconds is 15.08 cycles/iteration in the inner loop. 0.83 seconds is 15.45 cycles/iteration. 1.11 seconds is 20.67 cycles/iteration. The inner loop has 12 variables. Since i386 has only 6 or 7 integer registers, these can't be kept in registers. Checking the generated code in the inner loop in bar() shows that the source code is complicated enough to prevent significant optimizations. It has the following number of memory references: gcc -O2 -o f main.c bar.c foo.c: 7(r) 1(w) 4(r+w) = 16 gcc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: not counted gcc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 7(r) 1(w) 6(r+w) = 20 cc -O2 -o f main.c bar.c foo.c: 13(r) 13(w) 2(r+w) = 30 cc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: not counted cc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer not counted Although gcc -fomit-frame-pointer gives 4 fewer memory references, it runs 0.37 cycles/iteration slower. Apparently the extra memory references are always executed in parallel, so they are free. References relative to %ebp are 1 byte smaller than ones relative to %esp. Apparently this is enough to avoid 0.37 cycles of penalties for the larger instructions. cc (clang) generates remarkably bad code for this. This is shown by all metrics: - more instructions - less use of read-modify-write instructions. Where gcc generates lots of addl's to memory variables, clang likes to load the memory variables, add to them, and write them back. In theory this is no slower unless the larger instruction space is too large, since it takes the same number of memory references and should reduct to almost the same micro-ops . But the way clang does it, somehow also gives almost twice as many memory references (30 instead of 16). - many more many references - lots of spills which are carefully annotated by clang. It documents 15 spills and 13 reloads. amd64 times on ref10-amd64: gcc -O2 -o f main.c bar.c foo.c: 0.37 seconds gcc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 0.37 seconds gcc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 0.37 seconds cc -O2 -o f main.c bar.c foo.c: 0.41 seconds cc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 0.41 seconds cc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 0.41 seconds Now both compilers generate very nice code for bar(), with the variables in the inner loop all in registers, and not so nice code for foo() (it takes just 1 more instruction for all the arithmetic in the inner loop in bar() than for calling bar()). clang somehow finds a way to be slower here too. -fomit-frame-pointer now makes no difference at all, since it only takes about 3 extra instructions for every call to bar(). I made the inner loop in bar() too heavyweight to test it (oops). Another oops is that I forgot to modify the modification of the variables in bar(). It gets optimized away, and the code in bar() is only so very nice because bar() reduces to adding up the 10 variables. Now with the inner loop in bar() removed and the number of iterations in foo() multiplied by 1000 to compensate: amd64 times on ref10-amd64: gcc -O2 -o f main.c bar.c foo.c: 0.64 seconds gcc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 0.76 seconds gcc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 0.64 seconds cc -O2 -o f main.c bar.c foo.c: 0.67 seconds cc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 0.67 seconds cc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 0.61 seconds -fomit-frame-pointer is finally giving an optimization. It does so even for clang, and this is weird because for clang it only affects the non-leaf functions main() and foo() for which there are only 1+100 frame pointer initializations and finalizations. Having a frame pointer is apparently pessimizing foo() by changing the instructions that it uses to call bar(). i386 times on ref10-i386: gcc -O2 -o f main.c bar.c foo.c: 1.24 seconds gcc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 1.24 seconds gcc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 1.16 seconds cc -O2 -o f main.c bar.c foo.c: 1.11 seconds cc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 1.11 seconds cc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 1.23 seconds Now the frame pointer in bar() has obvious costs for gcc. But the change in the time for clang is even weirder than the changes above: for clang, the frame pointer is always omitted in the leaf function bar(); there are only 1+100 frame pointer initializations and finalizations, as for clang on amd64. On amd64, the frame pointer in foo() gave an optimization from 0.67 to 0.61 seconds, but on i386 it gives a pessimization from 1.11 seconds to 1.23 seconds. > For amd64, there is no so high pressure on the register file, but I do > not know that much debugging tools which expect the frame pointer on > amd64 or could detect and use it if present. It is only ddb for our > kernel and dtrace for solaris and freebsd, gdb definitely does not. I was originally going to say that the number of registers is almost as irrelevant for performance as -fomit-frame-pointer :-). amd64 is typically slightly slower than i386 despite its extra registers and "optimized" ABI with parameters passed in registers, since the optimizations aren't quite enough to recover from the bloat of 64-bit pointers and function parameters. But my non-representative examples made amd64 about twice as fast. And to demonstrate loss due to the frame pointer increasing register pressure significantly, I should have tried an example where everything fits in registers with -fomit-frame-pointer but not without. Such an example would be even more non-representative. Bruce From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 22:37:15 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 09F3F2DD for ; Fri, 28 Dec 2012 22:37:15 +0000 (UTC) (envelope-from jroberson@jroberson.net) Received: from mail-pa0-f51.google.com (mail-pa0-f51.google.com [209.85.220.51]) by mx1.freebsd.org (Postfix) with ESMTP id C470E8FC0A for ; Fri, 28 Dec 2012 22:37:14 +0000 (UTC) Received: by mail-pa0-f51.google.com with SMTP id fb11so6270490pad.10 for ; Fri, 28 Dec 2012 14:37:13 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:date:from:x-x-sender:to:cc:subject:in-reply-to :message-id:references:user-agent:mime-version:content-type :x-gm-message-state; bh=FLwYSxoHAzfEqqu0WrvFSqimNY4UlElNbOq/e5phFII=; b=U7WPXowGykQq0Hv26PkjWoJFhF3iYe2Vk5oGO1QeZ3wpsgAJzAVSEB5ON0fNBVS8hO qmCOpeZWsfYHeYi2O3jWStLv/c5e5nCq6ZlAmU2ULS9o7+8liAs/EzOXJ3yWI9l3JCk5 /Pkfh0YeSAKS1f8xuoYptZjlR4xtVtqIzwrjfcP5LQqigLe45bZ1FlM32nC3ycgpdYB9 nG1lHNM8/+NCWw+Ezmhse6YFBQSg8UTIIyB0uVAmtGZpdOwO+rt7JQEfxk4t6r0XjIqE vmTkk+7IwU7hWMHspARG7x0EvliO98YyMyxuCUeF1U3FP0pM+Lg8ZzkOJ2/X+Apc4xep 3M7A== X-Received: by 10.68.218.97 with SMTP id pf1mr107656048pbc.96.1356734233860; Fri, 28 Dec 2012 14:37:13 -0800 (PST) Received: from rrcs-66-91-135-210.west.biz.rr.com (rrcs-66-91-135-210.west.biz.rr.com. [66.91.135.210]) by mx.google.com with ESMTPS id gu5sm20343859pbc.10.2012.12.28.14.37.11 (version=SSLv3 cipher=OTHER); Fri, 28 Dec 2012 14:37:12 -0800 (PST) Date: Fri, 28 Dec 2012 12:36:57 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: Konstantin Belousov Subject: Re: Unmapped I/O In-Reply-To: <20121219135451.GU71906@kib.kiev.ua> Message-ID: References: <20121219135451.GU71906@kib.kiev.ua> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Gm-Message-State: ALoCoQleIVNP0PEikXGRGKgFC9u0OfAgmWUlXh86kUD/ajlZ0lJ8DUFOtYfAmrar3UbJT/Q01Bf4 Cc: arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2012 22:37:15 -0000 On Wed, 19 Dec 2012, Konstantin Belousov wrote: > One of the known FreeBSD I/O path performance bootleneck is the > neccessity to map each I/O buffer pages into KVA. The problem is that > on the multi-core machines, the mapping must flush TLB on all cores, > due to the global mapping of the buffer pages into the kernel. This > means that buffer creation and destruction disrupts execution of all > other cores to perform TLB shootdown through IPI, and the thread > initiating the shootdown must wait for all other cores to execute and > report. > > The patch at > http://people.freebsd.org/~kib/misc/unmapped.4.patch > implements the 'unmapped buffers'. It means an ability to create the > VMIO struct buf, which does not point to the KVA mapping the buffer > pages to the kernel addresses. Since there is no mapping, kernel does > not need to clear TLB. The unmapped buffers are marked with the new > B_NOTMAPPED flag, and should be requested explicitely using the > GB_NOTMAPPED flag to the buffer allocation routines. If the mapped > buffer is requested but unmapped buffer already exists, the buffer > subsystem automatically maps the pages. > > The clustering code is also made aware of the not-mapped buffers, but > this required the KPI change that accounts for the diff in the non-UFS > filesystems. > > UFS is adopted to request not mapped buffers when kernel does not need > to access the content, i.e. mostly for the file data. New helper > function vn_io_fault_pgmove() operates on the unmapped array of pages. > It calls new pmap method pmap_copy_pages() to do the data move to and > from usermode. > > Besides not mapped buffers, not mapped BIOs are introduced, marked > with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated > to unmapped BIOs. Geom providers may indicate an acceptance of the > unmapped BIOs. If provider does not handle unmapped i/o requests, > geom now automatically establishes transient mapping for the i/o > pages. > > Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The > gpart providers indicate the unmapped BIOs support if the underlying > provider can do unmapped i/o. I also hacked ahci(4) to handle > unmapped i/o, but this should be changed after the Jeff' physbio patch > is committed, to use proper busdma interface. > > Besides, the swap pager does unmapped swapping if the swap partition > indicated that it can do unmapped i/o. By Jeff request, a buffer > allocation code may reserve the KVA for unmapped buffer in advance. > The unmapped page-in for the vnode pager is also implemented if > filesystem supports it, but the page out is not. The page-out, as well > as the vnode-backed md(4), currently require mappings, mostly due to > the use of VOP_WRITE(). > > As such, the patch worked in my test environment, where I used > ahci-attached SATA disks with gpt partitions, md(4) and UFS. I see no > statistically significant difference in the buildworld -j 10 times on > the 4-core machine with HT. On the other hand, when doing sha1 over > the 5GB file, the system time was reduced by 30%. > > Unfinished items: > - Integration with the physbio, will be done after physbio is > committed to HEAD. > - The key per-architecture function needed for the unmapped i/o is the > pmap_copy_pages(). I implemented it for amd64 and i386 right now, it > shall be done for all other architectures. > - The sizing of the submap used for transient mapping of the BIOs is > naive. Should be adjusted, esp. for KVA-lean architectures. > - Conversion of the other filesystems. Low priority. > > I am interested in reviews, tests and suggestions. Note that this > only works now for md(4) and ahci(4), for other drivers the patched > kernel should fall back to the mapped i/o. > > sys/amd64/amd64/pmap.c | 24 +++ > sys/cam/ata/ata_da.c | 5 +- > sys/cam/cam_ccb.h | 30 ++++ > sys/dev/ahci/ahci.c | 53 +++++- > sys/dev/md/md.c | 255 ++++++++++++++++++++++++----- > sys/fs/cd9660/cd9660_vnops.c | 2 +- > sys/fs/ext2fs/ext2_balloc.c | 2 +- > sys/fs/ext2fs/ext2_vnops.c | 9 +- > sys/fs/msdosfs/msdosfs_vnops.c | 4 +- > sys/fs/udf/udf_vnops.c | 5 +- > sys/geom/geom.h | 1 + > sys/geom/geom_disk.c | 2 + > sys/geom/geom_disk.h | 1 + > sys/geom/geom_io.c | 44 ++++- > sys/geom/geom_vfs.c | 10 +- > sys/geom/part/g_part.c | 1 + > sys/i386/i386/pmap.c | 42 +++++ > sys/kern/vfs_bio.c | 356 +++++++++++++++++++++++++++++++++-------- > sys/kern/vfs_cluster.c | 118 +++++++------- > sys/kern/vfs_vnops.c | 39 +++++ > sys/sys/bio.h | 7 + > sys/sys/buf.h | 22 ++- > sys/sys/mount.h | 1 + > sys/sys/vnode.h | 2 + > sys/ufs/ffs/ffs_alloc.c | 10 +- > sys/ufs/ffs/ffs_balloc.c | 58 ++++--- > sys/ufs/ffs/ffs_vfsops.c | 3 +- > sys/ufs/ffs/ffs_vnops.c | 35 ++-- > sys/ufs/ufs/ufs_extern.h | 1 + > sys/vm/pmap.h | 2 + > sys/vm/swap_pager.c | 43 +++-- > sys/vm/swap_pager.h | 1 + > sys/vm/vm.h | 2 + > sys/vm/vm_init.c | 6 +- > sys/vm/vm_kern.c | 9 +- > sys/vm/vnode_pager.c | 30 +++- > 36 files changed, 989 insertions(+), 246 deletions(-) > > A few comments: 1) If the BIO is mapped at all you have to pass the virtual address to busdma. We can not lose the fact that it is mapped somewhere or virtual cache architectures will fail. I think when we integrate with physbio we can do this properly by implementing a load routine on the bio. 2) Why does mdstart_swap() need a cpu_flush_dcache? Shouldn't sf bufs on the appropriate architecture do the right flushing when they are unmapped? 3) I find the NOTMAPPED negative flag awkward grammatically. UNMAPPED seems more natural. Or a positive MAPPED flag would be better. Minor concern, bikeshedding, etc. 4) It would be better to have some wrapper functions around the bio transient map and or sf buf handling. I will need it to map unmapped cam ccbs in device drivers. We need to come to some agreement on this API. There should be a fast page-by-page version and a potentially blocking all-at-once linear version. 5) Should the transient bio space not come from the buffer_map? We don't have more KVA on 32bit architectures. Or did that come from pbufs? We could consolidate here some but there are a lot of potential deadlock issues. 6) Thank you for adding the KVAALLOC flag. Isilon needs this internally. 7) All of that code that exploded into getblk() should be refactored into some support functions. It's hard to read and getblk() is too big already. All in all it looks like we have the right pieces coming together. Just needs a little refactoring and then a lot of test. I'm almost ready to commit the first phase of physbio. It doesn't yet have the code necessary to temporarily map io for things like ATA PIO. I'm hoping that you'll provide the functions to do that. It does abstract out most of the details of the memory formats so we can change them at all. And adds support to busdma for physical addresses so bounce pages, alignment, sizes, and boundaries work. Thanks, Jeff From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 22:43:24 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A2E063B2 for ; Fri, 28 Dec 2012 22:43:24 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 5F4598FC12 for ; Fri, 28 Dec 2012 22:43:24 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 7EF4B8A50F; Fri, 28 Dec 2012 22:43:16 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.5/8.14.5) with ESMTP id qBSMhFIf048514; Fri, 28 Dec 2012 22:43:16 GMT (envelope-from phk@phk.freebsd.dk) To: Jeff Roberson Subject: Re: Unmapped I/O In-reply-to: From: "Poul-Henning Kamp" References: <20121219135451.GU71906@kib.kiev.ua> Date: Fri, 28 Dec 2012 22:43:15 +0000 Message-ID: <48513.1356734595@critter.freebsd.dk> Cc: Konstantin Belousov , arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2012 22:43:24 -0000 -------- In message , Jeff Roberson writes: >3) I find the NOTMAPPED negative flag awkward grammatically. UNMAPPED >seems more natural. Or a positive MAPPED flag would be better. Minor >concern, bikeshedding, etc. Given that down the road, MAPPED should be the exceptional case, I think this should not be a negative option. >4) It would be better to have some wrapper functions around the bio >transient map and or sf buf handling. I will need it to map unmapped cam >ccbs in device drivers. We need to come to some agreement on this API. >There should be a fast page-by-page version and a potentially blocking >all-at-once linear version. Do we have credible relevant use-cases for the "map all linear at once" case ? I think it would be better to leave it out, and force those obscure (already pessimized) cornercases to deal with page by page, rather than have them cramp our style WRT to max I/O size. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 23:00:52 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 20F08C26 for ; Fri, 28 Dec 2012 23:00:52 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 3D38A8FC15 for ; Fri, 28 Dec 2012 23:00:51 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.5/8.14.5) with ESMTP id qBSN0faF010190; Sat, 29 Dec 2012 01:00:41 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.3 kib.kiev.ua qBSN0faF010190 Received: (from kostik@localhost) by tom.home (8.14.5/8.14.5/Submit) id qBSN0fp4010189; Sat, 29 Dec 2012 01:00:41 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Sat, 29 Dec 2012 01:00:41 +0200 From: Konstantin Belousov To: Jeff Roberson Subject: Re: Unmapped I/O Message-ID: <20121228230041.GY82219@kib.kiev.ua> References: <20121219135451.GU71906@kib.kiev.ua> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="eUqGrSc0O7wKBRnC" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2012 23:00:52 -0000 --eUqGrSc0O7wKBRnC Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Dec 28, 2012 at 12:36:57PM -1000, Jeff Roberson wrote: > On Wed, 19 Dec 2012, Konstantin Belousov wrote: >=20 > > One of the known FreeBSD I/O path performance bootleneck is the > > neccessity to map each I/O buffer pages into KVA. The problem is that > > on the multi-core machines, the mapping must flush TLB on all cores, > > due to the global mapping of the buffer pages into the kernel. This > > means that buffer creation and destruction disrupts execution of all > > other cores to perform TLB shootdown through IPI, and the thread > > initiating the shootdown must wait for all other cores to execute and > > report. > > > > The patch at > > http://people.freebsd.org/~kib/misc/unmapped.4.patch > > implements the 'unmapped buffers'. It means an ability to create the > > VMIO struct buf, which does not point to the KVA mapping the buffer > > pages to the kernel addresses. Since there is no mapping, kernel does > > not need to clear TLB. The unmapped buffers are marked with the new > > B_NOTMAPPED flag, and should be requested explicitely using the > > GB_NOTMAPPED flag to the buffer allocation routines. If the mapped > > buffer is requested but unmapped buffer already exists, the buffer > > subsystem automatically maps the pages. > > > > The clustering code is also made aware of the not-mapped buffers, but > > this required the KPI change that accounts for the diff in the non-UFS > > filesystems. > > > > UFS is adopted to request not mapped buffers when kernel does not need > > to access the content, i.e. mostly for the file data. New helper > > function vn_io_fault_pgmove() operates on the unmapped array of pages. > > It calls new pmap method pmap_copy_pages() to do the data move to and > > from usermode. > > > > Besides not mapped buffers, not mapped BIOs are introduced, marked > > with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated > > to unmapped BIOs. Geom providers may indicate an acceptance of the > > unmapped BIOs. If provider does not handle unmapped i/o requests, > > geom now automatically establishes transient mapping for the i/o > > pages. > > > > Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The > > gpart providers indicate the unmapped BIOs support if the underlying > > provider can do unmapped i/o. I also hacked ahci(4) to handle > > unmapped i/o, but this should be changed after the Jeff' physbio patch > > is committed, to use proper busdma interface. > > > > Besides, the swap pager does unmapped swapping if the swap partition > > indicated that it can do unmapped i/o. By Jeff request, a buffer > > allocation code may reserve the KVA for unmapped buffer in advance. > > The unmapped page-in for the vnode pager is also implemented if > > filesystem supports it, but the page out is not. The page-out, as well > > as the vnode-backed md(4), currently require mappings, mostly due to > > the use of VOP_WRITE(). > > > > As such, the patch worked in my test environment, where I used > > ahci-attached SATA disks with gpt partitions, md(4) and UFS. I see no > > statistically significant difference in the buildworld -j 10 times on > > the 4-core machine with HT. On the other hand, when doing sha1 over > > the 5GB file, the system time was reduced by 30%. > > > > Unfinished items: > > - Integration with the physbio, will be done after physbio is > > committed to HEAD. > > - The key per-architecture function needed for the unmapped i/o is the > > pmap_copy_pages(). I implemented it for amd64 and i386 right now, it > > shall be done for all other architectures. > > - The sizing of the submap used for transient mapping of the BIOs is > > naive. Should be adjusted, esp. for KVA-lean architectures. > > - Conversion of the other filesystems. Low priority. > > > > I am interested in reviews, tests and suggestions. Note that this > > only works now for md(4) and ahci(4), for other drivers the patched > > kernel should fall back to the mapped i/o. > > > > sys/amd64/amd64/pmap.c | 24 +++ > > sys/cam/ata/ata_da.c | 5 +- > > sys/cam/cam_ccb.h | 30 ++++ > > sys/dev/ahci/ahci.c | 53 +++++- > > sys/dev/md/md.c | 255 ++++++++++++++++++++++++----- > > sys/fs/cd9660/cd9660_vnops.c | 2 +- > > sys/fs/ext2fs/ext2_balloc.c | 2 +- > > sys/fs/ext2fs/ext2_vnops.c | 9 +- > > sys/fs/msdosfs/msdosfs_vnops.c | 4 +- > > sys/fs/udf/udf_vnops.c | 5 +- > > sys/geom/geom.h | 1 + > > sys/geom/geom_disk.c | 2 + > > sys/geom/geom_disk.h | 1 + > > sys/geom/geom_io.c | 44 ++++- > > sys/geom/geom_vfs.c | 10 +- > > sys/geom/part/g_part.c | 1 + > > sys/i386/i386/pmap.c | 42 +++++ > > sys/kern/vfs_bio.c | 356 +++++++++++++++++++++++++++++++++-= ------- > > sys/kern/vfs_cluster.c | 118 +++++++------- > > sys/kern/vfs_vnops.c | 39 +++++ > > sys/sys/bio.h | 7 + > > sys/sys/buf.h | 22 ++- > > sys/sys/mount.h | 1 + > > sys/sys/vnode.h | 2 + > > sys/ufs/ffs/ffs_alloc.c | 10 +- > > sys/ufs/ffs/ffs_balloc.c | 58 ++++--- > > sys/ufs/ffs/ffs_vfsops.c | 3 +- > > sys/ufs/ffs/ffs_vnops.c | 35 ++-- > > sys/ufs/ufs/ufs_extern.h | 1 + > > sys/vm/pmap.h | 2 + > > sys/vm/swap_pager.c | 43 +++-- > > sys/vm/swap_pager.h | 1 + > > sys/vm/vm.h | 2 + > > sys/vm/vm_init.c | 6 +- > > sys/vm/vm_kern.c | 9 +- > > sys/vm/vnode_pager.c | 30 +++- > > 36 files changed, 989 insertions(+), 246 deletions(-) > > > > >=20 > A few comments: >=20 > 1) If the BIO is mapped at all you have to pass the virtual address to= =20 > busdma. We can not lose the fact that it is mapped somewhere or virtual= =20 > cache architectures will fail. I think when we integrate with physbio we= =20 > can do this properly by implementing a load routine on the bio. This is weird. How the buffer cache works for such architectures then ? VMIO buffer pages could be mapped by kernel and any usermode. Anyway, I believe I never pass BIO_NOTMAPPED down if the kernel mapping for the buffer exists. I might document it somewhere. Would you prefer to have both mapping address and vm_page array passed down, if KVA is available ? >=20 > 2) Why does mdstart_swap() need a cpu_flush_dcache? Shouldn't sf bufs o= n=20 > the appropriate architecture do the right flushing when they are unmapped? I believe it was added by Marcel. sfbuf code cannot not know if the [d]cache should be flushed, since there is no indication from the caller on the kind of access to the mapping, ro or rw. =20 >=20 > 3) I find the NOTMAPPED negative flag awkward grammatically. UNMAPPED= =20 > seems more natural. Or a positive MAPPED flag would be better. Minor=20 > concern, bikeshedding, etc. So you prefers BIO/B_UNMAPPED ? I will change this. >=20 > 4) It would be better to have some wrapper functions around the bio=20 > transient map and or sf buf handling. I will need it to map unmapped cam= =20 > ccbs in device drivers. We need to come to some agreement on this API.= =20 > There should be a fast page-by-page version and a potentially blocking=20 > all-at-once linear version. No, I do not think this is needed. I thought about this, and decided to handle it centralized in g_down instead to require drivers code to be changed there. The driver should indicate the acceptance of the unmapped BIOs. If it does not accept them, the patch already contains a code in the g_down path to create the transient mapping. The PIO-mode ATA drivers should not set the flag on the disk. There might be some need to be able to dynamically clear the flag when the channel mode is switched to PIO. In fact, I do not see why not to require the always mapped BIOs for legacy ATA. See DISKFLAG_NOTMAPPED_BIO. >=20 > 5) Should the transient bio space not come from the buffer_map? We don'= t=20 > have more KVA on 32bit architectures. Or did that come from pbufs? We= =20 > could consolidate here some but there are a lot of potential deadlock=20 > issues. No, exactly because it is impossible to request the buffer_map KVA defragmentation from the g_down. The worst which could happen with the current code is the pauses during intensive i/o when transient map is exhausted or fragmented. The g_down is paused until enough in-flight transactions are completed. Peter has tested it under the load where a lot of the parallel i/o were issued and transient map was constantly exhausted, for the ad(4) driver requiring fallback mappings. I added the counters to watch the situation. >=20 > 6) Thank you for adding the KVAALLOC flag. Isilon needs this internally. Please note that I did not tested this yet :/. >=20 > 7) All of that code that exploded into getblk() should be refactored int= o=20 > some support functions. It's hard to read and getblk() is too big=20 > already. Agreed. >=20 > All in all it looks like we have the right pieces coming together. Just= =20 > needs a little refactoring and then a lot of test. I'm almost ready to= =20 > commit the first phase of physbio. It doesn't yet have the code necessar= y=20 > to temporarily map io for things like ATA PIO. I'm hoping that you'll=20 It is there in the central place. > provide the functions to do that. It does abstract out most of the=20 > details of the memory formats so we can change them at all. And adds=20 > support to busdma for physical addresses so bounce pages, alignment,=20 > sizes, and boundaries work. >=20 > Thanks, > Jeff --eUqGrSc0O7wKBRnC Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJQ3iSYAAoJEJDCuSvBvK1BRZ0P/2TQph/5zzHD45+UCtQucthS IdJIr0iN6IpLj8++Xs3iR4BAHpFWhKmmqlhxmomOUQSLtHqL3kmqBbdxhNltof/E +OszxMkCqcSF4utLhiYKo/tTePyNPZ0yKaMfX7FpUX/8sa6nRxeMmsub6BhPmLiu lu76UlpIHKJRqgHSOuOq1SNVcSFdQGu6m7zZ6SP2ZuhE7JfJvmJyvCKZqEzwo2H/ yUPRTTarNJsOLEMoQOaC0GtPRAWtjNICEwoCmuqU4QdEVy/PKnVKX+W4qDKCZnXK O1jmLpn2DrVWRpq2oOURnVcC4a+7XG7ilx5IymNEDZr6f+KLrrbYOhTNiYNw0Kw/ ZVRIsHbsNu9YKt4VqZiX+5Yz38QmGR/QsQuWLaXP57140jkTb0v65EJ7pgZQegyf po5hJH+GFZP5vfm0FN+Co5jt2GGD0XiZUJkILgEWUuhsh3bld698lrKDVOKI4eMS sBRlJj/ngf/NmbhZLKcTeG/qK4823RITWotPmnmBiwa135fNgWNEG/WfkMBZKwUE zl2Sp1HZk5EI7HwTsRJdZX8JxfB4ulZrveIXHuO0LrprbprbRbU/5uo3iggaTF94 FNW8Qjnf4QaMTex/e8jW2wVgjYsGbQTi3N4UiAZp+hfk5vrpQ2KH2mISdiMNyzIU dO1rqwxLLZ6r9GJL0Pal =5ibS -----END PGP SIGNATURE----- --eUqGrSc0O7wKBRnC-- From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 23:20:32 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id AC6AB1BC for ; Fri, 28 Dec 2012 23:20:32 +0000 (UTC) (envelope-from jroberson@jroberson.net) Received: from mail-da0-f49.google.com (mail-da0-f49.google.com [209.85.210.49]) by mx1.freebsd.org (Postfix) with ESMTP id 734358FC0A for ; Fri, 28 Dec 2012 23:20:32 +0000 (UTC) Received: by mail-da0-f49.google.com with SMTP id v40so4951145dad.8 for ; Fri, 28 Dec 2012 15:20:26 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:date:from:x-x-sender:to:cc:subject:in-reply-to :message-id:references:user-agent:mime-version:content-type :x-gm-message-state; bh=/QmLskEBRBC7P2BysEeE8VV7PCsh+xjZehibUXU24Lo=; b=St8nCPW6x7556/uM+DFoVPZz4b4KB87ePGvhDP6dhf66VBGBPYRf2L8lozgqGqyRdu eZyAiYuznnmaJT1YKQzsEHQohWP2KW+D5tQvysSywS3mYt3fvRb1m3oN3p037AuAGpyU gCo3y1NySPgtkv2KaFzvAJQWJBu4Gx9PZxexO5WrhARFW7WzM83JwzzzmLhJ45jy1jhx WNsp18GoquJu83bCFzISLgieP7UDi7Fgt7iqVHk/2tZGWJlYRapGW7/215rK18SwYX9c jbuzdsnEwDG+44gKKZeozdLCrPClxdO+beFIChH/m3fuRPGtYOrkFekcLYp8a9ElLAlz 1R9w== X-Received: by 10.68.241.232 with SMTP id wl8mr107604582pbc.144.1356736826133; Fri, 28 Dec 2012 15:20:26 -0800 (PST) Received: from rrcs-66-91-135-210.west.biz.rr.com (rrcs-66-91-135-210.west.biz.rr.com. [66.91.135.210]) by mx.google.com with ESMTPS id gq10sm20381191pbc.54.2012.12.28.15.20.24 (version=SSLv3 cipher=OTHER); Fri, 28 Dec 2012 15:20:25 -0800 (PST) Date: Fri, 28 Dec 2012 13:20:09 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: Konstantin Belousov Subject: Re: Unmapped I/O In-Reply-To: <20121228230041.GY82219@kib.kiev.ua> Message-ID: References: <20121219135451.GU71906@kib.kiev.ua> <20121228230041.GY82219@kib.kiev.ua> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Gm-Message-State: ALoCoQmBpWQISqGg8Hl6mKplx5TSvnaPIQHg2RoxPPdZdgGoOVxQWA8Dhfc8ff+lcharJMUEEq70 Cc: arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2012 23:20:32 -0000 On Sat, 29 Dec 2012, Konstantin Belousov wrote: > On Fri, Dec 28, 2012 at 12:36:57PM -1000, Jeff Roberson wrote: >> On Wed, 19 Dec 2012, Konstantin Belousov wrote: >> >>> One of the known FreeBSD I/O path performance bootleneck is the >>> neccessity to map each I/O buffer pages into KVA. The problem is that >>> on the multi-core machines, the mapping must flush TLB on all cores, >>> due to the global mapping of the buffer pages into the kernel. This >>> means that buffer creation and destruction disrupts execution of all >>> other cores to perform TLB shootdown through IPI, and the thread >>> initiating the shootdown must wait for all other cores to execute and >>> report. >>> >>> The patch at >>> http://people.freebsd.org/~kib/misc/unmapped.4.patch >>> implements the 'unmapped buffers'. It means an ability to create the >>> VMIO struct buf, which does not point to the KVA mapping the buffer >>> pages to the kernel addresses. Since there is no mapping, kernel does >>> not need to clear TLB. The unmapped buffers are marked with the new >>> B_NOTMAPPED flag, and should be requested explicitely using the >>> GB_NOTMAPPED flag to the buffer allocation routines. If the mapped >>> buffer is requested but unmapped buffer already exists, the buffer >>> subsystem automatically maps the pages. >>> >>> The clustering code is also made aware of the not-mapped buffers, but >>> this required the KPI change that accounts for the diff in the non-UFS >>> filesystems. >>> >>> UFS is adopted to request not mapped buffers when kernel does not need >>> to access the content, i.e. mostly for the file data. New helper >>> function vn_io_fault_pgmove() operates on the unmapped array of pages. >>> It calls new pmap method pmap_copy_pages() to do the data move to and >>> from usermode. >>> >>> Besides not mapped buffers, not mapped BIOs are introduced, marked >>> with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated >>> to unmapped BIOs. Geom providers may indicate an acceptance of the >>> unmapped BIOs. If provider does not handle unmapped i/o requests, >>> geom now automatically establishes transient mapping for the i/o >>> pages. >>> >>> Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The >>> gpart providers indicate the unmapped BIOs support if the underlying >>> provider can do unmapped i/o. I also hacked ahci(4) to handle >>> unmapped i/o, but this should be changed after the Jeff' physbio patch >>> is committed, to use proper busdma interface. >>> >>> Besides, the swap pager does unmapped swapping if the swap partition >>> indicated that it can do unmapped i/o. By Jeff request, a buffer >>> allocation code may reserve the KVA for unmapped buffer in advance. >>> The unmapped page-in for the vnode pager is also implemented if >>> filesystem supports it, but the page out is not. The page-out, as well >>> as the vnode-backed md(4), currently require mappings, mostly due to >>> the use of VOP_WRITE(). >>> >>> As such, the patch worked in my test environment, where I used >>> ahci-attached SATA disks with gpt partitions, md(4) and UFS. I see no >>> statistically significant difference in the buildworld -j 10 times on >>> the 4-core machine with HT. On the other hand, when doing sha1 over >>> the 5GB file, the system time was reduced by 30%. >>> >>> Unfinished items: >>> - Integration with the physbio, will be done after physbio is >>> committed to HEAD. >>> - The key per-architecture function needed for the unmapped i/o is the >>> pmap_copy_pages(). I implemented it for amd64 and i386 right now, it >>> shall be done for all other architectures. >>> - The sizing of the submap used for transient mapping of the BIOs is >>> naive. Should be adjusted, esp. for KVA-lean architectures. >>> - Conversion of the other filesystems. Low priority. >>> >>> I am interested in reviews, tests and suggestions. Note that this >>> only works now for md(4) and ahci(4), for other drivers the patched >>> kernel should fall back to the mapped i/o. >>> >>> sys/amd64/amd64/pmap.c | 24 +++ >>> sys/cam/ata/ata_da.c | 5 +- >>> sys/cam/cam_ccb.h | 30 ++++ >>> sys/dev/ahci/ahci.c | 53 +++++- >>> sys/dev/md/md.c | 255 ++++++++++++++++++++++++----- >>> sys/fs/cd9660/cd9660_vnops.c | 2 +- >>> sys/fs/ext2fs/ext2_balloc.c | 2 +- >>> sys/fs/ext2fs/ext2_vnops.c | 9 +- >>> sys/fs/msdosfs/msdosfs_vnops.c | 4 +- >>> sys/fs/udf/udf_vnops.c | 5 +- >>> sys/geom/geom.h | 1 + >>> sys/geom/geom_disk.c | 2 + >>> sys/geom/geom_disk.h | 1 + >>> sys/geom/geom_io.c | 44 ++++- >>> sys/geom/geom_vfs.c | 10 +- >>> sys/geom/part/g_part.c | 1 + >>> sys/i386/i386/pmap.c | 42 +++++ >>> sys/kern/vfs_bio.c | 356 +++++++++++++++++++++++++++++++++-------- >>> sys/kern/vfs_cluster.c | 118 +++++++------- >>> sys/kern/vfs_vnops.c | 39 +++++ >>> sys/sys/bio.h | 7 + >>> sys/sys/buf.h | 22 ++- >>> sys/sys/mount.h | 1 + >>> sys/sys/vnode.h | 2 + >>> sys/ufs/ffs/ffs_alloc.c | 10 +- >>> sys/ufs/ffs/ffs_balloc.c | 58 ++++--- >>> sys/ufs/ffs/ffs_vfsops.c | 3 +- >>> sys/ufs/ffs/ffs_vnops.c | 35 ++-- >>> sys/ufs/ufs/ufs_extern.h | 1 + >>> sys/vm/pmap.h | 2 + >>> sys/vm/swap_pager.c | 43 +++-- >>> sys/vm/swap_pager.h | 1 + >>> sys/vm/vm.h | 2 + >>> sys/vm/vm_init.c | 6 +- >>> sys/vm/vm_kern.c | 9 +- >>> sys/vm/vnode_pager.c | 30 +++- >>> 36 files changed, 989 insertions(+), 246 deletions(-) >>> >>> >> >> A few comments: >> >> 1) If the BIO is mapped at all you have to pass the virtual address to >> busdma. We can not lose the fact that it is mapped somewhere or virtual >> cache architectures will fail. I think when we integrate with physbio we >> can do this properly by implementing a load routine on the bio. > This is weird. How the buffer cache works for such architectures then ? > VMIO buffer pages could be mapped by kernel and any usermode. pmap_remove_write in vfs_busy_pages() handles this. No new writes are allowed until the IO is complete. > > Anyway, I believe I never pass BIO_NOTMAPPED down if the kernel mapping > for the buffer exists. I might document it somewhere. Ok, we just need to be sure at any place where we make the transition from unmapped to mapped we use the right pointer. > > Would you prefer to have both mapping address and vm_page array passed > down, if KVA is available ? Yes, I think that's best. >> >> 2) Why does mdstart_swap() need a cpu_flush_dcache? Shouldn't sf bufs on >> the appropriate architecture do the right flushing when they are unmapped? > I believe it was added by Marcel. sfbuf code cannot not know if the > [d]cache should be flushed, since there is no indication from the caller > on the kind of access to the mapping, ro or rw. Maybe we should change the free function to indicate. This seems sloppy. not your fault though. > >> >> 3) I find the NOTMAPPED negative flag awkward grammatically. UNMAPPED >> seems more natural. Or a positive MAPPED flag would be better. Minor >> concern, bikeshedding, etc. > So you prefers BIO/B_UNMAPPED ? I will change this. I think I'd prefer the positive flag BIO_MAPPED but I recognize that's a lot of work to switch the sense in your entire patch now so feel free to ignore me. > >> >> 4) It would be better to have some wrapper functions around the bio >> transient map and or sf buf handling. I will need it to map unmapped cam >> ccbs in device drivers. We need to come to some agreement on this API. >> There should be a fast page-by-page version and a potentially blocking >> all-at-once linear version. > No, I do not think this is needed. I thought about this, and decided > to handle it centralized in g_down instead to require drivers code > to be changed there. > > The driver should indicate the acceptance of the unmapped BIOs. If it > does not accept them, the patch already contains a code in the g_down > path to create the transient mapping. The PIO-mode ATA drivers should > not set the flag on the disk. There might be some need to be able to > dynamically clear the flag when the channel mode is switched to PIO. > > In fact, I do not see why not to require the always mapped BIOs for > legacy ATA. See DISKFLAG_NOTMAPPED_BIO. I guess this would be acceptable for a first cut but I don't think it will fly long-term. I audited all of the cam and block devices recently. There are cases that are difficult to predict ahead of time which would mean some devices would always be operating in a slower mode. Also my primary test machines are 16core boxes that still use legacy ATA. It is not that uncommon. It would be a shame if many of our users saw no benefit from this. Although for ATA PIO we could make an iterator that used the directmap or sf_bufs. That may be sufficient. It would be nice if it was encapsulated in a simple api so every user didn't have to know the details of pinning, sf bufs, and the memory layout. What would you think of that as the in-between compromise? > >> >> 5) Should the transient bio space not come from the buffer_map? We don't >> have more KVA on 32bit architectures. Or did that come from pbufs? We >> could consolidate here some but there are a lot of potential deadlock >> issues. > No, exactly because it is impossible to request the buffer_map KVA > defragmentation from the g_down. The worst which could happen with the > current code is the pauses during intensive i/o when transient map is > exhausted or fragmented. The g_down is paused until enough in-flight > transactions are completed. > How is there room for this new transient map? Can't pbufs completely give up their space to it now? Shouldn't you always allocate BKVASIZE to avoid fragmentation? And speed-up the search. > Peter has tested it under the load where a lot of the parallel i/o were > issued and transient map was constantly exhausted, for the ad(4) driver > requiring fallback mappings. I added the counters to watch the situation. > >> >> 6) Thank you for adding the KVAALLOC flag. Isilon needs this internally. > Please note that I did not tested this yet :/. > >> >> 7) All of that code that exploded into getblk() should be refactored into >> some support functions. It's hard to read and getblk() is too big >> already. > Agreed. > >> >> All in all it looks like we have the right pieces coming together. Just >> needs a little refactoring and then a lot of test. I'm almost ready to >> commit the first phase of physbio. It doesn't yet have the code necessary >> to temporarily map io for things like ATA PIO. I'm hoping that you'll > It is there in the central place. > >> provide the functions to do that. It does abstract out most of the >> details of the memory formats so we can change them at all. And adds >> support to busdma for physical addresses so bounce pages, alignment, >> sizes, and boundaries work. >> >> Thanks, >> Jeff > Jeff From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 23:23:11 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 72923428 for ; Fri, 28 Dec 2012 23:23:11 +0000 (UTC) (envelope-from jroberson@jroberson.net) Received: from mail-da0-f48.google.com (mail-da0-f48.google.com [209.85.210.48]) by mx1.freebsd.org (Postfix) with ESMTP id 395BF8FC08 for ; Fri, 28 Dec 2012 23:23:11 +0000 (UTC) Received: by mail-da0-f48.google.com with SMTP id k18so4984104dae.21 for ; Fri, 28 Dec 2012 15:23:05 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:date:from:x-x-sender:to:cc:subject:in-reply-to :message-id:references:user-agent:mime-version:content-type :x-gm-message-state; bh=fiHLKz30apwHkRZ9aj8H4MqAobTI+bFO7vpBoctKkTg=; b=DB4jw50fEHSfTrq8mBpfNfwcZo77Py+3zJkwcGsXpVHCRx004LDlZ4OY50HjgSJfsU LojI5VjSfawPn17xuEXJyBO8JwZzrk07m/CJ/Ed0KcdaMd8EpEC7jb6UK4ruA+sD2/Q4 v64P4tM5dqZTrDRUcfKJ/7ERgo0B2BJpUQ9rN6oAz3wM3kb61sNDoe40L9fBDaJEAJOu n8TWHu04NIASOXa3xPBC4X3WgraHnsy+4jPa9DHqNQ7ra3c6iHH2v5LWNpa2HfT83B+7 qBh2QLP8eBr7ZFvL3CBLpxhk3p/0HVrqd3z+njxeFBRWg97x6QPyzHznt1DXH3zyDm1J ZWqA== X-Received: by 10.68.191.200 with SMTP id ha8mr107773295pbc.51.1356736985224; Fri, 28 Dec 2012 15:23:05 -0800 (PST) Received: from rrcs-66-91-135-210.west.biz.rr.com (rrcs-66-91-135-210.west.biz.rr.com. [66.91.135.210]) by mx.google.com with ESMTPS id sk1sm20398596pbc.0.2012.12.28.15.23.03 (version=SSLv3 cipher=OTHER); Fri, 28 Dec 2012 15:23:04 -0800 (PST) Date: Fri, 28 Dec 2012 13:22:48 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: Poul-Henning Kamp Subject: Re: Unmapped I/O In-Reply-To: <48513.1356734595@critter.freebsd.dk> Message-ID: References: <20121219135451.GU71906@kib.kiev.ua> <48513.1356734595@critter.freebsd.dk> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Gm-Message-State: ALoCoQmCJDgbAyjqtYOiQx3LKE4k6yOoC46QW0Dd2PivQcxLMJ5Qu3vaN5nWNI2B+83z4w9QeclX Cc: Konstantin Belousov , arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2012 23:23:11 -0000 On Fri, 28 Dec 2012, Poul-Henning Kamp wrote: > -------- > In message , Jeff Roberson writes: > >> 3) I find the NOTMAPPED negative flag awkward grammatically. UNMAPPED >> seems more natural. Or a positive MAPPED flag would be better. Minor >> concern, bikeshedding, etc. > > Given that down the road, MAPPED should be the exceptional case, I > think this should not be a negative option. > >> 4) It would be better to have some wrapper functions around the bio >> transient map and or sf buf handling. I will need it to map unmapped cam >> ccbs in device drivers. We need to come to some agreement on this API. >> There should be a fast page-by-page version and a potentially blocking >> all-at-once linear version. > > Do we have credible relevant use-cases for the "map all linear at > once" case ? > > I think it would be better to leave it out, and force those obscure > (already pessimized) cornercases to deal with page by page, rather > than have them cramp our style WRT to max I/O size. > Well some would require significant stack rewrites. The usb code, for example, assumes linear buffers. So the usb mass storage driver can always run in mapped compat mode. That's not a high performance case. The other place seems to be scsi target code and a few other consumers who actually want to poke at the block data. When we send various scsi commands down like identify and read format etc. we always use malloc'd buffers. For now I think it's safe to leave that assumption in place. The PIO like cases can do page by page pretty easily if we give them a nice API. Thanks, Jeff > -- > Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > phk@FreeBSD.ORG | TCP/IP since RFC 956 > FreeBSD committer | BSD since 4.3-tahoe > Never attribute to malice what can adequately be explained by incompetence. > From owner-freebsd-arch@FreeBSD.ORG Sat Dec 29 16:59:07 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id CCF7F994 for ; Sat, 29 Dec 2012 16:59:07 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 47CEB8FC18 for ; Sat, 29 Dec 2012 16:59:07 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.5/8.14.5) with ESMTP id qBTGwxHW031561; Sat, 29 Dec 2012 18:58:59 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.3 kib.kiev.ua qBTGwxHW031561 Received: (from kostik@localhost) by tom.home (8.14.5/8.14.5/Submit) id qBTGwwBV031560; Sat, 29 Dec 2012 18:58:58 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Sat, 29 Dec 2012 18:58:58 +0200 From: Konstantin Belousov To: Bruce Evans Subject: Re: UPDATE Re: making use of userland dtrace on FreeBSD Message-ID: <20121229165858.GE82219@kib.kiev.ua> References: <50D49DFF.3060803@ixsystems.com> <50DBC7E2.1070505@mu.org> <50DBD193.7080505@mu.org> <50DBE0DB.6090804@ixsystems.com> <20121227214354.V965@besplex.bde.org> <20121227190904.GL82219@kib.kiev.ua> <20121228224312.X1054@besplex.bde.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="vDpvzslK0qRw06MN" Content-Disposition: inline In-Reply-To: <20121228224312.X1054@besplex.bde.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: "arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Dec 2012 16:59:07 -0000 --vDpvzslK0qRw06MN Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Dec 29, 2012 at 12:22:57AM +1100, Bruce Evans wrote: > i386 times on ref10-i386: > gcc -O2 -o f main.c bar.c foo.c: 0.81 seconds > gcc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 0.81 seconds > gcc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 0.83 seconds > cc -O2 -o f main.c bar.c foo.c: 1.11 seconds > cc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 1.11 seconds > cc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 1.11 seconds >=20 > 0.81 seconds is 15.08 cycles/iteration in the inner loop. 0.83 seconds > is 15.45 cycles/iteration. 1.11 seconds is 20.67 cycles/iteration. I increased the number of iterations in the main() from 100 to 1000, and I get 34.1s for -fomit-frame-pointer version vs. 44.1s for -fno-omit with i7 930 and gcc 4.7.2 -O3. I say it is very significant difference. --vDpvzslK0qRw06MN Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJQ3yFRAAoJEJDCuSvBvK1B5SsP/AxEQc4ctq7RSOdrLzHnp7if JCWEh6RlzzEJ0kxSWLXCg+/WhO8d9lEj9uBW3d7jKYHqs7KcXiIEUD93SJ6cNTFR aFdi3rt0UiyimCDvJJScAFpSsdwJ/8RlnzZ1PWAaUh4P0CX8ktqt37/AnBSqxKq3 79YGLZ0qhjz6YSvDh8StpHo6KHOIGTPYbmx9cApR4Eeeke5yw+h6yOdmAMacgVrO POyiYikWru/uAUUBFVT8G8U9qXuInGMvRKkZeaT3NknPauVzunGEmAYkw2ad3aVq HmsqWgDb0tDjNFTvrChA2K9VUkFKQwEKKcI5aQhTI6KkXutP2/cS5f5kbc3RZANO TWWL2G0ezoizRlCEjBE3K1FHVc7SlKvvAuE/uM0PP+D3WYvazQhyKOW0UZrx0e4x OKr/I5g6ynu34L6E7ckypnXJdrA4zMy+ZcnEIPvKFYZ6LchGHEVIkL2k92KoF0fW 6KCPC7tEAsHuCmF+pYRanVMI9y7dgzRQWHNmiIjF4jSqF26c9vsgYDEljaLh5f4x f6mLaqDj7ina3nPW1xtkeuQkRK/PzeLx0bM5UYu7aPxxjt9wKEy3p8j21Rneb9WN DsUIhty4GmpZZR5FWbdVeDHeXJ7cbaE8ghM2UqNNPAl7L8RBjRV+s7uteNJ3ec+v YT+IF3JbaM0lrD+vwKww =Pfls -----END PGP SIGNATURE----- --vDpvzslK0qRw06MN--