From owner-freebsd-arch@FreeBSD.ORG Sun Sep 23 02:32:08 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D9C5516A468 for ; Sun, 23 Sep 2007 02:32:08 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from nf-out-0910.google.com (nf-out-0910.google.com [64.233.182.189]) by mx1.freebsd.org (Postfix) with ESMTP id 6E67B13C46A for ; Sun, 23 Sep 2007 02:32:08 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: by nf-out-0910.google.com with SMTP id b2so963355nfb for ; Sat, 22 Sep 2007 19:32:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:mime-version:content-type:content-transfer-encoding:content-disposition:x-google-sender-auth; bh=iiUvjAMtMydCs3r99oSiL8ZVSK+ENH6vXGacQdqDpQM=; b=g1iP1YJJkN0bZX0zG6kpRUL0G28h99XNXoc29uUxIEd/IjbnDP8ftcKd8USxTcqu+JhvIwjQl1Jrm3UL/FR+ItWu/f2LB7wZ8avmP1BZ3vwtB3mwzLrpBB6WxmFr1U90iktaxg/zEDFc8Y+cWCGp9cHbwoh03+LXgh9iTNNBlnY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:cc:mime-version:content-type:content-transfer-encoding:content-disposition:x-google-sender-auth; b=iWgivIQ7NXNN0XY5Jkaqr5xq/lcUGOjxvROnqBlWyXWPktg3E5q1lwgalmZVsy1CyN4hnqzpdKb7lnuc7C9zvoO73MlMsNChlfbb8kp0+o96ISr41ys1SK5Tas7lpY4RBvXQZGV3vcHPATRvTCChrT2a77oj7j+4pn0D6LowUkM= Received: by 10.78.149.15 with SMTP id w15mr2190939hud.1190514726280; Sat, 22 Sep 2007 19:32:06 -0700 (PDT) Received: by 10.78.97.18 with HTTP; Sat, 22 Sep 2007 19:32:06 -0700 (PDT) Message-ID: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> Date: Sun, 23 Sep 2007 04:32:06 +0200 From: "Attilio Rao" Sender: asmrookie@gmail.com To: freebsd-smp@freebsd.org, freebsd-arch@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline X-Google-Sender-Auth: 2e21c3221234266c Cc: Subject: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Sep 2007 02:32:08 -0000 Recently several people have reported problems of starvation with rwlocks. In particular, users which tried to use rwlock on big SMP environment (16+ CPUs) found them rather subjected to poor performances and to starvation of waiters. Inspecting the code, something strange about adaptive spinning popped up: basically, for rwlocks, adaptive spinning stubs seem to be customed too down in the decisioning-loop. The desposition of the stub will let the thread that would adaptively spin, to set the respecitve (both read or write) waiters flag on, which means that the owner of the lock will go down in the hard path of locking functions and will performe a full wakeup even if the waiters queues can result empty. This is a big penalty for adaptive spinning which can make it completely useless. In addiction to this, adaptive spinning only runs in the turnstile spinlock path which is not ideal. This patch ports the approach alredy used for adaptive spinning in sx locks to rwlocks: http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff In sx it is unlikely to see big benefits because they are held for too long times, but for rwlocks situation is rather different. I would like to see if people can do benchmarks with this patch (maybe in private environments?) as I'm not able to do them in short times. Adaptive spinning in rwlocks can be improved further with other tricks (like adding a backoff counter, for example, or trying to spin with the lock held in read mode too), but we first should be sure to start with a solid base. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Mon Sep 24 16:13:26 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 41E5F16A41A; Mon, 24 Sep 2007 16:13:26 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id D5FD813C4B2; Mon, 24 Sep 2007 16:13:25 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8p) with ESMTP id 211141650-1834499 for multiple; Mon, 24 Sep 2007 12:11:55 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l8OGD3hJ098542; Mon, 24 Sep 2007 12:13:07 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: "Attilio Rao" Date: Mon, 24 Sep 2007 11:52:41 -0400 User-Agent: KMail/1.9.6 References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> In-Reply-To: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200709241152.41660.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Mon, 24 Sep 2007 12:13:07 -0400 (EDT) X-Virus-Scanned: ClamAV 0.88.3/4378/Mon Sep 24 08:25:35 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Sep 2007 16:13:26 -0000 On Saturday 22 September 2007 10:32:06 pm Attilio Rao wrote: > Recently several people have reported problems of starvation with rwlocks. > In particular, users which tried to use rwlock on big SMP environment > (16+ CPUs) found them rather subjected to poor performances and to > starvation of waiters. > > Inspecting the code, something strange about adaptive spinning popped > up: basically, for rwlocks, adaptive spinning stubs seem to be > customed too down in the decisioning-loop. > The desposition of the stub will let the thread that would adaptively > spin, to set the respecitve (both read or write) waiters flag on, > which means that the owner of the lock will go down in the hard path > of locking functions and will performe a full wakeup even if the > waiters queues can result empty. This is a big penalty for adaptive > spinning which can make it completely useless. > In addiction to this, adaptive spinning only runs in the turnstile > spinlock path which is not ideal. > This patch ports the approach alredy used for adaptive spinning in sx > locks to rwlocks: > http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff > > In sx it is unlikely to see big benefits because they are held for too > long times, but for rwlocks situation is rather different. > I would like to see if people can do benchmarks with this patch (maybe > in private environments?) as I'm not able to do them in short times. > > Adaptive spinning in rwlocks can be improved further with other tricks > (like adding a backoff counter, for example, or trying to spin with > the lock held in read mode too), but we first should be sure to start > with a solid base. I did this for mutexes and rwlocks over a year ago and Kris found it was slower in benchmarks. www.freebsd.org/~jhb/patches/lock_adapt.patch is the last thing I sent kris@ to test (it only has the mutex changes). This might be more optimal post-thread_lock since thread_lock seems to have heavily pessimized adaptive spinning because it now enqueues the thread and then dequeues it again before doing the adaptive spin. I liked the approach orginially because it simplifies the code a lot. A separate issue is that writers don't spin at all if a reader holds the lock, and I think one thing to test for that would be an adaptive spin with a static timeout. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Mon Sep 24 20:54:27 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8B32C16A419; Mon, 24 Sep 2007 20:54:27 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 213EF13C48D; Mon, 24 Sep 2007 20:54:27 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.103] (c-67-160-44-208.hsd1.wa.comcast.net [67.160.44.208]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l8OKsO9I012940 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Mon, 24 Sep 2007 16:54:25 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Mon, 24 Sep 2007 13:57:06 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: John Baldwin In-Reply-To: <200709241152.41660.jhb@freebsd.org> Message-ID: <20070924135554.F547@10.0.0.1> References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> <200709241152.41660.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Attilio Rao , freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Sep 2007 20:54:27 -0000 On Mon, 24 Sep 2007, John Baldwin wrote: > On Saturday 22 September 2007 10:32:06 pm Attilio Rao wrote: >> Recently several people have reported problems of starvation with rwlocks. >> In particular, users which tried to use rwlock on big SMP environment >> (16+ CPUs) found them rather subjected to poor performances and to >> starvation of waiters. >> >> Inspecting the code, something strange about adaptive spinning popped >> up: basically, for rwlocks, adaptive spinning stubs seem to be >> customed too down in the decisioning-loop. >> The desposition of the stub will let the thread that would adaptively >> spin, to set the respecitve (both read or write) waiters flag on, >> which means that the owner of the lock will go down in the hard path >> of locking functions and will performe a full wakeup even if the >> waiters queues can result empty. This is a big penalty for adaptive >> spinning which can make it completely useless. >> In addiction to this, adaptive spinning only runs in the turnstile >> spinlock path which is not ideal. >> This patch ports the approach alredy used for adaptive spinning in sx >> locks to rwlocks: >> http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff >> >> In sx it is unlikely to see big benefits because they are held for too >> long times, but for rwlocks situation is rather different. >> I would like to see if people can do benchmarks with this patch (maybe >> in private environments?) as I'm not able to do them in short times. >> >> Adaptive spinning in rwlocks can be improved further with other tricks >> (like adding a backoff counter, for example, or trying to spin with >> the lock held in read mode too), but we first should be sure to start >> with a solid base. > > I did this for mutexes and rwlocks over a year ago and Kris found it was > slower in benchmarks. www.freebsd.org/~jhb/patches/lock_adapt.patch is the > last thing I sent kris@ to test (it only has the mutex changes). This might > be more optimal post-thread_lock since thread_lock seems to have heavily > pessimized adaptive spinning because it now enqueues the thread and then > dequeues it again before doing the adaptive spin. I liked the approach > orginially because it simplifies the code a lot. A separate issue is that > writers don't spin at all if a reader holds the lock, and I think one thing > to test for that would be an adaptive spin with a static timeout. We don't enqueue the thread until the same place. We just acquire an extra spinlock. The thread is not enqueued until turnstile_wait() as before. Jeff > > -- > John Baldwin > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Mon Sep 24 21:23:22 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4F1CA16A417; Mon, 24 Sep 2007 21:23:22 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id D749613C447; Mon, 24 Sep 2007 21:23:21 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8p) with ESMTP id 211191859-1834499 for multiple; Mon, 24 Sep 2007 17:21:18 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l8OLMZef001128; Mon, 24 Sep 2007 17:22:35 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: Jeff Roberson Date: Mon, 24 Sep 2007 17:22:28 -0400 User-Agent: KMail/1.9.6 References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> <200709241152.41660.jhb@freebsd.org> <20070924135554.F547@10.0.0.1> In-Reply-To: <20070924135554.F547@10.0.0.1> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200709241722.28670.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Mon, 24 Sep 2007 17:22:35 -0400 (EDT) X-Virus-Scanned: ClamAV 0.88.3/4381/Mon Sep 24 13:20:51 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Attilio Rao , freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Sep 2007 21:23:22 -0000 On Monday 24 September 2007 04:57:06 pm Jeff Roberson wrote: > On Mon, 24 Sep 2007, John Baldwin wrote: > > > On Saturday 22 September 2007 10:32:06 pm Attilio Rao wrote: > >> Recently several people have reported problems of starvation with rwlocks. > >> In particular, users which tried to use rwlock on big SMP environment > >> (16+ CPUs) found them rather subjected to poor performances and to > >> starvation of waiters. > >> > >> Inspecting the code, something strange about adaptive spinning popped > >> up: basically, for rwlocks, adaptive spinning stubs seem to be > >> customed too down in the decisioning-loop. > >> The desposition of the stub will let the thread that would adaptively > >> spin, to set the respecitve (both read or write) waiters flag on, > >> which means that the owner of the lock will go down in the hard path > >> of locking functions and will performe a full wakeup even if the > >> waiters queues can result empty. This is a big penalty for adaptive > >> spinning which can make it completely useless. > >> In addiction to this, adaptive spinning only runs in the turnstile > >> spinlock path which is not ideal. > >> This patch ports the approach alredy used for adaptive spinning in sx > >> locks to rwlocks: > >> http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff > >> > >> In sx it is unlikely to see big benefits because they are held for too > >> long times, but for rwlocks situation is rather different. > >> I would like to see if people can do benchmarks with this patch (maybe > >> in private environments?) as I'm not able to do them in short times. > >> > >> Adaptive spinning in rwlocks can be improved further with other tricks > >> (like adding a backoff counter, for example, or trying to spin with > >> the lock held in read mode too), but we first should be sure to start > >> with a solid base. > > > > I did this for mutexes and rwlocks over a year ago and Kris found it was > > slower in benchmarks. www.freebsd.org/~jhb/patches/lock_adapt.patch is the > > last thing I sent kris@ to test (it only has the mutex changes). This might > > be more optimal post-thread_lock since thread_lock seems to have heavily > > pessimized adaptive spinning because it now enqueues the thread and then > > dequeues it again before doing the adaptive spin. I liked the approach > > orginially because it simplifies the code a lot. A separate issue is that > > writers don't spin at all if a reader holds the lock, and I think one thing > > to test for that would be an adaptive spin with a static timeout. > > We don't enqueue the thread until the same place. We just acquire an > extra spinlock. The thread is not enqueued until turnstile_wait() as > before. Oh. That's what I get for assuming what trywait() and cancel() did based on their names. It is still more overhead than before though, so simplifying adaptive spinning might still be a win now as opposed to before. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Sep 25 17:17:50 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 618D216A418; Tue, 25 Sep 2007 17:17:50 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id D0CD713C459; Tue, 25 Sep 2007 17:17:49 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8p) with ESMTP id 211331053-1834499 for multiple; Tue, 25 Sep 2007 13:15:47 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l8PHGvJV011655; Tue, 25 Sep 2007 13:17:00 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: Jeff Roberson Date: Tue, 25 Sep 2007 13:14:42 -0400 User-Agent: KMail/1.9.6 References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> <200709241152.41660.jhb@freebsd.org> <20070924135554.F547@10.0.0.1> In-Reply-To: <20070924135554.F547@10.0.0.1> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200709251314.42707.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Tue, 25 Sep 2007 13:17:01 -0400 (EDT) X-Virus-Scanned: ClamAV 0.88.3/4391/Tue Sep 25 11:10:50 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Attilio Rao , freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Sep 2007 17:17:50 -0000 On Monday 24 September 2007 04:57:06 pm Jeff Roberson wrote: > On Mon, 24 Sep 2007, John Baldwin wrote: > > > On Saturday 22 September 2007 10:32:06 pm Attilio Rao wrote: > >> Recently several people have reported problems of starvation with rwlocks. > >> In particular, users which tried to use rwlock on big SMP environment > >> (16+ CPUs) found them rather subjected to poor performances and to > >> starvation of waiters. > >> > >> Inspecting the code, something strange about adaptive spinning popped > >> up: basically, for rwlocks, adaptive spinning stubs seem to be > >> customed too down in the decisioning-loop. > >> The desposition of the stub will let the thread that would adaptively > >> spin, to set the respecitve (both read or write) waiters flag on, > >> which means that the owner of the lock will go down in the hard path > >> of locking functions and will performe a full wakeup even if the > >> waiters queues can result empty. This is a big penalty for adaptive > >> spinning which can make it completely useless. > >> In addiction to this, adaptive spinning only runs in the turnstile > >> spinlock path which is not ideal. > >> This patch ports the approach alredy used for adaptive spinning in sx > >> locks to rwlocks: > >> http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff > >> > >> In sx it is unlikely to see big benefits because they are held for too > >> long times, but for rwlocks situation is rather different. > >> I would like to see if people can do benchmarks with this patch (maybe > >> in private environments?) as I'm not able to do them in short times. > >> > >> Adaptive spinning in rwlocks can be improved further with other tricks > >> (like adding a backoff counter, for example, or trying to spin with > >> the lock held in read mode too), but we first should be sure to start > >> with a solid base. > > > > I did this for mutexes and rwlocks over a year ago and Kris found it was > > slower in benchmarks. www.freebsd.org/~jhb/patches/lock_adapt.patch is the > > last thing I sent kris@ to test (it only has the mutex changes). This might > > be more optimal post-thread_lock since thread_lock seems to have heavily > > pessimized adaptive spinning because it now enqueues the thread and then > > dequeues it again before doing the adaptive spin. I liked the approach > > orginially because it simplifies the code a lot. A separate issue is that > > writers don't spin at all if a reader holds the lock, and I think one thing > > to test for that would be an adaptive spin with a static timeout. > > We don't enqueue the thread until the same place. We just acquire an > extra spinlock. The thread is not enqueued until turnstile_wait() as > before. So I looked at this some more and now I don't understand why the trywait() and cancel() changes were done. Some background: When the initial turnstile code was split out of the the mutex code, the main loop in mtx_lock_sleep() looked like this: while (!_obtain_lock) { ts = turnstile_lookup(); /* (A) locked ts chain and did O(n) lookup to find existing ts */ if (lock unlocked) { (B) turnstile_release(); continue; } if (failed to set contested flag) { (C) turnstile_release(); continue; } if (owner is running) { (D) turnstile_release(); ... spin ... continue; } turnstile_wait(ts, ...); (E) } So one thing I noticed after a while is that every iteration of this loop was doing the O(n) lookup to find the turnstile even though we only used that in (E). The lookup was wasted overhead for the (B), (C), and (D) cases. Hence this commit a while later: jhb 2004-10-12 18:36:20 UTC FreeBSD src repository Modified files: sys/kern kern_condvar.c kern_mutex.c kern_synch.c subr_sleepqueue.c subr_turnstile.c sys/sys sleepqueue.h turnstile.h Log: Refine the turnstile and sleep queue interfaces just a bit: - Add a new _lock() call to each API that locks the associated chain lock for a lock_object pointer or wait channel. The _lookup() functions now require that the chain lock be locked via _lock() when they are called. - Change sleepq_add(), turnstile_wait() and turnstile_claim() to lookup the associated queue structure internally via _lookup() rather than accepting a pointer from the caller. For turnstiles, this means that the actual lookup of the turnstile in the hash table is only done when the thread actually blocks rather than being done on each loop iteration in _mtx_lock_sleep(). For sleep queues, this means that sleepq_lookup() is no longer used outside of the sleep queue code except to implement an assertion in cv_destroy(). - Change sleepq_broadcast() and sleepq_signal() to require that the chain lock is already required. For condition variables, this lets the cv_broadcast() and cv_signal() functions lock the sleep queue chain lock while testing the waiters count. This means that the waiters count internal to condition variables is no longer protected by the interlock mutex and cv_broadcast() and cv_signal() now no longer require that the interlock be held when they are called. This lets consumers of condition variables drop the lock before waking other threads which can result in fewer context switches. MFC after: 1 month Revision Changes Path 1.52 +16 -15 src/sys/kern/kern_condvar.c 1.151 +4 -5 src/sys/kern/kern_mutex.c 1.263 +4 -3 src/sys/kern/kern_synch.c 1.13 +31 -14 src/sys/kern/subr_sleepqueue.c 1.150 +34 -12 src/sys/kern/subr_turnstile.c 1.5 +11 -12 src/sys/sys/sleepqueue.h 1.5 +17 -16 src/sys/sys/turnstile.h After that commit the mutex loop looked like this: while (!_obtain_lock) { turnstile_lock(); /* (A) locked ts chain only */ if (lock unlocked) { (B) turnstile_release(); continue; } if (failed to set contested flag) { (C) turnstile_release(); continue; } if (owner is running) { (D) turnstile_release(); ... spin ... continue; } turnstile_wait(ts, ...); /* (E) Does O(n) lookup */ } That is, after the above commit, we only did the O(n) lookup if we needed the actual turnstile object in (E). This decreased the overhead for the (B), (C), and (D) cases. The changes to add trywait() and cancel() basically seem to have reverted the above commit by going back to doing the O(n) lookup in (A) thus re-adding the overhead to the (B), (C), and (D) cases. In addition trywait() / cancel() actually maintain some additional state and acquire another lock so that (B), (C), and (D) have even more overhead than the first cut of the turnstile code including having to back out some of the changes in cancel() rather than just a single spinlock unlock. This is actually why I had assumed you had enqueued the turnstile in trywait() as I thought you had dropped the chain lock in trywait() (which would require enqueing the thread in trywait(), but would also greatly complicate cancel()). Otherwise, the current trywait() / cancel() changes seem like a step backwards to me. Do you really think it is necessary to do the O(n) lookup even when we don't need it (B, C, D) rather than just doing it in turnstile_wait() (E)? -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Sep 25 23:00:54 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CF99416A418; Tue, 25 Sep 2007 23:00:54 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 7C0C813C44B; Tue, 25 Sep 2007 23:00:54 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.103] (c-67-160-44-208.hsd1.wa.comcast.net [67.160.44.208]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l8PN0qhh066072 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Tue, 25 Sep 2007 19:00:53 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Tue, 25 Sep 2007 16:03:41 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: John Baldwin In-Reply-To: <200709251314.42707.jhb@freebsd.org> Message-ID: <20070925155504.W556@10.0.0.1> References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> <200709241152.41660.jhb@freebsd.org> <20070924135554.F547@10.0.0.1> <200709251314.42707.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Attilio Rao , freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Sep 2007 23:00:54 -0000 On Tue, 25 Sep 2007, John Baldwin wrote: > On Monday 24 September 2007 04:57:06 pm Jeff Roberson wrote: >> On Mon, 24 Sep 2007, John Baldwin wrote: >> >>> On Saturday 22 September 2007 10:32:06 pm Attilio Rao wrote: >>>> Recently several people have reported problems of starvation with > rwlocks. >>>> In particular, users which tried to use rwlock on big SMP environment >>>> (16+ CPUs) found them rather subjected to poor performances and to >>>> starvation of waiters. >>>> >>>> Inspecting the code, something strange about adaptive spinning popped >>>> up: basically, for rwlocks, adaptive spinning stubs seem to be >>>> customed too down in the decisioning-loop. >>>> The desposition of the stub will let the thread that would adaptively >>>> spin, to set the respecitve (both read or write) waiters flag on, >>>> which means that the owner of the lock will go down in the hard path >>>> of locking functions and will performe a full wakeup even if the >>>> waiters queues can result empty. This is a big penalty for adaptive >>>> spinning which can make it completely useless. >>>> In addiction to this, adaptive spinning only runs in the turnstile >>>> spinlock path which is not ideal. >>>> This patch ports the approach alredy used for adaptive spinning in sx >>>> locks to rwlocks: >>>> http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff >>>> >>>> In sx it is unlikely to see big benefits because they are held for too >>>> long times, but for rwlocks situation is rather different. >>>> I would like to see if people can do benchmarks with this patch (maybe >>>> in private environments?) as I'm not able to do them in short times. >>>> >>>> Adaptive spinning in rwlocks can be improved further with other tricks >>>> (like adding a backoff counter, for example, or trying to spin with >>>> the lock held in read mode too), but we first should be sure to start >>>> with a solid base. >>> >>> I did this for mutexes and rwlocks over a year ago and Kris found it was >>> slower in benchmarks. www.freebsd.org/~jhb/patches/lock_adapt.patch is > the >>> last thing I sent kris@ to test (it only has the mutex changes). This > might >>> be more optimal post-thread_lock since thread_lock seems to have heavily >>> pessimized adaptive spinning because it now enqueues the thread and then >>> dequeues it again before doing the adaptive spin. I liked the approach >>> orginially because it simplifies the code a lot. A separate issue is that >>> writers don't spin at all if a reader holds the lock, and I think one > thing >>> to test for that would be an adaptive spin with a static timeout. >> >> We don't enqueue the thread until the same place. We just acquire an >> extra spinlock. The thread is not enqueued until turnstile_wait() as >> before. > > So I looked at this some more and now I don't understand why the trywait() and > cancel() changes were done. Some background: I wanted the turnstile chain lock to protect only the association between the turnstile and lock. This way it further reduces the contention domain for the hard lock/unlock cases. We sometimes see a lot of contention on the turnstile chain that may be due to hash collisions. If this reduced contention doesn't help I think it would be possible to revert this change. I originally had more of the code running without the turnstile chain held but while trying to reduce diffs to understand the problem better it eventually returned to cover almost all of the hard paths. > > When the initial turnstile code was split out of the the mutex code, the main > loop in mtx_lock_sleep() looked like this: > > while (!_obtain_lock) { > ts = turnstile_lookup(); /* (A) locked ts chain > and did O(n) lookup > to find existing ts */ I'm not sure why you're calling this O(n). It's O(n) with the number of hash collisions, but it's a hash lookup. We should consider using a real hash algorithm here if the distribution is bad. Thanks, Jeff > > if (lock unlocked) { (B) > turnstile_release(); > continue; > } > > if (failed to set contested flag) { (C) > turnstile_release(); > continue; > } > > if (owner is running) { (D) > turnstile_release(); > ... spin ... > continue; > } > > turnstile_wait(ts, ...); (E) > } > > So one thing I noticed after a while is that every iteration of this loop was > doing the O(n) lookup to find the turnstile even though we only used that in > (E). The lookup was wasted overhead for the (B), (C), and (D) cases. Hence > this commit a while later: > > jhb 2004-10-12 18:36:20 UTC > > FreeBSD src repository > > Modified files: > sys/kern kern_condvar.c kern_mutex.c kern_synch.c > subr_sleepqueue.c subr_turnstile.c > sys/sys sleepqueue.h turnstile.h > Log: > Refine the turnstile and sleep queue interfaces just a bit: > - Add a new _lock() call to each API that locks the associated chain lock > for a lock_object pointer or wait channel. The _lookup() functions now > require that the chain lock be locked via _lock() when they are called. > - Change sleepq_add(), turnstile_wait() and turnstile_claim() to lookup > the associated queue structure internally via _lookup() rather than > accepting a pointer from the caller. For turnstiles, this means that > the actual lookup of the turnstile in the hash table is only done when > the thread actually blocks rather than being done on each loop iteration > in _mtx_lock_sleep(). For sleep queues, this means that sleepq_lookup() > is no longer used outside of the sleep queue code except to implement an > assertion in cv_destroy(). > - Change sleepq_broadcast() and sleepq_signal() to require that the chain > lock is already required. For condition variables, this lets the > cv_broadcast() and cv_signal() functions lock the sleep queue chain lock > while testing the waiters count. This means that the waiters count > internal to condition variables is no longer protected by the interlock > mutex and cv_broadcast() and cv_signal() now no longer require that the > interlock be held when they are called. This lets consumers of condition > variables drop the lock before waking other threads which can result in > fewer context switches. > > MFC after: 1 month > > Revision Changes Path > 1.52 +16 -15 src/sys/kern/kern_condvar.c > 1.151 +4 -5 src/sys/kern/kern_mutex.c > 1.263 +4 -3 src/sys/kern/kern_synch.c > 1.13 +31 -14 src/sys/kern/subr_sleepqueue.c > 1.150 +34 -12 src/sys/kern/subr_turnstile.c > 1.5 +11 -12 src/sys/sys/sleepqueue.h > 1.5 +17 -16 src/sys/sys/turnstile.h > > After that commit the mutex loop looked like this: > > while (!_obtain_lock) { > turnstile_lock(); /* (A) locked ts chain only */ > > if (lock unlocked) { (B) > turnstile_release(); > continue; > } > > if (failed to set contested flag) { (C) > turnstile_release(); > continue; > } > > if (owner is running) { (D) > turnstile_release(); > ... spin ... > continue; > } > > turnstile_wait(ts, ...); /* (E) Does O(n) lookup */ > } > > That is, after the above commit, we only did the O(n) lookup if we needed the > actual turnstile object in (E). This decreased the overhead for the (B), > (C), and (D) cases. > > The changes to add trywait() and cancel() basically seem to have reverted the > above commit by going back to doing the O(n) lookup in (A) thus re-adding the > overhead to the (B), (C), and (D) cases. In addition trywait() / cancel() > actually maintain some additional state and acquire another lock so that (B), > (C), and (D) have even more overhead than the first cut of the turnstile code > including having to back out some of the changes in cancel() rather than just > a single spinlock unlock. > > This is actually why I had assumed you had enqueued the turnstile in trywait() > as I thought you had dropped the chain lock in trywait() (which would require > enqueing the thread in trywait(), but would also greatly complicate > cancel()). Otherwise, the current trywait() / cancel() changes seem like a > step backwards to me. > > Do you really think it is necessary to do the O(n) lookup even when we don't > need it (B, C, D) rather than just doing it in turnstile_wait() (E)? > > -- > John Baldwin > From owner-freebsd-arch@FreeBSD.ORG Wed Sep 26 05:09:24 2007 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0E17016A468; Wed, 26 Sep 2007 05:09:24 +0000 (UTC) (envelope-from jmg@hydrogen.funkthat.com) Received: from hydrogen.funkthat.com (gate.funkthat.com [69.17.45.168]) by mx1.freebsd.org (Postfix) with ESMTP id A9FC813C48A; Wed, 26 Sep 2007 05:09:23 +0000 (UTC) (envelope-from jmg@hydrogen.funkthat.com) Received: from hydrogen.funkthat.com (gkvdzsioz46jvzap@localhost.funkthat.com [127.0.0.1]) by hydrogen.funkthat.com (8.13.6/8.13.3) with ESMTP id l8Q4s261083635; Tue, 25 Sep 2007 21:54:02 -0700 (PDT) (envelope-from jmg@hydrogen.funkthat.com) Received: (from jmg@localhost) by hydrogen.funkthat.com (8.13.6/8.13.3/Submit) id l8Q4s2hA083634; Tue, 25 Sep 2007 21:54:02 -0700 (PDT) (envelope-from jmg) Date: Tue, 25 Sep 2007 21:54:01 -0700 From: John-Mark Gurney To: Hans Petter Selasky Message-ID: <20070926045401.GB47467@funkthat.com> Mail-Followup-To: Hans Petter Selasky , freebsd-arch@FreeBSD.org References: <200709260131.49156.hselasky@c2i.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200709260131.49156.hselasky@c2i.net> User-Agent: Mutt/1.4.2.1i X-Operating-System: FreeBSD 5.4-RELEASE-p6 i386 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-3.0 (hydrogen.funkthat.com [127.0.0.1]); Tue, 25 Sep 2007 21:54:03 -0700 (PDT) Cc: freebsd-arch@FreeBSD.org Subject: Re: Request for feedback on common data backstore in the kernel X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: John-Mark Gurney List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 05:09:24 -0000 Hans Petter Selasky wrote this message on Wed, Sep 26, 2007 at 01:31 +0200: > Please keep me CC'ed, hence I'm not on all these lists. > > In the kernel we currently have two different data backstores: > > struct mbuf > > and > > struct buf > > These two backstores serve two different device types. "mbufs" are for network > devices and "buf" is for disk devices. I don't see how this relates to the rest of your email, but even though they are used similarly, their normal size is quite different... mbufs normally contain 64-256 byte packets, w/ large file transfers attaching a 2k cluster (which comes from a different pool than the core mbuf) to the mbuf... buf is usually something like 16k-64k... > Problem: > > The current backstores are loaded into DMA by using the BUS-DMA framework. > This appears not to be too fast according to Kip Macy. See: > > http://perforce.freebsd.org/chv.cgi?CH=126455 This only works on x86/amd64 because of the direct mapped memory that they support.. This would complete break arches like sparc64 that require an iommu to translate the addresses... and also doesn't address keeping the buffers in sync on arches like arm... sparc64 may have many gigs of memory, but only a 2GB window for mapping main memory... It sounds like the x86/amd64 bus_dma implementation needs to be improved to run more quickly... As w/ all things, you can hardcode stuff, but then you loose portability... > Some ideas I have: > > When a buffer is out out of range for a hardware device and a data-copy is > needed I want to simply copy that data in smaller parts to/from a > pre-allocated bounce buffer. I want to avoid allocating this buffer > when "bus_dmamap_load()" is called. > > For pre-allocated USB DMA memory I currently have: > > struct usbd_page > > struct usbd_page { > void *buffer; // virtual address > bus_size_t physaddr; // as seen by one of my devices > bus_dma_tag_t tag; > bus_dmamap_t map; > uint32_t length; > }; > > Mostly only "length == PAGE_SIZE" is allowed. When USB allocates DMA memory it > allocates the same size all the way and that is PAGE_SIZE bytes. I could see attaching preallocated memory to a tag, and having maps that attempt to use this memory, but that's something else... > If two different PCI controllers want to communicate directly passing DMA > buffers, technically one would need to translate the physical address for > device 1 to the physical address as seen by device 2. If this translation > table is sorted, the search will be rather quick. Another approach is to > limit the number of translations: > > #define N_MAX_PCI_TRANSLATE 4 > > struct usbd_page { > void *buffer; // virtual address > bus_size_t physaddr[N_MAX_PCI_TRANSLATE]; > bus_dma_tag_t tag; > bus_dmamap_t map; > uint32_t length; > }; > > Then PCI device 1 on bus X can use physaddr[0] and PCI device 2 on bus Y can > use physaddr[1]. If the physaddr[] is equal to some magic then the DMA buffer > is not reachable and must be bounced. > > Then when two PCI devices talk together all they need to pass is a structure > like this: > > struct usbd_page_cache { > struct usbd_page *page_start; > uint32_t page_offset_buf; > uint32_t page_offset_end; > }; > > And the required DMA address is looked up in some nanos. > > Has someone been thinking about this topic before ? There is no infastructure to support passing dma address between hardware devices, and is complete unrelated to the issues raised above... This requires the ability to pass in a map to a tag and create a new map... It is possible, as on the sun4v where you have two iommu's.. You'd have to program on iommu to point to the other one, to support that... But it is rare to see devices to dma directly to each other... You usually end up dma'ing to main memory, and then having the other device dma it out of memory.. The only time you need to dma between devices is if one has local memory, and the other device is able to sanely populate it... This is very rare... Also, the PCI bus length can get quite long.. With PCIe, each device is now it's own PCI bus, so you're starting to see PCI bus counts in the 10's and 20's, if not higher.. having an area of all of those, and calculating them and filling them out sounds like a huge expense... I'm a bit puzzeled as to what you wanted to solve, as the problem you stated doesn't relate to the solutions you were thinking about... Maybe I'm missing something? Can you give me an example of where cxgb is writing to the memory on another pci bus, and not main memory? P.S. I redirected to -arch as this seems more related than the other lists... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-arch@FreeBSD.ORG Wed Sep 26 06:59:40 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6779816A420 for ; Wed, 26 Sep 2007 06:59:40 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from nf-out-0910.google.com (nf-out-0910.google.com [64.233.182.189]) by mx1.freebsd.org (Postfix) with ESMTP id F0C0613C467 for ; Wed, 26 Sep 2007 06:59:39 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: by nf-out-0910.google.com with SMTP id b2so1632871nfb for ; Tue, 25 Sep 2007 23:59:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; bh=Yyd0Kt0q2A1GRkUwT62/Rn5k/Erc5kwRjhXvfl2yw4c=; b=MiM9PpjFpF5kG7MEi3NRo3G26GIsXyekFqjPGVXDxrb3agBk5Yyl0SpIZ2JO66Dh2QlU9dHk1t5LozGpue4vgVZf4zadJWWiBPeiz7g9DcKNg6gbYMF7PE6IuAzVZd6ybhG/eOHTWYTnGMmNFyY/rIP1oRiPmShtdjoZhYs2UxA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=Oj+kNUOsyEwDXAQHLqSKUqT6aM5UjPrZV5Z0oHPI4N/NwwJGe2XbIqh3J8FB2D5bgB9N8nwTfstrjzK8V98x7yTuUWL3GzmgO6GUKHyMxXpgbjruIDbQ08XJvbSKqRS3ODpIJoQNWndQmrYmPc4lOJTkcO1WQFY5XY35AmGgPAw= Received: by 10.78.131.8 with SMTP id e8mr151374hud.1190789977617; Tue, 25 Sep 2007 23:59:37 -0700 (PDT) Received: by 10.78.97.18 with HTTP; Tue, 25 Sep 2007 23:59:37 -0700 (PDT) Message-ID: <3bbf2fe10709252359h7812ff7bsce9d393f80b65c06@mail.gmail.com> Date: Wed, 26 Sep 2007 08:59:37 +0200 From: "Attilio Rao" Sender: asmrookie@gmail.com To: "Jeff Roberson" In-Reply-To: <20070925155504.W556@10.0.0.1> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> <200709241152.41660.jhb@freebsd.org> <20070924135554.F547@10.0.0.1> <200709251314.42707.jhb@freebsd.org> <20070925155504.W556@10.0.0.1> X-Google-Sender-Auth: 12c5eeab311d1215 Cc: freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 06:59:40 -0000 2007/9/26, Jeff Roberson : > On Tue, 25 Sep 2007, John Baldwin wrote: > > > > > So I looked at this some more and now I don't understand why the trywait() and > > cancel() changes were done. Some background: > > I wanted the turnstile chain lock to protect only the association between > the turnstile and lock. This way it further reduces the contention domain > for the hard lock/unlock cases. We sometimes see a lot of contention on > the turnstile chain that may be due to hash collisions. If this reduced > contention doesn't help I think it would be possible to revert this > change. Well, the change jhb is suggesting won't reduce tc_lock contention for sure, but it will prevent the turnstile lookup/locking if not necessary. Mainly, we need a spinlock held in order to protect mtx_lock bitfield (as well as rw_lock) by setting waiters flags. While in the past code this lock was only tc_lock, currently mtx_lock is both protected by tc_lock and ts_lock, which it doesn't seem strictly necessary. if, for example, you lock tc_lock than you scan the chain and lock ts_lock (basically what turnstile_trywait() does) and later you find lock is released or that the state of lock changed and you cannot set a waiters flag, you did the ts_locking and hash table scanning even if it wasn't needed. > I'm not sure why you're calling this O(n). It's O(n) with the number of > hash collisions, but it's a hash lookup. We should consider using a real > hash algorithm here if the distribution is bad. Well, the hash lookup (worst case) has O(n) complexity. This linear complexity cames from the O(1) * O(n) which is the access to the list and the scanning of this one. So this is reduced to the same complexity of a linked list. However, some time ago I saw empirically that starting from a 'struct thread' allocated address the highest level of entropy in the word came after a right shift of 10 bits (for ia32) and 12 bits for amd64. Currently, turnstile and sleepqueues only uses 8 bits of right shifting. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Wed Sep 26 15:52:12 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7B0C116A476; Wed, 26 Sep 2007 15:52:12 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 5157413C469; Wed, 26 Sep 2007 15:52:12 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.103] (c-67-160-44-208.hsd1.wa.comcast.net [67.160.44.208]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l8QFq9hX054745 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Wed, 26 Sep 2007 11:52:10 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Wed, 26 Sep 2007 08:54:57 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: Attilio Rao In-Reply-To: <3bbf2fe10709252359h7812ff7bsce9d393f80b65c06@mail.gmail.com> Message-ID: <20070926084705.T556@10.0.0.1> References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> <200709241152.41660.jhb@freebsd.org> <20070924135554.F547@10.0.0.1> <200709251314.42707.jhb@freebsd.org> <20070925155504.W556@10.0.0.1> <3bbf2fe10709252359h7812ff7bsce9d393f80b65c06@mail.gmail.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 15:52:12 -0000 On Wed, 26 Sep 2007, Attilio Rao wrote: > 2007/9/26, Jeff Roberson : >> On Tue, 25 Sep 2007, John Baldwin wrote: >> >>> >>> So I looked at this some more and now I don't understand why the trywait() and >>> cancel() changes were done. Some background: >> >> I wanted the turnstile chain lock to protect only the association between >> the turnstile and lock. This way it further reduces the contention domain >> for the hard lock/unlock cases. We sometimes see a lot of contention on >> the turnstile chain that may be due to hash collisions. If this reduced >> contention doesn't help I think it would be possible to revert this >> change. > > Well, the change jhb is suggesting won't reduce tc_lock contention for > sure, but it will prevent the turnstile lookup/locking if not > necessary. > Mainly, we need a spinlock held in order to protect mtx_lock bitfield > (as well as rw_lock) by setting waiters flags. > While in the past code this lock was only tc_lock, currently mtx_lock > is both protected by tc_lock and ts_lock, which it doesn't seem > strictly necessary. > if, for example, you lock tc_lock than you scan the chain and lock > ts_lock (basically what turnstile_trywait() does) and later you find > lock is released or that the state of lock changed and you cannot set > a waiters flag, you did the ts_locking and hash table scanning even if > it wasn't needed. Yes I understand, the question is whether the decreased complexity is a bigger win than the reduced contention. The contention is only reduced if there are collisions in the table. So probably jhb is right, we should revert this part of the change. I need to make sure there was not some other reason for it. I had convinced myself at the time that it was necessary, however, I can't recall why. > >> I'm not sure why you're calling this O(n). It's O(n) with the number of >> hash collisions, but it's a hash lookup. We should consider using a real >> hash algorithm here if the distribution is bad. > > Well, the hash lookup (worst case) has O(n) complexity. > This linear complexity cames from the O(1) * O(n) which is the access > to the list and the scanning of this one. So this is reduced to the > same complexity of a linked list. sure, I just don't think it's common to call a hash O(n). > However, some time ago I saw empirically that starting from a 'struct > thread' allocated address the highest level of entropy in the word > came after a right shift of 10 bits (for ia32) and 12 bits for amd64. > Currently, turnstile and sleepqueues only uses 8 bits of right > shifting. I haven't studied the hash distribution in turnstiles, I know there is some debugging code in there for it so jhb must have. Perhaps he can say how it does. > > Thanks, > Attilio > > > -- > Peace can only be achieved by understanding - A. Einstein > From owner-freebsd-arch@FreeBSD.ORG Wed Sep 26 17:44:39 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BC96D16A417 for ; Wed, 26 Sep 2007 17:44:39 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe05.swip.net [212.247.154.129]) by mx1.freebsd.org (Postfix) with ESMTP id 41E7413C447 for ; Wed, 26 Sep 2007 17:44:38 +0000 (UTC) (envelope-from hselasky@c2i.net) X-Cloudmark-Score: 0.000000 [] Received: from [85.19.218.45] (account mc467741@c2i.net [85.19.218.45] verified) by mailfe05.swip.net (CommuniGate Pro SMTP 5.1.10) with ESMTPA id 527612352; Wed, 26 Sep 2007 18:44:35 +0200 From: Hans Petter Selasky To: freebsd-arch@freebsd.org, John-Mark Gurney Date: Wed, 26 Sep 2007 18:44:55 +0200 User-Agent: KMail/1.9.7 References: <200709260131.49156.hselasky@c2i.net> <20070926045401.GB47467@funkthat.com> In-Reply-To: <20070926045401.GB47467@funkthat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200709261844.56182.hselasky@c2i.net> Cc: Subject: Re: Request for feedback on common data backstore in the kernel X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 17:44:39 -0000 Hi John-Mark, See my comments below. On Wednesday 26 September 2007, John-Mark Gurney wrote: > Hans Petter Selasky wrote this message on Wed, Sep 26, 2007 at 01:31 +0200: > > Please keep me CC'ed, hence I'm not on all these lists. > > > > In the kernel we currently have two different data backstores: > > > > struct mbuf > > > > and > > > > struct buf > > > > These two backstores serve two different device types. "mbufs" are for > > network devices and "buf" is for disk devices. > > I don't see how this relates to the rest of your email, but even though > they are used similarly, their normal size is quite different... mbufs > normally contain 64-256 byte packets, w/ large file transfers attaching > a 2k cluster (which comes from a different pool than the core mbuf) to > the mbuf... buf is usually something like 16k-64k... > > > Problem: > > > > The current backstores are loaded into DMA by using the BUS-DMA > > framework. This appears not to be too fast according to Kip Macy. See: > > > > http://perforce.freebsd.org/chv.cgi?CH=126455 > > This only works on x86/amd64 because of the direct mapped memory that > they support.. This would complete break arches like sparc64 that > require an iommu to translate the addresses... and also doesn't address > keeping the buffers in sync on arches like arm... sparc64 may have many > gigs of memory, but only a 2GB window for mapping main memory... > > It sounds like the x86/amd64 bus_dma implementation needs to be improved > to run more quickly... As w/ all things, you can hardcode stuff, but then > you loose portability... Correct. > > > Some ideas I have: > > > > When a buffer is out out of range for a hardware device and a data-copy > > is needed I want to simply copy that data in smaller parts to/from a > > pre-allocated bounce buffer. I want to avoid allocating this buffer when > > "bus_dmamap_load()" is called. > > > > For pre-allocated USB DMA memory I currently have: > > > > struct usbd_page > > > > struct usbd_page { > > void *buffer; // virtual address > > bus_size_t physaddr; // as seen by one of my devices > > bus_dma_tag_t tag; > > bus_dmamap_t map; > > uint32_t length; > > }; > > > > Mostly only "length == PAGE_SIZE" is allowed. When USB allocates DMA > > memory it allocates the same size all the way and that is PAGE_SIZE > > bytes. > > I could see attaching preallocated memory to a tag, and having maps > that attempt to use this memory, but that's something else... > > > If two different PCI controllers want to communicate directly passing DMA > > buffers, technically one would need to translate the physical address for > > device 1 to the physical address as seen by device 2. If this translation > > table is sorted, the search will be rather quick. Another approach is to > > limit the number of translations: > > > > #define N_MAX_PCI_TRANSLATE 4 > > > > struct usbd_page { > > void *buffer; // virtual address > > bus_size_t physaddr[N_MAX_PCI_TRANSLATE]; > > bus_dma_tag_t tag; > > bus_dmamap_t map; > > uint32_t length; > > }; > > > > Then PCI device 1 on bus X can use physaddr[0] and PCI device 2 on bus Y > > can use physaddr[1]. If the physaddr[] is equal to some magic then the > > DMA buffer is not reachable and must be bounced. > > > > Then when two PCI devices talk together all they need to pass is a > > structure like this: > > > > struct usbd_page_cache { > > struct usbd_page *page_start; > > uint32_t page_offset_buf; > > uint32_t page_offset_end; > > }; > > > > And the required DMA address is looked up in some nanos. > > > > Has someone been thinking about this topic before ? > > There is no infastructure to support passing dma address between hardware > devices, and is complete unrelated to the issues raised above... This > requires the ability to pass in a map to a tag and create a new map... > It is possible, as on the sun4v where you have two iommu's.. You'd have > to program on iommu to point to the other one, to support that... But > it is rare to see devices to dma directly to each other... You usually > end up dma'ing to main memory, and then having the other device dma it > out of memory.. The only time you need to dma between devices is if one > has local memory, and the other device is able to sanely populate it... > This is very rare... What I meant was that USB dma directly into main memory. But then another PCI device like an Ethernet device might want to forward that data by dma'ing it out of main memory. The dma address as seen by the two different PCI devices might not be the same. I admit that I'm not an expert on how DMA is done on the Sparc, but could you explain a little bit more how a mbuf is loaded into DMA for a network card on the Sparc ? What I'm looking for is a function that transforms a virtual memory address and a bus-DMA tag into a physical address without blocking. If a mapping is not possible I want that an error be returned so that I can bounce the data using a pre-allocate buffer, and not a buffer allocated by bus_dma on the fly. > > Also, the PCI bus length can get quite long.. With PCIe, each device is > now it's own PCI bus, so you're starting to see PCI bus counts in the > 10's and 20's, if not higher.. having an area of all of those, and > calculating them and filling them out sounds like a huge expense... > > I'm a bit puzzeled as to what you wanted to solve, as the problem you > stated doesn't relate to the solutions you were thinking about... Maybe > I'm missing something? Can you give me an example of where cxgb is > writing to the memory on another pci bus, and not main memory? This is not the case with USB. The only example I know if is some TV-cards that write directly into the frame buffer of the video-card. --HPS From owner-freebsd-arch@FreeBSD.ORG Wed Sep 26 18:32:52 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9AB8C16A418 for ; Wed, 26 Sep 2007 18:32:52 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id 50AAA13C45D for ; Wed, 26 Sep 2007 18:32:52 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8p) with ESMTP id 211528369-1834499 for multiple; Wed, 26 Sep 2007 14:31:28 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l8QIWPRj024716; Wed, 26 Sep 2007 14:32:37 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: freebsd-arch@freebsd.org Date: Wed, 26 Sep 2007 14:03:57 -0400 User-Agent: KMail/1.9.6 References: <200709260131.49156.hselasky@c2i.net> <20070926045401.GB47467@funkthat.com> <200709261844.56182.hselasky@c2i.net> In-Reply-To: <200709261844.56182.hselasky@c2i.net> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200709261403.58349.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Wed, 26 Sep 2007 14:32:37 -0400 (EDT) X-Virus-Scanned: ClamAV 0.88.3/4405/Wed Sep 26 12:29:18 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: John-Mark Gurney , Hans Petter Selasky Subject: Re: Request for feedback on common data backstore in the kernel X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 18:32:52 -0000 On Wednesday 26 September 2007 12:44:55 pm Hans Petter Selasky wrote: > What I meant was that USB dma directly into main memory. But then another PCI > device like an Ethernet device might want to forward that data by dma'ing it > out of main memory. The dma address as seen by the two different PCI devices > might not be the same. > > I admit that I'm not an expert on how DMA is done on the Sparc, but could you > explain a little bit more how a mbuf is loaded into DMA for a network card on > the Sparc ? > > What I'm looking for is a function that transforms a virtual memory address > and a bus-DMA tag into a physical address without blocking. If a mapping is > not possible I want that an error be returned so that I can bounce the data > using a pre-allocate buffer, and not a buffer allocated by bus_dma on the > fly. bus_dma already preallocates bounce pages when you create a map. I would just try using the existing bus_dma first w/o trying to use private buffers, etc. That said, there is one case where this could be useful: for disk dumps it might be nice to pre-allocate a known "safe" set of pages to use for bouncing pages (such as on i386 with PAE with a controller that does 32-bit addressing) that can't be mapped directly. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Wed Sep 26 18:33:10 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A6FBC16A418; Wed, 26 Sep 2007 18:33:10 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id 4519E13C4AA; Wed, 26 Sep 2007 18:33:10 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8p) with ESMTP id 211528358-1834499 for multiple; Wed, 26 Sep 2007 14:31:23 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l8QIWPRi024716; Wed, 26 Sep 2007 14:32:32 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: Jeff Roberson Date: Wed, 26 Sep 2007 13:40:07 -0400 User-Agent: KMail/1.9.6 References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> <3bbf2fe10709252359h7812ff7bsce9d393f80b65c06@mail.gmail.com> <20070926084705.T556@10.0.0.1> In-Reply-To: <20070926084705.T556@10.0.0.1> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200709261340.08344.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Wed, 26 Sep 2007 14:32:32 -0400 (EDT) X-Virus-Scanned: ClamAV 0.88.3/4405/Wed Sep 26 12:29:18 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Attilio Rao , freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 18:33:10 -0000 On Wednesday 26 September 2007 11:54:57 am Jeff Roberson wrote: > On Wed, 26 Sep 2007, Attilio Rao wrote: > > > 2007/9/26, Jeff Roberson : > >> On Tue, 25 Sep 2007, John Baldwin wrote: > >> > >>> > >>> So I looked at this some more and now I don't understand why the trywait() and > >>> cancel() changes were done. Some background: > >> > >> I wanted the turnstile chain lock to protect only the association between > >> the turnstile and lock. This way it further reduces the contention domain > >> for the hard lock/unlock cases. We sometimes see a lot of contention on > >> the turnstile chain that may be due to hash collisions. If this reduced > >> contention doesn't help I think it would be possible to revert this > >> change. > > > > Well, the change jhb is suggesting won't reduce tc_lock contention for > > sure, but it will prevent the turnstile lookup/locking if not > > necessary. > > Mainly, we need a spinlock held in order to protect mtx_lock bitfield > > (as well as rw_lock) by setting waiters flags. > > While in the past code this lock was only tc_lock, currently mtx_lock > > is both protected by tc_lock and ts_lock, which it doesn't seem > > strictly necessary. > > if, for example, you lock tc_lock than you scan the chain and lock > > ts_lock (basically what turnstile_trywait() does) and later you find > > lock is released or that the state of lock changed and you cannot set > > a waiters flag, you did the ts_locking and hash table scanning even if > > it wasn't needed. > > Yes I understand, the question is whether the decreased complexity is a > bigger win than the reduced contention. The contention is only reduced if > there are collisions in the table. So probably jhb is right, we should > revert this part of the change. I need to make sure there was not some > other reason for it. I had convinced myself at the time that it was > necessary, however, I can't recall why. Actually, since you still hold the tc_lock the entire time (trywait() doesn't drop it), I don't think the trywait() / cancel() changes affect spin lock contention at all. I think it's more just a matter of overhead. If you do want to drop the tc_lock after the lookup in trywait() then I think the added complexity you would need to add (queue'ing the thread in trywait() to avoid a race with concurrent contesters on the "real" lock and then backing all that out in cancel()) would be very hairy. > >> I'm not sure why you're calling this O(n). It's O(n) with the number of > >> hash collisions, but it's a hash lookup. We should consider using a real > >> hash algorithm here if the distribution is bad. > > > > Well, the hash lookup (worst case) has O(n) complexity. > > This linear complexity cames from the O(1) * O(n) which is the access > > to the list and the scanning of this one. So this is reduced to the > > same complexity of a linked list. > > sure, I just don't think it's common to call a hash O(n). Well, the lookup of the chain in the bucket is what I call O(n). Getting to the bucket (which you always have to do to lock the turnstile chain) is O(1). The change I'm suggesting would only defer walking the chain to turnstile_wait(), not finding the index in the bucket. It would also defer grabbing the turnstile spin lock which probably dwarfs the cost of the chain walk in practice. > > However, some time ago I saw empirically that starting from a 'struct > > thread' allocated address the highest level of entropy in the word > > came after a right shift of 10 bits (for ia32) and 12 bits for amd64. > > Currently, turnstile and sleepqueues only uses 8 bits of right > > shifting. > > I haven't studied the hash distribution in turnstiles, I know there is > some debugging code in there for it so jhb must have. Perhaps he can say > how it does. I don't think I ever saw it go over 10. Note that the comments about 'struct thread' aren't quite relevant as sleepqueue's and turnstile's don't use thread pointers for hashing, but wait channels and lock pointers. That said, I only used the values they used based on what tsleep() in 4.x used for its hash for the 4.x sleep queues. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Wed Sep 26 18:51:09 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B3C2716A419; Wed, 26 Sep 2007 18:51:09 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 8DC6A13C457; Wed, 26 Sep 2007 18:51:09 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.103] (c-67-160-44-208.hsd1.wa.comcast.net [67.160.44.208]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l8QIp60G013788 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Wed, 26 Sep 2007 14:51:07 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Wed, 26 Sep 2007 11:53:54 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: John Baldwin In-Reply-To: <200709261340.08344.jhb@freebsd.org> Message-ID: <20070926114943.X556@10.0.0.1> References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> <3bbf2fe10709252359h7812ff7bsce9d393f80b65c06@mail.gmail.com> <20070926084705.T556@10.0.0.1> <200709261340.08344.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Attilio Rao , freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 18:51:09 -0000 On Wed, 26 Sep 2007, John Baldwin wrote: > On Wednesday 26 September 2007 11:54:57 am Jeff Roberson wrote: >> On Wed, 26 Sep 2007, Attilio Rao wrote: >> >>> 2007/9/26, Jeff Roberson : >>>> On Tue, 25 Sep 2007, John Baldwin wrote: >>>> >>>>> >>>>> So I looked at this some more and now I don't understand why the > trywait() and >>>>> cancel() changes were done. Some background: >>>> >>>> I wanted the turnstile chain lock to protect only the association between >>>> the turnstile and lock. This way it further reduces the contention > domain >>>> for the hard lock/unlock cases. We sometimes see a lot of contention on >>>> the turnstile chain that may be due to hash collisions. If this reduced >>>> contention doesn't help I think it would be possible to revert this >>>> change. >>> >>> Well, the change jhb is suggesting won't reduce tc_lock contention for >>> sure, but it will prevent the turnstile lookup/locking if not >>> necessary. >>> Mainly, we need a spinlock held in order to protect mtx_lock bitfield >>> (as well as rw_lock) by setting waiters flags. >>> While in the past code this lock was only tc_lock, currently mtx_lock >>> is both protected by tc_lock and ts_lock, which it doesn't seem >>> strictly necessary. >>> if, for example, you lock tc_lock than you scan the chain and lock >>> ts_lock (basically what turnstile_trywait() does) and later you find >>> lock is released or that the state of lock changed and you cannot set >>> a waiters flag, you did the ts_locking and hash table scanning even if >>> it wasn't needed. >> >> Yes I understand, the question is whether the decreased complexity is a >> bigger win than the reduced contention. The contention is only reduced if >> there are collisions in the table. So probably jhb is right, we should >> revert this part of the change. I need to make sure there was not some >> other reason for it. I had convinced myself at the time that it was >> necessary, however, I can't recall why. > > Actually, since you still hold the tc_lock the entire time (trywait() doesn't > drop it), I don't think the trywait() / cancel() changes affect spin lock > contention at all. I think it's more just a matter of overhead. > > If you do want to drop the tc_lock after the lookup in trywait() then I think > the added complexity you would need to add (queue'ing the thread in trywait() > to avoid a race with concurrent contesters on the "real" lock and then > backing all that out in cancel()) would be very hairy. Yes it's not for concurrency in lock, but in unlock. If the chain gets as deep as 10 there may be some practical value in reducing contention on that chain. We should look at this more before we rush into anything. Right now unlock holds the chain lock for almost the entire duration but it's not really necessary. I think more experimentation is warranted especially since I did not actually measure a slowdown in heavily contended workloads. If we can reduce the number of hash collisions significantly then your approach is definitely a win. However it may be less so if we do not. Thanks, Jeff > >>>> I'm not sure why you're calling this O(n). It's O(n) with the number of >>>> hash collisions, but it's a hash lookup. We should consider using a real >>>> hash algorithm here if the distribution is bad. >>> >>> Well, the hash lookup (worst case) has O(n) complexity. >>> This linear complexity cames from the O(1) * O(n) which is the access >>> to the list and the scanning of this one. So this is reduced to the >>> same complexity of a linked list. >> >> sure, I just don't think it's common to call a hash O(n). > > Well, the lookup of the chain in the bucket is what I call O(n). Getting to > the bucket (which you always have to do to lock the turnstile chain) is O(1). > The change I'm suggesting would only defer walking the chain to > turnstile_wait(), not finding the index in the bucket. It would also defer > grabbing the turnstile spin lock which probably dwarfs the cost of the chain > walk in practice. > >>> However, some time ago I saw empirically that starting from a 'struct >>> thread' allocated address the highest level of entropy in the word >>> came after a right shift of 10 bits (for ia32) and 12 bits for amd64. >>> Currently, turnstile and sleepqueues only uses 8 bits of right >>> shifting. >> >> I haven't studied the hash distribution in turnstiles, I know there is >> some debugging code in there for it so jhb must have. Perhaps he can say >> how it does. > > I don't think I ever saw it go over 10. Note that the comments about 'struct > thread' aren't quite relevant as sleepqueue's and turnstile's don't use > thread pointers for hashing, but wait channels and lock pointers. That said, > I only used the values they used based on what tsleep() in 4.x used for its > hash for the 4.x sleep queues. > > -- > John Baldwin > From owner-freebsd-arch@FreeBSD.ORG Wed Sep 26 19:56:57 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9A11716A419; Wed, 26 Sep 2007 19:56:57 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe04.swip.net [212.247.154.97]) by mx1.freebsd.org (Postfix) with ESMTP id 7ADDF13C4A6; Wed, 26 Sep 2007 19:56:56 +0000 (UTC) (envelope-from hselasky@c2i.net) X-Cloudmark-Score: 0.000000 [] Received: from [85.19.218.45] (account mc467741@c2i.net [85.19.218.45] verified) by mailfe04.swip.net (CommuniGate Pro SMTP 5.1.10) with ESMTPA id 626638699; Wed, 26 Sep 2007 21:56:43 +0200 From: Hans Petter Selasky To: Scott Long Date: Wed, 26 Sep 2007 21:57:00 +0200 User-Agent: KMail/1.9.7 References: <200709260131.49156.hselasky@c2i.net> <46FA9C04.9060603@samsco.org> In-Reply-To: <46FA9C04.9060603@samsco.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200709262157.02712.hselasky@c2i.net> Cc: freebsd-scsi@freebsd.org, freebsd-arch@freebsd.org, freebsd-usb@freebsd.org, freebsd-net@freebsd.org Subject: Re: Request for feedback on common data backstore in the kernel X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 19:56:57 -0000 Hi Scott, The discussion has been moved to "freebsd-arch@freebsd.org". Please only reply there next time. On Wednesday 26 September 2007, Scott Long wrote: > Hans Petter Selasky wrote: > > Hi, > > > > Please keep me CC'ed, hence I'm not on all these lists. > > > > In the kernel we currently have two different data backstores: > > > > struct mbuf > > > > and > > > > struct buf > > > > These two backstores serve two different device types. "mbufs" are for > > network devices and "buf" is for disk devices. > > > > Problem: > > > > The current backstores are loaded into DMA by using the BUS-DMA > > framework. This appears not to be too fast according to Kip Macy. See: > > > > http://perforce.freebsd.org/chv.cgi?CH=126455 > > Busdma isn't fast enough for 1Gb and 10Gb network drivers that are > trying to max out their packet rates. It's still fine for storage > drivers and other slow/medium speed device drivers, like USB and > Firewire. I am working on techniques to make the API easier to use > and the implementation fast enough for the aforementioned devices. That's great! > > > Some ideas I have: > > > > When a buffer is out out of range for a hardware device and a data-copy > > is needed I want to simply copy that data in smaller parts to/from a > > pre-allocated bounce buffer. I want to avoid allocating this buffer when > > "bus_dmamap_load()" is called. > > I think you've missed the point of having architecture portable drivers. > John-Mark describes this further. I use the bus_dma framework to allocate and sync all DMA memory, and I assume that is correct. > It also makes little sense to push > the responsibility for handling bounce buffers out of a central > subsystem and back into every driver. I'm thinking about putting some wrappers around the "bus_dmamap_load()" function like: void usbd_rx_buf_load(struct usbd_xfer *xfer, void *buf, uint32_t framebuffer_offset, uint32_t framebuffer_index, uint32_t len); void usbd_tx_buf_load(struct usbd_xfer *xfer, void *buf, uint32_t framebuffer_offset, uint32_t framebuffer_index, uint32_t len); But I haven't figured out all the details yet. The "usbd_xxx_load()" functions should automagically figure out what is best to do and it won't be more than a few lines of code. --HPS From owner-freebsd-arch@FreeBSD.ORG Thu Sep 27 02:17:13 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 960F316A417 for ; Thu, 27 Sep 2007 02:17:13 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id EBBB113C458 for ; Thu, 27 Sep 2007 02:17:12 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from phobos.samsco.home (phobos.samsco.home [192.168.254.11]) (authenticated bits=0) by pooker.samsco.org (8.13.8/8.13.8) with ESMTP id l8R1DPix065986; Wed, 26 Sep 2007 19:13:26 -0600 (MDT) (envelope-from scottl@samsco.org) Message-ID: <46FB03B1.6020100@samsco.org> Date: Wed, 26 Sep 2007 19:13:21 -0600 From: Scott Long User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4 MIME-Version: 1.0 To: Hans Petter Selasky References: <200709260131.49156.hselasky@c2i.net> <46FA9C04.9060603@samsco.org> <200709262157.02712.hselasky@c2i.net> In-Reply-To: <200709262157.02712.hselasky@c2i.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (pooker.samsco.org [168.103.85.57]); Wed, 26 Sep 2007 19:13:26 -0600 (MDT) X-Spam-Status: No, score=-1.4 required=5.5 tests=ALL_TRUSTED autolearn=failed version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org Cc: freebsd-scsi@freebsd.org, freebsd-arch@freebsd.org, freebsd-usb@freebsd.org, freebsd-net@freebsd.org Subject: Re: Request for feedback on common data backstore in the kernel X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Sep 2007 02:17:13 -0000 Hans Petter Selasky wrote: > Hi Scott, > > The discussion has been moved to "freebsd-arch@freebsd.org". Please only reply > there next time. > > On Wednesday 26 September 2007, Scott Long wrote: >> Hans Petter Selasky wrote: >>> Hi, >>> >>> Please keep me CC'ed, hence I'm not on all these lists. >>> >>> In the kernel we currently have two different data backstores: >>> >>> struct mbuf >>> >>> and >>> >>> struct buf >>> >>> These two backstores serve two different device types. "mbufs" are for >>> network devices and "buf" is for disk devices. >>> >>> Problem: >>> >>> The current backstores are loaded into DMA by using the BUS-DMA >>> framework. This appears not to be too fast according to Kip Macy. See: >>> >>> http://perforce.freebsd.org/chv.cgi?CH=126455 >> Busdma isn't fast enough for 1Gb and 10Gb network drivers that are >> trying to max out their packet rates. It's still fine for storage >> drivers and other slow/medium speed device drivers, like USB and >> Firewire. I am working on techniques to make the API easier to use >> and the implementation fast enough for the aforementioned devices. > > That's great! > Well, the point is that I'm not sure why you're so worried about performance issues with USB and busdma. Do you have any test data that shows that it's a problem? >>> Some ideas I have: >>> >>> When a buffer is out out of range for a hardware device and a data-copy >>> is needed I want to simply copy that data in smaller parts to/from a >>> pre-allocated bounce buffer. I want to avoid allocating this buffer when >>> "bus_dmamap_load()" is called. >> I think you've missed the point of having architecture portable drivers. >> John-Mark describes this further. > > I use the bus_dma framework to allocate and sync all DMA memory, and I assume > that is correct. > That is one of the uses of busdma, yes. The other is to handle buffers that have been allocated elsewhere in the system. >> It also makes little sense to push >> the responsibility for handling bounce buffers out of a central >> subsystem and back into every driver. > > I'm thinking about putting some wrappers around the "bus_dmamap_load()" > function like: > > void usbd_rx_buf_load(struct usbd_xfer *xfer, void *buf, > uint32_t framebuffer_offset, uint32_t framebuffer_index, uint32_t len); > > void usbd_tx_buf_load(struct usbd_xfer *xfer, void *buf, > uint32_t framebuffer_offset, uint32_t framebuffer_index, uint32_t len); > > But I haven't figured out all the details yet. The "usbd_xxx_load()" functions > should automagically figure out what is best to do and it won't be more than > a few lines of code. Can you describe what these two wrappers are supposed to do? Are you expecting bus_dmamap_load() to operate synchronously? Scott From owner-freebsd-arch@FreeBSD.ORG Fri Sep 28 15:53:17 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6A30616A41A for ; Fri, 28 Sep 2007 15:53:17 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe07.swip.net [212.247.154.193]) by mx1.freebsd.org (Postfix) with ESMTP id CE4F813C459 for ; Fri, 28 Sep 2007 15:53:16 +0000 (UTC) (envelope-from hselasky@c2i.net) X-Cloudmark-Score: 0.000000 [] Received: from [85.19.218.45] (account mc467741@c2i.net [85.19.218.45] verified) by mailfe07.swip.net (CommuniGate Pro SMTP 5.1.10) with ESMTPA id 631171706; Fri, 28 Sep 2007 17:53:14 +0200 From: Hans Petter Selasky To: freebsd-arch@freebsd.org Date: Fri, 28 Sep 2007 17:53:29 +0200 User-Agent: KMail/1.9.7 References: <200709260131.49156.hselasky@c2i.net> <200709262157.02712.hselasky@c2i.net> <46FB03B1.6020100@samsco.org> In-Reply-To: <46FB03B1.6020100@samsco.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200709281753.30367.hselasky@c2i.net> Cc: Scott Long Subject: Re: Request for feedback on common data backstore in the kernel X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Sep 2007 15:53:17 -0000 On Thursday 27 September 2007, Scott Long wrote: > Hans Petter Selasky wrote: > > Hi Scott, > > > > The discussion has been moved to "freebsd-arch@freebsd.org". Please only > > reply there next time. > > > > On Wednesday 26 September 2007, Scott Long wrote: > >> Hans Petter Selasky wrote: > >>> Hi, > >>> > >>> Please keep me CC'ed, hence I'm not on all these lists. > >>> > >>> In the kernel we currently have two different data backstores: > >>> > >>> struct mbuf > >>> > >>> and > >>> > >>> struct buf > >>> > >>> These two backstores serve two different device types. "mbufs" are for > >>> network devices and "buf" is for disk devices. > >>> > >>> Problem: > >>> > >>> The current backstores are loaded into DMA by using the BUS-DMA > >>> framework. This appears not to be too fast according to Kip Macy. See: > >>> > >>> http://perforce.freebsd.org/chv.cgi?CH=126455 > >> > >> Busdma isn't fast enough for 1Gb and 10Gb network drivers that are > >> trying to max out their packet rates. It's still fine for storage > >> drivers and other slow/medium speed device drivers, like USB and > >> Firewire. I am working on techniques to make the API easier to use > >> and the implementation fast enough for the aforementioned devices. > > > > That's great! > > Well, the point is that I'm not sure why you're so worried about > performance issues with USB and busdma. Do you have any test data that > shows that it's a problem? It is a problem on embedded devices. Typically not for an ordinary PC user. > > >>> Some ideas I have: > >>> > >>> When a buffer is out out of range for a hardware device and a data-copy > >>> is needed I want to simply copy that data in smaller parts to/from a > >>> pre-allocated bounce buffer. I want to avoid allocating this buffer > >>> when "bus_dmamap_load()" is called. > >> > >> I think you've missed the point of having architecture portable drivers. > >> John-Mark describes this further. > > > > I use the bus_dma framework to allocate and sync all DMA memory, and I > > assume that is correct. > > That is one of the uses of busdma, yes. The other is to handle buffers > that have been allocated elsewhere in the system. Ok. > > >> It also makes little sense to push > >> the responsibility for handling bounce buffers out of a central > >> subsystem and back into every driver. > > > > I'm thinking about putting some wrappers around the "bus_dmamap_load()" > > function like: > > > > void usbd_rx_buf_load(struct usbd_xfer *xfer, void *buf, > > uint32_t framebuffer_offset, uint32_t framebuffer_index, uint32_t len); > > > > void usbd_tx_buf_load(struct usbd_xfer *xfer, void *buf, > > uint32_t framebuffer_offset, uint32_t framebuffer_index, uint32_t len); > > > > But I haven't figured out all the details yet. The "usbd_xxx_load()" > > functions should automagically figure out what is best to do and it won't > > be more than a few lines of code. > > Can you describe what these two wrappers are supposed to do? Are you > expecting bus_dmamap_load() to operate synchronously? I will come back to this question after the weekend, if you don't mind. Right now I'm in a hurry. --HPS From owner-freebsd-arch@FreeBSD.ORG Fri Sep 28 20:15:26 2007 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E721416A419 for ; Fri, 28 Sep 2007 20:15:26 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id 857DC13C45A for ; Fri, 28 Sep 2007 20:15:26 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.1/8.13.4) with ESMTP id l8SKDjwO001274; Fri, 28 Sep 2007 14:13:46 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Fri, 28 Sep 2007 14:13:45 -0600 (MDT) Message-Id: <20070928.141345.78789158.imp@bsdimp.com> To: hselasky@c2i.net From: Warner Losh In-Reply-To: <200709281753.30367.hselasky@c2i.net> References: <200709262157.02712.hselasky@c2i.net> <46FB03B1.6020100@samsco.org> <200709281753.30367.hselasky@c2i.net> X-Mailer: Mew version 3.3 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-2.0 (harmony.bsdimp.com [127.0.0.1]); Fri, 28 Sep 2007 14:13:47 -0600 (MDT) Cc: scottl@samsco.org, freebsd-arch@FreeBSD.org Subject: Re: Request for feedback on common data backstore in the kernel X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Sep 2007 20:15:27 -0000 Hans Petter Selasky wrote: > > Well, the point is that I'm not sure why you're so worried about > > performance issues with USB and busdma. Do you have any test data that > > shows that it's a problem? > > It is a problem on embedded devices. Typically not for an ordinary PC user. I'm curious. What's the problem? Warner From owner-freebsd-arch@FreeBSD.ORG Sat Sep 29 11:08:45 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CF47316A420 for ; Sat, 29 Sep 2007 11:08:45 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe13.swipnet.se [212.247.155.129]) by mx1.freebsd.org (Postfix) with ESMTP id 6FC7F13C457 for ; Sat, 29 Sep 2007 11:08:45 +0000 (UTC) (envelope-from hselasky@c2i.net) X-Cloudmark-Score: 0.000000 [] Received: from [193.217.102.3] (account mc467741@c2i.net HELO [10.0.0.249]) by mailfe13.swip.net (CommuniGate Pro SMTP 5.1.10) with ESMTPA id 238201938; Sat, 29 Sep 2007 13:08:43 +0200 From: Hans Petter Selasky To: Warner Losh Date: Sat, 29 Sep 2007 13:09:04 +0200 User-Agent: KMail/1.9.7 References: <200709262157.02712.hselasky@c2i.net> <200709281753.30367.hselasky@c2i.net> <20070928.141345.78789158.imp@bsdimp.com> In-Reply-To: <20070928.141345.78789158.imp@bsdimp.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200709291309.05747.hselasky@c2i.net> Cc: scottl@samsco.org, freebsd-arch@freebsd.org Subject: Re: Request for feedback on common data backstore in the kernel X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Sep 2007 11:08:45 -0000 On Friday 28 September 2007, Warner Losh wrote: > Hans Petter Selasky wrote: > > > Well, the point is that I'm not sure why you're so worried about > > > performance issues with USB and busdma. Do you have any test data that > > > shows that it's a problem? > > > > It is a problem on embedded devices. Typically not for an ordinary PC > > user. > > I'm curious. What's the problem? Technically, if you load a buffer pointer and length that is not cache aligned nor host aligned then you should bounce all the data no matter what. If you do not bounce under those conditions you might get trouble. struct my_softc { uint8_t hdr[2]; uint8_t eth_pkt[1500]; uint8_t other_important_stuff[546]; } __aligned(PAGE_SIZE); Can "eth_pkt" be loaded into DMA ? No, it has an offset of 2. struct my_softc { uint8_t eth_pkt[1500]; uint8_t other_important_stuff[548]; } __aligned(PAGE_SIZE); Can "eth_pkt" be loaded into DMA ? No, "other_important_stuff" might get lost when the cache is invalidated. struct my_softc { uint8_t eth_pkt[1500]; } __aligned(PAGE_SIZE); Can "eth_pkt" be loaded into DMA ? Yes, I see no problem. Is it guaranteed that the "m_data" field of a mbuf will always be correctly aligned for the CPU/DMA cache in addition to the host alignment ? Does bus_dma support bouncing using multiple pages instead of using one contiguous segment ? Please correct me if I am wrong. --HPS From owner-freebsd-arch@FreeBSD.ORG Sat Sep 29 15:12:26 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 05DEB16A421 for ; Sat, 29 Sep 2007 15:12:26 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id 96F9913C47E for ; Sat, 29 Sep 2007 15:12:25 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.1/8.13.4) with ESMTP id l8TF9rXE034409; Sat, 29 Sep 2007 09:09:54 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Sat, 29 Sep 2007 09:09:56 -0600 (MDT) Message-Id: <20070929.090956.-1889954517.imp@bsdimp.com> To: hselasky@c2i.net From: "M. Warner Losh" In-Reply-To: <200709291309.05747.hselasky@c2i.net> References: <200709281753.30367.hselasky@c2i.net> <20070928.141345.78789158.imp@bsdimp.com> <200709291309.05747.hselasky@c2i.net> X-Mailer: Mew version 5.2 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-2.0 (harmony.bsdimp.com [127.0.0.1]); Sat, 29 Sep 2007 09:09:54 -0600 (MDT) Cc: scottl@samsco.org, freebsd-arch@freebsd.org Subject: Re: Request for feedback on common data backstore in the kernel X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Sep 2007 15:12:26 -0000 In message: <200709291309.05747.hselasky@c2i.net> Hans Petter Selasky writes: : On Friday 28 September 2007, Warner Losh wrote: : > Hans Petter Selasky wrote: : > > > Well, the point is that I'm not sure why you're so worried about : > > > performance issues with USB and busdma. Do you have any test data that : > > > shows that it's a problem? : > > : > > It is a problem on embedded devices. Typically not for an ordinary PC : > > user. : > : > I'm curious. What's the problem? : : Technically, if you load a buffer pointer and length that is not cache aligned : nor host aligned then you should bounce all the data no matter what. If you : do not bounce under those conditions you might get trouble. Why do you say this? It doesn't match my experience. : struct my_softc { : uint8_t hdr[2]; : uint8_t eth_pkt[1500]; : uint8_t other_important_stuff[546]; : } __aligned(PAGE_SIZE); : : Can "eth_pkt" be loaded into DMA ? No, it has an offset of 2. This may or may not be able to load. It all depends on ethernet controller's requirements for alignment. : struct my_softc { : uint8_t eth_pkt[1500]; : uint8_t other_important_stuff[548]; : } __aligned(PAGE_SIZE); : : Can "eth_pkt" be loaded into DMA ? No, "other_important_stuff" might get lost : when the cache is invalidated. On MIPS and ARM you can invalidate/flush specific ranges of memory without the need for them to be aligned to a cache-line. There may be other reasons why you can't, but mere alignment isn't it. : struct my_softc { : uint8_t eth_pkt[1500]; : } __aligned(PAGE_SIZE); : : Can "eth_pkt" be loaded into DMA ? Yes, I see no problem. Maybe. Some designs require that you do DMA only to/from a specific range of memory. So while it meets alignment requirements, it may not be allocated from the proper range of memory. In these designs, bouncing is almost forced on the driver writer. : Is it guaranteed that the "m_data" field of a mbuf will always be correctly : aligned for the CPU/DMA cache in addition to the host alignment ? : : Does bus_dma support bouncing using multiple pages instead of using one : contiguous segment ? : : Please correct me if I am wrong. The busdma interface is designed to hide these issues. There are some problems with the busdma interface, but they are minor in comparison to the portability it buys. I use the busdma interface for my Ethernet driver on the AT91RM9200. There's some weirdness for that part because it had alignment requirements that are bad for IP: the ethernet header is longword aligned, necessarily forcing the ip header not to be. Warner From owner-freebsd-arch@FreeBSD.ORG Sat Sep 29 15:35:31 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 92CF416A46B for ; Sat, 29 Sep 2007 15:35:31 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe04.swip.net [212.247.154.97]) by mx1.freebsd.org (Postfix) with ESMTP id 00C0513C481 for ; Sat, 29 Sep 2007 15:35:30 +0000 (UTC) (envelope-from hselasky@c2i.net) X-Cloudmark-Score: 0.000000 [] Received: from [193.217.102.3] (account mc467741@c2i.net HELO [10.0.0.249]) by mailfe04.swip.net (CommuniGate Pro SMTP 5.1.10) with ESMTPA id 630145779; Sat, 29 Sep 2007 17:35:27 +0200 From: Hans Petter Selasky To: "M. Warner Losh" Date: Sat, 29 Sep 2007 17:35:47 +0200 User-Agent: KMail/1.9.7 References: <200709281753.30367.hselasky@c2i.net> <200709291309.05747.hselasky@c2i.net> <20070929.090956.-1889954517.imp@bsdimp.com> In-Reply-To: <20070929.090956.-1889954517.imp@bsdimp.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200709291735.48451.hselasky@c2i.net> Cc: scottl@samsco.org, freebsd-arch@freebsd.org Subject: Re: Request for feedback on common data backstore in the kernel X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Sep 2007 15:35:31 -0000 Hi Warner, On Saturday 29 September 2007, M. Warner Losh wrote: > In message: <200709291309.05747.hselasky@c2i.net> > > Hans Petter Selasky writes: > : On Friday 28 September 2007, Warner Losh wrote: > : > Hans Petter Selasky wrote: > : > > > Well, the point is that I'm not sure why you're so worried about > : > > > performance issues with USB and busdma. Do you have any test data > : > > > that shows that it's a problem? > : > > > : > > It is a problem on embedded devices. Typically not for an ordinary PC > : > > user. > : > > : > I'm curious. What's the problem? > : > : Technically, if you load a buffer pointer and length that is not cache > : aligned nor host aligned then you should bounce all the data no matter > : what. If you do not bounce under those conditions you might get trouble. > > Why do you say this? It doesn't match my experience. Because if the buffer start is not cache aligned, you will do cache operations on data before the actual address. If the length is not aligned, you will do do cache operations data after the end of the buffer. You cannot do cache operations on one byte at a time, nor 4-bytes. You have to do cache-line bytes at a time ? Or am I wrong ? > > : struct my_softc { > : uint8_t hdr[2]; > : uint8_t eth_pkt[1500]; > : uint8_t other_important_stuff[546]; > : } __aligned(PAGE_SIZE); > : > : Can "eth_pkt" be loaded into DMA ? No, it has an offset of 2. > > This may or may not be able to load. It all depends on ethernet > controller's requirements for alignment. > > : struct my_softc { > : uint8_t eth_pkt[1500]; > : uint8_t other_important_stuff[548]; > : } __aligned(PAGE_SIZE); > : > : Can "eth_pkt" be loaded into DMA ? No, "other_important_stuff" might get > : lost when the cache is invalidated. > > On MIPS and ARM you can invalidate/flush specific ranges of memory > without the need for them to be aligned to a cache-line. There may be > other reasons why you can't, but mere alignment isn't it. Does the cache instruction use byte granularity ? > > : struct my_softc { > : uint8_t eth_pkt[1500]; > : } __aligned(PAGE_SIZE); > : > : Can "eth_pkt" be loaded into DMA ? Yes, I see no problem. > > Maybe. Some designs require that you do DMA only to/from a specific > range of memory. So while it meets alignment requirements, it may not > be allocated from the proper range of memory. In these designs, > bouncing is almost forced on the driver writer. Yes, that might be correct. > > : Is it guaranteed that the "m_data" field of a mbuf will always be > : correctly aligned for the CPU/DMA cache in addition to the host alignment > : ? > : > : Does bus_dma support bouncing using multiple pages instead of using one > : contiguous segment ? > : > : Please correct me if I am wrong. > > The busdma interface is designed to hide these issues. There are some > problems with the busdma interface, but they are minor in comparison > to the portability it buys. I use the busdma interface for my > Ethernet driver on the AT91RM9200. There's some weirdness for that > part because it had alignment requirements that are bad for IP: the > ethernet header is longword aligned, necessarily forcing the ip header > not to be. BTW: I'm going to add support for the USB device part on this chip in not too long. --HPS From owner-freebsd-arch@FreeBSD.ORG Sat Sep 29 16:20:50 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CEA6B16A419 for ; Sat, 29 Sep 2007 16:20:50 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id 60B2713C45D for ; Sat, 29 Sep 2007 16:20:50 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from [192.168.254.15] (ydesk.samsco.home [192.168.254.15]) (authenticated bits=0) by pooker.samsco.org (8.13.8/8.13.8) with ESMTP id l8TGKfhN084918; Sat, 29 Sep 2007 10:20:42 -0600 (MDT) (envelope-from scottl@samsco.org) Message-ID: <46FE7B59.7080003@samsco.org> Date: Sat, 29 Sep 2007 10:20:41 -0600 From: Scott Long User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.7) Gecko/20050416 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Hans Petter Selasky References: <200709262157.02712.hselasky@c2i.net> <200709281753.30367.hselasky@c2i.net> <20070928.141345.78789158.imp@bsdimp.com> <200709291309.05747.hselasky@c2i.net> In-Reply-To: <200709291309.05747.hselasky@c2i.net> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (pooker.samsco.org [192.168.254.1]); Sat, 29 Sep 2007 10:20:42 -0600 (MDT) X-Spam-Status: No, score=-1.4 required=5.5 tests=ALL_TRUSTED autolearn=failed version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org Cc: freebsd-arch@freebsd.org Subject: Re: Request for feedback on common data backstore in the kernel X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Sep 2007 16:20:50 -0000 Hans Petter Selasky wrote: > On Friday 28 September 2007, Warner Losh wrote: > >>Hans Petter Selasky wrote: >> >>>>Well, the point is that I'm not sure why you're so worried about >>>>performance issues with USB and busdma. Do you have any test data that >>>>shows that it's a problem? >>> >>>It is a problem on embedded devices. Typically not for an ordinary PC >>>user. >> >>I'm curious. What's the problem? > > > Technically, if you load a buffer pointer and length that is not cache aligned > nor host aligned then you should bounce all the data no matter what. If you > do not bounce under those conditions you might get trouble. > > struct my_softc { > uint8_t hdr[2]; > uint8_t eth_pkt[1500]; > uint8_t other_important_stuff[546]; > } __aligned(PAGE_SIZE); > > Can "eth_pkt" be loaded into DMA ? No, it has an offset of 2. > > struct my_softc { > uint8_t eth_pkt[1500]; > uint8_t other_important_stuff[548]; > } __aligned(PAGE_SIZE); > > Can "eth_pkt" be loaded into DMA ? No, "other_important_stuff" might get lost > when the cache is invalidated. > > struct my_softc { > uint8_t eth_pkt[1500]; > } __aligned(PAGE_SIZE); > > Can "eth_pkt" be loaded into DMA ? Yes, I see no problem. > > Is it guaranteed that the "m_data" field of a mbuf will always be correctly > aligned for the CPU/DMA cache in addition to the host alignment ? > As Warner said, a lot of this is dependent on the device hardware, not the CPU, and it's a problem that is already well understood and addressed in those areas in FreeBSD. > Does bus_dma support bouncing using multiple pages instead of using one > contiguous segment ? Yes, busdma bounces on a page-by-page basis. In fact, when doing re-alignments, it'll attempt to bounce only the begining segment instead of the entire buffer. > > Please correct me if I am wrong. busdma is something that is both woefully underdocumented and much more powerful and featureful than is what is available in most other OS's, so please feel free to ask more questions like these. Scott