From owner-freebsd-arch@FreeBSD.ORG Fri Sep 17 07:27:56 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6876D1065697; Fri, 17 Sep 2010 07:27:56 +0000 (UTC) (envelope-from avg@freebsd.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 420198FC0A; Fri, 17 Sep 2010 07:27:54 +0000 (UTC) Received: from porto.topspin.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id KAA17650; Fri, 17 Sep 2010 10:27:53 +0300 (EEST) (envelope-from avg@freebsd.org) Received: from localhost.topspin.kiev.ua ([127.0.0.1]) by porto.topspin.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1OwVMP-00087M-0t; Fri, 17 Sep 2010 10:27:53 +0300 Message-ID: <4C931878.803@freebsd.org> Date: Fri, 17 Sep 2010 10:27:52 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.2.9) Gecko/20100912 Lightning/1.0b2 Thunderbird/3.1.3 MIME-Version: 1.0 To: John Baldwin References: <4C4DB2B8.9080404@freebsd.org> <201007270935.52082.jhb@freebsd.org> <4C531ED7.9010601@cs.rice.edu> <201007301614.40768.jhb@freebsd.org> In-Reply-To: <201007301614.40768.jhb@freebsd.org> X-Enigmail-Version: 1.1.2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: alc@freebsd.org, Alan Cox , freebsd-arch@freebsd.org Subject: Re: amd64: change VM_KMEM_SIZE_SCALE to 1? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Sep 2010 07:27:56 -0000 on 30/07/2010 23:14 John Baldwin said the following: > I think this is much better. My strawman was rather hackish in that it was > layering a hack on top of the existing calculations. I prefer your approach. > I do not think penalizing amd64 machines with less than 1.5GB is a big worry > as most x86 machines with a small amount of memory are probably running as > i386 anyway. Given that, I would probably lean towards 1/8 instead of 1/7, > but I would be happy with either one. Alan, John, are you planning to commit the vnodes limit patch or a version of it? -- Andriy Gapon From owner-freebsd-arch@FreeBSD.ORG Fri Sep 17 07:42:59 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 71E9B1065670 for ; Fri, 17 Sep 2010 07:42:59 +0000 (UTC) (envelope-from avg@freebsd.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id BBD8A8FC08 for ; Fri, 17 Sep 2010 07:42:58 +0000 (UTC) Received: from porto.topspin.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id KAA17791; Fri, 17 Sep 2010 10:42:57 +0300 (EEST) (envelope-from avg@freebsd.org) Received: from localhost.topspin.kiev.ua ([127.0.0.1]) by porto.topspin.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1OwVay-00088h-VV; Fri, 17 Sep 2010 10:42:56 +0300 Message-ID: <4C931C00.3070803@freebsd.org> Date: Fri, 17 Sep 2010 10:42:56 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.2.9) Gecko/20100912 Lightning/1.0b2 Thunderbird/3.1.3 MIME-Version: 1.0 To: freebsd-arch@freebsd.org References: <4C4DB2B8.9080404@freebsd.org> <4C88944C.5060603@freebsd.org> In-Reply-To: <4C88944C.5060603@freebsd.org> X-Enigmail-Version: 1.1.2 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: freebsd-currrent@freebsd.org Subject: amd64: VM_KMEM_SIZE_SCALE changed to 1 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Sep 2010 07:42:59 -0000 on 09/09/2010 11:01 Andriy Gapon said the following: > on 26/07/2010 19:07 Andriy Gapon said the following: >> >> Anyone knows any reason why VM_KMEM_SIZE_SCALE on amd64 should not be set to 1? >> I mean things potentially breaking, or some unpleasant surprise for an >> administrator/user... > > So, after having the discussion, what is our collective conclusion? > a) Go for it! > or > b) Don't do it, fool! > or > c) Let's wait another year... Nobody said (b), so: http://svn.freebsd.org/viewvc/base?view=revision&revision=212784 This thread in Gmane for your convenience: http://thread.gmane.org/gmane.os.freebsd.architechture/13419/focus=13551 -- Andriy Gapon From owner-freebsd-arch@FreeBSD.ORG Fri Sep 17 08:16:51 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 67F6D1065672; Fri, 17 Sep 2010 08:16:51 +0000 (UTC) (envelope-from avg@freebsd.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 823518FC19; Fri, 17 Sep 2010 08:16:50 +0000 (UTC) Received: from porto.topspin.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id LAA18728; Fri, 17 Sep 2010 11:16:49 +0300 (EEST) (envelope-from avg@freebsd.org) Received: from localhost.topspin.kiev.ua ([127.0.0.1]) by porto.topspin.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1OwW7k-0008BC-Ri; Fri, 17 Sep 2010 11:16:48 +0300 Message-ID: <4C9323F0.90500@freebsd.org> Date: Fri, 17 Sep 2010 11:16:48 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.2.9) Gecko/20100912 Lightning/1.0b2 Thunderbird/3.1.3 MIME-Version: 1.0 To: freebsd-arch@freebsd.org References: <4C4DB2B8.9080404@freebsd.org> <4C88944C.5060603@freebsd.org> In-Reply-To: <4C88944C.5060603@freebsd.org> X-Enigmail-Version: 1.1.2 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: freebsd-current@freebsd.org Subject: amd64: VM_KMEM_SIZE_SCALE changed to 1 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Sep 2010 08:16:51 -0000 [re-post, my address book was polluted with cu_rrr_ent@ entry, sorry] on 09/09/2010 11:01 Andriy Gapon said the following: > on 26/07/2010 19:07 Andriy Gapon said the following: >> >> Anyone knows any reason why VM_KMEM_SIZE_SCALE on amd64 should not be set to 1? >> I mean things potentially breaking, or some unpleasant surprise for an >> administrator/user... > > So, after having the discussion, what is our collective conclusion? > a) Go for it! > or > b) Don't do it, fool! > or > c) Let's wait another year... Nobody said (b), so: http://svn.freebsd.org/viewvc/base?view=revision&revision=212784 This thread in Gmane for your convenience: http://thread.gmane.org/gmane.os.freebsd.architechture/13419/focus=13551 -- Andriy Gapon From owner-freebsd-arch@FreeBSD.ORG Fri Sep 17 15:23:47 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7BCC91065679; Fri, 17 Sep 2010 15:23:47 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 4BE6A8FC1B; Fri, 17 Sep 2010 15:23:47 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id F0CB646BC1; Fri, 17 Sep 2010 11:23:46 -0400 (EDT) Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 031198A04E; Fri, 17 Sep 2010 11:23:46 -0400 (EDT) From: John Baldwin To: Andriy Gapon Date: Fri, 17 Sep 2010 09:00:41 -0400 User-Agent: KMail/1.13.5 (FreeBSD/7.3-CBSD-20100819; KDE/4.4.5; amd64; ; ) References: <4C4DB2B8.9080404@freebsd.org> <201007301614.40768.jhb@freebsd.org> <4C931878.803@freebsd.org> In-Reply-To: <4C931878.803@freebsd.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201009170900.41476.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Fri, 17 Sep 2010 11:23:46 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.6 required=4.2 tests=AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: alc@freebsd.org, Alan Cox , freebsd-arch@freebsd.org Subject: Re: amd64: change VM_KMEM_SIZE_SCALE to 1? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Sep 2010 15:23:47 -0000 On Friday, September 17, 2010 3:27:52 am Andriy Gapon wrote: > on 30/07/2010 23:14 John Baldwin said the following: > > I think this is much better. My strawman was rather hackish in that it was > > layering a hack on top of the existing calculations. I prefer your approach. > > I do not think penalizing amd64 machines with less than 1.5GB is a big worry > > as most x86 machines with a small amount of memory are probably running as > > i386 anyway. Given that, I would probably lean towards 1/8 instead of 1/7, > > but I would be happy with either one. > > Alan, John, > > are you planning to commit the vnodes limit patch or a version of it? I thought Alan had committed it already? Author: alc Date: Mon Aug 2 21:33:36 2010 New Revision: 210782 URL: http://svn.freebsd.org/changeset/base/210782 Log: Update the "desiredvnodes" calculation. In particular, make the part of the calculation that is based on the kernel's heap size more conservative. Hopefully, this will eliminate the need for MAXVNODES_MAX, but for the time being set MAXVNODES_MAX to a large value. Reviewed by: jhb@ MFC after: 6 weeks Looks like its MFC timer has likely triggered even. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Fri Sep 17 15:23:49 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 85E35106567A for ; Fri, 17 Sep 2010 15:23:49 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 598808FC1A for ; Fri, 17 Sep 2010 15:23:49 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 0E57E46BC0 for ; Fri, 17 Sep 2010 11:23:49 -0400 (EDT) Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 3650D8A050 for ; Fri, 17 Sep 2010 11:23:48 -0400 (EDT) From: John Baldwin To: arch@freebsd.org Date: Fri, 17 Sep 2010 11:23:39 -0400 User-Agent: KMail/1.13.5 (FreeBSD/7.3-CBSD-20100819; KDE/4.4.5; amd64; ; ) MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201009171123.39382.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Fri, 17 Sep 2010 11:23:48 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.6 required=4.2 tests=AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: Subject: Interrupt Threads X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Sep 2010 15:23:49 -0000 I have wanted to rework some of the interrupt threads stuff and enable interrupt filters by default for a while. I finally sat down and hacked out a new ithreads implementation at BSDCan and the following week. The new ithreads stuff moves away from dedicated threads per handlers or irqs. Instead, it adopts a model more akin to what Solaris does (though probably not completely identical). Each CPU has a queue of "pending handlers". When an interrupt fires, all of the handlers for that interrupt are placed on to that CPU's queue. There is a pool of hardware interrupt threads. If the current CPU does not already have an active hardware interrupt thread, it grabs a free one from the pool, pins it to the current CPU, and schedules it. The ithread continues to drain interrupt handlers from its CPU's queue until the queue is empty. Once that happens it disassociates itself from the CPU and goes back into the free pool. The effect is that interrupt handlers are now sort of like DPCs in Windows. If an interrupt handler blocks on a turnstile and there are other handlers pending for this CPU, then the current ithread is divorced from the current CPU and a new ithread is allocated for the current CPU. If we ever fail to allocate an ithread for a given CPU, then a flag is set. All ithreads check that flag before going idle, and if it is set they find the first CPU that needs an ithread and move to that CPU and start draining events. The ithread pool can be dynamically resized at runtime via sysctl, but it can't be smaller than NCPU * 2 or larger than the total number of handlers. Interrupt filters fit into this nicely since this avoids the problem with old interrupt filters that if you fix its design bug it may need to schedule multiple ithreads. Now it still only schedules at most one ithread per interrupt. To handle masking the interrupt and unmasking it when filters w/o handlers complete, I use a simple reference count with atomic ops to keep track of the number of queued handlers that need the interrupt masked and unmask it once the count drops to 0. Software interrupts still use a dedicated ithread, but the queue of pending handlers lives in the ithread, not in the CPU. I've also added some extensions to the current ithreads stuff based on some tricks that existing drivers use. Specifically, an interrupt handler can now call hwi_sched() on itself to reschedule itself at the back of the current CPU's queue. Thus, you can have NIC interrupt handlers do cooperative timesharing by just punting after N packets and using hwi_sched() to reschedule themselves. I also added a new type of interrupt handler that is registered with INTR_MANUAL. It is never automatically scheduled, but a filter can schedule it. As a test, I've ported the igb(4) driver to this framework. It uses hwi_sched() and an INTR_MANUAL handler for link events to replace almost all of the taskqueue usage in igb(4). (The multiqueue transmit bits still need a task for one case, but all the interrupt handler stuff is now "simpler"). Some downsides to this approach include: 1) If you have two busy devices whose interrupts both go to the same CPU but via different IRQs, in the old model those threads could run concurrently on separate CPUs, but in the new model the handlers are tied to the same CPU and compete for CPU time on that CPU. In other words, the new model really wants interrupts to be evenly distributed amongst CPUs to work properly. Not entirely sure what I think about that. 2) Many folks find the ability to see how much CPU IRQ N's thread has used in top useful, but this loses all of that since there is no longer a tight coupling between IRQs and threads. One unresolved issue is that the cardbus code currently uses a filter that returns just FILTER_SCHEDULE_THREAD without FILTER_HANDLED. This is not supported in the new code. I have some ideas on how to fix the cardbus code (most likely using wrappers around the child interrupt handlers) but need to has the details out with Warner. A second unresolved issue is that interrupt storm detection is currently broken. I have some thoughts on how to readd it, but it will likely be a bit tricky. The code currently lives in p4 at //depot/user/jhb/intr/... I have also put up a patch at http://www.freebsd.org/~jhb/patches/intr_threads.patch. This patch includes the changes to the igb(4) driver. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Fri Sep 17 15:32:51 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 42FEF1065675; Fri, 17 Sep 2010 15:32:51 +0000 (UTC) (envelope-from avg@freebsd.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 2057E8FC1C; Fri, 17 Sep 2010 15:32:47 +0000 (UTC) Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua [212.40.38.101]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id SAA24147; Fri, 17 Sep 2010 18:32:44 +0300 (EEST) (envelope-from avg@freebsd.org) Message-ID: <4C938A1C.40307@freebsd.org> Date: Fri, 17 Sep 2010 18:32:44 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.2.9) Gecko/20100909 Lightning/1.0b2 Thunderbird/3.1.3 MIME-Version: 1.0 To: John Baldwin References: <4C4DB2B8.9080404@freebsd.org> <201007301614.40768.jhb@freebsd.org> <4C931878.803@freebsd.org> <201009170900.41476.jhb@freebsd.org> In-Reply-To: <201009170900.41476.jhb@freebsd.org> X-Enigmail-Version: 1.1.2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: alc@freebsd.org, Alan Cox , freebsd-arch@freebsd.org Subject: Re: amd64: change VM_KMEM_SIZE_SCALE to 1? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Sep 2010 15:32:51 -0000 on 17/09/2010 16:00 John Baldwin said the following: > On Friday, September 17, 2010 3:27:52 am Andriy Gapon wrote: >> on 30/07/2010 23:14 John Baldwin said the following: >>> I think this is much better. My strawman was rather hackish in that it was >>> layering a hack on top of the existing calculations. I prefer your approach. >>> I do not think penalizing amd64 machines with less than 1.5GB is a big worry >>> as most x86 machines with a small amount of memory are probably running as >>> i386 anyway. Given that, I would probably lean towards 1/8 instead of 1/7, >>> but I would be happy with either one. >> >> Alan, John, >> >> are you planning to commit the vnodes limit patch or a version of it? > > I thought Alan had committed it already? Oops, missed this one. Thanks a lot! > Author: alc > Date: Mon Aug 2 21:33:36 2010 > New Revision: 210782 > URL: http://svn.freebsd.org/changeset/base/210782 > > Log: > Update the "desiredvnodes" calculation. In particular, make the part of > the calculation that is based on the kernel's heap size more conservative. > Hopefully, this will eliminate the need for MAXVNODES_MAX, but for the > time being set MAXVNODES_MAX to a large value. > > Reviewed by: jhb@ > MFC after: 6 weeks > > Looks like its MFC timer has likely triggered even. > -- Andriy Gapon From owner-freebsd-arch@FreeBSD.ORG Fri Sep 17 16:02:46 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BFBDF1065670; Fri, 17 Sep 2010 16:02:46 +0000 (UTC) (envelope-from julianelischer@gmail.com) Received: from mail-px0-f182.google.com (mail-px0-f182.google.com [209.85.212.182]) by mx1.freebsd.org (Postfix) with ESMTP id 873B68FC1B; Fri, 17 Sep 2010 16:02:46 +0000 (UTC) Received: by pxi17 with SMTP id 17so808006pxi.13 for ; Fri, 17 Sep 2010 09:02:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:date:from :user-agent:mime-version:to:cc:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=bU0Tyr/0LTZ25wsHjyUJcpqulaPmDxqTNl0VOwi/hHY=; b=QkhSRTcPph+7fE+hEMJ37sGjwuqfc9psJLin3JmmdxaN6lG0hyd+yM/s5XYMQpbOic ALQjtJqKrfSn9ld45LVPly4JAG/09TwNwf03E4DDrpZuGGeRZBmOtvIH6RvDFHpoJtxl 72g0VoaV6u1D+Z1+FdTV3oBgwD3DH0IVGp9E8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; b=Vk/Z+X7Hltno8Q/1mCX8odezyX9jZJTtg69a0kHyjerjxj56XKO2CsCOvvwERJyryU 7mttk764jQVpumps+qJckzACsz2ddXHVKak2vj2RDVJ+Q9JKFLERFXVzli6B0ujybUmC E3TqVVhg64y/PC5R+ZKcWh9tOs3ys1fxwa5/Q= Received: by 10.142.110.6 with SMTP id i6mr3852930wfc.276.1284737904968; Fri, 17 Sep 2010 08:38:24 -0700 (PDT) Received: from julian-mac.elischer.org (h-67-100-89-137.snfccasy.static.covad.net [67.100.89.137]) by mx.google.com with ESMTPS id o9sm2203962wfd.16.2010.09.17.08.38.21 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 17 Sep 2010 08:38:22 -0700 (PDT) Sender: Julian Elischer Message-ID: <4C938B8E.8050901@elischer.com> Date: Fri, 17 Sep 2010 08:38:54 -0700 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4 MIME-Version: 1.0 To: John Baldwin References: <201009171123.39382.jhb@freebsd.org> In-Reply-To: <201009171123.39382.jhb@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: arch@freebsd.org Subject: Re: Interrupt Threads X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Sep 2010 16:02:46 -0000 On 9/17/10 8:23 AM, John Baldwin wrote: > I have wanted to rework some of the interrupt threads stuff and enable > interrupt filters by default for a while. I finally sat down and > hacked out a new ithreads implementation at BSDCan and the following week. > > The new ithreads stuff moves away from dedicated threads per handlers or irqs. > Instead, it adopts a model more akin to what Solaris does (though probably not > completely identical). Each CPU has a queue of "pending handlers". When an > interrupt fires, all of the handlers for that interrupt are placed on to that > CPU's queue. There is a pool of hardware interrupt threads. If the current > CPU does not already have an active hardware interrupt thread, it grabs a free > one from the pool, pins it to the current CPU, and schedules it. The ithread > continues to drain interrupt handlers from its CPU's queue until the queue is > empty. Once that happens it disassociates itself from the CPU and goes back > into the free pool. The effect is that interrupt handlers are now sort of > like DPCs in Windows. do you gain anything, other than having less threads, by allowing them to migrate? that means you need more locking between cpus I presume. > > If an interrupt handler blocks on a turnstile and there are other handlers > pending for this CPU, then the current ithread is divorced from the current > CPU and a new ithread is allocated for the current CPU. > > If we ever fail to allocate an ithread for a given CPU, then a flag is set. > All ithreads check that flag before going idle, and if it is set they find the > first CPU that needs an ithread and move to that CPU and start draining > events. > > The ithread pool can be dynamically resized at runtime via sysctl, but it > can't be smaller than NCPU * 2 or larger than the total number of handlers. > > Interrupt filters fit into this nicely since this avoids the problem with old > interrupt filters that if you fix its design bug it may need to schedule > multiple ithreads. Now it still only schedules at most one ithread per > interrupt. > > To handle masking the interrupt and unmasking it when filters w/o handlers > complete, I use a simple reference count with atomic ops to keep track of the > number of queued handlers that need the interrupt masked and unmask it once > the count drops to 0. > > Software interrupts still use a dedicated ithread, but the queue of pending > handlers lives in the ithread, not in the CPU. > > I've also added some extensions to the current ithreads stuff based on some > tricks that existing drivers use. Specifically, an interrupt handler can now > call hwi_sched() on itself to reschedule itself at the back of the current > CPU's queue. Thus, you can have NIC interrupt handlers do cooperative > timesharing by just punting after N packets and using hwi_sched() to > reschedule themselves. I also added a new type of interrupt > handler that is registered with INTR_MANUAL. It is never automatically > scheduled, but a filter can schedule it. > > As a test, I've ported the igb(4) driver to this framework. It uses > hwi_sched() and an INTR_MANUAL handler for link events to replace almost all > of the taskqueue usage in igb(4). (The multiqueue transmit bits still need a > task for one case, but all the interrupt handler stuff is now "simpler"). > > Some downsides to this approach include: > > 1) If you have two busy devices whose interrupts both go to the same CPU but > via different IRQs, in the old model those threads could run concurrently on > separate CPUs, but in the new model the handlers are tied to the same CPU and > compete for CPU time on that CPU. In other words, the new model really wants > interrupts to be evenly distributed amongst CPUs to work properly. Not > entirely sure what I think about that. > > 2) Many folks find the ability to see how much CPU IRQ N's thread has used in > top useful, but this loses all of that since there is no longer a tight > coupling between IRQs and threads. yeah we faced this problem with KSEs and user threads.. the lack of coupling also means you lose history which may be useful. > > One unresolved issue is that the cardbus code currently uses a filter that > returns just FILTER_SCHEDULE_THREAD without FILTER_HANDLED. This is not > supported in the new code. I have some ideas on how to fix the cardbus code > (most likely using wrappers around the child interrupt handlers) but need to > hash out the details out with Warner. > > A second unresolved issue is that interrupt storm detection is currently > broken. I have some thoughts on how to re-add it, but it will likely be a bit > tricky. > > The code currently lives in p4 at //depot/user/jhb/intr/... I have also put > up a patch at http://www.freebsd.org/~jhb/patches/intr_threads.patch. This > patch includes the changes to the igb(4) driver. > From owner-freebsd-arch@FreeBSD.ORG Fri Sep 17 17:27:12 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9BD741065694 for ; Fri, 17 Sep 2010 17:27:12 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 6DD1D8FC0A for ; Fri, 17 Sep 2010 17:27:12 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 00A8546B64; Fri, 17 Sep 2010 13:27:12 -0400 (EDT) Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 3C4298A03C; Fri, 17 Sep 2010 13:27:11 -0400 (EDT) From: John Baldwin To: Julian Elischer Date: Fri, 17 Sep 2010 13:26:11 -0400 User-Agent: KMail/1.13.5 (FreeBSD/7.3-CBSD-20100819; KDE/4.4.5; amd64; ; ) References: <201009171123.39382.jhb@freebsd.org> <4C938B8E.8050901@elischer.com> In-Reply-To: <4C938B8E.8050901@elischer.com> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201009171326.11182.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Fri, 17 Sep 2010 13:27:11 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.6 required=4.2 tests=AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: arch@freebsd.org Subject: Re: Interrupt Threads X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Sep 2010 17:27:12 -0000 On Friday, September 17, 2010 11:38:54 am Julian Elischer wrote: > On 9/17/10 8:23 AM, John Baldwin wrote: > > I have wanted to rework some of the interrupt threads stuff and enable > > interrupt filters by default for a while. I finally sat down and > > hacked out a new ithreads implementation at BSDCan and the following week. > > > > The new ithreads stuff moves away from dedicated threads per handlers or irqs. > > Instead, it adopts a model more akin to what Solaris does (though probably not > > completely identical). Each CPU has a queue of "pending handlers". When an > > interrupt fires, all of the handlers for that interrupt are placed on to that > > CPU's queue. There is a pool of hardware interrupt threads. If the current > > CPU does not already have an active hardware interrupt thread, it grabs a free > > one from the pool, pins it to the current CPU, and schedules it. The ithread > > continues to drain interrupt handlers from its CPU's queue until the queue is > > empty. Once that happens it disassociates itself from the CPU and goes back > > into the free pool. The effect is that interrupt handlers are now sort of > > like DPCs in Windows. > > do you gain anything, other than having less threads, by allowing them > to migrate? that means you need more locking between cpus I presume. Filters really work and all the homegrown taskqueue stuff in many drivers goes away as it can now be done using the normal interrupt code. The locking between CPUs is not any more than it is currently and in fact can be less in certain cases. One thing I have not done is to add a notion of affinity so that a CPU would prefer the last ithread it used if it is still free when it needs a new ithread. Also, the code to add and remove handlers is actually simpler than the old code. > > 2) Many folks find the ability to see how much CPU IRQ N's thread has used in > > top useful, but this loses all of that since there is no longer a tight > > coupling between IRQs and threads. > > yeah we faced this problem with KSEs and user threads.. the lack of > coupling also means you lose history which may be useful. The only thing that is different in this case is that interrupt handlers have traditionally not had any coupling with anything since they are asynchronous event handlers similar to signal handlers. In the case of threads there is an expected coupling, but less so for signals and interrupts. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Fri Sep 17 22:00:22 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CFD601065670 for ; Fri, 17 Sep 2010 22:00:22 +0000 (UTC) (envelope-from mdf356@gmail.com) Received: from mail-iw0-f182.google.com (mail-iw0-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id 97F638FC08 for ; Fri, 17 Sep 2010 22:00:22 +0000 (UTC) Received: by iwn34 with SMTP id 34so2745913iwn.13 for ; Fri, 17 Sep 2010 15:00:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=woSfgqRAfkyGzlQMA7yezI9OVBCKmzcIETEPtJK0GNM=; b=AKWjiIPqKr3puMy5P5JiL+/QKsFbr69Udw9evn1Pe1iyYfzPri2JU1qiRwfOVIVX1c 4MZi0xquSLrcm0DzwRKeshWn7geXizXsLEBvfDu0R58IS8fan+PEOD9e1v34C4z1tmuL ZAOh6CISrbZkTO3ay1l9l+wbQyioZZy1w2tF0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:cc:content-type; b=rmD4elx5Y5H6Gf15hVnRwNGIIcks1SgnzCkACozTE8ZT5c+n4iXACT3PSmWojrmMrt 88ksPYHke/kFK0B1uHS/Wg1ZF41aPEwrN0cGS/3iXR3jwrtywSbmsf2VQSdD4WT0mpkr mE81Ycvds/X91EVdtYwkJTriD1Vo9Wcfp6dvg= MIME-Version: 1.0 Received: by 10.231.11.9 with SMTP id r9mr5855454ibr.47.1284760821712; Fri, 17 Sep 2010 15:00:21 -0700 (PDT) Sender: mdf356@gmail.com Received: by 10.231.187.71 with HTTP; Fri, 17 Sep 2010 15:00:21 -0700 (PDT) Date: Fri, 17 Sep 2010 15:00:21 -0700 X-Google-Sender-Auth: NPvV5G8STDOCYP4sd8Hu_Q-jPxQ Message-ID: From: mdf@FreeBSD.org To: FreeBSD Arch Content-Type: text/plain; charset=ISO-8859-1 Cc: Poul-Henning Kamp Subject: Towards a One True Printf X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Sep 2010 22:00:23 -0000 In an attempt to move towards a one true printf, I copied the base printf(3) implementation and changed the parameters to be similar to that of kvprintf(9), with a generic callback function on each print group. The callback can easily be essentially the io_put() methods used for printf(3) but have the possibility of being something else too. I used two different callback signatures -- the first is an optimized version that takes a char array (or presumably a wide char array to implement wprintf(3)), and the second is identical to the callback for kvprintf(9). http://people.freebsd.org/~mdf/vcb_printf.c http://people.freebsd.org/~mdf/vcb_printf2.c With changes in hand, I wrote a small user-space utility to benchmark the existing fprintf and sprintf versus the new one. Note that the my_fprintf() function essentially borrows from the guts of printfcommon.h. This http://people.freebsd.org/~mdf/printf_test.c The numbers I get I found rather interesting (also, I appear to be incompetent at calculating standard deviations; I'm sure someone will correct my code). # ./printf_test sprintf : avg 0.090116202 sec, stdd 1.429e+10 my_sprintf : avg 0.069918215 sec, stdd 1.167e+10 my_sprintf2: avg 0.174028095 sec, stdd 1.167e+10 fprintf /dev/null: avg 0.077871461 sec, stdd 1.65e+10 my_fprintf /dev/null: avg 0.102162816 sec, stdd 8.25e+09 my_fprintf2 /dev/null: avg 0.307952770 sec, stdd 1.65e+10 fprintf /tmp: avg 0.169936961 sec, stdd 1.167e+10 my_fprintf /tmp: avg 0.199943344 sec, stdd 1.167e+10 my_fprintf2 /tmp: avg 0.399886075 sec, stdd 1.167e+10 my_fwrite /tmp: avg 0.210000656 sec, stdd 1.167e+10 I am unsurprised that the character-by-character callback is slower than bulk; in fact I didn't roll up the bulk callback until I saw how miserable the character-by-character callback was. I put the code and numbers up for both because this also indicates the likelihood of a speedup in the kernel by doing a bulk callback for sprintf and sbuf operations. The new implementation is significantly faster when doing sprintf(3), significantly slower when printing to /dev/null, and slightly slower when printing to a file using an iovec, and slightly more slow using a naieve fwrite(3) callback. In my case, /tmp is a UFS2 filesystem. My thought would be that, if we have a core implementation like cb_printf that can be used in both the kernel and libc/stdio, it would be fewer sources to maintain. Also, the kernel cannot use the existing FILE based printf(3) but both sources can use a callback-based printf. I would like to discuss at some point after this adding a generic printf format specifier that basically takes a function pointer and argument and uses that to print. Implementing that for both kernel and userspace would be easier with a single root printf implementation. So, thoughts? Is the performance loss here acceptable, and is there something I missed in terms of making it run faster when printing to files? Thanks, matthew From owner-freebsd-arch@FreeBSD.ORG Fri Sep 17 22:09:02 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C14F11065672 for ; Fri, 17 Sep 2010 22:09:02 +0000 (UTC) (envelope-from mdf356@gmail.com) Received: from mail-iw0-f182.google.com (mail-iw0-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id 854408FC08 for ; Fri, 17 Sep 2010 22:09:02 +0000 (UTC) Received: by iwn34 with SMTP id 34so2753896iwn.13 for ; Fri, 17 Sep 2010 15:09:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:cc:content-type:content-transfer-encoding; bh=BG6j0fzGOTd990fSrVrXCF1skOc3Z5bXus9IVzFiZ9c=; b=rJWbqnQ5radleUB6cwSGabWEaFmyS8WkJhZZlnjNpS3hbZG1ufxAemKUbl47MQwv6y bzkNJHF+bmiQd2bBNziqkqtxWTF8u+TP94Cmspzw4e+OSiqcoDQQ98T4jobPfpF0v93d ZjmCO+Gx+AjWHjp5i3I9ThmVxuAB0t+eNz/c8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=JUCmb6llMnXhzSJxEw98wRXZMKCOqWEKGcxfYm4Jk+Qxiofkr3WXwQKdENTGZF2WyF n+rEKeGpAKbiBB5nC4GUchZ7FPhVw8X3lq4EJRhgIp52/A8uIBDysZQKSRraBf9dioJW XuLZgcJGc+iBcF38prbf36D/U3YvAGQhnOraU= MIME-Version: 1.0 Received: by 10.231.31.129 with SMTP id y1mr5881282ibc.45.1284761341838; Fri, 17 Sep 2010 15:09:01 -0700 (PDT) Sender: mdf356@gmail.com Received: by 10.231.187.71 with HTTP; Fri, 17 Sep 2010 15:09:01 -0700 (PDT) In-Reply-To: References: Date: Fri, 17 Sep 2010 15:09:01 -0700 X-Google-Sender-Auth: pR0vIg6Rr2RVsWhboFV8qTu76vs Message-ID: From: mdf@FreeBSD.org To: FreeBSD Arch Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: Poul-Henning Kamp Subject: Re: Towards a One True Printf X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Sep 2010 22:09:02 -0000 On Fri, Sep 17, 2010 at 3:00 PM, wrote: > In an attempt to move towards a one true printf, I copied the base > printf(3) implementation and changed the parameters to be similar to > that of kvprintf(9), with a generic callback function on each print > group. =A0The callback can easily be essentially the io_put() methods > used for printf(3) but have the possibility of being something else > too. =A0I used two different callback signatures -- the first is an > optimized version that takes a char array (or presumably a wide char > array to implement wprintf(3)), and the second is identical to the > callback for kvprintf(9). > > http://people.freebsd.org/~mdf/vcb_printf.c > http://people.freebsd.org/~mdf/vcb_printf2.c > > With changes in hand, I wrote a small user-space utility to benchmark > the existing fprintf and sprintf versus the new one. =A0Note that the > my_fprintf() function essentially borrows from the guts of > printfcommon.h. =A0This > > http://people.freebsd.org/~mdf/printf_test.c > > The numbers I get I found rather interesting (also, I appear to be > incompetent at calculating standard deviations; I'm sure someone will > correct my code). > > # ./printf_test > sprintf =A0 =A0: avg 0.090116202 sec, stdd 1.429e+10 > my_sprintf : avg 0.069918215 sec, stdd 1.167e+10 > my_sprintf2: avg 0.174028095 sec, stdd 1.167e+10 > > fprintf =A0 =A0 /dev/null: avg 0.077871461 sec, stdd 1.65e+10 > my_fprintf =A0/dev/null: avg 0.102162816 sec, stdd 8.25e+09 > my_fprintf2 /dev/null: avg 0.307952770 sec, stdd 1.65e+10 > > fprintf =A0 =A0 /tmp: avg 0.169936961 sec, stdd 1.167e+10 > my_fprintf =A0/tmp: avg 0.199943344 sec, stdd 1.167e+10 > my_fprintf2 /tmp: avg 0.399886075 sec, stdd 1.167e+10 > my_fwrite =A0 /tmp: avg 0.210000656 sec, stdd 1.167e+10 > > I am unsurprised that the character-by-character callback is slower > than bulk; in fact I didn't roll up the bulk callback until I saw how > miserable the character-by-character callback was. =A0I put the code and > numbers up for both because this also indicates the likelihood of a > speedup in the kernel by doing a bulk callback for sprintf and sbuf > operations. > > The new implementation is significantly faster when doing sprintf(3), > significantly slower when printing to /dev/null, and slightly slower > when printing to a file using an iovec, and slightly more slow using a > naieve fwrite(3) callback. =A0In my case, /tmp is a UFS2 filesystem. > > My thought would be that, if we have a core implementation like > cb_printf that can be used in both the kernel and libc/stdio, it would > be fewer sources to maintain. =A0Also, the kernel cannot use the > existing FILE based printf(3) but both sources can use a > callback-based printf. > > I would like to discuss at some point after this adding a generic > printf format specifier that basically takes a function pointer and > argument and uses that to print. =A0Implementing that for both kernel > and userspace would be easier with a single root printf > implementation. > > So, thoughts? =A0Is the performance loss here acceptable, and is there > something I missed in terms of making it run faster when printing to > files? Also, here is a diff from vfprintf.c to vcb_printf.c to show what really changed: http://people.freebsd.org/~mdf/vfprintf-to-vcb_printf.diff Last, to make all this compile I had to add [v]cb_printf to libc's Symbol.map and also I added __sfvwrite so my user-land performance utility would build. Thanks, matthew From owner-freebsd-arch@FreeBSD.ORG Sat Sep 18 00:10:14 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6D05910656A3 for ; Sat, 18 Sep 2010 00:10:14 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: from mail.garage.freebsd.pl (60.wheelsystems.com [83.12.187.60]) by mx1.freebsd.org (Postfix) with ESMTP id A15878FC16 for ; Sat, 18 Sep 2010 00:10:13 +0000 (UTC) Received: by mail.garage.freebsd.pl (Postfix, from userid 65534) id 0629A45D8D; Sat, 18 Sep 2010 01:46:04 +0200 (CEST) Received: from localhost (chello089077043238.chello.pl [89.77.43.238]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.garage.freebsd.pl (Postfix) with ESMTP id DC9B545C9B; Sat, 18 Sep 2010 01:45:58 +0200 (CEST) Date: Sat, 18 Sep 2010 01:45:42 +0200 From: Pawel Jakub Dawidek To: freebsd-arch@FreeBSD.org Message-ID: <20100917234542.GE1902@garage.freebsd.pl> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="Q8BnQc91gJZX4vDc" Content-Disposition: inline User-Agent: Mutt/1.4.2.3i X-PGP-Key-URL: http://people.freebsd.org/~pjd/pjd.asc X-OS: FreeBSD 9.0-CURRENT amd64 Cc: freebsd-current@FreeBSD.org Subject: gptboot rewrite, bootonce, etc. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Sep 2010 00:10:14 -0000 --Q8BnQc91gJZX4vDc Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi. My company was in need for functionality similar to nextboot(8), but on boot loader level, so we can have two partitions we boot from where one is known to be good and the other is used for upgrades. We upgrade by dd(1)ing entire partition image onto unused partition, we mark it as try-to-boot-from-it-but-only-once, reboot and if we fail to boot from the new partition, we fall back to the old, good partition. If we succeed on the other hand, we mark the new partition as our boot partition and mark the other one as unused. Well, how hard can it be? After around two weeks of work, I ended up rewriting gptboot in large parts, reorganizing a lot of code, improving and extending gpart a bit and implementing desire functionality. Here is the patch for review and test: http://people.freebsd.org/~pjd/patches/gptboot.patch The list of changes: - Split code shared by almost any boot loader into separate files and clean up most layering violations: sys/boot/i386/common/rbx.h: RBX_* defines OPT_SET() OPT_CHECK() sys/boot/common/util.[ch]: memcpy() memset() memcmp() bcpy() bzero() bcmp() strcmp() strncmp() [new] strcpy() strcat() strchr() strlen() printf() sys/boot/i386/common/cons.[ch]: ioctrl putc() xputc() putchar() getc() xgetc() keyhit() [now takes number of seconds as an argument] getstr() sys/boot/i386/common/drv.[ch]: struct dsk drvread() drvwrite() [new] drvsize() [new] sys/boot/common/crc32.[ch] [new] sys/boot/common/gpt.[ch] [new] - Teach gptboot and gptzfsboot about new files. I haven't touched the rest, but there is still a lot of code duplication to be removed. - Implement full GPT support. Currently we just read primary header and partition table and don't care about checksums, etc. With the patch we verify checksums of primary header and primary partition table and if there is a problem we fall back to backup header and backup partition table. - Clean up most messages to use prefix of boot program, so in case of an error we know where the error comes from, eg.: gptboot: unable to read primary GPT header - If we can't boot, print boot prompt only once and not every five seconds. - Introduce three new GPT attributes: bootme - this is bootable partition bootonce - try to boot from this partition only once bootfailed - we failed to boot from this partition - Extend gpart to allow to manipulate new attributes: gpart set -a bootme -i 3 ada0 gpart set -a bootonce -i 4 ada0 gpart unset -a bootfailed -i 2 ada0 Note, that setting 'bootonce' attribute automatically sets 'bootme' attribute. - Change boot order of gptboot to the following: 1. Try to boot from all the partitions that have both 'bootme' and 'bootonce' attributes one by one. 2. Try to boot from all the partitions that have only 'bootme' attribute one by one. 3. If there are no partitions with 'bootme' attribute, boot from the first UFS partition. - The 'bootonce' functionality is implemented in the following way: 1. Walk through all the partitions and when 'bootonce' attribute is found without 'bootme' attribute, remove 'bootonce' attribute and set 'bootfailed' attribute. 'bootonce' attribute alone means that we tried to boot from this partition, but boot failed after leaving gptboot and machine was restarted. 2. Find partition with both 'bootme' and 'bootonce' attributes. 3. Remove 'bootme' attribute. 4. Try to execute /boot/loader or /boot/kernel/kernel from that partition. If succeeded we stop here. 5. If execution failed, remove 'bootonce' and set 'bootfailed'. 6. Go to 2. If whole boot succeeded there is new /etc/rc.d/gptboot script that will log all partitions that we failed to boot from (the ones with 'bootfailed' attribute) and will remove this attribute. It will also find partition with 'bootonce' attribute - this is the partition we booted from successfully. The script will log success and remove the attribute. All the GPT updates we do here goes to both primary and backup GPT if they are valid. We don't touch headers or partition tables when checksum doesn't match. Any comments or suggestions? Be aware that at this point I'm soo full of boot loaders and I'm not looking for much more work in this area, so small tweaks are fine, but bigger things will have to wait until I can sleep at nights again. Well, there is still dedup support that waits to be implemented in gptzfsboot... --=20 Pawel Jakub Dawidek http://www.wheelsystems.com pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! --Q8BnQc91gJZX4vDc Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (FreeBSD) iEYEARECAAYFAkyT/aUACgkQForvXbEpPzTjJACfWIEFMstjYV+1bPilgCjx90pB Cb8An3XfrBMtepSeQWX0IYnuLJrOIH2i =yKLn -----END PGP SIGNATURE----- --Q8BnQc91gJZX4vDc-- From owner-freebsd-arch@FreeBSD.ORG Sat Sep 18 04:16:03 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4B1C31065670; Sat, 18 Sep 2010 04:16:03 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from out-0.mx.aerioconnect.net (outn.internet-mail-service.net [216.240.47.237]) by mx1.freebsd.org (Postfix) with ESMTP id 2BEB28FC0A; Sat, 18 Sep 2010 04:16:02 +0000 (UTC) Received: from idiom.com (postfix@mx0.idiom.com [216.240.32.160]) by out-0.mx.aerioconnect.net (8.13.8/8.13.8) with ESMTP id o8I456jF012205; Fri, 17 Sep 2010 21:05:07 -0700 X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e Received: from julian-mac.elischer.org (h-67-100-89-137.snfccasy.static.covad.net [67.100.89.137]) by idiom.com (Postfix) with ESMTP id 1695C2D6010; Fri, 17 Sep 2010 21:05:05 -0700 (PDT) Message-ID: <4C943A94.5020606@freebsd.org> Date: Fri, 17 Sep 2010 21:05:40 -0700 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4 MIME-Version: 1.0 To: Pawel Jakub Dawidek References: <20100917234542.GE1902@garage.freebsd.pl> In-Reply-To: <20100917234542.GE1902@garage.freebsd.pl> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.67 on 216.240.47.51 Cc: freebsd-current@freebsd.org, freebsd-arch@freebsd.org Subject: Re: gptboot rewrite, bootonce, etc. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Sep 2010 04:16:03 -0000 On 9/17/10 4:45 PM, Pawel Jakub Dawidek wrote: > Hi. > > My company was in need for functionality similar to nextboot(8), but on > boot loader level, so we can have two partitions we boot from where one > is known to be good and the other is used for upgrades. We upgrade by > dd(1)ing entire partition image onto unused partition, we mark it as > try-to-boot-from-it-but-only-once, reboot and if we fail to boot from > the new partition, we fall back to the old, good partition. If we > succeed on the other hand, we mark the new partition as our boot > partition and mark the other one as unused. > > Well, how hard can it be? > > After around two weeks of work, I ended up rewriting gptboot in large > parts, reorganizing a lot of code, improving and extending gpart a bit > and implementing desire functionality. > > Here is the patch for review and test: > > http://people.freebsd.org/~pjd/patches/gptboot.patch > > The list of changes: > > - Split code shared by almost any boot loader into separate files and > clean up most layering violations: > > sys/boot/i386/common/rbx.h: > > RBX_* defines > OPT_SET() > OPT_CHECK() > > sys/boot/common/util.[ch]: > > memcpy() > memset() > memcmp() > bcpy() > bzero() > bcmp() > strcmp() > strncmp() [new] > strcpy() > strcat() > strchr() > strlen() > printf() > > sys/boot/i386/common/cons.[ch]: > > ioctrl > putc() > xputc() > putchar() > getc() > xgetc() > keyhit() [now takes number of seconds as an argument] > getstr() > > sys/boot/i386/common/drv.[ch]: > > struct dsk > drvread() > drvwrite() [new] > drvsize() [new] > > sys/boot/common/crc32.[ch] [new] > > sys/boot/common/gpt.[ch] [new] > > - Teach gptboot and gptzfsboot about new files. I haven't touched the > rest, but there is still a lot of code duplication to be removed. > > - Implement full GPT support. Currently we just read primary header and > partition table and don't care about checksums, etc. With the patch we > verify checksums of primary header and primary partition table and if > there is a problem we fall back to backup header and backup partition > table. > > - Clean up most messages to use prefix of boot program, so in case of an > error we know where the error comes from, eg.: > > gptboot: unable to read primary GPT header > > - If we can't boot, print boot prompt only once and not every five > seconds. > > - Introduce three new GPT attributes: > > bootme - this is bootable partition > bootonce - try to boot from this partition only once > bootfailed - we failed to boot from this partition > > - Extend gpart to allow to manipulate new attributes: > > gpart set -a bootme -i 3 ada0 > gpart set -a bootonce -i 4 ada0 > gpart unset -a bootfailed -i 2 ada0 > > Note, that setting 'bootonce' attribute automatically sets 'bootme' > attribute. > > - Change boot order of gptboot to the following: > > 1. Try to boot from all the partitions that have both 'bootme' > and 'bootonce' attributes one by one. > 2. Try to boot from all the partitions that have only 'bootme' > attribute one by one. > 3. If there are no partitions with 'bootme' attribute, boot from > the first UFS partition. > > - The 'bootonce' functionality is implemented in the following way: > > 1. Walk through all the partitions and when 'bootonce' > attribute is found without 'bootme' attribute, remove > 'bootonce' attribute and set 'bootfailed' attribute. > 'bootonce' attribute alone means that we tried to boot from > this partition, but boot failed after leaving gptboot and > machine was restarted. > 2. Find partition with both 'bootme' and 'bootonce' attributes. > 3. Remove 'bootme' attribute. > 4. Try to execute /boot/loader or /boot/kernel/kernel from that > partition. If succeeded we stop here. > 5. If execution failed, remove 'bootonce' and set 'bootfailed'. > 6. Go to 2. > > If whole boot succeeded there is new /etc/rc.d/gptboot script that > will log all partitions that we failed to boot from (the ones with > 'bootfailed' attribute) and will remove this attribute. It will also > find partition with 'bootonce' attribute - this is the partition we > booted from successfully. The script will log success and remove the > attribute. > > All the GPT updates we do here goes to both primary and backup GPT if > they are valid. We don't touch headers or partition tables when > checksum doesn't match. > > Any comments or suggestions? Be aware that at this point I'm soo full of > boot loaders and I'm not looking for much more work in this area, so > small tweaks are fine, but bigger things will have to wait until I can > sleep at nights again. Well, there is still dedup support that waits to > be implemented in gptzfsboot... nextboot USED to work at the bootloader level, but it got broken^H^H^H^H^H^H^H changed by someone several years ago. Ironport still use the old bootblock for that reason. It used to store the string for boot1 to use in the second block of the disk and boot0 would read it and write it back disabled using a bios command, so that the boot after that would not do it again if it failed. boot0 then passed it to boot1 in the stack to use. I did have a version that kept the boot string in a special partition. (of 1 block) Obviously what you are doing is much more fancy. From owner-freebsd-arch@FreeBSD.ORG Sat Sep 18 08:11:29 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C6D631065673 for ; Sat, 18 Sep 2010 08:11:29 +0000 (UTC) (envelope-from roam@ringlet.net) Received: from praag.hoster.bg (praag.hoster.bg [77.77.142.10]) by mx1.freebsd.org (Postfix) with ESMTP id 427E98FC14 for ; Sat, 18 Sep 2010 08:11:29 +0000 (UTC) Received: from middenheim.hoster.bg (middenheim.hoster.bg [77.77.142.11]) by praag.hoster.bg (Postfix) with ESMTP id 92B3F94018 for ; Sat, 18 Sep 2010 10:54:42 +0300 (EEST) Received: from straylight.ringlet.net (unknown [94.155.53.142]) (Authenticated sender: roam@hoster.bg) by mail.hoster.bg (Postfix) with ESMTP id 4D0E35C2C9 for ; Sat, 18 Sep 2010 10:54:39 +0300 (EEST) Received: from roam (uid 1000) (envelope-from roam@ringlet.net) id 41601e by straylight.ringlet.net (DragonFly Mail Agent) Sat, 18 Sep 2010 10:54:38 +0300 Date: Sat, 18 Sep 2010 10:54:38 +0300 From: Peter Pentchev To: mdf@FreeBSD.org Message-ID: <20100918075438.GA3739@straylight.ringlet.net> Mail-Followup-To: mdf@FreeBSD.org, FreeBSD Arch , Poul-Henning Kamp References: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="9jxsPFA5p3P2qPhR" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) X-MailScanner-ID: 4D0E35C2C9.283F0 X-hoster-MailScanner: Found to be clean X-hoster-MailScanner-SpamCheck: not spam, SpamAssassin (cached, score=0.001, required 10, autolearn=disabled, UNPARSEABLE_RELAY 0.00) X-hoster-MailScanner-From: roam@ringlet.net X-hoster-MailScanner-To: freebsd-arch@freebsd.org X-Spam-Status: No Cc: Poul-Henning Kamp , FreeBSD Arch Subject: Re: Towards a One True Printf X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Sep 2010 08:11:29 -0000 --9jxsPFA5p3P2qPhR Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Sep 17, 2010 at 03:00:21PM -0700, mdf@FreeBSD.org wrote: [snip] > With changes in hand, I wrote a small user-space utility to benchmark > the existing fprintf and sprintf versus the new one. Note that the > my_fprintf() function essentially borrows from the guts of > printfcommon.h. This >=20 > http://people.freebsd.org/~mdf/printf_test.c >=20 > The numbers I get I found rather interesting (also, I appear to be > incompetent at calculating standard deviations; I'm sure someone will > correct my code). Just a side note, but... have you tried outputting the raw numbers and feeding them to ministat(1)? :) G'luck, Peter --=20 Peter Pentchev roam@space.bg roam@ringlet.net roam@FreeBSD.org PGP key: http://people.FreeBSD.org/~roam/roam.key.asc Key fingerprint FDBA FD79 C26F 3C51 C95E DF9E ED18 B68D 1619 4553 Nostalgia ain't what it used to be. --9jxsPFA5p3P2qPhR Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAEBCgAGBQJMlHA6AAoJEGUe77AlJ98T3KQQAMNjJxnG0xZSE8oGI781jHc5 wWfFMIv1MGtrbN4KLZmJKXleoAQN5umfJGyUcHpQz+lSLcR8bL/w1IfeGSVoCBoD 34rnpIs54H80W2iskiF1jESyoUf9Tzn9Q+3ht1snWwbdz3YEO2RFm16dYRc480QH Tm5yP6DONDmVxsL89y7QRacVor++vwXtWjzQxbzEoSqhzZ/Vueeplm1n9Q98KcBY 4teHqbSo80nSwVWKRrFpoEB7EVVQXpvGhwsSlVukV30ZlDX/tV1oLeEfAWcP7lcb 8tyOs6Djl0+BePC3ZxgKDTh7xPA18zHZmNMB8lm/i+FAkpFYdADGFsPOxG8nBHgE VIBQkuSlsgjEFB9zp9JEbKgXnpiO5JTxcx4xw3f3Z4HJdRCsN/qNBiUIMve/ggd0 laOxx3tbiAbAokh1EXkhQl6sMjskcHJZAbAgtyAAGwsHYnYvWNdpXEviOb80leVk pI0xRcfHoWh+QqUx6PHQTvi5DzD4qTN7dzvjY9U1x3IOdQ7fRz8uugbXKPacos2D eT6NE2z4PEfhHVPv/XkkpGhUdA6M2d2GtY94RA4UHT5JspFChKwqciT6lXcDnTDl Ynusoyd0LyxakSFDO3smLmgkOjlrxJc5yl+FMiWJGWV1KIMCm08fQNka2PksWmJL 829UpvTkBe6BE5fGfD14 =EBkD -----END PGP SIGNATURE----- --9jxsPFA5p3P2qPhR-- From owner-freebsd-arch@FreeBSD.ORG Sat Sep 18 09:41:36 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F2D471065672; Sat, 18 Sep 2010 09:41:36 +0000 (UTC) (envelope-from gljennjohn@googlemail.com) Received: from mail-bw0-f54.google.com (mail-bw0-f54.google.com [209.85.214.54]) by mx1.freebsd.org (Postfix) with ESMTP id 523C98FC0A; Sat, 18 Sep 2010 09:41:35 +0000 (UTC) Received: by bwz15 with SMTP id 15so4312695bwz.13 for ; Sat, 18 Sep 2010 02:41:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:received:received:date:from:to:cc:subject :message-id:in-reply-to:references:reply-to:x-mailer:mime-version :content-type:content-transfer-encoding; bh=ebAvEZH36jAmAPLGCLQnnGKRqXj2hzaz9tSbQnVqfS0=; b=j+OtozAATdTNQDyDFuBEG9uWTOhBQcP65JFD7KlL+Rs7IutBnetUzjASgSzg/fJDUg Rxu2D3/ibBjEtEiHZ90t20IyZ6Jz9Q2xpaSkPseFtP7VBnsLlKO09nHSpkuG0D1HqO/b HE8gAiIPySPlzbYTFHfidl1Kj0aMkocvtkADE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=date:from:to:cc:subject:message-id:in-reply-to:references:reply-to :x-mailer:mime-version:content-type:content-transfer-encoding; b=lj2ORuKpMUADLiipDjqzzDWi6hyzv27jQl+0nmEV5x6gfJOgRloVJd5omrEwH1bNYn crAacWqBncNUBR0ZUI6CwhLKvJClFh75i3sVvWLx0E591rv7QfWFv5bc0y39fHdfdL2C 3Vw6X5HD3CNkf93rBnU5VR0ToVw8qmTciVTu8= Received: by 10.204.84.17 with SMTP id h17mr4805338bkl.101.1284801370642; Sat, 18 Sep 2010 02:16:10 -0700 (PDT) Received: from ernst.jennejohn.org (p578E351F.dip.t-dialin.net [87.142.53.31]) by mx.google.com with ESMTPS id f18sm4511030bkf.15.2010.09.18.02.16.08 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sat, 18 Sep 2010 02:16:09 -0700 (PDT) Date: Sat, 18 Sep 2010 11:16:06 +0200 From: Gary Jennejohn To: John Baldwin Message-ID: <20100918111606.1c4390ff@ernst.jennejohn.org> In-Reply-To: <201009171123.39382.jhb@freebsd.org> References: <201009171123.39382.jhb@freebsd.org> X-Mailer: Claws Mail 3.7.6 (GTK+ 2.18.7; amd64-portbld-freebsd9.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: arch@freebsd.org Subject: Re: Interrupt Threads X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: gljennjohn@googlemail.com List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Sep 2010 09:41:37 -0000 On Fri, 17 Sep 2010 11:23:39 -0400 John Baldwin wrote: > The code currently lives in p4 at //depot/user/jhb/intr/... I have also put > up a patch at http://www.freebsd.org/~jhb/patches/intr_threads.patch. This > patch includes the changes to the igb(4) driver. > Doesn't compile without INVARIANTS because in line 928 of kern/kern_intr.c ihw is used, but its declaration is hidden behind #ifdef INVARIANTS. I just moved it outside the ifdef to get it to compile, but I haven't tested the resulting kernel yet, so I don't know whether that was the correct solution. -- Gary Jennejohn