From owner-freebsd-arch@FreeBSD.ORG Mon Mar 10 17:50:47 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4A4A21065671 for ; Mon, 10 Mar 2008 17:50:47 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id E1F9E8FC31 for ; Mon, 10 Mar 2008 17:50:46 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8s) with ESMTP id 234968788-1834499 for multiple; Mon, 10 Mar 2008 13:51:55 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.14.2/8.14.2) with ESMTP id m2AHoCCn087969; Mon, 10 Mar 2008 13:50:12 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: Jeff Roberson Date: Mon, 10 Mar 2008 13:13:03 -0400 User-Agent: KMail/1.9.7 References: <20080307020626.G920@desktop> <20080307124038.I920@desktop> <20080307234452.U1091@desktop> In-Reply-To: <20080307234452.U1091@desktop> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200803101313.03526.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Mon, 10 Mar 2008 13:50:13 -0400 (EDT) X-Virus-Scanned: ClamAV 0.91.2/6192/Mon Mar 10 10:54:00 2008 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: freebsd-arch@freebsd.org Subject: Re: Getting rid of the static msleep priority boost X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Mar 2008 17:50:47 -0000 On Saturday 08 March 2008 04:46:32 am Jeff Roberson wrote: > On Fri, 7 Mar 2008, Jeff Roberson wrote: > > > On Fri, 7 Mar 2008, John Baldwin wrote: > > > >> On Friday 07 March 2008 08:42:37 am John Baldwin wrote: > >>> On Friday 07 March 2008 07:16:30 am Jeff Roberson wrote: > >>>> Hello, > >>>> > >>>> I've been studying some problems with recent scheduler improvements that > >>>> help a lot on some workloads and hurt on others. I've tracked the > >>>> problem down to static priority boosts handed out by > >>>> msleep/cv_broadcastpri. The basic problem is that a user thread will be > >>>> woken with a kernel priority thus allowing it to preempt a thread running > >>>> on any processor with a lesser priority. The lesser priority thread may > >>>> in fact hold some resource that the higher priority thread requires. > >>>> Thus we context switch several times and perhaps go through priority > >>>> propagation as well. > >>>> > >>>> I have verified that disabling these static priority boosts entirely > >>>> fixes the performance problem I've run into on at least one workload. > >>>> There are probably others that it helps and hopefully we can discover > >>>> that. > >>>> > >>>> I'd like to know if anyone has a strong preference to keep this feature. > >>>> It is likely that it helps in some interactive situations. I'm not sure > >>>> how much however. I propose that we make a sysctl that disables it and > >>>> turn it off by default. If we see complaints on current@ we can suggest > >>>> that they toggle the sysctl to see if it alleviates problems. > >>>> > >>>> Based on feedback from that experiment and some testing we can then > >>>> choose a few options: > >>>> > >>>> 1) Disable the static boosts entirely. Leave kernel priorities for > >>>> kernel threads and priority propagation. Most other kernels do this. > >>>> Would make my life in ULE much easier as well. > >>>> > >>>> 2) Leave the support for static boosts but remove it from all but a few > >>>> key locations. Leaving it in the api would give some flexibility but > >>>> might confuse developers. > >>>> > >>>> 3) Leave things as they are. undesirable. > >>>> > >>>> I'm leaning towards #2 based on the information I have presently. This > >>>> is almost a significant change to historic BSD behavior so we might want > >>>> to tread lightly. > >>> > >>> One thing to note is that we actually depend on the priority boost > >>> (evilly) > >>> to pick processes to swap out. (I think we check for <= PSOCK and don't > >>> swap those out). One thing that I've wanted to happen for a while is that > >>> the sleep priority for msleep() just be a parameter available to the > >>> scheduler that the scheduler can use to calculate the real internal > >>> priority rather than just being a set. That is, I imagine having: > >>> > >>> void sched_set_sleep_prio(struct thread *td, u_char pri); > >>> u_char sched_get_sleep_prio(struct thread *td); > >>> > >>> (The swap check would use the get call). The 4BSD scheduler's > >>> implementation of sched_set_sleep_prio would look like this: > >>> > >>> void > >>> sched_set_sleep_prio(struct thread *td, u_char pri) > >>> { > >>> > >>> td->td_sched->sleep_pri = pri; > >>> sched_prio(td, pri); > >>> } > >>> > >>> void > >>> sched_userret(..) > >>> { > >>> > >>> ... > >>> td->td_sched->sleep_pri = 0; /* not in the kernel anymore */ > >>> } > >>> > >>> but other schedulers may just save it and recalculate the priority where > >>> the priority calculation just considers the sleep priority as one among > >>> many factors. If nothing else, this allows it to be a scheduler decision > >>> to ignore it (so 4BSD could continue to do what it does now, but ULE may > >>> ignore it, or ignore certain levels, etc.) > >> > >> One thing to clarify: I'm not opposed to replacing the PSOCK check with > >> something more suitable in the swap code, (in fact, that would be > >> desirable), > >> but it might take a good bit of work to do that and is probably easier to > >> work on that as a separate change. I also think there can be some merit in > >> having code paths hint to the scheduler the relative interactivity/priority > >> of a sleep. > > > > Couple of notes.. > > > > The priority argument to sleep is a reasonable way for the code to hint at > > the relative priority/interactivity. So that argues for leaving these > > arguments in place and making them more advisory. I don't think we have to > > change the api to take advantage of that. > > > > I'll look more closely for places like the swap that care about the absolute > > priority of a process and see what I can come up with. Thanks for raising > > that concern. > > > > I'd like to avoid apis that require the sched lock in seperate steps like > > msleep does now to elevate the priority. So far all sched* apis require the > > thread lock on enter and I'd hate to deviate from that norm. But another > > option may be just to make a globally visible td_sleep_pri that doesn't > > require the lock for write but does for read. The other option is to bubble > > the argument down through the sleepq code and into sched_sleep() and > > sched_wakeup(). I like that the best but it's the most api churn. > > http://people.freebsd.org/~jeff/sleeppri.diff > > What do you think of this? I added another parameter to sleepq_add() and > sched_sleep(). So the scheduler is responsible for adjusting the > priority. We could do the same thing for wakeup time adjustments like > sleepq_broadcastpri() but we'd have to pass it through setrunnable() as > well. The cv_broadcastpri() thing is a hack and I wish there was a better way to do it. I.e., I don't like having wakeup setting the priority at all. I think it's a good idea to pass this to sched_sleep(), but I'd rather leave sched_sleep() where it is and pass the prio arg to the sleepq_wait() routines instead so you don't get a bump unless you actually sleep. I think it's probably a bug that we bump the prio on threads that may not sleep now. > I'd like to normalize the other pri arguments in sleepq to use the same 0 > is not set vs -1 that msleep did. I realize that 0 is a valid priority > but for practical purposes this makes things consistent and does not > really restrict the api. Sounds fine to me. I think we should even formally make 0 an invalid priority (via a comment or something). -- John Baldwin