From owner-freebsd-arch@FreeBSD.ORG Sat Mar 8 09:46:01 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C2374106566B; Sat, 8 Mar 2008 09:46:01 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 7B81B8FC12; Sat, 8 Mar 2008 09:46:01 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.107] (cpe-24-94-75-93.hawaii.res.rr.com [24.94.75.93]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id m289juW5025373; Sat, 8 Mar 2008 04:45:57 -0500 (EST) (envelope-from jroberson@chesapeake.net) Date: Fri, 7 Mar 2008 23:46:32 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: John Baldwin In-Reply-To: <20080307124038.I920@desktop> Message-ID: <20080307234452.U1091@desktop> References: <20080307020626.G920@desktop> <200803070842.37248.jhb@freebsd.org> <200803071016.46944.jhb@freebsd.org> <20080307124038.I920@desktop> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Getting rid of the static msleep priority boost X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 08 Mar 2008 09:46:01 -0000 On Fri, 7 Mar 2008, Jeff Roberson wrote: > On Fri, 7 Mar 2008, John Baldwin wrote: > >> On Friday 07 March 2008 08:42:37 am John Baldwin wrote: >>> On Friday 07 March 2008 07:16:30 am Jeff Roberson wrote: >>>> Hello, >>>> >>>> I've been studying some problems with recent scheduler improvements that >>>> help a lot on some workloads and hurt on others. I've tracked the >>>> problem down to static priority boosts handed out by >>>> msleep/cv_broadcastpri. The basic problem is that a user thread will be >>>> woken with a kernel priority thus allowing it to preempt a thread running >>>> on any processor with a lesser priority. The lesser priority thread may >>>> in fact hold some resource that the higher priority thread requires. >>>> Thus we context switch several times and perhaps go through priority >>>> propagation as well. >>>> >>>> I have verified that disabling these static priority boosts entirely >>>> fixes the performance problem I've run into on at least one workload. >>>> There are probably others that it helps and hopefully we can discover >>>> that. >>>> >>>> I'd like to know if anyone has a strong preference to keep this feature. >>>> It is likely that it helps in some interactive situations. I'm not sure >>>> how much however. I propose that we make a sysctl that disables it and >>>> turn it off by default. If we see complaints on current@ we can suggest >>>> that they toggle the sysctl to see if it alleviates problems. >>>> >>>> Based on feedback from that experiment and some testing we can then >>>> choose a few options: >>>> >>>> 1) Disable the static boosts entirely. Leave kernel priorities for >>>> kernel threads and priority propagation. Most other kernels do this. >>>> Would make my life in ULE much easier as well. >>>> >>>> 2) Leave the support for static boosts but remove it from all but a few >>>> key locations. Leaving it in the api would give some flexibility but >>>> might confuse developers. >>>> >>>> 3) Leave things as they are. undesirable. >>>> >>>> I'm leaning towards #2 based on the information I have presently. This >>>> is almost a significant change to historic BSD behavior so we might want >>>> to tread lightly. >>> >>> One thing to note is that we actually depend on the priority boost >>> (evilly) >>> to pick processes to swap out. (I think we check for <= PSOCK and don't >>> swap those out). One thing that I've wanted to happen for a while is that >>> the sleep priority for msleep() just be a parameter available to the >>> scheduler that the scheduler can use to calculate the real internal >>> priority rather than just being a set. That is, I imagine having: >>> >>> void sched_set_sleep_prio(struct thread *td, u_char pri); >>> u_char sched_get_sleep_prio(struct thread *td); >>> >>> (The swap check would use the get call). The 4BSD scheduler's >>> implementation of sched_set_sleep_prio would look like this: >>> >>> void >>> sched_set_sleep_prio(struct thread *td, u_char pri) >>> { >>> >>> td->td_sched->sleep_pri = pri; >>> sched_prio(td, pri); >>> } >>> >>> void >>> sched_userret(..) >>> { >>> >>> ... >>> td->td_sched->sleep_pri = 0; /* not in the kernel anymore */ >>> } >>> >>> but other schedulers may just save it and recalculate the priority where >>> the priority calculation just considers the sleep priority as one among >>> many factors. If nothing else, this allows it to be a scheduler decision >>> to ignore it (so 4BSD could continue to do what it does now, but ULE may >>> ignore it, or ignore certain levels, etc.) >> >> One thing to clarify: I'm not opposed to replacing the PSOCK check with >> something more suitable in the swap code, (in fact, that would be >> desirable), >> but it might take a good bit of work to do that and is probably easier to >> work on that as a separate change. I also think there can be some merit in >> having code paths hint to the scheduler the relative interactivity/priority >> of a sleep. > > Couple of notes.. > > The priority argument to sleep is a reasonable way for the code to hint at > the relative priority/interactivity. So that argues for leaving these > arguments in place and making them more advisory. I don't think we have to > change the api to take advantage of that. > > I'll look more closely for places like the swap that care about the absolute > priority of a process and see what I can come up with. Thanks for raising > that concern. > > I'd like to avoid apis that require the sched lock in seperate steps like > msleep does now to elevate the priority. So far all sched* apis require the > thread lock on enter and I'd hate to deviate from that norm. But another > option may be just to make a globally visible td_sleep_pri that doesn't > require the lock for write but does for read. The other option is to bubble > the argument down through the sleepq code and into sched_sleep() and > sched_wakeup(). I like that the best but it's the most api churn. http://people.freebsd.org/~jeff/sleeppri.diff What do you think of this? I added another parameter to sleepq_add() and sched_sleep(). So the scheduler is responsible for adjusting the priority. We could do the same thing for wakeup time adjustments like sleepq_broadcastpri() but we'd have to pass it through setrunnable() as well. I'd like to normalize the other pri arguments in sleepq to use the same 0 is not set vs -1 that msleep did. I realize that 0 is a valid priority but for practical purposes this makes things consistent and does not really restrict the api. > > Thanks, > Jeff > >> >> -- >> John Baldwin >> > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" >