Date: Mon, 10 Mar 2008 12:22:54 -1000 (HST) From: Jeff Roberson <jroberson@chesapeake.net> To: John Baldwin <jhb@freebsd.org> Cc: freebsd-arch@freebsd.org Subject: Re: Getting rid of the static msleep priority boost Message-ID: <20080310121527.F1091@desktop> In-Reply-To: <200803101313.03526.jhb@freebsd.org> References: <20080307020626.G920@desktop> <20080307124038.I920@desktop> <20080307234452.U1091@desktop> <200803101313.03526.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 10 Mar 2008, John Baldwin wrote: > On Saturday 08 March 2008 04:46:32 am Jeff Roberson wrote: >> On Fri, 7 Mar 2008, Jeff Roberson wrote: >> >>> On Fri, 7 Mar 2008, John Baldwin wrote: >>> >>>> On Friday 07 March 2008 08:42:37 am John Baldwin wrote: >>>>> On Friday 07 March 2008 07:16:30 am Jeff Roberson wrote: >>>>>> Hello, >>>>>> >>>>>> I've been studying some problems with recent scheduler improvements > that >>>>>> help a lot on some workloads and hurt on others. I've tracked the >>>>>> problem down to static priority boosts handed out by >>>>>> msleep/cv_broadcastpri. The basic problem is that a user thread will > be >>>>>> woken with a kernel priority thus allowing it to preempt a thread > running >>>>>> on any processor with a lesser priority. The lesser priority thread > may >>>>>> in fact hold some resource that the higher priority thread requires. >>>>>> Thus we context switch several times and perhaps go through priority >>>>>> propagation as well. >>>>>> >>>>>> I have verified that disabling these static priority boosts entirely >>>>>> fixes the performance problem I've run into on at least one workload. >>>>>> There are probably others that it helps and hopefully we can discover >>>>>> that. >>>>>> >>>>>> I'd like to know if anyone has a strong preference to keep this > feature. >>>>>> It is likely that it helps in some interactive situations. I'm not > sure >>>>>> how much however. I propose that we make a sysctl that disables it and >>>>>> turn it off by default. If we see complaints on current@ we can > suggest >>>>>> that they toggle the sysctl to see if it alleviates problems. >>>>>> >>>>>> Based on feedback from that experiment and some testing we can then >>>>>> choose a few options: >>>>>> >>>>>> 1) Disable the static boosts entirely. Leave kernel priorities for >>>>>> kernel threads and priority propagation. Most other kernels do this. >>>>>> Would make my life in ULE much easier as well. >>>>>> >>>>>> 2) Leave the support for static boosts but remove it from all but a > few >>>>>> key locations. Leaving it in the api would give some flexibility but >>>>>> might confuse developers. >>>>>> >>>>>> 3) Leave things as they are. undesirable. >>>>>> >>>>>> I'm leaning towards #2 based on the information I have presently. This >>>>>> is almost a significant change to historic BSD behavior so we might > want >>>>>> to tread lightly. >>>>> >>>>> One thing to note is that we actually depend on the priority boost >>>>> (evilly) >>>>> to pick processes to swap out. (I think we check for <= PSOCK and don't >>>>> swap those out). One thing that I've wanted to happen for a while is > that >>>>> the sleep priority for msleep() just be a parameter available to the >>>>> scheduler that the scheduler can use to calculate the real internal >>>>> priority rather than just being a set. That is, I imagine having: >>>>> >>>>> void sched_set_sleep_prio(struct thread *td, u_char pri); >>>>> u_char sched_get_sleep_prio(struct thread *td); >>>>> >>>>> (The swap check would use the get call). The 4BSD scheduler's >>>>> implementation of sched_set_sleep_prio would look like this: >>>>> >>>>> void >>>>> sched_set_sleep_prio(struct thread *td, u_char pri) >>>>> { >>>>> >>>>> td->td_sched->sleep_pri = pri; >>>>> sched_prio(td, pri); >>>>> } >>>>> >>>>> void >>>>> sched_userret(..) >>>>> { >>>>> >>>>> ... >>>>> td->td_sched->sleep_pri = 0; /* not in the kernel anymore */ >>>>> } >>>>> >>>>> but other schedulers may just save it and recalculate the priority where >>>>> the priority calculation just considers the sleep priority as one among >>>>> many factors. If nothing else, this allows it to be a scheduler > decision >>>>> to ignore it (so 4BSD could continue to do what it does now, but ULE may >>>>> ignore it, or ignore certain levels, etc.) >>>> >>>> One thing to clarify: I'm not opposed to replacing the PSOCK check with >>>> something more suitable in the swap code, (in fact, that would be >>>> desirable), >>>> but it might take a good bit of work to do that and is probably easier to >>>> work on that as a separate change. I also think there can be some merit > in >>>> having code paths hint to the scheduler the relative > interactivity/priority >>>> of a sleep. >>> >>> Couple of notes.. >>> >>> The priority argument to sleep is a reasonable way for the code to hint at >>> the relative priority/interactivity. So that argues for leaving these >>> arguments in place and making them more advisory. I don't think we have > to >>> change the api to take advantage of that. >>> >>> I'll look more closely for places like the swap that care about the > absolute >>> priority of a process and see what I can come up with. Thanks for raising >>> that concern. >>> >>> I'd like to avoid apis that require the sched lock in seperate steps like >>> msleep does now to elevate the priority. So far all sched* apis require > the >>> thread lock on enter and I'd hate to deviate from that norm. But another >>> option may be just to make a globally visible td_sleep_pri that doesn't >>> require the lock for write but does for read. The other option is to > bubble >>> the argument down through the sleepq code and into sched_sleep() and >>> sched_wakeup(). I like that the best but it's the most api churn. >> >> http://people.freebsd.org/~jeff/sleeppri.diff >> >> What do you think of this? I added another parameter to sleepq_add() and >> sched_sleep(). So the scheduler is responsible for adjusting the >> priority. We could do the same thing for wakeup time adjustments like >> sleepq_broadcastpri() but we'd have to pass it through setrunnable() as >> well. > > The cv_broadcastpri() thing is a hack and I wish there was a better way to do > it. I.e., I don't like having wakeup setting the priority at all. I think > it's a good idea to pass this to sched_sleep(), but I'd rather leave > sched_sleep() where it is and pass the prio arg to the sleepq_wait() routines > instead so you don't get a bump unless you actually sleep. I think it's > probably a bug that we bump the prio on threads that may not sleep now. Ok, I preferred not to move sched_sleep() as well but I also didn't want to add those arguments to the stack everywhere. I'll do that however. > >> I'd like to normalize the other pri arguments in sleepq to use the same 0 >> is not set vs -1 that msleep did. I realize that 0 is a valid priority >> but for practical purposes this makes things consistent and does not >> really restrict the api. > > Sounds fine to me. I think we should even formally make 0 an invalid priority > (via a comment or something). Ok, I'll consider that. I'm just going to commit this when it's tested and working. It's simple enough I don't think it warrents further review. Thanks, Jeff > > -- > John Baldwin >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080310121527.F1091>