Date: Sun, 22 Apr 2018 13:43:53 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: Konstantin Belousov <kostikbel@gmail.com> Cc: "freebsd-current@freebsd.org" <freebsd-current@freebsd.org>, "George Mitchell" <george+freebsd@m5p.com>, Peter <pmc@citylink.dinoex.sub.org> Subject: Re: SCHED_ULE makes 256Mbyte i386 unusable Message-ID: <YQBPR0101MB10421CFD2FA2C1A5356492CEDD8A0@YQBPR0101MB1042.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <20180422120241.GR6887@kib.kiev.ua> References: <YQBPR0101MB1042F252A539E8D55EB44585DD8B0@YQBPR0101MB1042.CANPRD01.PROD.OUTLOOK.COM> <20180421201128.GO6887@kib.kiev.ua> <YQBPR0101MB10421529BB346952BCE7F20EDD8B0@YQBPR0101MB1042.CANPRD01.PROD.OUTLOOK.COM>, <20180422120241.GR6887@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
Konstantin Belousov wrote: >On Sat, Apr 21, 2018 at 11:30:55PM +0000, Rick Macklem wrote: >> Konstantin Belousov wrote: >> >On Sat, Apr 21, 2018 at 07:21:58PM +0000, Rick Macklem wrote: >> >> I decided to start a new thread on current related to SCHED_ULE, sinc= e I see >> >> more than just performance degradation and on a recent current kernel= . >> >> (I cc'd a couple of the people discussing performance problems in fre= ebsd-stable >> >> recently under a subject line of "Re: kern.sched.quantum: Creepy, sa= distic scheduler". >> >> >> >> When testing a pNFS server on a single core i386 with 256Mbytes using= a Dec. 2017 >> >> current/head kernel, I would see about a 30% performance degradation = (elapsed >> >> run time for a kernel build over NFSv4.1) when the server kernel was = built with >> >> options SCHED_ULE >> >> instead of >> >> options SCHED_4BSD So, now that I have decreased the number of nfsd kernel threads to 32, it w= orks with both schedulers and with essentially the same performance. (ie. The 30= % performance degradation has disappeared.) >> >> >> >> Now, with a kernel from a couple of days ago, the >> >> options SCHED_ULE >> >> kernel becomes unusable shortly after starting testing. >> >> I have seen two variants of this: >> >> - Became essentially hung. All I could do was ping the machine from t= he network. >> >> - Reported "vm_thread_new: kstack allocation failed >> >> and then any attempt to do anything gets "No more processes". >> >This is strange. It usually means that you get KVA either exhausted or >> >severly fragmented. >> Yes. I reduced the number of nfsd threads from 256->32 and the SCHED_ULE >> kernel is working ok now. I haven't done enough to compare performance y= et. >> Maybe I'll post again when I have some numbers. >> >> >Enter ddb, it should be operational since pings are replied. Try to se= e >> >where the threads are stuck. >> I didn't do this, since reducing the number of kernel threads seems to h= ave fixed >> the problem. For the pNFS server, the nfsd threads will spawn additional= kernel >> threads to do proxies to the mirrored DS servers. >> >> >> with the only difference being a kernel built with >> >> options SCHED_4BSD >> >> everything works and performs the same as the Dec 2017 kernel. >> >> >> >> I can try rolling back through the revisions, but it would be nice if= someone >> >> could suggest where to start, because it takes a couple of hours to b= uild a >> >> kernel on this system. >> >> >> >> So, something has made things worse for a head/current kernel this wi= nter, rick >> > >> >There are at least two potentially relevant changes. >> > >> >First is r326758 Dec 11 which bumped KSTACK_PAGES on i386 to 4. >> I've been running this machine with KSTACK_PAGES=3D4 for some time, so n= o change. W.r.t. Rodney Grimes comments about this (which didn't end up in this messa= ges in the thread): I didn't see any instability when using KSTACK_PAGES=3D4 for this until thi= s cropped up and seemed to be scheduler related (but not really, it seems). I bumped it to KSTACK_PAGES=3D4 because I needed that for the pNFS Metadata Server code. Yes, NFS does use quite a bit of kernel stack. Unfortunately, it isn't one = big item getting allocated on the stack, but many moderate sized ones. (A part of it is multiple instances of "struct vattr", some buried in "stru= ct nfsvattr", that NFS needs to use. I don't think these are large enough to justify mal= loc/free, but it has to use several of them.) One case I did try fixing was about 6 cases where "struct nfsstate" ended u= p on the stack. I changes the code to malloc/free them and then when testing, to my surprise I had a 20% performance hit and shelved the patch. Now that I know that the server was running near its limit, I might try thi= s one again, to see if the performance hit doesn't occur when the machine has ade= quate memory. If the performance hit goes away, I could commit this, but it would= n't have that much effect on the kstack usage. (It's interesting how this p= atch ended up related to the issue this thread discussed.) >> >> >Second is r332489 Apr 13, which introduced 4/4G KVA/UVA split. >> Could this change have resulted in the system being able to allocate few= er >> kernel threads/stacks for some reason? >Well, it could, as anything can be buggy. But the intent of the change >was to give 4G KVA, and it did. Righto. No concern here. I suspect the Dec. 2017 kernel was close to the li= mit (see performance issue that went away, noted above) and any change could have pushed it across the line, I think. >> >> >Consequences of the first one are obvious, it is much harder to find >> >the place to map the stack. Second change, on the other hand, provides >> >almost full 4G for KVA and should have mostly compensate for the negati= ve >> >effects of the first. >> > >> >And, I cannot see how changing the scheduler would fix or even affect t= hat >> >behaviour. >> My hunch is that the system was running near its limit for kernel thread= s/stacks. >> Then, somehow, the timing SCHED_ULE caused resulted in the nfsd trying t= o get >> to a higher peak number of threads and hit the limit. >> SCHED_4BSD happened to result in timing such that it stayed just below t= he >> limit and worked. >> I can think of a couple of things that might affect this: >> 1 - If SCHED_ULE doesn't do the termination of kernel threads as quickly= , then >> they wouldn't terminate and release their resources before more ne= w ones >> are spawned. >Scheduler has nothing to do with the threads termination. It might >select running threads in a way that causes the undesired pattern to >appear which might create some amount of backlog for termination, but >I doubt it. > >> 2 - If SCHED_ULE handles the nfsd threads in a more "bursty" way, then t= he burst >> could try and spawn more mirror DS worker threads at about the sam= e time. >> >> Anyhow, thanks for the help, rick Have a good day, rick=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQBPR0101MB10421CFD2FA2C1A5356492CEDD8A0>