From owner-freebsd-current@freebsd.org Sun Apr 22 14:36:41 2018 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2A06AFA459E for ; Sun, 22 Apr 2018 14:36:41 +0000 (UTC) (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net) Received: from pdx.rh.CN85.dnsmgr.net (br1.CN84in.dnsmgr.net [69.59.192.140]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 864CB81084 for ; Sun, 22 Apr 2018 14:36:40 +0000 (UTC) (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net) Received: from pdx.rh.CN85.dnsmgr.net (localhost [127.0.0.1]) by pdx.rh.CN85.dnsmgr.net (8.13.3/8.13.3) with ESMTP id w3MEa9JH080703; Sun, 22 Apr 2018 07:36:10 -0700 (PDT) (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net) Received: (from freebsd-rwg@localhost) by pdx.rh.CN85.dnsmgr.net (8.13.3/8.13.3/Submit) id w3MEa9DY080702; Sun, 22 Apr 2018 07:36:09 -0700 (PDT) (envelope-from freebsd-rwg) From: "Rodney W. Grimes" Message-Id: <201804221436.w3MEa9DY080702@pdx.rh.CN85.dnsmgr.net> Subject: Re: SCHED_ULE makes 256Mbyte i386 unusable In-Reply-To: To: Rick Macklem Date: Sun, 22 Apr 2018 07:36:09 -0700 (PDT) CC: Konstantin Belousov , "freebsd-current@freebsd.org" , George Mitchell , Peter X-Mailer: ELM [version 2.4ME+ PL121h (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Apr 2018 14:36:41 -0000 > Konstantin Belousov wrote: > >On Sat, Apr 21, 2018 at 11:30:55PM +0000, Rick Macklem wrote: > >> Konstantin Belousov wrote: > >> >On Sat, Apr 21, 2018 at 07:21:58PM +0000, Rick Macklem wrote: > >> >> I decided to start a new thread on current related to SCHED_ULE, since I see > >> >> more than just performance degradation and on a recent current kernel. > >> >> (I cc'd a couple of the people discussing performance problems in freebsd-stable > >> >> recently under a subject line of "Re: kern.sched.quantum: Creepy, sadistic scheduler". > >> >> > >> >> When testing a pNFS server on a single core i386 with 256Mbytes using a Dec. 2017 > >> >> current/head kernel, I would see about a 30% performance degradation (elapsed > >> >> run time for a kernel build over NFSv4.1) when the server kernel was built with > >> >> options SCHED_ULE > >> >> instead of > >> >> options SCHED_4BSD > So, now that I have decreased the number of nfsd kernel threads to 32, it works > with both schedulers and with essentially the same performance. (ie. The 30% > performance degradation has disappeared.) > > >> >> > >> >> Now, with a kernel from a couple of days ago, the > >> >> options SCHED_ULE > >> >> kernel becomes unusable shortly after starting testing. > >> >> I have seen two variants of this: > >> >> - Became essentially hung. All I could do was ping the machine from the network. > >> >> - Reported "vm_thread_new: kstack allocation failed > >> >> and then any attempt to do anything gets "No more processes". > >> >This is strange. It usually means that you get KVA either exhausted or > >> >severly fragmented. > >> Yes. I reduced the number of nfsd threads from 256->32 and the SCHED_ULE > >> kernel is working ok now. I haven't done enough to compare performance yet. > >> Maybe I'll post again when I have some numbers. > >> > >> >Enter ddb, it should be operational since pings are replied. Try to see > >> >where the threads are stuck. > >> I didn't do this, since reducing the number of kernel threads seems to have fixed > >> the problem. For the pNFS server, the nfsd threads will spawn additional kernel > >> threads to do proxies to the mirrored DS servers. > >> > >> >> with the only difference being a kernel built with > >> >> options SCHED_4BSD > >> >> everything works and performs the same as the Dec 2017 kernel. > >> >> > >> >> I can try rolling back through the revisions, but it would be nice if someone > >> >> could suggest where to start, because it takes a couple of hours to build a > >> >> kernel on this system. > >> >> > >> >> So, something has made things worse for a head/current kernel this winter, rick > >> > > >> >There are at least two potentially relevant changes. > >> > > >> >First is r326758 Dec 11 which bumped KSTACK_PAGES on i386 to 4. > >> I've been running this machine with KSTACK_PAGES=4 for some time, so no change. > W.r.t. Rodney Grimes comments about this (which didn't end up in this messages > in the thread): > I didn't see any instability when using KSTACK_PAGES=4 for this until this cropped > up and seemed to be scheduler related (but not really, it seems). > I bumped it to KSTACK_PAGES=4 because I needed that for the pNFS Metadata > Server code. > > Yes, NFS does use quite a bit of kernel stack. Unfortunately, it isn't one big > item getting allocated on the stack, but many moderate sized ones. > (A part of it is multiple instances of "struct vattr", some buried in "struct nfsvattr", > that NFS needs to use. I don't think these are large enough to justify malloc/free, > but it has to use several of them.) > > One case I did try fixing was about 6 cases where "struct nfsstate" ended up on > the stack. I changes the code to malloc/free them and then when testing, to > my surprise I had a 20% performance hit and shelved the patch. > Now that I know that the server was running near its limit, I might try this one > again, to see if the performance hit doesn't occur when the machine has adequate > memory. If the performance hit goes away, I could commit this, but it wouldn't > have that much effect on the kstack usage. (It's interesting how this patch ended > up related to the issue this thread discussed.) Anything we can do to help relieve KSTACK usage, especially on i386 is helpfull. These is a thread back quite some time where someone came up with a compile time static "this functions uses X bytes of local stack" and a bit of clean up was done. We should persue this issue further. My experiece with the i386/KSTACK issues was attempting to do installs from snapshot .iso's, I usually had to change to a custom kernel without INVARIANTS and WITNESS, or reduce KSTACK to 2 and suffer the small stack problem (ie, dont use NFS during install). Neither was very pleasant. I have found it in practical to run the 4 page KSTACK in production VM's using i386 due to memory requirements. I run many very lean i386 VM's with 64MB of memory. I suspect our user base also has many people doing this, and it would be to our advantage to try and reduce our kernel stack needs. > >> >Second is r332489 Apr 13, which introduced 4/4G KVA/UVA split. > >> Could this change have resulted in the system being able to allocate fewer > >> kernel threads/stacks for some reason? > >Well, it could, as anything can be buggy. But the intent of the change > >was to give 4G KVA, and it did. > Righto. No concern here. I suspect the Dec. 2017 kernel was close to the limit > (see performance issue that went away, noted above) and any change could > have pushed it across the line, I think. > > >> > >> >Consequences of the first one are obvious, it is much harder to find > >> >the place to map the stack. Second change, on the other hand, provides > >> >almost full 4G for KVA and should have mostly compensate for the negative > >> >effects of the first. > >> > > >> >And, I cannot see how changing the scheduler would fix or even affect that > >> >behaviour. > >> My hunch is that the system was running near its limit for kernel threads/stacks. > >> Then, somehow, the timing SCHED_ULE caused resulted in the nfsd trying to get > >> to a higher peak number of threads and hit the limit. > >> SCHED_4BSD happened to result in timing such that it stayed just below the > >> limit and worked. > >> I can think of a couple of things that might affect this: > >> 1 - If SCHED_ULE doesn't do the termination of kernel threads as quickly, then > >> they wouldn't terminate and release their resources before more new ones > >> are spawned. > >Scheduler has nothing to do with the threads termination. It might > >select running threads in a way that causes the undesired pattern to > >appear which might create some amount of backlog for termination, but > >I doubt it. > > > >> 2 - If SCHED_ULE handles the nfsd threads in a more "bursty" way, then the burst > >> could try and spawn more mirror DS worker threads at about the same time. > >> > >> Anyhow, thanks for the help, rick > > Have a good day, rick > _______________________________________________ > freebsd-current@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org" > -- Rod Grimes rgrimes@freebsd.org