FreeBSD Mail Archives

Date:      Sun, 22 Apr 2018 13:43:53 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        "freebsd-current@freebsd.org" <freebsd-current@freebsd.org>, "George Mitchell" <george+freebsd@m5p.com>, Peter <pmc@citylink.dinoex.sub.org>
Subject:   Re: SCHED_ULE makes 256Mbyte i386 unusable
Message-ID:  <YQBPR0101MB10421CFD2FA2C1A5356492CEDD8A0@YQBPR0101MB1042.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <20180422120241.GR6887@kib.kiev.ua>
References:  <YQBPR0101MB1042F252A539E8D55EB44585DD8B0@YQBPR0101MB1042.CANPRD01.PROD.OUTLOOK.COM> <20180421201128.GO6887@kib.kiev.ua> <YQBPR0101MB10421529BB346952BCE7F20EDD8B0@YQBPR0101MB1042.CANPRD01.PROD.OUTLOOK.COM>, <20180422120241.GR6887@kib.kiev.ua>

Konstantin Belousov wrote:
>On Sat, Apr 21, 2018 at 11:30:55PM +0000, Rick Macklem wrote:
>> Konstantin Belousov wrote:
>> >On Sat, Apr 21, 2018 at 07:21:58PM +0000, Rick Macklem wrote:
>> >> I decided to start a new thread on current related to SCHED_ULE, sinc=
e I see
>> >> more than just performance degradation and on a recent current kernel=
.
>> >> (I cc'd a couple of the people discussing performance problems in fre=
ebsd-stable
>> >>  recently under a subject line of "Re: kern.sched.quantum: Creepy, sa=
distic scheduler".
>> >>
>> >> When testing a pNFS server on a single core i386 with 256Mbytes using=
 a Dec. 2017
>> >> current/head kernel, I would see about a 30% performance degradation =
(elapsed
>> >> run time for a kernel build over NFSv4.1) when the server kernel was =
built with
>> >> options SCHED_ULE
>> >> instead of
>> >> options SCHED_4BSD
So, now that I have decreased the number of nfsd kernel threads to 32, it w=
orks
with both schedulers and with essentially the same performance. (ie. The 30=
%
performance degradation has disappeared.)

>> >>
>> >> Now, with a kernel from a couple of days ago, the
>> >> options SCHED_ULE
>> >> kernel becomes unusable shortly after starting testing.
>> >> I have seen two variants of this:
>> >> - Became essentially hung. All I could do was ping the machine from t=
he network.
>> >> - Reported "vm_thread_new: kstack allocation failed
>> >>   and then any attempt to do anything gets "No more processes".
>> >This is strange.  It usually means that you get KVA either exhausted or
>> >severly fragmented.
>> Yes. I reduced the number of nfsd threads from 256->32 and the SCHED_ULE
>> kernel is working ok now. I haven't done enough to compare performance y=
et.
>> Maybe I'll post again when I have some numbers.
>>
>> >Enter ddb, it should be operational since pings are replied.  Try to se=
e
>> >where the threads are stuck.
>> I didn't do this, since reducing the number of kernel threads seems to h=
ave fixed
>> the problem. For the pNFS server, the nfsd threads will spawn additional=
 kernel
>> threads to do proxies to the mirrored DS servers.
>>
>> >> with the only difference being a kernel built with
>> >> options SCHED_4BSD
>> >> everything works and performs the same as the Dec 2017 kernel.
>> >>
>> >> I can try rolling back through the revisions, but it would be nice if=
 someone
>> >> could suggest where to start, because it takes a couple of hours to b=
uild a
>> >> kernel on this system.
>> >>
>> >> So, something has made things worse for a head/current kernel this wi=
nter, rick
>> >
>> >There are at least two potentially relevant changes.
>> >
>> >First is r326758 Dec 11 which bumped KSTACK_PAGES on i386 to 4.
>> I've been running this machine with KSTACK_PAGES=3D4 for some time, so n=
o change.
W.r.t. Rodney Grimes comments about this (which didn't end up in this messa=
ges
in the thread):
I didn't see any instability when using KSTACK_PAGES=3D4 for this until thi=
s cropped
up and seemed to be scheduler related (but not really, it seems).
I bumped it to KSTACK_PAGES=3D4 because I needed that for the pNFS Metadata
Server code.

Yes, NFS does use quite a bit of kernel stack. Unfortunately, it isn't one =
big
item getting allocated on the stack, but many moderate sized ones.
(A part of it is multiple instances of "struct vattr", some buried in "stru=
ct nfsvattr",
 that NFS needs to use. I don't think these are large enough to justify mal=
loc/free,
 but it has to use several of them.)

One case I did try fixing was about 6 cases where "struct nfsstate" ended u=
p on
the stack. I changes the code to malloc/free them and then when testing, to
my surprise I had a 20% performance hit and shelved the patch.
Now that I know that the server was running near its limit, I might try thi=
s one
again, to see if the performance hit doesn't occur when the machine has ade=
quate
memory. If the performance hit goes away, I could commit this, but it would=
n't have that much effect on the kstack usage. (It's interesting how this p=
atch ended
up related to the issue this thread discussed.)

>>
>> >Second is r332489 Apr 13, which introduced 4/4G KVA/UVA split.
>> Could this change have resulted in the system being able to allocate few=
er
>> kernel threads/stacks for some reason?
>Well, it could, as anything can be buggy. But the intent of the change
>was to give 4G KVA, and it did.
Righto. No concern here. I suspect the Dec. 2017 kernel was close to the li=
mit
(see performance issue that went away, noted above) and any change could
have pushed it across the line, I think.

>>
>> >Consequences of the first one are obvious, it is much harder to find
>> >the place to map the stack.  Second change, on the other hand, provides
>> >almost full 4G for KVA and should have mostly compensate for the negati=
ve
>> >effects of the first.
>> >
>> >And, I cannot see how changing the scheduler would fix or even affect t=
hat
>> >behaviour.
>> My hunch is that the system was running near its limit for kernel thread=
s/stacks.
>> Then, somehow, the timing SCHED_ULE caused resulted in the nfsd trying t=
o get
>> to a higher peak number of threads and hit the limit.
>> SCHED_4BSD happened to result in timing such that it stayed just below t=
he
>> limit and worked.
>> I can think of a couple of things that might affect this:
>> 1 - If SCHED_ULE doesn't do the termination of kernel threads as quickly=
, then
>>       they wouldn't terminate and release their resources before more ne=
w ones
>>       are spawned.
>Scheduler has nothing to do with the threads termination.  It might
>select running threads in a way that causes the undesired pattern to
>appear which might create some amount of backlog for termination, but
>I doubt it.
>
>> 2 - If SCHED_ULE handles the nfsd threads in a more "bursty" way, then t=
he burst
>>       could try and spawn more mirror DS worker threads at about the sam=
e time.
>>
>> Anyhow, thanks for the help, rick

Have a good day, rick=

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQBPR0101MB10421CFD2FA2C1A5356492CEDD8A0>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation