Date: Fri, 27 Sep 2019 13:52:58 -0700 From: Mark Millard <marklmi@yahoo.com> To: Mark Johnston <markj@FreeBSD.org> Cc: freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance? Message-ID: <08CA4DA1-131C-4B14-BB57-EAA22A8CD5D9@yahoo.com> In-Reply-To: <20190927192434.GA93180@raichu> References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> <20190925170255.GA43643@raichu> <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com> <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com> <26B47782-033B-40C8-B8F8-4C731B167243@yahoo.com> <20190926202936.GD5581@raichu> <2DE123BE-B0F8-43F6-B950-F41CF0DEC8AD@yahoo.com> <6BC5F6BE-5FC3-48FA-9873-B20141FEFDF5@yahoo.com> <20190927192434.GA93180@raichu>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2019-Sep-27, at 12:24, Mark Johnston <markj at FreeBSD.org> wrote: > On Thu, Sep 26, 2019 at 08:37:39PM -0700, Mark Millard wrote: >>=20 >>=20 >> On 2019-Sep-26, at 17:05, Mark Millard <marklmi at yahoo.com> wrote: >>=20 >>> On 2019-Sep-26, at 13:29, Mark Johnston <markj at FreeBSD.org> = wrote: >>>> One possibility is that these are kernel memory allocations = occurring in >>>> the context of the benchmark threads. Such allocations may not = respect >>>> the configured policy since they are not private to the allocating >>>> thread. For instance, upon opening a file, the kernel may allocate = a >>>> vnode structure for that file. That vnode may be accessed by = threads >>>> from many processes over its lifetime, and may be recycled many = times >>>> before its memory is released back to the allocator. >>>=20 >>> For -l0-15 -n prefer:1 : >>>=20 >>> Looks like this reports sys_thr_new activity, sys_cpuset >>> activity, and 0xffffffff80bc09bd activity (whatever that >>> is). Mostly sys_thr_new activity, over 1300 of them . . . >>>=20 >>> dtrace: pid 13553 has exited >>>=20 >>>=20 >>> kernel`uma_small_alloc+0x61 >>> kernel`keg_alloc_slab+0x10b >>> kernel`zone_import+0x1d2 >>> kernel`uma_zalloc_arg+0x62b >>> kernel`thread_init+0x22 >>> kernel`keg_alloc_slab+0x259 >>> kernel`zone_import+0x1d2 >>> kernel`uma_zalloc_arg+0x62b >>> kernel`thread_alloc+0x23 >>> kernel`thread_create+0x13a >>> kernel`sys_thr_new+0xd2 >>> kernel`amd64_syscall+0x3ae >>> kernel`0xffffffff811b7600 >>> 2 >>>=20 >>> kernel`uma_small_alloc+0x61 >>> kernel`keg_alloc_slab+0x10b >>> kernel`zone_import+0x1d2 >>> kernel`uma_zalloc_arg+0x62b >>> kernel`cpuset_setproc+0x65 >>> kernel`sys_cpuset+0x123 >>> kernel`amd64_syscall+0x3ae >>> kernel`0xffffffff811b7600 >>> 2 >>>=20 >>> kernel`uma_small_alloc+0x61 >>> kernel`keg_alloc_slab+0x10b >>> kernel`zone_import+0x1d2 >>> kernel`uma_zalloc_arg+0x62b >>> kernel`uma_zfree_arg+0x36a >>> kernel`thread_reap+0x106 >>> kernel`thread_alloc+0xf >>> kernel`thread_create+0x13a >>> kernel`sys_thr_new+0xd2 >>> kernel`amd64_syscall+0x3ae >>> kernel`0xffffffff811b7600 >>> 6 >>>=20 >>> kernel`uma_small_alloc+0x61 >>> kernel`keg_alloc_slab+0x10b >>> kernel`zone_import+0x1d2 >>> kernel`uma_zalloc_arg+0x62b >>> kernel`uma_zfree_arg+0x36a >>> kernel`vm_map_process_deferred+0x8c >>> kernel`vm_map_remove+0x11d >>> kernel`vmspace_exit+0xd3 >>> kernel`exit1+0x5a9 >>> kernel`0xffffffff80bc09bd >>> kernel`amd64_syscall+0x3ae >>> kernel`0xffffffff811b7600 >>> 6 >>>=20 >>> kernel`uma_small_alloc+0x61 >>> kernel`keg_alloc_slab+0x10b >>> kernel`zone_import+0x1d2 >>> kernel`uma_zalloc_arg+0x62b >>> kernel`thread_alloc+0x23 >>> kernel`thread_create+0x13a >>> kernel`sys_thr_new+0xd2 >>> kernel`amd64_syscall+0x3ae >>> kernel`0xffffffff811b7600 >>> 22 >>>=20 >>> kernel`vm_page_grab_pages+0x1b4 >>> kernel`vm_thread_stack_create+0xc0 >>> kernel`kstack_import+0x52 >>> kernel`uma_zalloc_arg+0x62b >>> kernel`vm_thread_new+0x4d >>> kernel`thread_alloc+0x31 >>> kernel`thread_create+0x13a >>> kernel`sys_thr_new+0xd2 >>> kernel`amd64_syscall+0x3ae >>> kernel`0xffffffff811b7600 >>> 1324 >>=20 >> With sys_thr_new not respecting -n prefer:1 for >> -l0-15 (especially for the thread stacks), I >> looked some at the generated integration kernel >> code and it makes significant use of %rsp based >> memory accesses (read and write). >>=20 >> That would get both memory controllers going in >> parallel (kernel vectors accesses to the preferred >> memory domain), so not slowing down as expected. >>=20 >> If round-robin is not respected for thread stacks, >> and if threads migrate cpus across memory domains >> at times, there could be considerable variability >> for that context as well. (This may not be the >> only way to have different/extra variability for >> this context.) >>=20 >> Overall: I'd be surprised if this was not >> contributing to what I thought was odd about >> the benchmark results. >=20 > Your tracing refers to kernel thread stacks though, not the stacks = used > by threads when executing in user mode. My understanding is that a = HINT > implementation would spend virtually all of its time in user mode, so = it > shouldn't matter much or at all if kernel thread stacks are backed by > memory from the "wrong" domain. Looks like I was trying to think about it when I should have been = sleeping. You are correct. > This also doesn't really explain some of the disparities in the plots > you sent me. For instance, you get a much higher peak QUIS on FreeBSD > than on Fedora with 16 threads and an interleave/round-robin domain > selection policy. True. I suppose that there is the possibility that steady_clock's now() = results are odd for some reason for the type of context, leading to the = durations between such being on the short side where things look different. But the left hand side of the single-thread results (smaller memory = sizes for the vectors for the integration kernel's use) do not show such a = rescaling. (The single thread time measurements are strictly inside the thread of execution, no thread creation or such counted for any size.) The right = hand side of the single thread results (larger memory use, making smaller = cache levels fairly ineffective) do generally show some rescaling, but not as = drastic as multi-threaded. Both round-robin and prefer:1 showed such for single threaded. =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net <http://dsl-only.net/> went away in early 2018-Mar)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?08CA4DA1-131C-4B14-BB57-EAA22A8CD5D9>