Date: Thu, 28 Mar 2024 16:53:35 +0100 From: tuexen@freebsd.org To: Nuno Teixeira <eduardo@freebsd.org> Cc: Drew Gallatin <gallatin@freebsd.org>, Konstantin Belousov <kib@freebsd.org>, rrs <rrs@lakerest.net>, Mike Karels <mike@karels.net>, garyj@gmx.de, current@freebsd.org, net@freebsd.org, Randall Stewart <rrs@freebsd.org> Subject: Re: Request for Testing: TCP RACK Message-ID: <5C9863F7-0F1C-4D02-9F6D-9DDC5FBEB368@freebsd.org> In-Reply-To: <CAFDf7U%2Bsd0Hx6scDOvMuxOQaqVb8wi1ODrwT=uKDCqMn_81QEw@mail.gmail.com> References: <6e795e9c-8de4-4e02-9a96-8fabfaa4e66f@app.fastmail.com> <CAFDf7UKDWSnhm%2BTwP=ZZ9dkk0jmAgjGKPLpkX-CKuw3yH233gQ@mail.gmail.com> <CAFDf7UJq9SCnU-QYmS3t6EknP369w2LR0dNkQAc-NaRLvwVfoQ@mail.gmail.com> <A3F1FC0C-C199-4565-8E07-B233ED9E7B2E@freebsd.org> <6047C8EF-B1B0-4286-93FA-AA38F8A18656@karels.net> <ZfiI7GcbTwSG8kkO@kib.kiev.ua> <8031cd99-ded8-4b06-93b3-11cc729a8b2c@app.fastmail.com> <ZfiY-xUUM3wrBEz_@kib.kiev.ua> <38c54399-6c96-44d8-a3a2-3cc1bfbe50c2@app.fastmail.com> <27d8144f-0658-46f6-b8f3-35eb60061644@lakerest.net> <Zft8odA0s49eLhvk@kib.kiev.ua> <fc160c79-1884-4f68-8310-35e7ac0b9dd6@app.fastmail.com> <CAFDf7U%2Bsd0Hx6scDOvMuxOQaqVb8wi1ODrwT=uKDCqMn_81QEw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
> On 28. Mar 2024, at 15:00, Nuno Teixeira <eduardo@freebsd.org> wrote: >=20 > Hello all! >=20 > Running rack @b7b78c1c169 "Optimize HPTS..." very happy on my laptop = (amd64)! >=20 > Thanks all! Thanks for the feedback! Best regards Michael >=20 > Drew Gallatin <gallatin@freebsd.org> escreveu (quinta, 21/03/2024 = =C3=A0(s) 12:58): > The entire point is to *NOT* go through the overhead of scheduling = something asynchronously, but to take advantage of the fact that a = user/kernel transition is going to trash the cache anyway. >=20 > In the common case of a system which has less than the threshold = number of connections , we access the tcp_hpts_softclock function = pointer, make one function call, and access hpts_that_need_softclock, = and then return. So that's 2 variables and a function call. >=20 > I think it would be preferable to avoid that call, and to move the = declaration of tcp_hpts_softclock and hpts_that_need_softclock so that = they are in the same cacheline. Then we'd be hitting just a single line = in the common case. (I've made comments on the review to that effect). >=20 > Also, I wonder if the threshold could get higher by default, so that = hpts is never called in this context unless we're to the point where = we're scheduling thousands of runs of the hpts thread (and taking all = those clock interrupts). >=20 > Drew >=20 > On Wed, Mar 20, 2024, at 8:17 PM, Konstantin Belousov wrote: >> On Tue, Mar 19, 2024 at 06:19:52AM -0400, rrs wrote: >>> Ok I have created >>>=20 >>> https://reviews.freebsd.org/D44420 >>>=20 >>>=20 >>> To address the issue. I also attach a short version of the patch = that Nuno >>> can try and validate >>>=20 >>> it works. Drew you may want to try this and validate the = optimization does >>> kick in since I can >>>=20 >>> only now test that it does not on my local box :) >> The patch still causes access to all cpu's cachelines on each = userret. >> It would be much better to inc/check the threshold and only schedule = the >> call when exceeded. Then the call can occur in some dedicated = context, >> like per-CPU thread, instead of userret. >>=20 >>>=20 >>>=20 >>> R >>>=20 >>>=20 >>>=20 >>> On 3/18/24 3:42 PM, Drew Gallatin wrote: >>>> No. The goal is to run on every return to userspace for every = thread. >>>>=20 >>>> Drew >>>>=20 >>>> On Mon, Mar 18, 2024, at 3:41 PM, Konstantin Belousov wrote: >>>>> On Mon, Mar 18, 2024 at 03:13:11PM -0400, Drew Gallatin wrote: >>>>>> I got the idea from >>>>>> = https://people.mpi-sws.org/~druschel/publications/soft-timers-tocs.pdf >>>>>> The gist is that the TCP pacing stuff needs to run frequently, = and >>>>>> rather than run it out of a clock interrupt, its more efficient = to run >>>>>> it out of a system call context at just the point where we return = to >>>>>> userspace and the cache is trashed anyway. The current = implementation >>>>>> is fine for our workload, but probably not idea for a generic = system. >>>>>> Especially one where something is banging on system calls. >>>>>>=20 >>>>>> Ast's could be the right tool for this, but I'm super unfamiliar = with >>>>>> them, and I can't find any docs on them. >>>>>>=20 >>>>>> Would ast_register(0, ASTR_UNCOND, 0, func) be roughly equivalent = to >>>>>> what's happening here? >>>>> This call would need some AST number added, and then it registers = the >>>>> ast to run on next return to userspace, for the current thread. >>>>>=20 >>>>> Is it enough? >>>>>>=20 >>>>>> Drew >>>>>=20 >>>>>>=20 >>>>>> On Mon, Mar 18, 2024, at 2:33 PM, Konstantin Belousov wrote: >>>>>>> On Mon, Mar 18, 2024 at 07:26:10AM -0500, Mike Karels wrote: >>>>>>>> On 18 Mar 2024, at 7:04, tuexen@freebsd.org wrote: >>>>>>>>=20 >>>>>>>>>> On 18. Mar 2024, at 12:42, Nuno Teixeira >>>>> <eduardo@freebsd.org> wrote: >>>>>>>>>>=20 >>>>>>>>>> Hello all! >>>>>>>>>>=20 >>>>>>>>>> It works just fine! >>>>>>>>>> System performance is OK. >>>>>>>>>> Using patch on main-n268841-b0aaf8beb126(-dirty). >>>>>>>>>>=20 >>>>>>>>>> --- >>>>>>>>>> net.inet.tcp.functions_available: >>>>>>>>>> Stack D >>>>> Alias PCB count >>>>>>>>>> freebsd freebsd 0 >>>>>>>>>> rack * >>>>> rack 38 >>>>>>>>>> --- >>>>>>>>>>=20 >>>>>>>>>> It would be so nice that we can have a sysctl tunnable for >>>>> this patch >>>>>>>>>> so we could do more tests without recompiling kernel. >>>>>>>>> Thanks for testing! >>>>>>>>>=20 >>>>>>>>> @gallatin: can you come up with a patch that is acceptable >>>>> for Netflix >>>>>>>>> and allows to mitigate the performance regression. >>>>>>>>=20 >>>>>>>> Ideally, tcphpts could enable this automatically when it >>>>> starts to be >>>>>>>> used (enough?), but a sysctl could select auto/on/off. >>>>>>> There is already a well-known mechanism to request execution of = the >>>>>>> specific function on return to userspace, namely AST. The = difference >>>>>>> with the current hack is that the execution is requested for one >>>>> callback >>>>>>> in the context of the specific thread. >>>>>>>=20 >>>>>>> Still, it might be worth a try to use it; what is the reason to >>>>> hit a thread >>>>>>> that does not do networking, with TCP processing? >>>>>>>=20 >>>>>>>>=20 >>>>>>>> Mike >>>>>>>>=20 >>>>>>>>> Best regards >>>>>>>>> Michael >>>>>>>>>>=20 >>>>>>>>>> Thanks all! >>>>>>>>>> Really happy here :) >>>>>>>>>>=20 >>>>>>>>>> Cheers, >>>>>>>>>>=20 >>>>>>>>>> Nuno Teixeira <eduardo@freebsd.org> escreveu (domingo, >>>>> 17/03/2024 =C3=A0(s) 20:26): >>>>>>>>>>>=20 >>>>>>>>>>> Hello, >>>>>>>>>>>=20 >>>>>>>>>>>> I don't have the full context, but it seems like the >>>>> complaint is a performance regression in bonnie++ and perhaps = other >>>>> things when tcp_hpts is loaded, even when it is not used. Is that >>>>> correct? >>>>>>>>>>>>=20 >>>>>>>>>>>> If so, I suspect its because we drive the >>>>> tcp_hpts_softclock() routine from userret(), in order to avoid = tons >>>>> of timer interrupts and context switches. To test this theory, = you >>>>> could apply a patch like: >>>>>>>>>>>=20 >>>>>>>>>>> It's affecting overall system performance, bonnie was just >>>>> a way to >>>>>>>>>>> get some numbers to compare. >>>>>>>>>>>=20 >>>>>>>>>>> Tomorrow I will test patch. >>>>>>>>>>>=20 >>>>>>>>>>> Thanks! >>>>>>>>>>>=20 >>>>>>>>>>> -- >>>>>>>>>>> Nuno Teixeira >>>>>>>>>>> FreeBSD Committer (ports) >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>> -- >>>>>>>>>> Nuno Teixeira >>>>>>>>>> FreeBSD Committer (ports) >>>>>>>>=20 >>>>>>>=20 >>>>>=20 >>>>=20 >>=20 >>> diff --git a/sys/netinet/tcp_hpts.c b/sys/netinet/tcp_hpts.c >>> index 8c4d2d41a3eb..eadbee19f69c 100644 >>> --- a/sys/netinet/tcp_hpts.c >>> +++ b/sys/netinet/tcp_hpts.c >>> @@ -216,6 +216,7 @@ struct tcp_hpts_entry { >>> void *ie_cookie; >>> uint16_t p_num; /* The hpts number one per cpu */ >>> uint16_t p_cpu; /* The hpts CPU */ >>> + uint8_t hit_callout_thresh; >>> /* There is extra space in here */ >>> /* Cache line 0x100 */ >>> struct callout co __aligned(CACHE_LINE_SIZE); >>> @@ -269,6 +270,11 @@ static struct hpts_domain_info { >>> int cpu[MAXCPU]; >>> } hpts_domains[MAXMEMDOM]; >>>=20 >>> +counter_u64_t hpts_that_need_softclock; >>> +SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, = needsoftclock, CTLFLAG_RD, >>> + &hpts_that_need_softclock, >>> + "Number of hpts threads that need softclock"); >>> + >>> counter_u64_t hpts_hopelessly_behind; >>>=20 >>> SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, hopeless, = CTLFLAG_RD, >>> @@ -334,7 +340,7 @@ SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, = precision, CTLFLAG_RW, >>> &tcp_hpts_precision, 120, >>> "Value for PRE() precision of callout"); >>> SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, cnt_thresh, CTLFLAG_RW, >>> - &conn_cnt_thresh, 0, >>> + &conn_cnt_thresh, DEFAULT_CONNECTION_THESHOLD, >>> "How many connections (below) make us use the callout based = mechanism"); >>> SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, logging, CTLFLAG_RW, >>> &hpts_does_tp_logging, 0, >>> @@ -1548,6 +1554,9 @@ __tcp_run_hpts(void) >>> struct tcp_hpts_entry *hpts; >>> int ticks_ran; >>>=20 >>> + if (counter_u64_fetch(hpts_that_need_softclock) =3D=3D 0) >>> + return; >>> + >>> hpts =3D tcp_choose_hpts_to_run(); >>>=20 >>> if (hpts->p_hpts_active) { >>> @@ -1683,6 +1692,13 @@ tcp_hpts_thread(void *ctx) >>> ticks_ran =3D tcp_hptsi(hpts, 1); >>> tv.tv_sec =3D 0; >>> tv.tv_usec =3D hpts->p_hpts_sleep_time * HPTS_TICKS_PER_SLOT; >>> + if ((hpts->p_on_queue_cnt > conn_cnt_thresh) && = (hpts->hit_callout_thresh =3D=3D 0)) { >>> + hpts->hit_callout_thresh =3D 1; >>> + counter_u64_add(hpts_that_need_softclock, 1); >>> + } else if ((hpts->p_on_queue_cnt <=3D conn_cnt_thresh) && = (hpts->hit_callout_thresh =3D=3D 1)) { >>> + hpts->hit_callout_thresh =3D 0; >>> + counter_u64_add(hpts_that_need_softclock, -1); >>> + } >>> if (hpts->p_on_queue_cnt >=3D conn_cnt_thresh) { >>> if(hpts->p_direct_wake =3D=3D 0) { >>> /* >>> @@ -1818,6 +1834,7 @@ tcp_hpts_mod_load(void) >>> cpu_top =3D NULL; >>> #endif >>> tcp_pace.rp_num_hptss =3D ncpus; >>> + hpts_that_need_softclock =3D counter_u64_alloc(M_WAITOK); >>> hpts_hopelessly_behind =3D counter_u64_alloc(M_WAITOK); >>> hpts_loops =3D counter_u64_alloc(M_WAITOK); >>> back_tosleep =3D counter_u64_alloc(M_WAITOK); >>> @@ -2042,6 +2059,7 @@ tcp_hpts_mod_unload(void) >>> free(tcp_pace.grps, M_TCPHPTS); >>> #endif >>>=20 >>> + counter_u64_free(hpts_that_need_softclock); >>> counter_u64_free(hpts_hopelessly_behind); >>> counter_u64_free(hpts_loops); >>> counter_u64_free(back_tosleep); >>=20 >>=20 >=20 >=20 >=20 > --=20 > Nuno Teixeira > FreeBSD Committer (ports)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5C9863F7-0F1C-4D02-9F6D-9DDC5FBEB368>