FreeBSD Mail Archives

Date:      Thu, 28 Mar 2024 16:53:35 +0100
From:      tuexen@freebsd.org
To:        Nuno Teixeira <eduardo@freebsd.org>
Cc:        Drew Gallatin <gallatin@freebsd.org>, Konstantin Belousov <kib@freebsd.org>, rrs <rrs@lakerest.net>, Mike Karels <mike@karels.net>, garyj@gmx.de, current@freebsd.org, net@freebsd.org, Randall Stewart <rrs@freebsd.org>
Subject:   Re: Request for Testing: TCP RACK
Message-ID:  <5C9863F7-0F1C-4D02-9F6D-9DDC5FBEB368@freebsd.org>
In-Reply-To: <CAFDf7U%2Bsd0Hx6scDOvMuxOQaqVb8wi1ODrwT=uKDCqMn_81QEw@mail.gmail.com>
References:  <6e795e9c-8de4-4e02-9a96-8fabfaa4e66f@app.fastmail.com> <CAFDf7UKDWSnhm%2BTwP=ZZ9dkk0jmAgjGKPLpkX-CKuw3yH233gQ@mail.gmail.com> <CAFDf7UJq9SCnU-QYmS3t6EknP369w2LR0dNkQAc-NaRLvwVfoQ@mail.gmail.com> <A3F1FC0C-C199-4565-8E07-B233ED9E7B2E@freebsd.org> <6047C8EF-B1B0-4286-93FA-AA38F8A18656@karels.net> <ZfiI7GcbTwSG8kkO@kib.kiev.ua> <8031cd99-ded8-4b06-93b3-11cc729a8b2c@app.fastmail.com> <ZfiY-xUUM3wrBEz_@kib.kiev.ua> <38c54399-6c96-44d8-a3a2-3cc1bfbe50c2@app.fastmail.com> <27d8144f-0658-46f6-b8f3-35eb60061644@lakerest.net> <Zft8odA0s49eLhvk@kib.kiev.ua> <fc160c79-1884-4f68-8310-35e7ac0b9dd6@app.fastmail.com> <CAFDf7U%2Bsd0Hx6scDOvMuxOQaqVb8wi1ODrwT=uKDCqMn_81QEw@mail.gmail.com>


> On 28. Mar 2024, at 15:00, Nuno Teixeira <eduardo@freebsd.org> wrote:
> 
> Hello all!
> 
> Running rack @b7b78c1c169 "Optimize HPTS..." very happy on my laptop (amd64)!
> 
> Thanks all!
Thanks for the feedback!

Best regards
Michael
> 
> Drew Gallatin <gallatin@freebsd.org> escreveu (quinta, 21/03/2024 à(s) 12:58):
> The entire point is to *NOT* go through the overhead of scheduling something asynchronously, but to take advantage of the fact that a user/kernel transition is going to trash the cache anyway.
> 
> In the common case of a system which has less than the threshold  number of connections , we access the tcp_hpts_softclock function pointer, make one function call, and access hpts_that_need_softclock, and then return.  So that's 2 variables and a function call.
> 
> I think it would be preferable to avoid that call, and to move the declaration of tcp_hpts_softclock and hpts_that_need_softclock so that they are in the same cacheline.  Then we'd be hitting just a single line in the common case.  (I've made comments on the review to that effect).
> 
> Also, I wonder if the threshold could get higher by default, so that hpts is never called in this context unless we're to the point where we're scheduling thousands of runs of the hpts thread (and taking all those clock interrupts).
> 
> Drew
> 
> On Wed, Mar 20, 2024, at 8:17 PM, Konstantin Belousov wrote:
>> On Tue, Mar 19, 2024 at 06:19:52AM -0400, rrs wrote:
>>> Ok I have created
>>> 
>>> https://reviews.freebsd.org/D44420
>>> 
>>> 
>>> To address the issue. I also attach a short version of the patch that Nuno
>>> can try and validate
>>> 
>>> it works. Drew you may want to try this and validate the optimization does
>>> kick in since I can
>>> 
>>> only now test that it does not on my local box :)
>> The patch still causes access to all cpu's cachelines on each userret.
>> It would be much better to inc/check the threshold and only schedule the
>> call when exceeded.  Then the call can occur in some dedicated context,
>> like per-CPU thread, instead of userret.
>> 
>>> 
>>> 
>>> R
>>> 
>>> 
>>> 
>>> On 3/18/24 3:42 PM, Drew Gallatin wrote:
>>>> No.  The goal is to run on every return to userspace for every thread.
>>>> 
>>>> Drew
>>>> 
>>>> On Mon, Mar 18, 2024, at 3:41 PM, Konstantin Belousov wrote:
>>>>> On Mon, Mar 18, 2024 at 03:13:11PM -0400, Drew Gallatin wrote:
>>>>>> I got the idea from
>>>>>> https://people.mpi-sws.org/~druschel/publications/soft-timers-tocs.pdf
>>>>>> The gist is that the TCP pacing stuff needs to run frequently, and
>>>>>> rather than run it out of a clock interrupt, its more efficient to run
>>>>>> it out of a system call context at just the point where we return to
>>>>>> userspace and the cache is trashed anyway. The current implementation
>>>>>> is fine for our workload, but probably not idea for a generic system.
>>>>>> Especially one where something is banging on system calls.
>>>>>> 
>>>>>> Ast's could be the right tool for this, but I'm super unfamiliar with
>>>>>> them, and I can't find any docs on them.
>>>>>> 
>>>>>> Would ast_register(0, ASTR_UNCOND, 0, func) be roughly equivalent to
>>>>>> what's happening here?
>>>>> This call would need some AST number added, and then it registers the
>>>>> ast to run on next return to userspace, for the current thread.
>>>>> 
>>>>> Is it enough?
>>>>>> 
>>>>>> Drew
>>>>> 
>>>>>> 
>>>>>> On Mon, Mar 18, 2024, at 2:33 PM, Konstantin Belousov wrote:
>>>>>>> On Mon, Mar 18, 2024 at 07:26:10AM -0500, Mike Karels wrote:
>>>>>>>> On 18 Mar 2024, at 7:04, tuexen@freebsd.org wrote:
>>>>>>>> 
>>>>>>>>>> On 18. Mar 2024, at 12:42, Nuno Teixeira
>>>>> <eduardo@freebsd.org> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hello all!
>>>>>>>>>> 
>>>>>>>>>> It works just fine!
>>>>>>>>>> System performance is OK.
>>>>>>>>>> Using patch on main-n268841-b0aaf8beb126(-dirty).
>>>>>>>>>> 
>>>>>>>>>> ---
>>>>>>>>>> net.inet.tcp.functions_available:
>>>>>>>>>> Stack                           D
>>>>> Alias                            PCB count
>>>>>>>>>> freebsd freebsd                          0
>>>>>>>>>> rack                            *
>>>>> rack                             38
>>>>>>>>>> ---
>>>>>>>>>> 
>>>>>>>>>> It would be so nice that we can have a sysctl tunnable for
>>>>> this patch
>>>>>>>>>> so we could do more tests without recompiling kernel.
>>>>>>>>> Thanks for testing!
>>>>>>>>> 
>>>>>>>>> @gallatin: can you come up with a patch that is acceptable
>>>>> for Netflix
>>>>>>>>> and allows to mitigate the performance regression.
>>>>>>>> 
>>>>>>>> Ideally, tcphpts could enable this automatically when it
>>>>> starts to be
>>>>>>>> used (enough?), but a sysctl could select auto/on/off.
>>>>>>> There is already a well-known mechanism to request execution of the
>>>>>>> specific function on return to userspace, namely AST.  The difference
>>>>>>> with the current hack is that the execution is requested for one
>>>>> callback
>>>>>>> in the context of the specific thread.
>>>>>>> 
>>>>>>> Still, it might be worth a try to use it; what is the reason to
>>>>> hit a thread
>>>>>>> that does not do networking, with TCP processing?
>>>>>>> 
>>>>>>>> 
>>>>>>>> Mike
>>>>>>>> 
>>>>>>>>> Best regards
>>>>>>>>> Michael
>>>>>>>>>> 
>>>>>>>>>> Thanks all!
>>>>>>>>>> Really happy here :)
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> 
>>>>>>>>>> Nuno Teixeira <eduardo@freebsd.org> escreveu (domingo,
>>>>> 17/03/2024 à(s) 20:26):
>>>>>>>>>>> 
>>>>>>>>>>> Hello,
>>>>>>>>>>> 
>>>>>>>>>>>> I don't have the full context, but it seems like the
>>>>> complaint is a performance regression in bonnie++ and perhaps other
>>>>> things when tcp_hpts is loaded, even when it is not used.  Is that
>>>>> correct?
>>>>>>>>>>>> 
>>>>>>>>>>>> If so, I suspect its because we drive the
>>>>> tcp_hpts_softclock() routine from userret(), in order to avoid tons
>>>>> of timer interrupts and context switches.  To test this theory,  you
>>>>> could apply a patch like:
>>>>>>>>>>> 
>>>>>>>>>>> It's affecting overall system performance, bonnie was just
>>>>> a way to
>>>>>>>>>>> get some numbers to compare.
>>>>>>>>>>> 
>>>>>>>>>>> Tomorrow I will test patch.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks!
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Nuno Teixeira
>>>>>>>>>>> FreeBSD Committer (ports)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Nuno Teixeira
>>>>>>>>>> FreeBSD Committer (ports)
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> 
>>> diff --git a/sys/netinet/tcp_hpts.c b/sys/netinet/tcp_hpts.c
>>> index 8c4d2d41a3eb..eadbee19f69c 100644
>>> --- a/sys/netinet/tcp_hpts.c
>>> +++ b/sys/netinet/tcp_hpts.c
>>> @@ -216,6 +216,7 @@ struct tcp_hpts_entry {
>>> void *ie_cookie;
>>> uint16_t p_num; /* The hpts number one per cpu */
>>> uint16_t p_cpu; /* The hpts CPU */
>>> + uint8_t hit_callout_thresh;
>>> /* There is extra space in here */
>>> /* Cache line 0x100 */
>>> struct callout co __aligned(CACHE_LINE_SIZE);
>>> @@ -269,6 +270,11 @@ static struct hpts_domain_info {
>>> int cpu[MAXCPU];
>>> } hpts_domains[MAXMEMDOM];
>>> 
>>> +counter_u64_t hpts_that_need_softclock;
>>> +SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, needsoftclock, CTLFLAG_RD,
>>> +    &hpts_that_need_softclock,
>>> +    "Number of hpts threads that need softclock");
>>> +
>>> counter_u64_t hpts_hopelessly_behind;
>>> 
>>> SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, hopeless, CTLFLAG_RD,
>>> @@ -334,7 +340,7 @@ SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, precision, CTLFLAG_RW,
>>>     &tcp_hpts_precision, 120,
>>>     "Value for PRE() precision of callout");
>>> SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, cnt_thresh, CTLFLAG_RW,
>>> -    &conn_cnt_thresh, 0,
>>> +    &conn_cnt_thresh, DEFAULT_CONNECTION_THESHOLD,
>>>     "How many connections (below) make us use the callout based mechanism");
>>> SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, logging, CTLFLAG_RW,
>>>     &hpts_does_tp_logging, 0,
>>> @@ -1548,6 +1554,9 @@ __tcp_run_hpts(void)
>>> struct tcp_hpts_entry *hpts;
>>> int ticks_ran;
>>> 
>>> + if (counter_u64_fetch(hpts_that_need_softclock) == 0)
>>> + return;
>>> +
>>> hpts = tcp_choose_hpts_to_run();
>>> 
>>> if (hpts->p_hpts_active) {
>>> @@ -1683,6 +1692,13 @@ tcp_hpts_thread(void *ctx)
>>> ticks_ran = tcp_hptsi(hpts, 1);
>>> tv.tv_sec = 0;
>>> tv.tv_usec = hpts->p_hpts_sleep_time * HPTS_TICKS_PER_SLOT;
>>> + if ((hpts->p_on_queue_cnt > conn_cnt_thresh) && (hpts->hit_callout_thresh == 0)) {
>>> + hpts->hit_callout_thresh = 1;
>>> + counter_u64_add(hpts_that_need_softclock, 1);
>>> + } else if ((hpts->p_on_queue_cnt <= conn_cnt_thresh) && (hpts->hit_callout_thresh == 1)) {
>>> + hpts->hit_callout_thresh = 0;
>>> + counter_u64_add(hpts_that_need_softclock, -1);
>>> + }
>>> if (hpts->p_on_queue_cnt >= conn_cnt_thresh) {
>>> if(hpts->p_direct_wake == 0) {
>>> /*
>>> @@ -1818,6 +1834,7 @@ tcp_hpts_mod_load(void)
>>> cpu_top = NULL;
>>> #endif
>>> tcp_pace.rp_num_hptss = ncpus;
>>> + hpts_that_need_softclock = counter_u64_alloc(M_WAITOK);
>>> hpts_hopelessly_behind = counter_u64_alloc(M_WAITOK);
>>> hpts_loops = counter_u64_alloc(M_WAITOK);
>>> back_tosleep = counter_u64_alloc(M_WAITOK);
>>> @@ -2042,6 +2059,7 @@ tcp_hpts_mod_unload(void)
>>> free(tcp_pace.grps, M_TCPHPTS);
>>> #endif
>>> 
>>> + counter_u64_free(hpts_that_need_softclock);
>>> counter_u64_free(hpts_hopelessly_behind);
>>> counter_u64_free(hpts_loops);
>>> counter_u64_free(back_tosleep);
>> 
>> 
> 
> 
> 
> -- 
> Nuno Teixeira
> FreeBSD Committer (ports)

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5C9863F7-0F1C-4D02-9F6D-9DDC5FBEB368>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation