From nobody Thu Mar 28 15:53:35 2024
X-Original-To: net@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4V57PS0PWyz5GBKB;
	Thu, 28 Mar 2024 15:53:48 +0000 (UTC)
	(envelope-from tuexen@freebsd.org)
Received: from drew.franken.de (drew.ipv6.franken.de [IPv6:2001:638:a02:a001:20e:cff:fe4a:feaa])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "*.franken.de", Issuer "Sectigo RSA Domain Validation Secure Server CA" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4V57PR4qtzz4tK1;
	Thu, 28 Mar 2024 15:53:47 +0000 (UTC)
	(envelope-from tuexen@freebsd.org)
Authentication-Results: mx1.freebsd.org;
	none
Received: from smtpclient.apple (ip4d15f54e.dynamic.kabel-deutschland.de [77.21.245.78])
	(Authenticated sender: micmac)
	by mail-n.franken.de (Postfix) with ESMTPSA id 36744721E2806;
	Thu, 28 Mar 2024 16:53:36 +0100 (CET)
Content-Type: text/plain;
	charset=utf-8
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-net
List-Help: <mailto:freebsd-net+help@freebsd.org>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Subscribe: <mailto:freebsd-net+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-net+unsubscribe@freebsd.org>
Sender: owner-freebsd-net@freebsd.org
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.500.171.1.1\))
Subject: Re: Request for Testing: TCP RACK
From: tuexen@freebsd.org
In-Reply-To: <CAFDf7U+sd0Hx6scDOvMuxOQaqVb8wi1ODrwT=uKDCqMn_81QEw@mail.gmail.com>
Date: Thu, 28 Mar 2024 16:53:35 +0100
Cc: Drew Gallatin <gallatin@freebsd.org>,
 Konstantin Belousov <kib@freebsd.org>,
 rrs <rrs@lakerest.net>,
 Mike Karels <mike@karels.net>,
 garyj@gmx.de,
 current@freebsd.org,
 net@freebsd.org,
 Randall Stewart <rrs@freebsd.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <5C9863F7-0F1C-4D02-9F6D-9DDC5FBEB368@freebsd.org>
References: <6e795e9c-8de4-4e02-9a96-8fabfaa4e66f@app.fastmail.com>
 <CAFDf7UKDWSnhm+TwP=ZZ9dkk0jmAgjGKPLpkX-CKuw3yH233gQ@mail.gmail.com>
 <CAFDf7UJq9SCnU-QYmS3t6EknP369w2LR0dNkQAc-NaRLvwVfoQ@mail.gmail.com>
 <A3F1FC0C-C199-4565-8E07-B233ED9E7B2E@freebsd.org>
 <6047C8EF-B1B0-4286-93FA-AA38F8A18656@karels.net>
 <ZfiI7GcbTwSG8kkO@kib.kiev.ua>
 <8031cd99-ded8-4b06-93b3-11cc729a8b2c@app.fastmail.com>
 <ZfiY-xUUM3wrBEz_@kib.kiev.ua>
 <38c54399-6c96-44d8-a3a2-3cc1bfbe50c2@app.fastmail.com>
 <27d8144f-0658-46f6-b8f3-35eb60061644@lakerest.net>
 <Zft8odA0s49eLhvk@kib.kiev.ua>
 <fc160c79-1884-4f68-8310-35e7ac0b9dd6@app.fastmail.com>
 <CAFDf7U+sd0Hx6scDOvMuxOQaqVb8wi1ODrwT=uKDCqMn_81QEw@mail.gmail.com>
To: Nuno Teixeira <eduardo@freebsd.org>
X-Mailer: Apple Mail (2.3774.500.171.1.1)
X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00
	autolearn=disabled version=3.4.1
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on mail-n.franken.de
X-Spamd-Bar: ----
X-Rspamd-Pre-Result: action=no action;
	module=replies;
	Message is reply to one we originated
X-Spamd-Result: default: False [-4.00 / 15.00];
	REPLY(-4.00)[];
	ASN(0.00)[asn:680, ipnet:2001:638::/32, country:DE]
X-Rspamd-Queue-Id: 4V57PR4qtzz4tK1

> On 28. Mar 2024, at 15:00, Nuno Teixeira <eduardo@freebsd.org> wrote:
>=20
> Hello all!
>=20
> Running rack @b7b78c1c169 "Optimize HPTS..." very happy on my laptop =
(amd64)!
>=20
> Thanks all!
Thanks for the feedback!

Best regards
Michael
>=20
> Drew Gallatin <gallatin@freebsd.org> escreveu (quinta, 21/03/2024 =
=C3=A0(s) 12:58):
> The entire point is to *NOT* go through the overhead of scheduling =
something asynchronously, but to take advantage of the fact that a =
user/kernel transition is going to trash the cache anyway.
>=20
> In the common case of a system which has less than the threshold  =
number of connections , we access the tcp_hpts_softclock function =
pointer, make one function call, and access hpts_that_need_softclock, =
and then return.  So that's 2 variables and a function call.
>=20
> I think it would be preferable to avoid that call, and to move the =
declaration of tcp_hpts_softclock and hpts_that_need_softclock so that =
they are in the same cacheline.  Then we'd be hitting just a single line =
in the common case.  (I've made comments on the review to that effect).
>=20
> Also, I wonder if the threshold could get higher by default, so that =
hpts is never called in this context unless we're to the point where =
we're scheduling thousands of runs of the hpts thread (and taking all =
those clock interrupts).
>=20
> Drew
>=20
> On Wed, Mar 20, 2024, at 8:17 PM, Konstantin Belousov wrote:
>> On Tue, Mar 19, 2024 at 06:19:52AM -0400, rrs wrote:
>>> Ok I have created
>>>=20
>>> https://reviews.freebsd.org/D44420
>>>=20
>>>=20
>>> To address the issue. I also attach a short version of the patch =
that Nuno
>>> can try and validate
>>>=20
>>> it works. Drew you may want to try this and validate the =
optimization does
>>> kick in since I can
>>>=20
>>> only now test that it does not on my local box :)
>> The patch still causes access to all cpu's cachelines on each =
userret.
>> It would be much better to inc/check the threshold and only schedule =
the
>> call when exceeded.  Then the call can occur in some dedicated =
context,
>> like per-CPU thread, instead of userret.
>>=20
>>>=20
>>>=20
>>> R
>>>=20
>>>=20
>>>=20
>>> On 3/18/24 3:42 PM, Drew Gallatin wrote:
>>>> No.  The goal is to run on every return to userspace for every =
thread.
>>>>=20
>>>> Drew
>>>>=20
>>>> On Mon, Mar 18, 2024, at 3:41 PM, Konstantin Belousov wrote:
>>>>> On Mon, Mar 18, 2024 at 03:13:11PM -0400, Drew Gallatin wrote:
>>>>>> I got the idea from
>>>>>> =
https://people.mpi-sws.org/~druschel/publications/soft-timers-tocs.pdf
>>>>>> The gist is that the TCP pacing stuff needs to run frequently, =
and
>>>>>> rather than run it out of a clock interrupt, its more efficient =
to run
>>>>>> it out of a system call context at just the point where we return =
to
>>>>>> userspace and the cache is trashed anyway. The current =
implementation
>>>>>> is fine for our workload, but probably not idea for a generic =
system.
>>>>>> Especially one where something is banging on system calls.
>>>>>>=20
>>>>>> Ast's could be the right tool for this, but I'm super unfamiliar =
with
>>>>>> them, and I can't find any docs on them.
>>>>>>=20
>>>>>> Would ast_register(0, ASTR_UNCOND, 0, func) be roughly equivalent =
to
>>>>>> what's happening here?
>>>>> This call would need some AST number added, and then it registers =
the
>>>>> ast to run on next return to userspace, for the current thread.
>>>>>=20
>>>>> Is it enough?
>>>>>>=20
>>>>>> Drew
>>>>>=20
>>>>>>=20
>>>>>> On Mon, Mar 18, 2024, at 2:33 PM, Konstantin Belousov wrote:
>>>>>>> On Mon, Mar 18, 2024 at 07:26:10AM -0500, Mike Karels wrote:
>>>>>>>> On 18 Mar 2024, at 7:04, tuexen@freebsd.org wrote:
>>>>>>>>=20
>>>>>>>>>> On 18. Mar 2024, at 12:42, Nuno Teixeira
>>>>> <eduardo@freebsd.org> wrote:
>>>>>>>>>>=20
>>>>>>>>>> Hello all!
>>>>>>>>>>=20
>>>>>>>>>> It works just fine!
>>>>>>>>>> System performance is OK.
>>>>>>>>>> Using patch on main-n268841-b0aaf8beb126(-dirty).
>>>>>>>>>>=20
>>>>>>>>>> ---
>>>>>>>>>> net.inet.tcp.functions_available:
>>>>>>>>>> Stack                           D
>>>>> Alias                            PCB count
>>>>>>>>>> freebsd freebsd                          0
>>>>>>>>>> rack                            *
>>>>> rack                             38
>>>>>>>>>> ---
>>>>>>>>>>=20
>>>>>>>>>> It would be so nice that we can have a sysctl tunnable for
>>>>> this patch
>>>>>>>>>> so we could do more tests without recompiling kernel.
>>>>>>>>> Thanks for testing!
>>>>>>>>>=20
>>>>>>>>> @gallatin: can you come up with a patch that is acceptable
>>>>> for Netflix
>>>>>>>>> and allows to mitigate the performance regression.
>>>>>>>>=20
>>>>>>>> Ideally, tcphpts could enable this automatically when it
>>>>> starts to be
>>>>>>>> used (enough?), but a sysctl could select auto/on/off.
>>>>>>> There is already a well-known mechanism to request execution of =
the
>>>>>>> specific function on return to userspace, namely AST.  The =
difference
>>>>>>> with the current hack is that the execution is requested for one
>>>>> callback
>>>>>>> in the context of the specific thread.
>>>>>>>=20
>>>>>>> Still, it might be worth a try to use it; what is the reason to
>>>>> hit a thread
>>>>>>> that does not do networking, with TCP processing?
>>>>>>>=20
>>>>>>>>=20
>>>>>>>> Mike
>>>>>>>>=20
>>>>>>>>> Best regards
>>>>>>>>> Michael
>>>>>>>>>>=20
>>>>>>>>>> Thanks all!
>>>>>>>>>> Really happy here :)
>>>>>>>>>>=20
>>>>>>>>>> Cheers,
>>>>>>>>>>=20
>>>>>>>>>> Nuno Teixeira <eduardo@freebsd.org> escreveu (domingo,
>>>>> 17/03/2024 =C3=A0(s) 20:26):
>>>>>>>>>>>=20
>>>>>>>>>>> Hello,
>>>>>>>>>>>=20
>>>>>>>>>>>> I don't have the full context, but it seems like the
>>>>> complaint is a performance regression in bonnie++ and perhaps =
other
>>>>> things when tcp_hpts is loaded, even when it is not used.  Is that
>>>>> correct?
>>>>>>>>>>>>=20
>>>>>>>>>>>> If so, I suspect its because we drive the
>>>>> tcp_hpts_softclock() routine from userret(), in order to avoid =
tons
>>>>> of timer interrupts and context switches.  To test this theory,  =
you
>>>>> could apply a patch like:
>>>>>>>>>>>=20
>>>>>>>>>>> It's affecting overall system performance, bonnie was just
>>>>> a way to
>>>>>>>>>>> get some numbers to compare.
>>>>>>>>>>>=20
>>>>>>>>>>> Tomorrow I will test patch.
>>>>>>>>>>>=20
>>>>>>>>>>> Thanks!
>>>>>>>>>>>=20
>>>>>>>>>>> --
>>>>>>>>>>> Nuno Teixeira
>>>>>>>>>>> FreeBSD Committer (ports)
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>> --
>>>>>>>>>> Nuno Teixeira
>>>>>>>>>> FreeBSD Committer (ports)
>>>>>>>>=20
>>>>>>>=20
>>>>>=20
>>>>=20
>>=20
>>> diff --git a/sys/netinet/tcp_hpts.c b/sys/netinet/tcp_hpts.c
>>> index 8c4d2d41a3eb..eadbee19f69c 100644
>>> --- a/sys/netinet/tcp_hpts.c
>>> +++ b/sys/netinet/tcp_hpts.c
>>> @@ -216,6 +216,7 @@ struct tcp_hpts_entry {
>>> void *ie_cookie;
>>> uint16_t p_num; /* The hpts number one per cpu */
>>> uint16_t p_cpu; /* The hpts CPU */
>>> + uint8_t hit_callout_thresh;
>>> /* There is extra space in here */
>>> /* Cache line 0x100 */
>>> struct callout co __aligned(CACHE_LINE_SIZE);
>>> @@ -269,6 +270,11 @@ static struct hpts_domain_info {
>>> int cpu[MAXCPU];
>>> } hpts_domains[MAXMEMDOM];
>>>=20
>>> +counter_u64_t hpts_that_need_softclock;
>>> +SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, =
needsoftclock, CTLFLAG_RD,
>>> +    &hpts_that_need_softclock,
>>> +    "Number of hpts threads that need softclock");
>>> +
>>> counter_u64_t hpts_hopelessly_behind;
>>>=20
>>> SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, hopeless, =
CTLFLAG_RD,
>>> @@ -334,7 +340,7 @@ SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, =
precision, CTLFLAG_RW,
>>>     &tcp_hpts_precision, 120,
>>>     "Value for PRE() precision of callout");
>>> SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, cnt_thresh, CTLFLAG_RW,
>>> -    &conn_cnt_thresh, 0,
>>> +    &conn_cnt_thresh, DEFAULT_CONNECTION_THESHOLD,
>>>     "How many connections (below) make us use the callout based =
mechanism");
>>> SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, logging, CTLFLAG_RW,
>>>     &hpts_does_tp_logging, 0,
>>> @@ -1548,6 +1554,9 @@ __tcp_run_hpts(void)
>>> struct tcp_hpts_entry *hpts;
>>> int ticks_ran;
>>>=20
>>> + if (counter_u64_fetch(hpts_that_need_softclock) =3D=3D 0)
>>> + return;
>>> +
>>> hpts =3D tcp_choose_hpts_to_run();
>>>=20
>>> if (hpts->p_hpts_active) {
>>> @@ -1683,6 +1692,13 @@ tcp_hpts_thread(void *ctx)
>>> ticks_ran =3D tcp_hptsi(hpts, 1);
>>> tv.tv_sec =3D 0;
>>> tv.tv_usec =3D hpts->p_hpts_sleep_time * HPTS_TICKS_PER_SLOT;
>>> + if ((hpts->p_on_queue_cnt > conn_cnt_thresh) && =
(hpts->hit_callout_thresh =3D=3D 0)) {
>>> + hpts->hit_callout_thresh =3D 1;
>>> + counter_u64_add(hpts_that_need_softclock, 1);
>>> + } else if ((hpts->p_on_queue_cnt <=3D conn_cnt_thresh) && =
(hpts->hit_callout_thresh =3D=3D 1)) {
>>> + hpts->hit_callout_thresh =3D 0;
>>> + counter_u64_add(hpts_that_need_softclock, -1);
>>> + }
>>> if (hpts->p_on_queue_cnt >=3D conn_cnt_thresh) {
>>> if(hpts->p_direct_wake =3D=3D 0) {
>>> /*
>>> @@ -1818,6 +1834,7 @@ tcp_hpts_mod_load(void)
>>> cpu_top =3D NULL;
>>> #endif
>>> tcp_pace.rp_num_hptss =3D ncpus;
>>> + hpts_that_need_softclock =3D counter_u64_alloc(M_WAITOK);
>>> hpts_hopelessly_behind =3D counter_u64_alloc(M_WAITOK);
>>> hpts_loops =3D counter_u64_alloc(M_WAITOK);
>>> back_tosleep =3D counter_u64_alloc(M_WAITOK);
>>> @@ -2042,6 +2059,7 @@ tcp_hpts_mod_unload(void)
>>> free(tcp_pace.grps, M_TCPHPTS);
>>> #endif
>>>=20
>>> + counter_u64_free(hpts_that_need_softclock);
>>> counter_u64_free(hpts_hopelessly_behind);
>>> counter_u64_free(hpts_loops);
>>> counter_u64_free(back_tosleep);
>>=20
>>=20
>=20
>=20
>=20
> --=20
> Nuno Teixeira
> FreeBSD Committer (ports)