From nobody Wed Apr 10 11:40:16 2024 X-Original-To: net@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4VF1986KPBz5GxMl; Wed, 10 Apr 2024 11:40:28 +0000 (UTC) (envelope-from eduardo@freebsd.org) Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4VF1985JrBz4JTh; Wed, 10 Apr 2024 11:40:28 +0000 (UTC) (envelope-from eduardo@freebsd.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1712749228; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=l8vh42v6fmv7XhH8SXV5MwwLBBKllEko3xorlPVnaYw=; b=MDOBjzYfktSQk+MMYo2jokuffnXVLtbqgRQcES8s45ZMOcwvzRpZavXfBunXZXOOQrtbZL 4Hc95bCv9vSP5T4T12Ui4wG1GQfPYaWsjkxweFqYyDOj1PcnEgcS6582b8v7jAgzUHqtZN fixgZLSBkBRpjIE+zZ+fuWvpH8JwyS1v1SrP8xRAc+jB4W4g474TViRROzC6A8GFfUuvmZ a7gaCUl0kKqU8c3wRkXVpvXTmYsEWYMl/OKMiiZMDVlkjeP9pbxFrCpLVFBxX7KfbtMvFy h+wBobI1K7ImLdU7pIwAxDrkUMc60awUtQHhoKo8oBI5I8Vec/V+ShnV+hN8sA== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1712749228; a=rsa-sha256; cv=none; b=h4t2wQB586gYwqpt9duoGHK8t7x1neb6IewtBZUwOhMaI/yOOnAHmIlsGjWinFt6nM74V0 dMbJgvaHPhJa2UVUneUG9Zq3eh5wD0LL7ZVSvhcLiDb1U2kswtOXCh3magOkMf19feL6vr wQ8kI/QUpbmOTCIWCj1kwVa6Um4+UNuWnHQluSJf6ZvAxf994I6JMam/kQwEnXfVJiGITI qH0+FhtonCweyfXO13EbtJa5C/xSdFU6hMTxt2v6PN1NU4iYzZoj3hBmnICY70whHhgRmD UUrA2uiBjQJ8L6yIBhZCKNXub23gBw9tkLRmEp5E+HIW7PpT6tVB7HqWnhOhMg== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1712749228; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=l8vh42v6fmv7XhH8SXV5MwwLBBKllEko3xorlPVnaYw=; b=VVz9aRM8f3ZHdVbFR/l8ETNeZdhJ4SeDNaAbINRYIMm+A4nbWqOhxgjDj7S+9NYM2Ai/m2 FmivSJfLOkg31ypgw4dd2ct2VkkHajPl+wuzvZWa93t+qgnAxkvblDfeGt5oyUNcMWk9c6 bi60GJyX+DVXKoPPlAoB0zMlbX78BU7ut6XH3jdUhm1OhZNvGsK7TyEZgmW9BVh1Zo9eqx CddWeGNvLPKSfMub+K44qabYCir9j5J/3aphg8lmebct2M4nuNu/5886wGfQuLkEZHUl2K 5KrifwBC4QVtdN/Eq8uWGarlcyhkJAcjNwNEN8IN72e98RbfNDDNb6sMiAodnQ== Received: from mail-yb1-f181.google.com (mail-yb1-f181.google.com [209.85.219.181]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) (Authenticated sender: eduardo) by smtp.freebsd.org (Postfix) with ESMTPSA id 4VF1984kkTzLQn; Wed, 10 Apr 2024 11:40:28 +0000 (UTC) (envelope-from eduardo@freebsd.org) Received: by mail-yb1-f181.google.com with SMTP id 3f1490d57ef6-dc236729a2bso6308233276.0; Wed, 10 Apr 2024 04:40:28 -0700 (PDT) X-Forwarded-Encrypted: i=1; AJvYcCUpEzH1bSXatb9OyFDiw2rmdOiCyQKZ3dap3T9svyfGL/ivBCLS183bJQZJfrnqdM+3CGyZd6TpMhWOavbM1mcWCoL4ftWmMTJYmTwhEqq/td+9dviUhQdDP7ii/noZMbGMBJ2z8SmcV8ELkkt1NBe/qrf0tzkn X-Gm-Message-State: AOJu0Yw8nglrE/rdqV/6XvLzStCInUG+FqpwODverstCQRlIESB7tRF7 5xZo1upotbQ8ueW8PdTbLa11djf3U7A0zAHI0+EvV4Hi0rjkaXMsyaMjsT+W42MlCBGvhObyARV 8QZSv09uzuvcCNxYEgFlD+t2gyZk= X-Google-Smtp-Source: AGHT+IE9OgPBrk3nu4n44r3WDLgcsQdARQYYW6dMCmekc+eLpEAQo51FYbRxDz80TYeEtiOgK5hh+52bpEh64hqiEpc= X-Received: by 2002:a25:d842:0:b0:dcf:bf94:bc30 with SMTP id p63-20020a25d842000000b00dcfbf94bc30mr2503007ybg.34.1712749227563; Wed, 10 Apr 2024 04:40:27 -0700 (PDT) List-Id: Networking and TCP/IP with FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-net List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-net@freebsd.org MIME-Version: 1.0 References: <6e795e9c-8de4-4e02-9a96-8fabfaa4e66f@app.fastmail.com> <6047C8EF-B1B0-4286-93FA-AA38F8A18656@karels.net> <8031cd99-ded8-4b06-93b3-11cc729a8b2c@app.fastmail.com> <38c54399-6c96-44d8-a3a2-3cc1bfbe50c2@app.fastmail.com> <27d8144f-0658-46f6-b8f3-35eb60061644@lakerest.net> <5C9863F7-0F1C-4D02-9F6D-9DDC5FBEB368@freebsd.org> In-Reply-To: <5C9863F7-0F1C-4D02-9F6D-9DDC5FBEB368@freebsd.org> From: Nuno Teixeira Date: Wed, 10 Apr 2024 12:40:16 +0100 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Request for Testing: TCP RACK To: tuexen@freebsd.org Cc: Drew Gallatin , Konstantin Belousov , rrs , Mike Karels , garyj@gmx.de, current@freebsd.org, net@freebsd.org, Randall Stewart Content-Type: multipart/alternative; boundary="0000000000004c6d020615bc81cf" --0000000000004c6d020615bc81cf Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello all, @ current 1500018 and fetching torrents with net-p2p/qbittorrent finished ~2GB download and connection UP until the end: --- Apr 10 11:26:46 leg kernel: re0: watchdog timeout Apr 10 11:26:46 leg kernel: re0: link state changed to DOWN Apr 10 11:26:49 leg dhclient[58810]: New IP Address (re0): 192.168.1.67 Apr 10 11:26:49 leg dhclient[58814]: New Subnet Mask (re0): 255.255.255.0 Apr 10 11:26:49 leg dhclient[58818]: New Broadcast Address (re0): 192.168.1.255 Apr 10 11:26:49 leg kernel: re0: link state changed to UP Apr 10 11:26:49 leg dhclient[58822]: New Routers (re0): 192.168.1.1 --- In the past tests, I've got more watchdog timeouts, connection goes down and a reboot needed to put it back (`service netif restart` didn't work). Other way to reproduce this is using sysutils/restic (backup program) to read/check all files from a remote server via sftp: `restic -r sftp:user@remote:restic-repo check --read-data` from a 60GB compressed backup. --- watchdog timeout x3 as above --- restic check fail log @ 15% progress: --- Load(, 17310001, 0) returned error, retrying after 1.7670599s: connection lost Load(, 17456892, 0) returned error, retrying after 4.619104908s: connection lost Load(, 17310001, 0) returned error, retrying after 5.477648517s: connection lost List(lock) returned error, retrying after 293.057766ms: connection lost List(lock) returned error, retrying after 385.206693ms: connection lost List(lock) returned error, retrying after 1.577594281s: connection lost Connection continues UP. Cheers, escreveu (quinta, 28/03/2024 =C3=A0(s) 15:53): > > On 28. Mar 2024, at 15:00, Nuno Teixeira wrote: > > > > Hello all! > > > > Running rack @b7b78c1c169 "Optimize HPTS..." very happy on my laptop > (amd64)! > > > > Thanks all! > Thanks for the feedback! > > Best regards > Michael > > > > Drew Gallatin escreveu (quinta, 21/03/2024 =C3= =A0(s) > 12:58): > > The entire point is to *NOT* go through the overhead of scheduling > something asynchronously, but to take advantage of the fact that a > user/kernel transition is going to trash the cache anyway. > > > > In the common case of a system which has less than the threshold numbe= r > of connections , we access the tcp_hpts_softclock function pointer, make > one function call, and access hpts_that_need_softclock, and then return. > So that's 2 variables and a function call. > > > > I think it would be preferable to avoid that call, and to move the > declaration of tcp_hpts_softclock and hpts_that_need_softclock so that th= ey > are in the same cacheline. Then we'd be hitting just a single line in th= e > common case. (I've made comments on the review to that effect). > > > > Also, I wonder if the threshold could get higher by default, so that > hpts is never called in this context unless we're to the point where we'r= e > scheduling thousands of runs of the hpts thread (and taking all those clo= ck > interrupts). > > > > Drew > > > > On Wed, Mar 20, 2024, at 8:17 PM, Konstantin Belousov wrote: > >> On Tue, Mar 19, 2024 at 06:19:52AM -0400, rrs wrote: > >>> Ok I have created > >>> > >>> https://reviews.freebsd.org/D44420 > >>> > >>> > >>> To address the issue. I also attach a short version of the patch that > Nuno > >>> can try and validate > >>> > >>> it works. Drew you may want to try this and validate the optimization > does > >>> kick in since I can > >>> > >>> only now test that it does not on my local box :) > >> The patch still causes access to all cpu's cachelines on each userret. > >> It would be much better to inc/check the threshold and only schedule t= he > >> call when exceeded. Then the call can occur in some dedicated context= , > >> like per-CPU thread, instead of userret. > >> > >>> > >>> > >>> R > >>> > >>> > >>> > >>> On 3/18/24 3:42 PM, Drew Gallatin wrote: > >>>> No. The goal is to run on every return to userspace for every threa= d. > >>>> > >>>> Drew > >>>> > >>>> On Mon, Mar 18, 2024, at 3:41 PM, Konstantin Belousov wrote: > >>>>> On Mon, Mar 18, 2024 at 03:13:11PM -0400, Drew Gallatin wrote: > >>>>>> I got the idea from > >>>>>> > https://people.mpi-sws.org/~druschel/publications/soft-timers-tocs.pdf > >>>>>> The gist is that the TCP pacing stuff needs to run frequently, and > >>>>>> rather than run it out of a clock interrupt, its more efficient to > run > >>>>>> it out of a system call context at just the point where we return = to > >>>>>> userspace and the cache is trashed anyway. The current > implementation > >>>>>> is fine for our workload, but probably not idea for a generic > system. > >>>>>> Especially one where something is banging on system calls. > >>>>>> > >>>>>> Ast's could be the right tool for this, but I'm super unfamiliar > with > >>>>>> them, and I can't find any docs on them. > >>>>>> > >>>>>> Would ast_register(0, ASTR_UNCOND, 0, func) be roughly equivalent = to > >>>>>> what's happening here? > >>>>> This call would need some AST number added, and then it registers t= he > >>>>> ast to run on next return to userspace, for the current thread. > >>>>> > >>>>> Is it enough? > >>>>>> > >>>>>> Drew > >>>>> > >>>>>> > >>>>>> On Mon, Mar 18, 2024, at 2:33 PM, Konstantin Belousov wrote: > >>>>>>> On Mon, Mar 18, 2024 at 07:26:10AM -0500, Mike Karels wrote: > >>>>>>>> On 18 Mar 2024, at 7:04, tuexen@freebsd.org wrote: > >>>>>>>> > >>>>>>>>>> On 18. Mar 2024, at 12:42, Nuno Teixeira > >>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> Hello all! > >>>>>>>>>> > >>>>>>>>>> It works just fine! > >>>>>>>>>> System performance is OK. > >>>>>>>>>> Using patch on main-n268841-b0aaf8beb126(-dirty). > >>>>>>>>>> > >>>>>>>>>> --- > >>>>>>>>>> net.inet.tcp.functions_available: > >>>>>>>>>> Stack D > >>>>> Alias PCB count > >>>>>>>>>> freebsd freebsd 0 > >>>>>>>>>> rack * > >>>>> rack 38 > >>>>>>>>>> --- > >>>>>>>>>> > >>>>>>>>>> It would be so nice that we can have a sysctl tunnable for > >>>>> this patch > >>>>>>>>>> so we could do more tests without recompiling kernel. > >>>>>>>>> Thanks for testing! > >>>>>>>>> > >>>>>>>>> @gallatin: can you come up with a patch that is acceptable > >>>>> for Netflix > >>>>>>>>> and allows to mitigate the performance regression. > >>>>>>>> > >>>>>>>> Ideally, tcphpts could enable this automatically when it > >>>>> starts to be > >>>>>>>> used (enough?), but a sysctl could select auto/on/off. > >>>>>>> There is already a well-known mechanism to request execution of t= he > >>>>>>> specific function on return to userspace, namely AST. The > difference > >>>>>>> with the current hack is that the execution is requested for one > >>>>> callback > >>>>>>> in the context of the specific thread. > >>>>>>> > >>>>>>> Still, it might be worth a try to use it; what is the reason to > >>>>> hit a thread > >>>>>>> that does not do networking, with TCP processing? > >>>>>>> > >>>>>>>> > >>>>>>>> Mike > >>>>>>>> > >>>>>>>>> Best regards > >>>>>>>>> Michael > >>>>>>>>>> > >>>>>>>>>> Thanks all! > >>>>>>>>>> Really happy here :) > >>>>>>>>>> > >>>>>>>>>> Cheers, > >>>>>>>>>> > >>>>>>>>>> Nuno Teixeira escreveu (domingo, > >>>>> 17/03/2024 =C3=A0(s) 20:26): > >>>>>>>>>>> > >>>>>>>>>>> Hello, > >>>>>>>>>>> > >>>>>>>>>>>> I don't have the full context, but it seems like the > >>>>> complaint is a performance regression in bonnie++ and perhaps other > >>>>> things when tcp_hpts is loaded, even when it is not used. Is that > >>>>> correct? > >>>>>>>>>>>> > >>>>>>>>>>>> If so, I suspect its because we drive the > >>>>> tcp_hpts_softclock() routine from userret(), in order to avoid tons > >>>>> of timer interrupts and context switches. To test this theory, yo= u > >>>>> could apply a patch like: > >>>>>>>>>>> > >>>>>>>>>>> It's affecting overall system performance, bonnie was just > >>>>> a way to > >>>>>>>>>>> get some numbers to compare. > >>>>>>>>>>> > >>>>>>>>>>> Tomorrow I will test patch. > >>>>>>>>>>> > >>>>>>>>>>> Thanks! > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> Nuno Teixeira > >>>>>>>>>>> FreeBSD Committer (ports) > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Nuno Teixeira > >>>>>>>>>> FreeBSD Committer (ports) > >>>>>>>> > >>>>>>> > >>>>> > >>>> > >> > >>> diff --git a/sys/netinet/tcp_hpts.c b/sys/netinet/tcp_hpts.c > >>> index 8c4d2d41a3eb..eadbee19f69c 100644 > >>> --- a/sys/netinet/tcp_hpts.c > >>> +++ b/sys/netinet/tcp_hpts.c > >>> @@ -216,6 +216,7 @@ struct tcp_hpts_entry { > >>> void *ie_cookie; > >>> uint16_t p_num; /* The hpts number one per cpu */ > >>> uint16_t p_cpu; /* The hpts CPU */ > >>> + uint8_t hit_callout_thresh; > >>> /* There is extra space in here */ > >>> /* Cache line 0x100 */ > >>> struct callout co __aligned(CACHE_LINE_SIZE); > >>> @@ -269,6 +270,11 @@ static struct hpts_domain_info { > >>> int cpu[MAXCPU]; > >>> } hpts_domains[MAXMEMDOM]; > >>> > >>> +counter_u64_t hpts_that_need_softclock; > >>> +SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, needsoftclock= , > CTLFLAG_RD, > >>> + &hpts_that_need_softclock, > >>> + "Number of hpts threads that need softclock"); > >>> + > >>> counter_u64_t hpts_hopelessly_behind; > >>> > >>> SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, hopeless, > CTLFLAG_RD, > >>> @@ -334,7 +340,7 @@ SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, > precision, CTLFLAG_RW, > >>> &tcp_hpts_precision, 120, > >>> "Value for PRE() precision of callout"); > >>> SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, cnt_thresh, CTLFLAG_RW, > >>> - &conn_cnt_thresh, 0, > >>> + &conn_cnt_thresh, DEFAULT_CONNECTION_THESHOLD, > >>> "How many connections (below) make us use the callout based > mechanism"); > >>> SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, logging, CTLFLAG_RW, > >>> &hpts_does_tp_logging, 0, > >>> @@ -1548,6 +1554,9 @@ __tcp_run_hpts(void) > >>> struct tcp_hpts_entry *hpts; > >>> int ticks_ran; > >>> > >>> + if (counter_u64_fetch(hpts_that_need_softclock) =3D=3D 0) > >>> + return; > >>> + > >>> hpts =3D tcp_choose_hpts_to_run(); > >>> > >>> if (hpts->p_hpts_active) { > >>> @@ -1683,6 +1692,13 @@ tcp_hpts_thread(void *ctx) > >>> ticks_ran =3D tcp_hptsi(hpts, 1); > >>> tv.tv_sec =3D 0; > >>> tv.tv_usec =3D hpts->p_hpts_sleep_time * HPTS_TICKS_PER_SLOT; > >>> + if ((hpts->p_on_queue_cnt > conn_cnt_thresh) && > (hpts->hit_callout_thresh =3D=3D 0)) { > >>> + hpts->hit_callout_thresh =3D 1; > >>> + counter_u64_add(hpts_that_need_softclock, 1); > >>> + } else if ((hpts->p_on_queue_cnt <=3D conn_cnt_thresh) && > (hpts->hit_callout_thresh =3D=3D 1)) { > >>> + hpts->hit_callout_thresh =3D 0; > >>> + counter_u64_add(hpts_that_need_softclock, -1); > >>> + } > >>> if (hpts->p_on_queue_cnt >=3D conn_cnt_thresh) { > >>> if(hpts->p_direct_wake =3D=3D 0) { > >>> /* > >>> @@ -1818,6 +1834,7 @@ tcp_hpts_mod_load(void) > >>> cpu_top =3D NULL; > >>> #endif > >>> tcp_pace.rp_num_hptss =3D ncpus; > >>> + hpts_that_need_softclock =3D counter_u64_alloc(M_WAITOK); > >>> hpts_hopelessly_behind =3D counter_u64_alloc(M_WAITOK); > >>> hpts_loops =3D counter_u64_alloc(M_WAITOK); > >>> back_tosleep =3D counter_u64_alloc(M_WAITOK); > >>> @@ -2042,6 +2059,7 @@ tcp_hpts_mod_unload(void) > >>> free(tcp_pace.grps, M_TCPHPTS); > >>> #endif > >>> > >>> + counter_u64_free(hpts_that_need_softclock); > >>> counter_u64_free(hpts_hopelessly_behind); > >>> counter_u64_free(hpts_loops); > >>> counter_u64_free(back_tosleep); > >> > >> > > > > > > > > -- > > Nuno Teixeira > > FreeBSD Committer (ports) > > --=20 Nuno Teixeira FreeBSD Committer (ports) --0000000000004c6d020615bc81cf Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello all,

@ current 1500018= and fetching torrents with net-p2p/qbittorrent finished ~2GB download and = connection UP until the end:

---
Apr 10 11:26:46 leg kernel: re0: watchdog timeout
Apr 10 11:26:46 leg = kernel: re0: link state changed to DOWN
Apr 10 11:26:49 leg dhclient[588= 10]: New IP Address (re0): 192.168.1.67
Apr 10 11:26:49 leg dhclient[588= 14]: New Subnet Mask (re0): 255.255.255.0
Apr 10 11:26:49 leg dhclient[5= 8818]: New Broadcast Address (re0): 192.168.1.255
Apr 10 11:26:49 leg ke= rnel: re0: link state changed to UP
Apr 10 11:26:49 leg dhclient[58822]:= New Routers (re0): 192.168.1.1
---

In t= he past tests, I've got more watchdog timeouts, connection goes down an= d a reboot needed to put it back (`service netif restart` didn't work).=

Other way to reproduce this is using sysutils/res= tic (backup program) to read/check all files from a remote server via sftp:=

`restic -r sftp:user@remote:restic-repo check --r= ead-data` from a 60GB compressed backup.

---
=
watchdog timeout x3 as above
---

re= stic check fail log @ 15% progress:
---
<snip>= ;
Load(<data/52e2923dd6>, 17310001, 0) returned error, = retrying after 1.7670599s: connection lost
Load(<data/d27a0abe0f>,= 17456892, 0) returned error, retrying after 4.619104908s: connection lost<= br>Load(<data/52e2923dd6>, 17310001, 0) returned error, retrying afte= r 5.477648517s: connection lost
List(lock) returned error, retrying afte= r 293.057766ms: connection lost
List(lock) returned error, retrying afte= r 385.206693ms: connection lost
List(lock) returned error, retrying afte= r 1.577594281s: connection lost
<snip>

=
Connection continues UP.

Cheers,

<= ;tuexen@freebsd.org= > escreveu (quinta, 28/03/2024 =C3=A0(s) 15:53):
> On 28. Mar 2024, at 15:00, Nun= o Teixeira <edu= ardo@freebsd.org> wrote:
>
> Hello all!
>
> Running rack @b7b78c1c169 "Optimize HPTS..." very happy on m= y laptop (amd64)!
>
> Thanks all!
Thanks for the feedback!

Best regards
Michael
>
> Drew Gallatin <gallatin@freebsd.org> escreveu (quinta, 21/03/2024 =C3=A0(s) 1= 2:58):
> The entire point is to *NOT* go through the overhead of scheduling som= ething asynchronously, but to take advantage of the fact that a user/kernel= transition is going to trash the cache anyway.
>
> In the common case of a system which has less than the threshold=C2=A0= number of connections , we access the tcp_hpts_softclock function pointer,= make one function call, and access hpts_that_need_softclock, and then retu= rn.=C2=A0 So that's 2 variables and a function call.
>
> I think it would be preferable to avoid that call, and to move the dec= laration of tcp_hpts_softclock and hpts_that_need_softclock so that they ar= e in the same cacheline.=C2=A0 Then we'd be hitting just a single line = in the common case.=C2=A0 (I've made comments on the review to that eff= ect).
>
> Also, I wonder if the threshold could get higher by default, so that h= pts is never called in this context unless we're to the point where we&= #39;re scheduling thousands of runs of the hpts thread (and taking all thos= e clock interrupts).
>
> Drew
>
> On Wed, Mar 20, 2024, at 8:17 PM, Konstantin Belousov wrote:
>> On Tue, Mar 19, 2024 at 06:19:52AM -0400, rrs wrote:
>>> Ok I have created
>>>
>>> https://reviews.freebsd.org/D44420
>>>
>>>
>>> To address the issue. I also attach a short version of the pat= ch that Nuno
>>> can try and validate
>>>
>>> it works. Drew you may want to try this and validate the optim= ization does
>>> kick in since I can
>>>
>>> only now test that it does not on my local box :)
>> The patch still causes access to all cpu's cachelines on each = userret.
>> It would be much better to inc/check the threshold and only schedu= le the
>> call when exceeded.=C2=A0 Then the call can occur in some dedicate= d context,
>> like per-CPU thread, instead of userret.
>>
>>>
>>>
>>> R
>>>
>>>
>>>
>>> On 3/18/24 3:42 PM, Drew Gallatin wrote:
>>>> No.=C2=A0 The goal is to run on every return to userspace = for every thread.
>>>>
>>>> Drew
>>>>
>>>> On Mon, Mar 18, 2024, at 3:41 PM, Konstantin Belousov wrot= e:
>>>>> On Mon, Mar 18, 2024 at 03:13:11PM -0400, Drew Gallati= n wrote:
>>>>>> I got the idea from
>>>>>> https= ://people.mpi-sws.org/~druschel/publications/soft-timers-tocs.pdf
>>>>>> The gist is that the TCP pacing stuff needs to run= frequently, and
>>>>>> rather than run it out of a clock interrupt, its m= ore efficient to run
>>>>>> it out of a system call context at just the point = where we return to
>>>>>> userspace and the cache is trashed anyway. The cur= rent implementation
>>>>>> is fine for our workload, but probably not idea fo= r a generic system.
>>>>>> Especially one where something is banging on syste= m calls.
>>>>>>
>>>>>> Ast's could be the right tool for this, but I&= #39;m super unfamiliar with
>>>>>> them, and I can't find any docs on them.
>>>>>>
>>>>>> Would ast_register(0, ASTR_UNCOND, 0, func) be rou= ghly equivalent to
>>>>>> what's happening here?
>>>>> This call would need some AST number added, and then i= t registers the
>>>>> ast to run on next return to userspace, for the curren= t thread.
>>>>>
>>>>> Is it enough?
>>>>>>
>>>>>> Drew
>>>>>
>>>>>>
>>>>>> On Mon, Mar 18, 2024, at 2:33 PM, Konstantin Belou= sov wrote:
>>>>>>> On Mon, Mar 18, 2024 at 07:26:10AM -0500, Mike= Karels wrote:
>>>>>>>> On 18 Mar 2024, at 7:04, tuexen@freebsd.org wrote:
>>>>>>>>
>>>>>>>>>> On 18. Mar 2024, at 12:42, Nuno Te= ixeira
>>>>> <eduardo@freebsd.org> wrote:
>>>>>>>>>>
>>>>>>>>>> Hello all!
>>>>>>>>>>
>>>>>>>>>> It works just fine!
>>>>>>>>>> System performance is OK.
>>>>>>>>>> Using patch on main-n268841-b0aaf8= beb126(-dirty).
>>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>> net.inet.tcp.functions_available:<= br> >>>>>>>>>> Stack=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0D
>>>>> Alias=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 PCB count
>>>>>>>>>> freebsd freebsd=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0=
>>>>>>>>>> rack=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *
>>>>> rack=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A038
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>> It would be so nice that we can ha= ve a sysctl tunnable for
>>>>> this patch
>>>>>>>>>> so we could do more tests without = recompiling kernel.
>>>>>>>>> Thanks for testing!
>>>>>>>>>
>>>>>>>>> @gallatin: can you come up with a patc= h that is acceptable
>>>>> for Netflix
>>>>>>>>> and allows to mitigate the performance= regression.
>>>>>>>>
>>>>>>>> Ideally, tcphpts could enable this automat= ically when it
>>>>> starts to be
>>>>>>>> used (enough?), but a sysctl could select = auto/on/off.
>>>>>>> There is already a well-known mechanism to req= uest execution of the
>>>>>>> specific function on return to userspace, name= ly AST.=C2=A0 The difference
>>>>>>> with the current hack is that the execution is= requested for one
>>>>> callback
>>>>>>> in the context of the specific thread.
>>>>>>>
>>>>>>> Still, it might be worth a try to use it; what= is the reason to
>>>>> hit a thread
>>>>>>> that does not do networking, with TCP processi= ng?
>>>>>>>
>>>>>>>>
>>>>>>>> Mike
>>>>>>>>
>>>>>>>>> Best regards
>>>>>>>>> Michael
>>>>>>>>>>
>>>>>>>>>> Thanks all!
>>>>>>>>>> Really happy here :)
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> Nuno Teixeira <eduardo@freebsd.org> escrev= eu (domingo,
>>>>> 17/03/2024 =C3=A0(s) 20:26):
>>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>>> I don't have the full = context, but it seems like the
>>>>> complaint is a performance regression in bonnie++ and = perhaps other
>>>>> things when tcp_hpts is loaded, even when it is not us= ed.=C2=A0 Is that
>>>>> correct?
>>>>>>>>>>>>
>>>>>>>>>>>> If so, I suspect its becau= se we drive the
>>>>> tcp_hpts_softclock() routine from userret(), in order = to avoid tons
>>>>> of timer interrupts and context switches.=C2=A0 To tes= t this theory,=C2=A0 you
>>>>> could apply a patch like:
>>>>>>>>>>>
>>>>>>>>>>> It's affecting overall sys= tem performance, bonnie was just
>>>>> a way to
>>>>>>>>>>> get some numbers to compare. >>>>>>>>>>>
>>>>>>>>>>> Tomorrow I will test patch. >>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nuno Teixeira
>>>>>>>>>>> FreeBSD Committer (ports)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nuno Teixeira
>>>>>>>>>> FreeBSD Committer (ports)
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>
>>> diff --git a/sys/netinet/tcp_hpts.c b/sys/netinet/tcp_hpts.c >>> index 8c4d2d41a3eb..eadbee19f69c 100644
>>> --- a/sys/netinet/tcp_hpts.c
>>> +++ b/sys/netinet/tcp_hpts.c
>>> @@ -216,6 +216,7 @@ struct tcp_hpts_entry {
>>> void *ie_cookie;
>>> uint16_t p_num; /* The hpts number one per cpu */
>>> uint16_t p_cpu; /* The hpts CPU */
>>> + uint8_t hit_callout_thresh;
>>> /* There is extra space in here */
>>> /* Cache line 0x100 */
>>> struct callout co __aligned(CACHE_LINE_SIZE);
>>> @@ -269,6 +270,11 @@ static struct hpts_domain_info {
>>> int cpu[MAXCPU];
>>> } hpts_domains[MAXMEMDOM];
>>>
>>> +counter_u64_t hpts_that_need_softclock;
>>> +SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, needso= ftclock, CTLFLAG_RD,
>>> +=C2=A0 =C2=A0 &hpts_that_need_softclock,
>>> +=C2=A0 =C2=A0 "Number of hpts threads that need softcloc= k");
>>> +
>>> counter_u64_t hpts_hopelessly_behind;
>>>
>>> SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, hopeles= s, CTLFLAG_RD,
>>> @@ -334,7 +340,7 @@ SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, p= recision, CTLFLAG_RW,
>>>=C2=A0 =C2=A0 =C2=A0&tcp_hpts_precision, 120,
>>>=C2=A0 =C2=A0 =C2=A0"Value for PRE() precision of callout&= quot;);
>>> SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, cnt_thresh, CTLFLAG_R= W,
>>> -=C2=A0 =C2=A0 &conn_cnt_thresh, 0,
>>> +=C2=A0 =C2=A0 &conn_cnt_thresh, DEFAULT_CONNECTION_THESHO= LD,
>>>=C2=A0 =C2=A0 =C2=A0"How many connections (below) make us = use the callout based mechanism");
>>> SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, logging, CTLFLAG_RW,<= br> >>>=C2=A0 =C2=A0 =C2=A0&hpts_does_tp_logging, 0,
>>> @@ -1548,6 +1554,9 @@ __tcp_run_hpts(void)
>>> struct tcp_hpts_entry *hpts;
>>> int ticks_ran;
>>>
>>> + if (counter_u64_fetch(hpts_that_need_softclock) =3D=3D 0) >>> + return;
>>> +
>>> hpts =3D tcp_choose_hpts_to_run();
>>>
>>> if (hpts->p_hpts_active) {
>>> @@ -1683,6 +1692,13 @@ tcp_hpts_thread(void *ctx)
>>> ticks_ran =3D tcp_hptsi(hpts, 1);
>>> tv.tv_sec =3D 0;
>>> tv.tv_usec =3D hpts->p_hpts_sleep_time * HPTS_TICKS_PER_SLO= T;
>>> + if ((hpts->p_on_queue_cnt > conn_cnt_thresh) &&= ; (hpts->hit_callout_thresh =3D=3D 0)) {
>>> + hpts->hit_callout_thresh =3D 1;
>>> + counter_u64_add(hpts_that_need_softclock, 1);
>>> + } else if ((hpts->p_on_queue_cnt <=3D conn_cnt_thresh)= && (hpts->hit_callout_thresh =3D=3D 1)) {
>>> + hpts->hit_callout_thresh =3D 0;
>>> + counter_u64_add(hpts_that_need_softclock, -1);
>>> + }
>>> if (hpts->p_on_queue_cnt >=3D conn_cnt_thresh) {
>>> if(hpts->p_direct_wake =3D=3D 0) {
>>> /*
>>> @@ -1818,6 +1834,7 @@ tcp_hpts_mod_load(void)
>>> cpu_top =3D NULL;
>>> #endif
>>> tcp_pace.rp_num_hptss =3D ncpus;
>>> + hpts_that_need_softclock =3D counter_u64_alloc(M_WAITOK); >>> hpts_hopelessly_behind =3D counter_u64_alloc(M_WAITOK);
>>> hpts_loops =3D counter_u64_alloc(M_WAITOK);
>>> back_tosleep =3D counter_u64_alloc(M_WAITOK);
>>> @@ -2042,6 +2059,7 @@ tcp_hpts_mod_unload(void)
>>> free(tcp_pace.grps, M_TCPHPTS);
>>> #endif
>>>
>>> + counter_u64_free(hpts_that_need_softclock);
>>> counter_u64_free(hpts_hopelessly_behind);
>>> counter_u64_free(hpts_loops);
>>> counter_u64_free(back_tosleep);
>>
>>
>
>
>
> --
> Nuno Teixeira
> FreeBSD Committer (ports)



--
Nuno Teixeira
FreeBSD Committ= er (ports)
--0000000000004c6d020615bc81cf--