Date: Wed, 10 Apr 2024 13:44:59 +0100 From: Nuno Teixeira <eduardo@freebsd.org> To: tuexen@freebsd.org Cc: Drew Gallatin <gallatin@freebsd.org>, Konstantin Belousov <kib@freebsd.org>, rrs <rrs@lakerest.net>, Mike Karels <mike@karels.net>, garyj@gmx.de, current@freebsd.org, net@freebsd.org, Randall Stewart <rrs@freebsd.org> Subject: Re: Request for Testing: TCP RACK Message-ID: <CAFDf7UK68POsmm%2BO74=6--cx1O-t3jsBrjgiaMV2r81otJq4gA@mail.gmail.com> In-Reply-To: <CAFDf7UK_6dZ1BRxqhLP0LmN%2BdBtX2=9OFRLZnc3Fv8b=nj-dGg@mail.gmail.com> References: <6e795e9c-8de4-4e02-9a96-8fabfaa4e66f@app.fastmail.com> <CAFDf7UKDWSnhm%2BTwP=ZZ9dkk0jmAgjGKPLpkX-CKuw3yH233gQ@mail.gmail.com> <CAFDf7UJq9SCnU-QYmS3t6EknP369w2LR0dNkQAc-NaRLvwVfoQ@mail.gmail.com> <A3F1FC0C-C199-4565-8E07-B233ED9E7B2E@freebsd.org> <6047C8EF-B1B0-4286-93FA-AA38F8A18656@karels.net> <ZfiI7GcbTwSG8kkO@kib.kiev.ua> <8031cd99-ded8-4b06-93b3-11cc729a8b2c@app.fastmail.com> <ZfiY-xUUM3wrBEz_@kib.kiev.ua> <38c54399-6c96-44d8-a3a2-3cc1bfbe50c2@app.fastmail.com> <27d8144f-0658-46f6-b8f3-35eb60061644@lakerest.net> <Zft8odA0s49eLhvk@kib.kiev.ua> <fc160c79-1884-4f68-8310-35e7ac0b9dd6@app.fastmail.com> <CAFDf7U%2Bsd0Hx6scDOvMuxOQaqVb8wi1ODrwT=uKDCqMn_81QEw@mail.gmail.com> <5C9863F7-0F1C-4D02-9F6D-9DDC5FBEB368@freebsd.org> <CAFDf7U%2BPtNhVsbR=pMhGVnxW5thWmzYBq9%2Bd4rSbZD0zRnJjkg@mail.gmail.com> <52479AA6-04F6-4D4A-ABE0-7142B47E28DF@freebsd.org> <CAFDf7UK_6dZ1BRxqhLP0LmN%2BdBtX2=9OFRLZnc3Fv8b=nj-dGg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000c64a570615bd6820 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable (...) Backup server is https://www.rsync.net/ (free 500GB for FreeBSD developers). Nuno Teixeira <eduardo@freebsd.org> escreveu (quarta, 10/04/2024 =C3=A0(s) 13:39): > With base stack I can complete restic check successfully > downloading/reading/checking all files from a "big" remote compressed > backup. > Changing it to RACK stack, it fails. > > I run this command often because in the past, compression corruption > occured and this is the equivalent of restoring backup to check its > integrity. > > Maybe someone could do a restic test to check if this is reproducible. > > Thanks, > > > > <tuexen@freebsd.org> escreveu (quarta, 10/04/2024 =C3=A0(s) 13:12): > >> >> >> > On 10. Apr 2024, at 13:40, Nuno Teixeira <eduardo@freebsd.org> wrote: >> > >> > Hello all, >> > >> > @ current 1500018 and fetching torrents with net-p2p/qbittorrent >> finished ~2GB download and connection UP until the end: >> > >> > --- >> > Apr 10 11:26:46 leg kernel: re0: watchdog timeout >> > Apr 10 11:26:46 leg kernel: re0: link state changed to DOWN >> > Apr 10 11:26:49 leg dhclient[58810]: New IP Address (re0): 192.168.1.6= 7 >> > Apr 10 11:26:49 leg dhclient[58814]: New Subnet Mask (re0): >> 255.255.255.0 >> > Apr 10 11:26:49 leg dhclient[58818]: New Broadcast Address (re0): >> 192.168.1.255 >> > Apr 10 11:26:49 leg kernel: re0: link state changed to UP >> > Apr 10 11:26:49 leg dhclient[58822]: New Routers (re0): 192.168.1.1 >> > --- >> > >> > In the past tests, I've got more watchdog timeouts, connection goes >> down and a reboot needed to put it back (`service netif restart` didn't >> work). >> > >> > Other way to reproduce this is using sysutils/restic (backup program) >> to read/check all files from a remote server via sftp: >> > >> > `restic -r sftp:user@remote:restic-repo check --read-data` from a 60GB >> compressed backup. >> > >> > --- >> > watchdog timeout x3 as above >> > --- >> > >> > restic check fail log @ 15% progress: >> > --- >> > <snip> >> > Load(<data/52e2923dd6>, 17310001, 0) returned error, retrying after >> 1.7670599s: connection lost >> > Load(<data/d27a0abe0f>, 17456892, 0) returned error, retrying after >> 4.619104908s: connection lost >> > Load(<data/52e2923dd6>, 17310001, 0) returned error, retrying after >> 5.477648517s: connection lost >> > List(lock) returned error, retrying after 293.057766ms: connection los= t >> > List(lock) returned error, retrying after 385.206693ms: connection los= t >> > List(lock) returned error, retrying after 1.577594281s: connection los= t >> > <snip> >> > >> > Connection continues UP. >> Hi, >> >> I'm not sure what the issue is you are reporting. Could you state >> what behavior you are experiencing with the base stack and with >> the RACK stack. In particular, what the difference is? >> >> Best regards >> Michael >> > >> > Cheers, >> > >> > <tuexen@freebsd.org> escreveu (quinta, 28/03/2024 =C3=A0(s) 15:53): >> >> On 28. Mar 2024, at 15:00, Nuno Teixeira <eduardo@freebsd.org> wrote: >> >> >> >> Hello all! >> >> >> >> Running rack @b7b78c1c169 "Optimize HPTS..." very happy on my laptop >> (amd64)! >> >> >> >> Thanks all! >> > Thanks for the feedback! >> > >> > Best regards >> > Michael >> >> >> >> Drew Gallatin <gallatin@freebsd.org> escreveu (quinta, 21/03/2024 >> =C3=A0(s) 12:58): >> >> The entire point is to *NOT* go through the overhead of scheduling >> something asynchronously, but to take advantage of the fact that a >> user/kernel transition is going to trash the cache anyway. >> >> >> >> In the common case of a system which has less than the threshold >> number of connections , we access the tcp_hpts_softclock function pointe= r, >> make one function call, and access hpts_that_need_softclock, and then >> return. So that's 2 variables and a function call. >> >> >> >> I think it would be preferable to avoid that call, and to move the >> declaration of tcp_hpts_softclock and hpts_that_need_softclock so that t= hey >> are in the same cacheline. Then we'd be hitting just a single line in t= he >> common case. (I've made comments on the review to that effect). >> >> >> >> Also, I wonder if the threshold could get higher by default, so that >> hpts is never called in this context unless we're to the point where we'= re >> scheduling thousands of runs of the hpts thread (and taking all those cl= ock >> interrupts). >> >> >> >> Drew >> >> >> >> On Wed, Mar 20, 2024, at 8:17 PM, Konstantin Belousov wrote: >> >>> On Tue, Mar 19, 2024 at 06:19:52AM -0400, rrs wrote: >> >>>> Ok I have created >> >>>> >> >>>> https://reviews.freebsd.org/D44420 >> >>>> >> >>>> >> >>>> To address the issue. I also attach a short version of the patch >> that Nuno >> >>>> can try and validate >> >>>> >> >>>> it works. Drew you may want to try this and validate the >> optimization does >> >>>> kick in since I can >> >>>> >> >>>> only now test that it does not on my local box :) >> >>> The patch still causes access to all cpu's cachelines on each userre= t. >> >>> It would be much better to inc/check the threshold and only schedule >> the >> >>> call when exceeded. Then the call can occur in some dedicated >> context, >> >>> like per-CPU thread, instead of userret. >> >>> >> >>>> >> >>>> >> >>>> R >> >>>> >> >>>> >> >>>> >> >>>> On 3/18/24 3:42 PM, Drew Gallatin wrote: >> >>>>> No. The goal is to run on every return to userspace for every >> thread. >> >>>>> >> >>>>> Drew >> >>>>> >> >>>>> On Mon, Mar 18, 2024, at 3:41 PM, Konstantin Belousov wrote: >> >>>>>> On Mon, Mar 18, 2024 at 03:13:11PM -0400, Drew Gallatin wrote: >> >>>>>>> I got the idea from >> >>>>>>> >> https://people.mpi-sws.org/~druschel/publications/soft-timers-tocs.pdf >> >>>>>>> The gist is that the TCP pacing stuff needs to run frequently, a= nd >> >>>>>>> rather than run it out of a clock interrupt, its more efficient >> to run >> >>>>>>> it out of a system call context at just the point where we retur= n >> to >> >>>>>>> userspace and the cache is trashed anyway. The current >> implementation >> >>>>>>> is fine for our workload, but probably not idea for a generic >> system. >> >>>>>>> Especially one where something is banging on system calls. >> >>>>>>> >> >>>>>>> Ast's could be the right tool for this, but I'm super unfamiliar >> with >> >>>>>>> them, and I can't find any docs on them. >> >>>>>>> >> >>>>>>> Would ast_register(0, ASTR_UNCOND, 0, func) be roughly equivalen= t >> to >> >>>>>>> what's happening here? >> >>>>>> This call would need some AST number added, and then it registers >> the >> >>>>>> ast to run on next return to userspace, for the current thread. >> >>>>>> >> >>>>>> Is it enough? >> >>>>>>> >> >>>>>>> Drew >> >>>>>> >> >>>>>>> >> >>>>>>> On Mon, Mar 18, 2024, at 2:33 PM, Konstantin Belousov wrote: >> >>>>>>>> On Mon, Mar 18, 2024 at 07:26:10AM -0500, Mike Karels wrote: >> >>>>>>>>> On 18 Mar 2024, at 7:04, tuexen@freebsd.org wrote: >> >>>>>>>>> >> >>>>>>>>>>> On 18. Mar 2024, at 12:42, Nuno Teixeira >> >>>>>> <eduardo@freebsd.org> wrote: >> >>>>>>>>>>> >> >>>>>>>>>>> Hello all! >> >>>>>>>>>>> >> >>>>>>>>>>> It works just fine! >> >>>>>>>>>>> System performance is OK. >> >>>>>>>>>>> Using patch on main-n268841-b0aaf8beb126(-dirty). >> >>>>>>>>>>> >> >>>>>>>>>>> --- >> >>>>>>>>>>> net.inet.tcp.functions_available: >> >>>>>>>>>>> Stack D >> >>>>>> Alias PCB count >> >>>>>>>>>>> freebsd freebsd 0 >> >>>>>>>>>>> rack * >> >>>>>> rack 38 >> >>>>>>>>>>> --- >> >>>>>>>>>>> >> >>>>>>>>>>> It would be so nice that we can have a sysctl tunnable for >> >>>>>> this patch >> >>>>>>>>>>> so we could do more tests without recompiling kernel. >> >>>>>>>>>> Thanks for testing! >> >>>>>>>>>> >> >>>>>>>>>> @gallatin: can you come up with a patch that is acceptable >> >>>>>> for Netflix >> >>>>>>>>>> and allows to mitigate the performance regression. >> >>>>>>>>> >> >>>>>>>>> Ideally, tcphpts could enable this automatically when it >> >>>>>> starts to be >> >>>>>>>>> used (enough?), but a sysctl could select auto/on/off. >> >>>>>>>> There is already a well-known mechanism to request execution of >> the >> >>>>>>>> specific function on return to userspace, namely AST. The >> difference >> >>>>>>>> with the current hack is that the execution is requested for on= e >> >>>>>> callback >> >>>>>>>> in the context of the specific thread. >> >>>>>>>> >> >>>>>>>> Still, it might be worth a try to use it; what is the reason to >> >>>>>> hit a thread >> >>>>>>>> that does not do networking, with TCP processing? >> >>>>>>>> >> >>>>>>>>> >> >>>>>>>>> Mike >> >>>>>>>>> >> >>>>>>>>>> Best regards >> >>>>>>>>>> Michael >> >>>>>>>>>>> >> >>>>>>>>>>> Thanks all! >> >>>>>>>>>>> Really happy here :) >> >>>>>>>>>>> >> >>>>>>>>>>> Cheers, >> >>>>>>>>>>> >> >>>>>>>>>>> Nuno Teixeira <eduardo@freebsd.org> escreveu (domingo, >> >>>>>> 17/03/2024 =C3=A0(s) 20:26): >> >>>>>>>>>>>> >> >>>>>>>>>>>> Hello, >> >>>>>>>>>>>> >> >>>>>>>>>>>>> I don't have the full context, but it seems like the >> >>>>>> complaint is a performance regression in bonnie++ and perhaps oth= er >> >>>>>> things when tcp_hpts is loaded, even when it is not used. Is tha= t >> >>>>>> correct? >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> If so, I suspect its because we drive the >> >>>>>> tcp_hpts_softclock() routine from userret(), in order to avoid to= ns >> >>>>>> of timer interrupts and context switches. To test this theory, >> you >> >>>>>> could apply a patch like: >> >>>>>>>>>>>> >> >>>>>>>>>>>> It's affecting overall system performance, bonnie was just >> >>>>>> a way to >> >>>>>>>>>>>> get some numbers to compare. >> >>>>>>>>>>>> >> >>>>>>>>>>>> Tomorrow I will test patch. >> >>>>>>>>>>>> >> >>>>>>>>>>>> Thanks! >> >>>>>>>>>>>> >> >>>>>>>>>>>> -- >> >>>>>>>>>>>> Nuno Teixeira >> >>>>>>>>>>>> FreeBSD Committer (ports) >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> -- >> >>>>>>>>>>> Nuno Teixeira >> >>>>>>>>>>> FreeBSD Committer (ports) >> >>>>>>>>> >> >>>>>>>> >> >>>>>> >> >>>>> >> >>> >> >>>> diff --git a/sys/netinet/tcp_hpts.c b/sys/netinet/tcp_hpts.c >> >>>> index 8c4d2d41a3eb..eadbee19f69c 100644 >> >>>> --- a/sys/netinet/tcp_hpts.c >> >>>> +++ b/sys/netinet/tcp_hpts.c >> >>>> @@ -216,6 +216,7 @@ struct tcp_hpts_entry { >> >>>> void *ie_cookie; >> >>>> uint16_t p_num; /* The hpts number one per cpu */ >> >>>> uint16_t p_cpu; /* The hpts CPU */ >> >>>> + uint8_t hit_callout_thresh; >> >>>> /* There is extra space in here */ >> >>>> /* Cache line 0x100 */ >> >>>> struct callout co __aligned(CACHE_LINE_SIZE); >> >>>> @@ -269,6 +270,11 @@ static struct hpts_domain_info { >> >>>> int cpu[MAXCPU]; >> >>>> } hpts_domains[MAXMEMDOM]; >> >>>> >> >>>> +counter_u64_t hpts_that_need_softclock; >> >>>> +SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, >> needsoftclock, CTLFLAG_RD, >> >>>> + &hpts_that_need_softclock, >> >>>> + "Number of hpts threads that need softclock"); >> >>>> + >> >>>> counter_u64_t hpts_hopelessly_behind; >> >>>> >> >>>> SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, hopeless, >> CTLFLAG_RD, >> >>>> @@ -334,7 +340,7 @@ SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, >> precision, CTLFLAG_RW, >> >>>> &tcp_hpts_precision, 120, >> >>>> "Value for PRE() precision of callout"); >> >>>> SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, cnt_thresh, CTLFLAG_RW, >> >>>> - &conn_cnt_thresh, 0, >> >>>> + &conn_cnt_thresh, DEFAULT_CONNECTION_THESHOLD, >> >>>> "How many connections (below) make us use the callout based >> mechanism"); >> >>>> SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, logging, CTLFLAG_RW, >> >>>> &hpts_does_tp_logging, 0, >> >>>> @@ -1548,6 +1554,9 @@ __tcp_run_hpts(void) >> >>>> struct tcp_hpts_entry *hpts; >> >>>> int ticks_ran; >> >>>> >> >>>> + if (counter_u64_fetch(hpts_that_need_softclock) =3D=3D 0) >> >>>> + return; >> >>>> + >> >>>> hpts =3D tcp_choose_hpts_to_run(); >> >>>> >> >>>> if (hpts->p_hpts_active) { >> >>>> @@ -1683,6 +1692,13 @@ tcp_hpts_thread(void *ctx) >> >>>> ticks_ran =3D tcp_hptsi(hpts, 1); >> >>>> tv.tv_sec =3D 0; >> >>>> tv.tv_usec =3D hpts->p_hpts_sleep_time * HPTS_TICKS_PER_SLOT; >> >>>> + if ((hpts->p_on_queue_cnt > conn_cnt_thresh) && >> (hpts->hit_callout_thresh =3D=3D 0)) { >> >>>> + hpts->hit_callout_thresh =3D 1; >> >>>> + counter_u64_add(hpts_that_need_softclock, 1); >> >>>> + } else if ((hpts->p_on_queue_cnt <=3D conn_cnt_thresh) && >> (hpts->hit_callout_thresh =3D=3D 1)) { >> >>>> + hpts->hit_callout_thresh =3D 0; >> >>>> + counter_u64_add(hpts_that_need_softclock, -1); >> >>>> + } >> >>>> if (hpts->p_on_queue_cnt >=3D conn_cnt_thresh) { >> >>>> if(hpts->p_direct_wake =3D=3D 0) { >> >>>> /* >> >>>> @@ -1818,6 +1834,7 @@ tcp_hpts_mod_load(void) >> >>>> cpu_top =3D NULL; >> >>>> #endif >> >>>> tcp_pace.rp_num_hptss =3D ncpus; >> >>>> + hpts_that_need_softclock =3D counter_u64_alloc(M_WAITOK); >> >>>> hpts_hopelessly_behind =3D counter_u64_alloc(M_WAITOK); >> >>>> hpts_loops =3D counter_u64_alloc(M_WAITOK); >> >>>> back_tosleep =3D counter_u64_alloc(M_WAITOK); >> >>>> @@ -2042,6 +2059,7 @@ tcp_hpts_mod_unload(void) >> >>>> free(tcp_pace.grps, M_TCPHPTS); >> >>>> #endif >> >>>> >> >>>> + counter_u64_free(hpts_that_need_softclock); >> >>>> counter_u64_free(hpts_hopelessly_behind); >> >>>> counter_u64_free(hpts_loops); >> >>>> counter_u64_free(back_tosleep); >> >>> >> >>> >> >> >> >> >> >> >> >> -- >> >> Nuno Teixeira >> >> FreeBSD Committer (ports) >> > >> > >> > >> > -- >> > Nuno Teixeira >> > FreeBSD Committer (ports) >> >> > > -- > Nuno Teixeira > FreeBSD Committer (ports) > --=20 Nuno Teixeira FreeBSD Committer (ports) --000000000000c64a570615bd6820 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div>(...)</div><div><br></div><div>Backup server is <a hr= ef=3D"https://www.rsync.net/">https://www.rsync.net/</a> (free 500GB for Fr= eeBSD developers). <br></div></div><br><div class=3D"gmail_quote"><div dir= =3D"ltr" class=3D"gmail_attr">Nuno Teixeira <<a href=3D"mailto:eduardo@f= reebsd.org">eduardo@freebsd.org</a>> escreveu (quarta, 10/04/2024 =C3=A0= (s) 13:39):<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px = 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div= dir=3D"ltr"><div>With base stack I can complete restic check successfully = downloading/reading/checking all files from a "big" remote compre= ssed backup.</div><div>Changing it to RACK stack, it fails.</div><div><br><= /div><div>I run this command often because in the past, compression corrupt= ion occured and this is the equivalent of restoring backup to check its int= egrity.</div><div><br></div><div>Maybe someone could do a restic test to ch= eck if this is reproducible.</div><div><br></div><div>Thanks,<br></div><div= ><br></div><div><br></div></div><br><div class=3D"gmail_quote"><div dir=3D"= ltr" class=3D"gmail_attr"><<a href=3D"mailto:tuexen@freebsd.org" target= =3D"_blank">tuexen@freebsd.org</a>> escreveu (quarta, 10/04/2024 =C3=A0(= s) 13:12):<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0= px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br> <br> > On 10. Apr 2024, at 13:40, Nuno Teixeira <<a href=3D"mailto:eduardo= @freebsd.org" target=3D"_blank">eduardo@freebsd.org</a>> wrote:<br> > <br> > Hello all,<br> > <br> > @ current 1500018 and fetching torrents with net-p2p/qbittorrent finis= hed ~2GB download and connection UP until the end: <br> > <br> > ---<br> > Apr 10 11:26:46 leg kernel: re0: watchdog timeout<br> > Apr 10 11:26:46 leg kernel: re0: link state changed to DOWN<br> > Apr 10 11:26:49 leg dhclient[58810]: New IP Address (re0): 192.168.1.6= 7<br> > Apr 10 11:26:49 leg dhclient[58814]: New Subnet Mask (re0): 255.255.25= 5.0<br> > Apr 10 11:26:49 leg dhclient[58818]: New Broadcast Address (re0): 192.= 168.1.255<br> > Apr 10 11:26:49 leg kernel: re0: link state changed to UP<br> > Apr 10 11:26:49 leg dhclient[58822]: New Routers (re0): 192.168.1.1<br= > > ---<br> > <br> > In the past tests, I've got more watchdog timeouts, connection goe= s down and a reboot needed to put it back (`service netif restart` didn'= ;t work).<br> > <br> > Other way to reproduce this is using sysutils/restic (backup program) = to read/check all files from a remote server via sftp:<br> > <br> > `restic -r sftp:user@remote:restic-repo check --read-data` from a 60GB= compressed backup.<br> > <br> > ---<br> > watchdog timeout x3 as above<br> > ---<br> > <br> > restic check fail log @ 15% progress:<br> > ---<br> > <snip><br> > Load(<data/52e2923dd6>, 17310001, 0) returned error, retrying af= ter 1.7670599s: connection lost<br> > Load(<data/d27a0abe0f>, 17456892, 0) returned error, retrying af= ter 4.619104908s: connection lost<br> > Load(<data/52e2923dd6>, 17310001, 0) returned error, retrying af= ter 5.477648517s: connection lost<br> > List(lock) returned error, retrying after 293.057766ms: connection los= t<br> > List(lock) returned error, retrying after 385.206693ms: connection los= t<br> > List(lock) returned error, retrying after 1.577594281s: connection los= t<br> > <snip><br> > <br> > Connection continues UP.<br> Hi,<br> <br> I'm not sure what the issue is you are reporting. Could you state<br> what behavior you are experiencing with the base stack and with<br> the RACK stack. In particular, what the difference is?<br> <br> Best regards<br> Michael<br> > <br> > Cheers,<br> > <br> > <<a href=3D"mailto:tuexen@freebsd.org" target=3D"_blank">tuexen@fre= ebsd.org</a>> escreveu (quinta, 28/03/2024 =C3=A0(s) 15:53):<br> >> On 28. Mar 2024, at 15:00, Nuno Teixeira <<a href=3D"mailto:edu= ardo@freebsd.org" target=3D"_blank">eduardo@freebsd.org</a>> wrote:<br> >> <br> >> Hello all!<br> >> <br> >> Running rack @b7b78c1c169 "Optimize HPTS..." very happy = on my laptop (amd64)!<br> >> <br> >> Thanks all!<br> > Thanks for the feedback!<br> > <br> > Best regards<br> > Michael<br> >> <br> >> Drew Gallatin <<a href=3D"mailto:gallatin@freebsd.org" target= =3D"_blank">gallatin@freebsd.org</a>> escreveu (quinta, 21/03/2024 =C3= =A0(s) 12:58):<br> >> The entire point is to *NOT* go through the overhead of scheduling= something asynchronously, but to take advantage of the fact that a user/ke= rnel transition is going to trash the cache anyway.<br> >> <br> >> In the common case of a system which has less than the threshold= =C2=A0 number of connections , we access the tcp_hpts_softclock function po= inter, make one function call, and access hpts_that_need_softclock, and the= n return.=C2=A0 So that's 2 variables and a function call.<br> >> <br> >> I think it would be preferable to avoid that call, and to move the= declaration of tcp_hpts_softclock and hpts_that_need_softclock so that the= y are in the same cacheline.=C2=A0 Then we'd be hitting just a single l= ine in the common case.=C2=A0 (I've made comments on the review to that= effect).<br> >> <br> >> Also, I wonder if the threshold could get higher by default, so th= at hpts is never called in this context unless we're to the point where= we're scheduling thousands of runs of the hpts thread (and taking all = those clock interrupts).<br> >> <br> >> Drew<br> >> <br> >> On Wed, Mar 20, 2024, at 8:17 PM, Konstantin Belousov wrote:<br> >>> On Tue, Mar 19, 2024 at 06:19:52AM -0400, rrs wrote:<br> >>>> Ok I have created<br> >>>> <br> >>>> <a href=3D"https://reviews.freebsd.org/D44420" rel=3D"nore= ferrer" target=3D"_blank">https://reviews.freebsd.org/D44420</a><br> >>>> <br> >>>> <br> >>>> To address the issue. I also attach a short version of the= patch that Nuno<br> >>>> can try and validate<br> >>>> <br> >>>> it works. Drew you may want to try this and validate the o= ptimization does<br> >>>> kick in since I can<br> >>>> <br> >>>> only now test that it does not on my local box :)<br> >>> The patch still causes access to all cpu's cachelines on e= ach userret.<br> >>> It would be much better to inc/check the threshold and only sc= hedule the<br> >>> call when exceeded.=C2=A0 Then the call can occur in some dedi= cated context,<br> >>> like per-CPU thread, instead of userret.<br> >>> <br> >>>> <br> >>>> <br> >>>> R<br> >>>> <br> >>>> <br> >>>> <br> >>>> On 3/18/24 3:42 PM, Drew Gallatin wrote:<br> >>>>> No.=C2=A0 The goal is to run on every return to usersp= ace for every thread.<br> >>>>> <br> >>>>> Drew<br> >>>>> <br> >>>>> On Mon, Mar 18, 2024, at 3:41 PM, Konstantin Belousov = wrote:<br> >>>>>> On Mon, Mar 18, 2024 at 03:13:11PM -0400, Drew Gal= latin wrote:<br> >>>>>>> I got the idea from<br> >>>>>>> <a href=3D"https://people.mpi-sws.org/~drusche= l/publications/soft-timers-tocs.pdf" rel=3D"noreferrer" target=3D"_blank">h= ttps://people.mpi-sws.org/~druschel/publications/soft-timers-tocs.pdf</a><b= r> >>>>>>> The gist is that the TCP pacing stuff needs to= run frequently, and<br> >>>>>>> rather than run it out of a clock interrupt, i= ts more efficient to run<br> >>>>>>> it out of a system call context at just the po= int where we return to<br> >>>>>>> userspace and the cache is trashed anyway. The= current implementation<br> >>>>>>> is fine for our workload, but probably not ide= a for a generic system.<br> >>>>>>> Especially one where something is banging on s= ystem calls.<br> >>>>>>> <br> >>>>>>> Ast's could be the right tool for this, bu= t I'm super unfamiliar with<br> >>>>>>> them, and I can't find any docs on them.<b= r> >>>>>>> <br> >>>>>>> Would ast_register(0, ASTR_UNCOND, 0, func) be= roughly equivalent to<br> >>>>>>> what's happening here?<br> >>>>>> This call would need some AST number added, and th= en it registers the<br> >>>>>> ast to run on next return to userspace, for the cu= rrent thread.<br> >>>>>> <br> >>>>>> Is it enough?<br> >>>>>>> <br> >>>>>>> Drew<br> >>>>>> <br> >>>>>>> <br> >>>>>>> On Mon, Mar 18, 2024, at 2:33 PM, Konstantin B= elousov wrote:<br> >>>>>>>> On Mon, Mar 18, 2024 at 07:26:10AM -0500, = Mike Karels wrote:<br> >>>>>>>>> On 18 Mar 2024, at 7:04, <a href=3D"ma= ilto:tuexen@freebsd.org" target=3D"_blank">tuexen@freebsd.org</a> wrote:<br= > >>>>>>>>> <br> >>>>>>>>>>> On 18. Mar 2024, at 12:42, Nun= o Teixeira<br> >>>>>> <<a href=3D"mailto:eduardo@freebsd.org" target= =3D"_blank">eduardo@freebsd.org</a>> wrote:<br> >>>>>>>>>>> <br> >>>>>>>>>>> Hello all!<br> >>>>>>>>>>> <br> >>>>>>>>>>> It works just fine!<br> >>>>>>>>>>> System performance is OK.<br> >>>>>>>>>>> Using patch on main-n268841-b0= aaf8beb126(-dirty).<br> >>>>>>>>>>> <br> >>>>>>>>>>> ---<br> >>>>>>>>>>> net.inet.tcp.functions_availab= le:<br> >>>>>>>>>>> Stack=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0D<= br> >>>>>> Alias=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 PCB count<br> >>>>>>>>>>> freebsd freebsd=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 0<br> >>>>>>>>>>> rack=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *= <br> >>>>>> rack=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A038<br> >>>>>>>>>>> ---<br> >>>>>>>>>>> <br> >>>>>>>>>>> It would be so nice that we ca= n have a sysctl tunnable for<br> >>>>>> this patch<br> >>>>>>>>>>> so we could do more tests with= out recompiling kernel.<br> >>>>>>>>>> Thanks for testing!<br> >>>>>>>>>> <br> >>>>>>>>>> @gallatin: can you come up with a = patch that is acceptable<br> >>>>>> for Netflix<br> >>>>>>>>>> and allows to mitigate the perform= ance regression.<br> >>>>>>>>> <br> >>>>>>>>> Ideally, tcphpts could enable this aut= omatically when it<br> >>>>>> starts to be<br> >>>>>>>>> used (enough?), but a sysctl could sel= ect auto/on/off.<br> >>>>>>>> There is already a well-known mechanism to= request execution of the<br> >>>>>>>> specific function on return to userspace, = namely AST.=C2=A0 The difference<br> >>>>>>>> with the current hack is that the executio= n is requested for one<br> >>>>>> callback<br> >>>>>>>> in the context of the specific thread.<br> >>>>>>>> <br> >>>>>>>> Still, it might be worth a try to use it; = what is the reason to<br> >>>>>> hit a thread<br> >>>>>>>> that does not do networking, with TCP proc= essing?<br> >>>>>>>> <br> >>>>>>>>> <br> >>>>>>>>> Mike<br> >>>>>>>>> <br> >>>>>>>>>> Best regards<br> >>>>>>>>>> Michael<br> >>>>>>>>>>> <br> >>>>>>>>>>> Thanks all!<br> >>>>>>>>>>> Really happy here :)<br> >>>>>>>>>>> <br> >>>>>>>>>>> Cheers,<br> >>>>>>>>>>> <br> >>>>>>>>>>> Nuno Teixeira <<a href=3D"m= ailto:eduardo@freebsd.org" target=3D"_blank">eduardo@freebsd.org</a>> es= creveu (domingo,<br> >>>>>> 17/03/2024 =C3=A0(s) 20:26):<br> >>>>>>>>>>>> <br> >>>>>>>>>>>> Hello,<br> >>>>>>>>>>>> <br> >>>>>>>>>>>>> I don't have the f= ull context, but it seems like the<br> >>>>>> complaint is a performance regression in bonnie++ = and perhaps other<br> >>>>>> things when tcp_hpts is loaded, even when it is no= t used.=C2=A0 Is that<br> >>>>>> correct?<br> >>>>>>>>>>>>> <br> >>>>>>>>>>>>> If so, I suspect its b= ecause we drive the<br> >>>>>> tcp_hpts_softclock() routine from userret(), in or= der to avoid tons<br> >>>>>> of timer interrupts and context switches.=C2=A0 To= test this theory,=C2=A0 you<br> >>>>>> could apply a patch like:<br> >>>>>>>>>>>> <br> >>>>>>>>>>>> It's affecting overall= system performance, bonnie was just<br> >>>>>> a way to<br> >>>>>>>>>>>> get some numbers to compar= e.<br> >>>>>>>>>>>> <br> >>>>>>>>>>>> Tomorrow I will test patch= .<br> >>>>>>>>>>>> <br> >>>>>>>>>>>> Thanks!<br> >>>>>>>>>>>> <br> >>>>>>>>>>>> --<br> >>>>>>>>>>>> Nuno Teixeira<br> >>>>>>>>>>>> FreeBSD Committer (ports)<= br> >>>>>>>>>>> <br> >>>>>>>>>>> <br> >>>>>>>>>>> <br> >>>>>>>>>>> --<br> >>>>>>>>>>> Nuno Teixeira<br> >>>>>>>>>>> FreeBSD Committer (ports)<br> >>>>>>>>> <br> >>>>>>>> <br> >>>>>> <br> >>>>> <br> >>> <br> >>>> diff --git a/sys/netinet/tcp_hpts.c b/sys/netinet/tcp_hpts= .c<br> >>>> index 8c4d2d41a3eb..eadbee19f69c 100644<br> >>>> --- a/sys/netinet/tcp_hpts.c<br> >>>> +++ b/sys/netinet/tcp_hpts.c<br> >>>> @@ -216,6 +216,7 @@ struct tcp_hpts_entry {<br> >>>> void *ie_cookie;<br> >>>> uint16_t p_num; /* The hpts number one per cpu */<br> >>>> uint16_t p_cpu; /* The hpts CPU */<br> >>>> + uint8_t hit_callout_thresh;<br> >>>> /* There is extra space in here */<br> >>>> /* Cache line 0x100 */<br> >>>> struct callout co __aligned(CACHE_LINE_SIZE);<br> >>>> @@ -269,6 +270,11 @@ static struct hpts_domain_info {<br> >>>> int cpu[MAXCPU];<br> >>>> } hpts_domains[MAXMEMDOM];<br> >>>> <br> >>>> +counter_u64_t hpts_that_need_softclock;<br> >>>> +SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, ne= edsoftclock, CTLFLAG_RD,<br> >>>> +=C2=A0 =C2=A0 &hpts_that_need_softclock,<br> >>>> +=C2=A0 =C2=A0 "Number of hpts threads that need soft= clock");<br> >>>> +<br> >>>> counter_u64_t hpts_hopelessly_behind;<br> >>>> <br> >>>> SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, hop= eless, CTLFLAG_RD,<br> >>>> @@ -334,7 +340,7 @@ SYSCTL_INT(_net_inet_tcp_hpts, OID_AUT= O, precision, CTLFLAG_RW,<br> >>>>=C2=A0 =C2=A0 &tcp_hpts_precision, 120,<br> >>>>=C2=A0 =C2=A0 "Value for PRE() precision of callout&qu= ot;);<br> >>>> SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, cnt_thresh, CTLFL= AG_RW,<br> >>>> -=C2=A0 =C2=A0 &conn_cnt_thresh, 0,<br> >>>> +=C2=A0 =C2=A0 &conn_cnt_thresh, DEFAULT_CONNECTION_TH= ESHOLD,<br> >>>>=C2=A0 =C2=A0 "How many connections (below) make us us= e the callout based mechanism");<br> >>>> SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, logging, CTLFLAG_= RW,<br> >>>>=C2=A0 =C2=A0 &hpts_does_tp_logging, 0,<br> >>>> @@ -1548,6 +1554,9 @@ __tcp_run_hpts(void)<br> >>>> struct tcp_hpts_entry *hpts;<br> >>>> int ticks_ran;<br> >>>> <br> >>>> + if (counter_u64_fetch(hpts_that_need_softclock) =3D=3D 0= )<br> >>>> + return;<br> >>>> +<br> >>>> hpts =3D tcp_choose_hpts_to_run();<br> >>>> <br> >>>> if (hpts->p_hpts_active) {<br> >>>> @@ -1683,6 +1692,13 @@ tcp_hpts_thread(void *ctx)<br> >>>> ticks_ran =3D tcp_hptsi(hpts, 1);<br> >>>> tv.tv_sec =3D 0;<br> >>>> tv.tv_usec =3D hpts->p_hpts_sleep_time * HPTS_TICKS_PER= _SLOT;<br> >>>> + if ((hpts->p_on_queue_cnt > conn_cnt_thresh) &= & (hpts->hit_callout_thresh =3D=3D 0)) {<br> >>>> + hpts->hit_callout_thresh =3D 1;<br> >>>> + counter_u64_add(hpts_that_need_softclock, 1);<br> >>>> + } else if ((hpts->p_on_queue_cnt <=3D conn_cnt_thr= esh) && (hpts->hit_callout_thresh =3D=3D 1)) {<br> >>>> + hpts->hit_callout_thresh =3D 0;<br> >>>> + counter_u64_add(hpts_that_need_softclock, -1);<br> >>>> + }<br> >>>> if (hpts->p_on_queue_cnt >=3D conn_cnt_thresh) {<br> >>>> if(hpts->p_direct_wake =3D=3D 0) {<br> >>>> /*<br> >>>> @@ -1818,6 +1834,7 @@ tcp_hpts_mod_load(void)<br> >>>> cpu_top =3D NULL;<br> >>>> #endif<br> >>>> tcp_pace.rp_num_hptss =3D ncpus;<br> >>>> + hpts_that_need_softclock =3D counter_u64_alloc(M_WAITOK)= ;<br> >>>> hpts_hopelessly_behind =3D counter_u64_alloc(M_WAITOK);<br= > >>>> hpts_loops =3D counter_u64_alloc(M_WAITOK);<br> >>>> back_tosleep =3D counter_u64_alloc(M_WAITOK);<br> >>>> @@ -2042,6 +2059,7 @@ tcp_hpts_mod_unload(void)<br> >>>> free(tcp_pace.grps, M_TCPHPTS);<br> >>>> #endif<br> >>>> <br> >>>> + counter_u64_free(hpts_that_need_softclock);<br> >>>> counter_u64_free(hpts_hopelessly_behind);<br> >>>> counter_u64_free(hpts_loops);<br> >>>> counter_u64_free(back_tosleep);<br> >>> <br> >>> <br> >> <br> >> <br> >> <br> >> -- <br> >> Nuno Teixeira<br> >> FreeBSD Committer (ports)<br> > <br> > <br> > <br> > -- <br> > Nuno Teixeira<br> > FreeBSD Committer (ports)<br> <br> </blockquote></div><br clear=3D"all"><br><span class=3D"gmail_signature_pre= fix">-- </span><br><div dir=3D"ltr" class=3D"gmail_signature"><div dir=3D"l= tr"><span style=3D"color:rgb(102,102,102)">Nuno Teixeira<br>FreeBSD Committ= er (ports)</span></div></div> </blockquote></div><br clear=3D"all"><br><span class=3D"gmail_signature_pre= fix">-- </span><br><div dir=3D"ltr" class=3D"gmail_signature"><div dir=3D"l= tr"><span style=3D"color:rgb(102,102,102)">Nuno Teixeira<br>FreeBSD Committ= er (ports)</span></div></div> --000000000000c64a570615bd6820--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFDf7UK68POsmm%2BO74=6--cx1O-t3jsBrjgiaMV2r81otJq4gA>