FreeBSD Mail Archives

Date:      Tue, 15 Sep 2020 16:22:04 -0400
From:      Liang Tian <l.tian.email@gmail.com>
To:        "Scheffenegger, Richard" <Richard.Scheffenegger@netapp.com>
Cc:        FreeBSD Transport <freebsd-transport@freebsd.org>
Subject:   Re: Fast recovery ssthresh value
Message-ID:  <CAJhigrgSwEJheiTvvt56-o%2B=rgHVy3RaGNzn%2BSFsNqzgroOOFA@mail.gmail.com>
In-Reply-To: <SN4PR0601MB372845BA807F8F70543848C486200@SN4PR0601MB3728.namprd06.prod.outlook.com>
References:  <CAJhigrhbguXQzeYGfMtPRK03fp6KR65q8gjB9e9L-5tGGsuyzQ@mail.gmail.com> <SN4PR0601MB3728D1F8ABC9C86972B6C53886590@SN4PR0601MB3728.namprd06.prod.outlook.com> <CAJhigrjdRzK5fKpE9jTQM5p-wzKUBALK7Cc34_Qbi7HCZ_NCXw@mail.gmail.com> <SN4PR0601MB372817A4C0D80D981B1CE52586270@SN4PR0601MB3728.namprd06.prod.outlook.com> <CAJhigrgZDE4TURO%2BLJPr5nK--O%2BPwV4-cPHYJXdk08_K8GBkwQ@mail.gmail.com> <SN4PR0601MB3728E7F53971E89A5821467C86240@SN4PR0601MB3728.namprd06.prod.outlook.com> <CAJhigriU-CRz=x3NxdCYcEvabTYq=v5q%2BZ51KAgRq6PKD4wfeQ@mail.gmail.com> <SN4PR0601MB372845BA807F8F70543848C486200@SN4PR0601MB3728.namprd06.prod.outlook.com>

Hi Richard,

Thanks. It works well now. We also had an observation that the
majority of tinygrams were caused by the t_maxseg usage in
tcp_sack_partialack(). Now with PRR we rarely see tinygrams because 1)
tcp_sack_partialack() is no longer called 2) the updated PRR patch

Just one comment on the patch: line 2546 is redundant because maxseg
is already defined and calculated in line 2477

Regards,
Liang

On Tue, Sep 15, 2020 at 10:35 AM Scheffenegger, Richard
<Richard.Scheffenegger@netapp.com> wrote:
>
> Hi Liang,
>
> I was about to send out this email notifying you of the changes to the pa=
tch, where you uncovered the issues with TSopt enabled TCP flows.
>
> https://reviews.freebsd.org/D18892
>
> Can you please re-patch your test machine with this updated version (I fi=
xed one merge issue due to whitespace cleanup recently too, so it should ap=
ply cleanly to HEAD now).
>
> Please let us know and share any comments and criticism about this patch!
>
> Thanks again for testing - and finding the overlooked combination with ti=
mestamps.
>
>
> Richard Scheffenegger
>
> -----Original Message-----
> From: Liang Tian <l.tian.email@gmail.com>
> Sent: Freitag, 11. September 2020 19:02
> To: Scheffenegger, Richard <Richard.Scheffenegger@netapp.com>
> Cc: FreeBSD Transport <freebsd-transport@freebsd.org>
> Subject: Re: Fast recovery ssthresh value
>
> NetApp Security WARNING: This is an external email. Do not click links or=
 open attachments unless you recognize the sender and know the content is s=
afe.
>
>
>
>
> Hi Richard,
>
> Initial tests show PRR is doing quite well. See trace below showing respo=
nse to TSval 2713381916 and 2713381917.
> I have a comment on the patch: I think all the tp->t_maxseg should be rep=
laced with maxseg in the diff (https://reviews.freebsd.org/D18892),
> where maxseg =3D tcp_maxseg(tp). This will take TCP options(timestamp in =
this case) into account and avoid sending the tinygrams with len 120 and 36=
 in the trace below.
> Interestingly we were also chasing another issue where we see a lot of
> 12 bytes segments when retransmission happens(before applying PRT patch),=
 we are suspecting the mixed usage of t_maxseg and maxseg =3D
> tcp_maxseg(tp) in the tcp code is causing this: the CCA modules are all u=
sing t_maxseg for CWND increase instead of effective SMSS.
>
> [TCP Dup ACK 41541#3] 52466  >  80 [ACK] Seq=3D156 Ack=3D44596441
> Win=3D3144704 Len=3D0 TSval=3D2713381914 TSecr=3D1636604730 SLE=3D4678531=
7
> SRE=3D46790869
> [TCP Dup ACK 41541#4] 52466  >  80 [ACK] Seq=3D156 Ack=3D44596441
> Win=3D3144704 Len=3D0 TSval=3D2713381916 TSecr=3D1636604730 SLE=3D4678531=
7
> SRE=3D46804749
> [TCP Dup ACK 41541#5] 52466  >  80 [ACK] Seq=3D156 Ack=3D44596441
> Win=3D3144704 Len=3D0 TSval=3D2713381917 TSecr=3D1636604730 SLE=3D4678531=
7
> SRE=3D46808913
> [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=3D44597853 Ack=3D156 Win=3D1048=
576
> Len=3D1388 TSval=3D1636604904 TSecr=3D2713381916 [TCP Out-Of-Order] 80  >=
  52466 [ACK] Seq=3D44599241 Ack=3D156 Win=3D1048576
> Len=3D1388 TSval=3D1636604904 TSecr=3D2713381916 [TCP Out-Of-Order] 80  >=
  52466 [ACK] Seq=3D44600629 Ack=3D156 Win=3D1048576
> Len=3D1388 TSval=3D1636604904 TSecr=3D2713381916 [TCP Out-Of-Order] 80  >=
  52466 [ACK] Seq=3D44602017 Ack=3D156 Win=3D1048576
> Len=3D1388 TSval=3D1636604904 TSecr=3D2713381916 [TCP Out-Of-Order] 80  >=
  52466 [ACK] Seq=3D44603405 Ack=3D156 Win=3D1048576
> Len=3D1388 TSval=3D1636604904 TSecr=3D2713381916 [TCP Out-Of-Order] 80  >=
  52466 [ACK] Seq=3D44604793 Ack=3D156 Win=3D1048576
> Len=3D1388 TSval=3D1636604904 TSecr=3D2713381916 [TCP Out-Of-Order] 80  >=
  52466 [ACK] Seq=3D44606181 Ack=3D156 Win=3D1048576
> Len=3D1388 TSval=3D1636604904 TSecr=3D2713381916 [TCP Out-Of-Order] 80  >=
  52466 [ACK] Seq=3D44607569 Ack=3D156 Win=3D1048576
> Len=3D1388 TSval=3D1636604904 TSecr=3D2713381916 [TCP Out-Of-Order] 80  >=
  52466 [ACK] Seq=3D44608957 Ack=3D156 Win=3D1048576
> Len=3D1388 TSval=3D1636604904 TSecr=3D2713381916 [TCP Out-Of-Order] 80  >=
  52466 [ACK] Seq=3D44610345 Ack=3D156 Win=3D1048576
> Len=3D1388 TSval=3D1636604904 TSecr=3D2713381916 [TCP Out-Of-Order] 80  >=
  52466 [ACK] Seq=3D44611733 Ack=3D156 Win=3D1048576
> Len=3D120 TSval=3D1636604904 TSecr=3D2713381916 [TCP Out-Of-Order] 80  > =
 52466 [ACK] Seq=3D44611853 Ack=3D156 Win=3D1048576
> Len=3D1388 TSval=3D1636604905 TSecr=3D2713381917 [TCP Out-Of-Order] 80  >=
  52466 [ACK] Seq=3D44613241 Ack=3D156 Win=3D1048576
> Len=3D1388 TSval=3D1636604905 TSecr=3D2713381917 [TCP Out-Of-Order] 80  >=
  52466 [ACK] Seq=3D44614629 Ack=3D156 Win=3D1048576
> Len=3D1388 TSval=3D1636604905 TSecr=3D2713381917 [TCP Out-Of-Order] 80  >=
  52466 [ACK] Seq=3D44616017 Ack=3D156 Win=3D1048576
> Len=3D36 TSval=3D1636604905 TSecr=3D2713381917 [TCP Dup ACK 41541#6] 5246=
6  >  80 [ACK] Seq=3D156 Ack=3D44596441
> Win=3D3144704 Len=3D0 TSval=3D2713381925 TSecr=3D1636604730 SLE=3D4678531=
7
> SRE=3D46867209
> [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=3D44616053 Ack=3D156 Win=3D1048=
576
> Len=3D1388 TSval=3D1636604912 TSecr=3D2713381925 [TCP Out-Of-Order] 80  >=
  52466 [ACK] Seq=3D44617441 Ack=3D156 Win=3D1048576
> Len=3D1388 TSval=3D1636604912 TSecr=3D2713381925
>
> Thanks,
> Liang
> ...
>
> On Fri, Sep 11, 2020 at 3:40 AM Scheffenegger, Richard <Richard.Scheffene=
gger@netapp.com> wrote:
> >
> > Perfect!
> >
> > Please share your findings then, as reviews (including informal ones) a=
re needed prior to me committing this patch.
> >
> > Note that it builds upon D18624, which is currently in stable/12 and he=
ad, but not any released branches. So you may need to apply that too if you=
 aren't using head.
> >
> > Best regards,
> >
> >
> > Richard Scheffenegger
> >
> > -----Original Message-----
> > From: Liang Tian <l.tian.email@gmail.com>
> > Sent: Freitag, 11. September 2020 06:06
> > To: Scheffenegger, Richard <Richard.Scheffenegger@netapp.com>; FreeBSD
> > Transport <freebsd-transport@freebsd.org>
> > Subject: Re: Fast recovery ssthresh value
> >
> > NetApp Security WARNING: This is an external email. Do not click links =
or open attachments unless you recognize the sender and know the content is=
 safe.
> >
> >
> >
> >
> > Hi Richard,
> >
> > Thanks! I'm able to apply the patches. I'll test it.
> >
> > Regards,
> > Liang
> >
> >
> >
> > On Thu, Sep 10, 2020 at 5:49 AM Scheffenegger, Richard <Richard.Scheffe=
negger@netapp.com> wrote:
> > >
> > > Hi Liang,
> > >
> > > Yes, you are absolutely correct about this observation. The SACK loss=
 recovery will only send  one MSS per received ACK right now - and when the=
re is ACK thinning present, will fail to timely recover all the missing pac=
kets, eventually receiving no more ACK to clock out more retransmissions...
> > >
> > > I have a Diff in review, to implement Proportional Rate Reduction:
> > >
> > > https://reviews.freebsd.org/D18892
> > >
> > > Which should address not only that issue about ACK thinning, but also=
 the issue that current SACK loss recovery has to wait until pipe drops bel=
ow ssthresh, before the retransmissions are clocked out. And then, they wou=
ld actually be clocked out at the same rate at the incoming ACKs. This woul=
d be the same rate as when the overload happened (barring any ACK thinning)=
, and as a secondary effect, it was observed that this behavior too can lea=
d to self-inflicted loss - of retransmissions.
> > >
> > > If you have the ability to patch your kernel with D18892 and observe =
how the reaction is in your dramatic ACK thinning scenario, that would be g=
ood to know! The assumption of the Patch was, that - as per TCP RFC require=
ments - there is one ACK for each received out-of-sequence data segment, an=
d ACK drops / thinning are not happening on such a massive scale as you des=
cribe it.
> > >
> > > Best regards,
> > >
> > > Richard Scheffenegger
> > >
> > > -----Original Message-----
> > > From: owner-freebsd-transport@freebsd.org
> > > <owner-freebsd-transport@freebsd.org> On Behalf Of Liang Tian
> > > Sent: Mittwoch, 9. September 2020 19:16
> > > To: Scheffenegger, Richard <Richard.Scheffenegger@netapp.com>
> > > Cc: FreeBSD Transport <freebsd-transport@freebsd.org>
> > > Subject: Re: Fast recovery ssthresh value
> > >
> > > Hi Richard,
> > >
> > > Thanks for the explanation and sorry for the late reply.
> > > I've been investigating SACK loss recovery and I think I'm seeing an
> > > issue similar to the ABC L value issue that I reported
> > > previously(https://reviews.freebsd.org/D26120) and I do believe there=
 is a deviation to RFC3517:
> > > The issue happens when a DupAck is received during SACK loss recovery=
 in the presence of ACK Thinning or receiver enabling LRO, which means the =
SACK block edges could expand by more than 1 SMSS(We've seen 30*SMSS), i.e.=
 a single DupAck could decrement `pipe` by more than 1 SMSS.
> > > In RFC3517,
> > > (C) If cwnd - pipe >=3D 1 SMSS, the sender SHOULD transmit one or mor=
e segments...
> > >         (C.5) If cwnd - pipe >=3D 1 SMSS, return to (C.1) So based on=
 RFC, the sender should be able to send more segments if such DupAck is rec=
eived, because of the big change to `pipe`.
> > >
> > > In the current implementation, the cwin variable, which controls the =
amount of data that can be transmitted based on the new information, is dic=
tated by snd_cwnd. The snd_cwnd is incremented by 1 SMSS for each DupAck re=
ceived. I believe this effectively limits the retransmission triggered by e=
ach DupAck to 1 SMSS -  deviation.
> > >  307         cwin =3D
> > >  308             imax(min(tp->snd_wnd, tp->snd_cwnd) - sack_bytes_rxm=
t, 0);
> > >
> > > As a result, SACK is not doing enough recovery in this scenario and l=
oss has to be recovered by RTO.
> > > Again, I'd appreciate feedback from the community.
> > >
> > > Regards,
> > > Liang Tian
> > >
> > >
> > >
> > >
> > > On Sun, Aug 23, 2020 at 3:56 PM Scheffenegger, Richard <Richard.Schef=
fenegger@netapp.com> wrote:
> > > >
> > > > Hi Liang,
> > > >
> > > > In SACK loss recovery, you can recover up to ssthresh (prior cwnd/2=
 [or 70% in case of cubic]) lost bytes - at least in theory.
> > > >
> > > > In comparison, (New)Reno can only recover one lost packet per windo=
w, and then keeps on transmitting new segments (ack + cwnd), even before th=
e receipt of the retransmitted packet is acked.
> > > >
> > > > For historic reasons, the semantic of the variable cwnd is overload=
ed during loss recovery, and it doesn't "really" indicate cwnd, but rather =
indicates if/when retransmissions can happen.
> > > >
> > > >
> > > > In both cases (also the simple one, with only one packet loss), cwn=
d should be equal (or near equal) to ssthresh by the time loss recovery is =
finished - but NOT before! While it may appear like slow-start, the value o=
f the cwnd variable really increases by acked_bytes only per ACK (not acked=
_bytes + SMSS), since the left edge (snd_una) doesn't move right - unlike d=
uring slow-start. But numerically, these different phases (slow-start / sac=
k loss-recovery) may appear very similar.
> > > >
> > > > You could check this using the (loadable) SIFTR module, which captu=
res t_flags (indicating if cong/loss recovery is active), ssthresh, cwnd, a=
nd other parameters.
> > > >
> > > > That is at least how things are supposed to work; or have you inves=
tigated the timing and behavior of SACK loss recovery and found a deviation=
 to RFC3517? Note that FBSD currently has not fully implemented RFC6675 sup=
port (which deviates slightly from 3517 under specific circumstances; I hav=
e a patch pending to implemente 6675 rescue retransmissions, but haven't tw=
eaked the other aspects of 6675 vs. 3517.
> > > >
> > > > BTW: While freebsd-net is not the wrong DL per se, TCP, UDP, SCTP s=
pecific questions can also be posted to freebsd-transport, which is more na=
rrowly focused.
> > > >
> > > > Best regards,
> > > >
> > > > Richard Scheffenegger
> > > >
> > > > -----Original Message-----
> > > > From: owner-freebsd-net@freebsd.org
> > > > <owner-freebsd-net@freebsd.org> On Behalf Of Liang Tian
> > > > Sent: Sonntag, 23. August 2020 00:14
> > > > To: freebsd-net <freebsd-net@freebsd.org>
> > > > Subject: Fast recovery ssthresh value
> > > >
> > > > Hi all,
> > > >
> > > > When 3 dupacks are received and TCP enter fast recovery, if SACK is=
 used, the CWND is set to maxseg:
> > > >
> > > > 2593                     if (tp->t_flags & TF_SACK_PERMIT) {
> > > > 2594                         TCPSTAT_INC(
> > > > 2595                             tcps_sack_recovery_episode);
> > > > 2596                         tp->snd_recover =3D tp->snd_nxt;
> > > > 2597                         tp->snd_cwnd =3D maxseg;
> > > > 2598                         (void) tp->t_fb->tfb_tcp_output(tp);
> > > > 2599                         goto drop;
> > > > 2600                     }
> > > >
> > > > Otherwise(SACK is not in use), CWND is set to maxseg before
> > > > tcp_output() and then set back to snd_ssthresh+inflation
> > > > 2601                     tp->snd_nxt =3D th->th_ack;
> > > > 2602                     tp->snd_cwnd =3D maxseg;
> > > > 2603                     (void) tp->t_fb->tfb_tcp_output(tp);
> > > > 2604                     KASSERT(tp->snd_limited <=3D 2,
> > > > 2605                         ("%s: tp->snd_limited too big",
> > > > 2606                         __func__));
> > > > 2607                     tp->snd_cwnd =3D tp->snd_ssthresh +
> > > > 2608                          maxseg *
> > > > 2609                          (tp->t_dupacks - tp->snd_limited);
> > > > 2610                     if (SEQ_GT(onxt, tp->snd_nxt))
> > > > 2611                         tp->snd_nxt =3D onxt;
> > > > 2612                     goto drop;
> > > >
> > > > I'm wondering in the SACK case, should CWND be set back to ssthresh=
(which has been slashed in cc_cong_signal() a few lines above) before line =
2599, like non-SACK case, instead of doing slow start from maxseg?
> > > > I read rfc6675 and a few others, and it looks like that's the case.=
 I appreciate your opinion, again.
> > > >
> > > > Thanks,
> > > > Liang
> > > > _______________________________________________
> > > > freebsd-net@freebsd.org mailing list
> > > > https://lists.freebsd.org/mailman/listinfo/freebsd-net
> > > > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.o=
rg"
> > > _______________________________________________
> > > freebsd-transport@freebsd.org mailing list
> > > https://lists.freebsd.org/mailman/listinfo/freebsd-transport
> > > To unsubscribe, send any mail to "freebsd-transport-unsubscribe@freeb=
sd.org"

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJhigrgSwEJheiTvvt56-o%2B=rgHVy3RaGNzn%2BSFsNqzgroOOFA>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation