Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 05 Feb 2021 09:11:45 +0100
From:      GomoR <freebsd-stable@gomor.org>
To:        John Baldwin <jhb@freebsd.org>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: Suspected mbuf leak with Nginx + sendfile + TLS in 12.2-STABLE
Message-ID:  <069535216479ce00859e4bcbf499f8a1@gomor.org>
In-Reply-To: <9c56bfda-725c-9c2a-9db3-4599abcfeaa0@FreeBSD.org>
References:  <f6118f40fcac0e938e4050fc36a1e05e@gomor.org> <9c56bfda-725c-9c2a-9db3-4599abcfeaa0@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 2021-02-04 19:33, John Baldwin wrote:
> None of the sendfile or KTLS changes from Netflix are in 12, they are 
> only
> in 13 and later.

I thought about that possibility, thank you for the clarification.

>> Don't transmit mbufs that aren't yet ready on TOE sockets.
>> This includes mbufs waiting for data from sendfile() I/O requests, or
>> mbufs awaiting encryption for KTLS.
>> https://github.com/freebsd/freebsd-src/commit/14c77f30b201bf76119d59678e72051c093333c2
> 
> This patch only applies to Chelsio T5/T6 NICs when using TOE (TCP 
> offload)
> and doesn't affect freeing mbufs, it just fixes a race when the NIC 
> could
> potentially send random garbage if it sends the mbuf before the 
> scheduled
> disk I/O to populate it with data from disk has completed.

Understood.

>> NIC is:
>> ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver>
>> 
>> What can we do to help you find the root cause?
> 
> The first step I would do if possible would be to bisect between the 
> last
> known working version and the version that is known to be broken to
> determine which commit introduced the problem.  One thing that could 
> help
> here is to see if you can reproduce the problem using a 12.2 kernel on 
> a
> 12.1 world + ports.  If you can, then you can limit your bisecting to 
> just
> building new kernels which will make that process quicker.

Thank you for the tip, I'll try that path and let you know.

> You might also see if using a different NIC shows the same problem.  If
> not, then it might point to a regression in the NIC driver (or perhaps 
> in
> iflib as ix uses iflib I believe).

Unfortunately, not a possibility here.

I did some other tests and found where the problem arise. In fact, we 
use
proxy_pass directive within Nginx and the network flow goes through one 
public
interface (ix0) and proxy_pass through a second (ix1) towards a remote 
machine.

Changing the Nginx configuration to only go through ix0 does not cause 
the
issue. So that's something about with passing packets between 2 NICs.

I'll keep you posted.

Regards,



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?069535216479ce00859e4bcbf499f8a1>