Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 13 May 2025 07:43:07 -0700
From:      Pete Wright <pete@nomadlogic.org>
To:        "Kiyanovski, Arthur" <akiyano@amazon.com>, Colin Percival <cperciva@tarsnap.com>, "freebsd-cloud@freebsd.org" <freebsd-cloud@freebsd.org>
Cc:        "Arinzon, David" <darinzon@amazon.com>
Subject:   Re: ena(4) tx timeout messages in dmesg
Message-ID:  <a6abb2fd-823b-49af-91f1-ecfc54e8ede8@nomadlogic.org>
In-Reply-To: <1c8e7c62067845ab9cd5fca6198a78e8@amazon.com>
References:  <fec4cb4f-2a36-4a3d-bf02-539fd1a1273c@nomadlogic.org> <01000196c5b6fa5f-b8ed430e-23ca-47fd-9dd9-374a1de9c67c-000000@email.amazonses.com> <527aa929-4083-4935-8147-e59b6416c3bf@nomadlogic.org> <01000196c5db82dc-cfa5bf54-9758-4125-bdca-f1794b76ac9f-000000@email.amazonses.com> <a5ef1194-a020-417c-b8a9-b82badfa3ca0@nomadlogic.org> <CAJXMMHGM3kWb6dL8-aJuUB-4xsCH=%2B8rLwUEv5nOM2kuh-0D8g@mail.gmail.com> <1c8e7c62067845ab9cd5fca6198a78e8@amazon.com>

index | next in thread | previous in thread | raw e-mail



On 5/12/25 19:52, Kiyanovski, Arthur wrote:
>> ---------- Forwarded message ---------
>> From: Pete Wright <pete@nomadlogic.org>
>> Date: Mon, 12 May 2025 at 12:30
>> Subject: Re: ena(4) tx timeout messages in dmesg
>> To: Colin Percival <cperciva@tarsnap.com>, <freebsd-cloud@freebsd.org>
>> Cc: Arthur Kiyanovski <akiyano@freebsd.org>
>>
>>
>>
>>
>> On 5/12/25 11:56, Colin Percival wrote:
>>> On 5/12/25 11:25, Pete Wright wrote:
>>>> On 5/12/25 11:17, Colin Percival wrote:
>>>>> On 5/12/25 11:04, Pete Wright wrote:
>>>>>> hey there - i have an ec2 instance that i'm using as a nfs server
>>>>>> and have noticed the following messages in my dmesg buffer:
>>>>>> [...]
>>>>>> ena0: Found a Tx that wasn't completed on time, qid 3, index 998. 1
>>>>>> msecs have passed since last cleanup. Missing Tx timeout value 5000
>>>>>> msecs.
>>>>>>
>>>>> I've heard that this can be caused by a thread being starved for
>>>>> CPU, possibly due to FreeBSD kernel scheduler issues, but that was
>>>>> on a far more heavily loaded system.  What instance type are you
>>>>> running on?
>>>>
>>>> oh of course, forgot to provide useful info:
>>>>
>>>> # uname -ar
>>>> FreeBSD airflow-nfs.q0.ringdna.net 14.2-RELEASE-p1 FreeBSD 14.2-
>>>> RELEASE-p1 GENERIC amd64
>>>>
>>>> Instance type:
>>>> t3a.xlarge
>>>>
>>>> I also verified I have plenty of available "burstable credit"
>>>> available since this is a t class system (current balance is steady
>>>> at
>>>> 2,300 credits).
>>>
>>> Ah, this won't necessarily help you -- T family instances are on
>>> shared hardware so even if you have burstable credits it's possible
>>> that you'll be unlucky with "noisy neighbours" and the sibling
>>> instances will all want CPU at the same time as you.  But I think
>>> there's probably something else going on as well.
>>>
>>
>>
>> oh that's a good point, since this is a pre-prod system that is less of a concern
>> as we want to limit spend when possible.  i'll be spinning up production
>> systems in the following week or so that will be on a "c"
>> class system, i'll keep an eye out to see if see similar messages in that
>> environment.
>>
>> -pete
>>
>> --
>> Pete Wright
>> pete@nomadlogic.org
> 
> HI Colin, Pete,
> 
> Your analysis regarding CPU being occupied is the classic explanation for this kind
> prints.
> 
> The prints are consistent with cpu not being available to the interrupt
> handler to run.
> Although you say you have burstable credits available, the fact that you are using
> T instance types does make you more susceptible to such issues.
> 
> Also when you say you have 25% CPU usage, how did you check that?
> Are you using tools that give you an average over some time? so you may
> have 75% of the time 0 cpu usage and 25% of the time 100% cpu usage.
> 
> As you already suggested, the first thing we would like to eliminate is the T instance
> Type.
> If all works - great!
> 
> If not you may want to look into the spreading of interrupts over the different cpus
> using https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena#io-irq-affinity
> And also make sure that the cpu heavy processes you have, are run on different cpus than
> ones you handle the interrupts on.
> 
> Hope this helps,
> Arthur
> 
> 
> 

thanks for the context Arthur, I'll take a look at that sysctl knob.  as 
i said the box is only serving a python virtual environment to a pool of 
ec2 compute nodes, and the dataset resides in memory.  so nothing too 
crazy.  the load does have spikes but they are pretty brief and rarely 
over %70.  i'm collecting metrics via telegraph, and also observe load 
via the usual suspects like top, systat etc.

it sounds like ena(4) seems to be particularly sensitive to cpu spikes 
though - at least with this vm configuration.  if i continue to see 
these messages in dmesg i'll test out distributing IRQ's, otherwise i 
think i can chalk this up to a noisy neighbor or something similar.

thanks!
-pete



-- 
Pete Wright
pete@nomadlogic.org



home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?a6abb2fd-823b-49af-91f1-ecfc54e8ede8>