Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 21 Jul 2019 16:32:04 -0400
From:      Patrick Kelsey <pkelsey@freebsd.org>
To:        Andriy Gapon <avg@freebsd.org>
Cc:        freebsd-net@freebsd.org, FreeBSD Current <freebsd-current@freebsd.org>
Subject:   Re: vmx0: watchdog timeout on queue 2, no interrupts on BSP
Message-ID:  <F89598D2-FEBD-4857-9734-350A077DF4C0@freebsd.org>
In-Reply-To: <dfb182e0-7512-cd48-142b-b98dfa4d3525@FreeBSD.org>
References:  <9c509f7b-8294-d2fe-ea3e-f10fd51f5736@FreeBSD.org> <CAD44qMUA_-vT7-374WGZH1bUFCA-sVo_UHi1uQjKkgpk9358bA@mail.gmail.com> <dfb182e0-7512-cd48-142b-b98dfa4d3525@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help



> On Jul 21, 2019, at 4:17 PM, Andriy Gapon <avg@freebsd.org> wrote:
> 
>> On 20/07/2019 20:08, Patrick Kelsey wrote:
>> 
>> 
>> On Fri, Jul 19, 2019 at 10:07 AM Andriy Gapon <avg@freebsd.org
>> <mailto:avg@freebsd.org>> wrote:
>> 
>> 
>>    Recently we experienced a strange problem.
>>    We noticed a lot of these messages in the logs:
>>    vmx0: watchdog timeout on queue 2
>>    (always queue 2)
>>    Also, we noticed that connections to some end points did not work at all
>>    while others worked without problems.  I assume that that was because
>>    specific flows got assigned to that queue 2.
>> 
>>    Further investigation has shown that none of interrupts assigned to the
>>    BSP has ever fired (since boot, of course).  That included vmx0:rx2 and
>>    vmx0:tx2.  But also interrupts for other drivers as well.
>> 
>>    Trying to get more information I rebooted the system and the problem
>>    disappeared.
>> 
>>    Has anyone seen anything like that?
>>    Any thoughts on possible causes?
>>    Any suggestions what to check if/when the problem reoccurs?
>> 
>>    Thanks!
>> 
>> 
>> If you are running head at or after r347221 or stable/12 at or after
>> r349112, then this could be due to
>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239118 (see Comment 4
>> - short story is that an iflib change has broken the vmx driver).
> 
> I am not sure if that bug could lead to all interrupts on the core
> getting disabled (for all drivers), and right at the boot time.

I am not sure either, but it’s the kind of bug that breaks the design of the vmx driver in such a way that its state can get corrupted to the point where the kernel can panic.  I haven’t fully analyzed the potential scope of memory corruption / hardware state corruption that can occur (because the fix for the issue is already apparent), so I am freely considering it to include elements beyond the device and driver itself.

If you are saying that zero vmx queue interrupts have occurred anywhere in the system, then I would rule out any connection to this as a prerequisite for the corruption to occur is having at least one such interrupt.

-Patrick


Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F89598D2-FEBD-4857-9734-350A077DF4C0>