Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 09 Mar 2020 08:24:40 +0000
From:      bugzilla-noreply@freebsd.org
To:        virtualization@FreeBSD.org
Subject:   [Bug 235856] FreeBSD freezes on AWS EC2 t3 machines
Message-ID:  <bug-235856-27103-aDbQZ0VGDs@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-235856-27103@https.bugs.freebsd.org/bugzilla/>
References:  <bug-235856-27103@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D235856

--- Comment #36 from mail@rubenvos.com ---
Hi,

This weekend the issue manifested itself again on one of our 12.1 instances
(with an EBS volume attached):

Mar  7 03:05:47 zfs01 kernel: nvme1: cpl does not map to outstanding cmd
Mar  7 03:05:47 zfs01 kernel: cdw0:00000000 sqhd:001b sqid:0001 cid:001b p:0
sc:00 sct:0 m:0 dnr:0
Mar  7 03:05:47 zfs01 kernel: nvme1: Resetting controller due to a timeout.
Mar  7 03:05:47 zfs01 kernel: nvme1: resetting controller
Mar  7 03:05:47 zfs01 kernel: nvme1: temperature threshold not supported
Mar  7 03:05:47 zfs01 kernel: nvme1: aborting outstanding i/o
Mar  7 03:06:18 zfs01 kernel: nvme1: Missing interrupt
Mar  7 03:06:48 zfs01 kernel: nvme1: Resetting controller due to a timeout.
Mar  7 03:06:48 zfs01 kernel: nvme1: resetting controller
Mar  7 03:06:48 zfs01 kernel: nvme1: temperature threshold not supported
Mar  7 03:06:48 zfs01 kernel: nvme1: aborting outstanding i/o
Mar  7 03:06:48 zfs01 syslogd: last message repeated 5 times
Mar  7 03:07:20 zfs01 kernel: nvme1: VERIFY sqid:1 cid:27 nsid:0 lba:0 len:1
Mar  7 03:07:20 zfs01 kernel: nvme1: INVALID OPCODE (00/01) sqid:1 cid:27
cdw0:0
Mar  7 03:07:20 zfs01 kernel: nvme1: Missing interrupt
Mar  7 03:07:20 zfs01 kernel: nvme1: VERIFY sqid:1 cid:27 nsid:0 lba:0 len:1
Mar  7 03:07:20 zfs01 kernel: nvme1: INVALID OPCODE (00/01) sqid:1 cid:27
cdw0:0
Mar  7 03:08:23 zfs01 kernel: ena0: The number of lost tx completion is abo=
ve
the threshold (129 > 128). Reset the device
Mar  7 03:08:23 zfs01 kernel: ena0: Trigger reset is on
Mar  7 03:08:23 zfs01 kernel: ena0: device is going DOWN
Mar  7 03:08:23 zfs01 kernel: nvme1: Resetting controller due to a timeout.
Mar  7 03:08:23 zfs01 kernel: nvme1: resetting controller
Mar  7 03:08:23 zfs01 dhclient[40936]: send_packet6: Network is down
Mar  7 03:08:23 zfs01 dhclient[40936]: dhc6: send_packet6() sent -1 of 52 b=
ytes
Mar  7 03:08:23 zfs01 kernel: nvme1: aborting outstanding admin command
Mar  7 03:08:23 zfs01 kernel: nvme1: CREATE IO SQ (01) sqid:0 cid:24 nsid:1
cdw10:1cb72e58 cdw11:00000000
Mar  7 03:08:23 zfs01 kernel: nvme1: ABORTED - BY REQUEST (00/07) sqid:0 ci=
d:15
cdw0:0
Mar  7 03:08:23 zfs01 kernel: nvme1: temperature threshold not supported
Mar  7 03:08:23 zfs01 kernel: nvme1: aborting outstanding i/o
Mar  7 03:08:23 zfs01 syslogd: last message repeated 2 times
Mar  7 03:08:53 zfs01 kernel: nvme1: WRITE sqid:1 cid:18 nsid:1 lba:3856318
len:64
Mar  7 03:08:53 zfs01 kernel: nvme1: INVALID OPCODE (00/01) sqid:1 cid:27
cdw0:0
Mar  7 03:08:53 zfs01 kernel: nvme1: Missing interrupt
Mar  7 03:09:15 zfs01 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x58
Mar  7 03:09:16 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors =
(9
supported)
Mar  7 03:09:16 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 ve=
ctor
52
Mar  7 03:09:16 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 ve=
ctor
53
Mar  7 03:09:16 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 ve=
ctor
54
Mar  7 03:09:16 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X
Mar  7 03:09:16 zfs01 kernel: ena0: device is going UP
Mar  7 03:09:16 zfs01 kernel: ena0: link is UP
Mar  7 03:10:30 zfs01 dhclient[40936]: send_packet6: Network is down
Mar  7 03:10:30 zfs01 dhclient[40936]: dhc6: send_packet6() sent -1 of 52 b=
ytes
Mar  7 03:10:32 zfs01 dhclient[69248]: send_packet: Network is down
Mar  7 03:11:16 zfs01 syslogd: last message repeated 4 times
Mar  7 03:11:33 zfs01 syslogd: last message repeated 1 times
Mar  7 03:13:31 zfs01 kernel: ena0: The number of lost tx completion is abo=
ve
the threshold (129 > 128). Reset the device
Mar  7 03:13:31 zfs01 kernel: ena0: Trigger reset is on
Mar  7 03:13:31 zfs01 kernel: ena0: device is going DOWN
Mar  7 03:14:25 zfs01 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x134
Mar  7 03:14:26 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors =
(9
supported)
Mar  7 03:14:26 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 ve=
ctor
52

root@zfs01:/usr/home/ruben # ls -lahtuT /etc/periodic/daily/
total 128
-rwxr-xr-x  1 root  wheel   1.0K Mar  7 03:01:00 2020 450.status-security
-rwxr-xr-x  1 root  wheel   1.4K Mar  7 03:01:00 2020 440.status-mailq
-rwxr-xr-x  1 root  wheel   705B Mar  7 03:01:00 2020 430.status-uptime
-rwxr-xr-x  1 root  wheel   611B Mar  7 03:01:00 2020 420.status-network
-rwxr-xr-x  1 root  wheel   684B Mar  7 03:01:00 2020 410.status-mfi
-rwxr-xr-x  1 root  wheel   590B Mar  7 03:01:00 2020 409.status-gconcat
-rwxr-xr-x  1 root  wheel   590B Mar  7 03:01:00 2020 408.status-gstripe
-rwxr-xr-x  1 root  wheel   591B Mar  7 03:01:00 2020 407.status-graid3
-rwxr-xr-x  1 root  wheel   596B Mar  7 03:01:00 2020 406.status-gmirror
-rwxr-xr-x  1 root  wheel   807B Mar  7 03:01:00 2020 404.status-zfs
-rwxr-xr-x  1 root  wheel   583B Mar  7 03:01:00 2020 401.status-graid
-rwxr-xr-x  1 root  wheel   773B Mar  7 03:01:00 2020 400.status-disks
-rwxr-xr-x  1 root  wheel   724B Mar  7 03:01:00 2020 330.news
-r-xr-xr-x  1 root  wheel   1.4K Mar  7 03:01:00 2020 310.accounting
-rwxr-xr-x  1 root  wheel   693B Mar  7 03:01:00 2020 300.calendar
-rwxr-xr-x  1 root  wheel   1.0K Mar  7 03:01:00 2020 210.backup-aliases
-rwxr-xr-x  1 root  wheel   1.7K Mar  7 03:01:00 2020 200.backup-passwd
-rwxr-xr-x  1 root  wheel   603B Mar  7 03:01:00 2020 150.clean-hoststat
-rwxr-xr-x  1 root  wheel   1.0K Mar  7 03:01:00 2020 140.clean-rwho
-rwxr-xr-x  1 root  wheel   709B Mar  7 03:01:00 2020 130.clean-msgs
-rwxr-xr-x  1 root  wheel   1.1K Mar  7 03:01:00 2020 120.clean-preserve
-rwxr-xr-x  1 root  wheel   1.5K Mar  7 03:01:00 2020 110.clean-tmps
-rwxr-xr-x  1 root  wheel   1.3K Mar  7 03:01:00 2020 100.clean-disks
-rwxr-xr-x  1 root  wheel   811B Mar  5 03:21:29 2020 999.local
-rwxr-xr-x  1 root  wheel   2.8K Mar  5 03:21:29 2020 800.scrub-zfs
-rwxr-xr-x  1 root  wheel   845B Mar  5 03:21:29 2020 510.status-world-kern=
el
-rwxr-xr-x  1 root  wheel   737B Mar  5 03:21:29 2020 500.queuerun
-rwxr-xr-x  1 root  wheel   498B Mar  5 03:21:29 2020 480.status-ntpd
-rwxr-xr-x  1 root  wheel   451B Mar  5 03:03:36 2020 480.leapfile-ntpd
-rwxr-xr-x  1 root  wheel   2.0K Mar  5 03:03:18 2020 460.status-mail-rejec=
ts
drwxr-xr-x  2 root  wheel   1.0K Dec  7 06:23:36 2018 .
drwxr-xr-x  6 root  wheel   512B Dec  7 06:23:36 2018 ..
root@zfs01:/usr/home/ruben #=20

root@zfs01:/usr/home/ruben # ls -lahtuT /etc/periodic/security/
total 68
-rwxr-xr-x  1 root  wheel   2.3K Mar  7 03:01:48 2020 900.tcpwrap
-rwxr-xr-x  1 root  wheel   2.3K Mar  7 03:01:48 2020 800.loginfail
-rwxr-xr-x  1 root  wheel   1.9K Mar  7 03:01:48 2020 700.kernelmsg
-r--r--r--  1 root  wheel   2.8K Mar  7 03:01:48 2020 security.functions
-rwxr-xr-x  1 root  wheel   2.0K Mar  7 03:01:48 2020 610.ipf6denied
-rwxr-xr-x  1 root  wheel   2.2K Mar  7 03:01:48 2020 550.ipfwlimit
-rwxr-xr-x  1 root  wheel   2.1K Mar  7 03:01:48 2020 520.pfdenied
-rwxr-xr-x  1 root  wheel   1.9K Mar  7 03:01:48 2020 510.ipfdenied
-rwxr-xr-x  1 root  wheel   2.0K Mar  7 03:01:48 2020 500.ipfwdenied
-rwxr-xr-x  1 root  wheel   1.9K Mar  7 03:01:48 2020 410.logincheck
-rwxr-xr-x  1 root  wheel   1.9K Mar  7 03:01:48 2020 400.passwdless
-rwxr-xr-x  1 root  wheel   1.9K Mar  7 03:01:48 2020 300.chkuid0
-rwxr-xr-x  1 root  wheel   2.3K Mar  7 03:01:48 2020 200.chkmounts
-rwxr-xr-x  1 root  wheel   2.2K Mar  7 03:01:25 2020 110.neggrpperm
-rwxr-xr-x  1 root  wheel   2.2K Mar  7 03:01:00 2020 100.chksetuid
drwxr-xr-x  2 root  wheel   512B Dec  7 06:23:36 2018 .
drwxr-xr-x  6 root  wheel   512B Dec  7 06:23:36 2018 ..
root@zfs01:/usr/home/ruben #=20

the NIC had been going up/down ever since (for 2 days) until a coworker
rebooted it this morning.

There does seem to be a relationship with the periodic framework, with the
issues occuring at 03:05 while the last timestamp updates around 03:01 ...

Will attach the verbose boot log. Feel free to request any additional detai=
ls!

Kind regards,

Ruben

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-235856-27103-aDbQZ0VGDs>