From owner-freebsd-bugs@freebsd.org Fri Mar 16 05:24:31 2018 Return-Path: Delivered-To: freebsd-bugs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 71526F51C4B for ; Fri, 16 Mar 2018 05:24:31 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.ysv.freebsd.org (mxrelay.ysv.freebsd.org [IPv6:2001:1900:2254:206a::19:3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mxrelay.ysv.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 8AB986C56D for ; Fri, 16 Mar 2018 05:24:30 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mxrelay.ysv.freebsd.org (Postfix) with ESMTPS id 291691B207 for ; Fri, 16 Mar 2018 05:24:29 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id w2G5OS6c026937 for ; Fri, 16 Mar 2018 05:24:28 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from bugzilla@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id w2G5OSUR026936 for freebsd-bugs@FreeBSD.org; Fri, 16 Mar 2018 05:24:28 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: bugzilla set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: freebsd-bugs@FreeBSD.org Subject: [Bug 211713] NVME controller failure: resetting (Samsung SM961 SSD Drives) Date: Fri, 16 Mar 2018 05:24:29 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.3-RELEASE X-Bugzilla-Keywords: needs-qa X-Bugzilla-Severity: Affects Many People X-Bugzilla-Who: commit-hook@freebsd.org X-Bugzilla-Status: Open X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-bugs@FreeBSD.org X-Bugzilla-Flags: mfc-stable10? mfc-stable11? X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 16 Mar 2018 05:24:31 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D211713 --- Comment #50 from commit-hook@freebsd.org --- A commit references this bug: Author: imp Date: Fri Mar 16 05:23:49 UTC 2018 New revision: 331046 URL: https://svnweb.freebsd.org/changeset/base/331046 Log: Try polling the qpairs on timeout. On some systems, we're getting timeouts when we use multiple queues on drives that work perfectly well on other systems. On a hunch, Jim Harris suggested I poll the completion queue when we get a timeout. This patch polls the completion queue if no fatal status was indicated. If it had pending I/O, we complete that request and return. Otherwise, if aborts are enabled and no fatal status, we abort the command and return. Otherwise we reset the card. This may clear up the problem, or we may see it result in lots of timeouts and a performance problem. Either way, we'll know the next step. We may also need to pay attention to the fatal status bit of the controller. PR: 211713 Suggested by: Jim Harris Sponsored by: Netflix Changes: head/sys/dev/nvme/nvme_private.h head/sys/dev/nvme/nvme_qpair.c --=20 You are receiving this mail because: You are the assignee for the bug.=