From nobody Tue Feb 13 13:12:51 2024 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4TZ1w72rffz5BHJJ for ; Tue, 13 Feb 2024 13:12:55 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4TZ1w72H1Bz4HPK for ; Tue, 13 Feb 2024 13:12:55 +0000 (UTC) (envelope-from truckman@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1707829975; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: references:references; bh=mpsephCL1aYeikFvtyMfy/iYZxXgNOVMVN83jVb8HVo=; b=ZXU/So4eF+L55fBK4JET7slwNE/fkRfYEEZlU1yOo9N9ulebSBdcmwHSuKhuY4cYbHbuYY 51R/qRAY//7g6abJhVjXgRJ5eiQ2frPYER+Ijt92s/5fJNORVCMq0ptcg4g3lk58cLeMEy nPFCni2h7wZNZMXWhAiIG4hYJisPbKs6fnET683XA+wAnBmn+iDtwizOjT4okIM0aDJEX/ Gbv8DA0/yr6ztp1Gr/6pNtTUzeGxBx/5eQXUlGv0DJ59czRHJWome3jJVuest03VEWCXUH 40PhdlN+qMb8Cghp8QdVgP9rZ9ZXl5bqbDS6wixIQcme36n+tlf/vm4hpy7RlQ== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1707829975; a=rsa-sha256; cv=none; b=WzfJLyvKcgVrO/51ffKmenq1xm2OfTWKq7W/Uzfd9+wq2DN8rYe1lURrp54EqNyxGvUBM/ jNHzq8NdMa8DFyagSFw10cbvvS4aAPOar6o2wp+9/nwzwJxzrG5UPg/rvzZLWjXhD+2GwY sw1Ia4czQBB9O0ZEVp7OX/K82KRcPqz/As2TSDNuDPwskx7mfz8eSMbdSO9XHUwr/qoc7m vAGLd2VwMWPiYxAO2LpeB8AbUHEQCLZqnGlmLKEj5biKGlJ2xj2xiDlT1PDDxdIU+60TMd H+9hbOe7s+BU6YKOoBVzDzfPWxAC+7PLx9JVJCoNBCchTFiYlVO1BlA9SJKIxQ== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1707829975; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: references:references; bh=mpsephCL1aYeikFvtyMfy/iYZxXgNOVMVN83jVb8HVo=; b=mRJdLh7ziSwX/ESWmrFoVfyVCnpnbtv0hBJFqKhafz9mGZAGoMBHT0GlIIya9HSlVzaPPw TuvvfSvlyt1AWn0jbeq/81xKs1ytYaOQFhyhYNiciixEj0yuJwDEwYbbowjQ5YQYHRJJnx DCGvk3jkab9HyoVBvrSVuttf7JcEkkg078sF7uJ8rsklWmo/4G6E24/Psq96if+UkQWGo6 TTXdD6IjXoA5I0dcnBLv+HPB2xzTeNxnEkREjaIxJjJTPw/+wOsNxGTnxFQGUY7ut+/KSI tAOrwI2HRL9Lr52MZwBVb0RZKVSNkxcmvHxyhNS95epD+YLlBvB+OVApwJHuJg== Received: from gw.catspoiler.org (unknown [IPv6:2602:304:cd45:5b11::2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) (Authenticated sender: truckman) by smtp.freebsd.org (Postfix) with ESMTPSA id 4TZ1w64lN4z16yW for ; Tue, 13 Feb 2024 13:12:54 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from dl (uid 1001) (envelope-from truckman@FreeBSD.org) id 21a34c by gw.catspoiler.org (DragonFly Mail Agent v0.13 on mousie.catspoiler.org); Tue, 13 Feb 2024 05:12:52 -0800 Date: Tue, 13 Feb 2024 05:12:51 -0800 (PST) From: Don Lewis Subject: Re: nvme controller reset failures on recent -CURRENT To: Warner Losh cc: Maxim Sobolev , FreeBSD current , John Baldwin Message-ID: References: List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 Content-Type: TEXT/PLAIN; CHARSET=utf-8 Content-Transfer-Encoding: 8BIT Content-Disposition: INLINE On 12 Feb, Warner Losh wrote: > On Mon, Feb 12, 2024 at 9:15 PM Don Lewis wrote: > >> On 12 Feb, Maxim Sobolev wrote: >> > Might be an overheating. Today's nvme drives are notoriously flaky if you >> > run them without proper heat sink attached to it. >> >> I don't think it is a thermal problem. According to the drive health >> page, the device temperature has never reached Temperature 2, whatever >> that is. The room temperature is around 65F. The system was stable >> last summer when the room temperature spent a lot of time in the 80-85F >> range. The device temperature depends a lot on the I/O rate, and the >> last panic happened when the I/O rate had been below 40tps for quite a >> while. >> > > It did reach temperature 1, though. That's the 'Warning this drive is too > hot' temperature. It has spent 41213 minutes of your 19297 hours of up > time, or an average of 2 minutes per hour. That's too much. Temperature > 2 is critical error: we are about to shut down completely due to it > being too hot. It's only a couple degrees below hardware power off > due to temperature in many drives. Some really cheap ones don't really > implement it at all. On my card with the bad heat sink, Warning temp is > 70C while critical is 75C while IIRC thermal shutdown is 78C or 80C. > > I don't think we report these values in nvmecontrol identify. But you can > do a raw dump with -x look at bytes 266:267 for warning and 268:269 > for critical. > > In contrast, the few dozen drives that I have, all of which have been > abused in various ways, And only one of them has any heat issues, > and that one is an engineering special / sample with what I think is > a damaged heat sink. If your card has no heat sink, this could well > be what's going on. > > This panic means "the nvme card lost its mind and stopped talking > to the host". Its status registers read 0xff's, which means that the card > isn't decoding bus signals. Usually this means that the firmware on the > card has faulted and rebooted. If the card is overheating, then this could > well be what's happening. > > There's a tiny chance that this could be something more exotic, > but my money is on hardware gone bad after 2 years of service. I don't think > this is 'wear out' of the NAND (it's only 15TB written, but it could be if > this > drive is really really crappy nand: first generation QLC maybe, but it seems > too new). It might also be a connector problem that's developed over time. > There might be a few other things too, but I don't think this is a U.2 drive > with funky cables. The system was probably idle the majority of those two years of power on time. It's one of these: https://www.techpowerup.com/ssd-specs/intel-660p-512-gb.d437 I've seen comments that these generally don't need cooling. I just ordered a heatsink with some nice big fins, but it will take a week or more to arrive. > >> > On Mon, Feb 12, 2024, 4:28 PM Don Lewis wrote: >> > >> >> I just upgraded my package build machine to: >> >> FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e >> >> from: >> >> FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38 >> >> and I've had two nvme-triggered panics in the last day. >> >> >> >> nvme is being used for swap and L2ARC. I'm not able to get a crash >> >> dump, probably because the nvme device has gone away and I get an error >> >> about not having a dump device. It looks like a low-memory panic >> >> because free memory is low and zfs is calling malloc(). >> >> >> >> This shows up in the log leading up to the panic: >> >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a >> >> timeout a >> >> nd possible hot unplug. >> >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times >> >> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller >> >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a >> >> timeout a >> >> nd possible hot unplug. >> >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times >> >> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete >> >> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times >> >> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o >> >> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping >> watchdog >> >> ti >> >> meout. >> >> >> >> The device looks healthy to me: >> >> SMART/Health Information Log >> >> ============================ >> >> Critical Warning State: 0x00 >> >> Available spare: 0 >> >> Temperature: 0 >> >> Device reliability: 0 >> >> Read only: 0 >> >> Volatile memory backup: 0 >> >> Temperature: 312 K, 38.85 C, 101.93 F >> >> Available spare: 100 >> >> Available spare threshold: 10 >> >> Percentage used: 3 >> >> Data units (512,000 byte) read: 5761183 >> >> Data units written: 29911502 >> >> Host read commands: 471921188 >> >> Host write commands: 605394753 >> >> Controller busy time (minutes): 32359 >> >> Power cycles: 110 >> >> Power on hours: 19297 >> >> Unsafe shutdowns: 14 >> >> Media errors: 0 >> >> No. error info log entries: 0 >> >> Warning Temp Composite Time: 0 >> >> Error Temp Composite Time: 0 >> >> Temperature 1 Transition Count: 5231 >> >> Temperature 2 Transition Count: 0 >> >> Total Time For Temperature 1: 41213 >> >> Total Time For Temperature 2: 0 >> >> >> >> >> >> >> >> >>