From nobody Sun Apr 30 23:47:13 2023 X-Original-To: dev-commits-src-all@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Q8jgR6Yh7z48SrY; Sun, 30 Apr 2023 23:47:15 +0000 (UTC) (envelope-from jah@freebsd.org) Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Q8jgR60vsz4RvC; Sun, 30 Apr 2023 23:47:15 +0000 (UTC) (envelope-from jah@freebsd.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1682898435; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=OukEzG48efWFkuOKgYZSFA4uDoevjeXSH4U0g2NKxT4=; b=hJ/gk9A3O3d9vMj9RvyixOJ2PAU/xpb3K+/E/B0ySzISEfbTYbm7RYRfkEJycFH7EOc/uY 2Yp02s0VmX/+JtE90yrEtnjc//ZjEfHlq34iSiwfGmOvMAxRqfXkUCcdtpYaUhqDfO/pBL 4AxR+JsUm4A4DnudCCLK8lu7OE3ckcxEzt2nWZ5A1+DduQX0G4W1q+658lQe3Ztoq/KHYs BHUWszqC31X0irFQW8hnFStTGJvLhsUrMM0FiCDwzCsf5+glIlZT/0a7Dfe91IPGVT7gYX u55r+aVAlLUrXDkAIiX1PU0dDyJoovVLAyfbQVYN4r/Db3BcxfW6hdzmXhDD6w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1682898435; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=OukEzG48efWFkuOKgYZSFA4uDoevjeXSH4U0g2NKxT4=; b=Okcca9laYvruzKVorEn5ejSKQi4/lqFtYUeUEhkiXcfBkf7SBLdwIIawXuHsXubJD1PjiN fxTPB5NHASDHb2kLQoHKLuGmANpb1man8S9gv5JpespOwT+lQOrZGZriYPBVq9UG4i8flV Y9Cj72mKZQ64KiVFoTNJaMdKpQK788WcUQE7JoOHTqaSR1f2V11RdJRxnvAXNb5k3VHvfC xW7NcZcA6zWVUcBkKCLA4W72HxTwGRsXZsGsQrKTOFi7FcS9ubezCo8louD1yoD9DUg01B 7AHzrdcgy35J0cbk3J0O4jkxHauzNTX0/mhiycW8xQgt2M//MF8c3w4Dvi1W5A== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1682898435; a=rsa-sha256; cv=none; b=DKHwon7FhEOOLZIqd+l9skiTInL+Vf57mEGibQugViZQtz15jbM2tt/fNzovrjwxuYDTCT NxF6jKCSGehs1ImbVTl74ciBORN8vcdAO+O4l7MTSAlWclUHHc9iKrNLVdAMuXutTsNQbn aPbAGrsEsyp+D4TZViLwGZsHRCoec/OKYS3QR4Qp2ah9WAqGnKXikNoXUcLrN4/nujRjGs DmT+VdRshytYWHQidoGov5C/r7Po+hPSgh1WhYsZlyw9h+sKEjUL6g+/wDgh59jsIFhdNH hFOk03HyUoJbvBxRE2EeRIU1YNKFdybG2YfMpJv2A7TAvnEz/6Bx5+9FfuwcMA== Received: from corona (047-232-115-243.res.spectrum.com [47.232.115.243]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) (Authenticated sender: jah) by smtp.freebsd.org (Postfix) with ESMTPSA id 4Q8jgR2PdNzTDr; Sun, 30 Apr 2023 23:47:15 +0000 (UTC) (envelope-from jah@freebsd.org) Date: Sun, 30 Apr 2023 18:47:13 -0500 From: "Jason A. Harmening" To: Konstantin Belousov Cc: Dimitry Andric , src-committers@freebsd.org, dev-commits-src-all@freebsd.org, dev-commits-src-branches@freebsd.org Subject: Re: git: 060699e91369 - stable/13 - Merge llvm-project release/15.x llvmorg-15.0.7-0-g8dfdcc7b7bf6 Message-ID: References: <202304092135.339LZMeJ081640@gitrepo.freebsd.org> <76DD2CB9-986B-4349-8F46-3B7BF63EB315@FreeBSD.org> List-Id: Commit messages for all branches of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-all List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-dev-commits-src-all@freebsd.org X-BeenThere: dev-commits-src-all@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-ThisMailContainsUnwantedMimeParts: N On Sun, Apr 30, 2023 at 08:09:16AM +0300, Konstantin Belousov wrote: > On Sat, Apr 29, 2023 at 02:27:50PM -0500, Jason A. Harmening wrote: > > On Sat, Apr 29, 2023 at 08:49:28PM +0200, Dimitry Andric wrote: > > > On 29 Apr 2023, at 20:33, Jason A. Harmening wrote: > > > > > > > > On Sun, Apr 09, 2023 at 09:35:22PM +0000, Dimitry Andric wrote: > > > >> The branch stable/13 has been updated by dim: > > > >> > > > >> URL: https://cgit.FreeBSD.org/src/commit/?id=060699e9136975d51d3f726b9785bdbac9a62ba6 > > > >> > > > >> commit 060699e9136975d51d3f726b9785bdbac9a62ba6 > > > >> Author: Dimitry Andric > > > >> AuthorDate: 2023-01-14 16:33:24 +0000 > > > >> Commit: Dimitry Andric > > > >> CommitDate: 2023-04-09 14:54:52 +0000 > > > >> > > > >> Merge llvm-project release/15.x llvmorg-15.0.7-0-g8dfdcc7b7bf6 > > > >> > > > >> This updates llvm, clang, compiler-rt, libc++, libunwind, lld, lldb and > > > >> openmp to llvmorg-15.0.7-0-g8dfdcc7b7bf6. > > > >> > > > >> PR: 265425 > > > >> MFC after: 2 weeks > > > > > > > > This MFC of llvm15 appears to have completely broken the Intel IOMMU > > > > driver on my stable/13 machine. After this series of commits, any > > > > downstream DMA seems to produce an IOMMU translation fault, which > > > > renders the machine completely unusable: no nvme boot disk, no usb > > > > keyboard, etc. > > > > > > > > The faults I see look something like this: > > > > > > > > DMAR4: ahci0: pci0:17:5 sid 8d fault acc 0 adt 0x0 reason 0x3 addr 26000 > > > > > > > > It's a bit surprising to see a toolchain upgrade produce breakage like > > > > this, but that's what git bisect clearly tells me. I wonder if some of > > > > the IOMMU control structures might be defined as C bitfields and the new > > > > compiler is emitting them differently? Also, was any breakage like this > > > > observed when -current was upgraded to llvm15 several months ago? > > > > > > I haven't heard anything about such breakage, no. > > > > > > > > > > More generally, this is the second time in as many months I've had to > > > > deal with IOMMU breakage on -stable. I can't imagine I'm the only > > > > person who sees value in running with DMA remapping enabled; do we need > > > > a dedicated DMAR-enabled machine in the cluster to smoke-test changes > > > > like this? More generally, should we avoid MFCing high-risk changes > > > > like this? > > > > > > Since there were very few bug reports, it was not deemed high risk. > > > > > > In any case, it would be good to get the bottom of what is causing the > > > problem, so is there any way you can isolate which code seems to be > > > going "bad"? > > > > > > For example, if this problem affects code in sys/dev/iommu, is there > > > some way you can compile that part with -O1, or with an older version > > > of clang (from ports), to see if the problem goes away? > > > > I did try removing all custom make.conf settings (previously I just had > > CPUTYPE?=icelake-server), but that didn't change the behavior. > > > > Before I try further build tweaks, I'd like to ask if the IOMMU fault > > report can provide guidance here? AFAICT all the faults I'm getting > > show "reason 0x3". If I'm reading the VT-d spec correctly, FR=0x3 > > indicates an invalid context entry, in other words there's something the > > hardware doesn't like in the way the address width or pagetable base is > > configured for the PCIe requestor. > > I would start looking at the other direction: might be, there are still some > left shifts for int32 values with the shift count > 30, or uint32 with the > count > 31. > > Also might be useful to dump each context entry on creation, it is kept > constant after. I did look over the constants in intel_reg.h, and didn't see anything that looked as though it would be susceptible to sign-extension or truncation bugs. In the failing case it's much easier for me to catch the fault messages than any initialization message, so I instrumented the fault handler to get the context entry from the dmar_ctx object using the same logic as dmar_map_ctx_entry(), and then dump out the ctx1 and ctx2 fields. What I see are messages like: ... ctx1 0x10013b001 ctx2 0x103 At first glance these "look right": the P bit is set in ctx1, and the rest of the field looks like a valid physical address. ctx2 also doesn't have any of the reserved bits set, but in all cases it does have AW=3, which would indicate 57-bit AGAW. But when I boot the last working kernel, from the revision prior to the llvm15 MFC, I see this in dmesg: ahci0: dmar4 pci0:0:17:5 rid 8d domain 1 mgaw 48 agaw 48 re-mapped ...all reported devices show 48-bit MGAW/AGAW, so I would expect ctx2 to have AW=2. I suspect this may be the source of the fault, but I'm not sure how it's getting configured that way, whether it's an issue with reading the capability register or something else.