From nobody Mon May 1 00:34:45 2023 X-Original-To: dev-commits-src-all@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Q8kkJ2Dr8z48WGD; Mon, 1 May 2023 00:34:48 +0000 (UTC) (envelope-from jah@freebsd.org) Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Q8kkJ1TJJz4YFZ; Mon, 1 May 2023 00:34:48 +0000 (UTC) (envelope-from jah@freebsd.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1682901288; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=GgMCcZk5NtZQ2CmLkzaUejgO3Q+ZNczDvWWo6x0o++4=; b=jwidpCzPOh7nHQytADhpTSGxqqFsQz5I+dDsExAUIiVeizrH9WQJV40xEtcgBpzu6uRmxl NjkgY0wu6qgsMlYP5QGtUlta7ohMMprYZAS2oyZfS1U+VJgOEFV2Uhbx5jRjw8X4dEvc1Z 6zodjDgEITlyZwO6lsNKhg4C+5yH+/x5XxKWD5BL3FEcRJYuTV3r2HAjH9dN4NomCSbX57 n5IG4zf3lzf1uxgeKEqO62hQIVQB1BF4BSs0EtLHuPk4sgwi6QZ5AjJrC+h557CNVVV6JW sVQU+yj3EyK2v1DuOrnvrUGJMLGD5npNLhbgwTAp9W/tRN1SIoZLe0Mj1qOnBw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1682901288; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=GgMCcZk5NtZQ2CmLkzaUejgO3Q+ZNczDvWWo6x0o++4=; b=ZkBs/npCsc/pXT1qUwaYutZXrdLJ19BYDrjkDpoW9RouuudDKb5lf4z6oSdqwnAUIhKOfh WKVHMcDk14tIeZgYJAa9Rljrogfv54S8ZxPBGa9memZs9UopzB+7O63Wy4SQaoNU8dqSLh Er+O4WX1iA08BAq/iW/Y58RjSpqhsPUQq7ctCU06RvNSwWWJxvx/3OJIA9G92IFSSQ/u5X OAFxcPq6JPm3yQmmMFU+uI6YsygxJPJLhVjO+FaNmQHUDUeIaCB5iWm3/MTaFXuflF3GD7 mQGBj3Qg6jH5A52wMvvfDx5aQVbkpTyiPG0TyGSPd80BKjGKPUm+EvBhMS/fjQ== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1682901288; a=rsa-sha256; cv=none; b=uIM9S/PVkEzrBjs0XYnP2vijpiY69+0aR/Deo/6jWGbaBIVqGb9wrJ5rw56wfFSyuRcG3h 9WkORaUUs7Pw74hr+ZWhAmjNA3ITh7ZfCLdxCikmpvq74D2IkofnkuDbni6ErkIygPgzVc uXShPTDh04qUniCyoAk/Tq/Ss+U4AJk8VtkQ5IQA2VzPkowpN2vFDY6SFjiXCShAx3/uDJ x6/UlwV9FDlwa5FgUJhmtwcB0qAwyQHXQrp5ZUkO6a9pHvz9+3CunpSWctw3wowyANFvbm X1p25ypNrIiZsLU89+VlvzB81Pza6uVlILjXOmlTOqrQUyu0fAoASW+oObGMRA== Received: from corona (047-232-115-243.res.spectrum.com [47.232.115.243]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) (Authenticated sender: jah) by smtp.freebsd.org (Postfix) with ESMTPSA id 4Q8kkH4zppzVVN; Mon, 1 May 2023 00:34:47 +0000 (UTC) (envelope-from jah@freebsd.org) Date: Sun, 30 Apr 2023 19:34:45 -0500 From: "Jason A. Harmening" To: Konstantin Belousov Cc: Dimitry Andric , src-committers@freebsd.org, dev-commits-src-all@freebsd.org, dev-commits-src-branches@freebsd.org Subject: Re: git: 060699e91369 - stable/13 - Merge llvm-project release/15.x llvmorg-15.0.7-0-g8dfdcc7b7bf6 Message-ID: References: <202304092135.339LZMeJ081640@gitrepo.freebsd.org> <76DD2CB9-986B-4349-8F46-3B7BF63EB315@FreeBSD.org> List-Id: Commit messages for all branches of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-all List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-dev-commits-src-all@freebsd.org X-BeenThere: dev-commits-src-all@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-ThisMailContainsUnwantedMimeParts: N On Sun, Apr 30, 2023 at 06:47:13PM -0500, Jason A. Harmening wrote: > On Sun, Apr 30, 2023 at 08:09:16AM +0300, Konstantin Belousov wrote: > > On Sat, Apr 29, 2023 at 02:27:50PM -0500, Jason A. Harmening wrote: > > > On Sat, Apr 29, 2023 at 08:49:28PM +0200, Dimitry Andric wrote: > > > > On 29 Apr 2023, at 20:33, Jason A. Harmening wrote: > > > > > > > > > > On Sun, Apr 09, 2023 at 09:35:22PM +0000, Dimitry Andric wrote: > > > > >> The branch stable/13 has been updated by dim: > > > > >> > > > > >> URL: https://cgit.FreeBSD.org/src/commit/?id=060699e9136975d51d3f726b9785bdbac9a62ba6 > > > > >> > > > > >> commit 060699e9136975d51d3f726b9785bdbac9a62ba6 > > > > >> Author: Dimitry Andric > > > > >> AuthorDate: 2023-01-14 16:33:24 +0000 > > > > >> Commit: Dimitry Andric > > > > >> CommitDate: 2023-04-09 14:54:52 +0000 > > > > >> > > > > >> Merge llvm-project release/15.x llvmorg-15.0.7-0-g8dfdcc7b7bf6 > > > > >> > > > > >> This updates llvm, clang, compiler-rt, libc++, libunwind, lld, lldb and > > > > >> openmp to llvmorg-15.0.7-0-g8dfdcc7b7bf6. > > > > >> > > > > >> PR: 265425 > > > > >> MFC after: 2 weeks > > > > > > > > > > This MFC of llvm15 appears to have completely broken the Intel IOMMU > > > > > driver on my stable/13 machine. After this series of commits, any > > > > > downstream DMA seems to produce an IOMMU translation fault, which > > > > > renders the machine completely unusable: no nvme boot disk, no usb > > > > > keyboard, etc. > > > > > > > > > > The faults I see look something like this: > > > > > > > > > > DMAR4: ahci0: pci0:17:5 sid 8d fault acc 0 adt 0x0 reason 0x3 addr 26000 > > > > > > > > > > It's a bit surprising to see a toolchain upgrade produce breakage like > > > > > this, but that's what git bisect clearly tells me. I wonder if some of > > > > > the IOMMU control structures might be defined as C bitfields and the new > > > > > compiler is emitting them differently? Also, was any breakage like this > > > > > observed when -current was upgraded to llvm15 several months ago? > > > > > > > > I haven't heard anything about such breakage, no. > > > > > > > > > > > > > More generally, this is the second time in as many months I've had to > > > > > deal with IOMMU breakage on -stable. I can't imagine I'm the only > > > > > person who sees value in running with DMA remapping enabled; do we need > > > > > a dedicated DMAR-enabled machine in the cluster to smoke-test changes > > > > > like this? More generally, should we avoid MFCing high-risk changes > > > > > like this? > > > > > > > > Since there were very few bug reports, it was not deemed high risk. > > > > > > > > In any case, it would be good to get the bottom of what is causing the > > > > problem, so is there any way you can isolate which code seems to be > > > > going "bad"? > > > > > > > > For example, if this problem affects code in sys/dev/iommu, is there > > > > some way you can compile that part with -O1, or with an older version > > > > of clang (from ports), to see if the problem goes away? > > > > > > I did try removing all custom make.conf settings (previously I just had > > > CPUTYPE?=icelake-server), but that didn't change the behavior. > > > > > > Before I try further build tweaks, I'd like to ask if the IOMMU fault > > > report can provide guidance here? AFAICT all the faults I'm getting > > > show "reason 0x3". If I'm reading the VT-d spec correctly, FR=0x3 > > > indicates an invalid context entry, in other words there's something the > > > hardware doesn't like in the way the address width or pagetable base is > > > configured for the PCIe requestor. > > > > I would start looking at the other direction: might be, there are still some > > left shifts for int32 values with the shift count > 30, or uint32 with the > > count > 31. > > > > Also might be useful to dump each context entry on creation, it is kept > > constant after. > > I did look over the constants in intel_reg.h, and didn't see anything > that looked as though it would be susceptible to sign-extension or > truncation bugs. In the failing case it's much easier for me to catch > the fault messages than any initialization message, so I instrumented > the fault handler to get the context entry from the dmar_ctx object > using the same logic as dmar_map_ctx_entry(), and then dump out the ctx1 > and ctx2 fields. What I see are messages like: > > ... ctx1 0x10013b001 ctx2 0x103 > > At first glance these "look right": the P bit is set in ctx1, and the > rest of the field looks like a valid physical address. ctx2 also > doesn't have any of the reserved bits set, but in all cases it does have > AW=3, which would indicate 57-bit AGAW. But when I boot the last > working kernel, from the revision prior to the llvm15 MFC, I see this in > dmesg: > > ahci0: dmar4 pci0:0:17:5 rid 8d domain 1 mgaw 48 agaw 48 re-mapped > > ...all reported devices show 48-bit MGAW/AGAW, so I would expect ctx2 to > have AW=2. I suspect this may be the source of the fault, but I'm not > sure how it's getting configured that way, whether it's an issue with > reading the capability register or something else. > I can confirm that hacking domain_set_agaw() to always use the settings from sagaw_bits[2] eliminates the faults and at least allows the machine to boot to single-user mode.