From nobody Mon Jan 29 15:45:01 2024 X-Original-To: freebsd-stable@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4TNt0j0R3hz58w4p for ; Mon, 29 Jan 2024 15:45:09 +0000 (UTC) (envelope-from mike@sentex.net) Received: from smarthost1.sentex.ca (smarthost1.sentex.ca [IPv6:2607:f3e0:0:1::12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smarthost1.sentex.ca", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4TNt0h2R40z4gvL for ; Mon, 29 Jan 2024 15:45:08 +0000 (UTC) (envelope-from mike@sentex.net) Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of mike@sentex.net designates 2607:f3e0:0:1::12 as permitted sender) smtp.mailfrom=mike@sentex.net Received: from pyroxene2a.sentex.ca (pyroxene19.sentex.ca [199.212.134.19]) by smarthost1.sentex.ca (8.17.1/8.16.1) with ESMTPS id 40TFj1SW040037 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=FAIL); Mon, 29 Jan 2024 10:45:02 -0500 (EST) (envelope-from mike@sentex.net) Received: from [IPV6:2607:f3e0:0:4:8488:750d:df1a:1a74] ([IPv6:2607:f3e0:0:4:8488:750d:df1a:1a74]) by pyroxene2a.sentex.ca (8.17.1/8.15.2) with ESMTPS id 40TFj06L035521 (version=TLSv1.3 cipher=TLS_AES_128_GCM_SHA256 bits=128 verify=NO); Mon, 29 Jan 2024 10:45:00 -0500 (EST) (envelope-from mike@sentex.net) Message-ID: <8e819103-08f5-4f8b-a9f7-d0a872e256f5@sentex.net> Date: Mon, 29 Jan 2024 10:45:01 -0500 List-Id: Production branch of FreeBSD source code List-Archive: https://lists.freebsd.org/archives/freebsd-stable List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-stable@freebsd.org X-BeenThere: freebsd-stable@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: tracking down i386 mount issue between Aug 2023 and now-- RELENG_13 Content-Language: en-US From: mike tancsa To: FreeBSD-STABLE Mailing List References: <1c266591-11b0-4f75-addf-6a02469441a8@sentex.net> Autocrypt: addr=mike@sentex.net; keydata= xsBNBFywzOMBCACoNFpwi5MeyEREiCeHtbm6pZJI/HnO+wXdCAWtZkS49weOoVyUj5BEXRZP xflV2ib2hflX4nXqhenaNiia4iaZ9ft3I1ebd7GEbGnsWCvAnob5MvDZyStDAuRxPJK1ya/s +6rOvr+eQiXYNVvfBhrCfrtR/esSkitBGxhUkBjOti8QwzD71JVF5YaOjBAs7jZUKyLGj0kW yDg4jUndudWU7G2yc9GwpHJ9aRSUN8e/mWdIogK0v+QBHfv/dsI6zVB7YuxCC9Fx8WPwfhDH VZC4kdYCQWKXrm7yb4TiVdBh5kgvlO9q3js1yYdfR1x8mjK2bH2RSv4bV3zkNmsDCIxjABEB AAHNHW1pa2UgdGFuY3NhIDxtaWtlQHNlbnRleC5uZXQ+wsCOBBMBCAA4FiEEmuvCXT0aY6hs 4SbWeVOEFl5WrMgFAl+pQfkCGwMFCwkIBwIGFQoJCAsCBBYCAwECHgECF4AACgkQeVOEFl5W rMiN6ggAk3H5vk8QnbvGbb4sinxZt/wDetgk0AOR9NRmtTnPaW+sIJEfGBOz47Xih+f7uWJS j+uvc9Ewn2Z7n8z3ZHJlLAByLVLtcNXGoRIGJ27tevfOaNqgJHBPbFOcXCBBFTx4MYMM4iAZ cDT5vsBTSaM36JZFtHZBKkuFEItbA/N8ZQSHKdTYMIA7A3OCLGbJBqloQ8SlW4MkTzKX4u7R yefAYQ0h20x9IqC5Ju8IsYRFacVZconT16KS81IBceO42vXTN0VexbVF2rZIx3v/NT75r6Vw 0FlXVB1lXOHKydRA2NeleS4NEG2vWqy/9Boj0itMfNDlOhkrA/0DcCurMpnpbM7ATQRcsMzk AQgA1Dpo/xWS66MaOJLwA28sKNMwkEk1Yjs+okOXDOu1F+0qvgE8sVmrOOPvvWr4axtKRSG1 t2QUiZ/ZkW/x/+t0nrM39EANV1VncuQZ1ceIiwTJFqGZQ8kb0+BNkwuNVFHRgXm1qzAJweEt RdsCMohB+H7BL5LGCVG5JaU0lqFU9pFP40HxEbyzxjsZgSE8LwkI6wcu0BLv6K6cLm0EiHPO l5G8kgRi38PS7/6s3R8QDsEtbGsYy6O82k3zSLIjuDBwA9GRaeigGppTxzAHVjf5o9KKu4O7 gC2KKVHPegbXS+GK7DU0fjzX57H5bZ6komE5eY4p3oWT/CwVPSGfPs8jOwARAQABwsB2BBgB CAAgFiEEmuvCXT0aY6hs4SbWeVOEFl5WrMgFAl+pQfkCGwwACgkQeVOEFl5WrMiVqwf9GwU8 c6cylknZX8QwlsVudTC8xr/L17JA84wf03k3d4wxP7bqy5AYy7jboZMbgWXngAE/HPQU95NM aukysSnknzoIpC96XZJ0okLBXVS6Y0ylZQ+HrbIhMpuQPoDweoF5F9wKrsHRoDaUK1VR706X rwm4HUzh7Jk+auuMYfuCh0FVlFBEuiJWMLhg/5WCmcRfiuB6F59ZcUQrwLEZeNhF2XJV4KwB Tlg7HCWO/sy1foE5noaMyACjAtAQE9p5kGYaj+DuRhPdWUTsHNuqrhikzIZd2rrcMid+ktb0 NvtvswzMO059z1YGMtGSqQ4srCArju+XHIdTFdiIYbd7+jeehg== In-Reply-To: <1c266591-11b0-4f75-addf-6a02469441a8@sentex.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.84 on 64.7.153.18 X-Spamd-Bar: --- X-Spamd-Result: default: False [-3.39 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; R_SPF_ALLOW(-0.20)[+ip6:2607:f3e0::/32]; MIME_GOOD(-0.10)[text/plain]; RCVD_IN_DNSWL_LOW(-0.10)[199.212.134.19:received]; XM_UA_NO_VERSION(0.01)[]; ASN(0.00)[asn:11647, ipnet:2607:f3e0::/32, country:CA]; FREEFALL_USER(0.00)[mike]; RCPT_COUNT_ONE(0.00)[1]; MIME_TRACE(0.00)[0:+]; MID_RHS_MATCH_FROM(0.00)[]; MLMMJ_DEST(0.00)[freebsd-stable@freebsd.org]; ARC_NA(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; FROM_HAS_DN(0.00)[]; R_DKIM_NA(0.00)[]; RCVD_COUNT_TWO(0.00)[2]; TO_DN_ALL(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; DMARC_NA(0.00)[sentex.net]; RCVD_TLS_ALL(0.00)[] X-Rspamd-Queue-Id: 4TNt0h2R40z4gvL Still trying to track this issue down. Its not just one partition, but often the entire disk IO locks up with processes stuck. The CF comes up as ada0 and I dont see any commits that have touched that. the box is a single GEODE CPU but I tried both SMP and UP kernels and it still seems to happen. If I play with rtprio on some processes, that *seems* to trigger the issue more often.  I did try a RELENG_14 image on a couple of test boxes and so far those seem to have survived the weekend without lockups. It doesnt seem to be memory pressure as available RAM holds steady from bootup to lockup.     ---Mike On 1/16/2024 9:48 AM, mike tancsa wrote: > Not sure exactly where to start, but I noticed this recently on an > i386 nanobsd image running on old PC Engines Alix devices that had > been rock solid for years. We have a few dozen in the field running > with RELENG_13 from Aug that have been very stable with STABLE over > the years.  However, somewhere between Aug 2023 and now I am getting > some lock ups that are difficult to diagnose as the devices are > remote.  I did manage to find one odd thing on a local test unit where > a remount of a backup partition is hung. > > # ps -auxwwwwp 3443 > USER  PID %CPU %MEM  VSZ  RSS TT  STAT STARTED     TIME COMMAND > root 3443  3.3  0.9 4708 2320  -  D<   20:18   34:55.20 /sbin/mount > -ur /dev/ada0s4 /logs > > I dont have truss on the box to attach to the process and ktrace > doesnt seem to show anything either.  Does this sort of hang ring a > bell for anyone ? Looking back at the git logs, a coarse search for > anything to do with mount, doesnt come up with much (2 below).   Also > since then a new version of clang so not quite where to start. > > Any guidance appreciated. Testing is difficult as the hang doesnt > always happen -- sometimes within a day, sometimes 5 days.  ssh is > usually borked as well as some processes.  I have a scaled down > telegraf agent collecting some basic stats, and the cpu is pegged at > 100%. These are single core devices so not sure what is pegging the > CPU.  RAM still shows some available so it doesnt seem to be memory > pressures. > > > commit 71fceff2480999b3fc921f47ec9adea9eff32041 > Author: Andrew Gierth > Date:   Sun Dec 24 14:04:21 2023 +0200 > >     vfs_domount_update(): correct fsidcmp() usage > >     (cherry picked from commit 2a1d50fc12f6e604da834fbaea961d412aae6e85) > > and > > commit 608ccfc29fb48d8edc59a97382936790c02d27f3 > Author: Konstantin Belousov > Date:   Thu Nov 9 22:18:47 2023 +0200 > >     vfs_domount_update(): ensure that 'goto end' works > >     PR:     274992 > >     (cherry picked from commit ede4c412b3ea9289ef42c664b01b6b5ff7eac434) >