From owner-freebsd-stable@freebsd.org Wed Mar 17 13:33:30 2021 Return-Path: Delivered-To: freebsd-stable@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id E64BE5AF898; Wed, 17 Mar 2021 13:33:30 +0000 (UTC) (envelope-from lists@yamagi.org) Received: from mail1.yamagi.org (mail1.yamagi.org [IPv6:2001:19f0:b001:853::3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4F0rhT3RZxz3np8; Wed, 17 Mar 2021 13:33:29 +0000 (UTC) (envelope-from lists@yamagi.org) Received: from [2001:470:6845:1:7f9f:3dc1:2752:4334] (helo=killua.home.yamagi.org) by mail1.yamagi.org with esmtpsa (TLS1.3) tls TLS_AES_256_GCM_SHA384 (Exim 4.94 (FreeBSD)) (envelope-from ) id 1lMWIb-000G2Z-EX; Wed, 17 Mar 2021 14:33:22 +0100 Date: Wed, 17 Mar 2021 14:33:07 +0100 From: Yamagi To: freebsd-current@freebsd.org Cc: freebsd-stable@freebsd.org Subject: 13.0-RC2 / 14-CURRENT: Processes getting stuck in vlruwk state Message-Id: <20210317143307.20beb5fca0814246f2a91e9a@yamagi.org> X-Mailer: Sylpheed 3.7.0 (GTK+ 2.24.33; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: multipart/signed; protocol="application/pgp-signature"; micalg="PGP-SHA256"; boundary="Signature=_Wed__17_Mar_2021_14_33_08_+0100_n+tlFzmG7X+B+oes" X-Rspamd-Queue-Id: 4F0rhT3RZxz3np8 X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of lists@yamagi.org designates 2001:19f0:b001:853::3 as permitted sender) smtp.mailfrom=lists@yamagi.org X-Spamd-Result: default: False [-2.90 / 15.00]; RCVD_TLS_ALL(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; FROM_HAS_DN(0.00)[]; RBL_DBL_DONT_QUERY_IPS(0.00)[2001:19f0:b001:853::3:from]; MV_CASE(0.50)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; MIME_GOOD(-0.20)[multipart/signed,text/plain]; TO_DN_NONE(0.00)[]; ARC_NA(0.00)[]; R_SPF_ALLOW(-0.20)[+mx]; SPAMHAUS_ZRD(0.00)[2001:19f0:b001:853::3:from:127.0.2.255]; DMARC_NA(0.00)[yamagi.org]; NEURAL_HAM_LONG(-1.00)[-1.000]; RCPT_COUNT_TWO(0.00)[2]; NEURAL_HAM_SHORT(-1.00)[-0.998]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:20473, ipnet:2001:19f0:b000::/38, country:US]; RCVD_COUNT_TWO(0.00)[2]; MAILMAN_DEST(0.00)[freebsd-current,freebsd-stable] X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Mar 2021 13:33:31 -0000 --Signature=_Wed__17_Mar_2021_14_33_08_+0100_n+tlFzmG7X+B+oes Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi, me and some other users in the ##bsdforen.de IRC channel have the problem that during Poudriere runs processes getting stuck in the 'vlruwk' state. For me it's fairly reproduceable. The problems begin about 20 to 25 minutes after I've started poudriere. At first only some ccache processes hang in the 'vlruwk' state, after another 2 to 3 minutes nearly everything hangs and the total CPU load drops to about 5%. When I stop poudriere with ctrl-c it takes another 3 to 5 minutes until the system recovers. First the setup: * poudriere runs in a bhyve vm on zvol. The host is a 12.2-RELEASE-p2. The zvol has a 8k blocksize, the guests partition are aligned to 8k. The guest has only zpool, the pool was created with ashift=3D13. The vm has 16 E5-2620 and 16 gigabytes RAM assigned to it. * poudriere is configured with ccache and ALLOW_MAKE_JOBS=3Dyes. Removing either of these options lowers the probability of the problem to show up significantly. I've tried several git revisions starting with 14-CURRENT at 54ac6f721efccdba5a09aa9f38be0a1c4ef6cf14 in the hope that I can find at least one known to be good revision. No chance, even a kernel build from 0932ee9fa0d82b2998993b649f9fa4cc95ba77d6 (Wed Sep 2 19:18:27 2020 +0000) has the problem. The problem isn't reproduceable with 12.2-RELEASE. The kernel stack ('procstat -kk') of a hanging process is: mi_switch+0x155 sleepq_switch+0x109 sleepq_catch_signals+0x3f1 sleepq_wait_sig+0x9 _sleep+0x2aa kern_wait6+0x482 sys_wait4+0x7d amd64_syscall+0x140 fast_syscall_common+0xf8 The kernel stack of vnlru is changing, even while the processes are hanging: * mi_switch+0x155 sleepq_switch+0x109 sleepq_timedwait+0x4b _sleep+0x29b vnlru_proc+0xa05 fork_exit+0x80 fork_trampoline+0xe * fork_exit+0x80 fork_trampoline+0xe Since vnlru is accumulating CPU time it looks like it's doing at least something. As an educated guess I would say that vn_alloc_hard() is waiting a long time or even forever to allocate new vnodes. I can provide more information, I just need to know what. Regards, Yamagi --=20 Homepage: https://www.yamagi.org Github: https://github.com/yamagi GPG: 0x1D502515 --Signature=_Wed__17_Mar_2021_14_33_08_+0100_n+tlFzmG7X+B+oes Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEOXu/lxyufwz0gC5x6xRy5x1QJRUFAmBSBRQACgkQ6xRy5x1Q JRW/cQ//VlgtFrHyRBWN3zVg6EfJbPZMJd9Ky9y7uFltYmWorK4bDjfU1n1XQlrn 0q7+c5ejtVCva3DZ7APSPse23QI6tR8iRbba8wCnj+2pnB+x/b4czUOg8mSVST9I JShktEcHUoYgoC7hzJDVMbqMR+DTkRl/Mp9EC3psu/mRglEw5LVy5Qq8IGTlvEU4 FVOXlS1+DEAJD/lCqS4sgut+2z3sm/EAbmqrsB1+6CHuZcbxmd0Zi14UFMkLNSDK kitZTp1SruKg1BEjyT8Vsfa98vB6/IYQdUsv5agjxA6/ClsnN+VLZXa7xbOEbPUw l3XxKqySy/+tX9SGMnbArS7E+E0/6mBMNVRxWCEPu9qQuShfmGXOhDDxaIFqVMGt AmF1hmiEF5Z8QiLcdHx+gtYjW0tAJfgTRSQRozQ44dLoTAI67S1tVnZ3Z5m9AHOz 4PzSvnIL2+ScCDzyJraT/vr2u6Z5hKCQflWh/1jhDH1J87QlvbQ9ZKoaMMKHYMg4 Fyi/Xe5dlUN1YAzci8rag4WBcT6NJiL/u/S5hsXiPZ8vCSo/9hj6KaVfq8RUSLHO mOe387YdBlmz4PiPFBeZG9vfeczUgQ29XxYLWi9B1I/waJ3Iea5Py4oLHJii1UTO 2ZsKeZWn175gQ8WUOuxC+7ZbzVc/B/vSdtmC1xXciR80NGm+jMY= =AiJB -----END PGP SIGNATURE----- --Signature=_Wed__17_Mar_2021_14_33_08_+0100_n+tlFzmG7X+B+oes--