From owner-freebsd-infiniband@freebsd.org Sat Jan 11 07:27:30 2020 Return-Path: Delivered-To: freebsd-infiniband@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id C5FCE1FFD24 for ; Sat, 11 Jan 2020 07:27:30 +0000 (UTC) (envelope-from SRS0+0AGg=3A=moira.hest-guild.se=andkem@lysator.liu.se) Received: from mail.lysator.liu.se (mail.lysator.liu.se [IPv6:2001:6b0:17:f0a0::3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 47vrz44F5dz4gCL for ; Sat, 11 Jan 2020 07:27:28 +0000 (UTC) (envelope-from SRS0+0AGg=3A=moira.hest-guild.se=andkem@lysator.liu.se) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 9A8FF4000A for ; Sat, 11 Jan 2020 08:27:22 +0100 (CET) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 857C64000C; Sat, 11 Jan 2020 08:27:22 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on bernadotte.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-0.0 required=5.0 tests=AWL,UNPARSEABLE_RELAY autolearn=disabled version=3.4.2 X-Spam-Score: -0.0 Received: from moira.hest-guild.se (moira.hest-guild.se [IPv6:2001:470:de3f:5ec::2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id E72DD4000A for ; Sat, 11 Jan 2020 08:27:18 +0100 (CET) Received: from andkem (uid 1000) (envelope-from andkem@moira.hest-guild.se) id 184af248 by moira.hest-guild.se (DragonFly Mail Agent v0.12); Sat, 11 Jan 2020 08:27:18 +0100 Date: Sat, 11 Jan 2020 08:27:18 +0100 From: Andreas Kempe To: freebsd-net@freebsd.org, freebsd-infiniband@freebsd.org Subject: [PATCH] ipoib: Patch for crash in icmp_error, fault trap 12 Message-ID: <20200111072718.GA14718@moira.hest-guild.se> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="jho1yZJdad60DJr+" Content-Disposition: inline X-Virus-Scanned: ClamAV using ClamSMTP X-Rspamd-Queue-Id: 47vrz44F5dz4gCL X-Spamd-Bar: ------ Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=none) header.from=liu.se; spf=pass (mx1.freebsd.org: domain of SRS0@lysator.liu.se designates 2001:6b0:17:f0a0::3 as permitted sender) smtp.mailfrom=SRS0@lysator.liu.se X-Spamd-Result: default: False [-6.26 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; RCVD_COUNT_FIVE(0.00)[5]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(-0.20)[+a:mail.lysator.liu.se]; RCVD_TLS_LAST(0.00)[]; HAS_ATTACHMENT(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[freebsd-infiniband@freebsd.org]; TO_DN_NONE(0.00)[]; MIME_GOOD(-0.20)[multipart/signed,multipart/mixed,text/plain]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; TO_MATCH_ENVRCPT_SOME(0.00)[]; IP_SCORE(-1.66)[ip: (-6.77), ipnet: 2001:6b0::/32(-0.85), asn: 1653(-0.67), country: EU(-0.01)]; RCPT_COUNT_TWO(0.00)[2]; RCVD_IN_DNSWL_NONE(0.00)[3.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.a.0.f.7.1.0.0.0.b.6.0.1.0.0.2.list.dnswl.org : 127.0.11.0]; DMARC_POLICY_ALLOW(-0.50)[liu.se,none]; SIGNED_PGP(-2.00)[]; FORGED_SENDER(0.30)[kempe@lysator.liu.se,SRS0@lysator.liu.se]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:+,3:~,4:~]; ASN(0.00)[asn:1653, ipnet:2001:6b0::/32, country:EU]; TAGGED_FROM(0.00)[0AGg=3A=moira.hest-guild.se=andkem]; FROM_NEQ_ENVFROM(0.00)[kempe@lysator.liu.se,SRS0@lysator.liu.se] X-BeenThere: freebsd-infiniband@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Infiniband on FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 11 Jan 2020 07:27:30 -0000 --jho1yZJdad60DJr+ Content-Type: multipart/mixed; boundary="OgqxwSJOaUobr8KG" Content-Disposition: inline --OgqxwSJOaUobr8KG Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hello everyone, We have been using IP over IB in connected mode between a Linux machine running Void Linux and another machine running FreeBSD 12.1 STABLE. After having initially transferred data at expected speeds, about 5 Gbit/s, and letting the computers rest for a while the FreeBSD machine throws transmission timeout errors. When a new data transfer is started, the machine would complain that it cannot send a few packets because of them being too large. After this the kernel would panic. See example logs below: Timing out: > ib0: timing out; 7 sends not completed When starting new transfers: > ib0: packet len 32812 (> 2044) too long to send, dropping > ib0: packet len 8248 (> 2044) too long to send, dropping Kernel crash: > Fatal trap 12: page fault while in kernel mode > cpuid = 3; apic id = 03 > fault virtual address = 0x28 > fault code = supervisor read data, page not present > instruction pointer = 0x20:0xffffffff80d76edf > stack pointer = 0x28:0xfffffe008edbeb50 > frame pointer = 0x28:0xfffffe008edbebb0 > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = interrupt enabled, resume, IOPL = 0 > current process = 0 (ipoib) > trap number = 12 > panic: page fault > cpuid = 3 > time = 1578710936 > KDB: stack backtrace: > db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe008edbe7b0 > vpanic() at vpanic+0x17e/frame 0xfffffe008edbe810 > panic() at panic+0x43/frame 0xfffffe008edbe870 > trap_pfault() at trap_pfault/frame 0xfffffe008edbe8e0 > trap_pfault() at trap_pfault+0x4f/frame 0xfffffe008edbe950 > trap() at trap+0x288/frame 0xfffffe008edbea80 > calltrap() at calltrap+0x8/frame 0xfffffe008edbea80 > --- trap 0xc, rip = 0xffffffff80d76edf, rsp = 0xfffffe008edbeb50, rbp = 0xfffffe008edbebb0 --- > icmp_error() at icmp_error+0x2f/frame 0xfffffe008edbebb0 > ipoib_cm_mb_reap() at ipoib_cm_mb_reap+0x154/frame 0xfffffe008edbec00 > linux_work_fn() at linux_work_fn+0xfc/frame 0xfffffe008edbec60 > taskqueue_run_locked() at taskqueue_run_locked+0x144/frame 0xfffffe008edbecc0 > taskqueue_thread_loop() at taskqueue_thread_loop+0xd3/frame 0xfffffe008edbecf0 > fork_exit() at fork_exit+0x7e/frame 0xfffffe008edbed30 > fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe008edbed30 > --- trap 0, rip = 0, rsp = 0, rbp = 0 --- > KDB: enter: panic The 0x28 access that causes the trap is caused by the error statistics if statement at the top of icmp_error in sys/netinet/ip_icmp.c: > if (type != ICMP_REDIRECT) > ICMPSTAT_INC(icps_error); ICMPSTAT_INC needs the VIMAGE for the current thread to be set. Its calling function, i.e. ipoib_cm_mb_reap in sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c, is scheduled in its own thread when the MTU size is too large in ipoib_cm_send. It then calls ipoib_cm_mb_too_long, which in turn schedules ipoib_cm_mb_reap (both functions are located in sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c). The attached patch fixes the issue by setting the VIMAGE for the thread in ipoib_cm_mb_reap. We still have not investigated what causes the MTU to be perceived as too large, but our machine stopped crashing after applying the patch. Cordially, Andreas Kempe --OgqxwSJOaUobr8KG Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="set-vnet-in-ipoib_cm_mb_reap.patch" Content-Transfer-Encoding: quoted-printable Index: sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c (revision 356611) +++ sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c (working copy) @@ -1265,6 +1265,8 @@ =20 spin_lock_irqsave(&priv->lock, flags); =20 + CURVNET_SET_QUIET(priv->dev->if_vnet); + for (;;) { IF_DEQUEUE(&priv->cm.mb_queue, mb); if (mb =3D=3D NULL) @@ -1291,6 +1293,8 @@ spin_lock_irqsave(&priv->lock, flags); } =20 + CURVNET_RESTORE(); + spin_unlock_irqrestore(&priv->lock, flags); } =20 --OgqxwSJOaUobr8KG-- --jho1yZJdad60DJr+ Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEETci4cPcl+ZcyiACiCkqKrhcKSD0FAl4ZeMoACgkQCkqKrhcK SD0JrA/+MkIuMGWf3yPQ69pk1mzFJX/INO9KX8jaLV09GSf7puj5X+QdiafSC6g9 D/HpNC1lD90k4kHH+xcFnc+OhTGqaQzemt8SlILcJP6f3rEWBFmsXSYB8aLTQYsK 5sFpq03IypIhaEUEr0ZoHzO6Gyd8pNnVP4eA9hloTTBdzHwuXFSOPITU+pfLjRhf nIsUl4dyEDVwZw1qGdEuv/avq508mjEOJ8toVwo4hWZKlKY/Q6T8wwuE6iY0cdYx vTFFpVsDvv4LMAcV43VsXqEBd7qgZv/ZOMX/jt6FGeu1HWmXe7tqc2SB/ecMeE5V HeUTjTFXLx83ZhjkDhkq1FFLg7wCb6NdVJtivNgyOb167C6wbsf47hg6NckEieNj Uhzgq9aOpg9WgZWnYsm9yB2SRG2ZhYOa/s+nlLGtkHIPtu6NboBHv6kLXdGJ45rw WtOmM8I4B0o/dLQNKOc6MUuYMf1PKA/mVAn1JPCKKxQqkmvKNqpYbMFZL8BGVvPH ZADXh79n0AwBaBkLO7OEkAn4hvgYVfEBlVMc6UMvHgzLZQiCgkC8+/4oJf1EhdaZ uA53NbVhH3ws+fO489m3kX0RpNmMflBrGoUTejKGkzdMISmO14NrOT+NkEhtW5+D qu5v+KkQidiLEqK3Js52lO1W1pyZrLeQWRbVViNgw+GjS1uJipw= =1b90 -----END PGP SIGNATURE----- --jho1yZJdad60DJr+-- From owner-freebsd-infiniband@freebsd.org Sat Jan 11 12:03:28 2020 Return-Path: Delivered-To: freebsd-infiniband@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 7EFC3227E88; Sat, 11 Jan 2020 12:03:28 +0000 (UTC) (envelope-from hps@selasky.org) Received: from mail.turbocat.net (turbocat.net [IPv6:2a01:4f8:c17:6c4b::2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 47vz5W2s95z40qL; Sat, 11 Jan 2020 12:03:27 +0000 (UTC) (envelope-from hps@selasky.org) Received: from hps2020.home.selasky.org (unknown [62.141.129.235]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)) (No client certificate requested) by mail.turbocat.net (Postfix) with ESMTPSA id D69F32600DC; Sat, 11 Jan 2020 13:03:25 +0100 (CET) Subject: Re: [PATCH] ipoib: Patch for crash in icmp_error, fault trap 12 To: Andreas Kempe , freebsd-net@freebsd.org, freebsd-infiniband@freebsd.org References: <20200111072718.GA14718@moira.hest-guild.se> From: Hans Petter Selasky Message-ID: Date: Sat, 11 Jan 2020 13:03:18 +0100 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:68.0) Gecko/20100101 Thunderbird/68.3.1 MIME-Version: 1.0 In-Reply-To: <20200111072718.GA14718@moira.hest-guild.se> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 47vz5W2s95z40qL X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of hps@selasky.org designates 2a01:4f8:c17:6c4b::2 as permitted sender) smtp.mailfrom=hps@selasky.org X-Spamd-Result: default: False [-4.94 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; RCPT_COUNT_THREE(0.00)[3]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+a:mail.turbocat.net]; FROM_HAS_DN(0.00)[]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[selasky.org]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; TO_MATCH_ENVRCPT_SOME(0.00)[]; IP_SCORE(-2.64)[ip: (-9.21), ipnet: 2a01:4f8::/29(-2.45), asn: 24940(-1.50), country: DE(-0.02)]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:24940, ipnet:2a01:4f8::/29, country:DE]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_TLS_ALL(0.00)[]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-infiniband@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Infiniband on FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 11 Jan 2020 12:03:28 -0000 Thank you for your patch: https://svnweb.freebsd.org/changeset/base/356633 --HPS