Date: Sat, 11 Jan 2020 08:27:18 +0100 From: Andreas Kempe <kempe@lysator.liu.se> To: freebsd-net@freebsd.org, freebsd-infiniband@freebsd.org Subject: [PATCH] ipoib: Patch for crash in icmp_error, fault trap 12 Message-ID: <20200111072718.GA14718@moira.hest-guild.se>
next in thread | raw e-mail | index | archive | help
--jho1yZJdad60DJr+ Content-Type: multipart/mixed; boundary="OgqxwSJOaUobr8KG" Content-Disposition: inline --OgqxwSJOaUobr8KG Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hello everyone, We have been using IP over IB in connected mode between a Linux machine running Void Linux and another machine running FreeBSD 12.1 STABLE. After having initially transferred data at expected speeds, about 5 Gbit/s, and letting the computers rest for a while the FreeBSD machine throws transmission timeout errors. When a new data transfer is started, the machine would complain that it cannot send a few packets because of them being too large. After this the kernel would panic. See example logs below: Timing out: > ib0: timing out; 7 sends not completed When starting new transfers: > ib0: packet len 32812 (> 2044) too long to send, dropping > ib0: packet len 8248 (> 2044) too long to send, dropping Kernel crash: > Fatal trap 12: page fault while in kernel mode > cpuid = 3; apic id = 03 > fault virtual address = 0x28 > fault code = supervisor read data, page not present > instruction pointer = 0x20:0xffffffff80d76edf > stack pointer = 0x28:0xfffffe008edbeb50 > frame pointer = 0x28:0xfffffe008edbebb0 > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = interrupt enabled, resume, IOPL = 0 > current process = 0 (ipoib) > trap number = 12 > panic: page fault > cpuid = 3 > time = 1578710936 > KDB: stack backtrace: > db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe008edbe7b0 > vpanic() at vpanic+0x17e/frame 0xfffffe008edbe810 > panic() at panic+0x43/frame 0xfffffe008edbe870 > trap_pfault() at trap_pfault/frame 0xfffffe008edbe8e0 > trap_pfault() at trap_pfault+0x4f/frame 0xfffffe008edbe950 > trap() at trap+0x288/frame 0xfffffe008edbea80 > calltrap() at calltrap+0x8/frame 0xfffffe008edbea80 > --- trap 0xc, rip = 0xffffffff80d76edf, rsp = 0xfffffe008edbeb50, rbp = 0xfffffe008edbebb0 --- > icmp_error() at icmp_error+0x2f/frame 0xfffffe008edbebb0 > ipoib_cm_mb_reap() at ipoib_cm_mb_reap+0x154/frame 0xfffffe008edbec00 > linux_work_fn() at linux_work_fn+0xfc/frame 0xfffffe008edbec60 > taskqueue_run_locked() at taskqueue_run_locked+0x144/frame 0xfffffe008edbecc0 > taskqueue_thread_loop() at taskqueue_thread_loop+0xd3/frame 0xfffffe008edbecf0 > fork_exit() at fork_exit+0x7e/frame 0xfffffe008edbed30 > fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe008edbed30 > --- trap 0, rip = 0, rsp = 0, rbp = 0 --- > KDB: enter: panic The 0x28 access that causes the trap is caused by the error statistics if statement at the top of icmp_error in sys/netinet/ip_icmp.c: > if (type != ICMP_REDIRECT) > ICMPSTAT_INC(icps_error); ICMPSTAT_INC needs the VIMAGE for the current thread to be set. Its calling function, i.e. ipoib_cm_mb_reap in sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c, is scheduled in its own thread when the MTU size is too large in ipoib_cm_send. It then calls ipoib_cm_mb_too_long, which in turn schedules ipoib_cm_mb_reap (both functions are located in sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c). The attached patch fixes the issue by setting the VIMAGE for the thread in ipoib_cm_mb_reap. We still have not investigated what causes the MTU to be perceived as too large, but our machine stopped crashing after applying the patch. Cordially, Andreas Kempe --OgqxwSJOaUobr8KG Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="set-vnet-in-ipoib_cm_mb_reap.patch" Content-Transfer-Encoding: quoted-printable Index: sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c (revision 356611) +++ sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c (working copy) @@ -1265,6 +1265,8 @@ =20 spin_lock_irqsave(&priv->lock, flags); =20 + CURVNET_SET_QUIET(priv->dev->if_vnet); + for (;;) { IF_DEQUEUE(&priv->cm.mb_queue, mb); if (mb =3D=3D NULL) @@ -1291,6 +1293,8 @@ spin_lock_irqsave(&priv->lock, flags); } =20 + CURVNET_RESTORE(); + spin_unlock_irqrestore(&priv->lock, flags); } =20 --OgqxwSJOaUobr8KG-- --jho1yZJdad60DJr+ Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEETci4cPcl+ZcyiACiCkqKrhcKSD0FAl4ZeMoACgkQCkqKrhcK SD0JrA/+MkIuMGWf3yPQ69pk1mzFJX/INO9KX8jaLV09GSf7puj5X+QdiafSC6g9 D/HpNC1lD90k4kHH+xcFnc+OhTGqaQzemt8SlILcJP6f3rEWBFmsXSYB8aLTQYsK 5sFpq03IypIhaEUEr0ZoHzO6Gyd8pNnVP4eA9hloTTBdzHwuXFSOPITU+pfLjRhf nIsUl4dyEDVwZw1qGdEuv/avq508mjEOJ8toVwo4hWZKlKY/Q6T8wwuE6iY0cdYx vTFFpVsDvv4LMAcV43VsXqEBd7qgZv/ZOMX/jt6FGeu1HWmXe7tqc2SB/ecMeE5V HeUTjTFXLx83ZhjkDhkq1FFLg7wCb6NdVJtivNgyOb167C6wbsf47hg6NckEieNj Uhzgq9aOpg9WgZWnYsm9yB2SRG2ZhYOa/s+nlLGtkHIPtu6NboBHv6kLXdGJ45rw WtOmM8I4B0o/dLQNKOc6MUuYMf1PKA/mVAn1JPCKKxQqkmvKNqpYbMFZL8BGVvPH ZADXh79n0AwBaBkLO7OEkAn4hvgYVfEBlVMc6UMvHgzLZQiCgkC8+/4oJf1EhdaZ uA53NbVhH3ws+fO489m3kX0RpNmMflBrGoUTejKGkzdMISmO14NrOT+NkEhtW5+D qu5v+KkQidiLEqK3Js52lO1W1pyZrLeQWRbVViNgw+GjS1uJipw= =1b90 -----END PGP SIGNATURE----- --jho1yZJdad60DJr+--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20200111072718.GA14718>