Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 11 Jan 2020 08:27:18 +0100
From:      Andreas Kempe <kempe@lysator.liu.se>
To:        freebsd-net@freebsd.org, freebsd-infiniband@freebsd.org
Subject:   [PATCH] ipoib: Patch for crash in icmp_error, fault trap 12
Message-ID:  <20200111072718.GA14718@moira.hest-guild.se>

next in thread | raw e-mail | index | archive | help

--jho1yZJdad60DJr+
Content-Type: multipart/mixed; boundary="OgqxwSJOaUobr8KG"
Content-Disposition: inline


--OgqxwSJOaUobr8KG
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Hello everyone,

We have been using IP over IB in connected mode between a Linux
machine running Void Linux and another machine running
FreeBSD 12.1 STABLE. After having initially transferred data at
expected speeds, about 5 Gbit/s, and letting the computers rest for a
while the FreeBSD machine throws transmission timeout errors. When a
new data transfer is started, the machine would complain that it
cannot send a few packets because of them being too large. After this
the kernel would panic. See example logs below:

Timing out:

> ib0: timing out; 7 sends not completed

When starting new transfers:

> ib0: packet len 32812 (> 2044) too long to send, dropping
> ib0: packet len 8248 (> 2044) too long to send, dropping

Kernel crash:

> Fatal trap 12: page fault while in kernel mode
> cpuid = 3; apic id = 03
> fault virtual address   = 0x28
> fault code              = supervisor read data, page not present
> instruction pointer     = 0x20:0xffffffff80d76edf
> stack pointer           = 0x28:0xfffffe008edbeb50
> frame pointer           = 0x28:0xfffffe008edbebb0
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 0 (ipoib)
> trap number             = 12
> panic: page fault
> cpuid = 3
> time = 1578710936
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe008edbe7b0
> vpanic() at vpanic+0x17e/frame 0xfffffe008edbe810
> panic() at panic+0x43/frame 0xfffffe008edbe870
> trap_pfault() at trap_pfault/frame 0xfffffe008edbe8e0
> trap_pfault() at trap_pfault+0x4f/frame 0xfffffe008edbe950
> trap() at trap+0x288/frame 0xfffffe008edbea80
> calltrap() at calltrap+0x8/frame 0xfffffe008edbea80
> --- trap 0xc, rip = 0xffffffff80d76edf, rsp = 0xfffffe008edbeb50, rbp = 0xfffffe008edbebb0 ---
> icmp_error() at icmp_error+0x2f/frame 0xfffffe008edbebb0
> ipoib_cm_mb_reap() at ipoib_cm_mb_reap+0x154/frame 0xfffffe008edbec00
> linux_work_fn() at linux_work_fn+0xfc/frame 0xfffffe008edbec60
> taskqueue_run_locked() at taskqueue_run_locked+0x144/frame 0xfffffe008edbecc0
> taskqueue_thread_loop() at taskqueue_thread_loop+0xd3/frame 0xfffffe008edbecf0
> fork_exit() at fork_exit+0x7e/frame 0xfffffe008edbed30
> fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe008edbed30
> --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
> KDB: enter: panic


The 0x28 access that causes the trap is caused by the error statistics
if statement at the top of icmp_error in sys/netinet/ip_icmp.c:

> if (type != ICMP_REDIRECT)
>         ICMPSTAT_INC(icps_error);

ICMPSTAT_INC needs the VIMAGE for the current thread to be set.
Its calling function, i.e. ipoib_cm_mb_reap in
sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c, is scheduled in its
own thread when the MTU size is too large in ipoib_cm_send. It then
calls ipoib_cm_mb_too_long, which in turn schedules ipoib_cm_mb_reap
(both functions are located in sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c).

The attached patch fixes the issue by setting the VIMAGE for the
thread in ipoib_cm_mb_reap.

We still have not investigated what causes the MTU to be perceived as
too large, but our machine stopped crashing after applying the patch.

Cordially,
Andreas Kempe

--OgqxwSJOaUobr8KG
Content-Type: text/x-diff; charset=us-ascii
Content-Disposition: attachment;
	filename="set-vnet-in-ipoib_cm_mb_reap.patch"
Content-Transfer-Encoding: quoted-printable

Index: sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c	(revision 356611)
+++ sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c	(working copy)
@@ -1265,6 +1265,8 @@
=20
 	spin_lock_irqsave(&priv->lock, flags);
=20
+	CURVNET_SET_QUIET(priv->dev->if_vnet);
+
 	for (;;) {
 		IF_DEQUEUE(&priv->cm.mb_queue, mb);
 		if (mb =3D=3D NULL)
@@ -1291,6 +1293,8 @@
 		spin_lock_irqsave(&priv->lock, flags);
 	}
=20
+	CURVNET_RESTORE();
+
 	spin_unlock_irqrestore(&priv->lock, flags);
 }
=20

--OgqxwSJOaUobr8KG--

--jho1yZJdad60DJr+
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEETci4cPcl+ZcyiACiCkqKrhcKSD0FAl4ZeMoACgkQCkqKrhcK
SD0JrA/+MkIuMGWf3yPQ69pk1mzFJX/INO9KX8jaLV09GSf7puj5X+QdiafSC6g9
D/HpNC1lD90k4kHH+xcFnc+OhTGqaQzemt8SlILcJP6f3rEWBFmsXSYB8aLTQYsK
5sFpq03IypIhaEUEr0ZoHzO6Gyd8pNnVP4eA9hloTTBdzHwuXFSOPITU+pfLjRhf
nIsUl4dyEDVwZw1qGdEuv/avq508mjEOJ8toVwo4hWZKlKY/Q6T8wwuE6iY0cdYx
vTFFpVsDvv4LMAcV43VsXqEBd7qgZv/ZOMX/jt6FGeu1HWmXe7tqc2SB/ecMeE5V
HeUTjTFXLx83ZhjkDhkq1FFLg7wCb6NdVJtivNgyOb167C6wbsf47hg6NckEieNj
Uhzgq9aOpg9WgZWnYsm9yB2SRG2ZhYOa/s+nlLGtkHIPtu6NboBHv6kLXdGJ45rw
WtOmM8I4B0o/dLQNKOc6MUuYMf1PKA/mVAn1JPCKKxQqkmvKNqpYbMFZL8BGVvPH
ZADXh79n0AwBaBkLO7OEkAn4hvgYVfEBlVMc6UMvHgzLZQiCgkC8+/4oJf1EhdaZ
uA53NbVhH3ws+fO489m3kX0RpNmMflBrGoUTejKGkzdMISmO14NrOT+NkEhtW5+D
qu5v+KkQidiLEqK3Js52lO1W1pyZrLeQWRbVViNgw+GjS1uJipw=
=1b90
-----END PGP SIGNATURE-----

--jho1yZJdad60DJr+--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20200111072718.GA14718>