Date: Sun, 24 Feb 2019 01:23:11 +0100 From: Andreas Kempe <kempe@lysator.liu.se> To: freebsd-net@freebsd.org Subject: Infiniband: Mellanox MT26418 in ethernet mode causes crash on shutdown Message-ID: <8763252f-d433-5e1e-9e3b-628e0545c8eb@lysator.liu.se>
next in thread | raw e-mail | index | archive | help
This is a multi-part message in MIME format. --------------7E2ABC1F3783C5094D044441 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Hello, When running a Mellanox MT26418 in ethernet mode, the kernel crashes with the following stack trace on system shutdown: > Fatal trap 12: page fault while in kernel mode > cpuid = 0; apic id = 00 > fault virtual address = 0x0 > fault code = supervisor read data, page not present > instruction pointer = 0x20:0xffffffff80e3f5f4 > stack pointer = 0x28:0xfffffe064abec6e0 > frame pointer = 0x28:0xfffffe064abec700 > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = interrupt enabled, resume, IOPL = 0 > current process = 1 (init) > trap number = 12 > panic: page fault > cpuid = 0 > KDB: stack backtrace: > #0 0xffffffff80b4c5b7 at kdb_backtrace+0x67 > #1 0xffffffff80b05b57 at vpanic+0x177 > #2 0xffffffff80b059d3 at panic+0x43 > #3 0xffffffff8106efdf at trap_fatal+0x35f > #4 0xffffffff8106f039 at trap_pfault+0x49 > #5 0xffffffff8106e807 at trap+0x2c7 > #6 0xffffffff8104f03c at calltrap+0x8 > #7 0xffffffff80e3fae2 at mlx4_en_stop_port+0x3d2 > #8 0xffffffff80e40ff6 at mlx4_en_destroy_netdev+0x1e6 > #9 0xffffffff80e3e47d at mlx4_en_remove+0xcd > #10 0xffffffff80e1ab01 at mlx4_remove_device+0xb1 > #11 0xffffffff80e1b0b8 at mlx4_unregister_device+0x98 > #12 0xffffffff80e1c5c5 at mlx4_unload_one+0x85 > #13 0xffffffff80e23543 at mlx4_shutdown+0x83 > #14 0xffffffff80d6b6e9 at linux_pci_shutdown+0x39 > #15 0xffffffff80b4004a at bus_generic_shutdown+0x5a > #16 0xffffffff80b4004a at bus_generic_shutdown+0x5a > #17 0xffffffff80b4004a at bus_generic_shutdown+0x5a I've traced the issue to the following lines of code in sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c in mlx4_en_destroy_netdev(): > /* Unregister device - this will close the port if it was up */ > if (priv->registered) { > mutex_lock(&mdev->state_lock); > ether_ifdetach(dev); > mutex_unlock(&mdev->state_lock); > }>> mutex_lock(&mdev->state_lock); > mlx4_en_stop_port(dev); > mutex_unlock(&mdev->state_lock); > The issue is that mlx4_en_stop_port() follows the fcall chain below and tries to fetch the MAC address of the device in mlx4_en_put_qp. mlx4_en_destroy_netdev->mlx4_en_stop_port->mlx4_en_put_qp The sequence above causes the kernel to choke because the MAC address was freed in the previous call to ether_ifdetach in if_detach_internal with the following call chain: mlx4_en_destroy_netdev->ether_ifdetach->if_detach->if_detach_internal I've written a small workaround that works on our test machine, although I suspect this could potentially cause issues as we're destroying the port before we destroy the interface. Please see the attached patch for the workaround. Cordially, Andreas Kempe Lysator ACS --------------7E2ABC1F3783C5094D044441 Content-Type: text/x-patch; name="mlx_destroy_work_around.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="mlx_destroy_work_around.patch" --- sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c.old 2019-02-24 01:01:54.7593070= 00 +0100 +++ sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c 2019-02-24 01:04:07.872558000 += 0100 @@ -1764,16 +1764,19 @@ if (priv->vlan_detach !=3D NULL) EVENTHANDLER_DEREGISTER(vlan_unconfig, priv->vlan_detach); =20 + /* Bring the interface down before destroying the port. */ + if_down(dev); + + mutex_lock(&mdev->state_lock); + mlx4_en_stop_port(dev); + mutex_unlock(&mdev->state_lock); + /* Unregister device - this will close the port if it was up */ if (priv->registered) { mutex_lock(&mdev->state_lock); ether_ifdetach(dev); mutex_unlock(&mdev->state_lock); } - - mutex_lock(&mdev->state_lock); - mlx4_en_stop_port(dev); - mutex_unlock(&mdev->state_lock); =20 if (priv->allocated) mlx4_free_hwq_res(mdev->dev, &priv->res, MLX4_EN_PAGE_SIZE); --------------7E2ABC1F3783C5094D044441--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8763252f-d433-5e1e-9e3b-628e0545c8eb>