From owner-freebsd-net@freebsd.org Sun Feb 24 00:23:24 2019 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CAD841509400 for ; Sun, 24 Feb 2019 00:23:24 +0000 (UTC) (envelope-from kempe@lysator.liu.se) Received: from mail.lysator.liu.se (mail.lysator.liu.se [IPv6:2001:6b0:17:f0a0::3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 22D387510B for ; Sun, 24 Feb 2019 00:23:24 +0000 (UTC) (envelope-from kempe@lysator.liu.se) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id C120B40006 for ; Sun, 24 Feb 2019 01:23:12 +0100 (CET) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id AA7B740008; Sun, 24 Feb 2019 01:23:12 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on bernadotte.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.4.1 X-Spam-Score: -1.0 Received: from drd1812.nothing.org (unknown [10.253.196.231]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 2074940006 for ; Sun, 24 Feb 2019 01:23:12 +0100 (CET) To: freebsd-net@freebsd.org From: Andreas Kempe Subject: Infiniband: Mellanox MT26418 in ethernet mode causes crash on shutdown Openpgp: preference=signencrypt Autocrypt: addr=kempe@lysator.liu.se; keydata= mQINBFfcARkBEACpvItp92qIKstHBKcdfFFq7/Kd52IyQuOvLtJCn9Cvbipn48wxWBc2Ytzq OT0VueYpsX85VbJkCB8aTCVfm3xlHHozbttJbwRiQuoRPQFdaRMQacaSg7g9A8DZmIshHbzk hvDw20Exa/d3B4mX/LHewWyR/v34Aw0t1OchuI/xN2qpTgTINkY4vyA057/088dbTz7Kfs5P 3TCzrycULgoke96jrAy6hjUOSHCPvLPXnQ9mq9q0u1yXpK1WnG91aU1qhSiY7ya4Sj+UkSok T5qiV08K49IJdLNvs32FRaizJqHUrdXHsel2xFAnSssqwjq/qWinR8XIZCRtuf8Vcg+/cihM KWbrEvDsnWPmzWRk4HjFmUfXTBN6NpDq5teWQmcHUxCkIG15vEa7twUlDfSfFdW2tThglz/M i2IlinT5mCa213s1mv0xRTKY/QnWQgrVK4m6gAL2vt4TQ0f/2KE7OFm4vS4HJCrleMKIROEE iycYHsFjdzL55JBm5idAPTg/da07WJZBxAIFLLvbm4XawRzyeuNCPUZt030A6OszTfV8+1XZ cB+qU2pxiD6VLfoW/HIZNLpZDTl+DkrPHK5FeVl3rMpCroWwFrZkWAWlDgRseupdbOkmiplj cqKiRoQ7414jW8PwVzHWuHZ/JDALO3JrinsBRKCiJ6wpyw1K1wARAQABtCRBbmRyZWFzIEtl bXBlIDxrZW1wZUBseXNhdG9yLmxpdS5zZT6JAlgEEwECAEICGwMGCwkIBwMCBhUIAgkKCwQW AgMBAh4BAheABQkFhWQCFiEETci4cPcl+ZcyiACiCkqKrhcKSD0FAlvxTwICGQEACgkQCkqK rhcKSD3I0hAAn0/l5GRdVrkAPrNx1DZp/3RQmEwQUZJC/0hSi7JFZHfgoA7AHeg2oVY36KAy q2oTEzA2OsDKKs6T66+CLedIpDwVKE7KPcSswHbM0ULB8sE/Kymha/+db6xQzglJtnlLEJEn OyV09exX9rqyazpAQ4/TohthrDbMqYU+MZUIULZz83Hat+Fdi4K6CwCdl4NC05xsVIDNEJks 3DivlUccZ0GX6Bx1sTYO7mSL1v4mUpEf/BWOLqm9PFyU9jniQVuabYcfPnLn/xmwXZUcTEe4 iyBzlKxDhbnFyyxG7A8N/8UDWte59t843Q+KBA6WknWorG4gmwEhGE7pB4hTFU8gM+KGm89T RZiINbYaHI/OJ2p0R2cY/LU1XGw5A5WFvLXgqAO3196KLqjKz11VHeTWI/KeZ+QAPccShZKh yXbu+cYdqfYpGRlnSx0mzCvi5NfJQw+HTIA6EMnlcc+vfsnn4X0RO4BcGoqnThIkQYWVwZPC ZxAxDciifEMdJ374eaxmORlW3hvPDHmowPST/8H6ln3nWvhZRncGcWj4Gz0QuoB6V0JKgUaV zI5NBIbpi0U/1FdB6HtWK7KH86KNvXT6FG49yNPMnnYX9Nnt/3IyTlpl4rp0zuAE5gxj9Z06 Sko5+jO2iJ6T0147cjhmiTLGySfi+kavAX0fdRSy3SxSxPC5Ag0EV9wBGQEQANRvDM/LiHoq bAGAs1IgCcSGWdcdE2pcmMTeaxH/9ZY28tyIwrnjuuYdAZwrVFgNaRlSKB/VXJ7p1Pkir2PT ruvhvhdsRQsBV56KXw55WYpN+FizQbWftvRpGy0AV/UlCtba2pxNyo1kSYUh3Lati+uIRWcp zB2ctLQ/ib58S4Kft6ZvPXLoAEMkkj/dNZDG3kK0Wr1eB30LnpxQC38EMbIDvsVt2csfFZUd qjpn0dAjJNEsWCs9WrBclYBl/TDFmyfbc4CvrVgizFdt4tKX3VK7uGXUfSHSF7CFkvqViW7D AE+p7cvkd6wwf45ivRlAe0bOiV8DzW90c7F4WbGoM+hydDeSOYy5T4ttE8mZCi/XtdZuxUBB kOUuUs0yBPYk/kdeSSvPcUep2zs3JK5Vn+pikwTsGrrYthBgrwZ3WntWHaRsinPTd3uAn5i+ 7depjtzUwz4h+s4JiCYYSKFoNyet27xF3MgT8+y5O6GkdFXoCHQ0XfsW6CU7GEdbcCWpecvh EuwrOmjFfHHd+jVMUzekUxFD49qjxKn4tZdNm/mM7zZaefjps2Srw7EzZI0sJIld1LG9dcB9 vYeunOZ4JuMo9xtTgqEW74avoxsRWvvTYMRYCWIlm5azzbqxpTSxnbicCFaEkUhye8OBdm3r eb3tk7a7m+k+sb1kV40lCMRhABEBAAGJAiUEGAECAA8CGwwFAluAMdsFCQWFZEIACgkQCkqK rhcKSD3hWw//enGvZy6YymqrxAVXBywC9BBh/Nact9O3LZ/l2PI0JSqn2+gQuqUl69SYpTuu npo4edw0DCzhFPbeND4dZP8d/iBp+AUkm+fNqtK33yNsVp6iwVqfE1fSWGT+9dqIy2uCNn09 rW2ry02l+BBzWAvOtNVdYEGx73IiIQbGMzjVCb/UCBmTbRk77svXKST97tgrMiO21KWbyDVG RgMMJSeLFnLKoWxhHUJTwwlUtOQgdVpUoZAg75Ca6LokV//e+hWXs+jcdsKLm59ZpwAxg0Am 255vdtIwaRiI3RMJ7L0+Hi9czicdx/dBo+ie5A7Ky+uKPaLIh18UeYcf9LH96wRf1ifHn8eQ ztx5Am1CMHGM8o1dd6g4ofU/dGGAT53vZAKKnZsZsYVLuvfXATdRsNpaCYBGtKlNHXQ9ZkJX gLEtUSmlq9dSRmMrDx5sK1BURkTA+UCOFqX+hSVqATDabhkaOXoR20nE1kSGpYKmfOA0eJTN 89EYGey7x3YHIBLeq/a3pFuI6AuGLXIifNzzRTyvdIwgkT77kQPmXzWybtyyc0IMSV/tBkru f70YQq+CgeA6uUFvRrRy7rP8gnWjVHguB8pwUC8Y35ibjK4ExkSpSctQXSfQ7IzDgK8IyGH4 gjRhOx2pI57kIj0vDFSpQjM8klDD7tqE/uCor1yqIdkv1r4= Message-ID: <8763252f-d433-5e1e-9e3b-628e0545c8eb@lysator.liu.se> Date: Sun, 24 Feb 2019 01:23:11 +0100 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------------7E2ABC1F3783C5094D044441" Content-Language: en-US X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 24 Feb 2019 00:23:25 -0000 This is a multi-part message in MIME format. --------------7E2ABC1F3783C5094D044441 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Hello, When running a Mellanox MT26418 in ethernet mode, the kernel crashes with the following stack trace on system shutdown: > Fatal trap 12: page fault while in kernel mode > cpuid = 0; apic id = 00 > fault virtual address = 0x0 > fault code = supervisor read data, page not present > instruction pointer = 0x20:0xffffffff80e3f5f4 > stack pointer = 0x28:0xfffffe064abec6e0 > frame pointer = 0x28:0xfffffe064abec700 > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = interrupt enabled, resume, IOPL = 0 > current process = 1 (init) > trap number = 12 > panic: page fault > cpuid = 0 > KDB: stack backtrace: > #0 0xffffffff80b4c5b7 at kdb_backtrace+0x67 > #1 0xffffffff80b05b57 at vpanic+0x177 > #2 0xffffffff80b059d3 at panic+0x43 > #3 0xffffffff8106efdf at trap_fatal+0x35f > #4 0xffffffff8106f039 at trap_pfault+0x49 > #5 0xffffffff8106e807 at trap+0x2c7 > #6 0xffffffff8104f03c at calltrap+0x8 > #7 0xffffffff80e3fae2 at mlx4_en_stop_port+0x3d2 > #8 0xffffffff80e40ff6 at mlx4_en_destroy_netdev+0x1e6 > #9 0xffffffff80e3e47d at mlx4_en_remove+0xcd > #10 0xffffffff80e1ab01 at mlx4_remove_device+0xb1 > #11 0xffffffff80e1b0b8 at mlx4_unregister_device+0x98 > #12 0xffffffff80e1c5c5 at mlx4_unload_one+0x85 > #13 0xffffffff80e23543 at mlx4_shutdown+0x83 > #14 0xffffffff80d6b6e9 at linux_pci_shutdown+0x39 > #15 0xffffffff80b4004a at bus_generic_shutdown+0x5a > #16 0xffffffff80b4004a at bus_generic_shutdown+0x5a > #17 0xffffffff80b4004a at bus_generic_shutdown+0x5a I've traced the issue to the following lines of code in sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c in mlx4_en_destroy_netdev(): > /* Unregister device - this will close the port if it was up */ > if (priv->registered) { > mutex_lock(&mdev->state_lock); > ether_ifdetach(dev); > mutex_unlock(&mdev->state_lock); > }>> mutex_lock(&mdev->state_lock); > mlx4_en_stop_port(dev); > mutex_unlock(&mdev->state_lock); > The issue is that mlx4_en_stop_port() follows the fcall chain below and tries to fetch the MAC address of the device in mlx4_en_put_qp. mlx4_en_destroy_netdev->mlx4_en_stop_port->mlx4_en_put_qp The sequence above causes the kernel to choke because the MAC address was freed in the previous call to ether_ifdetach in if_detach_internal with the following call chain: mlx4_en_destroy_netdev->ether_ifdetach->if_detach->if_detach_internal I've written a small workaround that works on our test machine, although I suspect this could potentially cause issues as we're destroying the port before we destroy the interface. Please see the attached patch for the workaround. Cordially, Andreas Kempe Lysator ACS --------------7E2ABC1F3783C5094D044441 Content-Type: text/x-patch; name="mlx_destroy_work_around.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="mlx_destroy_work_around.patch" --- sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c.old 2019-02-24 01:01:54.7593070= 00 +0100 +++ sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c 2019-02-24 01:04:07.872558000 += 0100 @@ -1764,16 +1764,19 @@ if (priv->vlan_detach !=3D NULL) EVENTHANDLER_DEREGISTER(vlan_unconfig, priv->vlan_detach); =20 + /* Bring the interface down before destroying the port. */ + if_down(dev); + + mutex_lock(&mdev->state_lock); + mlx4_en_stop_port(dev); + mutex_unlock(&mdev->state_lock); + /* Unregister device - this will close the port if it was up */ if (priv->registered) { mutex_lock(&mdev->state_lock); ether_ifdetach(dev); mutex_unlock(&mdev->state_lock); } - - mutex_lock(&mdev->state_lock); - mlx4_en_stop_port(dev); - mutex_unlock(&mdev->state_lock); =20 if (priv->allocated) mlx4_free_hwq_res(mdev->dev, &priv->res, MLX4_EN_PAGE_SIZE); --------------7E2ABC1F3783C5094D044441--