From owner-freebsd-net@freebsd.org Sun Feb 24 09:57:23 2019 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 626E41519519 for ; Sun, 24 Feb 2019 09:57:23 +0000 (UTC) (envelope-from hps@selasky.org) Received: from mail.turbocat.net (turbocat.net [IPv6:2a01:4f8:c17:6c4b::2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id ED3E48DE47 for ; Sun, 24 Feb 2019 09:57:22 +0000 (UTC) (envelope-from hps@selasky.org) Received: from hps2016.home.selasky.org (unknown [176.74.212.121]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.turbocat.net (Postfix) with ESMTPSA id CC5E4260369; Sun, 24 Feb 2019 10:57:19 +0100 (CET) Subject: Re: Infiniband: Mellanox MT26418 in ethernet mode causes crash on shutdown To: Andreas Kempe , freebsd-net@freebsd.org, freebsd-drivers References: <8763252f-d433-5e1e-9e3b-628e0545c8eb@lysator.liu.se> From: Hans Petter Selasky Message-ID: Date: Sun, 24 Feb 2019 10:54:53 +0100 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 MIME-Version: 1.0 In-Reply-To: <8763252f-d433-5e1e-9e3b-628e0545c8eb@lysator.liu.se> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: ED3E48DE47 X-Spamd-Bar: ------ Authentication-Results: mx1.freebsd.org X-Spamd-Result: default: False [-6.98 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-0.996,0]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; REPLY(-4.00)[]; NEURAL_HAM_SHORT(-0.98)[-0.985,0] X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 24 Feb 2019 09:57:23 -0000 On 2/24/19 1:23 AM, Andreas Kempe wrote: > Hello, > > When running a Mellanox MT26418 in ethernet mode, the kernel crashes > with the following stack trace on system shutdown: > >> Fatal trap 12: page fault while in kernel mode >> cpuid = 0; apic id = 00 >> fault virtual address = 0x0 >> fault code = supervisor read data, page not present >> instruction pointer = 0x20:0xffffffff80e3f5f4 >> stack pointer = 0x28:0xfffffe064abec6e0 >> frame pointer = 0x28:0xfffffe064abec700 >> code segment = base 0x0, limit 0xfffff, type 0x1b >> = DPL 0, pres 1, long 1, def32 0, gran 1 >> processor eflags = interrupt enabled, resume, IOPL = 0 >> current process = 1 (init) >> trap number = 12 >> panic: page fault >> cpuid = 0 >> KDB: stack backtrace: >> #0 0xffffffff80b4c5b7 at kdb_backtrace+0x67 >> #1 0xffffffff80b05b57 at vpanic+0x177 >> #2 0xffffffff80b059d3 at panic+0x43 >> #3 0xffffffff8106efdf at trap_fatal+0x35f >> #4 0xffffffff8106f039 at trap_pfault+0x49 >> #5 0xffffffff8106e807 at trap+0x2c7 >> #6 0xffffffff8104f03c at calltrap+0x8 >> #7 0xffffffff80e3fae2 at mlx4_en_stop_port+0x3d2 >> #8 0xffffffff80e40ff6 at mlx4_en_destroy_netdev+0x1e6 >> #9 0xffffffff80e3e47d at mlx4_en_remove+0xcd >> #10 0xffffffff80e1ab01 at mlx4_remove_device+0xb1 >> #11 0xffffffff80e1b0b8 at mlx4_unregister_device+0x98 >> #12 0xffffffff80e1c5c5 at mlx4_unload_one+0x85 >> #13 0xffffffff80e23543 at mlx4_shutdown+0x83 >> #14 0xffffffff80d6b6e9 at linux_pci_shutdown+0x39 >> #15 0xffffffff80b4004a at bus_generic_shutdown+0x5a >> #16 0xffffffff80b4004a at bus_generic_shutdown+0x5a >> #17 0xffffffff80b4004a at bus_generic_shutdown+0x5a > > I've traced the issue to the following lines of code in > sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c in mlx4_en_destroy_netdev(): >> /* Unregister device - this will close the port if it was up */ >> if (priv->registered) { >> mutex_lock(&mdev->state_lock); >> ether_ifdetach(dev); >> mutex_unlock(&mdev->state_lock); >> }>> mutex_lock(&mdev->state_lock); >> mlx4_en_stop_port(dev); >> mutex_unlock(&mdev->state_lock); >> > > The issue is that mlx4_en_stop_port() follows the fcall chain below and > tries to fetch the MAC address of the device in mlx4_en_put_qp. > mlx4_en_destroy_netdev->mlx4_en_stop_port->mlx4_en_put_qp > > The sequence above causes the kernel to choke because the MAC address > was freed in the previous call to ether_ifdetach in if_detach_internal > with the following call chain: > mlx4_en_destroy_netdev->ether_ifdetach->if_detach->if_detach_internal > > I've written a small workaround that works on our test machine, although > I suspect this could potentially cause issues as we're destroying the > port before we destroy the interface. Please see the attached patch for > the workaround. > > Cordially, > Andreas Kempe > Lysator ACS CC'ing FreeBSD-drivers at Mellanox. Thank you for your patch. We'll have a look at it. --HPS