Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 27 Feb 2020 00:00:12 +0100
From:      Andreas Kempe <kempe@lysator.liu.se>
To:        Hans Petter Selasky <hps@selasky.org>
Cc:        Konstantin Belousov <kib@freebsd.org>, Meny Yossefi <menyy@mellanox.com>, freebsd-infiniband@freebsd.org
Subject:   Re: [PATCH]: ipoib with mlx4 initialisation ordering
Message-ID:  <20200226230012.GA6559@moira.hest-guild.se>
In-Reply-To: <2226834e-4184-a581-87bb-3b8ce6c184da@selasky.org>
References:  <20200222004838.GA22659@moira.hest-guild.se> <9d76992b-6ba4-2419-61ff-5035aa45e597@selasky.org> <20200224194608.GC22659@moira.hest-guild.se> <16883d49-3cc0-d9cc-0877-46f811eeb8f1@selasky.org> <20200226210554.GE22659@moira.hest-guild.se> <e2bcedd2-f2db-4d5d-453f-b59f8d7e186e@selasky.org> <20200226213022.GG22659@moira.hest-guild.se> <2226834e-4184-a581-87bb-3b8ce6c184da@selasky.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--uAKRQypu60I7Lcqm
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Feb 26, 2020 at 10:52:56PM +0100, Hans Petter Selasky wrote:
> On 2020-02-26 22:30, Andreas Kempe wrote:
> > Index: sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > --- sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_main.c	(revision 356611)
> > +++ sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_main.c	(working copy)
> > @@ -1739,7 +1739,7 @@
> >   }
> >   module_init(ipoib_init_module);
> > -module_exit(ipoib_cleanup_module);
> > +module_exit_order(ipoib_cleanup_module, SI_ORDER_FOURTH);
> >   static int
> >   ipoib_evhand(module_t mod, int event, void *arg)
> >=20
>=20
> I haven't yet found time to reproduce this issue.
>=20

No worries, there is absolutely no rush from my side. We can patch our
machines ourselves with the initial patch until some sort of solution
gets adopted upstream.

> Possibly you're right that the list order matters.
>=20

I would also have guessed that the patch above would have solved the
issue. When the ipoib module is torn down, it should, as far as I can
tell from only reading the code, remove all the multicast groups.
Without hooking up the kernel debugger again, I can't say for sure why
it would still hang.

I'm providing the wall of text below in hopes it can help you or
anyone that wishes to debug this issue further.

The only reason I really said that the list ordering matters is that
mlx4_ib_remove calls ib_unregister_device which in turn walks the
client list in the reverse order. Printing each list element as the
list is iterated during shutdown yields the following client order
(the first client to be removed at the top of the list):

ib_unregister_device: ib_client->name =3D uverbs =20
ib_unregister_device: ib_client->name =3D ucm
ib_unregister_device: ib_client->name =3D umad
ib_unregister_device: ib_client->name =3D cm
ib_unregister_device: ib_client->name =3D ib_multicast                     =
                          =20
ib_unregister_device: ib_client->name =3D sa                               =
                          =20
ib_unregister_device: ib_client->name =3D mad
ib_unregister_device: ib_client->name =3D cma
ib_unregister_device: ib_client->name =3D ipoib                          =
=20
ib_unregister_device: ib_client->name =3D sdp

If the interface is up and running and has sent data when the machine
is shut down, it hangs on list index 4,
i.e. ib_unregister_device: ib_client->name =3D ib_multicast.
The reason it hangs is the wait in mcast_remove_one, see below:

sys/ofed/drivers/infiniband/core/ib_multicast.c:
> static void mcast_remove_one(struct ib_device *device, void *client_data)
> {
>     struct mcast_device *dev =3D client_data;
>     struct mcast_port *port;
>     int i;
>=20
>     if (!dev)
>         return;
>=20
>     ib_unregister_event_handler(&dev->event_handler);
>     flush_workqueue(mcast_wq);
>=20
>     for (i =3D 0; i <=3D dev->end_port - dev->start_port; i++) {
>         if (rdma_cap_ib_mcast(device, dev->start_port + i)) {
>             port =3D &dev->port[i];
>             deref_port(port);
>             wait_for_completion(&port->comp);
>         }
>     }
>=20
>     kfree(dev);
> }
>
> [...]
>
> tatic void deref_port(struct mcast_port *port)
> {
>     if (atomic_dec_and_test(&port->refcount))
>         complete(&port->comp);
> }

The crucial logic is in the deref_port(port) function call. It does a
check whether the reference counter for port is zero after
decrementing the count. If it is zero, complete is called on
port->comp.

In the case where the reference count is larger than zero after the
decrement, complete is never called and we hang forever on
wait_for_completion in mcast_remove_one.

By moving the initialisation of ipoib, we got it to be removed before
the ib_multicast client, causing the reference count to be exactly 1
going into mcast_remove_one, preventing the hang.

Cordially,
Andreas Kempe

--uAKRQypu60I7Lcqm
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEETci4cPcl+ZcyiACiCkqKrhcKSD0FAl5W+HQACgkQCkqKrhcK
SD0www//ZmYxrBl7U63tl9e83DpZqNSs0JP8TRazR/Te0WJhXMlnFcK+LlM48CdM
HFEtKChhQpOR+nzMwy1+Ozolu3Imx8es8C16Mfh9WvxF4XYGzWPG/Rmntw9zATMF
krZl3gugOrVKnHQHUKG3fSBZX7j1PMeO1Bo5eHbSJ6AYu/KyeKBD8O6RDX62jnN6
5FzNqLwovYlsoUMX8xBr0nSMVhPZIbzgUAw5krBzs+uNx4VrG16WGt/wHqYvTPtn
TJbV3Y0DUXy5P/TEQPUrofSXhbPUWowZ4qqsx0QaJArQt1nSUMEKFmkqiP6TZPfo
oMlouHoSPb9JBcg/YmG0WBowsHPCIxw7/wJmHBpxRHlw2Yjyz6tVcbvvoLYgFs40
no2pOeaWcTTKmcgG/Rhk4nN542GzAABWYrZvRNp7oj2FRKzfbBVnlI0k3ZUTYAOj
U/6Sc4msv4UQKKRjn4f5/iPSx98Nfr3TZmtWzN7I+Xa2F8JqzKBsWz/pzG5NxfZH
Qu4kQugzRaRgyEG3rwx75OCIRsHNbLytjbSxj2lXxR/Du5JcENIp90b4ACQ5kCiU
PGNQgjldFeYv70AFl4Nf3Ckgzui8SmuCBP8vSLdAiF9c+wMJ0nABF4BKBqJbITRe
6woHS5+hGodMt7jKlEN9+2tONsrBcr/tdYEd3nGHCbmucsYTsqg=
=/9nz
-----END PGP SIGNATURE-----

--uAKRQypu60I7Lcqm--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20200226230012.GA6559>