Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 30 Nov 2018 14:27:45 +0200
From:      Andriy Gapon <avg@FreeBSD.org>
To:        freebsd-net@FreeBSD.org
Subject:   lagg: a race between ioctl and clone_destroy
Message-ID:  <8fcf453e-c326-eb14-367a-655cbee5d088@FreeBSD.org>

next in thread | raw e-mail | index | archive | help

I am working on analyzing a crash with the following stack trace:
_sx_slock_hard lagg_media_status ifmedia_ioctl lagg_ioctl ifioctl kern_ioctl
sys_ioctl
It appears that the crash happened because of a destroyed sx lock.
Please note that the code base is the CURRENT from just before the time when the
network stack started to use the epoch mechanism.

My theory is that the following race happened.
lagg_clone_destroy() and lagg_ioctl() were called concurrently.
lagg_clone_destroy() won a race to lock sc_sx while lagg_media_status() got
blocked on it.  I think that after some adaptive spinning the thread was placed
on a sleep queue.  Then, lagg_clone_destroy() unlocked the lock and proceeded to
destroy it.  After the lagg_media_status() thread was waken up it found the lock
in the destroyed state and crashed on it in a typical fashion (trying to
dereference a NULL pointer as a struct thread pointer).

My knowledge of the network stack is rather shallow, so I have a few questions.

Q1. Is the described race plausible?

Q2. If yes, then has this class[*] of races been fixed by the epoch mechanism?
I suspect that lagg_ioctl() can still have that race if it's called concurrently
with lagg_clone_destroy().

Q3. Is there a way to protect against this type of a race in the code from
before the epoch mechanism?
I think that the safest thing to do would be to free/destroy the softc only
after if_refcount goes to zero, but I could not find any callback for that.
That is, I think that this code in if_free() ensures that ifp stays valid as
long as it's referenced and it's being accessed under the epoch protection:
        if (refcount_release(&ifp->if_refcount))
                epoch_call(net_epoch_preempt, &ifp->if_epoch_ctx, if_destroy);

But the code in lagg_clone_destroy() would destroy the softc without waiting on
anything:
        LAGG_XUNLOCK(sc);

        ifmedia_removeall(&sc->sc_media);
        ether_ifdetach(ifp);
        if_free(ifp); <---- ifp can still be used and valid after this

        LAGG_LIST_LOCK();
        SLIST_REMOVE(&V_lagg_list, sc, lagg_softc, sc_entries);
        LAGG_LIST_UNLOCK();

        LAGG_SX_DESTROY(sc);
        free(sc, M_DEVBUF); <---- but sc is immediately destroyed in any case
}

[*] My impression is that the problem is (or, at least, was) not limited to
lagg.  I think that other drivers can also have a similar race between a
clone_destroy callback and an operation on an interface that needs to access a
softc.

-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8fcf453e-c326-eb14-367a-655cbee5d088>