Date: Fri, 30 Nov 2018 14:27:45 +0200 From: Andriy Gapon <avg@FreeBSD.org> To: freebsd-net@FreeBSD.org Subject: lagg: a race between ioctl and clone_destroy Message-ID: <8fcf453e-c326-eb14-367a-655cbee5d088@FreeBSD.org>
next in thread | raw e-mail | index | archive | help
I am working on analyzing a crash with the following stack trace: _sx_slock_hard lagg_media_status ifmedia_ioctl lagg_ioctl ifioctl kern_ioctl sys_ioctl It appears that the crash happened because of a destroyed sx lock. Please note that the code base is the CURRENT from just before the time when the network stack started to use the epoch mechanism. My theory is that the following race happened. lagg_clone_destroy() and lagg_ioctl() were called concurrently. lagg_clone_destroy() won a race to lock sc_sx while lagg_media_status() got blocked on it. I think that after some adaptive spinning the thread was placed on a sleep queue. Then, lagg_clone_destroy() unlocked the lock and proceeded to destroy it. After the lagg_media_status() thread was waken up it found the lock in the destroyed state and crashed on it in a typical fashion (trying to dereference a NULL pointer as a struct thread pointer). My knowledge of the network stack is rather shallow, so I have a few questions. Q1. Is the described race plausible? Q2. If yes, then has this class[*] of races been fixed by the epoch mechanism? I suspect that lagg_ioctl() can still have that race if it's called concurrently with lagg_clone_destroy(). Q3. Is there a way to protect against this type of a race in the code from before the epoch mechanism? I think that the safest thing to do would be to free/destroy the softc only after if_refcount goes to zero, but I could not find any callback for that. That is, I think that this code in if_free() ensures that ifp stays valid as long as it's referenced and it's being accessed under the epoch protection: if (refcount_release(&ifp->if_refcount)) epoch_call(net_epoch_preempt, &ifp->if_epoch_ctx, if_destroy); But the code in lagg_clone_destroy() would destroy the softc without waiting on anything: LAGG_XUNLOCK(sc); ifmedia_removeall(&sc->sc_media); ether_ifdetach(ifp); if_free(ifp); <---- ifp can still be used and valid after this LAGG_LIST_LOCK(); SLIST_REMOVE(&V_lagg_list, sc, lagg_softc, sc_entries); LAGG_LIST_UNLOCK(); LAGG_SX_DESTROY(sc); free(sc, M_DEVBUF); <---- but sc is immediately destroyed in any case } [*] My impression is that the problem is (or, at least, was) not limited to lagg. I think that other drivers can also have a similar race between a clone_destroy callback and an operation on an interface that needs to access a softc. -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8fcf453e-c326-eb14-367a-655cbee5d088>