Date: Sun, 11 Jan 2009 22:39:34 +0300 (MSK) From: Eygene Ryabinkin <rea-fbsd@codelabs.ru> To: FreeBSD-gnats-submit@freebsd.org Subject: kern/130386: [patch] add locking for generic interface address manipulations through ioctl path Message-ID: <20090111193934.103D2B8073@phoenix.codelabs.ru> Resent-Message-ID: <200901111940.n0BJe2Rf044780@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
>Number: 130386 >Category: kern >Synopsis: [patch] add locking for generic interface address manipulations through ioctl path >Confidential: no >Severity: serious >Priority: medium >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Sun Jan 11 19:40:02 UTC 2009 >Closed-Date: >Last-Modified: >Originator: Eygene Ryabinkin >Release: FreeBSD 7.1-STABLE amd64 >Organization: Code Labs >Environment: System: FreeBSD 7.1-STABLE amd64 >Description: Manipulation of interface addresses inside generic ioctl() handler in_control() (sys/netinet/in.c) was not locked, so races from multiple sumiltaneous invocations of ioctl's ADD/DELETE actions could occur and the results will vary from incosistent data to the kernel panics. More verbose description is in the patch below. >How-To-Repeat: In my very case, the kernel panic is easily triggered just by restart of the dhclient on my ethernet interface: 'ifaddr' mutex gets destroyed and then used for locking (sleeping). Actually, the bug's reproducibility will vary from machine to machine. rik@ sees another strange effects on his PPPoE interfaces: sometimes the address list has repeating address/mask pairs. This could be also related to this race when two ioctsl ADD actions are racing or ADD/DELETE actions are racing differently from mine's case. >Fix: The patch below fixes the things for me. rik@ will try to test this patch on his system, but he runs 5.x on the host in question, so he needs some time to understand and possibly port the patch. --- Lock-network-interface-address-manipulations.diff begins here --- >From a1e44f4b174415363e999023e11874e5e24b6ed4 Mon Sep 17 00:00:00 2001 From: Eygene Ryabinkin <rea-fbsd@codelabs.ru> Date: Wed, 7 Jan 2009 19:56:54 +0300 Generic network interface ioctl handler (in_control() in sys/netinet/in.c) was not locked at all, so two invocations of this handler could race with each other. In my case it was the race between interface address deletion and interface address addition. The scenario was the following: - SIOCAIFADDR handler was invoked and it created new 'ifaddr' mutex; - then the SIOCDIFADDR handler was invoked for the same IP address, it found that the address is on the interface and tried to remove it and destroyed newly created 'ifaddr' mutex; - SIOCAIFADDR handler tried to add the interface address to the routing table and invoked rtinit1() with RTM_ADD that in turn invoked rtrequest1_fib() that called IFAREF() for the destroyed mutex; - IFAREF() impies the call of _mtx_lock_sleep that isn't prepared to see the mutex with MTX_DESTROYED: I am running ADAPTIVE_MUTEXES and the code in /sys/kern/kern_mutex.c tries to do the following: ----- /* * If the owner is running on another CPU, spin until the * owner stops running or the state of the lock changes. */ v = m->mtx_lock; if (v != MTX_UNOWNED) { owner = (struct thread *)(v & ~MTX_FLAGMASK); ----- Clearly for MTX_DESTROYED (== MTX_UNOWNED | MTX_CONTESTED) this will result in owner set to 0 and the kernel goes to hell. The situation made worse by existence of the network drivers that still need Giant for their locking: all interface-specific ioctls should be called with Giant, so we can't hold our own mutex for such code paths. We also can't release our own mutex prior to grabbing Giant, because this will introduce small race, so I just locked such code paths with Giant entirely. This was done only for interfaces that still need Giant locking, others will be locked with the local mutex, so the overhead for them is just the test for the 'grabbed_giant' variable on the exit paths, nothing more. There is only one code path for now: the one that handles SIOCSIFDSTADDR, other Giant invocations are dealt with differently. But due to the switch'ed code structure, other code paths are affected by the added overhead. Since we now run locked, I changed malloc flags to reflect this: M_NOWAIT is passed now. I had also added panic() to the _mtx_lock_sleep() to catch such situations early. I would do it via KASSERT(), but this requires INVARIANTS and default kernel isn't built with this option _and_ we know that the system will likely panic shortly after such call. Signed-off-by: Eygene Ryabinkin <rea-fbsd@codelabs.ru> --- sys/kern/kern_mutex.c | 7 ++- sys/netinet/in.c | 207 ++++++++++++++++++++++++++++++++++++++++--------- 2 files changed, 175 insertions(+), 39 deletions(-) diff --git a/sys/kern/kern_mutex.c b/sys/kern/kern_mutex.c index 7016ecb..e4ccae9 100644 --- a/sys/kern/kern_mutex.c +++ b/sys/kern/kern_mutex.c @@ -309,7 +309,12 @@ _mtx_lock_sleep(struct mtx *m, uintptr_t tid, int opts, const char *file, int contested = 0; uint64_t waittime = 0; uintptr_t v; - + + if (mtx_destroyed(m)) { + panic("%s: tried to sleep on destroyed mutex %s @ %s:%d\n", + __func__, m->lock_object.lo_name, file, line); + } + if (mtx_owned(m)) { KASSERT((m->lock_object.lo_flags & LO_RECURSABLE) != 0, ("_mtx_lock_sleep: recursed on non-recursive mutex %s @ %s:%d\n", diff --git a/sys/netinet/in.c b/sys/netinet/in.c index c028edb..db4892b 100644 --- a/sys/netinet/in.c +++ b/sys/netinet/in.c @@ -73,6 +73,12 @@ SYSCTL_INT(_net_inet_ip, OID_AUTO, same_prefix_carp_only, CTLFLAG_RW, &sameprefixcarponly, 0, "Refuse to create same prefixes on different interfaces"); +/* Mutex used for locking of a general interface ioctl processing */ +static struct mtx in_control_mtx; + +static void in_setup(void *dummy); +SYSINIT(in_setup, SI_SUB_PROTO_IF, SI_ORDER_ANY, in_setup, NULL); + extern struct inpcbinfo ripcbinfo; extern struct inpcbinfo udbinfo; @@ -190,6 +196,62 @@ in_len2mask(struct in_addr *mask, int len) p[i] = (0xff00 >> (len % 8)) & 0xff; } +/* A load of macroses for in_control */ + +/* Grab/release our own local mutex */ +#define IN_CONTROL_GRAB_MTX \ +do { \ + mtx_lock(&in_control_mtx); \ + grabbed_mtx = 1; \ +} while(0) +#define IN_CONTROL_RELEASE_MTX \ +do { \ + if (grabbed_mtx) { \ + grabbed_mtx = 0; \ + mtx_unlock(&in_control_mtx); \ + } \ +} while(0) + +/* + * Acquires Giant if interface still runs under Giant lock or grabs + * our fine-grained mutex. This macro and the corresponding lock + * releaser should gone away when the last network interface will be + * converted from Giant to the own fine-grained locking. + */ +#define IN_CONTROL_GRAB_LOCK(ifp) \ +do { \ + if ((ifp)->if_flags & IFF_NEEDSGIANT) { \ + mtx_lock(&Giant); \ + grabbed_giant = 1; \ + } else { \ + IN_CONTROL_GRAB_MTX; \ + } \ +} while(0) +#define IN_CONTROL_RELEASE_LOCK \ +do { \ + IN_CONTROL_RELEASE_MTX; \ + if (grabbed_giant) { \ + grabbed_giant = 0; \ + mtx_unlock(&Giant); \ + } \ +} while(0) + +/* + * NB: exitval should have no side-effects or we will unlock prematurely. + * NB: exitval should be local variable or immediate value, global variables + * must not be used here or races will occur. + */ +#define IN_CONTROL_EXIT(exitval) \ +do { \ + IN_CONTROL_RELEASE_MTX; \ + return (exitval); \ +} while (0) +#define IN_CONTROL_EXIT_FROM_GIANT(exitval) \ +do { \ + IN_CONTROL_RELEASE_LOCK; \ + return (exitval); \ +} while (0) + /* * Generic internet control operations (ioctl's). * Ifp is 0 if not an interface-specific ioctl. @@ -209,36 +271,81 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, struct sockaddr_in oldaddr; int error, hostIsNew, iaIsNew, maskIsNew, s; int iaIsFirst; + int grabbed_mtx = 0, grabbed_giant = 0; iaIsFirst = 0; iaIsNew = 0; allhosts_addr.s_addr = htonl(INADDR_ALLHOSTS_GROUP); + /* + * Fast processing of some ioctls and locking is done here. + * + * Code paths that need locking: + * + * - those that are modifying the state of interface: this + * should be done to ensure the consistency of results; + * + * - those that are creating/deleting 'ifaddr' mutex: this + * should be done to avoid cases when one kernel thread + * tries to create new mutex and it races with the other + * one that creates new mutex or with one that deletes + * the just created mutex. + * + * Additional attention is put to the code paths that want + * to acquire Giant mutex for the interface-specific ioctl + * processing. Since no mutex should be active when we're + * trying to grab Giant, we are locking such code paths + * entirely with Giant, but only if interface really needs + * Giant locking; otherwise we will use our local mutex. + */ switch (cmd) { case SIOCALIFADDR: + IN_CONTROL_GRAB_MTX; if (td != NULL) { error = priv_check(td, PRIV_NET_ADDIFADDR); if (error) - return (error); + IN_CONTROL_EXIT (error); } if (!ifp) - return EINVAL; - return in_lifaddr_ioctl(so, cmd, data, ifp, td); + IN_CONTROL_EXIT (EINVAL); + error = in_lifaddr_ioctl(so, cmd, data, ifp, td); + IN_CONTROL_EXIT (error); case SIOCDLIFADDR: + IN_CONTROL_GRAB_MTX; if (td != NULL) { error = priv_check(td, PRIV_NET_DELIFADDR); if (error) - return (error); + IN_CONTROL_EXIT (error); } if (!ifp) - return EINVAL; - return in_lifaddr_ioctl(so, cmd, data, ifp, td); + IN_CONTROL_EXIT (EINVAL); + error = in_lifaddr_ioctl(so, cmd, data, ifp, td); + IN_CONTROL_EXIT (error); case SIOCGLIFADDR: if (!ifp) - return EINVAL; - return in_lifaddr_ioctl(so, cmd, data, ifp, td); + IN_CONTROL_EXIT (EINVAL); + error = in_lifaddr_ioctl(so, cmd, data, ifp, td); + IN_CONTROL_EXIT (error); + + /* + * Ioctl calls that possibly need Giant. Should test that + * ifp isn't NULL first. + */ + case SIOCSIFDSTADDR: + if (ifp == 0) + IN_CONTROL_EXIT (EADDRNOTAVAIL); + IN_CONTROL_GRAB_LOCK(ifp); + break; + + case SIOCAIFADDR: + case SIOCDIFADDR: + case SIOCSIFADDR: + case SIOCSIFNETMASK: + case SIOCSIFBRDADDR: + IN_CONTROL_GRAB_MTX; + break; } /* @@ -272,7 +379,7 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, case SIOCAIFADDR: case SIOCDIFADDR: if (ifp == 0) - return (EADDRNOTAVAIL); + IN_CONTROL_EXIT (EADDRNOTAVAIL); if (ifra->ifra_addr.sin_family == AF_INET) { for (oia = ia; ia; ia = TAILQ_NEXT(ia, ia_link)) { if (ia->ia_ifp == ifp && @@ -284,29 +391,30 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, && (cmd == SIOCAIFADDR) && (ifra->ifra_dstaddr.sin_addr.s_addr == INADDR_ANY)) { - return EDESTADDRREQ; + IN_CONTROL_EXIT (EDESTADDRREQ); } } if (cmd == SIOCDIFADDR && ia == 0) - return (EADDRNOTAVAIL); + IN_CONTROL_EXIT (EADDRNOTAVAIL); /* FALLTHROUGH */ case SIOCSIFADDR: case SIOCSIFNETMASK: + /* Possibly Giant-locked path */ case SIOCSIFDSTADDR: if (td != NULL) { - error = priv_check(td, (cmd == SIOCDIFADDR) ? + error = priv_check(td, (cmd == SIOCDIFADDR) ? PRIV_NET_DELIFADDR : PRIV_NET_ADDIFADDR); if (error) - return (error); + IN_CONTROL_EXIT_FROM_GIANT (error); } if (ifp == 0) - return (EADDRNOTAVAIL); + IN_CONTROL_EXIT_FROM_GIANT (EADDRNOTAVAIL); if (ia == (struct in_ifaddr *)0) { ia = (struct in_ifaddr *) - malloc(sizeof *ia, M_IFADDR, M_WAITOK | M_ZERO); + malloc(sizeof *ia, M_IFADDR, M_NOWAIT | M_ZERO); if (ia == (struct in_ifaddr *)NULL) - return (ENOBUFS); + IN_CONTROL_EXIT_FROM_GIANT (ENOBUFS); /* * Protect from ipintr() traversing address list * while we're modifying it. @@ -338,7 +446,7 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, if (td != NULL) { error = priv_check(td, PRIV_NET_ADDIFADDR); if (error) - return (error); + IN_CONTROL_EXIT (error); } /* FALLTHROUGH */ @@ -347,44 +455,47 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, case SIOCGIFDSTADDR: case SIOCGIFBRDADDR: if (ia == (struct in_ifaddr *)0) - return (EADDRNOTAVAIL); + IN_CONTROL_EXIT (EADDRNOTAVAIL); break; } switch (cmd) { case SIOCGIFADDR: *((struct sockaddr_in *)&ifr->ifr_addr) = ia->ia_addr; - return (0); + IN_CONTROL_EXIT (0); case SIOCGIFBRDADDR: if ((ifp->if_flags & IFF_BROADCAST) == 0) - return (EINVAL); + IN_CONTROL_EXIT (EINVAL); *((struct sockaddr_in *)&ifr->ifr_dstaddr) = ia->ia_broadaddr; - return (0); + IN_CONTROL_EXIT (0); case SIOCGIFDSTADDR: if ((ifp->if_flags & IFF_POINTOPOINT) == 0) - return (EINVAL); + IN_CONTROL_EXIT (EINVAL); *((struct sockaddr_in *)&ifr->ifr_dstaddr) = ia->ia_dstaddr; - return (0); + IN_CONTROL_EXIT (0); case SIOCGIFNETMASK: *((struct sockaddr_in *)&ifr->ifr_addr) = ia->ia_sockmask; - return (0); + IN_CONTROL_EXIT (0); + /* Possibly Giant-locked path */ case SIOCSIFDSTADDR: if ((ifp->if_flags & IFF_POINTOPOINT) == 0) - return (EINVAL); + IN_CONTROL_EXIT_FROM_GIANT (EINVAL); oldaddr = ia->ia_dstaddr; ia->ia_dstaddr = *(struct sockaddr_in *)&ifr->ifr_dstaddr; if (ifp->if_ioctl) { - IFF_LOCKGIANT(ifp); + /* + * Should be already Giant-locked if interface + * driver needs it. + */ error = (*ifp->if_ioctl)(ifp, SIOCSIFDSTADDR, (caddr_t)ia); - IFF_UNLOCKGIANT(ifp); if (error) { ia->ia_dstaddr = oldaddr; - return (error); + IN_CONTROL_EXIT_FROM_GIANT (error); } } if (ia->ia_flags & IFA_ROUTE) { @@ -394,13 +505,13 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, (struct sockaddr *)&ia->ia_dstaddr; rtinit(&(ia->ia_ifa), (int)RTM_ADD, RTF_HOST|RTF_UP); } - return (0); + IN_CONTROL_EXIT_FROM_GIANT (0); case SIOCSIFBRDADDR: if ((ifp->if_flags & IFF_BROADCAST) == 0) - return (EINVAL); + IN_CONTROL_EXIT (EINVAL); ia->ia_broadaddr = *(struct sockaddr_in *)&ifr->ifr_broadaddr; - return (0); + IN_CONTROL_EXIT (0); case SIOCSIFADDR: error = in_ifinit(ifp, ia, @@ -412,12 +523,12 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, in_addmulti(&allhosts_addr, ifp); EVENTHANDLER_INVOKE(ifaddr_event, ifp); } - return (0); + IN_CONTROL_EXIT (0); case SIOCSIFNETMASK: ia->ia_sockmask.sin_addr = ifra->ifra_addr.sin_addr; ia->ia_subnetmask = ntohl(ia->ia_sockmask.sin_addr.s_addr); - return (0); + IN_CONTROL_EXIT (0); case SIOCAIFADDR: maskIsNew = 0; @@ -459,7 +570,7 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, in_addmulti(&allhosts_addr, ifp); EVENTHANDLER_INVOKE(ifaddr_event, ifp); } - return (error); + IN_CONTROL_EXIT (error); case SIOCDIFADDR: /* @@ -479,14 +590,22 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, default: if (ifp == 0 || ifp->if_ioctl == 0) - return (EOPNOTSUPP); + IN_CONTROL_EXIT (EOPNOTSUPP); + IN_CONTROL_RELEASE_MTX; IFF_LOCKGIANT(ifp); error = (*ifp->if_ioctl)(ifp, cmd, data); IFF_UNLOCKGIANT(ifp); - return (error); + IN_CONTROL_EXIT (error); } /* + * This is the cleanup code: it is invoked when some of the + * above calls were not succeeded or when we are processing + * the SIOCDIFADDR ioctl. + * + * NB: we should release 'ifaddr' mutex while our code path + * is still locked, so we have multiple IFAFREE() invocations. + * * Protect from ipintr() traversing address list while we're modifying * it. */ @@ -495,6 +614,7 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, TAILQ_REMOVE(&in_ifaddrhead, ia, ia_link); if (ia->ia_addr.sin_family == AF_INET) { LIST_REMOVE(ia, ia_hash); + IFAFREE(&ia->ia_ifa); /* * If this is the last IPv4 address configured on this * interface, leave the all-hosts group. @@ -505,6 +625,7 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, if (oia == NULL) { struct in_multi *inm; + IN_CONTROL_RELEASE_MTX; IFF_LOCKGIANT(ifp); IN_MULTI_LOCK(); IN_LOOKUP_MULTI(allhosts_addr, ifp, inm); @@ -513,13 +634,18 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, IN_MULTI_UNLOCK(); IFF_UNLOCKGIANT(ifp); } + } else { + IFAFREE(&ia->ia_ifa); } - IFAFREE(&ia->ia_ifa); splx(s); - return (error); + IN_CONTROL_EXIT (error); } +#undef IN_CONTROL_EXIT +#undef IN_CONTROL_GRAB_MTX +#undef IN_CONTROL_RELEASE_MTX + /* * SIOC[GAD]LIFADDR. * SIOCGLIFADDR: get first address. (?!?) @@ -1002,3 +1128,8 @@ in_ifdetach(struct ifnet *ifp) in_pcbpurgeif0(&udbinfo, ifp); in_purgemaddrs(ifp); } + +static void in_setup(void *dummy) +{ + mtx_init(&in_control_mtx, "in_control", NULL, MTX_DEF); +} -- 1.6.0.5 --- Lock-network-interface-address-manipulations.diff ends here --- >Release-Note: >Audit-Trail: >Unformatted:
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090111193934.103D2B8073>