Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 13 Feb 2015 17:08:12 -0500
From:      Randall Stewart <rrs@netflix.com>
To:        Gleb Smirnoff <glebius@freebsd.org>
Cc:        src-committers@freebsd.org, John Baldwin <jhb@freebsd.org>, Eric van Gyzen <eric_van_gyzen@dell.com>, svn-src-all@freebsd.org, Randall Stewart <rrs@freebsd.org>, svn-src-head@freebsd.org, zont@FreeBSD.org, rstone@FreeBSD.org
Subject:   Re: svn commit: r278472 - in head/sys: netinet netinet6
Message-ID:  <1655B9AB-AC55-49EE-ADB6-F355F578915A@netflix.com>
In-Reply-To: <20150213212128.GG15484@FreeBSD.org>
References:  <201502091928.t19JSC5P066293@svn.freebsd.org> <1903622.i3cFFNVYcL@ralph.baldwin.cx> <20150213212128.GG15484@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help

On Feb 13, 2015, at 4:21 PM, Gleb Smirnoff <glebius@freebsd.org> wrote:

> On Mon, Feb 09, 2015 at 03:11:21PM -0500, John Baldwin wrote:
> J> On Monday, February 09, 2015 07:28:12 PM Randall Stewart wrote:
> J> > Author: rrs
> J> > Date: Mon Feb  9 19:28:11 2015
> J> > New Revision: 278472
> J> > URL: https://svnweb.freebsd.org/changeset/base/278472
> J> >=20
> J> > Log:
> J> >   This fixes a bug in the way that the LLE timers for nd6
> J> >   and arp were being used. They basically would pass in the
> J> >   mutex to the callout_init. Because they used this method
> J> >   to the callout system, it was possible to "stop" the callout.
> J> >   When flushing the table and you stopped the running callout, =
the
> J> >   callout_stop code would return 1 indicating that it was going
> J> >   to stop the callout (that was about to run on the callout_wheel =
blocked
> J> >   by the function calling the stop). Now when 1 was returned, it =
would
> J> >   lower the reference count one extra time for the stopped timer, =
then
> J> >   a few lines later delete the memory. Of course the =
callout_wheel was
> J> >   stuck in the lock code and would then crash since it was =
accessing
> J> >   freed memory. By using callout_init(c, 1) we always get a 0 =
back
> J> >   and the reference counting bug does not rear its head. We do =
have
> J> >   to make a few adjustments to the callouts themselves though to =
make
> J> >   sure it does the proper thing if rescheduled as well as gets =
the lock.
> J> >=20
> J> >   Commented upon by hiren and sbruno
> J> >   See Phabricator D1777 for more details.
> J> >=20
> J> >   Commented upon by hiren and sbruno
> J> >   Reviewed by:	adrian, jhb and bz
> J> >   Sponsored by:	Netflix Inc.
> J>=20
> J> Eh, I looked at it, but I really, really don't like it.  I think=20
> J> callout_init_*() should be preferred to CALLOUT_MPSAFE whenever =
possible as it=20
> J> is less race-prone.  I think this should probably be fixed by =
adding Hans'=20
> J> callout_drain_async() instead, though this is fine as a temporary =
workaround.
>=20
> I second concerns. Please look at kern/165863 and r238990 that fixed =
it.
> Transition from CALLOUT_MPSAFE to callout_init_rw() was intentional
> and fixed races.
>=20
> I added to Cc guys who helped to track down that races. May be someone =
still
> has test scripts at hand. AFAIR, there were some that allowed to put a =
box
> down quite quickly.

Well without it we can also put a box down quickly.. at least Sbruno and =
Hiren seem to be
able to.. you end up with deleted memory being accessed by the callout =
code.

I can look at kern/165863 and 238990.. let me go see what I can see ;0

>=20
> --=20
> Totus tuus, Glebius.

--------
Randall Stewart
rrs@netflix.com
803-317-4952








Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1655B9AB-AC55-49EE-ADB6-F355F578915A>