Date: Tue, 20 Jan 2015 15:53:35 -0800 From: Sean Bruno <sbruno@ignoranthack.me> To: "K. Macy" <kmacy@freebsd.org> Cc: Hans Petter Selasky <hps@selasky.org>, Adrian Chadd <adrian@freebsd.org>, "src-committers@freebsd.org" <src-committers@freebsd.org>, Jason Wolfe <nitroboost@gmail.com>, "svn-src-all@freebsd.org" <svn-src-all@freebsd.org>, "svn-src-head@freebsd.org" <svn-src-head@freebsd.org>, Gleb Smirnoff <glebius@freebsd.org>, Konstantin Belousov <kostikbel@gmail.com> Subject: Re: svn commit: r277213 - in head: share/man/man9 sys/kern sys/ofed/include/linux sys/sys Message-ID: <54BEEA7F.1070301@ignoranthack.me> In-Reply-To: <CAHM0Q_N_53BM-6RvXu8UpjfDzQHEn5oXZo1Nn8RO0cuOUhe8tg@mail.gmail.com> References: <201501151532.t0FFWV2Y037455@svn.freebsd.org> <CAJ-Vmok0GXZoojyi=jE=b5D-d338APztaf3Pw0_AAQ-173XSWw@mail.gmail.com> <54BDD9E1.6090505@selasky.org> <20150120075126.GA42409@kib.kiev.ua> <20150120211137.GY15484@FreeBSD.org> <54BED6FB.8060401@selasky.org> <54BEE62D.2060703@ignoranthack.me> <CAHM0Q_MDJN_8sTvTDXfqA7UtJVO3Y8S8%2BNRCs_=6Nj4dkTzjOA@mail.gmail.com> <54BEE8E6.3080009@ignoranthack.me> <CAHM0Q_N_53BM-6RvXu8UpjfDzQHEn5oXZo1Nn8RO0cuOUhe8tg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On 01/20/15 15:48, K. Macy wrote: > Are any other drivers hitting this? e.g. cxgb/cxgbe? > > -K > Unkown to me. Nor am I aware of anyone else who ever hit our panics either. Our environment, and the failure, was only seen in the Intel 10GE space (ixgbe). This is an artifact of our use cases, and hasn't been expanded nor tested in our environment with other vendor interfaces. sean > On Tue, Jan 20, 2015 at 3:46 PM, Sean Bruno > <sbruno@ignoranthack.me> wrote: On 01/20/15 15:40, K. Macy wrote: >>>> I think you're working around driver locking bugs by >>>> crippling the callout code. >>>> >>>> -K >>>> > > We had zero evidence of this. What leads you down that path? I'm > totally open to being wrong, e.g. "yeah, you slowed down things so > that you don't hit a race condition" > > sean > >>>> On Tue, Jan 20, 2015 at 3:35 PM, Sean Bruno >>>> <sbruno@ignoranthack.me> wrote: On 01/20/15 14:30, Hans >>>> Petter Selasky wrote: >>>>>>> On 01/20/15 22:11, Gleb Smirnoff wrote: >>>>>>>> On Tue, Jan 20, 2015 at 09:51:26AM +0200, Konstantin >>>>>>>> Belousov wrote: K> > Like stated in the manual page, >>>>>>>> callout_reset_curcpu/on() does not work K> > with >>>>>>>> MPSAFE callouts any more! K> I.e. you 'fixed' some >>>>>>>> undeterminate bugs in callout migration by not K> >>>>>>>> doing migration at all anymore. K> K> > K> > You need >>>>>>>> to use callout_init_{mtx,rm,rw} and remove the custom >>>>>>>> locking K> > inside the callback in the TCP stack to >>>>>>>> get it working like before! K> K> No, you need to do >>>>>>>> this, if you think that whole callout KPI must be K> >>>>>>>> rototiled. It is up to the person who modifies the >>>>>>>> KPI, to ensure that K> existing code is not broken. >>>>>>>> K> K> As I understand, currently we are back to the >>>>>>>> one-cpu callouts. K> Do other people consider this >>>>>>>> situation acceptable ? >>>>>>>> >>>>>>>> I think this isn't acceptable. The commit to a >>>>>>>> complex subsystem lacked a review from persons >>>>>>>> involved in the system before. The commit to >>>>>>>> subsystem broke consumers of the subsystem and this >>>>>>>> was even done not accidentially, but due to Hans not >>>>>>>> caring about it. >>>>>>>> >>>>>>>> As for me this is enough to request a backout, and >>>>>>>> let the change back in only after proper review. >>>>>>>> >>>>>>> >>>>>>> Hi Gleb, >>>>>>> >>>>>>> Backing out my callout API patch means we will for >>>>>>> sure re-introduce an unknown callout spinlock hang, as >>>>>>> noted to me by several people. What do you think about >>>>>>> that? dram Maybe "Jason Wolfe" CC'ed can add to >>>>>>> 10-stable w/o my patches: >>>>>>> >>>> >>>> Jason picked up this patch for work and it resolved our >>>> instability issues that had remained unsolved for quite some >>>> time as reported to freebsd-net: >>>> >>>> https://lists.freebsd.org/pipermail/freebsd-net/2015-January/040895.html >>>> >>>> >>>> This had gone undiagnosed for some time (even with the gracious >>>> help of jhb in offline emails, thanks btw!). >>>> >>>> There's some diagnostics in that email thread that may be of >>>> value to you folks for determination of the validity of >>>> changing the callout API or at least understanding why we >>>> were involved in diagnostics. >>>> >>>> While I'd sure love to tune performance, the fact that our >>>> machines were basically going out to lunch without these >>>> changes, probably means that others were seeing it and didn't >>>> know what else to do. As much as I enjoy a good "break out >>>> the pitch forks and torches" email thread, this increased >>>> stability for us and is allowing us to upgrade from freebsd8 >>>> to freebsd10. Bear this in mind when you throw your voice in >>>> favor of reverting. >>>> >>>>>>> int callout_reset_sbt_on(struct callout *c, sbintime_t >>>>>>> sbt, sbintime_t precision, void (*ftn)(void *), void >>>>>>> *arg, int cpu, int flags) { sbintime_t to_sbt, pr; >>>>>>> struct callout_cpu *cc; int cancelled, direct; >>>>>>> >>>>>>> + cpu = timeout_cpu; /* XXX test code XXX */ >>>>>>> >>>>>>> cancelled = 0; >>>>>>> >>>> >>>> Jason or I would have to run this in production, which would >>>> be problematic I fear. We never had a deterministic test >>>> case that would exhibit the reported failure. We merely >>>> "tested in production" and saw that panics ceased. We didn't >>>> note a dropoff in our traffic either, perhaps we are not as >>>> efficient as others in this corner case, but we were >>>> consistently seeing the spinlock hangs after a day or so of >>>> traffic. >>>> >>>>>>> And see if he observes a callout spinlock hang or not >>>>>>> on his test setup. The patch above should force all >>>>>>> callouts to the same thread basically. Then we could >>>>>>> maybe see if single threading the callouts has anything >>>>>>> to do with solving the spinlock hang. >>>>>>> >>>>>>> The "rewritten" callout API still has all the features >>>>>>> and capabilities the old one had, when used as >>>>>>> described in "man 9 callout". >>>>>>> >>>>>>> At the present moment I'm not technically convinced a >>>>>>> backout is correct. >>>> >>>> Neither am I, to be honest. Just based on *results*. >>>> >>>>>>> >>>>>>> Gleb: I think we would see far better results with >>>>>>> high speed internet links using TCP if we could extend >>>>>>> the LRO (large receive offload) code to accumulate more >>>>>>> than 64KBytes worth of data per call to the TCP stack >>>>>>> instead of complaining about some callouts ending up on >>>>>>> the same thread! Actually I have a patch for that. >>>>>>> >>>>>>> --HPS >>>>>>> >>>>>>> >>>>>>> >>>> >>>>> _______________________________________________ >>>>> svn-src-head@freebsd.org mailing list >>>>> http://lists.freebsd.org/mailman/listinfo/svn-src-head To >>>>> unsubscribe, send any mail to >>>>> "svn-src-head-unsubscribe@freebsd.org" >>>> >>>> > > > -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQF8BAEBCgBmBQJUvup4XxSAAAAAAC4AKGlzc3Vlci1mcHJAbm90YXRpb25zLm9w ZW5wZ3AuZmlmdGhob3JzZW1hbi5uZXRCQUFENDYzMkU3MTIxREU4RDIwOTk3REQx MjAxRUZDQTFFNzI3RTY0AAoJEBIB78oecn5kNdgH/iFU6kSAcuPJDwiNONLULu9c Nu9kJzQUgCpROor2CnrsjN0WILWTAfmWaQK1b7ClUIwKPmgzvX09DZdsdfQnIMFR aimC3uKW0rJeMDlQNbN/J2mI5+QOyREUozxb53jetDj9t+OmC08idne0AuYuMgwY NZoKt2llhJScjjEu2EJV9W3FoDCmg/ITA83IbzmmoZHHtXACdDVh0vwmNb1UPluA 0hcJv4rQrY5Khq08a4tonQxTB96Sgk/c+0zPpvqSctKEEmf6TiREbwr68SOM4dkY zLxpD/9IeExFTShJSGW+slKAuzCobr+iG8a9tDLV+STtoAX9i5jyLQLnwDcX7AY= =F1sR -----END PGP SIGNATURE-----
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?54BEEA7F.1070301>