From owner-freebsd-stable@freebsd.org Thu Aug 31 16:04:33 2017 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 402D7E00FFE; Thu, 31 Aug 2017 16:04:33 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: from mail-wr0-x242.google.com (mail-wr0-x242.google.com [IPv6:2a00:1450:400c:c0c::242]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id BC0B98248B; Thu, 31 Aug 2017 16:04:32 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: by mail-wr0-x242.google.com with SMTP id z91so35837wrc.1; Thu, 31 Aug 2017 09:04:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=kp3kddJg30cxSsH2HJvyTRO8GEJGj9OT10YL0bwTMSI=; b=HfXYEuD71s5lkUOM8UgG/0sR1165oI92h7Ef9RtNOWY/aFnyy75kx2XQ6tC6hb2sKn /7h8AnOi99xIlykAbCFbr27f26u5/Jnpr4SUf1PZZNC6e+1f6Wg6l4CJ1pc61si+lIqM ZCvSU6JtRtDja/eFBULtV6ZvvtA1nWz/fvJdcsGAN+bLBaHAXcD6qwijeRK79NMt6j3b eNlNhgfe/SQOjKmtHD4amJjCNAV5AB8fS29pGNIYd6D+25OSKBp+NhhSYQirJ1D0n36L cWDmitL/P5DF8sogo5ljVbYL4OAHbuzJltN/bPWcpyskOQNKVuQ8kEaAaoEO4/SpP6ET 2/Yw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=kp3kddJg30cxSsH2HJvyTRO8GEJGj9OT10YL0bwTMSI=; b=txD53RKMPbLwEAWLQP5ySmSluML1+08Pume4Eay4WKlMLcVcSX6E1Z+fE72mSpzCtE zLs45UVLsucugLjFIpKbR8PAYoenirWIyguFgaJ4+oG2LEUF58Ete3SaTcQ44/k3FmJX 5WljsrwEkfUOyY8F5OFgcnnnIAP8ylef7X5ytY+Fj4ZcTQlifEhPDs+KURgvnt+eXm5Q UqoCuIS/G/TxYlsgIN151fWTR/dTiyieJRK0UHp5c+Wdb2wlt0yNDe9dLYLwIR/FGcPs W7ll6vRvUFuiKQ9CwjEndNyV7vnf91rMDLBkPIteq+Nf8UhnsnmdThEJY289G7q2/Xcn jqmw== X-Gm-Message-State: AHYfb5j7VRII4i19QkFzB4NeEcE6B4E2I7JRbP5whuiWq8/cYUbq5Ncx IeOBmE1uRA5y8CZpmWc= X-Google-Smtp-Source: ADKCNb476N5sZSYn1JIX0qvyq4emwAVV6iG82x//ixJpxcRvP+L0vkmjOIigSjOnJ1P1RCWHJHDHRQ== X-Received: by 10.223.150.35 with SMTP id b32mr3354121wra.87.1504195471031; Thu, 31 Aug 2017 09:04:31 -0700 (PDT) Received: from ben.home (LFbn-MAR-1-330-35.w2-15.abo.wanadoo.fr. [2.15.164.35]) by smtp.gmail.com with ESMTPSA id 33sm85967wra.6.2017.08.31.09.04.29 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Thu, 31 Aug 2017 09:04:30 -0700 (PDT) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: mlx4en, timer irq @100%... (11.0 stuck on high network load ???) From: Ben RUBSON In-Reply-To: <82e661b4-1bac-ff5b-f776-8dba44cac15e@freebsd.org> Date: Thu, 31 Aug 2017 18:04:27 +0200 Cc: Hans Petter Selasky , FreeBSD Net , hiren , Slawa Olhovchenkov , FreeBSD Stable Content-Transfer-Encoding: quoted-printable Message-Id: <82EFBD5E-8FC2-4156-A030-AF70D97A37BA@gmail.com> References: <7f14c95d-1ef8-bf82-c469-e6566c3aba66@selasky.org> <76A5EE7E-1D2E-46B4-86F1-F219C3DCE6EA@gmail.com> <4C91C6E5-0725-42E7-9813-1F3ACF3DDD6E@gmail.com> <5840c25e-7472-3276-6df9-1ed4183078ad@selasky.org> <2ADA8C57-2C2D-4F97-9F0B-82D53EDDC649@gmail.com> <061cdf72-6285-8239-5380-58d9d19a1ef7@selasky.org> <92BEE83D-498F-47D5-A53C-39DCDC00A0FD@gmail.com> <5d8960d8-e1ff-8719-320f-d3ae84054714@selasky.org> <6B4A35F7-5694-4945-9575-19ADB678F9FA@gmail.com> <297a784a-3d80-b1a6-652e-a78621fe5a8b@selasky.org> <3ECCFBF1-18D9-4E33-8F39-0C366C3BB8B4@gmail.com> <0a5787c5-8a53-ab09-971a-dc1cd5f3aca0@freebsd.org> <645f2ee3-3eaa-660e-2a64-37d53e88322f@freebsd.org> <13DE4E6D-CE83-4B5D-BF88-0EFE65111311@gmail.com> <7B084207-062A-4529-B0DC-5BFEB6517780@gmail.com> <82e661b4-1bac-ff5b-f776-8dba44cac15e@freebsd.org> To: Julien Charbon X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Aug 2017 16:04:33 -0000 > On 28 Aug 2017, at 11:27, Julien Charbon wrote: >=20 > On 8/28/17 10:25 AM, Ben RUBSON wrote: >>> On 16 Aug 2017, at 11:02, Ben RUBSON wrote: >>>=20 >>>> On 15 Aug 2017, at 23:33, Julien Charbon wrote: >>>>=20 >>>> On 8/11/17 11:32 AM, Ben RUBSON wrote: >>>>>> On 08 Aug 2017, at 13:33, Julien Charbon wrote: >>>>>>=20 >>>>>> On 8/8/17 10:31 AM, Hans Petter Selasky wrote: >>>>>>>=20 >>>>>>> Suggested fix attached. >>>>>>=20 >>>>>> I agree we your conclusion. Just for the record, more precisely = this >>>>>> regression seems to have been introduced with: >>>>>> (...) >>>>>> Thus good catch, and your patch looks good. I am going to just = verify >>>>>> the other in_pcbrele_wlocked() calls in TCP stack. >>>>>=20 >>>>> Julien, do you plan to make this fix reach 11.0-p12 ? >>>>=20 >>>> I am checking if your issue is another flavor of the issue fixed = by: >>>>=20 >>>> https://svnweb.freebsd.org/base?view=3Drevision&revision=3D307551 >>>> https://reviews.freebsd.org/D8211 >>>>=20 >>>> This fix in not in 11.0 but in 11.1. Currently I did not found how = an >>>> inp in INP_TIMEWAIT state can have been INP_FREED without having = its tw >>>> set to NULL already except the issue fixed by r307551. >>>>=20 >>>> Thus could you try to apply this patch: >>>>=20 >>>> = https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087= c5f7d0a0.patch >>>>=20 >>>> and see if you can still reproduce this issue? >>>=20 >>> Thank you for your answer Julien. >>> Unfortunately, I'm not sure at all how to reproduce the issue. >>> I have other servers which are 100% identical to this one, same = workload, >>> same some-months uptime, but they did not trigger the bug yet. >>>=20 >>> If other network stack experts (I'm not) agree with your analysis, >>> we could then certainly go further with D8211 / r307551. >>>=20 >>> One thing that perhaps might help : >>> # netstat -an | grep TIME_WAIT$ | wc -l >>> 468 >>>=20 >>> Note that due to this running bug, sendmail has lots of difficulties = to send outgoing mails. >>> As soon as I run the above netstat command, I receive a lot of = stacked mails (more than 20 this time). >>> As if netstat was able to somehow help... >>>=20 >>> Number of TIME_WAIT connections however does not decrease, but = increases. >>>=20 >>>> And in the spirit of r307551 fix and based on Hans patch I will = also >>>> propose to add a kernel log describing the issue instead of = starting an >>>> infinite loop when INVARIANT is not set. >>>=20 >>> Which should then never be triggered :) >>> Good idea I think ! >>=20 >> What about : >> D8211/r307551 >> + Hans' patch >> + Julien's idea of a kernel log (sort of "We should not be here but = we are") >=20 > I did this change and I am testing it Good news ! > on your side did you try this patch applied on 11.0? >=20 > = https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087= c5f7d0a0.patch Yes, patch applied and running correctly, however hard to say whether or not it solves this issue, as there is no easy way to reproduce it. >> And backporting all this to 11.0 (and so to 11.1 too) ? >>=20 >> As this bug can impact every FreeBSD machine / server, >> leading to an unavailable / unreachable system (this is how mine = ended), >> sounds like it could inevitably be a good thing, for production = stability purpose. >=20 > The main fix for your issue is (I believe): >=20 > Fix a double-free when an inp transitions to INP_TIMEWAIT state > after having been dropped. > https://svnweb.freebsd.org/base?view=3Drevision&revision=3D307551 >=20 > This fix has been MFC-ed on both stable/11, stable/10 and is already > included in 11.1 and will be in 10.4. To push in 11.0 release = directly, > I guess you have to promote this change to an Errata (never did that > myself): >=20 > https://www.freebsd.org/security/notices.html > https://www.freebsd.org/security/security.html#reporting Mail sent to FreeBSD Security Team ! Many thanks, let's stay tuned ! Ben