From owner-freebsd-stable@freebsd.org Thu Sep 7 15:28:32 2017 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 68370E22AF5; Thu, 7 Sep 2017 15:28:32 +0000 (UTC) (envelope-from julien.charbon@gmail.com) Received: from mail-wm0-f46.google.com (mail-wm0-f46.google.com [74.125.82.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id AE13669241; Thu, 7 Sep 2017 15:28:31 +0000 (UTC) (envelope-from julien.charbon@gmail.com) Received: by mail-wm0-f46.google.com with SMTP id r10so1111663wmf.1; Thu, 07 Sep 2017 08:28:31 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=fCwFZtBtV8puzrgJsf/Ff+CwOjYExRIeBcA/j4Nn1Y8=; b=ekyZi0uR1c24BXQ9j4cgMwmUCQI+5iGAeFuQwGRiv53rkO9Z3cmUTYi3l+v+58fKzi aMC1mlOsvWNodMg1M0D0tTcjejcPwmsV07ym6rNlGauiJqrd5alsly7Y97oBVy86r0ud pOyFAk/ukPlTeyMrISreCqDFaU0NzrR0HYHG6FwTVL9WFGl+t7/wTKkzEZp52Dhn508u 0OJ8ModSEhVhKqqZjWBZZdsXC9OUW7ql8bzxqaGWYpgQPy991IGT+gSSeN3zyrcy8K4x 8mzZjawiHu3FZ9546/R+R4dGOYUbVtGQyF42TovNwI/ZYRptbbBCkEOcnpFwe1347H7Q 09ww== X-Gm-Message-State: AHPjjUi3u+HWZJi39Z9Lf5iEodJh/TlHcsIYgP17hWUvwBb9dECBPv0Y qOeD32Xwq95O4tcZJrA= X-Google-Smtp-Source: AOwi7QCNlpAFFoDEZjffIL1wOLyNMEeBxzYyGGDbEnRFTFKc/+QIkrMIlrOaCgoUFSeZqXRGD4zfdg== X-Received: by 10.28.131.197 with SMTP id f188mr795787wmd.71.1504797746215; Thu, 07 Sep 2017 08:22:26 -0700 (PDT) Received: from [10.100.64.46] ([217.30.88.7]) by smtp.gmail.com with ESMTPSA id 29sm4615163wrz.77.2017.09.07.08.22.22 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Sep 2017 08:22:25 -0700 (PDT) Subject: Re: mlx4en, timer irq @100%... (11.0 stuck on high network load ???) To: Ben RUBSON Cc: Hans Petter Selasky , FreeBSD Net , hiren , Slawa Olhovchenkov , FreeBSD Stable References: <4C91C6E5-0725-42E7-9813-1F3ACF3DDD6E@gmail.com> <5840c25e-7472-3276-6df9-1ed4183078ad@selasky.org> <2ADA8C57-2C2D-4F97-9F0B-82D53EDDC649@gmail.com> <061cdf72-6285-8239-5380-58d9d19a1ef7@selasky.org> <92BEE83D-498F-47D5-A53C-39DCDC00A0FD@gmail.com> <5d8960d8-e1ff-8719-320f-d3ae84054714@selasky.org> <6B4A35F7-5694-4945-9575-19ADB678F9FA@gmail.com> <297a784a-3d80-b1a6-652e-a78621fe5a8b@selasky.org> <3ECCFBF1-18D9-4E33-8F39-0C366C3BB8B4@gmail.com> <0a5787c5-8a53-ab09-971a-dc1cd5f3aca0@freebsd.org> <645f2ee3-3eaa-660e-2a64-37d53e88322f@freebsd.org> <13DE4E6D-CE83-4B5D-BF88-0EFE65111311@gmail.com> <7B084207-062A-4529-B0DC-5BFEB6517780@gmail.com> <82e661b4-1bac-ff5b-f776-8dba44cac15e@freebsd.org> <82EFBD5E-8FC2-4156-A030-AF70D97A37BA@gmail.com> From: Julien Charbon Message-ID: <25938f82-a32e-e014-8a98-4f9452dca9b5@freebsd.org> Date: Thu, 7 Sep 2017 11:22:19 -0400 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: <82EFBD5E-8FC2-4156-A030-AF70D97A37BA@gmail.com> Content-Type: text/plain; charset=windows-1252 Content-Language: en-US Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Sep 2017 15:28:32 -0000 Hi Ben, On 8/31/17 12:04 PM, Ben RUBSON wrote: >> On 28 Aug 2017, at 11:27, Julien Charbon wrote: >> >> On 8/28/17 10:25 AM, Ben RUBSON wrote: >>>> On 16 Aug 2017, at 11:02, Ben RUBSON wrote: >>>> >>>>> On 15 Aug 2017, at 23:33, Julien Charbon wrote: >>>>> >>>>> On 8/11/17 11:32 AM, Ben RUBSON wrote: >>>>>>> On 08 Aug 2017, at 13:33, Julien Charbon wrote: >>>>>>> >>>>>>> On 8/8/17 10:31 AM, Hans Petter Selasky wrote: >>>>>>>> >>>>>>>> Suggested fix attached. >>>>>>> >>>>>>> I agree we your conclusion. Just for the record, more precisely this >>>>>>> regression seems to have been introduced with: >>>>>>> (...) >>>>>>> Thus good catch, and your patch looks good. I am going to just verify >>>>>>> the other in_pcbrele_wlocked() calls in TCP stack. >>>>>> >>>>>> Julien, do you plan to make this fix reach 11.0-p12 ? >>>>> >>>>> I am checking if your issue is another flavor of the issue fixed by: >>>>> >>>>> https://svnweb.freebsd.org/base?view=revision&revision=307551 >>>>> https://reviews.freebsd.org/D8211 >>>>> >>>>> This fix in not in 11.0 but in 11.1. Currently I did not found how an >>>>> inp in INP_TIMEWAIT state can have been INP_FREED without having its tw >>>>> set to NULL already except the issue fixed by r307551. >>>>> >>>>> Thus could you try to apply this patch: >>>>> >>>>> https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087c5f7d0a0.patch >>>>> >>>>> and see if you can still reproduce this issue? >>>> >>>> Thank you for your answer Julien. >>>> Unfortunately, I'm not sure at all how to reproduce the issue. >>>> I have other servers which are 100% identical to this one, same workload, >>>> same some-months uptime, but they did not trigger the bug yet. >>>> >>>> If other network stack experts (I'm not) agree with your analysis, >>>> we could then certainly go further with D8211 / r307551. >>>> >>>> One thing that perhaps might help : >>>> # netstat -an | grep TIME_WAIT$ | wc -l >>>> 468 >>>> >>>> Note that due to this running bug, sendmail has lots of difficulties to send outgoing mails. >>>> As soon as I run the above netstat command, I receive a lot of stacked mails (more than 20 this time). >>>> As if netstat was able to somehow help... >>>> >>>> Number of TIME_WAIT connections however does not decrease, but increases. >>>> >>>>> And in the spirit of r307551 fix and based on Hans patch I will also >>>>> propose to add a kernel log describing the issue instead of starting an >>>>> infinite loop when INVARIANT is not set. >>>> >>>> Which should then never be triggered :) >>>> Good idea I think ! >>> >>> What about : >>> D8211/r307551 >>> + Hans' patch >>> + Julien's idea of a kernel log (sort of "We should not be here but we are") >> >> I did this change and I am testing it > > Good news ! > >> on your side did you try this patch applied on 11.0? >> >> https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087c5f7d0a0.patch > > Yes, patch applied and running correctly, > however hard to say whether or not it solves this issue, > as there is no easy way to reproduce it. No problem, it is just a matter of not seeing the issue anymore during a long enough period. I created a review that includes Hans's patch and uses the same log(LOG_ERR) logic than r307551: https://reviews.freebsd.org/D12267 On my side, TCP smoke tests are ok. And I am going to launch our TCP QA on it while receiving review comments. > Mail sent to FreeBSD Security Team ! > > Many thanks, let's stay tuned ! Thanks to you and Hans for reporting that issue. And in summary: - Applying r307551 on top of 11.0 should prevent this case to happen - D12267 will prevent the tcp_tw_2msl_scan() infinite loop while reporting the error, in case a regression defeating r307551 is introduced Thanks. -- Julien