From owner-freebsd-questions@freebsd.org Fri Jul 26 20:24:02 2019 Return-Path: Delivered-To: freebsd-questions@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id BC412A88EC for ; Fri, 26 Jul 2019 20:24:02 +0000 (UTC) (envelope-from dpchrist@holgerdanske.com) Received: from holgerdanske.com (holgerdanske.com [184.105.128.27]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "holgerdanske.com", Issuer "holgerdanske.com" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 124768160F for ; Fri, 26 Jul 2019 20:24:01 +0000 (UTC) (envelope-from dpchrist@holgerdanske.com) Received: from 99.100.19.101 ([99.100.19.101]) by holgerdanske.com with ESMTPSA (ECDHE-RSA-AES128-GCM-SHA256:TLSv1.2:Kx=ECDH:Au=RSA:Enc=AESGCM(128):Mac=AEAD) (SMTP-AUTH username dpchrist@holgerdanske.com, mechanism PLAIN) for ; Fri, 26 Jul 2019 13:23:58 -0700 Subject: Re: Help:: Listen queue overflow killing servers To: freebsd-questions@freebsd.org References: <3a62375a-432c-3533-a7bc-e5573c26fa9c@ifdnrg.com> <2b10f991-bc95-ae31-18e2-95ae943ac527@holgerdanske.com> <2798d3f3-9689-111c-e061-1f6f66d78e03@ifdnrg.com> <1d629866-09db-d892-4c55-717b3dfead7f@holgerdanske.com> From: David Christensen Message-ID: <4620405e-7d5c-a6e7-d8cc-e94e1230c03f@holgerdanske.com> Date: Fri, 26 Jul 2019 13:23:50 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 124768160F X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org X-Spamd-Result: default: False [-2.08 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; NEURAL_HAM_MEDIUM(-0.95)[-0.952,0]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; IP_SCORE(-0.40)[ipnet: 184.104.0.0/15(0.85), asn: 6939(-2.80), country: US(-0.05)]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-questions@freebsd.org]; TO_DN_NONE(0.00)[]; AUTH_NA(1.00)[]; RCPT_COUNT_ONE(0.00)[1]; NEURAL_HAM_LONG(-1.00)[-0.995,0]; DMARC_NA(0.00)[holgerdanske.com]; MX_GOOD(-0.01)[cached: holgerdanske.com]; NEURAL_HAM_SHORT(-0.62)[-0.619,0]; RCVD_IN_DNSWL_NONE(0.00)[27.128.105.184.list.dnswl.org : 127.0.10.0]; R_SPF_NA(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:6939, ipnet:184.104.0.0/15, country:US]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_TLS_ALL(0.00)[]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Jul 2019 20:24:02 -0000 On 7/26/19 12:34 PM, Paul Macdonald via freebsd-questions wrote: > > On 26/07/2019 19:56, David Christensen wrote: >> On 7/26/19 9:57 AM, Paul Macdonald via freebsd-questions wrote: >>> >>> On 26/07/2019 17:11, David Christensen wrote: >>>> On 7/26/19 4:58 AM, Paul Macdonald via freebsd-questions wrote: >>>>> Over the past few months i've seen several boxes (4 or 5) become >>>>> unresponsive as a result of a Listen queue overflow state. >>>> >>>                  so doesn;t look like its load..... ( and that would >>> have shown up in the logs anyway) >> >> >> Is this server in production?  If so, it would be prudent to migrate >> services and data to another computer while you troubleshoot. >> >> > this has happened on 5 production boxes over the past few months, all > with different hardware and load profiles. Which tracks and versions of FreeBSD? Are you running stock FreeBSD? Packages? Ports? Custom? Do you have automation to detect the symptom(s) and alert you? >> I would turn on debugging and crank up logging everywhere -- kernel, >> ZFS, Apache, MySQL, PHP, WP, app code, etc..  Make sure you have a big >> and fast device/ virtual device for the logs and debug dumps. >> >> > thats  a big job  we run 110+ servers, i'd like to find something more > specific Pick a representative sample (say, 10%) and crank up debug/ logging. As you get clues, you can scale back depth and increase the sample size. >> Are the stress tests hitting the server with "good" traffic?  Can you >> send "bad" traffic? >> >> > no idea how to send bad traffic! Metasploit comes to mind. >> Do you have test suites for any of the components?  If so, run them. >> As you troubleshoot, write new test scripts. >> > components are not comparable across boxes, and one box that went down > has only our custom code ( which has worked for a decade) Did the other failing machines have your code? >> Can you capture real traffic and replay it -- preferably traffic that >> elicits the bug(s)? >> > the issue doesn;t seem to be that reproducible, i'l check but i think > only 1 of the boxes has gone down >1 times with same issue > > (i can't capture traffic on all boxes) Again, perhaps start with a sample. > I wish it was more reproducible, i'd downgrade that server down to 11.4 > in a heart beat ( i'm suspecting its 12.0 related) I prefer to use the most mature and supported "production" release of whatever FOSS I use -- BSD, Linux, whatever. Newer stuff usually has more "gremlins". Similarly, I prefer "vendor official" binary software packages. I have destabilized plenty of machines with unofficial packages and/or source distributions. > ( have see historic report of similar issues on imap boxes, which do > have large quues anyway obv) > > weirdly our imap boxes have been fine, and they have 10k connections all > the time. > > I sieged tested the box that went down earlier today (16C/32T, 128GB > RAM, 1Tb NVme) and it didn;t break sweat after 300,000 conections. > > am at a bit of a loss. Which tracks/ versions of FreeBSD are you running? Is there any correlation between FreeBSD track/ version and the bug(s)? Can you run 11.2-RELEASE? Are you running/ can you run official FreeBSD binary packages? Do you put your code into a FreeBSD package? Do you use configuration management? David