From owner-freebsd-fs@FreeBSD.ORG Tue Nov 12 13:28:06 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A75F2DDF for ; Tue, 12 Nov 2013 13:28:06 +0000 (UTC) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 493A82BED for ; Tue, 12 Nov 2013 13:28:06 +0000 (UTC) Received: from r2d2 ([82.69.141.170]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50006695317.msg for ; Tue, 12 Nov 2013 13:27:56 +0000 X-Spam-Processed: mail1.multiplay.co.uk, Tue, 12 Nov 2013 13:27:56 +0000 (not processed: message from valid local sender) X-MDDKIM-Result: neutral (mail1.multiplay.co.uk) X-MDRemoteIP: 82.69.141.170 X-Return-Path: prvs=1028112938=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk X-MDaemon-Deliver-To: freebsd-fs@freebsd.org Message-ID: From: "Steven Hartland" To: "Ivan Dimitrov" , References: <52821EEE.5040502@gmail.com> Subject: Re: Strange lock/crash - 100% cpu with basic command line utils Date: Tue, 12 Nov 2013 13:27:49 -0000 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=response Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.16 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Nov 2013 13:28:06 -0000 ----- Original Message ----- From: "Ivan Dimitrov" To: Sent: Tuesday, November 12, 2013 12:28 PM Subject: Strange lock/crash - 100% cpu with basic command line utils > Hello list > > This is my first time reporting a problem, so please excuse me if this > is not the right place or format. Also apology for my poor English. > > Last month we started experiencing strange locks on some of our servers. > On semi-random occasions, when typing `cd`, `ls`, `pwd` the server would > crash and start behave strangely. Sometimes the problem is reproducible, > sometimes all commands work as expected. > All servers are Intel or AMD CPUs with FreeBSD 9.2 that netboot the > latest kernel and load the OS in RAM. > All our servers are using zfs with ssd for cache. Here is an example > server: > Also we tested out with preempted and non preempted kernel. > > ========================================== > > [root@ph3storage5 ~]# zpool status -v > pool: zstorage5p1 > state: ONLINE > scan: scrub repaired 0 in 39h36m with 0 errors on Mon Nov 4 05:11:48 > 2013 > config: > > NAME STATE READ WRITE CKSUM > zstorage5p1 ONLINE 0 0 0 > mirror-0 ONLINE 0 0 0 > ada0 ONLINE 0 0 0 > ada1 ONLINE 0 0 0 > cache > ada4p1 ONLINE 0 0 0 > > errors: No known data errors > > pool: zstorage5p2 > state: ONLINE > scan: scrub repaired 0 in 14h59m with 0 errors on Sun Nov 3 04:41:50 > 2013 > config: > > NAME STATE READ WRITE CKSUM > zstorage5p2 ONLINE 0 0 0 > mirror-0 ONLINE 0 0 0 > ada2 ONLINE 0 0 0 > ada3 ONLINE 0 0 0 > cache > ada4p2 ONLINE 0 0 0 > > errors: No known data errors > > ========================================== > The typical lock would look like the following: > cd ~userdir/ ; ls > At this point, the ls command "freezes" and cannot be "ctrl+c". > We open up another console and see that the `ls` command is using 100% > CPU. Also, some disk operations randomly start taking 1 to 2 minutes to > complete. For example, we used `camcontrol` a few times, and it freezed > at one point. > Also (while crashed) we used zpool to remove the ssd cache from the > pool, than we re-added the cache back to the pool, but when we issued > zpool status, the command freezed for a minute. > > We managed to collect some data from two different incidents > > Incident 1: http://pastebin.com/EkCeSwY9 > Incident 2: http://pastebin.com/5rj9BV68 > > Since the problem is reproducible, we accept proposals how to do further > tests. This may be off the mark, as I've not seen 100% CPU, but we have seen random unexplained hangs when connecting to some new machines here and it turned out to be a simple lack of mbufs caused by the fact the machines have 6 Intel igb nic's. So the command wasn't hanging at all it was the output over ssh which was hanging due to lack of mbufs to send the output to the client. If you run "netstat -m" you'll be able to check and confirm / eliminate this as your problem. My next check would be for a failing disk, so throw smartctl at them. Finally memory, so memtest++ or something similar Regards Steve ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk.