From owner-freebsd-fs@FreeBSD.ORG  Tue Nov 12 13:28:06 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id A75F2DDF
 for <freebsd-fs@freebsd.org>; Tue, 12 Nov 2013 13:28:06 +0000 (UTC)
Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23])
 (using TLSv1 with cipher RC4-MD5 (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 493A82BED
 for <freebsd-fs@freebsd.org>; Tue, 12 Nov 2013 13:28:06 +0000 (UTC)
Received: from r2d2 ([82.69.141.170])
 by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23])
 (MDaemon PRO v10.0.4) with ESMTP id md50006695317.msg
 for <freebsd-fs@freebsd.org>; Tue, 12 Nov 2013 13:27:56 +0000
X-Spam-Processed: mail1.multiplay.co.uk, Tue, 12 Nov 2013 13:27:56 +0000
 (not processed: message from valid local sender)
X-MDDKIM-Result: neutral (mail1.multiplay.co.uk)
X-MDRemoteIP: 82.69.141.170
X-Return-Path: prvs=1028112938=killing@multiplay.co.uk
X-Envelope-From: killing@multiplay.co.uk
X-MDaemon-Deliver-To: freebsd-fs@freebsd.org
Message-ID: <C3022716E0A6458690F541AB01A8975A@multiplay.co.uk>
From: "Steven Hartland" <killing@multiplay.co.uk>
To: "Ivan Dimitrov" <zlobber@gmail.com>,
	<freebsd-fs@freebsd.org>
References: <52821EEE.5040502@gmail.com>
Subject: Re: Strange lock/crash - 100% cpu with basic command line utils
Date: Tue, 12 Nov 2013 13:27:49 -0000
MIME-Version: 1.0
Content-Type: text/plain; format=flowed; charset="iso-8859-1";
 reply-type=response
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.5931
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.16
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Nov 2013 13:28:06 -0000


----- Original Message ----- 
From: "Ivan Dimitrov" <zlobber@gmail.com>
To: <freebsd-fs@freebsd.org>
Sent: Tuesday, November 12, 2013 12:28 PM
Subject: Strange lock/crash - 100% cpu with basic command line utils


> Hello list
> 
> This is my first time reporting a problem, so please excuse me if this 
> is not the right place or format. Also apology for my poor English.
> 
> Last month we started experiencing strange locks on some of our servers. 
> On semi-random occasions, when typing `cd`, `ls`, `pwd` the server would 
> crash and start behave strangely. Sometimes the problem is reproducible, 
> sometimes all commands work as expected.
> All servers are Intel or AMD CPUs with FreeBSD 9.2 that netboot the 
> latest kernel and load the OS in RAM.
> All our servers are using zfs with ssd for cache. Here is an example 
> server:
> Also we tested out with preempted and non preempted kernel.
> 
> ==========================================
> 
> [root@ph3storage5 ~]# zpool status -v
>   pool: zstorage5p1
>  state: ONLINE
>   scan: scrub repaired 0 in 39h36m with 0 errors on Mon Nov  4 05:11:48 
> 2013
> config:
> 
>     NAME        STATE     READ WRITE CKSUM
>     zstorage5p1  ONLINE       0     0     0
>       mirror-0  ONLINE       0     0     0
>         ada0    ONLINE       0     0     0
>         ada1    ONLINE       0     0     0
>     cache
>       ada4p1    ONLINE       0     0     0
> 
> errors: No known data errors
> 
>   pool: zstorage5p2
>  state: ONLINE
>   scan: scrub repaired 0 in 14h59m with 0 errors on Sun Nov  3 04:41:50 
> 2013
> config:
> 
>     NAME        STATE     READ WRITE CKSUM
>     zstorage5p2  ONLINE       0     0     0
>       mirror-0  ONLINE       0     0     0
>         ada2    ONLINE       0     0     0
>         ada3    ONLINE       0     0     0
>     cache
>       ada4p2    ONLINE       0     0     0
> 
> errors: No known data errors
> 
> ==========================================
> The typical lock would look like the following:
> cd ~userdir/ ; ls
> At this point, the ls command "freezes" and cannot be "ctrl+c".
> We open up another console and see that the `ls` command is using 100% 
> CPU. Also, some disk operations randomly start taking 1 to 2 minutes to 
> complete. For example, we used `camcontrol` a few times, and it freezed 
> at one point.
> Also (while crashed) we used zpool to remove the ssd cache from the 
> pool, than we re-added the cache back to the pool, but when we issued 
> zpool status, the command freezed for a minute.
> 
> We managed to collect some data from two different incidents
> 
> Incident 1: http://pastebin.com/EkCeSwY9
> Incident 2: http://pastebin.com/5rj9BV68
> 
> Since the problem is reproducible, we accept proposals how to do further 
> tests.

This may be off the mark, as I've not seen 100% CPU, but we have
seen random unexplained hangs when connecting to some new machines
here and it turned out to be a simple lack of mbufs caused by the
fact the machines have 6 Intel igb nic's. So the command wasn't
hanging at all it was the output over ssh which was hanging due
to lack of mbufs to send the output to the client.

If you run "netstat -m" you'll be able to check and confirm /
eliminate this as your problem.


My next check would be for a failing disk, so throw smartctl at them.

Finally memory, so memtest++ or something similar

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster@multiplay.co.uk.