From owner-freebsd-fs@FreeBSD.ORG Tue Nov 12 12:28:32 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 70120C73 for ; Tue, 12 Nov 2013 12:28:32 +0000 (UTC) Received: from mail-bk0-x234.google.com (mail-bk0-x234.google.com [IPv6:2a00:1450:4008:c01::234]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id EC0FA2800 for ; Tue, 12 Nov 2013 12:28:31 +0000 (UTC) Received: by mail-bk0-f52.google.com with SMTP id v10so2145319bkz.25 for ; Tue, 12 Nov 2013 04:28:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject :content-type; bh=aO9P7/rPEVgbtwOJC/9rtl5+LQGLbIjqQPBl5GrCRTI=; b=GrOqbCX0n/fzIPpX1XZgEsY+WI9lAB2NOlwQ9toHquj88uOUoP6XqWw47gjRGUm59t T+IkyzCLknpPVxai1hKLgA+dtyEiFtGQRMnuj67mIUUf8k+IODxoiqx7sOi9b8jJh0UG EKAIAgUDPPsKeiWPHiBMWaqzhcdSn2wFwOC3KZ80EtBUg//FQutd+OwPVwEk12ocLIt8 t/IGfMrngxZHnx8Hi0cmZFb+33lRoupSOT90Z9H+78j0WHGF+SQKvdNn/FTscM/YKxGF Hh3Ktv/MFCGOGYQp5HtY7W0XBDGfTT/WF872MVzCytjvKGxweyMKOuqRuwBHPvT0nQwk 4CUw== X-Received: by 10.204.226.75 with SMTP id iv11mr6943bkb.161.1384259310261; Tue, 12 Nov 2013 04:28:30 -0800 (PST) Received: from endymion.local ([88.203.210.106]) by mx.google.com with ESMTPSA id b7sm17957620bkg.1.2013.11.12.04.28.29 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 12 Nov 2013 04:28:29 -0800 (PST) Message-ID: <52821EEE.5040502@gmail.com> Date: Tue, 12 Nov 2013 14:28:30 +0200 From: Ivan Dimitrov User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Strange lock/crash - 100% cpu with basic command line utils Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.16 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.16 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Nov 2013 12:28:32 -0000 Hello list This is my first time reporting a problem, so please excuse me if this is not the right place or format. Also apology for my poor English. Last month we started experiencing strange locks on some of our servers. On semi-random occasions, when typing `cd`, `ls`, `pwd` the server would crash and start behave strangely. Sometimes the problem is reproducible, sometimes all commands work as expected. All servers are Intel or AMD CPUs with FreeBSD 9.2 that netboot the latest kernel and load the OS in RAM. All our servers are using zfs with ssd for cache. Here is an example server: Also we tested out with preempted and non preempted kernel. ========================================== [root@ph3storage5 ~]# zpool status -v pool: zstorage5p1 state: ONLINE scan: scrub repaired 0 in 39h36m with 0 errors on Mon Nov 4 05:11:48 2013 config: NAME STATE READ WRITE CKSUM zstorage5p1 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0 ONLINE 0 0 0 ada1 ONLINE 0 0 0 cache ada4p1 ONLINE 0 0 0 errors: No known data errors pool: zstorage5p2 state: ONLINE scan: scrub repaired 0 in 14h59m with 0 errors on Sun Nov 3 04:41:50 2013 config: NAME STATE READ WRITE CKSUM zstorage5p2 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada2 ONLINE 0 0 0 ada3 ONLINE 0 0 0 cache ada4p2 ONLINE 0 0 0 errors: No known data errors ========================================== The typical lock would look like the following: cd ~userdir/ ; ls At this point, the ls command "freezes" and cannot be "ctrl+c". We open up another console and see that the `ls` command is using 100% CPU. Also, some disk operations randomly start taking 1 to 2 minutes to complete. For example, we used `camcontrol` a few times, and it freezed at one point. Also (while crashed) we used zpool to remove the ssd cache from the pool, than we re-added the cache back to the pool, but when we issued zpool status, the command freezed for a minute. We managed to collect some data from two different incidents Incident 1: http://pastebin.com/EkCeSwY9 Incident 2: http://pastebin.com/5rj9BV68 Since the problem is reproducible, we accept proposals how to do further tests. Thanks in advance Best Regards Ivan Dimitrov