From owner-freebsd-fs@FreeBSD.ORG Sat Mar 7 03:13:23 2015 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 52576D96 for ; Sat, 7 Mar 2015 03:13:23 +0000 (UTC) Received: from mail-pd0-x22e.google.com (mail-pd0-x22e.google.com [IPv6:2607:f8b0:400e:c02::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 27318DBA for ; Sat, 7 Mar 2015 03:13:23 +0000 (UTC) Received: by pdbnh10 with SMTP id nh10so56132726pdb.3 for ; Fri, 06 Mar 2015 19:13:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ycombinator.com; s=google; h=mime-version:date:message-id:subject:from:to:content-type; bh=diKn8Hue12DKW+7sunRPRNAY41MpdOvEeyj0yL17bL0=; b=OlGUQy9vjqx+UalGJCYNUklFRWFJDYXoWHPVrMlYKAWrLB2mEYWSFwSv8WsaHWAtF1 +AKRFSL+BLjO+uyDHKzvmG4a/szATkT754azlAK1rowgdgwEs35+yIBdQpS598Nfwj0i rFxAzKHxzk3KRyPtZEQyVAIZK8aGcSWZjyaK0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to :content-type; bh=diKn8Hue12DKW+7sunRPRNAY41MpdOvEeyj0yL17bL0=; b=bRWb67RsffipOF7S0cF8K3EqEOVi1WGXFtOaeOdw6fYxijkeJts2CX37z5O1651OuF g52UNXDLngQj4N2ywWeE8xc/8v0afwjjkGzWB7RRWk46V3+Dc1eMwk+5akGhVH46W1v8 sPjHWZLjcGxX90kLSEtJVAM3on7lLgVDnrvBj/LvMuK385BoONFCGGSP8F65FfzWC8DT RQY9FU9fiI6VsTJwS/tO6GuQLeGLILF2YBr/HElzoFpdyH5ONArF6AkyOOCcE00Y2Hr5 asDUKyb9U/EBmCm+sK8f4zKd7jdc+C6Wa9aNm0v2W4uV7b+4yT8p+ghDOg1+TkAuTVsY n6IQ== X-Gm-Message-State: ALoCoQlhLriv00zMiQC5I9pVoQPBQhLStJCMrDm0jupItiWK/wqmZQTS5rj0u9f8TxOK8xKNhABC MIME-Version: 1.0 X-Received: by 10.66.153.36 with SMTP id vd4mr31117394pab.126.1425698002509; Fri, 06 Mar 2015 19:13:22 -0800 (PST) Received: by 10.70.47.105 with HTTP; Fri, 6 Mar 2015 19:13:22 -0800 (PST) Date: Fri, 6 Mar 2015 19:13:22 -0800 Message-ID: Subject: ZFS Deadlock? From: Nick Sivo To: freebsd-fs Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 07 Mar 2015 03:13:23 -0000 Hi, One of our servers occasionally exhibits strange behavior under heavy IO load. I think, based on the output from procstat -kk -a, it may be a ZFS or VFS deadlock. Certain operations, including anything involving the ZFS commands like zfs and zpool will hang. Running ls at the root of a ZFS filesystem will also hang. Trying to access snapshots in the .zfs/ folder will hang. None of these hung processes can be killed. Eventually the machine will panic, if we don't reboot it first, but that can take days after we start seeing this issue. Strangely, our primary application (Hacker News) will keep running without interruption until the panic. Details of three occurrences can be found at https://gist.github.com/kogir/acbd6d0e28ade0ee3aac For the ones this month, it's on: 9.3-RELEASE-p10 FreeBSD 9.3-RELEASE-p10 #0: Tue Feb 24 21:28:03 UTC 2015 Those from October of last year were running an earlier 9.3 (exact version unknown). The same hardware running 9.2 was solid for months at a time. We never saw this issue on 9.2. top output from the dying box right now: last pid: 48083; load averages: 0.24, 0.31, 0.27 120 processes: 1 running, 119 sleeping CPU: 5.6% user, 0.0% nice, 1.7% system, 0.2% interrupt, 92.5% idle Mem: 5722M Active, 249M Inact, 67G Wired, 352K Cache, 51G Free ARC: 32G Total, 14G MFU, 8824M MRU, 52M Anon, 1800M Header, 7962M Other Swap: I'd show you the zpool configuration, but that would hang. We're not using L2ARC or deduplication. In any case, it's happening more frequently (twice this week), so I'd like to get to the bottom of it if I can. Does this look like it could be a filesystem issue? This will undoubtedly happen again. Is there more information I should try to collect? Thanks for your time and ideas/help you throw my way :) Best, Nick