From owner-freebsd-fs@FreeBSD.ORG Thu Oct 9 12:36:44 2014 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 94FC7F65 for ; Thu, 9 Oct 2014 12:36:44 +0000 (UTC) Received: from frv189.fwdcdn.com (frv189.fwdcdn.com [212.42.77.189]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 5594D7B3 for ; Thu, 9 Oct 2014 12:36:44 +0000 (UTC) Received: from [10.10.1.23] (helo=frv199.fwdcdn.com) by frv189.fwdcdn.com with esmtp ID 1XcCxI-000AcQ-OE for freebsd-fs@freebsd.org; Thu, 09 Oct 2014 15:36:28 +0300 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=ukr.net; s=ffe; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-Id:To:Subject:From:Date; bh=3stpP77wWO1GkckNndRnp2yxDgjsUvquQpoeNbhsGxg=; b=YYw4ZZTJ5ZbDRtmc217L2lrFHQcvfzHLbr3YRA2kpsdivg8FWwF9tDQ7NW8E/Gdy+x+M+nka8UeXC0TqB4x277+TvKGzNX602QQsCMMjFaZeuiUtuPeClnCDLrmg1PjRBDikoiZ1tU7soBMql5FL2lJLYnJJs/knsNxYCgo2SPI=; Received: from [10.10.10.38] (helo=frv38.fwdcdn.com) by frv199.fwdcdn.com with smtp ID 1XcCx6-0009qX-OL for freebsd-fs@freebsd.org; Thu, 09 Oct 2014 15:36:16 +0300 Date: Thu, 09 Oct 2014 15:36:16 +0300 From: Paul Subject: Question about metadata in ARC To: freebsd-fs@freebsd.org X-Mailer: mail.ukr.net 5.0 Message-Id: <1412858175.581495697.syuqcm4o@frv38.fwdcdn.com> MIME-Version: 1.0 Received: from devgs@ukr.net by frv38.fwdcdn.com; Thu, 09 Oct 2014 15:36:16 +0300 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: binary Content-Disposition: inline X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Oct 2014 12:36:44 -0000 Cheers. Recently on production servers we have discovered strange ARC behavior. Up until now we weren't using readdir() too ofter, only on occasion. Now some of our daemons periodically call readdir() on many thousands of folders. On average a time between two readdir() calls on same directory is small: ~10-15 seconds. But for some reason 99% of the calls miss the cache. Directories never have more than 3 files in them. One of them, the lock file, is never removed and stays on file system. Other one or two files almost immediately unliked, after being created. Scan of directory using readdir(), before unlinking those files always takes roughly 10-12 milliseconds. I have figured it's because directory metadata is getting pumped out of ARC. The question is why so quickly? Why is on the other hand, metadata needed for stat() stays in cache much much longer. Literally hours. stat()-ing file once in few hours always hits the cache. And we stat() millions of dirrefent files per day. So I imagine there is two kinds of metadata: directory data blocks (hash table, according to wiki) and data blocks where stat()'s metadata is stored. Why is stat() metadata lives so much longer than directory metadata? Or better say, why is directory metadata rejected so quickly? Is there a way to configure it otherwise? I want to show you my little test case. Environment: The test is performed on production server with 128G RAM and max ARC size set to 90G We are running FreeBSD 11.0-CURRENT #3 r260625 Stats from top are: ARC: 86G Total, 1884M MFU, 77G MRU, 54M Anon, 2160M Header, 5149M Other Average disk busyness is 40% 6 CPU cores with hypertherading (12 virtual cores) are 25% busy on average Setup: To setup test case I have created test directory and spawned 5000 files using: # for a in {1..5000}; do touch ${RANDOM}_test_file_name_${RANDOM}_$a; done And saved their names to temporary file. # ls > /tmp/testfiles Testing: Then I waited an hour (for cached metadata to expire from cache) and did two things: 1) scan directory using plain ls # time ls >| /dev/null; ls -G >| /dev/null 0,03s user 0,05s system 5% cpu 1,472 total 2) stat files by their names, that were retrieved from temporary file created earlier # time stat `cat /tmp/lstest`>| /dev/null stat `cat /tmp/lstest` >| /dev/null 0,16s user 0,19s system 99% cpu 1,481 total Then I waited one minute and repeated two above actions: 1) # time ls >| /dev/null; ls -G >| /dev/null 0,03s user 0,04s system 5% cpu 1,327 total 2) # time stat `cat /tmp/lstest`>| /dev/null stat `cat /tmp/lstest` >| /dev/null 0,16s user 0,19s system 99% cpu 0,351 total As you can see in case (1) majority of time is waiting for disk ie cache miss. In case (2) CPU time is in majority. I did many more experiments and came to conclusion that directory metadata is removed from cache almost immediately. Sometimes it takes 5 seconds, sometimes it's even 1 second, rarely it's 10 or more seconds. While on the other hand when I ran stat `cat /tmp/lstest` hours later I still had total time around 350ms. So, how can I configure ZFS to reduce cache misses when reading directories? There is another problem I think related to my issue. Periodical unlink() of non-existent files also takes 10 to 12 milliseconds. While stat() + unlink() (if file exists) always takes no more than tens of microseconds! Paul, Thanks.