Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 19 Feb 2013 14:10:47 -0600
From:      Kevin Day <toasty@dragondata.com>
To:        Peter Jeremy <peter@rulingia.com>
Cc:        FreeBSD Filesystems <freebsd-fs@freebsd.org>
Subject:   Re: Improving ZFS performance for large directories
Message-ID:  <19E0C908-79F1-43F8-899C-6B60F998D4A5@dragondata.com>
In-Reply-To: <20130201192416.GA76461@server.rulingia.com>
References:  <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> <CAJjvXiE%2B8OMu_yvdRAsWugH7W=fhFW7bicOLLyjEn8YrgvCwiw@mail.gmail.com> <F4420A8C-FB92-4771-B261-6C47A736CF7F@dragondata.com> <20130201192416.GA76461@server.rulingia.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Sorry for the late followup, I've been doing some testing with an L2ARC =
device.


>> Doing it twice back-to-back makes a bit of difference but it's still =
slow either way.
>=20
> ZFS can very conservative about caching data and twice might not be =
enough.
> I suggest you try 8-10 times, or until the time stops reducing.
>=20

Timing doing an "ls" in large directories 20 times, the first is the =
slowest, then all subsequent listings are roughly the same. There =
doesn't appear to be any gain after 20 repetitions=20


>> I think some of the issue is that nothing is being allowed to stay =
cached long.
>=20
> Well ZFS doesn't do any time-based eviction so if things aren't
> staying in the cache, it's because they are being evicted by things
> that ZFS considers more deserving.
>=20
> Looking at the zfs-stats you posted, it looks like your workload has
> very low locality of reference (the data hitrate is very) low.  If
> this is not what you expect then you need more RAM.  OTOH, your
> vfs.zfs.arc_meta_used being above vfs.zfs.arc_meta_limit suggests that
> ZFS really wants to cache more metadata (by default ZFS has a 25%
> metadata, 75% data split in ARC to prevent metadata caching starving
> data caching).  I would go even further than the 50:50 split suggested
> later and try 75:25 (ie, triple the current vfs.zfs.arc_meta_limit).
>=20
> Note that if there is basically no locality of reference in your
> workload (as I suspect), you can even turn off data caching for
> specific filesystems with zfs set primarycache=3Dmetadata tank/foo
> (note that you still need to increase vfs.zfs.arc_meta_limit to
> allow ZFS to use the the ARC to cache metadata).

Now that I've got an L2ARC device (250GB), I've been doing some playing. =
With the defaults (primarycache and secondarycache set to all), I really =
didn't see much improvement. The SSD filled itself pretty quickly, but =
it's hit rate was around 1%, even after 48 hours.

Thinking I'd make the primary cache metadata only, and the secondary =
cache "all" would improve things, I wiped the device (SATA secure erase =
to make sure) and tried again. This was much worse, i'm guessing because =
there was some amount of real file data being looked at frequently, the =
SSD was basically getting hammered for read access with 100% =
utilization, and things were far slower.

I wiped the SSD and tried again with primarycache=3Dall, =
secondarycache=3Dmetadata and things have improved. Even with boosting =
up vfs.zfs.l2arc_write_max, it took quite a while before things =
stabilized. I'm guessing there isn't a huge amount of data, but there's =
such poor locality and sweeping the entire filesystem takes so long that =
it's going to take a while before it decides what's worth being cached. =
After about 20 hours in this configuration, it's a HUGE difference on =
directory speeds though. Before adding the SSD, an "ls" in a directory =
with 65k files would take 10-30 seconds, it's now down to about 0.2 =
seconds.=20

So I'm guessing the theory was right, there was more metadata than would =
fit in ARC so it was constantly churning. I'm a bit surprised that =
continually doing an ls in a big directory didn't make it stick better, =
but these filesystems are HUGE so there may be some inefficiencies =
happening here. There are roughly 29M files, growing at about 50k =
files/day. We recently upgraded, and are now at 96 3TB drives in the =
pool.

What I also find surprising is this:

L2 ARC Size: (Adaptive)				22.70	GiB
	Header Size:			0.31%	71.49	MiB

L2 ARC Breakdown:				23.77m
	Hit Ratio:			34.26%	8.14m
	Miss Ratio:			65.74%	15.62m
	Feeds:					63.28k

It's a 250G drive, and only 22G is being used, and there's still a ~66% =
miss rate. Is there any way to tell why more metadata isn't being pushed =
to the L2ARC? I see a pretty high count for "Passed Headroom" and "Tried =
Lock Failures", but I'm not sure if that's normal.  Including the =
lengthy output of zfs-stat below in case anyone sees something that =
stands out as being unusual.=20

------------------------------------------------------------------------
ZFS Subsystem Report				Tue Feb 19 20:08:19 2013
------------------------------------------------------------------------

System Information:

	Kernel Version:				901000 (osreldate)
	Hardware Platform:			amd64
	Processor Architecture:			amd64

	ZFS Storage pool Version:		28
	ZFS Filesystem Version:			5

FreeBSD 9.1-RC2 #1: Tue Oct 30 20:37:38 UTC 2012 root
 8:08PM  up 20:40, 3 users, load averages: 0.47, 0.50, 0.52

------------------------------------------------------------------------

System Memory:

	8.41%	5.22	GiB Active,	10.18%	6.32	GiB Inact
	77.39%	48.05	GiB Wired,	1.52%	966.99	MiB Cache
	2.50%	1.55	GiB Free,	0.00%	888.00	KiB Gap

	Real Installed:				64.00	GiB
	Real Available:			99.97%	63.98	GiB
	Real Managed:			97.04%	62.08	GiB

	Logical Total:				64.00	GiB
	Logical Used:			86.22%	55.18	GiB
	Logical Free:			13.78%	8.82	GiB

Kernel Memory:					23.18	GiB
	Data:				99.91%	23.16	GiB
	Text:				0.09%	21.27	MiB

Kernel Memory Map:				52.10	GiB
	Size:				35.21%	18.35	GiB
	Free:				64.79%	33.75	GiB

------------------------------------------------------------------------

ARC Summary: (HEALTHY)
	Memory Throttle Count:			0

ARC Misc:
	Deleted:				10.24m
	Recycle Misses:				3.48m
	Mutex Misses:				24.85k
	Evict Skips:				12.79m

ARC Size:				92.50%	28.25	GiB
	Target Size: (Adaptive)		92.50%	28.25	GiB
	Min Size (Hard Limit):		25.00%	7.64	GiB
	Max Size (High Water):		4:1	30.54	GiB

ARC Size Breakdown:
	Recently Used Cache Size:	62.35%	17.62	GiB
	Frequently Used Cache Size:	37.65%	10.64	GiB

ARC Hash Breakdown:
	Elements Max:				1.99m
	Elements Current:		99.16%	1.98m
	Collisions:				8.97m
	Chain Max:				14
	Chains:					586.97k

------------------------------------------------------------------------

ARC Efficiency:					1.15b
	Cache Hit Ratio:		97.66%	1.12b
	Cache Miss Ratio:		2.34%	26.80m
	Actual Hit Ratio:		72.75%	833.30m

	Data Demand Efficiency:		98.39%	33.94m
	Data Prefetch Efficiency:	8.11%	7.60m

	CACHE HITS BY CACHE LIST:
	  Anonymously Used:		23.88%	267.15m
	  Most Recently Used:		4.70%	52.60m
	  Most Frequently Used:		69.79%	780.70m
	  Most Recently Used Ghost:	0.64%	7.13m
	  Most Frequently Used Ghost:	0.98%	10.99m

	CACHE HITS BY DATA TYPE:
	  Demand Data:			2.99%	33.40m
	  Prefetch Data:		0.06%	616.42k
	  Demand Metadata:		71.38%	798.44m
	  Prefetch Metadata:		25.58%	286.13m

	CACHE MISSES BY DATA TYPE:
	  Demand Data:			2.04%	546.67k
	  Prefetch Data:		26.07%	6.99m
	  Demand Metadata:		37.96%	10.18m
	  Prefetch Metadata:		33.93%	9.09m

------------------------------------------------------------------------

L2 ARC Summary: (HEALTHY)
	Passed Headroom:			3.62m
	Tried Lock Failures:			3.17m
	IO In Progress:				21.18k
	Low Memory Aborts:			20
	Free on Write:				7.07k
	Writes While Full:			134
	R/W Clashes:				1.63k
	Bad Checksums:				0
	IO Errors:				0
	SPA Mismatch:				0

L2 ARC Size: (Adaptive)				22.70	GiB
	Header Size:			0.31%	71.02	MiB

L2 ARC Breakdown:				23.78m
	Hit Ratio:			34.25%	8.15m
	Miss Ratio:			65.75%	15.64m
	Feeds:					63.47k

L2 ARC Buffer:
	Bytes Scanned:				65.51	TiB
	Buffer Iterations:			63.47k
	List Iterations:			4.06m
	NULL List Iterations:			64.89k

L2 ARC Writes:
	Writes Sent:			100.00%	29.89k

------------------------------------------------------------------------

File-Level Prefetch: (HEALTHY)

DMU Efficiency:					1.24b
	Hit Ratio:			64.29%	798.62m
	Miss Ratio:			35.71%	443.54m

	Colinear:				443.54m
	  Hit Ratio:			0.00%	20.45k
	  Miss Ratio:			100.00%	443.52m

	Stride:					772.29m
	  Hit Ratio:			99.99%	772.21m
	  Miss Ratio:			0.01%	81.30k

DMU Misc:
	Reclaim:				443.52m
	  Successes:			0.05%	220.47k
	  Failures:			99.95%	443.30m

	Streams:				26.42m
	  +Resets:			0.05%	12.73k
	  -Resets:			99.95%	26.41m
	  Bogus:				0

------------------------------------------------------------------------

VDEV cache is disabled

------------------------------------------------------------------------

ZFS Tunables (sysctl):
	kern.maxusers                           384
	vm.kmem_size                            66662760448
	vm.kmem_size_scale                      1
	vm.kmem_size_min                        0
	vm.kmem_size_max                        329853485875
	vfs.zfs.l2c_only_size                   5242113536
	vfs.zfs.mfu_ghost_data_lsize            178520064
	vfs.zfs.mfu_ghost_metadata_lsize        6486959104
	vfs.zfs.mfu_ghost_size                  6665479168
	vfs.zfs.mfu_data_lsize                  11863127552
	vfs.zfs.mfu_metadata_lsize              123386368
	vfs.zfs.mfu_size                        12432947200
	vfs.zfs.mru_ghost_data_lsize            14095171584
	vfs.zfs.mru_ghost_metadata_lsize        8351076864
	vfs.zfs.mru_ghost_size                  22446248448
	vfs.zfs.mru_data_lsize                  2076449280
	vfs.zfs.mru_metadata_lsize              4655490560
	vfs.zfs.mru_size                        7074721792
	vfs.zfs.anon_data_lsize                 0
	vfs.zfs.anon_metadata_lsize             0
	vfs.zfs.anon_size                       1605632
	vfs.zfs.l2arc_norw                      1
	vfs.zfs.l2arc_feed_again                1
	vfs.zfs.l2arc_noprefetch                1
	vfs.zfs.l2arc_feed_min_ms               200
	vfs.zfs.l2arc_feed_secs                 1
	vfs.zfs.l2arc_headroom                  2
	vfs.zfs.l2arc_write_boost               52428800
	vfs.zfs.l2arc_write_max                 26214400
	vfs.zfs.arc_meta_limit                  16398159872
	vfs.zfs.arc_meta_used                   16398120264
	vfs.zfs.arc_min                         8199079936
	vfs.zfs.arc_max                         32796319744
	vfs.zfs.dedup.prefetch                  1
	vfs.zfs.mdcomp_disable                  0
	vfs.zfs.write_limit_override            0
	vfs.zfs.write_limit_inflated            206088929280
	vfs.zfs.write_limit_max                 8587038720
	vfs.zfs.write_limit_min                 33554432
	vfs.zfs.write_limit_shift               3
	vfs.zfs.no_write_throttle               0
	vfs.zfs.zfetch.array_rd_sz              1048576
	vfs.zfs.zfetch.block_cap                256
	vfs.zfs.zfetch.min_sec_reap             2
	vfs.zfs.zfetch.max_streams              8
	vfs.zfs.prefetch_disable                0
	vfs.zfs.mg_alloc_failures               12
	vfs.zfs.check_hostid                    1
	vfs.zfs.recover                         0
	vfs.zfs.txg.synctime_ms                 1000
	vfs.zfs.txg.timeout                     5
	vfs.zfs.vdev.cache.bshift               16
	vfs.zfs.vdev.cache.size                 0
	vfs.zfs.vdev.cache.max                  16384
	vfs.zfs.vdev.write_gap_limit            4096
	vfs.zfs.vdev.read_gap_limit             32768
	vfs.zfs.vdev.aggregation_limit          131072
	vfs.zfs.vdev.ramp_rate                  2
	vfs.zfs.vdev.time_shift                 6
	vfs.zfs.vdev.min_pending                4
	vfs.zfs.vdev.max_pending                128
	vfs.zfs.vdev.bio_flush_disable          0
	vfs.zfs.cache_flush_disable             0
	vfs.zfs.zil_replay_disable              0
	vfs.zfs.zio.use_uma                     0
	vfs.zfs.snapshot_list_prefetch          0
	vfs.zfs.version.zpl                     5
	vfs.zfs.version.spa                     28
	vfs.zfs.version.acl                     1
	vfs.zfs.debug                           0
	vfs.zfs.super_owner                     0




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19E0C908-79F1-43F8-899C-6B60F998D4A5>