From owner-freebsd-current@FreeBSD.ORG Mon Nov 18 20:12:19 2013 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 9997E9E1; Mon, 18 Nov 2013 20:12:19 +0000 (UTC) Received: from mail-qc0-x234.google.com (mail-qc0-x234.google.com [IPv6:2607:f8b0:400d:c01::234]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 3A4D5246A; Mon, 18 Nov 2013 20:12:19 +0000 (UTC) Received: by mail-qc0-f180.google.com with SMTP id e16so2209772qcx.25 for ; Mon, 18 Nov 2013 12:12:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=HqgiATGrJ/ODsDXBEhSOw2BEkJggcTDFe3Y+u3r8XaI=; b=h9blKoVL7Rtp3CsyIz17NF/++7F9nW8bKhDu1mlvf3SGX3ImW1PpVg/QQ2F3H24TSw ndOpDd7ddtDE4ep50g7o5Xu6C5FH1DqJvrkjq2Q9D1lR9AYX7MVWOiqwOmLbaSp8NHRw OVDJEOYnXmvMEqKlFwGsTpKsgHLFf+dqE6WfPrwoTTOL+/bvc5B/HgICi9Vaj1TWYzzA oRgLUABk2ApiQP8AwnftpPMlLvyYr0AJf9iBcqtOrpebHNLqYm3wUSRR+HReFOpPr4Iq 5FMTB28tAc+egTVHaQ3lWNBFjGyCWV1dHxYrbDDH3EuQmjqepw32We4cZMN2s3bKfOPW P9Jw== MIME-Version: 1.0 X-Received: by 10.49.71.207 with SMTP id x15mr37164431qeu.49.1384805538000; Mon, 18 Nov 2013 12:12:18 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.224.207.66 with HTTP; Mon, 18 Nov 2013 12:12:17 -0800 (PST) In-Reply-To: <528A70A2.4010308@FreeBSD.org> References: <52894C92.60905@FreeBSD.org> <528A70A2.4010308@FreeBSD.org> Date: Mon, 18 Nov 2013 12:12:17 -0800 X-Google-Sender-Auth: -bRyyF1NPA-IgriJgTBwA18SvkM Message-ID: Subject: Re: UMA cache back pressure From: Adrian Chadd To: Alexander Motin Content-Type: text/plain; charset=ISO-8859-1 Cc: "freebsd-hackers@freebsd.org" , "freebsd-current@freebsd.org" , Jeff Roberson X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.16 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Nov 2013 20:12:19 -0000 Remember that for Netflix, we have a mostly non-cachable workload (with some very specific exceptions!) and thus we churn through VM pages at a presitidigious rate. 20gbit sec, or ~ 2.4 gigabytes a second, or ~ 680,000 4 kilobyte pages a second. It's quite frightening and it's only likely to increase. There's a lot of pressure from all over the place so IIRC pools tend to not stay very large for very long. That's why I'm interested in your specific situations. Doing an all CPU TLB shootdown with 24 cores is costly. But after we killed some incorrect KVA mapping flags for sendfile, we (netflix) totally stopped seeing the TLB shootdown and IPIs in any of the performance traces. Now, doing 24 cores worth of ZFS when you let the pools grow to the size you do is understandable, but I'd like to just make sure that you aren't breaking performance for people doing different workloads on less cores. I'm a bit busy at work with other things so I can't spin up your patch on a cache for another week or two. But I'll certainly get around to it as I'd like to see this stuff catch on. What I _can_ do in a reasonably immediate timeframe is update vm0.freebsd.org to the latest -HEAD and stress test your patch out. I'm using vm0.freebsd.org to stress test -HEAD with ZFS doing concurrent poudriere builds so it gets very crowded on that box. The box currently survives a couple days before I hit some races to do with vnode exhaustion and a lack of handling there, and ZFS deadlocks. I'll just run this up to see if anything unexpected happens that causes it to blow up in a different way. Thanks, -adrian On 18 November 2013 11:55, Alexander Motin wrote: > On 18.11.2013 21:11, Jeff Roberson wrote: >> >> On Mon, 18 Nov 2013, Alexander Motin wrote: >>> >>> I've created patch, based on earlier work of avg@, to add back >>> pressure to UMA allocation caches. The problem of physical memory or >>> KVA exhaustion existed there for many years and it is quite critical >>> now for improving systems performance while keeping stability. Changes >>> done in memory allocation last years improved situation. but haven't >>> fixed completely. My patch solves remaining problems from two sides: >>> a) reducing bucket sizes every time system detects low memory >>> condition; and b) as last-resort mechanism for very low memory >>> condition, it cycling over all CPUs to purge their per-CPU UMA caches. >>> Benefit of this approach is in absence of any additional hard-coded >>> limits on cache sizes -- they are self-tuned, based on load and memory >>> pressure. >>> >>> With this change I believe it should be safe enough to enable UMA >>> allocation caches in ZFS via vfs.zfs.zio.use_uma tunable (at least for >>> amd64). I did many tests on machine with 24 logical cores (and as >>> result strong allocation cache effects), and can say that with 40GB >>> RAM using UMA caches, allowed by this change, by two times increases >>> results of SPEC NFS benchmark on ZFS pool of several SSDs. To test >>> system stability I've run the same test with physical memory limited >>> to just 2GB and system successfully survived that, and even showed >>> results 1.5 times better then with just last resort measures of b). In >>> both cases tools/umastat no longer shows unbound UMA cache growth, >>> that makes me believe in viability of this approach for longer runs. >>> >>> I would like to hear some comments about that: >>> http://people.freebsd.org/~mav/uma_pressure.patch >> >> >> Hey Mav, >> >> This is a great start and great results. I think it could probably even >> go in as-is, but I have a few suggestions. > > > Hey! Thanks for your review. I appreciate. > > >> First, let's test this with something that is really super allocator >> heavy and doesn't benefit much from bucket sizing. For example, a >> network forwarding test. Or maybe you could get someone like Netflix >> that is using it to push a lot of bits with less filesystem cost than >> zfs and spec. > > > I am not sure what simple forwarding may show in this case. Even on my > workload with ZFS creating strong memory pressure I still have mbuf* zones > buckets almost (some totally) maxed out. Without other major (or even any) > pressure in system they just can't become bigger then maximum. But if you > can propose some interesting test case with pressure that I can reproduce -- > I am all ears. > > >> Second, the cpu binding is a very costly and very high-latency >> operation. It would make sense to do CPU_FOREACH and then ZONE_FOREACH. >> You're also biasing the first zones in the list. The low memory >> condition will more often clear after you check these first zones. So >> you might just check it once and equally penalize all zones. I'm >> concerned that doing CPU_FOREACH in every zone will slow the pagedaemon >> more. > > > I completely agree with all you said here. This part of code I just took > as-is from earlier work. It definitely can be improved. I'll take a look on > that. But as I have mentioned in one of earlier responses that code used in > _very_ rare cases, unless system is heavily overloaded on memory, like doing > ZFS on box with 24 cores and 2GB RAM. During reasonable operation it is > enough to have soft back pressure to keep on caches in shape and never call > that. > > >> We also have been working towards per-domain pagedaemons so >> perhaps we should have a uma-reclaim taskqueue that we wake up to do the >> work? > > > VM is not my area so far, so please propose "the right way". I took this > task now only because I have to due to huge performance bottleneck this > problem causes and years it remains unsolved. > > >> Third, using vm_page_count_min() will only trigger when the pageout >> daemon can't keep up with the free target. Typically this should only >> happen with a lot of dirty mmap'd pages or incredibly high system load >> coupled with frequent allocations. So there may be many cases where >> reclaiming the extra UMA memory is helpful but the pagedaemon can still >> keep up while pushing out file pages that we'd prefer to keep. > > > As I have told that is indeed last resort. It does not need to be done > often. Per-CPU caches just should not grow without real need to the point > when they have to be cleaned. > > >> I think the perfect heuristic would have some idea of how likely the UMA >> pages are to be re-used immediately so we can more effectively tradeoff >> between file pages and kernel memory cache. As it is now we limit the >> uma_reclaim() calls to every 10 seconds when there is memory pressure. >> Perhaps we could keep a timestamp for when the last slab was allocated >> to a zone and do the more expensive reclaim on zones who have timestamps >> that exceed some threshold? Then have a lower threshold for reclaiming >> at all? Again, it doesn't need to be perfect, but I believe we can catch >> a wider set of cases by carefully scheduling this. > > > I was thinking about that too. But I think timestamps should be set not on > slab, but on bucket. The fact that zone is not allocating new slabs does not > mean it does not use its already allocated buckets. If we put time of the > last refill into each bucket, then we should be able to purge all buckets, > unused for specified period of time. Additionally we could put timestamp on > zone and update it every time zone runs out of its cache. If zone does not > run out of cache for some time -- probably it has unused buckets. So when we > need some RAM we should take a first look on zones that had stale timestamp. > > > -- > Alexander Motin > _______________________________________________ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"