From owner-svn-src-head@freebsd.org Wed Sep 2 17:29:31 2015 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1C3B29C8B18; Wed, 2 Sep 2015 17:29:31 +0000 (UTC) (envelope-from imp@FreeBSD.org) Received: from repo.freebsd.org (repo.freebsd.org [IPv6:2001:1900:2254:2068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 0094CA7E; Wed, 2 Sep 2015 17:29:31 +0000 (UTC) (envelope-from imp@FreeBSD.org) Received: from repo.freebsd.org ([127.0.1.70]) by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id t82HTU0V036120; Wed, 2 Sep 2015 17:29:30 GMT (envelope-from imp@FreeBSD.org) Received: (from imp@localhost) by repo.freebsd.org (8.15.2/8.15.2/Submit) id t82HTULW036119; Wed, 2 Sep 2015 17:29:30 GMT (envelope-from imp@FreeBSD.org) Message-Id: <201509021729.t82HTULW036119@repo.freebsd.org> X-Authentication-Warning: repo.freebsd.org: imp set sender to imp@FreeBSD.org using -f From: Warner Losh Date: Wed, 2 Sep 2015 17:29:30 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: svn commit: r287405 - head/sys/geom X-SVN-Group: head MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Sep 2015 17:29:31 -0000 Author: imp Date: Wed Sep 2 17:29:30 2015 New Revision: 287405 URL: https://svnweb.freebsd.org/changeset/base/287405 Log: After the introduction of direct dispatch, the pacing code in g_down() broke in two ways. One, the pacing variable was accessed in multiple threads in an unsafe way. Two, since large numbers of I/O could come down from the buf layer at one time, large numbers of allocation failures could happen all at once, resulting in a huge pace value that would limit I/Os to 10 IOPS for minutes (or even hours) at a time. While a real solution to these problems requires substantial work (to go to a no-allocation after the first model, or to have some way to wait for more memory with some kind of reserve for pager and swapper requests), it is relatively easy to make this simplistic pacing less pathological. Move to using a volatile variable with loads and stores. While this is a little racy, losing the race is safe: either you get memory and proceed, or you don't and queue. Second, sleep for 1ms (or one tick, whichever is larger) instead of 100ms. This removes the artificial 10 IOPS limit while still easing up on new I/Os during memory shortages. Remove tying the amount of time we do this to the number of failed requests and do it only as long as we keep failing requests. Finally, to avoid needless recursion when memory is tight (start -> g_io_deliver() -> g_io_request() -> start -> ... until we use 1/2 the stack), don't do direct dispatch while pacing. This should be a rare event (not steady state) so the performance hit here is worth the extra safety of not starving g_down() with directly dispatched I/O. Differential Review: https://reviews.freebsd.org/D3546 Modified: head/sys/geom/geom_io.c Modified: head/sys/geom/geom_io.c ============================================================================== --- head/sys/geom/geom_io.c Wed Sep 2 16:50:49 2015 (r287404) +++ head/sys/geom/geom_io.c Wed Sep 2 17:29:30 2015 (r287405) @@ -71,7 +71,17 @@ static struct g_bioq g_bio_run_down; static struct g_bioq g_bio_run_up; static struct g_bioq g_bio_run_task; -static u_int pace; +/* + * Pace is a hint that we've had some trouble recently allocating + * bios, so we should back off trying to send I/O down the stack + * a bit to let the problem resolve. When pacing, we also turn + * off direct dispatch to also reduce memory pressure from I/Os + * there, at the expxense of some added latency while the memory + * pressures exist. See g_io_schedule_down() for more details + * and limitations. + */ +static volatile u_int pace; + static uma_zone_t biozone; /* @@ -521,7 +531,8 @@ g_io_request(struct bio *bp, struct g_co (pp->flags & G_PF_DIRECT_RECEIVE) != 0 && !g_is_geom_thread(curthread) && ((pp->flags & G_PF_ACCEPT_UNMAPPED) != 0 || - (bp->bio_flags & BIO_UNMAPPED) == 0 || THREAD_CAN_SLEEP()); + (bp->bio_flags & BIO_UNMAPPED) == 0 || THREAD_CAN_SLEEP()) && + pace == 0; if (direct) { /* Block direct execution if less then half of stack left. */ size_t st, su; @@ -688,7 +699,7 @@ g_io_deliver(struct bio *bp, int error) bp->bio_driver2 = NULL; bp->bio_pflags = 0; g_io_request(bp, cp); - pace++; + pace = 1; return; } @@ -777,10 +788,33 @@ g_io_schedule_down(struct thread *tp __u } CTR0(KTR_GEOM, "g_down has work to do"); g_bioq_unlock(&g_bio_run_down); - if (pace > 0) { - CTR1(KTR_GEOM, "g_down pacing self (pace %d)", pace); - pause("g_down", hz/10); - pace--; + if (pace != 0) { + /* + * There has been at least one memory allocation + * failure since the last I/O completed. Pause 1ms to + * give the system a chance to free up memory. We only + * do this once because a large number of allocations + * can fail in the direct dispatch case and there's no + * relationship between the number of these failures and + * the length of the outage. If there's still an outage, + * we'll pause again and again until it's + * resolved. Older versions paused longer and once per + * allocation failure. This was OK for a single threaded + * g_down, but with direct dispatch would lead to max of + * 10 IOPs for minutes at a time when transient memory + * issues prevented allocation for a batch of requests + * from the upper layers. + * + * XXX This pacing is really lame. It needs to be solved + * by other methods. This is OK only because the worst + * case scenario is so rare. In the worst case scenario + * all memory is tied up waiting for I/O to complete + * which can never happen since we can't allocate bios + * for that I/O. + */ + CTR0(KTR_GEOM, "g_down pacing self"); + pause("g_down", min(hz/1000, 1)); + pace = 0; } CTR2(KTR_GEOM, "g_down processing bp %p provider %s", bp, bp->bio_to->name);