From owner-freebsd-hackers@freebsd.org Fri Jul 10 23:14:08 2020 Return-Path: Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id CFBD03762FA for ; Fri, 10 Jul 2020 23:14:08 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-ot1-f48.google.com (mail-ot1-f48.google.com [209.85.210.48]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4B3TPr05blz44qc; Fri, 10 Jul 2020 23:14:07 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-ot1-f48.google.com with SMTP id e90so5392129ote.1; Fri, 10 Jul 2020 16:14:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=kd/FsIBSEQuSvY7b06rMDMmdLbSt0aTBoOYLeWeYdbM=; b=iOnpTZBVp3sNnn1cK6c0hWK951SuNxilFcH1NVnC7HYy6AHElbTlBSZqYq/PT4Y3G2 V3B3RjfzZb/gybuNeyNsH9wDHNrNP+jR8Wqo59sD2KAkVSVtwCqHpAZ13OHPx9pmPtX1 WBx44KzflptL6bPbrEpq+ftIJWFcrwIUKsmumsQ4egwRni1/LrhXtZ+G9T0lZtLEJvms pKkshhyqbFgGW/3yf5cG/+HxJRLmGmQESlsPvdPR1kMeG0O+rdxMmwNGcBkptEMGs0Ad 4rbwLvE4Sk2LBu8esTUSIq1U3YRcXZKN7uA+Xhp4KUm6H7RL6GqW8pgJSXO6wW2lhR92 By0w== X-Gm-Message-State: AOAM530kP7wGLPZtJ4znQgRYVjKKVeJunRqLYcEftiDmqQjER+CJZuPE LbpoqYI0pPP3ojR/TZDg2EoFoZhskCvlBU7asAuoqge229M= X-Google-Smtp-Source: ABdhPJziicOyo3te8uuMcLkjie8+fJZ38n6/LvYrmbNzmO+pIzjQMko13ZrC7OhpxaEZjMjJD06SClwzJTdhvrALMB8= X-Received: by 2002:a9d:2788:: with SMTP id c8mr43078384otb.251.1594422845600; Fri, 10 Jul 2020 16:14:05 -0700 (PDT) MIME-Version: 1.0 References: <20200710175452.GA9380@raichu> In-Reply-To: <20200710175452.GA9380@raichu> From: Alan Somers Date: Fri, 10 Jul 2020 17:13:52 -0600 Message-ID: Subject: Re: Right-sizing the geli thread pool To: Mark Johnston Cc: Mateusz Guzik , FreeBSD Hackers X-Rspamd-Queue-Id: 4B3TPr05blz44qc X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US] Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.33 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Jul 2020 23:14:08 -0000 On Fri, Jul 10, 2020 at 11:55 AM Mark Johnston wrote: > On Fri, Jul 10, 2020 at 05:55:50AM +0200, Mateusz Guzik wrote: > > On 7/9/20, Alan Somers wrote: > > > Currently, geli creates a separate thread pool for each provider, and > by > > > default each thread pool contains one thread per cpu. On a large > server > > > with many encrypted disks, that can balloon into a very large number of > > > threads! I have a patch in progress that switches from per-provider > thread > > > pools to a single thread pool for the entire module. Happily, I see > read > > > IOPs increase by up to 60%. But to my surprise, write IOPs > _decreases_ by > > > up to 25%. dtrace suggests that the CPU usage is dominated by the > > > vmem_free call in biodone, as in the below stack. > > > > > > kernel`lock_delay+0x32 > > > kernel`biodone+0x88 > > > kernel`g_io_deliver+0x214 > > > geom_eli.ko`g_eli_write_done+0xf6 > > > kernel`g_io_deliver+0x214 > > > kernel`md_kthread+0x275 > > > kernel`fork_exit+0x7e > > > kernel`0xffffffff8104784e > > > > > > I only have one idea for how to improve things from here. The geli > thread > > > pool is still fed by a single global bio queue. That could cause cache > > > thrashing, if bios get moved between cores too often. I think a > superior > > > design would be to use a separate bio queue for each geli thread, and > use > > > work-stealing to balance them. However, > > > > > > 1) That doesn't explain why this change benefits reads more than > writes, > > > and > > > 2) work-stealing is hard to get right, and I can't find any examples > in the > > > kernel. > > > > > > Can anybody offer tips or code for implementing work stealing? Or any > > > other suggestions about why my write performance is suffering? I would > > > like to get this change committed, but not without resolving that > issue. > > > > > > > I can't comment on revamping the design, but: > > > > void > > vmem_free(vmem_t *vm, vmem_addr_t addr, vmem_size_t size) > > { > > qcache_t *qc; > > MPASS(size > 0); > > > > if (size <= vm->vm_qcache_max && > > __predict_true(addr >= VMEM_ADDR_QCACHE_MIN)) { > > qc = &vm->vm_qcache[(size - 1) >> vm->vm_quantum_shift]; > > uma_zfree(qc->qc_cache, (void *)addr); > > } else > > vmem_xfree(vm, addr, size); > > } > > > > What sizes are being passed here? Or more to the point, is it feasible > > to bump qcache to stick to uma in this call? If lock contention is > > indeed coming from vmem_xfree this change would get rid of the problem > > without having to rework anything. > > We would have to enable the quantum cache in the transient KVA arena. > This itself should not have many downsides on platforms with plenty of > KVA, but it only solves the immediate problem: before freeing the KVA > biodone() has to perform a global TLB shootdown, and the quantum cache > doesn't help at all with that. > > kib's suggestion of using sf_buf(9) to transiently map crypto(9) > payloads in software crypto drivers that require a mapping, and using > unmapped cryptop requests for hardware drivers that do not, sounds like > the right solution. cryptop structures can already handle multiple data > container types, like uios and mbufs, so it should be possible to also > support vm_page arrays in OCF like we do for unmapped BIOs, and let > crypto(9) drivers create transient mappings when necessary. > > > For read performance, while it is nice there is a win, it may still be > > less than it should. I think it is prudent to get a flamegraph from > > both cases. > And, it works! Using sf_buf was a good suggestion, because the write path only needs to access the user data for a short amount of time, in a CPU-pinned thread. So I can create my sf_buf and immediately free it. I didn't even need to change crypto(9). Writes are much faster now. I haven't incorporated sf_buf into the read path yet, though. Preliminary benchmarks: Write IOPs READ IOPs baseline 66908 60351 kern.geom.eli.threads=1 156719 144117 kern.geom.eli.threads=1, direct dispatch 205326 201867 direct dispatch, shared thread pool 156631 226874 direct dispatch, shared thread pool, sf_buf 432573 247275 -Alan