From owner-freebsd-hackers@freebsd.org Tue Jul 14 16:37:12 2020 Return-Path: Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 471E6368387 for ; Tue, 14 Jul 2020 16:37:12 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-oi1-f172.google.com (mail-oi1-f172.google.com [209.85.167.172]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4B5mPy38Ncz49wn for ; Tue, 14 Jul 2020 16:37:10 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-oi1-f172.google.com with SMTP id k22so14455634oib.0 for ; Tue, 14 Jul 2020 09:37:10 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:cc; bh=8lHQGKmLQ9ler0bDOlfv/jO4faB9/zdtg0TU+75udRQ=; b=N42OI40D2PvIkg6uz5MUz/QSTx5lXH4u9XUU00RkTvClqCYTW3mA9dKbe5yQXJDAxe j9KKBvTyXpKEUI8pqw1AjM57Fc8TIcfmGKaIUX7/KAatYJArzMVYQkd537JtlBMyHJO/ bFxkUbkkN6ivER5deqyeEBf8YiEa7IOkiLqDk504a4IO5GkxeBiDdhAoWvhLoS0iJuYb IULdusxt1Zl+JPXifl4LnCZxRrJSLvKFt4VbNKcVCoib5fa9MmT3kjK1ph/9U1CBuBZw IHskfbg3/NKrEKGu02vD+O0O0hk4IRmwxSv4qSvz6r0YHNLXdJuTEGhrhhmdW7wT/zxw wvag== X-Gm-Message-State: AOAM532YhVOh57lfcQxWxnZhBZoRs4tZBt0Oydp62dUEP2JgZAWcc0FN 90OgFY46bugHdtllTvjSemqBglINrWNInXKqHic= X-Received: by 2002:aca:b6c3:: with SMTP id g186mt4887102oif.55.1594744629083; Tue, 14 Jul 2020 09:37:09 -0700 (PDT) MIME-Version: 1.0 References: <20200710175452.GA9380@raichu> In-Reply-To: From: Alan Somers Date: Tue, 14 Jul 2020 10:36:56 -0600 Message-ID: Subject: Re: Right-sizing the geli thread pool Cc: Mark Johnston , Mateusz Guzik , FreeBSD Hackers X-Rspamd-Queue-Id: 4B5mPy38Ncz49wn X-Spamd-Bar: / Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of asomers@gmail.com designates 209.85.167.172 as permitted sender) smtp.mailfrom=asomers@gmail.com X-Spamd-Result: default: False [-0.01 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-0.70)[-0.698]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; R_SPF_ALLOW(-0.20)[+ip4:209.85.128.0/17:c]; NEURAL_HAM_LONG(-0.78)[-0.777]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-hackers@freebsd.org]; DMARC_NA(0.00)[freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; NEURAL_HAM_SHORT(-0.54)[-0.536]; RCVD_IN_DNSWL_NONE(0.00)[209.85.167.172:from]; MISSING_TO(2.00)[]; FORGED_SENDER(0.30)[asomers@freebsd.org,asomers@gmail.com]; RWL_MAILSPIKE_POSSIBLE(0.00)[209.85.167.172:from]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US]; FROM_NEQ_ENVFROM(0.00)[asomers@freebsd.org,asomers@gmail.com]; FREEMAIL_ENVFROM(0.00)[gmail.com]; FREEMAIL_CC(0.00)[freebsd.org,gmail.com] Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.33 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Jul 2020 16:37:12 -0000 On Fri, Jul 10, 2020 at 5:13 PM Alan Somers wrote: > On Fri, Jul 10, 2020 at 11:55 AM Mark Johnston wrote: > >> On Fri, Jul 10, 2020 at 05:55:50AM +0200, Mateusz Guzik wrote: >> > On 7/9/20, Alan Somers wrote: >> > > Currently, geli creates a separate thread pool for each provider, and >> by >> > > default each thread pool contains one thread per cpu. On a large >> server >> > > with many encrypted disks, that can balloon into a very large number >> of >> > > threads! I have a patch in progress that switches from per-provider >> thread >> > > pools to a single thread pool for the entire module. Happily, I see >> read >> > > IOPs increase by up to 60%. But to my surprise, write IOPs >> _decreases_ by >> > > up to 25%. dtrace suggests that the CPU usage is dominated by the >> > > vmem_free call in biodone, as in the below stack. >> > > >> > > kernel`lock_delay+0x32 >> > > kernel`biodone+0x88 >> > > kernel`g_io_deliver+0x214 >> > > geom_eli.ko`g_eli_write_done+0xf6 >> > > kernel`g_io_deliver+0x214 >> > > kernel`md_kthread+0x275 >> > > kernel`fork_exit+0x7e >> > > kernel`0xffffffff8104784e >> > > >> > > I only have one idea for how to improve things from here. The geli >> thread >> > > pool is still fed by a single global bio queue. That could cause >> cache >> > > thrashing, if bios get moved between cores too often. I think a >> superior >> > > design would be to use a separate bio queue for each geli thread, and >> use >> > > work-stealing to balance them. However, >> > > >> > > 1) That doesn't explain why this change benefits reads more than >> writes, >> > > and >> > > 2) work-stealing is hard to get right, and I can't find any examples >> in the >> > > kernel. >> > > >> > > Can anybody offer tips or code for implementing work stealing? Or any >> > > other suggestions about why my write performance is suffering? I >> would >> > > like to get this change committed, but not without resolving that >> issue. >> > > >> > >> > I can't comment on revamping the design, but: >> > >> > void >> > vmem_free(vmem_t *vm, vmem_addr_t addr, vmem_size_t size) >> > { >> > qcache_t *qc; >> > MPASS(size > 0); >> > >> > if (size <= vm->vm_qcache_max && >> > __predict_true(addr >= VMEM_ADDR_QCACHE_MIN)) { >> > qc = &vm->vm_qcache[(size - 1) >> vm->vm_quantum_shift]; >> > uma_zfree(qc->qc_cache, (void *)addr); >> > } else >> > vmem_xfree(vm, addr, size); >> > } >> > >> > What sizes are being passed here? Or more to the point, is it feasible >> > to bump qcache to stick to uma in this call? If lock contention is >> > indeed coming from vmem_xfree this change would get rid of the problem >> > without having to rework anything. >> >> We would have to enable the quantum cache in the transient KVA arena. >> This itself should not have many downsides on platforms with plenty of >> KVA, but it only solves the immediate problem: before freeing the KVA >> biodone() has to perform a global TLB shootdown, and the quantum cache >> doesn't help at all with that. >> >> kib's suggestion of using sf_buf(9) to transiently map crypto(9) >> payloads in software crypto drivers that require a mapping, and using >> unmapped cryptop requests for hardware drivers that do not, sounds like >> the right solution. cryptop structures can already handle multiple data >> container types, like uios and mbufs, so it should be possible to also >> support vm_page arrays in OCF like we do for unmapped BIOs, and let >> crypto(9) drivers create transient mappings when necessary. >> >> > For read performance, while it is nice there is a win, it may still be >> > less than it should. I think it is prudent to get a flamegraph from >> > both cases. >> > > And, it works! Using sf_buf was a good suggestion, because the write path > only needs to access the user data for a short amount of time, in a > CPU-pinned thread. So I can create my sf_buf and immediately free it. I > didn't even need to change crypto(9). Writes are much faster now. I > haven't incorporated sf_buf into the read path yet, though. > > Preliminary benchmarks: > > Write IOPs READ IOPs > baseline 66908 60351 > kern.geom.eli.threads=1 156719 144117 > kern.geom.eli.threads=1, direct dispatch 205326 201867 > direct dispatch, shared thread pool 156631 226874 > direct dispatch, shared thread pool, sf_buf 432573 247275 > > -Alan > Ok, I got ahead of myself. Actually, it _doesn't_ work. I was initially elated by how much extra speed I got by using sf_bufs. But even so, writes are still slower (and reads faster) when I use a shared thread pool. I'll work on committing the sf_buf changes, then try to debug the shared thread pool performance degradation again. -Alan