From owner-freebsd-arch@freebsd.org Sat Nov 14 20:39:46 2020 Return-Path: Delivered-To: freebsd-arch@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 1D4F6465414 for ; Sat, 14 Nov 2020 20:39:46 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-qv1-xf2d.google.com (mail-qv1-xf2d.google.com [IPv6:2607:f8b0:4864:20::f2d]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4CYRz572Pbz3vhS for ; Sat, 14 Nov 2020 20:39:45 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-qv1-xf2d.google.com with SMTP id z17so6749412qvy.11 for ; Sat, 14 Nov 2020 12:39:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=lnjSVl6EoFXzNFk4/yW55MhwidLXi3ZGCTWUFrRHt7Q=; b=b5F73J4pQOeHnwiU65CH3qpz4DgvDgfy690S0Fzr2ICh045fqFZ3bpL34OQQYNo6c3 LI9lBzzhd29MQSNbQ/Mnq4rN6wztI1Bj9Q/6JIl+S1ka4Pl7tHD6TQjPlRNwc5INc4k5 Q5PhCC5hfpWahD8QcyZTEFIBw7HIhDosQxhUnxjhmS/E2Nd9nIw1q3IKrAKCgUWpAbxf hD0gmdtzh3yBrK69e8kcPXe1Xrg3TF9AmpvQU7OstUDnNPtFEG9/W2F1OhQQailNl/O/ /DHTsIrYI/otn+WQrd6j965GbQgidjtCQrObcm5vsCxbi1TQ7gjUx4wY7kh85d2hFQyu qZTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=lnjSVl6EoFXzNFk4/yW55MhwidLXi3ZGCTWUFrRHt7Q=; b=VbeHwhpZBZHDiFDEi8bSFuIYyM571NUtE1kJF7Qo+5jSj6oY065wdLD7fAg5/10o06 jbW28lAV72BjuNx9Z3/tHIB3RVgW0iKqL5esQLGttB5AuYLcGmRsWIdn1ZQ4eIBPaEjw 4PWDPuzhh1xfQUTFKp2Udu/8F0Z7B16OPVvWXlrk7HGax7kzvHLpyvwOn95ymy971X1w MQIsWikZoXoHvkdBrFfnHukQTY1y4q2rkCGURZFGtMnCQqEbttrSikWCgYl8Hd1s/a9z w4hUU7tR/ncmiG8llKIbKy59WoFNe9R4gadA29FyOkiIumakTN94qf4O8B6/CxUdU9JX Gidw== X-Gm-Message-State: AOAM533qIwYES43JAvdX/jId5P5UJ+90pM3gjDQpoNmuccrtI/6Bwnxf QyE7bebC7oiz0Ym2wpepJhBbQBmuQXoCcFIOpoiOIg== X-Google-Smtp-Source: ABdhPJx8H4oxhLEGaX1dA+fFOBfHudXrkQQ6X0UcR2GnMzo5Bfv9N4Md3H+lGl2hz/lGW75UzWXRj1Y/lAYo+r4ME00= X-Received: by 2002:a0c:82c4:: with SMTP id i62mr8593902qva.28.1605386384605; Sat, 14 Nov 2020 12:39:44 -0800 (PST) MIME-Version: 1.0 References: <634E35E9-9E2B-40CE-9C70-BB130BD9D614@samsco.org> In-Reply-To: From: Warner Losh Date: Sat, 14 Nov 2020 13:39:32 -0700 Message-ID: Subject: Re: MAXPHYS bump for FreeBSD 13 To: Konstantin Belousov Cc: Scott Long , Alexander Motin , "freebsd-arch@freebsd.org" X-Rspamd-Queue-Id: 4CYRz572Pbz3vhS X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.34 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 14 Nov 2020 20:39:46 -0000 On Sat, Nov 14, 2020, 12:43 PM Konstantin Belousov wrote: > On Sat, Nov 14, 2020 at 11:48:34AM -0700, Scott Long wrote: > > > > > > > On Nov 14, 2020, at 11:37 AM, Konstantin Belousov > wrote: > > > > > > On Sat, Nov 14, 2020 at 10:01:05AM -0500, Alexander Motin wrote: > > >> On Fri, 13 Nov 2020 21:09:37 +0200 Konstantin Belousov wrote: > > >>> To put the specific numbers, for struct buf it means increase by 17= 92 > > >>> bytes. For bio it does not, because it does not embed vm_page_t[] > into > > >>> the structure. > > >>> > > >>> Worse, typical struct buf addend for excess vm_page pointers is goi= ng > > >>> to be unused, because normal size of the UFS block is 32K. It is > > >>> going to be only used by clusters and physbufs. > > >>> > > >>> So I object against bumping this value without reworking buffers > > >>> handling of b_pages[]. Most straightforward approach is stop using > > >>> MAXPHYS to size this array, and use external array for clusters. > > >>> Pbufs can embed large array. > > >> > > >> I am not very familiar with struct buf usage, so I'd appreciate some > > >> help there. > > >> > > >> Quickly looking on pbuf, it seems trivial to allocate external b_pag= es > > >> array of any size in pbuf_init, that should easily satisfy all of pb= uf > > >> descendants. Cluster and vnode/swap pagers code are pbuf descendant= s > > >> also. Vnode pager I guess may only need replacement for > > >> nitems(bp->b_pages) in few places. > > > I planned to look at making MAXPHYS a tunable. > > > > > > You are right, we would need: > > > 1. move b_pages to the end of struct buf and declaring it as flexible= . > > > This would make KBI worse because struct buf depends on some debuggin= g > > > options, and than b_pages offset depends on config. > > > > > > Another option could be to change b_pages to pointer, if we are fine > with > > > one more indirection. But in my plan, real array is always allocated > past > > > struct buf, so flexible array is more correct even. > > > > > > > I like this, and I was in the middle of writing up an email that > described it. > > There could be multiple malloc types or UMA zones of different sizes, > > depending on the intended i/o size, or just a runtime change to the siz= e > of > > a single allocation size. > I do not think we need new/many zones. > > Queued (getnewbuf()) bufs come from buf_zone, and pbufs are allocated > from pbuf_zone. That should be fixed alloc size, with small b_pages[] > for buf_zone, and large (MAXPHYS) for pbuf. > > Everything else, if any, would need to pre-calculate malloc size. > How will this affect clustered reads for things like read ahead? > > > > 2. Preallocating both normal bufs and pbufs together with the arrays. > > > > > > 3. I considered adding B_SMALLPAGES flag to b_flags and use it to > indicate > > > that buffer has 'small' b_pages. All buffers rotated through > getnewbuf()/ > > > buf_alloc() should have it set. > > > > > > > This would work nicely with a variable sized allocator, yes. > > > > > 4. There could be some places which either malloc() or allocate struc= t > buf > > > on stack (I tend to believe that I converted all later places to > formed). > > > They would need to get special handling. > > > > > > > I couldn=E2=80=99t find any places that allocated a buf on the stack or= embedded > it > > into another structure. > As I said, I did a pass to eliminate stack allocations for bufs. > As result, for instance flushbufqueues() mallocs struct buf, but it does > not use b_pages[] of the allocated sentinel. > Yea. I recall both the pass and looking for them later and not finding any either... > > > > md(4) uses pbufs. > > > > > > 4. My larger concern is, in fact, cam and drivers. > > > > > > > Can you describe your concern? > My opinion is that during this work all uses of MAXPHYS should be reviewe= d, > and there are a lot of drivers that reference the constant. From the pas= t > experience, I expect some evil ingenious (ab)use. > > Same for bufs, but apart from application of pbufs in cam_periph.c, I do > not > think drivers have much use of it. > Do you have precise definitions for DFLTPHYS and MAXPHYS? That might help ferret out the differences between the two. I have seen several places that use one or the other of these that seem incorrect, but that I can't quite articulate precisely why... having a good definition articulated would help. There are some places that likely want a fixed constant to reflect hardware, not a FreeBSD tuning parameter. As an aside, there are times I want to do transfers of arbitrary sizes for certain pass through commands that are vendor specific and that have no way to read the results in chunks. Thankfully most newer drives don't have this restriction, but it still comes up. But that's way below the buf layer and handled today by cam_periph and the pbufs there. These types of operations are rare and typically when the system is mostly idle, so low memory situations can be ignored beyond error handling and retry in the user program. Would this work make those possible? Or would MAXPHYS, however set, still limit them? Warner > > > >> > > >> Could you or somebody help with vfs/ffs code, where I suppose the > > >> smaller page lists are used? > > > Do you plan to work on this ? I can help, sure. > > > > > > Still, I wanted to make MAXPHYS (and 'small' MAXPHYS, this is not sam= e > as > > > DFLPHYS), a tunable, in the scope of this work. > > > > Sounds great, thank you for looking at it. > > > > Scott > > > _______________________________________________ > freebsd-arch@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" >