Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 14 Nov 2020 23:57:00 +0200
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Warner Losh <imp@bsdimp.com>
Cc:        Scott Long <scottl@samsco.org>, Alexander Motin <mav@freebsd.org>, "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
Subject:   Re: MAXPHYS bump for FreeBSD 13
Message-ID:  <X7BSrM9bCna33ks4@kib.kiev.ua>
In-Reply-To: <CANCZdfroqvHhwBxhbB1SqcZgM6o_pcAFxWqPQ3Y4Z%2BZqd1FCTA@mail.gmail.com>
References:  <aac44f21-3a6d-c697-bb52-1a669b11aa3b@FreeBSD.org> <X7Aj0a6ZtIQfBscC@kib.kiev.ua> <634E35E9-9E2B-40CE-9C70-BB130BD9D614@samsco.org> <X7AzRMZ5FSRpzXqo@kib.kiev.ua> <CANCZdfroqvHhwBxhbB1SqcZgM6o_pcAFxWqPQ3Y4Z%2BZqd1FCTA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Nov 14, 2020 at 01:39:32PM -0700, Warner Losh wrote:
> On Sat, Nov 14, 2020, 12:43 PM Konstantin Belousov <kostikbel@gmail.com>
> wrote:
> 
> > On Sat, Nov 14, 2020 at 11:48:34AM -0700, Scott Long wrote:
> > >
> > >
> > > > On Nov 14, 2020, at 11:37 AM, Konstantin Belousov <kostikbel@gmail.com>
> > wrote:
> > > >
> > > > On Sat, Nov 14, 2020 at 10:01:05AM -0500, Alexander Motin wrote:
> > > >> On Fri, 13 Nov 2020 21:09:37 +0200 Konstantin Belousov wrote:
> > > >>> To put the specific numbers, for struct buf it means increase by 1792
> > > >>> bytes. For bio it does not, because it does not embed vm_page_t[]
> > into
> > > >>> the structure.
> > > >>>
> > > >>> Worse, typical struct buf addend for excess vm_page pointers is going
> > > >>> to be unused, because normal size of the UFS block is 32K.  It is
> > > >>> going to be only used by clusters and physbufs.
> > > >>>
> > > >>> So I object against bumping this value without reworking buffers
> > > >>> handling of b_pages[].  Most straightforward approach is stop using
> > > >>> MAXPHYS to size this array, and use external array for clusters.
> > > >>> Pbufs can embed large array.
> > > >>
> > > >> I am not very familiar with struct buf usage, so I'd appreciate some
> > > >> help there.
> > > >>
> > > >> Quickly looking on pbuf, it seems trivial to allocate external b_pages
> > > >> array of any size in pbuf_init, that should easily satisfy all of pbuf
> > > >> descendants.  Cluster and vnode/swap pagers code are pbuf descendants
> > > >> also.  Vnode pager I guess may only need replacement for
> > > >> nitems(bp->b_pages) in few places.
> > > > I planned to look at making MAXPHYS a tunable.
> > > >
> > > > You are right, we would need:
> > > > 1. move b_pages to the end of struct buf and declaring it as flexible.
> > > > This would make KBI worse because struct buf depends on some debugging
> > > > options, and than b_pages offset depends on config.
> > > >
> > > > Another option could be to change b_pages to pointer, if we are fine
> > with
> > > > one more indirection.  But in my plan, real array is always allocated
> > past
> > > > struct buf, so flexible array is more correct even.
> > > >
> > >
> > > I like this, and I was in the middle of writing up an email that
> > described it.
> > > There could be multiple malloc types or UMA zones of different sizes,
> > > depending on the intended i/o size, or just a runtime change to the size
> > of
> > > a single allocation size.
> > I do not think we need new/many zones.
> >
> > Queued (getnewbuf()) bufs come from buf_zone, and pbufs are allocated
> > from pbuf_zone. That should be fixed alloc size, with small b_pages[]
> > for buf_zone, and large (MAXPHYS) for pbuf.
> >
> > Everything else, if any, would need to pre-calculate malloc size.
> >
> 
> How will this affect clustered reads for things like read ahead?
kern/vfs_cluster.c uses pbufs to create temporal buf by combining pages
from the constituent normal (queued) buffers.  According to the discussion,
pbufs would have b_pages[]/KVA reserved by MAXPHYS.  This allows cluster
to fill the large request for read-ahead or background write.

> 
> >
> > > > 2. Preallocating both normal bufs and pbufs together with the arrays.
> > > >
> > > > 3. I considered adding B_SMALLPAGES flag to b_flags and use it to
> > indicate
> > > > that buffer has 'small' b_pages.  All buffers rotated through
> > getnewbuf()/
> > > > buf_alloc() should have it set.
> > > >
> > >
> > > This would work nicely with a variable sized allocator, yes.
> > >
> > > > 4. There could be some places which either malloc() or allocate struct
> > buf
> > > > on stack (I tend to believe that I converted all later places to
> > formed).
> > > > They would need to get special handling.
> > > >
> > >
> > > I couldn’t find any places that allocated a buf on the stack or embedded
> > it
> > > into another structure.
> > As I said, I did a pass to eliminate stack allocations for bufs.
> > As result, for instance flushbufqueues() mallocs struct buf, but it does
> > not use b_pages[] of the allocated sentinel.
> >
> 
> Yea. I recall both the pass and looking for them later and not finding any
> either...
> 
> >
> > > > md(4) uses pbufs.
> > > >
> > > > 4. My larger concern is, in fact, cam and drivers.
> > > >
> > >
> > > Can you describe your concern?
> > My opinion is that during this work all uses of MAXPHYS should be reviewed,
> > and there are a lot of drivers that reference the constant.  From the past
> > experience, I expect some evil ingenious (ab)use.
> >
> > Same for bufs, but apart from application of pbufs in cam_periph.c, I do
> > not
> > think drivers have much use of it.
> >
> 
> Do you have precise definitions for DFLTPHYS and MAXPHYS? That might help
> ferret out the differences between the two. I have seen several places that
> use one or the other of these that seem incorrect, but that I can't quite
> articulate precisely why... having a good definition articulated would
> help. There are some places that likely want a fixed constant to reflect
> hardware, not a FreeBSD tuning parameter.
Right now VFS guarantees that it never creates io request (bio ?) larger
than MAXPHYS.  In fact, VMIO buffers simply do not allow to express such
request because there is no place to put more pages.

DFLTPHYS seems to be only used by drivers (and some geoms), and typical
driver' usage of it is to clamp the max io request more than MAXPHYS.
I see that dump code tries to not write more than DFLTPHYS one time, to
ease life of drivers, and physio() sanitize maxio at DFLTPHYS, but this
is for really broken drivers.

> 
> As an aside, there are times I want to do transfers of arbitrary sizes for
> certain pass through commands that are vendor specific and that have no way
> to read the results in chunks. Thankfully most newer drives don't have this
> restriction, but it still comes up. But that's way below the buf layer and
> handled today by cam_periph and the pbufs there. These types of operations
> are rare and typically when the system is mostly idle, so low memory
> situations can be ignored beyond error handling and retry in the user
> program. Would this work make those possible? Or would MAXPHYS, however
> set, still limit them?
MAXPHYS would still limit them, at least in the scope of work we are
discussing.

> 
> Warner
> 
> >
> > > >>
> > > >> Could you or somebody help with vfs/ffs code, where I suppose the
> > > >> smaller page lists are used?
> > > > Do you plan to work on this ?  I can help, sure.
> > > >
> > > > Still, I wanted to make MAXPHYS (and 'small' MAXPHYS, this is not same
> > as
> > > > DFLPHYS), a tunable, in the scope of this work.
> > >
> > > Sounds great, thank you for looking at it.
> > >
> > > Scott
> > >
> > _______________________________________________
> > freebsd-arch@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-arch
> > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"
> >



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?X7BSrM9bCna33ks4>