Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 20 Dec 2012 01:25:03 -0600
From:      Alan Cox <alc@rice.edu>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        alc@freebsd.org, arch@freebsd.org
Subject:   Re: Unmapped I/O
Message-ID:  <50D2BD4F.7010204@rice.edu>
In-Reply-To: <20121219192838.GZ71906@kib.kiev.ua>
References:  <20121219135451.GU71906@kib.kiev.ua> <CAJUyCcNuD_TWR6xxFxVqDi4-eBGx3Jjs21eBxaZYYVUERESbMw@mail.gmail.com> <20121219192838.GZ71906@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On 12/19/2012 13:28, Konstantin Belousov wrote:
> On Wed, Dec 19, 2012 at 12:58:41PM -0600, Alan Cox wrote:
>> On Wed, Dec 19, 2012 at 7:54 AM, Konstantin Belousov <kostikbel@gmail.com>wrote:
>>
>>> One of the known FreeBSD I/O path performance bootleneck is the
>>> neccessity to map each I/O buffer pages into KVA.  The problem is that
>>> on the multi-core machines, the mapping must flush TLB on all cores,
>>> due to the global mapping of the buffer pages into the kernel.  This
>>> means that buffer creation and destruction disrupts execution of all
>>> other cores to perform TLB shootdown through IPI, and the thread
>>> initiating the shootdown must wait for all other cores to execute and
>>> report.
>>>
>>> The patch at
>>> http://people.freebsd.org/~kib/misc/unmapped.4.patch
>>> implements the 'unmapped buffers'.  It means an ability to create the
>>> VMIO struct buf, which does not point to the KVA mapping the buffer
>>> pages to the kernel addresses.  Since there is no mapping, kernel does
>>> not need to clear TLB. The unmapped buffers are marked with the new
>>> B_NOTMAPPED flag, and should be requested explicitely using the
>>> GB_NOTMAPPED flag to the buffer allocation routines.  If the mapped
>>> buffer is requested but unmapped buffer already exists, the buffer
>>> subsystem automatically maps the pages.
>>>
>>> The clustering code is also made aware of the not-mapped buffers, but
>>> this required the KPI change that accounts for the diff in the non-UFS
>>> filesystems.
>>>
>>> UFS is adopted to request not mapped buffers when kernel does not need
>>> to access the content, i.e. mostly for the file data.  New helper
>>> function vn_io_fault_pgmove() operates on the unmapped array of pages.
>>> It calls new pmap method pmap_copy_pages() to do the data move to and
>>> from usermode.
>>>
>>> Besides not mapped buffers, not mapped BIOs are introduced, marked
>>> with the flag BIO_NOTMAPPED.  Unmapped buffers are directly translated
>>> to unmapped BIOs.  Geom providers may indicate an acceptance of the
>>> unmapped BIOs.  If provider does not handle unmapped i/o requests,
>>> geom now automatically establishes transient mapping for the i/o
>>> pages.
>>>
>>> Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The
>>> gpart providers indicate the unmapped BIOs support if the underlying
>>> provider can do unmapped i/o.  I also hacked ahci(4) to handle
>>> unmapped i/o, but this should be changed after the Jeff' physbio patch
>>> is committed, to use proper busdma interface.
>>>
>>> Besides, the swap pager does unmapped swapping if the swap partition
>>> indicated that it can do unmapped i/o.  By Jeff request, a buffer
>>> allocation code may reserve the KVA for unmapped buffer in advance.
>>> The unmapped page-in for the vnode pager is also implemented if
>>> filesystem supports it, but the page out is not. The page-out, as well
>>> as the vnode-backed md(4), currently require mappings, mostly due to
>>> the use of VOP_WRITE().
>>>
>>> As such, the patch worked in my test environment, where I used
>>> ahci-attached SATA disks with gpt partitions, md(4) and UFS.  I see no
>>> statistically significant difference in the buildworld -j 10 times on
>>> the 4-core machine with HT.  On the other hand, when doing sha1 over
>>> the 5GB file, the system time was reduced by 30%.
>>>
>>> Unfinished items:
>>> - Integration with the physbio, will be done after physbio is
>>>   committed to HEAD.
>>> - The key per-architecture function needed for the unmapped i/o is the
>>>   pmap_copy_pages(). I implemented it for amd64 and i386 right now, it
>>>   shall be done for all other architectures.
>>> - The sizing of the submap used for transient mapping of the BIOs is
>>>   naive.  Should be adjusted, esp. for KVA-lean architectures.
>>> - Conversion of the other filesystems. Low priority.
>>>
>>> I am interested in reviews, tests and suggestions.  Note that this
>>> only works now for md(4) and ahci(4), for other drivers the patched
>>> kernel should fall back to the mapped i/o.
>>>
>>>
>> Here are a couple things for you to think about:
>>
>> 1. A while back, I developed the patch at
>> http://www.cs.rice.edu/~alc/buf_maps5.patch as an experiment in trying to
>> reduce the number of TLB shootdowns by the buffer map.  The idea is simple:
>> Replace the calls to pmap_q{enter,remove}() with calls to a new
>> machine-dependent function that opportunistically sets the buffer's kernel
>> virtual address to the direct map for physically contiguous pages.
>> However, if the pages are not physically contiguous, it calls pmap_qenter()
>> with the kernel virtual address from the buffer map.
>>
>> This eliminated about half of the TLB shootdowns for a buildworld, because
>> there is a decent amount of physical contiguity that occurs by "accident".
>> Using a buddy allocator for physical page allocation tends to promote this
>> contiguity.  However, in a few places, it occurs by explicit action, e.g.,
>> mapped files, including large executables, using superpage reservations.
>>
>> So, how does this fit with what you've done?  You might think of using what
>> I describe above as a kind of "fast path".  As you can see from the patch,
>> it's very simple and non-intrusive.  If the pages aren't physically
>> contiguous, then instead of using pmap_qenter(), you fall back to whatever
>> approach for creating ephemeral mappings is appropriate to a given
>> architecture.
> I remember this.
>
> I did not measured the change in the amount of IPIs issued during the
> buildworld, but I do account for the mapped/unmapped buffer space in
> the patch. For the buildworld load, there is 5-10% of the mapped buffers
> from the whole buffers, which coincide with the intuitive size of the
> metadata for sources. Since unmapped buffers eliminate IPIs at creation
> and reuse, I safely guess that IPI reduction is on the comparable numbers.
>
> The pmap_map_buf() patch is orthohonal to the work I did, and it should
> nicely reduce the overhead for the metadata buffers handling. I can finish
> it, if you want. I do not think that it should be added to the already
> large patch, but instead it could be done and committed separately.


I agree.  This patch should be completed and committed separately from
your patch.

I would be happy for you to complete the patch.  However, before doing
that, let me send you another patch that is an alternate implementation
of this same basic idea.  Essentially, I was trying to see if I could
come up with another way of doing the same thing that didn't require two
new pmap functions.  After you've had a chance to look at them both, we
can discuss the pros and cons of each, and decide which one to complete
and commit.

I'll dig up this alternate implementation and send it to you on Friday.


>> 2. As for managing the ephemeral mappings on machines that don't support a
>> direct map.  I would suggest an approach that is loosely inspired by
>> copying garbage collection (or the segment cleaners in log-structured file
>> systems).  Roughly, you manage the buffer map as a few spaces (or
>> segments).  When you create a new mapping in one of these spaces (or
>> segments), you simply install the PTEs.  When you decide to "garbage
>> collect" a space (or spaces), then you perform a global TLB flush.
>> Specifically, you do something like toggling the bit in the cr4 register
>> that enables/disables support for the PG_G bit.  If the spaces are
>> sufficiently large, then the number of such global TLB flushes should be
>> quite low.  Every space would have an epoch number (or flush number).  In
>> the buffer, you would record the epoch number alongside the kernel virtual
>> address.  On access to the buffer, if the epoch number was too old, then
>> you have to recreate the buffer's mapping in a new space.
> Could you, please, describe the idea in more details ? For which mappings
> the described mechanism should be used ?
>
> Do you mean the pmap_copy_pages() implementation, or the fallback mappings
> for BIOs ?
>
> Note that pmap_copy_pages() implementaion on i386 is shamelessly stolen
> from pmap_copy_page() and uses the per-cpu ephemeral mapping for copying.
>
> For BIOs, this might be used, but I am also quite satisfied with submap
> and pmap_qenter().


I'll try to answer your questions on Friday.

Alan




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50D2BD4F.7010204>