Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 28 Dec 2012 13:20:09 -1000 (HST)
From:      Jeff Roberson <jroberson@jroberson.net>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        arch@freebsd.org
Subject:   Re: Unmapped I/O
Message-ID:  <alpine.BSF.2.00.1212281305020.2005@desktop>
In-Reply-To: <20121228230041.GY82219@kib.kiev.ua>
References:  <20121219135451.GU71906@kib.kiev.ua> <alpine.BSF.2.00.1212281217570.2005@desktop> <20121228230041.GY82219@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 29 Dec 2012, Konstantin Belousov wrote:

> On Fri, Dec 28, 2012 at 12:36:57PM -1000, Jeff Roberson wrote:
>> On Wed, 19 Dec 2012, Konstantin Belousov wrote:
>>
>>> One of the known FreeBSD I/O path performance bootleneck is the
>>> neccessity to map each I/O buffer pages into KVA.  The problem is that
>>> on the multi-core machines, the mapping must flush TLB on all cores,
>>> due to the global mapping of the buffer pages into the kernel.  This
>>> means that buffer creation and destruction disrupts execution of all
>>> other cores to perform TLB shootdown through IPI, and the thread
>>> initiating the shootdown must wait for all other cores to execute and
>>> report.
>>>
>>> The patch at
>>> http://people.freebsd.org/~kib/misc/unmapped.4.patch
>>> implements the 'unmapped buffers'.  It means an ability to create the
>>> VMIO struct buf, which does not point to the KVA mapping the buffer
>>> pages to the kernel addresses.  Since there is no mapping, kernel does
>>> not need to clear TLB. The unmapped buffers are marked with the new
>>> B_NOTMAPPED flag, and should be requested explicitely using the
>>> GB_NOTMAPPED flag to the buffer allocation routines.  If the mapped
>>> buffer is requested but unmapped buffer already exists, the buffer
>>> subsystem automatically maps the pages.
>>>
>>> The clustering code is also made aware of the not-mapped buffers, but
>>> this required the KPI change that accounts for the diff in the non-UFS
>>> filesystems.
>>>
>>> UFS is adopted to request not mapped buffers when kernel does not need
>>> to access the content, i.e. mostly for the file data.  New helper
>>> function vn_io_fault_pgmove() operates on the unmapped array of pages.
>>> It calls new pmap method pmap_copy_pages() to do the data move to and
>>> from usermode.
>>>
>>> Besides not mapped buffers, not mapped BIOs are introduced, marked
>>> with the flag BIO_NOTMAPPED.  Unmapped buffers are directly translated
>>> to unmapped BIOs.  Geom providers may indicate an acceptance of the
>>> unmapped BIOs.  If provider does not handle unmapped i/o requests,
>>> geom now automatically establishes transient mapping for the i/o
>>> pages.
>>>
>>> Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The
>>> gpart providers indicate the unmapped BIOs support if the underlying
>>> provider can do unmapped i/o.  I also hacked ahci(4) to handle
>>> unmapped i/o, but this should be changed after the Jeff' physbio patch
>>> is committed, to use proper busdma interface.
>>>
>>> Besides, the swap pager does unmapped swapping if the swap partition
>>> indicated that it can do unmapped i/o.  By Jeff request, a buffer
>>> allocation code may reserve the KVA for unmapped buffer in advance.
>>> The unmapped page-in for the vnode pager is also implemented if
>>> filesystem supports it, but the page out is not. The page-out, as well
>>> as the vnode-backed md(4), currently require mappings, mostly due to
>>> the use of VOP_WRITE().
>>>
>>> As such, the patch worked in my test environment, where I used
>>> ahci-attached SATA disks with gpt partitions, md(4) and UFS.  I see no
>>> statistically significant difference in the buildworld -j 10 times on
>>> the 4-core machine with HT.  On the other hand, when doing sha1 over
>>> the 5GB file, the system time was reduced by 30%.
>>>
>>> Unfinished items:
>>> - Integration with the physbio, will be done after physbio is
>>>  committed to HEAD.
>>> - The key per-architecture function needed for the unmapped i/o is the
>>>  pmap_copy_pages(). I implemented it for amd64 and i386 right now, it
>>>  shall be done for all other architectures.
>>> - The sizing of the submap used for transient mapping of the BIOs is
>>>  naive.  Should be adjusted, esp. for KVA-lean architectures.
>>> - Conversion of the other filesystems. Low priority.
>>>
>>> I am interested in reviews, tests and suggestions.  Note that this
>>> only works now for md(4) and ahci(4), for other drivers the patched
>>> kernel should fall back to the mapped i/o.
>>>
>>> sys/amd64/amd64/pmap.c         |  24 +++
>>> sys/cam/ata/ata_da.c           |   5 +-
>>> sys/cam/cam_ccb.h              |  30 ++++
>>> sys/dev/ahci/ahci.c            |  53 +++++-
>>> sys/dev/md/md.c                | 255 ++++++++++++++++++++++++-----
>>> sys/fs/cd9660/cd9660_vnops.c   |   2 +-
>>> sys/fs/ext2fs/ext2_balloc.c    |   2 +-
>>> sys/fs/ext2fs/ext2_vnops.c     |   9 +-
>>> sys/fs/msdosfs/msdosfs_vnops.c |   4 +-
>>> sys/fs/udf/udf_vnops.c         |   5 +-
>>> sys/geom/geom.h                |   1 +
>>> sys/geom/geom_disk.c           |   2 +
>>> sys/geom/geom_disk.h           |   1 +
>>> sys/geom/geom_io.c             |  44 ++++-
>>> sys/geom/geom_vfs.c            |  10 +-
>>> sys/geom/part/g_part.c         |   1 +
>>> sys/i386/i386/pmap.c           |  42 +++++
>>> sys/kern/vfs_bio.c             | 356 +++++++++++++++++++++++++++++++++--------
>>> sys/kern/vfs_cluster.c         | 118 +++++++-------
>>> sys/kern/vfs_vnops.c           |  39 +++++
>>> sys/sys/bio.h                  |   7 +
>>> sys/sys/buf.h                  |  22 ++-
>>> sys/sys/mount.h                |   1 +
>>> sys/sys/vnode.h                |   2 +
>>> sys/ufs/ffs/ffs_alloc.c        |  10 +-
>>> sys/ufs/ffs/ffs_balloc.c       |  58 ++++---
>>> sys/ufs/ffs/ffs_vfsops.c       |   3 +-
>>> sys/ufs/ffs/ffs_vnops.c        |  35 ++--
>>> sys/ufs/ufs/ufs_extern.h       |   1 +
>>> sys/vm/pmap.h                  |   2 +
>>> sys/vm/swap_pager.c            |  43 +++--
>>> sys/vm/swap_pager.h            |   1 +
>>> sys/vm/vm.h                    |   2 +
>>> sys/vm/vm_init.c               |   6 +-
>>> sys/vm/vm_kern.c               |   9 +-
>>> sys/vm/vnode_pager.c           |  30 +++-
>>> 36 files changed, 989 insertions(+), 246 deletions(-)
>>>
>>>
>>
>> A few comments:
>>
>> 1)  If the BIO is mapped at all you have to pass the virtual address to
>> busdma.  We can not lose the fact that it is mapped somewhere or virtual
>> cache architectures will fail.  I think when we integrate with physbio we
>> can do this properly by implementing a load routine on the bio.
> This is weird. How the buffer cache works for such architectures then ?
> VMIO buffer pages could be mapped by kernel and any usermode.

pmap_remove_write in vfs_busy_pages() handles this.  No new writes are 
allowed until the IO is complete.

>
> Anyway, I believe I never pass BIO_NOTMAPPED down if the kernel mapping
> for the buffer exists. I might document it somewhere.

Ok, we just need to be sure at any place where we make the transition from 
unmapped to mapped we use the right pointer.

>
> Would you prefer to have both mapping address and vm_page array passed
> down, if KVA is available ?

Yes, I think that's best.

>>
>> 2)  Why does mdstart_swap() need a cpu_flush_dcache?  Shouldn't sf bufs on
>> the appropriate architecture do the right flushing when they are unmapped?
> I believe it was added by Marcel. sfbuf code cannot not know if the
> [d]cache should be flushed, since there is no indication from the caller
> on the kind of access to the mapping, ro or rw.

Maybe we should change the free function to indicate.  This seems sloppy. 
not your fault though.

>
>>
>> 3)  I find the NOTMAPPED negative flag awkward grammatically.  UNMAPPED
>> seems more natural.  Or a positive MAPPED flag would be better.  Minor
>> concern, bikeshedding, etc.
> So you prefers BIO/B_UNMAPPED ? I will change this.

I think I'd prefer the positive flag BIO_MAPPED but I recognize that's a 
lot of work to switch the sense in your entire patch now so feel free to 
ignore me.

>
>>
>> 4)  It would be better to have some wrapper functions around the bio
>> transient map and or sf buf handling.  I will need it to map unmapped cam
>> ccbs in device drivers.  We need to come to some agreement on this API.
>> There should be a fast page-by-page version and a potentially blocking
>> all-at-once linear version.
> No, I do not think this is needed. I thought about this, and decided
> to handle it centralized in g_down instead to require drivers code
> to be changed there.
>
> The driver should indicate the acceptance of the unmapped BIOs. If it
> does not accept them, the patch already contains a code in the g_down
> path to create the transient mapping. The PIO-mode ATA drivers should
> not set the flag on the disk. There might be some need to be able to
> dynamically clear the flag when the channel mode is switched to PIO.
>
> In fact, I do not see why not to require the always mapped BIOs for
> legacy ATA. See DISKFLAG_NOTMAPPED_BIO.

I guess this would be acceptable for a first cut but I don't think it will 
fly long-term.  I audited all of the cam and block devices recently. 
There are cases that are difficult to predict ahead of time which would 
mean some devices would always be operating in a slower mode.

Also my primary test machines are 16core boxes that still use legacy ATA. 
It is not that uncommon.  It would be a shame if many of our users saw no 
benefit from this.  Although for ATA PIO we could make an iterator that 
used the directmap or sf_bufs.  That may be sufficient.  It would be nice 
if it was encapsulated in a simple api so every user didn't have to know 
the details of pinning, sf bufs, and the memory layout.  What would you 
think of that as the in-between compromise?

>
>>
>> 5)  Should the transient bio space not come from the buffer_map?  We don't
>> have more KVA on 32bit architectures.  Or did that come from pbufs?  We
>> could consolidate here some but there are a lot of potential deadlock
>> issues.
> No, exactly because it is impossible to request the buffer_map KVA
> defragmentation from the g_down. The worst which could happen with the
> current code is the pauses during intensive i/o when transient map is
> exhausted or fragmented. The g_down is paused until enough in-flight
> transactions are completed.
>

How is there room for this new transient map?  Can't pbufs completely give 
up their space to it now?  Shouldn't you always allocate BKVASIZE to avoid 
fragmentation?  And speed-up the search.

> Peter has tested it under the load where a lot of the parallel i/o were
> issued and transient map was constantly exhausted, for the ad(4) driver
> requiring fallback mappings. I added the counters to watch the situation.
>
>>
>> 6)  Thank you for adding the KVAALLOC flag.  Isilon needs this internally.
> Please note that I did not tested this yet :/.
>
>>
>> 7)  All of that code that exploded into getblk() should be refactored into
>> some support functions.  It's hard to read and getblk() is too big
>> already.
> Agreed.
>
>>
>> All in all it looks like we have the right pieces coming together.  Just
>> needs a little refactoring and then a lot of test.  I'm almost ready to
>> commit the first phase of physbio.  It doesn't yet have the code necessary
>> to temporarily map io for things like ATA PIO.  I'm hoping that you'll
> It is there in the central place.
>
>> provide the functions to do that.  It does abstract out most of the
>> details of the memory formats so we can change them at all.  And adds
>> support to busdma for physical addresses so bounce pages, alignment,
>> sizes, and boundaries work.
>>
>> Thanks,
>> Jeff
>

Jeff



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.00.1212281305020.2005>