Date: Wed, 19 Dec 2012 09:28:46 -1000 (HST) From: Jeff Roberson <jroberson@jroberson.net> To: alc@freebsd.org Cc: Konstantin Belousov <kostikbel@gmail.com>, arch@freebsd.org Subject: Re: Unmapped I/O Message-ID: <alpine.BSF.2.00.1212190923170.2005@desktop> In-Reply-To: <CAJUyCcNuD_TWR6xxFxVqDi4-eBGx3Jjs21eBxaZYYVUERESbMw@mail.gmail.com> References: <20121219135451.GU71906@kib.kiev.ua> <CAJUyCcNuD_TWR6xxFxVqDi4-eBGx3Jjs21eBxaZYYVUERESbMw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 19 Dec 2012, Alan Cox wrote: > On Wed, Dec 19, 2012 at 7:54 AM, Konstantin Belousov <kostikbel@gmail.com>wrote: > >> One of the known FreeBSD I/O path performance bootleneck is the >> neccessity to map each I/O buffer pages into KVA. The problem is that >> on the multi-core machines, the mapping must flush TLB on all cores, >> due to the global mapping of the buffer pages into the kernel. This >> means that buffer creation and destruction disrupts execution of all >> other cores to perform TLB shootdown through IPI, and the thread >> initiating the shootdown must wait for all other cores to execute and >> report. >> >> The patch at >> http://people.freebsd.org/~kib/misc/unmapped.4.patch >> implements the 'unmapped buffers'. It means an ability to create the >> VMIO struct buf, which does not point to the KVA mapping the buffer >> pages to the kernel addresses. Since there is no mapping, kernel does >> not need to clear TLB. The unmapped buffers are marked with the new >> B_NOTMAPPED flag, and should be requested explicitely using the >> GB_NOTMAPPED flag to the buffer allocation routines. If the mapped >> buffer is requested but unmapped buffer already exists, the buffer >> subsystem automatically maps the pages. >> >> The clustering code is also made aware of the not-mapped buffers, but >> this required the KPI change that accounts for the diff in the non-UFS >> filesystems. >> >> UFS is adopted to request not mapped buffers when kernel does not need >> to access the content, i.e. mostly for the file data. New helper >> function vn_io_fault_pgmove() operates on the unmapped array of pages. >> It calls new pmap method pmap_copy_pages() to do the data move to and >> from usermode. >> >> Besides not mapped buffers, not mapped BIOs are introduced, marked >> with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated >> to unmapped BIOs. Geom providers may indicate an acceptance of the >> unmapped BIOs. If provider does not handle unmapped i/o requests, >> geom now automatically establishes transient mapping for the i/o >> pages. >> >> Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The >> gpart providers indicate the unmapped BIOs support if the underlying >> provider can do unmapped i/o. I also hacked ahci(4) to handle >> unmapped i/o, but this should be changed after the Jeff' physbio patch >> is committed, to use proper busdma interface. >> >> Besides, the swap pager does unmapped swapping if the swap partition >> indicated that it can do unmapped i/o. By Jeff request, a buffer >> allocation code may reserve the KVA for unmapped buffer in advance. >> The unmapped page-in for the vnode pager is also implemented if >> filesystem supports it, but the page out is not. The page-out, as well >> as the vnode-backed md(4), currently require mappings, mostly due to >> the use of VOP_WRITE(). >> >> As such, the patch worked in my test environment, where I used >> ahci-attached SATA disks with gpt partitions, md(4) and UFS. I see no >> statistically significant difference in the buildworld -j 10 times on >> the 4-core machine with HT. On the other hand, when doing sha1 over >> the 5GB file, the system time was reduced by 30%. >> >> Unfinished items: >> - Integration with the physbio, will be done after physbio is >> committed to HEAD. >> - The key per-architecture function needed for the unmapped i/o is the >> pmap_copy_pages(). I implemented it for amd64 and i386 right now, it >> shall be done for all other architectures. >> - The sizing of the submap used for transient mapping of the BIOs is >> naive. Should be adjusted, esp. for KVA-lean architectures. >> - Conversion of the other filesystems. Low priority. >> >> I am interested in reviews, tests and suggestions. Note that this >> only works now for md(4) and ahci(4), for other drivers the patched >> kernel should fall back to the mapped i/o. >> >> > Here are a couple things for you to think about: > > 1. A while back, I developed the patch at > http://www.cs.rice.edu/~alc/buf_maps5.patch as an experiment in trying to > reduce the number of TLB shootdowns by the buffer map. The idea is simple: > Replace the calls to pmap_q{enter,remove}() with calls to a new > machine-dependent function that opportunistically sets the buffer's kernel > virtual address to the direct map for physically contiguous pages. > However, if the pages are not physically contiguous, it calls pmap_qenter() > with the kernel virtual address from the buffer map. > > This eliminated about half of the TLB shootdowns for a buildworld, because > there is a decent amount of physical contiguity that occurs by "accident". > Using a buddy allocator for physical page allocation tends to promote this > contiguity. However, in a few places, it occurs by explicit action, e.g., > mapped files, including large executables, using superpage reservations. > > So, how does this fit with what you've done? You might think of using what > I describe above as a kind of "fast path". As you can see from the patch, > it's very simple and non-intrusive. If the pages aren't physically > contiguous, then instead of using pmap_qenter(), you fall back to whatever > approach for creating ephemeral mappings is appropriate to a given > architecture. I think these are complimentary. Kib's patch gives us the fastest possible path for user data. Alan's patch will improve the metadata performance for things that really require the buffer cache. I see no reason not to clean up and commit both. > > 2. As for managing the ephemeral mappings on machines that don't support a > direct map. I would suggest an approach that is loosely inspired by > copying garbage collection (or the segment cleaners in log-structured file > systems). Roughly, you manage the buffer map as a few spaces (or > segments). When you create a new mapping in one of these spaces (or > segments), you simply install the PTEs. When you decide to "garbage > collect" a space (or spaces), then you perform a global TLB flush. > Specifically, you do something like toggling the bit in the cr4 register > that enables/disables support for the PG_G bit. If the spaces are > sufficiently large, then the number of such global TLB flushes should be > quite low. Every space would have an epoch number (or flush number). In > the buffer, you would record the epoch number alongside the kernel virtual > address. On access to the buffer, if the epoch number was too old, then > you have to recreate the buffer's mapping in a new space. Are the machines that don't have a direct map performance critical? My expectation is that they are legacy or embedded. This seems like a great project to do when the rest of the pieces are stable and fast. Until then they could just use something like pbufs? Jeff > > Alan > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.00.1212190923170.2005>