Date: Wed, 19 Dec 2012 15:16:22 -0600 From: Alan Cox <alc@rice.edu> To: Jeff Roberson <jroberson@jroberson.net> Cc: alc@freebsd.org, Konstantin Belousov <kostikbel@gmail.com>, arch@freebsd.org Subject: Re: Unmapped I/O Message-ID: <50D22EA6.1040501@rice.edu> In-Reply-To: <alpine.BSF.2.00.1212190923170.2005@desktop> References: <20121219135451.GU71906@kib.kiev.ua> <CAJUyCcNuD_TWR6xxFxVqDi4-eBGx3Jjs21eBxaZYYVUERESbMw@mail.gmail.com> <alpine.BSF.2.00.1212190923170.2005@desktop>
next in thread | previous in thread | raw e-mail | index | archive | help
On 12/19/2012 13:28, Jeff Roberson wrote: > On Wed, 19 Dec 2012, Alan Cox wrote: > >> On Wed, Dec 19, 2012 at 7:54 AM, Konstantin Belousov >> <kostikbel@gmail.com>wrote: >> >>> One of the known FreeBSD I/O path performance bootleneck is the >>> neccessity to map each I/O buffer pages into KVA. The problem is that >>> on the multi-core machines, the mapping must flush TLB on all cores, >>> due to the global mapping of the buffer pages into the kernel. This >>> means that buffer creation and destruction disrupts execution of all >>> other cores to perform TLB shootdown through IPI, and the thread >>> initiating the shootdown must wait for all other cores to execute and >>> report. >>> >>> The patch at >>> http://people.freebsd.org/~kib/misc/unmapped.4.patch >>> implements the 'unmapped buffers'. It means an ability to create the >>> VMIO struct buf, which does not point to the KVA mapping the buffer >>> pages to the kernel addresses. Since there is no mapping, kernel does >>> not need to clear TLB. The unmapped buffers are marked with the new >>> B_NOTMAPPED flag, and should be requested explicitely using the >>> GB_NOTMAPPED flag to the buffer allocation routines. If the mapped >>> buffer is requested but unmapped buffer already exists, the buffer >>> subsystem automatically maps the pages. >>> >>> The clustering code is also made aware of the not-mapped buffers, but >>> this required the KPI change that accounts for the diff in the non-UFS >>> filesystems. >>> >>> UFS is adopted to request not mapped buffers when kernel does not need >>> to access the content, i.e. mostly for the file data. New helper >>> function vn_io_fault_pgmove() operates on the unmapped array of pages. >>> It calls new pmap method pmap_copy_pages() to do the data move to and >>> from usermode. >>> >>> Besides not mapped buffers, not mapped BIOs are introduced, marked >>> with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated >>> to unmapped BIOs. Geom providers may indicate an acceptance of the >>> unmapped BIOs. If provider does not handle unmapped i/o requests, >>> geom now automatically establishes transient mapping for the i/o >>> pages. >>> >>> Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The >>> gpart providers indicate the unmapped BIOs support if the underlying >>> provider can do unmapped i/o. I also hacked ahci(4) to handle >>> unmapped i/o, but this should be changed after the Jeff' physbio patch >>> is committed, to use proper busdma interface. >>> >>> Besides, the swap pager does unmapped swapping if the swap partition >>> indicated that it can do unmapped i/o. By Jeff request, a buffer >>> allocation code may reserve the KVA for unmapped buffer in advance. >>> The unmapped page-in for the vnode pager is also implemented if >>> filesystem supports it, but the page out is not. The page-out, as well >>> as the vnode-backed md(4), currently require mappings, mostly due to >>> the use of VOP_WRITE(). >>> >>> As such, the patch worked in my test environment, where I used >>> ahci-attached SATA disks with gpt partitions, md(4) and UFS. I see no >>> statistically significant difference in the buildworld -j 10 times on >>> the 4-core machine with HT. On the other hand, when doing sha1 over >>> the 5GB file, the system time was reduced by 30%. >>> >>> Unfinished items: >>> - Integration with the physbio, will be done after physbio is >>> committed to HEAD. >>> - The key per-architecture function needed for the unmapped i/o is the >>> pmap_copy_pages(). I implemented it for amd64 and i386 right now, it >>> shall be done for all other architectures. >>> - The sizing of the submap used for transient mapping of the BIOs is >>> naive. Should be adjusted, esp. for KVA-lean architectures. >>> - Conversion of the other filesystems. Low priority. >>> >>> I am interested in reviews, tests and suggestions. Note that this >>> only works now for md(4) and ahci(4), for other drivers the patched >>> kernel should fall back to the mapped i/o. >>> >>> >> Here are a couple things for you to think about: >> >> 1. A while back, I developed the patch at >> http://www.cs.rice.edu/~alc/buf_maps5.patch as an experiment in >> trying to >> reduce the number of TLB shootdowns by the buffer map. The idea is >> simple: >> Replace the calls to pmap_q{enter,remove}() with calls to a new >> machine-dependent function that opportunistically sets the buffer's >> kernel >> virtual address to the direct map for physically contiguous pages. >> However, if the pages are not physically contiguous, it calls >> pmap_qenter() >> with the kernel virtual address from the buffer map. >> >> This eliminated about half of the TLB shootdowns for a buildworld, >> because >> there is a decent amount of physical contiguity that occurs by >> "accident". >> Using a buddy allocator for physical page allocation tends to promote >> this >> contiguity. However, in a few places, it occurs by explicit action, >> e.g., >> mapped files, including large executables, using superpage reservations. >> >> So, how does this fit with what you've done? You might think of >> using what >> I describe above as a kind of "fast path". As you can see from the >> patch, >> it's very simple and non-intrusive. If the pages aren't physically >> contiguous, then instead of using pmap_qenter(), you fall back to >> whatever >> approach for creating ephemeral mappings is appropriate to a given >> architecture. > > I think these are complimentary. Kib's patch gives us the fastest > possible path for user data. Alan's patch will improve the metadata > performance for things that really require the buffer cache. I see no > reason not to clean up and commit both. > >> >> 2. As for managing the ephemeral mappings on machines that don't >> support a >> direct map. I would suggest an approach that is loosely inspired by >> copying garbage collection (or the segment cleaners in log-structured >> file >> systems). Roughly, you manage the buffer map as a few spaces (or >> segments). When you create a new mapping in one of these spaces (or >> segments), you simply install the PTEs. When you decide to "garbage >> collect" a space (or spaces), then you perform a global TLB flush. >> Specifically, you do something like toggling the bit in the cr4 register >> that enables/disables support for the PG_G bit. If the spaces are >> sufficiently large, then the number of such global TLB flushes should be >> quite low. Every space would have an epoch number (or flush >> number). In >> the buffer, you would record the epoch number alongside the kernel >> virtual >> address. On access to the buffer, if the epoch number was too old, then >> you have to recreate the buffer's mapping in a new space. > > Are the machines that don't have a direct map performance critical? > My expectation is that they are legacy or embedded. This seems like a > great project to do when the rest of the pieces are stable and fast. > Until then they could just use something like pbufs? > I think the answer to your first question depends entirely on who you are. :-) Also, at the low-end of the server space, there are many people trying to promote arm-based systems. While FreeBSD may never run on your arm-based phone, I think that ceding the arm-based server market to others will be a strategic mistake. Alan P.S. I think we're moving the discussion to far away from kib's original, so I suggest changing the subject line on any follow ups.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50D22EA6.1040501>