From owner-freebsd-arch@FreeBSD.ORG Sun Apr 26 18:02:45 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id BCE113AE for ; Sun, 26 Apr 2015 18:02:45 +0000 (UTC) Received: from mail-oi0-x22a.google.com (mail-oi0-x22a.google.com [IPv6:2607:f8b0:4003:c06::22a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 760D010A9 for ; Sun, 26 Apr 2015 18:02:45 +0000 (UTC) Received: by oiko83 with SMTP id o83so73893885oik.1 for ; Sun, 26 Apr 2015 11:02:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type; bh=rnqbSthaT/1Vc0JHz5+2PPFOazfbUncR3bsi7ooNK98=; b=h91WrfyebFwjCges2TwApNteb5n5y6xE6EEW2pEVxahix94J2vVefQP8qIn2Php5dm zVzIW+nK1gKKcYsf1Say2h7vSHOY6GZ5iUDP9hpDGExztFTaZodzhq4E8JtNDiHgMjqQ ca74RTiNepDTCKy+SO5g24Sa5+jQY8l1sjLfnjPXibWTTY5mRJKbwLHpb63w3FJSeRvf /M56Je7j7dv+vBHfahOUtpxJ96ObYm4Hc1BvVSALmIr97jTeKyzDKSeqvW6KKA31Fymj sjZhRyFXHJipcv9JYj2/3pee4Z3HRKIzXjdoZW3DmcetljIy16Fx7RbbfEYp2DCCWkKE vhLg== X-Received: by 10.60.223.228 with SMTP id qx4mr6974831oec.24.1430071364726; Sun, 26 Apr 2015 11:02:44 -0700 (PDT) Received: from corona.austin.rr.com (cpe-72-177-6-10.austin.res.rr.com. [72.177.6.10]) by mx.google.com with ESMTPSA id m42sm7994785oik.3.2015.04.26.11.02.43 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 26 Apr 2015 11:02:44 -0700 (PDT) Message-ID: <553D2890.4020107@gmail.com> Date: Sun, 26 Apr 2015 13:04:00 -0500 From: Jason Harmening User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: Konstantin Belousov CC: Svatopluk Kraus , FreeBSD Arch Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space References: <20150425094152.GE2390@kib.kiev.ua> <553B9E64.8030907@gmail.com> <20150425163444.GL2390@kib.kiev.ua> <553BC9D1.1070502@gmail.com> <20150425172833.GM2390@kib.kiev.ua> <553BD501.4010109@gmail.com> <20150425181846.GN2390@kib.kiev.ua> <553BE12B.4000105@gmail.com> <20150425201410.GP2390@kib.kiev.ua> In-Reply-To: <20150425201410.GP2390@kib.kiev.ua> Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="eCCgoIbMufOXxfvxMFPjB42fodthGCmE7" X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Apr 2015 18:02:45 -0000 This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --eCCgoIbMufOXxfvxMFPjB42fodthGCmE7 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable On 04/25/15 15:14, Konstantin Belousov wrote: > On Sat, Apr 25, 2015 at 01:47:07PM -0500, Jason Harmening wrote: >> On 04/25/15 13:18, Konstantin Belousov wrote: >>> On Sat, Apr 25, 2015 at 12:55:13PM -0500, Jason Harmening wrote: >>>> Ah, that looks much better. A few things though: >>>> 1) _bus_dmamap_load_ma (note the underscore) is still part of the MI= /MD >>>> interface, which we tell drivers not to use. It looks like it's >>>> implemented for every arch though. Should there be a public and >>>> documented bus_dmamap_load_ma ? >>> Might be yes. But at least one consumer of the KPI must appear befor= e >>> the facility is introduced. >> Could some of the GART/GTT code consume that? > Do you mean, by GEM/GTT code ? Indeed, this is interesting and probabl= y > workable suggestion. I thought that I would need to provide a special > interface from DMAR for the GEM, but your proposal seems to fit. Still= , > an issue is that the Linux code is structured significantly different, > and this code, although isolated, is significant divergent from the > upstream. Yes, GEM/GTT. I know it would be useful for i915, maybe other drm2 drivers too. > >>>> 3) Using bus_dmamap_load_ma would mean always using physcopy for bou= nce >>>> buffers...seems like the sfbufs would slow things down ? >>> For amd64, sfbufs are nop, due to the direct map. But, I doubt that >>> we can combine bounce buffers and performance in the single sentence.= >> In fact the amd64 implementation of uiomove_fromphys doesn't use sfbuf= s >> at all thanks to the direct map. sparc64 seems to avoid sfbufs as muc= h >> as possible too. I don't know what arm64/aarch64 will be able to use.= =20 >> Those seem like the platforms where bounce buffering would be the most= >> likely, along with i386 + PAE. They might still be used on 32-bit >> platforms for alignment or devices with < 32-bit address width, but th= en >> those are likely to be old and slow anyway. >> >> I'm still a bit worried about the slowness of waiting for an sfbuf if >> one is needed, but in practice that might not be a big issue. >> I noticed the following in vm_map_delete, which is called by sys_munmap: =20 2956 * Wait for wiring or unwiring of an entry to compl= ete. 2957 * Also wait for any system wirings to disappear on= 2958 * user maps. 2959 */ 2960 if ((entry->eflags & MAP_ENTRY_IN_TRANSITION) !=3D = 0 || 2961 (vm_map_pmap(map) !=3D kernel_pmap && 2962 vm_map_entry_system_wired_count(entry) !=3D 0))= { =2E.. 2970 (void) vm_map_unlock_and_wait(map, 0); It looks like munmap does wait on wired pages (well, system-wired pages, = not mlock'ed pages). The system-wire count on the PTE will be non-zero if vslock/vm_map_wire(.= =2E.VM_MAP_WIRE_SYSTEM...) was called on it. Does that mean UIO_USERSPACE dmamaps are actually safe from getting the U= VA taken out from under them? Obviously it doesn't make bcopy safe to do in the wrong process context, = but that seems easily fixable. --eCCgoIbMufOXxfvxMFPjB42fodthGCmE7 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQF8BAEBCgBmBQJVPSiQXxSAAAAAAC4AKGlzc3Vlci1mcHJAbm90YXRpb25zLm9w ZW5wZ3AuZmlmdGhob3JzZW1hbi5uZXQwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAw MDAwMDAwMDAwMDAwMDAwAAoJELufi/mShB0ba7gIAIcGtzGRq1R1W/S8AoR2qTCi JY9p/fLrD4i1kmiOmcI2hnfBa9UbFLmUGOJnlnrNifQfhY3vnw/IPHhO6zlQW8Jp llSh6eBiq2lb59+ptA0VLDE33mOzIL2/ZYBVm7EmavGirKVEBtbGLLtCw20ZwiQz HiRj1cXoppwYyt6xrl1OtbQs9jZNqURvdIwVa2NkwVKZftwqtGv4a5UXJXNr08U3 wbi9niaylcAjwpBlxheemBkC1V0m5QtVvAOSbxMsKwlxOGgMJztnQrksJOWgVvIH qm4FZeKGOUSSzswfd8l0WWMzkBi4mYdFo6JhRpP3lWYmWow+uBTe0+Y9RU3RxBw= =nb6L -----END PGP SIGNATURE----- --eCCgoIbMufOXxfvxMFPjB42fodthGCmE7-- From owner-freebsd-arch@FreeBSD.ORG Sun Apr 26 19:56:58 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2FC5C15F for ; Sun, 26 Apr 2015 19:56:58 +0000 (UTC) Received: from mail-ie0-x22c.google.com (mail-ie0-x22c.google.com [IPv6:2607:f8b0:4001:c03::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id EDD641B2A for ; Sun, 26 Apr 2015 19:56:57 +0000 (UTC) Received: by iecrt8 with SMTP id rt8so113665042iec.0 for ; Sun, 26 Apr 2015 12:56:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=ezCfJmJd0fa0bLql8KhRi69TLmo3ooBGEMzEGahQAmI=; b=RlZro5HeNGcAjrAr+DD5mg0h0p53luzyldckDjm4fdNltNHXVa68Of1T6GEiOWgExx 5XW1uxq4LctnxIfXVu6MlsxdnRcr5/3se5grZHcSXQkLMNf89j5V5nphan5J0fNlg8Xl LpFyU/0BlZGA8ZZWREhBifYcXd3/HHRu30eJgktcck5U8HgCvfo84Ji+9uiIDyY5ejRy r51MBzWISvVj9hCxpAfYXF6voeMzOswd06s3ceoW3USbVdpWnQU3SqEdseJWJ12iCNkL SWu0a2qe4Q+ESZV0binaH1LyXj1KTmUUjYnZYfkFmqMaMBErTHcilrwaCGdxaW3MDUg7 wKcw== MIME-Version: 1.0 X-Received: by 10.42.76.146 with SMTP id e18mr9344383ick.42.1430078217335; Sun, 26 Apr 2015 12:56:57 -0700 (PDT) Received: by 10.64.13.81 with HTTP; Sun, 26 Apr 2015 12:56:57 -0700 (PDT) In-Reply-To: References: Date: Sun, 26 Apr 2015 21:56:57 +0200 Message-ID: Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space From: Svatopluk Kraus To: Jason Harmening Cc: FreeBSD Arch Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Apr 2015 19:56:58 -0000 On Sat, Apr 25, 2015 at 12:50 AM, Jason Harmening wrote: > A couple of comments: > > --POSTWRITE and POSTREAD are only asynchronous if you call them from an > asynchronous context. > For a driver that's already performing DMA operations on usermode memory, it > seems likely that there's going to be *some* place where you can call > bus_dmamap_sync() and be guaranteed to be executing in the context of the > process that owns the memory. Then a regular bcopy will be safe and > inexpensive, assuming the pages have been properly vslock-ed/vm_map_wire-d. > That's usually whatever read/write/ioctl operation spawned the DMA transfer > in the first place. So, in those cases can you not just move the > POSTREAD/POSTWRITE sync from the "DMA-finished" interrupt to the > d_read/d_write/d_ioctl that waits on the "DMA-finished" interrupt? > Yes, it could be possible in those cases. However it implies, that dma unload must be moved as well. And to make it symmetric, dma load too. Then dma driver just programs hardware and all clients must do dma load, sync, wait for finish, sync, and unload itself. So (1) almost same code will be spreaded on many places and (2) all the stuff which is done for dma load will be pending in system much longer. > --physcopyin/physcopyout aren't trivial. They go through uiomove_fromphys, > which often uses sfbufs to create temporary KVA mappings for the physical > pages. sf_buf_alloc() can sleep (unless SFB_NOWAIT is specified, which > means it can fail and which uiomove_fromphys does not specify for good > reason); that makes it unsafe for use in either a threaded interrupt or a > filter. Perhaps the physcopyout path could be changed to use pmap_qenter > directly in this case, but that can still be expensive in terms of TLB > shootdowns. > I thought that unmapped buffers are used to save KVA space. For such buffers physcopyin/physcopyout must be used already. So if there is some slowing down, it's taken into acount already. And, if it's good for unmapped buffers, it should be good for user buffers as well. I'm not so afraid of TLB shutdowns in ARM arch. On the contrary, the arch is not DMA cache coherent, so cache maintainance is of much care. It must always be done for cached memory, bouncing or not. > Checking against VM_MIN_KERNEL_ADDRESS seems sketchy; it eliminates the > chance to use a much-less-expensive bcopy in cases where the sync is > happening in correct process context. > Right, but it's the simplest solution. > Context-switching during bus_dmamap_sync() shouldn't be an issue. In a > filter interrupt, curproc will be completely arbitrary but none of this > stuff should be called in a filter anyway. Otherwise, if you're in a kernel > thread (including an ithread), curproc should be whatever proc was supplied > when the thread was created. That's usually proc0, which only has kernel > address space. IOW, even if a context-switch happens sometime during > bus_dmamap_sync, any pmap check or copy should have a consistent and > non-arbitrary process context. > It's correct analysis with given presumptions. But why are you so sure that this stuff should not be done in interrupt filter? > I think something like your second solution would be workable to make > UIO_USERSPACE syncs work in non-interrupt kernel threads, but given all the > restrictions and extra cost of physcopy, I'm not sure how useful that would > be. > That or KASSERT to check that context is bad. In fact, the second solution does not close door. If it's called in correct context, bcopy is used anyway, and if it's called in bad context, some extra work is done due to physcopyin/physcopyout. > I do think busdma.9 could at least use a note that bus_dmamap_sync() is only > safe to call in the context of the owning process for user buffers. > At least for now. However, I would be unhappy if it remains that for ever. > > On Fri, Apr 24, 2015 at 8:13 AM, Svatopluk Kraus wrote: >> >> DMA can be done on client buffer from user address space. For example, >> thru bus_dmamap_load_uio() when uio->uio_segflg is UIO_USERSPACE. Such >> client buffer can bounce and then, it must be copied to and from >> bounce buffer in bus_dmamap_sync(). >> >> Current implementations (in all archs) do not take into account that >> bus_dmamap_sync() is asynchronous for POSTWRITE and POSTREAD in >> general. It can be asynchronous for PREWRITE and PREREAD too. For >> example, in some driver implementations where DMA client buffers >> operations are buffered. In those cases, simple bcopy() is not >> correct. >> >> Demonstration of current implementation (x86) is the following: >> >> ----------------------------- >> struct bounce_page { >> vm_offset_t vaddr; /* kva of bounce buffer */ >> bus_addr_t busaddr; /* Physical address */ >> vm_offset_t datavaddr; /* kva of client data */ >> bus_addr_t dataaddr; /* client physical address */ >> bus_size_t datacount; /* client data count */ >> STAILQ_ENTRY(bounce_page) links; >> }; >> >> >> if ((op & BUS_DMASYNC_PREWRITE) != 0) { >> while (bpage != NULL) { >> if (bpage->datavaddr != 0) { >> bcopy((void *)bpage->datavaddr, >> (void *)bpage->vaddr, >> bpage->datacount); >> } else { >> physcopyout(bpage->dataaddr, >> (void *)bpage->vaddr, >> bpage->datacount); >> } >> bpage = STAILQ_NEXT(bpage, links); >> } >> dmat->bounce_zone->total_bounced++; >> } >> ----------------------------- >> >> There are two things: >> >> (1) datavaddr is not always kva of client data, but sometimes it can >> be uva of client data. >> (2) bcopy() can be used only if datavaddr is kva or when map->pmap is >> current pmap. >> >> Note that there is an implication for bus_dmamap_load_uio() with >> uio->uio_segflg set to UIO_USERSPACE that used physical pages are >> in-core and wired. See "man bus_dma". >> >> There is not public interface to check that map->pmap is current pmap. >> So one solution is the following: >> >> if (bpage->datavaddr >= VM_MIN_KERNEL_ADDRESS) { >> bcopy(); >> } else { >> physcopy(); >> } >> >> If there will be public pmap_is_current() then another solution is the >> following: >> >> if (bpage->datavaddr != 0) && pmap_is_current(map->pmap)) { >> bcopy(); >> } else { >> physcopy(); >> } >> >> The second solution implies that context switch must not happen during >> bus_dmamap_sync() called from an interrupt routine. However, IMO, it's >> granted. >> >> Note that map->pmap should be always kernel_pmap for datavaddr >= >> VM_MIN_KERNEL_ADDRESS. >> >> Comments, different solutions, or objections? >> >> Svatopluk Kraus >> _______________________________________________ >> freebsd-arch@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-arch >> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > > From owner-freebsd-arch@FreeBSD.ORG Sun Apr 26 20:00:33 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 85BA6372 for ; Sun, 26 Apr 2015 20:00:33 +0000 (UTC) Received: from mail-ie0-x230.google.com (mail-ie0-x230.google.com [IPv6:2607:f8b0:4001:c03::230]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 606C21C09 for ; Sun, 26 Apr 2015 20:00:33 +0000 (UTC) Received: by iedfl3 with SMTP id fl3so128885226ied.1 for ; Sun, 26 Apr 2015 13:00:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=ek+5hbmqYaM9feQ5ezfpPRLJg1a5KakUWNeYNF+8xps=; b=UoIPWeJfsTW+VffnnH+uBHCOw1kI/yLw/THWQ8BY1sfAcs5tFQ0S0rJ8tTdf+Ji/c1 IwNmalIopNZs5xBPdlpyMoD1E9c3B7Q6QYqt6iAfW/+xthlijc2Po3OihNVN40GO/eNQ TahOpzuMSz94+WgizIC9ECeEKfCncD+TY8Ux04fPI/7jn/10mqXmtoLJF/AYQy7qXSdK i6hPH+Sgv6nC2pVGs6jREVSiwQMIgtwVK9jxghV6faniAPFxUnFLCDRPu567BhGj5d3I wVB02Sf4/9bmaHwoqr2WTHox4mJAHKvkk0faXf7zCm36zlnHS0kznDIAdH6zuSpsauBk NqOQ== MIME-Version: 1.0 X-Received: by 10.43.39.1 with SMTP id tk1mr9081421icb.26.1430078432242; Sun, 26 Apr 2015 13:00:32 -0700 (PDT) Received: by 10.64.13.81 with HTTP; Sun, 26 Apr 2015 13:00:32 -0700 (PDT) In-Reply-To: <20150425094152.GE2390@kib.kiev.ua> References: <20150425094152.GE2390@kib.kiev.ua> Date: Sun, 26 Apr 2015 22:00:32 +0200 Message-ID: Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space From: Svatopluk Kraus To: Konstantin Belousov Cc: Jason Harmening , FreeBSD Arch Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Apr 2015 20:00:33 -0000 On Sat, Apr 25, 2015 at 11:41 AM, Konstantin Belousov wrote: > On Fri, Apr 24, 2015 at 05:50:15PM -0500, Jason Harmening wrote: >> A couple of comments: >> >> --POSTWRITE and POSTREAD are only asynchronous if you call them from an >> asynchronous context. >> For a driver that's already performing DMA operations on usermode memory, >> it seems likely that there's going to be *some* place where you can call >> bus_dmamap_sync() and be guaranteed to be executing in the context of the >> process that owns the memory. Then a regular bcopy will be safe and >> inexpensive, assuming the pages have been properly vslock-ed/vm_map_wire-d. >> That's usually whatever read/write/ioctl operation spawned the DMA transfer >> in the first place. So, in those cases can you not just move the >> POSTREAD/POSTWRITE sync from the "DMA-finished" interrupt to the >> d_read/d_write/d_ioctl that waits on the "DMA-finished" interrupt? >> >> --physcopyin/physcopyout aren't trivial. They go through uiomove_fromphys, >> which often uses sfbufs to create temporary KVA mappings for the physical >> pages. sf_buf_alloc() can sleep (unless SFB_NOWAIT is specified, which >> means it can fail and which uiomove_fromphys does not specify for good >> reason); that makes it unsafe for use in either a threaded interrupt or a >> filter. Perhaps the physcopyout path could be changed to use pmap_qenter >> directly in this case, but that can still be expensive in terms of TLB >> shootdowns. >> >> Checking against VM_MIN_KERNEL_ADDRESS seems sketchy; it eliminates the >> chance to use a much-less-expensive bcopy in cases where the sync is >> happening in correct process context. >> >> Context-switching during bus_dmamap_sync() shouldn't be an issue. In a >> filter interrupt, curproc will be completely arbitrary but none of this >> stuff should be called in a filter anyway. Otherwise, if you're in a >> kernel thread (including an ithread), curproc should be whatever proc was >> supplied when the thread was created. That's usually proc0, which only has >> kernel address space. IOW, even if a context-switch happens sometime >> during bus_dmamap_sync, any pmap check or copy should have a consistent and >> non-arbitrary process context. >> >> I think something like your second solution would be workable to make >> UIO_USERSPACE syncs work in non-interrupt kernel threads, but given all the >> restrictions and extra cost of physcopy, I'm not sure how useful that would >> be. >> >> I do think busdma.9 could at least use a note that bus_dmamap_sync() is >> only safe to call in the context of the owning process for user buffers. > > UIO_USERSPACE for busdma is absolutely unsafe and cannot be used without > making kernel panicing. Even if you wire the userspace bufer, there is > nothing which could prevent other thread in the user process, or other > process sharing the same address space, to call munmap(2) on the range. > Using of vslock() is proposed method in bus_dma man page. IMO, the function looks complex and can be a big time eater. However, are you saying that vslock() does not work for that? Then for what reason does that function exist? > The only safe method to work with the userspace regions is to > vm_fault_quick_hold() them to get hold on the pages, and then either > pass pages array down, or remap them in the KVA with pmap_qenter(). > So, even vm_fault_quick_hold() does not keep valid user mapping? >> >> >> On Fri, Apr 24, 2015 at 8:13 AM, Svatopluk Kraus wrote: >> >> > DMA can be done on client buffer from user address space. For example, >> > thru bus_dmamap_load_uio() when uio->uio_segflg is UIO_USERSPACE. Such >> > client buffer can bounce and then, it must be copied to and from >> > bounce buffer in bus_dmamap_sync(). >> > >> > Current implementations (in all archs) do not take into account that >> > bus_dmamap_sync() is asynchronous for POSTWRITE and POSTREAD in >> > general. It can be asynchronous for PREWRITE and PREREAD too. For >> > example, in some driver implementations where DMA client buffers >> > operations are buffered. In those cases, simple bcopy() is not >> > correct. >> > >> > Demonstration of current implementation (x86) is the following: >> > >> > ----------------------------- >> > struct bounce_page { >> > vm_offset_t vaddr; /* kva of bounce buffer */ >> > bus_addr_t busaddr; /* Physical address */ >> > vm_offset_t datavaddr; /* kva of client data */ >> > bus_addr_t dataaddr; /* client physical address */ >> > bus_size_t datacount; /* client data count */ >> > STAILQ_ENTRY(bounce_page) links; >> > }; >> > >> > >> > if ((op & BUS_DMASYNC_PREWRITE) != 0) { >> > while (bpage != NULL) { >> > if (bpage->datavaddr != 0) { >> > bcopy((void *)bpage->datavaddr, >> > (void *)bpage->vaddr, >> > bpage->datacount); >> > } else { >> > physcopyout(bpage->dataaddr, >> > (void *)bpage->vaddr, >> > bpage->datacount); >> > } >> > bpage = STAILQ_NEXT(bpage, links); >> > } >> > dmat->bounce_zone->total_bounced++; >> > } >> > ----------------------------- >> > >> > There are two things: >> > >> > (1) datavaddr is not always kva of client data, but sometimes it can >> > be uva of client data. >> > (2) bcopy() can be used only if datavaddr is kva or when map->pmap is >> > current pmap. >> > >> > Note that there is an implication for bus_dmamap_load_uio() with >> > uio->uio_segflg set to UIO_USERSPACE that used physical pages are >> > in-core and wired. See "man bus_dma". >> > >> > There is not public interface to check that map->pmap is current pmap. >> > So one solution is the following: >> > >> > if (bpage->datavaddr >= VM_MIN_KERNEL_ADDRESS) { >> > bcopy(); >> > } else { >> > physcopy(); >> > } >> > >> > If there will be public pmap_is_current() then another solution is the >> > following: >> > >> > if (bpage->datavaddr != 0) && pmap_is_current(map->pmap)) { >> > bcopy(); >> > } else { >> > physcopy(); >> > } >> > >> > The second solution implies that context switch must not happen during >> > bus_dmamap_sync() called from an interrupt routine. However, IMO, it's >> > granted. >> > >> > Note that map->pmap should be always kernel_pmap for datavaddr >= >> > VM_MIN_KERNEL_ADDRESS. >> > >> > Comments, different solutions, or objections? >> > >> > Svatopluk Kraus >> > _______________________________________________ >> > freebsd-arch@freebsd.org mailing list >> > http://lists.freebsd.org/mailman/listinfo/freebsd-arch >> > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" >> > >> _______________________________________________ >> freebsd-arch@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-arch >> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" From owner-freebsd-arch@FreeBSD.ORG Sun Apr 26 20:08:32 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id BD0285B4 for ; Sun, 26 Apr 2015 20:08:32 +0000 (UTC) Received: from mail-ig0-x22c.google.com (mail-ig0-x22c.google.com [IPv6:2607:f8b0:4001:c05::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 85D391C5A for ; Sun, 26 Apr 2015 20:08:32 +0000 (UTC) Received: by igblo3 with SMTP id lo3so47489264igb.0 for ; Sun, 26 Apr 2015 13:08:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=ew0Pl0XNmqlrtlmVQVnPc6VrSdMW/fv2qSkUlO+6kxs=; b=R4dkHDhHs1q//jbNSohF1lTMWlGN0osEnGlEwx0KKEi8Yc5Nwol2j0uzj6D5l4WK2b smiXJqs5WibhKV+96DI4MtI0nyBJG08FBoh7gMAqZKPwurjeQ2y8j4YpSJp8RL7mRm9r jzRyDLWOwFiADB0yLlghS532uJ41rXuLARxTapoj3QfZN+OeGmwpQnjM6ywJPHyc+N5U N/aoPSeb4Y1faeCEyUol6eQzwm0dyBXWYyGmgZyFsMM/1EU55bHWfzutWZMoLWJyxumy 7IFXImhmdGpGCXnmZJsf3kiYaPxUAOOfOnn9r+WStPY7oUMUjqu4WQKtCwCxWRlRF9W3 WUiA== MIME-Version: 1.0 X-Received: by 10.50.102.68 with SMTP id fm4mr9502079igb.25.1430078911984; Sun, 26 Apr 2015 13:08:31 -0700 (PDT) Received: by 10.64.13.81 with HTTP; Sun, 26 Apr 2015 13:08:31 -0700 (PDT) In-Reply-To: <20150425172833.GM2390@kib.kiev.ua> References: <20150425094152.GE2390@kib.kiev.ua> <553B9E64.8030907@gmail.com> <20150425163444.GL2390@kib.kiev.ua> <553BC9D1.1070502@gmail.com> <20150425172833.GM2390@kib.kiev.ua> Date: Sun, 26 Apr 2015 22:08:31 +0200 Message-ID: Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space From: Svatopluk Kraus To: Konstantin Belousov Cc: Jason Harmening , FreeBSD Arch Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Apr 2015 20:08:32 -0000 On Sat, Apr 25, 2015 at 7:28 PM, Konstantin Belousov wrote: > On Sat, Apr 25, 2015 at 12:07:29PM -0500, Jason Harmening wrote: >> >> On 04/25/15 11:34, Konstantin Belousov wrote: >> > I believe UIO_USERSPACE is almost unused, it might be there for some >> > obscure (and buggy) driver. >> It may be nearly unused, but we still document it in busdma.9, and we >> still explicitly check for it when setting the pmap in >> _bus_dmamap_load_uio. If it's not safe to use, then it's not OK for us >> to do that. >> We need to either a) remove support for it by adding a failure/KASSERT >> on UIO_USERSPACE in _busdmamap_load_uio() and remove the paragraph on it >> from busdma.9, or b) make it safe. >> >> I'd be in favor of b), because I think it is still valid to support some >> non-painful way of using DMA with userspace buffers. Right now, the >> only safe way to do that seems to be: >> 1) vm_fault_quick_hold_pages >> 2) kva_alloc >> 3) pmap_qenter >> 4) bus_dmamap_load > 1. vm_fault_quick_hold > 2. bus_dmamap_load_ma > >> >> That seems both unnecessarily complex and wasteful of KVA space. >> > The above sequence does not need a KVA allocation. > > But if the buffer bounces, then some KVA must be allocated temporarily for physcopyin/physcopyout. FYI, we are in the following situation in ARM arch. (1) The DMA in not cache coherent, and (2) cache maintainance operations are done on virtual addresses. It means that cache maintainance must be done for cached memory. Moreover, it must be done even for unmapped buffers and they must be mapped for that. Thus it could be of much help if we can used UVA for that if context is correct. From owner-freebsd-arch@FreeBSD.ORG Sun Apr 26 20:30:53 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 5DB2889C for ; Sun, 26 Apr 2015 20:30:53 +0000 (UTC) Received: from mail-wi0-x235.google.com (mail-wi0-x235.google.com [IPv6:2a00:1450:400c:c05::235]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id EB5E31EAA for ; Sun, 26 Apr 2015 20:30:52 +0000 (UTC) Received: by wizk4 with SMTP id k4so77337706wiz.1 for ; Sun, 26 Apr 2015 13:30:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type; bh=004It6X1U8ovvM8sfNhtpNeQYamSHDfyMXVS68kIXDA=; b=rvRJX/IGGGZIGrA1SVZDXVovUmS8RMtsPe5cYJPQfHAA/GWbVnOthqfDDhXUClHSGW d2+Sdd92vrULL+YmM7XRAIV1muozDaycrsjUJuYY+4v0XWui5yTWGePOoR0UX9gZBxJh odXEv8fheMcm8H9znJlKBZyMf/gtHt7HTPyGALqT7CdgKZuiFq5huwh9ksiF1s1OUQC9 8d+FNKzo3wH13oSbWcnR2W4h9yUGqZO4V/pga+0+Z0YeIwJ8Kxk2tG+3mjJiiJVfBg91 uUBKbEfh/uatsH5vSV2JKxULf2T7GQnKdyMHhpK2i3/+0MwZkbmV/8IuaCXDYgWxf2WU AQuQ== MIME-Version: 1.0 X-Received: by 10.180.208.42 with SMTP id mb10mr14445543wic.80.1430080251349; Sun, 26 Apr 2015 13:30:51 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.180.42.68 with HTTP; Sun, 26 Apr 2015 13:30:51 -0700 (PDT) In-Reply-To: References: Date: Sun, 26 Apr 2015 13:30:51 -0700 X-Google-Sender-Auth: JR-IUyNetz_wBzufHa-RsInf0I4 Message-ID: Subject: Re: RFT: numa policy branch From: Adrian Chadd To: "freebsd-arch@freebsd.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Apr 2015 20:30:53 -0000 Hi! Another update: * updated to recent -HEAD; * numactl now can set memory policy and cpuset domain information - so it's easy to say "this runs in memory domain X and cpu domain Y" in one pass with it; * the locality matrix is now available. Here's an example from scott's 2x haswell v3, with cluster-on-die enabled: vm.phys_locality: 0: 10 21 31 31 1: 21 10 31 31 2: 31 31 10 21 3: 31 31 21 10 And on the westmere-ex box, with no SLIT table: vm.phys_locality: 0: -1 -1 -1 -1 1: -1 -1 -1 -1 2: -1 -1 -1 -1 3: -1 -1 -1 -1 * I've tested in on westmere-ex (4x socket), sandybridge, ivybridge, haswell v3 and haswell v3 cluster on die. * I've discovered that our implementation of libgomp (from gcc-4.2) is very old and doesn't include some of the thread control environment variables, grr. * .. and that the gcc libgomp code doesn't at all have freebsd thread affinity routines, so I added them to gcc-4.8. Testing with a local copy of stream - using gcc-4.9 and the updated libgomp to support thread pinning - shows that yes, it all works as expected, and yes for NUMA workloads its quite a big difference. I'd appreciate any reviews / testing people are able to provide. I'm about at the functionality point where I'd like to submit it for formal review and try to land it in -HEAD. -adrian From owner-freebsd-arch@FreeBSD.ORG Mon Apr 27 08:14:59 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 5C23DE9 for ; Mon, 27 Apr 2015 08:14:59 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id D8A3B109E for ; Mon, 27 Apr 2015 08:14:58 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3R8Erqk021008 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Mon, 27 Apr 2015 11:14:53 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3R8Erqk021008 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id t3R8Erqj021007; Mon, 27 Apr 2015 11:14:53 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 27 Apr 2015 11:14:53 +0300 From: Konstantin Belousov To: Jason Harmening Cc: Svatopluk Kraus , FreeBSD Arch Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space Message-ID: <20150427081453.GZ2390@kib.kiev.ua> References: <20150425094152.GE2390@kib.kiev.ua> <553B9E64.8030907@gmail.com> <20150425163444.GL2390@kib.kiev.ua> <553BC9D1.1070502@gmail.com> <20150425172833.GM2390@kib.kiev.ua> <553BD501.4010109@gmail.com> <20150425181846.GN2390@kib.kiev.ua> <553BE12B.4000105@gmail.com> <20150425201410.GP2390@kib.kiev.ua> <553D2890.4020107@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <553D2890.4020107@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Apr 2015 08:14:59 -0000 On Sun, Apr 26, 2015 at 01:04:00PM -0500, Jason Harmening wrote: > > On 04/25/15 15:14, Konstantin Belousov wrote: > > On Sat, Apr 25, 2015 at 01:47:07PM -0500, Jason Harmening wrote: > >> On 04/25/15 13:18, Konstantin Belousov wrote: > >>> On Sat, Apr 25, 2015 at 12:55:13PM -0500, Jason Harmening wrote: > >>>> Ah, that looks much better. A few things though: > >>>> 1) _bus_dmamap_load_ma (note the underscore) is still part of the MI/MD > >>>> interface, which we tell drivers not to use. It looks like it's > >>>> implemented for every arch though. Should there be a public and > >>>> documented bus_dmamap_load_ma ? > >>> Might be yes. But at least one consumer of the KPI must appear before > >>> the facility is introduced. > >> Could some of the GART/GTT code consume that? > > Do you mean, by GEM/GTT code ? Indeed, this is interesting and probably > > workable suggestion. I thought that I would need to provide a special > > interface from DMAR for the GEM, but your proposal seems to fit. Still, > > an issue is that the Linux code is structured significantly different, > > and this code, although isolated, is significant divergent from the > > upstream. > > Yes, GEM/GTT. I know it would be useful for i915, maybe other drm2 > drivers too. > > > > >>>> 3) Using bus_dmamap_load_ma would mean always using physcopy for bounce > >>>> buffers...seems like the sfbufs would slow things down ? > >>> For amd64, sfbufs are nop, due to the direct map. But, I doubt that > >>> we can combine bounce buffers and performance in the single sentence. > >> In fact the amd64 implementation of uiomove_fromphys doesn't use sfbufs > >> at all thanks to the direct map. sparc64 seems to avoid sfbufs as much > >> as possible too. I don't know what arm64/aarch64 will be able to use. > >> Those seem like the platforms where bounce buffering would be the most > >> likely, along with i386 + PAE. They might still be used on 32-bit > >> platforms for alignment or devices with < 32-bit address width, but then > >> those are likely to be old and slow anyway. > >> > >> I'm still a bit worried about the slowness of waiting for an sfbuf if > >> one is needed, but in practice that might not be a big issue. > >> > I noticed the following in vm_map_delete, which is called by sys_munmap: > > > 2956 * Wait for wiring or unwiring of an entry to complete. > 2957 * Also wait for any system wirings to disappear on > 2958 * user maps. > 2959 */ > 2960 if ((entry->eflags & MAP_ENTRY_IN_TRANSITION) != 0 || > 2961 (vm_map_pmap(map) != kernel_pmap && > 2962 vm_map_entry_system_wired_count(entry) != 0)) { > ... > 2970 (void) vm_map_unlock_and_wait(map, 0); > > It looks like munmap does wait on wired pages (well, system-wired pages, not mlock'ed pages). > The system-wire count on the PTE will be non-zero if vslock/vm_map_wire(...VM_MAP_WIRE_SYSTEM...) was called on it. > Does that mean UIO_USERSPACE dmamaps are actually safe from getting the UVA taken out from under them? > Obviously it doesn't make bcopy safe to do in the wrong process context, but that seems easily fixable. vslock() indeed would prevent the unmap, but it also causes very serious user address space fragmentation. vslock() carves map entry covering the specified region, which, for the typical application use of malloced memory for buffers, could easily fragment the bss into per-page map entries. It is not very important for the current vslock() use by sysctl code, since apps usually do bounded number of sysctls at the startup, but definitely it would be an issue if vslock() appears on the i/o path. From owner-freebsd-arch@FreeBSD.ORG Mon Apr 27 10:48:51 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 9D677F9B; Mon, 27 Apr 2015 10:48:51 +0000 (UTC) Received: from mailhost.netlabit.sk (mailhost.netlabit.sk [84.245.65.72]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 359E61117; Mon, 27 Apr 2015 10:48:50 +0000 (UTC) Received: from zeta.dino.sk (fw1.dino.sk [84.245.95.252]) (AUTH: LOGIN milan) by mailhost.netlabit.sk with ESMTPA; Mon, 27 Apr 2015 12:48:41 +0200 id 00347808.553E1409.00006044 Date: Mon, 27 Apr 2015 12:48:41 +0200 From: Milan Obuch To: Adrian Chadd Cc: freebsd-arch@freebsd.org Subject: Re: using libgpio to bitbang LCDs! Message-ID: <20150427124841.7b8a59bc@zeta.dino.sk> In-Reply-To: References: X-Mailer: Claws Mail 3.11.1 (GTK+ 2.24.27; i386-portbld-freebsd10.1) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Apr 2015 10:48:51 -0000 On Sat, 11 Apr 2015 23:45:55 -0700 Adrian Chadd wrote: > Hi, > > The library source code and a demo program is available here: > > https://github.com/erikarn/freebsd-liblcd > > It includes the wiring needed to hook the example OLED board up > (http://www.adafruit.com/products/684) to a Carambola 2 evaluation > board. > > Anything you can get 5v and 5 GPIO pins from will work. (Well, as long > as there's also libgpio / gpio API support for your device..) > > > > -adrian > Hi, I downloaded master.zip from github and now I am trying to modify it for my hardware - Raspberry PI and TFT display with ILI9341 chip https://learn.adafruit.com/adafruit-pitft-28-inch-resistive-touchscreen-display-raspberry-pi My original attempt to use SPI failed probably because ILI9341 does not use pure SPI, but there is one extension - DC pin. I did not manage to satisfy chip's timing/bit sequence/whatever, so I would like to try bit banging. I found two small issues with archive downloaded - freebsd-liblcd-master/src/beastie_ili9340c_320x240/Makefile seems to ba a copy of freebsd-liblcd-master/src/beastie_ssd1351_128x128/Makefile, I think there should be a difference, after my fix --- beastie_ili9340c_320x240/Makefile 2015-04-23 21:40:10.693847000 +0200 +++ beastie_ssd1351_128x128/Makefile 2015-04-13 00:58:06.000000000 +0200 @@ -3,7 +3,7 @@ .include -PROG=beastie_ili9340c_320x240 +PROG=beastie_ssd1351_128x128 SRCS=main.c CFLAGS+=-I../../lib/liblcd LDFLAGS+=-L../../lib/liblcd The second issue is when doing 'make install' binaries produced are being installed into / directory. While both are not fatal for me, they are annoyances at least. I found following in source files: /* Configured for the carambola 2 board */ cfg.gpio_unit = 0; cfg.pin_cs = 19; cfg.pin_rst = 20; cfg.pin_dc = 21; cfg.pin_sck = 22; cfg.pin_mosi = 23; I see there no 'pin_miso', so does this mean only one directional communication is being used, no status reading, or both reading and writing are being carried over one pin? If the former, then all I should to do is change pin numbers in above given excerption to ones valid for Raspberry Pi, if the later, I can't use this without more modifications. Also, I have simple monochrome display with PCD8544 chip, which should use basically the same bus design, so it could be used for this too, with some modification. Regards, Milan From owner-freebsd-arch@FreeBSD.ORG Mon Apr 27 12:13:12 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 68FBABE7; Mon, 27 Apr 2015 12:13:12 +0000 (UTC) Received: from mailhost.netlabit.sk (mailhost.netlabit.sk [84.245.65.72]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DAEA61B9C; Mon, 27 Apr 2015 12:13:10 +0000 (UTC) Received: from zeta.dino.sk (fw1.dino.sk [84.245.95.252]) (AUTH: LOGIN milan) by mailhost.netlabit.sk with ESMTPA; Mon, 27 Apr 2015 14:13:07 +0200 id 00347808.553E27D3.000068E2 Date: Mon, 27 Apr 2015 14:13:07 +0200 From: Milan Obuch To: Adrian Chadd Cc: freebsd-arch@freebsd.org Subject: Re: using libgpio to bitbang LCDs! Message-ID: <20150427141307.46630f07@zeta.dino.sk> In-Reply-To: <20150427124841.7b8a59bc@zeta.dino.sk> References: <20150427124841.7b8a59bc@zeta.dino.sk> X-Mailer: Claws Mail 3.11.1 (GTK+ 2.24.27; i386-portbld-freebsd10.1) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Apr 2015 12:13:12 -0000 On Mon, 27 Apr 2015 12:48:41 +0200 Milan Obuch wrote: > On Sat, 11 Apr 2015 23:45:55 -0700 > Adrian Chadd wrote: > > > Hi, > > > > The library source code and a demo program is available here: > > > > https://github.com/erikarn/freebsd-liblcd > > > > It includes the wiring needed to hook the example OLED board up > > (http://www.adafruit.com/products/684) to a Carambola 2 evaluation > > board. > > > > Anything you can get 5v and 5 GPIO pins from will work. (Well, as > > long as there's also libgpio / gpio API support for your device..) > > > > > > > > -adrian > > > > Hi, > > I downloaded master.zip from github and now I am trying to modify it > for my hardware - Raspberry PI and TFT display with ILI9341 chip > https://learn.adafruit.com/adafruit-pitft-28-inch-resistive-touchscreen-display-raspberry-pi > My original attempt to use SPI failed probably because ILI9341 does > not use pure SPI, but there is one extension - DC pin. I did not > manage to satisfy chip's timing/bit sequence/whatever, so I would > like to try bit banging. > > I found two small issues with archive downloaded - > freebsd-liblcd-master/src/beastie_ili9340c_320x240/Makefile seems to > ba a copy of > freebsd-liblcd-master/src/beastie_ssd1351_128x128/Makefile, I think > there should be a difference, after my fix > > --- beastie_ili9340c_320x240/Makefile 2015-04-23 21:40:10.693847000 > +0200 +++ beastie_ssd1351_128x128/Makefile 2015-04-13 > 00:58:06.000000000 +0200 @@ -3,7 +3,7 @@ > > .include > > -PROG=beastie_ili9340c_320x240 > +PROG=beastie_ssd1351_128x128 > SRCS=main.c > CFLAGS+=-I../../lib/liblcd > LDFLAGS+=-L../../lib/liblcd > > > The second issue is when doing 'make install' binaries produced are > being installed into / directory. While both are not fatal for me, > they are annoyances at least. > > I found following in source files: > > /* Configured for the carambola 2 board */ > cfg.gpio_unit = 0; > cfg.pin_cs = 19; > cfg.pin_rst = 20; > cfg.pin_dc = 21; > cfg.pin_sck = 22; > cfg.pin_mosi = 23; > > I see there no 'pin_miso', so does this mean only one directional > communication is being used, no status reading, or both reading and > writing are being carried over one pin? If the former, then all I > should to do is change pin numbers in above given excerption to ones > valid for Raspberry Pi, if the later, I can't use this without more > modifications. > > Also, I have simple monochrome display with PCD8544 chip, which should > use basically the same bus design, so it could be used for this too, > with some modification. > > Regards, > Milan > I decided just to try it. I copied freebsd-liblcd-master/src/beastie_ili9340c_320x240 to freebsd-liblcd-master/src/beastie_ili9340c_320x240-1, just to keep things clean (and be able to revert easy if I screw something too much) and changed configuration lines mentioned above to /* Configured for the Raspberry Pi board */ cfg.gpio_unit = 0; cfg.pin_cs = 8; cfg.pin_rst = 0; cfg.pin_dc = 25; cfg.pin_sck = 11; cfg.pin_mosi = 10; and it works, a bit slowly, but this could be expected. Also note there is no reset pin connected in my display, so I put there 0, which may not be the best value, but it works. One more issue here - picture is turned upside down, I have four buttons below screen and I need to turn the display to be normally readable so they are on top... but this is not hard to solve... I can use this display now, it is slow, but works. Regards, Milan From owner-freebsd-arch@FreeBSD.ORG Mon Apr 27 14:21:36 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 97DAB59A for ; Mon, 27 Apr 2015 14:21:36 +0000 (UTC) Received: from mail-ie0-x22f.google.com (mail-ie0-x22f.google.com [IPv6:2607:f8b0:4001:c03::22f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 5F8831B2E for ; Mon, 27 Apr 2015 14:21:36 +0000 (UTC) Received: by iecrt8 with SMTP id rt8so127029388iec.0 for ; Mon, 27 Apr 2015 07:21:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=MA47pEcho1Y0oWJUlTACgdTFkhHGKYZ9Ow0uoKxYyr0=; b=dQ7Bk3aG9sJxwDXqpjEclvJlRM9eB/xXBx8IgkH4YLrTJS4ecoGu0r0+4rC+uRqaf0 6LMoJzl5dAjw/6O/0fKWk4gwQldJ9PFxYT1bK5zorzeMdiYXaZ4hDmEQ8k3KyF6rjKxp yPF4ZIRKvbdNhdwIYiVyV6u1OdMO8AoxwYypDuJMUUA4by3ItPwi8IzgaGW/dbskvnd/ OJGvPDn0Pm9DjUH6rdrurrqylo1jZI9PCQEbnOheg/4D4eLpyVp9kz5fy8gZveUvcwDF sxfMmr4gCrIU04Lgh0bFEFgxtNd5E1nSkZVCBGY8aAOIFkGYtxkDArzxKpn5PSfoWglL QnYw== MIME-Version: 1.0 X-Received: by 10.42.76.146 with SMTP id e18mr12925726ick.42.1430144495787; Mon, 27 Apr 2015 07:21:35 -0700 (PDT) Received: by 10.64.13.81 with HTTP; Mon, 27 Apr 2015 07:21:35 -0700 (PDT) In-Reply-To: <20150427081453.GZ2390@kib.kiev.ua> References: <20150425094152.GE2390@kib.kiev.ua> <553B9E64.8030907@gmail.com> <20150425163444.GL2390@kib.kiev.ua> <553BC9D1.1070502@gmail.com> <20150425172833.GM2390@kib.kiev.ua> <553BD501.4010109@gmail.com> <20150425181846.GN2390@kib.kiev.ua> <553BE12B.4000105@gmail.com> <20150425201410.GP2390@kib.kiev.ua> <553D2890.4020107@gmail.com> <20150427081453.GZ2390@kib.kiev.ua> Date: Mon, 27 Apr 2015 16:21:35 +0200 Message-ID: Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space From: Svatopluk Kraus To: Konstantin Belousov Cc: Jason Harmening , FreeBSD Arch Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Apr 2015 14:21:36 -0000 On Mon, Apr 27, 2015 at 10:14 AM, Konstantin Belousov wrote: > On Sun, Apr 26, 2015 at 01:04:00PM -0500, Jason Harmening wrote: >> >> On 04/25/15 15:14, Konstantin Belousov wrote: >> > On Sat, Apr 25, 2015 at 01:47:07PM -0500, Jason Harmening wrote: >> >> On 04/25/15 13:18, Konstantin Belousov wrote: >> >>> On Sat, Apr 25, 2015 at 12:55:13PM -0500, Jason Harmening wrote: >> >>>> Ah, that looks much better. A few things though: >> >>>> 1) _bus_dmamap_load_ma (note the underscore) is still part of the MI/MD >> >>>> interface, which we tell drivers not to use. It looks like it's >> >>>> implemented for every arch though. Should there be a public and >> >>>> documented bus_dmamap_load_ma ? >> >>> Might be yes. But at least one consumer of the KPI must appear before >> >>> the facility is introduced. >> >> Could some of the GART/GTT code consume that? >> > Do you mean, by GEM/GTT code ? Indeed, this is interesting and probably >> > workable suggestion. I thought that I would need to provide a special >> > interface from DMAR for the GEM, but your proposal seems to fit. Still, >> > an issue is that the Linux code is structured significantly different, >> > and this code, although isolated, is significant divergent from the >> > upstream. >> >> Yes, GEM/GTT. I know it would be useful for i915, maybe other drm2 >> drivers too. >> >> > >> >>>> 3) Using bus_dmamap_load_ma would mean always using physcopy for bounce >> >>>> buffers...seems like the sfbufs would slow things down ? >> >>> For amd64, sfbufs are nop, due to the direct map. But, I doubt that >> >>> we can combine bounce buffers and performance in the single sentence. >> >> In fact the amd64 implementation of uiomove_fromphys doesn't use sfbufs >> >> at all thanks to the direct map. sparc64 seems to avoid sfbufs as much >> >> as possible too. I don't know what arm64/aarch64 will be able to use. >> >> Those seem like the platforms where bounce buffering would be the most >> >> likely, along with i386 + PAE. They might still be used on 32-bit >> >> platforms for alignment or devices with < 32-bit address width, but then >> >> those are likely to be old and slow anyway. >> >> >> >> I'm still a bit worried about the slowness of waiting for an sfbuf if >> >> one is needed, but in practice that might not be a big issue. >> >> >> I noticed the following in vm_map_delete, which is called by sys_munmap: >> >> >> 2956 * Wait for wiring or unwiring of an entry to complete. >> 2957 * Also wait for any system wirings to disappear on >> 2958 * user maps. >> 2959 */ >> 2960 if ((entry->eflags & MAP_ENTRY_IN_TRANSITION) != 0 || >> 2961 (vm_map_pmap(map) != kernel_pmap && >> 2962 vm_map_entry_system_wired_count(entry) != 0)) { >> ... >> 2970 (void) vm_map_unlock_and_wait(map, 0); >> >> It looks like munmap does wait on wired pages (well, system-wired pages, not mlock'ed pages). >> The system-wire count on the PTE will be non-zero if vslock/vm_map_wire(...VM_MAP_WIRE_SYSTEM...) was called on it. >> Does that mean UIO_USERSPACE dmamaps are actually safe from getting the UVA taken out from under them? >> Obviously it doesn't make bcopy safe to do in the wrong process context, but that seems easily fixable. > > vslock() indeed would prevent the unmap, but it also causes very serious > user address space fragmentation. vslock() carves map entry covering the > specified region, which, for the typical application use of malloced > memory for buffers, could easily fragment the bss into per-page map > entries. It is not very important for the current vslock() use by sysctl > code, since apps usually do bounded number of sysctls at the startup, > but definitely it would be an issue if vslock() appears on the i/o path. In the scope of this thread, there are two things which must be fulfilled during DMA operations: (1) Affected physical pages must be kept in system at any cost. It means no swapping and no freeing. (2) DMA sync must be doable. It means that physical pages must be mapped somewhere if needed, even temporarily. The point (1) must be fulfilled by DMA client by the way which is suitable for it. It should not be a part of any DMA load or unload method. The subject of this thread was meant to be about point (2). I have no problem that it was extented to point (1) too. In fact, I welcome that. But there are still two proposed solutions how to fix bouncing for user space buffers here. The first solution is very simple and user space buffers could be looked at like unbuffered ones. If a mapping is needed, some temporaty KVA is used. The second solution is simple too. If a mapping is needed and context is correct, UVA is used. Otherwise, some temporaty KVA is used. I prefer this solution as on cache not coherent DMA case, cache maintainace operations must be taken and buffer must always have valid mapping for them in DMA sync. I think that support for DMA from/to user space buffers is important for graphic adapters, fast data grabbers, whatever what needs fast user process interaction with a device. IMHO, There is no way to cancel support for it. Thus some fix for bouncing must be done in all archs. From owner-freebsd-arch@FreeBSD.ORG Mon Apr 27 14:46:53 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 70B46DAB for ; Mon, 27 Apr 2015 14:46:53 +0000 (UTC) Received: from mail-ig0-x234.google.com (mail-ig0-x234.google.com [IPv6:2607:f8b0:4001:c05::234]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 373A61E03 for ; Mon, 27 Apr 2015 14:46:53 +0000 (UTC) Received: by igbyr2 with SMTP id yr2so64564113igb.0 for ; Mon, 27 Apr 2015 07:46:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=ESLr29EpsUU5sWCCSPZoZhz8DpYA3n/0OyEY5KHIggo=; b=wIJUQ7mzZz5UUmc7py9SYgixKDwo0LPZbLpUS80cShxbdfc2Rg7j9jAe+DKmq5yLmq REAoIQLJPrIc3oFbHrxixXdqTCUshpLWm9nTJRIf6HLexv54RzSoT3/RXq9FCvb1f4wy EWNAm3bA8aXd50DncDhF1GGHIng/pGsNmelGJtRnKD1O6e33orLBGXGLhnz5FP/VTCkb FpodAZORovbcvL3S/JYFpA3j780y65XhdPDfKQ2kWvEuJEKtf4nnHb4groXMPc6yfzEG RG8QB7CqQ9N5taSIEf+PncySXI2NTVyHb09LMP3Ke8f8fHNcc10Oi8YYTr98jSlg0q4R x7RQ== MIME-Version: 1.0 X-Received: by 10.50.6.4 with SMTP id w4mr13928585igw.36.1430146012552; Mon, 27 Apr 2015 07:46:52 -0700 (PDT) Received: by 10.36.106.70 with HTTP; Mon, 27 Apr 2015 07:46:52 -0700 (PDT) In-Reply-To: References: <20150425094152.GE2390@kib.kiev.ua> Date: Mon, 27 Apr 2015 09:46:52 -0500 Message-ID: Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space From: Jason Harmening To: Svatopluk Kraus Cc: Konstantin Belousov , FreeBSD Arch Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Apr 2015 14:46:53 -0000 > > Using of vslock() is proposed method in bus_dma man page. IMO, the >> function looks complex and can be a big time eater. However, are you >> saying that vslock() does not work for that? Then for what reason >> does that function exist? >> > There's been some misunderstanding here, I think. If you use vslock (or vm_map_wire, which vslock wraps), then the UVAs should be safe from teardown and you should be able to use bcopy if you are in the correct context. See the post elsewhere in this thread where I dig through the sys_munmap path and find vm_map_delete waiting on system-wired PTEs. > > >> >> >> > The only safe method to work with the userspace regions is to >> > vm_fault_quick_hold() them to get hold on the pages, and then either >> > pass pages array down, or remap them in the KVA with pmap_qenter(). >> > >> >> >> So, even vm_fault_quick_hold() does not keep valid user mapping? >> > vm_fault_quick_hold_pages() doesn't do any bookkeeping on UVAs, only the underlying physical pages. That means it is possible for the UVA region to be munmap'ed if vm_fault_quick_hold_pages() has been used. So if you use vm_fault_quick_hold_pages() instead of vslock(), you can't use bus_dmamap_load_uio(UIO_USERSPACE) because that assumes valid UVA mappings. You must instead deal only with the underlying vm_page_t's, which means using _bus_dmamap_load_ma(). Here's my take on it: vslock(), as you mention, is very complex. It not only keeps the physical pages from being swapped out, but it also removes them from page queues (see https://lists.freebsd.org/pipermail/freebsd-current/2015-March/054890.html) and does a lot of bookkeeping on the UVA mappings for those pages. Part of that involves simulating a pagefault, which as kib mentions can lead to a lot of UVA fragmentation. vm_fault_quick_hold_pages() is much cheaper and seems mostly intended for short-term DMA operations. So, you might use vslock() + bus_dmamap_load_uio() for long-duration DMA transfers, like continuous streaming to a circular buffer that could last minutes or longer. Then, the extra cost of the vslock will be amortized over the long time of the transfer, and UVA fragmentation will be less of a concern since you presumably will have a limited number of vslock() calls over the lifetime of the process. Also, you will probably be keeping the DMA map for a long duration anyway, so it should be OK to wait and call bus_dmamap_sync() in the process context. Since vslock() removed the pages from the page queues, there will also be less work for pagedaemon to do during the long transfer. OTOH, vm_fault_quick_hold_pages() + _bus_dmamap_load_ma() seems much better to do for frequent short transfers to widely-varying buffers, such as block I/O. The extra pagedaemon work is inconsequential here, and since the DMA operations are frequent and you may have many in-flight at once, the reduced setup cost and fragmentation are much more important. From owner-freebsd-arch@FreeBSD.ORG Mon Apr 27 16:13:07 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id D13F2DC2 for ; Mon, 27 Apr 2015 16:13:07 +0000 (UTC) Received: from nm9-vm0.bullet.mail.bf1.yahoo.com (nm9-vm0.bullet.mail.bf1.yahoo.com [98.139.213.154]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 84D691973 for ; Mon, 27 Apr 2015 16:13:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1430151180; bh=fAyR+eXP0AOnpPupWvjSOb6S8LbSoHw4zMZ0WP8zVCE=; h=Date:From:To:Subject:From:Subject; b=UD2o/zIX0TahjIi/5DySmOQW8koiv/JsJBVWyngGn+h/oMWd56yn3GHzPb2dOSCEISNV94blkjYUasArYAHwg10Bi69TFI77rtBK3F8i0P2fXtFs2SJaYoYU/8P1JYF+COsD5uR1w8hXAIoH42KuJ1cnbnPx1kCkZH9619X7c49tLYhnsUfzB9KWiNMpf8icAOrdlb9dTv1spSn3kkW/KhG+btt1cGGn5tn8A+oCaE5w4AHWBPN/CGO3avN+F8enp0xoxgzf/4aZPouqtCHAR/OaWmEdAkg1HvNBtjp/w4Qr6h8rOIdCy7Ezi7kzBhz/xa3XazYMhMdPL1jo34RuBg== Received: from [98.139.170.178] by nm9.bullet.mail.bf1.yahoo.com with NNFMP; 27 Apr 2015 16:13:00 -0000 Received: from [98.139.213.9] by tm21.bullet.mail.bf1.yahoo.com with NNFMP; 27 Apr 2015 16:13:00 -0000 Received: from [127.0.0.1] by smtp109.mail.bf1.yahoo.com with NNFMP; 27 Apr 2015 16:13:00 -0000 X-Yahoo-Newman-Id: 54358.14564.bm@smtp109.mail.bf1.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: IL1kViAVM1l9Yr9_15h2mRnP.V_3gpiWMYxO8Jsu2vRRndM jrBWlbJ6ThFsICgFYoGYs66CulABKxfOf.fDbDmaaWcQsEIeMohMCGOPvNXl aAaUnmNhXDq.9qqA_r0Az3HH66YNNi5qOv6zGDtvUzCDsimjOYg5KnzR0_QL BhRomkLuwJmQu9TSQO3xJjATLEifXMk4sCyDWrfhpAUsTGkPoooYBQUs_E7O r6EpjlTnUD1XWxhp0ayrWZaiaRxK_qi5YeyFjAeJRYkWJj6BbiVoV5ay5td6 rUpfD6xhjn552A3IGABY0eNpWfpdOG0mGyGxqaEPw6cqQbJ0TEFHUd1J8wjd eFaWrW37HEWuYPyXIQx0EmN4GfgG5YLU_qeUQ39g8J53HIh4.S4YoFCKayLv 5q0lrZrnEu204dXwEKsNMBUCqud79Mufwl9d9IgcHARBbU_y1l4ubGsTN9J7 YnwEH0pzlgt__pAvhg3jxQEd5JI3e1KcKa0p656E1C9e2DfPAKWrYjI5eAvF 8rzrzOLE1gfeB6GsHFSee1hzmvV5cFjRJ X-Yahoo-SMTP: xcjD0guswBAZaPPIbxpWwLcp9Unf Message-ID: <553E600D.2000405@FreeBSD.org> Date: Mon, 27 Apr 2015 11:13:01 -0500 From: Pedro Giffuni User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Adrian Chadd , FreeBSD-arch list Subject: Re: RFT: numa policy branch Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Apr 2015 16:13:07 -0000 Hello; Well, I figure it may help the effort so I created a GPLv2 patch[1] to add the CPU affinity code to our aging libgomp. It is taken from GCC-pre43 branch, so no idea how well it works and you are basically on your own, but if it doesn't break anything I can commit it later today. Thanks for all the great work on NUMA! Pedro. [1] https://people.freebsd.org/~pfg/patches/libgomp-GCCr123494.diff From owner-freebsd-arch@FreeBSD.ORG Mon Apr 27 17:41:35 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 5B302871; Mon, 27 Apr 2015 17:41:35 +0000 (UTC) Received: from mail-ig0-x234.google.com (mail-ig0-x234.google.com [IPv6:2607:f8b0:4001:c05::234]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 27ABC1434; Mon, 27 Apr 2015 17:41:35 +0000 (UTC) Received: by igbhj9 with SMTP id hj9so68529556igb.1; Mon, 27 Apr 2015 10:41:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=6E0lhif+d/w1YUY5jLgZUzZVFKZnLINzX5SImMV2wiY=; b=0WRgKeBm2wADJdK/oLuX6A1R65XcORZ+8wagpH3M1/2dZeXfpIVeoo1a8IeVnoMvPY A3Jt1IbmoHwP3armsMP/PuJX9JOoRI959gu1ltj5O1io6BZpQGT9Zx1ougM6qjhLntuZ twFW8LAwzhGjyRtokiRIwg8pod4gNh5ftNgujmXcZ19fpR44BlgQHYXJlrVhdA9fERbq ODRkWXuATXBj+S32ZOgt3RO3zv+XTXa1+azPGsbT8AxDOsOnfWrbUzGBsZJfrLTjuLpW oeyU8EtUSeQOYlj0z0RCA8eaghgBmYva7EkQ0dQ0Shxm/TAC+St+qvZaO+uXHk2U5z6J A1dQ== MIME-Version: 1.0 X-Received: by 10.107.46.39 with SMTP id i39mr15362517ioo.8.1430156494606; Mon, 27 Apr 2015 10:41:34 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.36.38.133 with HTTP; Mon, 27 Apr 2015 10:41:34 -0700 (PDT) In-Reply-To: <553E600D.2000405@FreeBSD.org> References: <553E600D.2000405@FreeBSD.org> Date: Mon, 27 Apr 2015 10:41:34 -0700 X-Google-Sender-Auth: 6LGBa85RH-o9tUlh_U80M-aykTw Message-ID: Subject: Re: RFT: numa policy branch From: Adrian Chadd To: Pedro Giffuni Cc: FreeBSD-arch list Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Apr 2015 17:41:35 -0000 Hi! Would you mind seeing if we can do the proc bind option too? That's apparently quite popular. -a From owner-freebsd-arch@FreeBSD.ORG Tue Apr 28 02:34:08 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 488993E0; Tue, 28 Apr 2015 02:34:08 +0000 (UTC) Received: from mail-wg0-x229.google.com (mail-wg0-x229.google.com [IPv6:2a00:1450:400c:c00::229]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id D81F210CC; Tue, 28 Apr 2015 02:34:07 +0000 (UTC) Received: by wgso17 with SMTP id o17so135659293wgs.1; Mon, 27 Apr 2015 19:34:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:to:cc:subject:date:message-id; bh=4GG0lUEEMlCsfoUhPKIrpbqLcfVpZ6YgUj9kAX+Q3j4=; b=EsUuRKNEamy+KrisMIA4Ax1PMjk91fzPWGqapdn2m7aeS6sKt87b+fNbrJoCYEDhtH SxmhTlOS6cbhatxldzQA1mMYJRszE9ky50sxnREfrEFlxET0x8A+qDNh80c/MqbI7wMF GT6WQZjiHUMs6N/rDZGI8zicBB7RHq/ZkNmpnN3JDeygB8dVftN1cPupx8e7QPz32IZS 2sf/u9mezRiagVhR9rbWVedYUjrJvLDbBtY+e0one0pnw1HZYGgjshr3hVBNOzr0rYgp JMNud+zoW97SYhZUAbJVXhOmeaNqrT77srFvdsPEweU/L7AmahDpZm1IKxswY4etQW1s AazQ== X-Received: by 10.194.222.197 with SMTP id qo5mr27446540wjc.142.1430188446430; Mon, 27 Apr 2015 19:34:06 -0700 (PDT) Received: from localhost.localdomain (ip-89-102-11-63.net.upcbroadband.cz. [89.102.11.63]) by mx.google.com with ESMTPSA id fo7sm14118352wic.1.2015.04.27.19.34.05 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 27 Apr 2015 19:34:05 -0700 (PDT) From: Mateusz Guzik To: freebsd-arch@freebsd.org Cc: Mateusz Guzik Subject: [PATCH 0/2] generalised cow per-thread structs Date: Tue, 28 Apr 2015 04:34:01 +0200 Message-Id: <1430188443-19413-1-git-send-email-mjguzik@gmail.com> X-Mailer: git-send-email 1.8.3.1 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Apr 2015 02:34:08 -0000 From: Mateusz Guzik struct ucred is managed per thread as follows: setuid and the like updated the pointer in struct proc on kernel<->userspace boundary it is checked whether the thread needs updating This scheme is useful for other structures as well, so this patch generalises it by introducing a counter which is compared instead. This prevents introduction of further comparisons as such structures are added. The first patch just adds convenience funcs and adjusts cred handling to use it. The second patch implements lockless resource limits. The bigger goal concerns struct filedesc - the plan is to split it to fd part and vnode part. The latter is seldomly modified, se could be accessed locklessly and with further effort can save some refs/unrefs on vnodes since we will be sure they cannot go away. Mateusz Guzik (2): Generalised support for copy-on-write structures shared by threads. Implement lockless resource limits. contrib/binutils/ld/emultempl/spu_ovl.o | Bin 1432 -> 0 bytes sys/amd64/amd64/trap.c | 4 +- sys/arm/arm/trap-v6.c | 4 +- sys/arm/arm/trap.c | 11 +++-- sys/i386/i386/trap.c | 4 +- sys/kern/imgact_elf.c | 13 +++--- sys/kern/init_main.c | 8 ++-- sys/kern/kern_descrip.c | 24 +++++----- sys/kern/kern_event.c | 6 +-- sys/kern/kern_exec.c | 4 +- sys/kern/kern_fork.c | 7 ++- sys/kern/kern_kthread.c | 2 +- sys/kern/kern_proc.c | 7 +-- sys/kern/kern_prot.c | 5 ++- sys/kern/kern_resource.c | 77 +++++++++++++++++++------------- sys/kern/kern_sig.c | 2 +- sys/kern/kern_syscalls.c | 3 ++ sys/kern/kern_thr.c | 6 +-- sys/kern/kern_thread.c | 49 ++++++++++++++++++-- sys/kern/subr_syscall.c | 4 +- sys/kern/subr_trap.c | 4 +- sys/kern/subr_uio.c | 4 +- sys/kern/sysv_shm.c | 4 +- sys/kern/tty_pts.c | 4 +- sys/kern/uipc_sockbuf.c | 4 +- sys/kern/vfs_vnops.c | 7 ++- sys/powerpc/powerpc/trap.c | 4 +- sys/sparc64/sparc64/trap.c | 4 +- sys/sys/proc.h | 14 +++++- sys/sys/resourcevar.h | 9 ++-- sys/sys/vnode.h | 2 +- sys/vm/swap_pager.c | 4 +- sys/vm/vm_map.c | 14 +++--- sys/vm/vm_mmap.c | 34 +++++++------- sys/vm/vm_pageout.c | 2 +- sys/vm/vm_unix.c | 8 ++-- 36 files changed, 208 insertions(+), 154 deletions(-) delete mode 100644 contrib/binutils/ld/emultempl/spu_ovl.o -- 2.3.6 From owner-freebsd-arch@FreeBSD.ORG Tue Apr 28 02:34:11 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0446647B; Tue, 28 Apr 2015 02:34:11 +0000 (UTC) Received: from mail-wi0-x232.google.com (mail-wi0-x232.google.com [IPv6:2a00:1450:400c:c05::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 7923710CE; Tue, 28 Apr 2015 02:34:10 +0000 (UTC) Received: by wicmx19 with SMTP id mx19so97303024wic.1; Mon, 27 Apr 2015 19:34:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=nbz4ZCJ5M6wNvU2eAyQ4Xkizd8PPaGDx+1u9eMA4IC4=; b=AMaVomoBm71KsxoN4UaBZthsGCSjQbCrMwjQSgPNmgOW0J6fscliQEJ00WMWNO9ZDf tNkIt4n+uv6+7Im7nqyMZRpE8LvV6mm4/bUarhZrXAv9cTxzPBNoRhViYf/mUIGjBPAk rD8UlElTpxpE8IpaKNcmSGqEVd3j/cWYLVJpCqzGt56JY0wIdr0PBDtUR/0de+K8JW6h 0ToT7Y0YnAdKUNEZSOk4ZlT5WEp9MkLNfKP7jmClOdkg2TV+IvM/y48gWgo7hl2V6JiS ZeL0gn3fGnoxf64/f6yKpuoej0Z3WF/9gRdE4X52LkWVFw0KiVzp/JxBN/6Oe7mtILNz q83g== X-Received: by 10.195.11.202 with SMTP id ek10mr27213692wjd.12.1430188449019; Mon, 27 Apr 2015 19:34:09 -0700 (PDT) Received: from localhost.localdomain (ip-89-102-11-63.net.upcbroadband.cz. [89.102.11.63]) by mx.google.com with ESMTPSA id fo7sm14118352wic.1.2015.04.27.19.34.07 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 27 Apr 2015 19:34:08 -0700 (PDT) From: Mateusz Guzik To: freebsd-arch@freebsd.org Cc: Mateusz Guzik Subject: [PATCH 2/2] Implement lockless resource limits. Date: Tue, 28 Apr 2015 04:34:03 +0200 Message-Id: <1430188443-19413-3-git-send-email-mjguzik@gmail.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1430188443-19413-1-git-send-email-mjguzik@gmail.com> References: <1430188443-19413-1-git-send-email-mjguzik@gmail.com> X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Apr 2015 02:34:11 -0000 From: Mateusz Guzik Employ the same mechanism which is used to manage per-thread credentials. --- sys/kern/imgact_elf.c | 13 ++++---- sys/kern/kern_descrip.c | 24 +++++++-------- sys/kern/kern_event.c | 6 +--- sys/kern/kern_exec.c | 4 +-- sys/kern/kern_fork.c | 4 +-- sys/kern/kern_proc.c | 7 ++--- sys/kern/kern_resource.c | 77 ++++++++++++++++++++++++++++-------------------- sys/kern/kern_sig.c | 2 +- sys/kern/kern_syscalls.c | 1 + sys/kern/kern_thread.c | 6 ++++ sys/kern/subr_uio.c | 4 +-- sys/kern/sysv_shm.c | 4 +-- sys/kern/tty_pts.c | 4 +-- sys/kern/uipc_sockbuf.c | 4 +-- sys/kern/vfs_vnops.c | 7 ++--- sys/sys/proc.h | 3 +- sys/sys/resourcevar.h | 9 ++++-- sys/sys/vnode.h | 2 +- sys/vm/swap_pager.c | 4 +-- sys/vm/vm_map.c | 14 ++++----- sys/vm/vm_mmap.c | 34 +++++++++++---------- sys/vm/vm_pageout.c | 2 +- sys/vm/vm_unix.c | 8 ++--- 23 files changed, 122 insertions(+), 121 deletions(-) diff --git a/sys/kern/imgact_elf.c b/sys/kern/imgact_elf.c index 39e4df3..ff3a371 100644 --- a/sys/kern/imgact_elf.c +++ b/sys/kern/imgact_elf.c @@ -900,13 +900,17 @@ __CONCAT(exec_, __elfN(imgact))(struct image_params *imgp) * limits after loading the segments since we do * not actually fault in all the segments pages. */ +#ifdef RACCT PROC_LOCK(imgp->proc); - if (data_size > lim_cur(imgp->proc, RLIMIT_DATA) || +#endif + if (data_size > lim_cur(curthread, RLIMIT_DATA) || text_size > maxtsiz || - total_size > lim_cur(imgp->proc, RLIMIT_VMEM) || + total_size > lim_cur(curthread, RLIMIT_VMEM) || racct_set(imgp->proc, RACCT_DATA, data_size) != 0 || racct_set(imgp->proc, RACCT_VMEM, total_size) != 0) { +#ifdef RACCT PROC_UNLOCK(imgp->proc); +#endif return (ENOMEM); } @@ -922,9 +926,8 @@ __CONCAT(exec_, __elfN(imgact))(struct image_params *imgp) * calculation is that it leaves room for the heap to grow to * its maximum allowed size. */ - addr = round_page((vm_offset_t)vmspace->vm_daddr + lim_max(imgp->proc, + addr = round_page((vm_offset_t)vmspace->vm_daddr + lim_max(curthread, RLIMIT_DATA)); - PROC_UNLOCK(imgp->proc); imgp->entry_addr = entry; @@ -1963,7 +1966,7 @@ note_procstat_rlimit(void *arg, struct sbuf *sb, size_t *sizep) sbuf_bcat(sb, &structsize, sizeof(structsize)); PROC_LOCK(p); for (i = 0; i < RLIM_NLIMITS; i++) - lim_rlimit(p, i, &rlim[i]); + lim_rlimit_proc(p, i, &rlim[i]); PROC_UNLOCK(p); sbuf_bcat(sb, rlim, sizeof(rlim)); } diff --git a/sys/kern/kern_descrip.c b/sys/kern/kern_descrip.c index f3f27bf..cc7b276 100644 --- a/sys/kern/kern_descrip.c +++ b/sys/kern/kern_descrip.c @@ -109,7 +109,7 @@ static void fdgrowtable(struct filedesc *fdp, int nfd); static void fdgrowtable_exp(struct filedesc *fdp, int nfd); static void fdunused(struct filedesc *fdp, int fd); static void fdused(struct filedesc *fdp, int fd); -static int getmaxfd(struct proc *p); +static int getmaxfd(struct thread *td); /* Flags for do_dup() */ #define DUP_FIXED 0x1 /* Force fixed allocation. */ @@ -331,16 +331,19 @@ struct getdtablesize_args { int sys_getdtablesize(struct thread *td, struct getdtablesize_args *uap) { - struct proc *p = td->td_proc; +#ifdef RACCT uint64_t lim; +#endif - PROC_LOCK(p); td->td_retval[0] = - min((int)lim_cur(p, RLIMIT_NOFILE), maxfilesperproc); + min((int)lim_cur(td, RLIMIT_NOFILE), maxfilesperproc); +#ifdef RACCT + PROC_LOCK(p); lim = racct_get_limit(td->td_proc, RACCT_NOFILE); PROC_UNLOCK(p); if (lim < td->td_retval[0]) td->td_retval[0] = lim; +#endif return (0); } @@ -785,15 +788,10 @@ kern_fcntl(struct thread *td, int fd, int cmd, intptr_t arg) } static int -getmaxfd(struct proc *p) +getmaxfd(struct thread *td) { - int maxfd; - - PROC_LOCK(p); - maxfd = min((int)lim_cur(p, RLIMIT_NOFILE), maxfilesperproc); - PROC_UNLOCK(p); - return (maxfd); + return (min((int)lim_cur(td, RLIMIT_NOFILE), maxfilesperproc)); } /* @@ -821,7 +819,7 @@ do_dup(struct thread *td, int flags, int old, int new) return (EBADF); if (new < 0) return (flags & DUP_FCNTL ? EINVAL : EBADF); - maxfd = getmaxfd(p); + maxfd = getmaxfd(td); if (new >= maxfd) return (flags & DUP_FCNTL ? EINVAL : EBADF); @@ -1619,7 +1617,7 @@ fdalloc(struct thread *td, int minfd, int *result) if (fdp->fd_freefile > minfd) minfd = fdp->fd_freefile; - maxfd = getmaxfd(p); + maxfd = getmaxfd(td); /* * Search the bitmap for a free descriptor starting at minfd. diff --git a/sys/kern/kern_event.c b/sys/kern/kern_event.c index e01f12c..618a68e 100644 --- a/sys/kern/kern_event.c +++ b/sys/kern/kern_event.c @@ -747,14 +747,10 @@ sys_kqueue(struct thread *td, struct kqueue_args *uap) p = td->td_proc; cred = td->td_ucred; crhold(cred); - PROC_LOCK(p); - if (!chgkqcnt(cred->cr_ruidinfo, 1, lim_cur(td->td_proc, - RLIMIT_KQUEUES))) { - PROC_UNLOCK(p); + if (!chgkqcnt(cred->cr_ruidinfo, 1, lim_cur(td, RLIMIT_KQUEUES))) { crfree(cred); return (ENOMEM); } - PROC_UNLOCK(p); fdp = p->p_fd; error = falloc(td, &fp, &fd, 0); diff --git a/sys/kern/kern_exec.c b/sys/kern/kern_exec.c index 9d893f8..751f153 100644 --- a/sys/kern/kern_exec.c +++ b/sys/kern/kern_exec.c @@ -1061,9 +1061,7 @@ exec_new_vmspace(imgp, sv) /* Allocate a new stack */ if (imgp->stack_sz != 0) { ssiz = trunc_page(imgp->stack_sz); - PROC_LOCK(p); - lim_rlimit(p, RLIMIT_STACK, &rlim_stack); - PROC_UNLOCK(p); + lim_rlimit(curthread, RLIMIT_STACK, &rlim_stack); if (ssiz > rlim_stack.rlim_max) ssiz = rlim_stack.rlim_max; if (ssiz > rlim_stack.rlim_cur) { diff --git a/sys/kern/kern_fork.c b/sys/kern/kern_fork.c index d04c3e3..6cde199 100644 --- a/sys/kern/kern_fork.c +++ b/sys/kern/kern_fork.c @@ -912,10 +912,8 @@ fork1(struct thread *td, int flags, int pages, struct proc **procp, if (error == 0) ok = chgproccnt(td->td_ucred->cr_ruidinfo, 1, 0); else { - PROC_LOCK(p1); ok = chgproccnt(td->td_ucred->cr_ruidinfo, 1, - lim_cur(p1, RLIMIT_NPROC)); - PROC_UNLOCK(p1); + lim_cur(td, RLIMIT_NPROC)); } if (ok) { do_fork(td, flags, newproc, td2, vm2, pdflags); diff --git a/sys/kern/kern_proc.c b/sys/kern/kern_proc.c index 505521d..0708d71 100644 --- a/sys/kern/kern_proc.c +++ b/sys/kern/kern_proc.c @@ -2597,11 +2597,8 @@ sysctl_kern_proc_rlimit(SYSCTL_HANDLER_ARGS) /* * Retrieve limit. */ - if (req->oldptr != NULL) { - PROC_LOCK(p); - lim_rlimit(p, which, &rlim); - PROC_UNLOCK(p); - } + if (req->oldptr != NULL) + lim_rlimit(curthread, which, &rlim); error = SYSCTL_OUT(req, &rlim, sizeof(rlim)); if (error != 0) goto errout; diff --git a/sys/kern/kern_resource.c b/sys/kern/kern_resource.c index dac49cd..bc677dc 100644 --- a/sys/kern/kern_resource.c +++ b/sys/kern/kern_resource.c @@ -560,15 +560,11 @@ ogetrlimit(struct thread *td, register struct ogetrlimit_args *uap) { struct orlimit olim; struct rlimit rl; - struct proc *p; int error; if (uap->which >= RLIM_NLIMITS) return (EINVAL); - p = td->td_proc; - PROC_LOCK(p); - lim_rlimit(p, uap->which, &rl); - PROC_UNLOCK(p); + lim_rlimit(td, uap->which, &rl); /* * XXX would be more correct to convert only RLIM_INFINITY to the @@ -625,7 +621,7 @@ lim_cb(void *arg) } PROC_STATUNLOCK(p); if (p->p_rux.rux_runtime > p->p_cpulimit * cpu_tickrate()) { - lim_rlimit(p, RLIMIT_CPU, &rlim); + lim_rlimit_proc(p, RLIMIT_CPU, &rlim); if (p->p_rux.rux_runtime >= rlim.rlim_max * cpu_tickrate()) { killproc(p, "exceeded maximum CPU limit"); } else { @@ -667,29 +663,21 @@ kern_proc_setrlimit(struct thread *td, struct proc *p, u_int which, limp->rlim_max = RLIM_INFINITY; oldssiz.rlim_cur = 0; - newlim = NULL; + newlim = lim_alloc(); PROC_LOCK(p); - if (lim_shared(p->p_limit)) { - PROC_UNLOCK(p); - newlim = lim_alloc(); - PROC_LOCK(p); - } oldlim = p->p_limit; alimp = &oldlim->pl_rlimit[which]; if (limp->rlim_cur > alimp->rlim_max || limp->rlim_max > alimp->rlim_max) if ((error = priv_check(td, PRIV_PROC_SETRLIMIT))) { PROC_UNLOCK(p); - if (newlim != NULL) - lim_free(newlim); + lim_free(newlim); return (error); } if (limp->rlim_cur > limp->rlim_max) limp->rlim_cur = limp->rlim_max; - if (newlim != NULL) { - lim_copy(newlim, oldlim); - alimp = &newlim->pl_rlimit[which]; - } + lim_copy(newlim, oldlim); + alimp = &newlim->pl_rlimit[which]; switch (which) { @@ -739,11 +727,10 @@ kern_proc_setrlimit(struct thread *td, struct proc *p, u_int which, if (p->p_sysent->sv_fixlimit != NULL) p->p_sysent->sv_fixlimit(limp, which); *alimp = *limp; - if (newlim != NULL) - p->p_limit = newlim; + p->p_limit = newlim; + PROC_UPDATE_COW(p); PROC_UNLOCK(p); - if (newlim != NULL) - lim_free(oldlim); + lim_free(oldlim); if (which == RLIMIT_STACK && /* @@ -793,15 +780,11 @@ int sys_getrlimit(struct thread *td, register struct __getrlimit_args *uap) { struct rlimit rlim; - struct proc *p; int error; if (uap->which >= RLIM_NLIMITS) return (EINVAL); - p = td->td_proc; - PROC_LOCK(p); - lim_rlimit(p, uap->which, &rlim); - PROC_UNLOCK(p); + lim_rlimit(td, uap->which, &rlim); error = copyout(&rlim, uap->rlp, sizeof(struct rlimit)); return (error); } @@ -1172,11 +1155,11 @@ lim_copy(struct plimit *dst, struct plimit *src) * which parameter specifies the index into the rlimit array. */ rlim_t -lim_max(struct proc *p, int which) +lim_max(struct thread *td, int which) { struct rlimit rl; - lim_rlimit(p, which, &rl); + lim_rlimit(td, which, &rl); return (rl.rlim_max); } @@ -1185,11 +1168,11 @@ lim_max(struct proc *p, int which) * The which parameter which specifies the index into the rlimit array */ rlim_t -lim_cur(struct proc *p, int which) +lim_cur(struct thread *td, int which) { struct rlimit rl; - lim_rlimit(p, which, &rl); + lim_rlimit(td, which, &rl); return (rl.rlim_cur); } @@ -1198,7 +1181,23 @@ lim_cur(struct proc *p, int which) * specified by 'which' in the rlimit structure pointed to by 'rlp'. */ void -lim_rlimit(struct proc *p, int which, struct rlimit *rlp) +lim_rlimit(struct thread *td, int which, struct rlimit *rlp) +{ + struct proc *p = td->td_proc; + + MPASS(td == curthread); + KASSERT(which >= 0 && which < RLIM_NLIMITS, + ("request for invalid resource limit")); + *rlp = td->td_limit->pl_rlimit[which]; + if (p->p_sysent->sv_fixlimit != NULL) + p->p_sysent->sv_fixlimit(rlp, which); +} + +/* + * Same as lim_rlimit but can be used with non-curthread. + */ +void +lim_rlimit_proc(struct proc *p, int which, struct rlimit *rlp) { PROC_LOCK_ASSERT(p, MA_OWNED); @@ -1441,3 +1440,17 @@ chgkqcnt(struct uidinfo *uip, int diff, rlim_t max) } return (1); } + +void +lim_update_thread(struct thread *td) +{ + struct proc *p; + struct plimit *lim; + + p = td->td_proc; + lim = td->td_limit; + PROC_LOCK_ASSERT(p, MA_OWNED); + td->td_limit = lim_hold(p->p_limit); + if (lim != NULL) + lim_free(lim); +} diff --git a/sys/kern/kern_sig.c b/sys/kern/kern_sig.c index 154c250..07a586f 100644 --- a/sys/kern/kern_sig.c +++ b/sys/kern/kern_sig.c @@ -3304,7 +3304,7 @@ coredump(struct thread *td) * a corefile is truncated instead of not being created, * if it is larger than the limit. */ - limit = (off_t)lim_cur(p, RLIMIT_CORE); + limit = (off_t)lim_cur(td, RLIMIT_CORE); if (limit == 0 || racct_get_available(p, RACCT_CORE) == 0) { PROC_UNLOCK(p); return (EFBIG); diff --git a/sys/kern/kern_syscalls.c b/sys/kern/kern_syscalls.c index 3d3df01..15574be 100644 --- a/sys/kern/kern_syscalls.c +++ b/sys/kern/kern_syscalls.c @@ -33,6 +33,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include #include #include #include diff --git a/sys/kern/kern_thread.c b/sys/kern/kern_thread.c index df8511b..79e9c50 100644 --- a/sys/kern/kern_thread.c +++ b/sys/kern/kern_thread.c @@ -386,6 +386,7 @@ thread_get_cow_proc(struct thread *newtd, struct proc *p) PROC_LOCK_ASSERT(p, MA_OWNED); newtd->td_ucred = crhold(p->p_ucred); + newtd->td_limit = lim_hold(p->p_limit); newtd->td_cowgeneration = p->p_cowgeneration; } @@ -394,6 +395,7 @@ thread_get_cow(struct thread *newtd, struct thread *td) { newtd->td_ucred = crhold(td->td_ucred); + newtd->td_limit = lim_hold(td->td_limit); newtd->td_cowgeneration = td->td_cowgeneration; } @@ -403,6 +405,8 @@ thread_free_cow(struct thread *td) if (td->td_ucred) crfree(td->td_ucred); + if (td->td_limit) + lim_free(td->td_limit); } void @@ -414,6 +418,8 @@ thread_update_cow(struct thread *td) PROC_LOCK(p); if (td->td_ucred != p->p_ucred) cred_update_thread(td); + if (td->td_limit != p->p_limit) + lim_update_thread(td); td->td_cowgeneration = p->p_cowgeneration; PROC_UNLOCK(p); } diff --git a/sys/kern/subr_uio.c b/sys/kern/subr_uio.c index 87892fd..570298f 100644 --- a/sys/kern/subr_uio.c +++ b/sys/kern/subr_uio.c @@ -409,10 +409,8 @@ copyout_map(struct thread *td, vm_offset_t *addr, size_t sz) /* * Map somewhere after heap in process memory. */ - PROC_LOCK(td->td_proc); *addr = round_page((vm_offset_t)vms->vm_daddr + - lim_max(td->td_proc, RLIMIT_DATA)); - PROC_UNLOCK(td->td_proc); + lim_max(td, RLIMIT_DATA)); /* round size up to page boundry */ size = (vm_size_t)round_page(sz); diff --git a/sys/kern/sysv_shm.c b/sys/kern/sysv_shm.c index 274deda..00e3c0a 100644 --- a/sys/kern/sysv_shm.c +++ b/sys/kern/sysv_shm.c @@ -380,10 +380,8 @@ kern_shmat_locked(struct thread *td, int shmid, const void *shmaddr, * This is just a hint to vm_map_find() about where to * put it. */ - PROC_LOCK(p); attach_va = round_page((vm_offset_t)p->p_vmspace->vm_daddr + - lim_max(p, RLIMIT_DATA)); - PROC_UNLOCK(p); + lim_max(td, RLIMIT_DATA)); } vm_object_reference(shmseg->object); diff --git a/sys/kern/tty_pts.c b/sys/kern/tty_pts.c index 2d1e8fe..fcc9c47 100644 --- a/sys/kern/tty_pts.c +++ b/sys/kern/tty_pts.c @@ -741,7 +741,7 @@ pts_alloc(int fflags, struct thread *td, struct file *fp) PROC_UNLOCK(p); return (EAGAIN); } - ok = chgptscnt(cred->cr_ruidinfo, 1, lim_cur(p, RLIMIT_NPTS)); + ok = chgptscnt(cred->cr_ruidinfo, 1, lim_cur(td, RLIMIT_NPTS)); if (!ok) { racct_sub(p, RACCT_NPTS, 1); PROC_UNLOCK(p); @@ -795,7 +795,7 @@ pts_alloc_external(int fflags, struct thread *td, struct file *fp, PROC_UNLOCK(p); return (EAGAIN); } - ok = chgptscnt(cred->cr_ruidinfo, 1, lim_cur(p, RLIMIT_NPTS)); + ok = chgptscnt(cred->cr_ruidinfo, 1, lim_cur(td, RLIMIT_NPTS)); if (!ok) { racct_sub(p, RACCT_NPTS, 1); PROC_UNLOCK(p); diff --git a/sys/kern/uipc_sockbuf.c b/sys/kern/uipc_sockbuf.c index 88952ed..243450d 100644 --- a/sys/kern/uipc_sockbuf.c +++ b/sys/kern/uipc_sockbuf.c @@ -420,9 +420,7 @@ sbreserve_locked(struct sockbuf *sb, u_long cc, struct socket *so, if (cc > sb_max_adj) return (0); if (td != NULL) { - PROC_LOCK(td->td_proc); - sbsize_limit = lim_cur(td->td_proc, RLIMIT_SBSIZE); - PROC_UNLOCK(td->td_proc); + sbsize_limit = lim_cur(td, RLIMIT_SBSIZE); } else sbsize_limit = RLIM_INFINITY; if (!chgsbsize(so->so_cred->cr_uidinfo, &sb->sb_hiwat, cc, diff --git a/sys/kern/vfs_vnops.c b/sys/kern/vfs_vnops.c index 01d448e..9db72c3 100644 --- a/sys/kern/vfs_vnops.c +++ b/sys/kern/vfs_vnops.c @@ -2098,19 +2098,18 @@ vn_vget_ino_gen(struct vnode *vp, vn_get_ino_t alloc, void *alloc_arg, int vn_rlimit_fsize(const struct vnode *vp, const struct uio *uio, - const struct thread *td) + struct thread *td) { if (vp->v_type != VREG || td == NULL) return (0); - PROC_LOCK(td->td_proc); if ((uoff_t)uio->uio_offset + uio->uio_resid > - lim_cur(td->td_proc, RLIMIT_FSIZE)) { + lim_cur(td, RLIMIT_FSIZE)) { + PROC_LOCK(td->td_proc); kern_psignal(td->td_proc, SIGXFSZ); PROC_UNLOCK(td->td_proc); return (EFBIG); } - PROC_UNLOCK(td->td_proc); return (0); } diff --git a/sys/sys/proc.h b/sys/sys/proc.h index f29d796..9d58550 100644 --- a/sys/sys/proc.h +++ b/sys/sys/proc.h @@ -247,6 +247,7 @@ struct thread { int td_intr_nesting_level; /* (k) Interrupt recursion. */ int td_pinned; /* (k) Temporary cpu pin count. */ struct ucred *td_ucred; /* (k) Reference to credentials. */ + struct plimit *td_limit; /* (k) Resource limits. */ u_int td_estcpu; /* (t) estimated cpu utilization */ int td_slptick; /* (t) Time at sleep. */ int td_blktick; /* (t) Time spent blocked. */ @@ -497,7 +498,7 @@ struct proc { struct filedesc *p_fd; /* (b) Open files. */ struct filedesc_to_leader *p_fdtol; /* (b) Tracking node */ struct pstats *p_stats; /* (b) Accounting/statistics (CPU). */ - struct plimit *p_limit; /* (c) Process limits. */ + struct plimit *p_limit; /* (c) Resource limits. */ struct callout p_limco; /* (c) Limit callout handle */ struct sigacts *p_sigacts; /* (x) Signal actions, state (CPU). */ diff --git a/sys/sys/resourcevar.h b/sys/sys/resourcevar.h index a07fdf8..426a27a 100644 --- a/sys/sys/resourcevar.h +++ b/sys/sys/resourcevar.h @@ -130,13 +130,14 @@ int kern_proc_setrlimit(struct thread *td, struct proc *p, u_int which, struct plimit *lim_alloc(void); void lim_copy(struct plimit *dst, struct plimit *src); -rlim_t lim_cur(struct proc *p, int which); +rlim_t lim_cur(struct thread *td, int which); void lim_fork(struct proc *p1, struct proc *p2); void lim_free(struct plimit *limp); struct plimit *lim_hold(struct plimit *limp); -rlim_t lim_max(struct proc *p, int which); -void lim_rlimit(struct proc *p, int which, struct rlimit *rlp); +rlim_t lim_max(struct thread *td, int which); +void lim_rlimit(struct thread *td, int which, struct rlimit *rlp); +void lim_rlimit_proc(struct proc *p, int which, struct rlimit *rlp); void ruadd(struct rusage *ru, struct rusage_ext *rux, struct rusage *ru2, struct rusage_ext *rux2); void rucollect(struct rusage *ru, struct rusage *ru2); @@ -156,5 +157,7 @@ void ui_racct_foreach(void (*callback)(struct racct *racct, void *arg2, void *arg3), void *arg2, void *arg3); #endif +void lim_update_thread(struct thread *td); + #endif /* _KERNEL */ #endif /* !_SYS_RESOURCEVAR_H_ */ diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h index d70aa57..4aecd93 100644 --- a/sys/sys/vnode.h +++ b/sys/sys/vnode.h @@ -691,7 +691,7 @@ int vn_rdwr_inchunks(enum uio_rw rw, struct vnode *vp, void *base, struct ucred *active_cred, struct ucred *file_cred, size_t *aresid, struct thread *td); int vn_rlimit_fsize(const struct vnode *vn, const struct uio *uio, - const struct thread *td); + struct thread *td); int vn_stat(struct vnode *vp, struct stat *sb, struct ucred *active_cred, struct ucred *file_cred, struct thread *td); int vn_start_write(struct vnode *vp, struct mount **mpp, int flags); diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c index 55e02c4..bdf55c5 100644 --- a/sys/vm/swap_pager.c +++ b/sys/vm/swap_pager.c @@ -222,16 +222,14 @@ swap_reserve_by_cred(vm_ooffset_t incr, struct ucred *cred) mtx_unlock(&sw_dev_mtx); if (res) { - PROC_LOCK(curproc); UIDINFO_VMSIZE_LOCK(uip); if ((overcommit & SWAP_RESERVE_RLIMIT_ON) != 0 && - uip->ui_vmsize + incr > lim_cur(curproc, RLIMIT_SWAP) && + uip->ui_vmsize + incr > lim_cur(curthread, RLIMIT_SWAP) && priv_check(curthread, PRIV_VM_SWAP_NORLIMIT)) res = 0; else uip->ui_vmsize += incr; UIDINFO_VMSIZE_UNLOCK(uip); - PROC_UNLOCK(curproc); if (!res) { mtx_lock(&sw_dev_mtx); swap_reserved -= incr; diff --git a/sys/vm/vm_map.c b/sys/vm/vm_map.c index b7e668b..225837f 100644 --- a/sys/vm/vm_map.c +++ b/sys/vm/vm_map.c @@ -3421,10 +3421,8 @@ vm_map_stack(vm_map_t map, vm_offset_t addrbos, vm_size_t max_ssize, growsize = sgrowsiz; init_ssize = (max_ssize < growsize) ? max_ssize : growsize; vm_map_lock(map); - PROC_LOCK(curproc); - lmemlim = lim_cur(curproc, RLIMIT_MEMLOCK); - vmemlim = lim_cur(curproc, RLIMIT_VMEM); - PROC_UNLOCK(curproc); + lmemlim = lim_cur(curthread, RLIMIT_MEMLOCK); + vmemlim = lim_cur(curthread, RLIMIT_VMEM); if (!old_mlock && map->flags & MAP_WIREFUTURE) { if (ptoa(pmap_wired_count(map->pmap)) + init_ssize > lmemlim) { rv = KERN_NO_SPACE; @@ -3553,12 +3551,10 @@ vm_map_growstack(struct proc *p, vm_offset_t addr) int error; #endif + lmemlim = lim_cur(curthread, RLIMIT_MEMLOCK); + stacklim = lim_cur(curthread, RLIMIT_STACK); + vmemlim = lim_cur(curthread, RLIMIT_VMEM); Retry: - PROC_LOCK(p); - lmemlim = lim_cur(p, RLIMIT_MEMLOCK); - stacklim = lim_cur(p, RLIMIT_STACK); - vmemlim = lim_cur(p, RLIMIT_VMEM); - PROC_UNLOCK(p); vm_map_lock_read(map); diff --git a/sys/vm/vm_mmap.c b/sys/vm/vm_mmap.c index 02634d6..adc7fba 100644 --- a/sys/vm/vm_mmap.c +++ b/sys/vm/vm_mmap.c @@ -325,14 +325,12 @@ sys_mmap(td, uap) * There should really be a pmap call to determine a reasonable * location. */ - PROC_LOCK(td->td_proc); if (addr == 0 || (addr >= round_page((vm_offset_t)vms->vm_taddr) && addr < round_page((vm_offset_t)vms->vm_daddr + - lim_max(td->td_proc, RLIMIT_DATA)))) + lim_max(td, RLIMIT_DATA)))) addr = round_page((vm_offset_t)vms->vm_daddr + - lim_max(td->td_proc, RLIMIT_DATA)); - PROC_UNLOCK(td->td_proc); + lim_max(td, RLIMIT_DATA)); } if (flags & MAP_ANON) { /* @@ -1112,13 +1110,9 @@ vm_mlock(struct proc *proc, struct ucred *cred, const void *addr0, size_t len) if (npages > vm_page_max_wired) return (ENOMEM); map = &proc->p_vmspace->vm_map; - PROC_LOCK(proc); nsize = ptoa(npages + pmap_wired_count(map->pmap)); - if (nsize > lim_cur(proc, RLIMIT_MEMLOCK)) { - PROC_UNLOCK(proc); + if (nsize > lim_cur(curthread, RLIMIT_MEMLOCK)) return (ENOMEM); - } - PROC_UNLOCK(proc); if (npages + vm_cnt.v_wire_count > vm_page_max_wired) return (EAGAIN); #ifdef RACCT @@ -1171,12 +1165,8 @@ sys_mlockall(td, uap) * a hard resource limit, return ENOMEM. */ if (!old_mlock && uap->how & MCL_CURRENT) { - PROC_LOCK(td->td_proc); - if (map->size > lim_cur(td->td_proc, RLIMIT_MEMLOCK)) { - PROC_UNLOCK(td->td_proc); + if (map->size > lim_cur(td, RLIMIT_MEMLOCK)) return (ENOMEM); - } - PROC_UNLOCK(td->td_proc); } #ifdef RACCT PROC_LOCK(td->td_proc); @@ -1551,21 +1541,29 @@ vm_mmap(vm_map_t map, vm_offset_t *addr, vm_size_t size, vm_prot_t prot, size = round_page(size); if (map == &td->td_proc->p_vmspace->vm_map) { +#ifdef RACCT PROC_LOCK(td->td_proc); - if (map->size + size > lim_cur(td->td_proc, RLIMIT_VMEM)) { +#endif + if (map->size + size > lim_cur(td, RLIMIT_VMEM)) { +#ifdef RACCT PROC_UNLOCK(td->td_proc); +#endif return (ENOMEM); } if (racct_set(td->td_proc, RACCT_VMEM, map->size + size)) { +#ifdef RACCT PROC_UNLOCK(td->td_proc); +#endif return (ENOMEM); } if (!old_mlock && map->flags & MAP_WIREFUTURE) { if (ptoa(pmap_wired_count(map->pmap)) + size > - lim_cur(td->td_proc, RLIMIT_MEMLOCK)) { + lim_cur(td, RLIMIT_MEMLOCK)) { racct_set_force(td->td_proc, RACCT_VMEM, map->size); +#ifdef RACCT PROC_UNLOCK(td->td_proc); +#endif return (ENOMEM); } error = racct_set(td->td_proc, RACCT_MEMLOCK, @@ -1573,11 +1571,15 @@ vm_mmap(vm_map_t map, vm_offset_t *addr, vm_size_t size, vm_prot_t prot, if (error != 0) { racct_set_force(td->td_proc, RACCT_VMEM, map->size); +#ifdef RACCT PROC_UNLOCK(td->td_proc); +#endif return (error); } } +#ifdef RACCT PROC_UNLOCK(td->td_proc); +#endif } /* diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c index 6f50053..8225522 100644 --- a/sys/vm/vm_pageout.c +++ b/sys/vm/vm_pageout.c @@ -1844,7 +1844,7 @@ again: /* * get a limit */ - lim_rlimit(p, RLIMIT_RSS, &rsslim); + lim_rlimit_proc(p, RLIMIT_RSS, &rsslim); limit = OFF_TO_IDX( qmin(rsslim.rlim_cur, rsslim.rlim_max)); diff --git a/sys/vm/vm_unix.c b/sys/vm/vm_unix.c index de9aa78..0e55ddf 100644 --- a/sys/vm/vm_unix.c +++ b/sys/vm/vm_unix.c @@ -83,11 +83,9 @@ sys_obreak(td, uap) int error = 0; boolean_t do_map_wirefuture; - PROC_LOCK(td->td_proc); - datalim = lim_cur(td->td_proc, RLIMIT_DATA); - lmemlim = lim_cur(td->td_proc, RLIMIT_MEMLOCK); - vmemlim = lim_cur(td->td_proc, RLIMIT_VMEM); - PROC_UNLOCK(td->td_proc); + datalim = lim_cur(td, RLIMIT_DATA); + lmemlim = lim_cur(td, RLIMIT_MEMLOCK); + vmemlim = lim_cur(td, RLIMIT_VMEM); do_map_wirefuture = FALSE; new = round_page((vm_offset_t)uap->nsize); -- 2.3.6 From owner-freebsd-arch@FreeBSD.ORG Tue Apr 28 02:34:09 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id AD5F13E2; Tue, 28 Apr 2015 02:34:09 +0000 (UTC) Received: from mail-wg0-x22c.google.com (mail-wg0-x22c.google.com [IPv6:2a00:1450:400c:c00::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 37B6710CD; Tue, 28 Apr 2015 02:34:09 +0000 (UTC) Received: by wgen6 with SMTP id n6so135281463wge.3; Mon, 27 Apr 2015 19:34:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=+5j929s7xJAmfYn5RKOs1RXBFUo1F4Y4UJayr5y/XkQ=; b=GxsFQ0Xe6qlRHS/55m5t28ft1GyynORaMAC7s6UdjmYILg/Pvbhw8SObetcwtChnT3 KDrTw3kK5VIiICCWdLqSqP4ZYrr32DhU7hxAUwpARWUHtWjGNsDnChauXM6dc4dUWsw1 mOADaYPvWJSybWxQ8cpU6iE9B/e2h6KU72gZLHD647wxeTei/qDyfTCI+I37juiPdlMJ c6iQrRgAK74Ric7OEB7z33Dion8RXzLtPzlDbn6w2dN1bMrU2iBhoXg/34C7Dw421Oky aMtd9Pv2lhSoRaYuIuCePnFXLVIb5GVAYK+d4T2ikh9kD+pUEA6YXrQWzrXuQmmh8+uY WX8A== X-Received: by 10.194.184.10 with SMTP id eq10mr28223179wjc.147.1430188447676; Mon, 27 Apr 2015 19:34:07 -0700 (PDT) Received: from localhost.localdomain (ip-89-102-11-63.net.upcbroadband.cz. [89.102.11.63]) by mx.google.com with ESMTPSA id fo7sm14118352wic.1.2015.04.27.19.34.06 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 27 Apr 2015 19:34:06 -0700 (PDT) From: Mateusz Guzik To: freebsd-arch@freebsd.org Cc: Mateusz Guzik Subject: [PATCH 1/2] Generalised support for copy-on-write structures shared by threads. Date: Tue, 28 Apr 2015 04:34:02 +0200 Message-Id: <1430188443-19413-2-git-send-email-mjguzik@gmail.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1430188443-19413-1-git-send-email-mjguzik@gmail.com> References: <1430188443-19413-1-git-send-email-mjguzik@gmail.com> X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Apr 2015 02:34:09 -0000 From: Mateusz Guzik Previously td_ucred was managed by comparing it to struct proc's version on kernel<->userspace boundary. Now a dedicated counter is introduced instead which makes it possible to treat more structures this way without adding more tests for the common case (no change). --- sys/amd64/amd64/trap.c | 4 +-- sys/arm/arm/trap-v6.c | 4 +-- sys/arm/arm/trap.c | 11 ++++---- sys/i386/i386/trap.c | 4 +-- sys/kern/init_main.c | 8 +++--- sys/kern/kern_fork.c | 3 ++- sys/kern/kern_kthread.c | 2 +- sys/kern/kern_prot.c | 5 ++-- sys/kern/kern_syscalls.c | 2 ++ sys/kern/kern_thr.c | 6 ++--- sys/kern/kern_thread.c | 43 +++++++++++++++++++++++++++++--- sys/kern/subr_syscall.c | 4 +-- sys/kern/subr_trap.c | 4 +-- sys/powerpc/powerpc/trap.c | 4 +-- sys/sparc64/sparc64/trap.c | 4 +-- sys/sys/proc.h | 11 ++++++++ 17 files changed, 86 insertions(+), 33 deletions(-) diff --git a/sys/amd64/amd64/trap.c b/sys/amd64/amd64/trap.c index 193d207..1883727 100644 --- a/sys/amd64/amd64/trap.c +++ b/sys/amd64/amd64/trap.c @@ -257,8 +257,8 @@ trap(struct trapframe *frame) td->td_pticks = 0; td->td_frame = frame; addr = frame->tf_rip; - if (td->td_ucred != p->p_ucred) - cred_update_thread(td); + if (td->td_cowgeneration != p->p_cowgeneration) + thread_update_cow(td); switch (type) { case T_PRIVINFLT: /* privileged instruction fault */ diff --git a/sys/arm/arm/trap-v6.c b/sys/arm/arm/trap-v6.c index abafa86..f521785 100644 --- a/sys/arm/arm/trap-v6.c +++ b/sys/arm/arm/trap-v6.c @@ -394,8 +394,8 @@ abort_handler(struct trapframe *tf, int prefetch) p = td->td_proc; if (usermode) { td->td_pticks = 0; - if (td->td_ucred != p->p_ucred) - cred_update_thread(td); + if (td->td_cowgeneration != p->p_cowgeneration) + thread_update_cow(td); } /* Invoke the appropriate handler, if necessary. */ diff --git a/sys/arm/arm/trap.c b/sys/arm/arm/trap.c index 0f142ce..36faac2 100644 --- a/sys/arm/arm/trap.c +++ b/sys/arm/arm/trap.c @@ -214,9 +214,8 @@ abort_handler(struct trapframe *tf, int type) if (user) { td->td_pticks = 0; td->td_frame = tf; - if (td->td_ucred != td->td_proc->p_ucred) - cred_update_thread(td); - + if (td->td_cowgeneration != p->p_cowgeneration) + thread_update_cow(td); } /* Grab the current pcb */ pcb = td->td_pcb; @@ -644,8 +643,8 @@ prefetch_abort_handler(struct trapframe *tf) if (TRAP_USERMODE(tf)) { td->td_frame = tf; - if (td->td_ucred != td->td_proc->p_ucred) - cred_update_thread(td); + if (td->td_cowgeneration != p->p_cowgeneration) + thread_update_cow(td); } fault_pc = tf->tf_pc; if (td->td_md.md_spinlock_count == 0) { diff --git a/sys/i386/i386/trap.c b/sys/i386/i386/trap.c index d783a2b..41e62db 100644 --- a/sys/i386/i386/trap.c +++ b/sys/i386/i386/trap.c @@ -306,8 +306,8 @@ trap(struct trapframe *frame) td->td_pticks = 0; td->td_frame = frame; addr = frame->tf_eip; - if (td->td_ucred != p->p_ucred) - cred_update_thread(td); + if (td->td_cowgeneration != p->p_cowgeneration) + thread_update_cow(td); switch (type) { case T_PRIVINFLT: /* privileged instruction fault */ diff --git a/sys/kern/init_main.c b/sys/kern/init_main.c index b77b788..97e5878 100644 --- a/sys/kern/init_main.c +++ b/sys/kern/init_main.c @@ -522,8 +522,6 @@ proc0_init(void *dummy __unused) #ifdef MAC mac_cred_create_swapper(newcred); #endif - td->td_ucred = crhold(newcred); - /* Create sigacts. */ p->p_sigacts = sigacts_alloc(); @@ -555,6 +553,10 @@ proc0_init(void *dummy __unused) p->p_limit->pl_rlimit[RLIMIT_MEMLOCK].rlim_max = pageablemem; p->p_cpulimit = RLIM_INFINITY; + PROC_LOCK(p); + thread_get_cow_proc(td, p); + PROC_UNLOCK(p); + /* Initialize resource accounting structures. */ racct_create(&p->p_racct); @@ -842,10 +844,10 @@ create_init(const void *udata __unused) audit_cred_proc1(newcred); #endif proc_set_cred(initproc, newcred); + cred_update_thread(FIRST_THREAD_IN_PROC(initproc)); PROC_UNLOCK(initproc); sx_xunlock(&proctree_lock); crfree(oldcred); - cred_update_thread(FIRST_THREAD_IN_PROC(initproc)); cpu_set_fork_handler(FIRST_THREAD_IN_PROC(initproc), start_init, NULL); } SYSINIT(init, SI_SUB_CREATE_INIT, SI_ORDER_FIRST, create_init, NULL); diff --git a/sys/kern/kern_fork.c b/sys/kern/kern_fork.c index c3dd792..d04c3e3 100644 --- a/sys/kern/kern_fork.c +++ b/sys/kern/kern_fork.c @@ -496,7 +496,6 @@ do_fork(struct thread *td, int flags, struct proc *p2, struct thread *td2, p2->p_swtick = ticks; if (p1->p_flag & P_PROFIL) startprofclock(p2); - td2->td_ucred = crhold(p2->p_ucred); if (flags & RFSIGSHARE) { p2->p_sigacts = sigacts_hold(p1->p_sigacts); @@ -526,6 +525,8 @@ do_fork(struct thread *td, int flags, struct proc *p2, struct thread *td2, */ lim_fork(p1, p2); + thread_get_cow_proc(td2, p2); + pstats_fork(p1->p_stats, p2->p_stats); PROC_UNLOCK(p1); diff --git a/sys/kern/kern_kthread.c b/sys/kern/kern_kthread.c index ee94de0..0614d89 100644 --- a/sys/kern/kern_kthread.c +++ b/sys/kern/kern_kthread.c @@ -289,7 +289,7 @@ kthread_add(void (*func)(void *), void *arg, struct proc *p, cpu_set_fork_handler(newtd, func, arg); newtd->td_pflags |= TDP_KTHREAD; - newtd->td_ucred = crhold(p->p_ucred); + thread_get_cow_proc(newtd, p); /* this code almost the same as create_thread() in kern_thr.c */ p->p_flag |= P_HADTHREADS; diff --git a/sys/kern/kern_prot.c b/sys/kern/kern_prot.c index 9c49f71..b531763 100644 --- a/sys/kern/kern_prot.c +++ b/sys/kern/kern_prot.c @@ -1946,9 +1946,8 @@ cred_update_thread(struct thread *td) p = td->td_proc; cred = td->td_ucred; - PROC_LOCK(p); + PROC_LOCK_ASSERT(p, MA_OWNED); td->td_ucred = crhold(p->p_ucred); - PROC_UNLOCK(p); if (cred != NULL) crfree(cred); } @@ -1987,6 +1986,8 @@ proc_set_cred(struct proc *p, struct ucred *newcred) oldcred = p->p_ucred; p->p_ucred = newcred; + if (newcred != NULL) + PROC_UPDATE_COW(p); return (oldcred); } diff --git a/sys/kern/kern_syscalls.c b/sys/kern/kern_syscalls.c index dada746..3d3df01 100644 --- a/sys/kern/kern_syscalls.c +++ b/sys/kern/kern_syscalls.c @@ -31,6 +31,8 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include +#include #include #include #include diff --git a/sys/kern/kern_thr.c b/sys/kern/kern_thr.c index d5f1ce6..242e4dd 100644 --- a/sys/kern/kern_thr.c +++ b/sys/kern/kern_thr.c @@ -226,13 +226,13 @@ create_thread(struct thread *td, mcontext_t *ctx, bcopy(&td->td_startcopy, &newtd->td_startcopy, __rangeof(struct thread, td_startcopy, td_endcopy)); newtd->td_proc = td->td_proc; - newtd->td_ucred = crhold(td->td_ucred); + thread_get_cow(newtd, td); if (ctx != NULL) { /* old way to set user context */ error = set_mcontext(newtd, ctx); if (error != 0) { + thread_free_cow(newtd); thread_free(newtd); - crfree(td->td_ucred); goto fail; } } else { @@ -244,8 +244,8 @@ create_thread(struct thread *td, mcontext_t *ctx, /* Setup user TLS address and TLS pointer register. */ error = cpu_set_user_tls(newtd, tls_base); if (error != 0) { + thread_free_cow(newtd); thread_free(newtd); - crfree(td->td_ucred); goto fail; } } diff --git a/sys/kern/kern_thread.c b/sys/kern/kern_thread.c index 0a93dbd..df8511b 100644 --- a/sys/kern/kern_thread.c +++ b/sys/kern/kern_thread.c @@ -324,8 +324,7 @@ thread_reap(void) mtx_unlock_spin(&zombie_lock); while (td_first) { td_next = TAILQ_NEXT(td_first, td_slpq); - if (td_first->td_ucred) - crfree(td_first->td_ucred); + thread_free_cow(td_first); thread_free(td_first); td_first = td_next; } @@ -381,6 +380,44 @@ thread_free(struct thread *td) uma_zfree(thread_zone, td); } +void +thread_get_cow_proc(struct thread *newtd, struct proc *p) +{ + + PROC_LOCK_ASSERT(p, MA_OWNED); + newtd->td_ucred = crhold(p->p_ucred); + newtd->td_cowgeneration = p->p_cowgeneration; +} + +void +thread_get_cow(struct thread *newtd, struct thread *td) +{ + + newtd->td_ucred = crhold(td->td_ucred); + newtd->td_cowgeneration = td->td_cowgeneration; +} + +void +thread_free_cow(struct thread *td) +{ + + if (td->td_ucred) + crfree(td->td_ucred); +} + +void +thread_update_cow(struct thread *td) +{ + struct proc *p; + + p = td->td_proc; + PROC_LOCK(p); + if (td->td_ucred != p->p_ucred) + cred_update_thread(td); + td->td_cowgeneration = p->p_cowgeneration; + PROC_UNLOCK(p); +} + /* * Discard the current thread and exit from its context. * Always called with scheduler locked. @@ -518,7 +555,7 @@ thread_wait(struct proc *p) cpuset_rel(td->td_cpuset); td->td_cpuset = NULL; cpu_thread_clean(td); - crfree(td->td_ucred); + thread_free_cow(td); thread_reap(); /* check for zombie threads etc. */ } diff --git a/sys/kern/subr_syscall.c b/sys/kern/subr_syscall.c index 1bf78b8..8fdb828 100644 --- a/sys/kern/subr_syscall.c +++ b/sys/kern/subr_syscall.c @@ -61,8 +61,8 @@ syscallenter(struct thread *td, struct syscall_args *sa) p = td->td_proc; td->td_pticks = 0; - if (td->td_ucred != p->p_ucred) - cred_update_thread(td); + if (td->td_cowgeneration != p->p_cowgeneration) + thread_update_cow(td); if (p->p_flag & P_TRACED) { traced = 1; PROC_LOCK(p); diff --git a/sys/kern/subr_trap.c b/sys/kern/subr_trap.c index cfc3ed7..e055e54 100644 --- a/sys/kern/subr_trap.c +++ b/sys/kern/subr_trap.c @@ -219,8 +219,8 @@ ast(struct trapframe *framep) thread_unlock(td); PCPU_INC(cnt.v_trap); - if (td->td_ucred != p->p_ucred) - cred_update_thread(td); + if (td->td_cowgeneration != p->p_cowgeneration) + thread_update_cow(td); if (td->td_pflags & TDP_OWEUPC && p->p_flag & P_PROFIL) { addupc_task(td, td->td_profil_addr, td->td_profil_ticks); td->td_profil_ticks = 0; diff --git a/sys/powerpc/powerpc/trap.c b/sys/powerpc/powerpc/trap.c index 0ceb170..007752c 100644 --- a/sys/powerpc/powerpc/trap.c +++ b/sys/powerpc/powerpc/trap.c @@ -196,8 +196,8 @@ trap(struct trapframe *frame) if (user) { td->td_pticks = 0; td->td_frame = frame; - if (td->td_ucred != p->p_ucred) - cred_update_thread(td); + if (td->td_cowgeneration != p->p_cowgeneration) + thread_update_cow(td); /* User Mode Traps */ switch (type) { diff --git a/sys/sparc64/sparc64/trap.c b/sys/sparc64/sparc64/trap.c index b4f0e27..54c1ebe 100644 --- a/sys/sparc64/sparc64/trap.c +++ b/sys/sparc64/sparc64/trap.c @@ -277,8 +277,8 @@ trap(struct trapframe *tf) td->td_pticks = 0; td->td_frame = tf; addr = tf->tf_tpc; - if (td->td_ucred != p->p_ucred) - cred_update_thread(td); + if (td->td_cowgeneration != p->p_cowgeneration) + thread_update_cow(td); switch (tf->tf_type) { case T_DATA_MISS: diff --git a/sys/sys/proc.h b/sys/sys/proc.h index 64b99fc..f29d796 100644 --- a/sys/sys/proc.h +++ b/sys/sys/proc.h @@ -225,6 +225,7 @@ struct thread { /* Cleared during fork1() */ #define td_startzero td_flags int td_flags; /* (t) TDF_* flags. */ + u_int td_cowgeneration;/* (k) Generation of COW pointers. */ int td_inhibitors; /* (t) Why can not run. */ int td_pflags; /* (k) Private thread (TDP_*) flags. */ int td_dupfd; /* (k) Ret value from fdopen. XXX */ @@ -531,6 +532,7 @@ struct proc { pid_t p_oppid; /* (c + e) Save ppid in ptrace. XXX */ struct vmspace *p_vmspace; /* (b) Address space. */ u_int p_swtick; /* (c) Tick when swapped in or out. */ + u_int p_cowgeneration;/* (c) Generation of COW pointers. */ struct itimerval p_realtimer; /* (c) Alarm timer. */ struct rusage p_ru; /* (a) Exit information. */ struct rusage_ext p_rux; /* (cu) Internal resource usage. */ @@ -830,6 +832,11 @@ extern pid_t pid_max; KASSERT((p)->p_lock == 0, ("process held")); \ } while (0) +#define PROC_UPDATE_COW(p) do { \ + PROC_LOCK_ASSERT((p), MA_OWNED); \ + p->p_cowgeneration++; \ +} while (0) + /* Check whether a thread is safe to be swapped out. */ #define thread_safetoswapout(td) ((td)->td_flags & TDF_CANSWAP) @@ -976,6 +983,10 @@ struct thread *thread_alloc(int pages); int thread_alloc_stack(struct thread *, int pages); void thread_exit(void) __dead2; void thread_free(struct thread *td); +void thread_get_cow_proc(struct thread *newtd, struct proc *p); +void thread_get_cow(struct thread *newtd, struct thread *td); +void thread_free_cow(struct thread *td); +void thread_update_cow(struct thread *td); void thread_link(struct thread *td, struct proc *p); void thread_reap(void); int thread_single(struct proc *p, int how); -- 2.3.6 From owner-freebsd-arch@FreeBSD.ORG Tue Apr 28 08:45:16 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E3DFCC1E; Tue, 28 Apr 2015 08:45:15 +0000 (UTC) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 75AEB1B4F; Tue, 28 Apr 2015 08:45:14 +0000 (UTC) Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 06F8D1040193; Tue, 28 Apr 2015 18:45:01 +1000 (AEST) Date: Tue, 28 Apr 2015 18:45:01 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Mateusz Guzik cc: freebsd-arch@freebsd.org, Mateusz Guzik Subject: Re: [PATCH 1/2] Generalised support for copy-on-write structures shared by threads. In-Reply-To: <1430188443-19413-2-git-send-email-mjguzik@gmail.com> Message-ID: <20150428181802.F1119@besplex.bde.org> References: <1430188443-19413-1-git-send-email-mjguzik@gmail.com> <1430188443-19413-2-git-send-email-mjguzik@gmail.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=dKqfxopb c=1 sm=1 tr=0 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=kj9zAlcOel0A:10 a=JzwRw_2MAAAA:8 a=1twMqG6x6PHpFXFWjvsA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Apr 2015 08:45:16 -0000 On Tue, 28 Apr 2015, Mateusz Guzik wrote: > diff --git a/sys/amd64/amd64/trap.c b/sys/amd64/amd64/trap.c > index 193d207..1883727 100644 > --- a/sys/amd64/amd64/trap.c > +++ b/sys/amd64/amd64/trap.c > @@ -257,8 +257,8 @@ trap(struct trapframe *frame) > td->td_pticks = 0; > td->td_frame = frame; > addr = frame->tf_rip; > - if (td->td_ucred != p->p_ucred) > - cred_update_thread(td); > + if (td->td_cowgeneration != p->p_cowgeneration) > + thread_update_cow(td); > > switch (type) { > case T_PRIVINFLT: /* privileged instruction fault */ This seems reasonable, but I don't like verbose names like p_cowgeneration. It is especially bad to abbreviate "copy on write" to "cow" and then spell "generation" in full. "gen" would be a reasonable abbreviation, but "g" goes better with "cow". Old bad names visible in the patch include "thread" instead of "td". "td" is not such a good abbreviation for "thread pointer". > diff --git a/sys/kern/kern_thr.c b/sys/kern/kern_thr.c > index d5f1ce6..242e4dd 100644 > --- a/sys/kern/kern_thr.c > +++ b/sys/kern/kern_thr.c "thread" has too many different spellings. For just file names, there are kern_thr.c and kern_thread.c. For variable names, there is also "t" in "tid". "tid" is the best of all the names mentioned so far. > diff --git a/sys/sys/proc.h b/sys/sys/proc.h > index 64b99fc..f29d796 100644 > --- a/sys/sys/proc.h > +++ b/sys/sys/proc.h > @@ -225,6 +225,7 @@ struct thread { > /* Cleared during fork1() */ > #define td_startzero td_flags > int td_flags; /* (t) TDF_* flags. */ > + u_int td_cowgeneration;/* (k) Generation of COW pointers. */ > int td_inhibitors; /* (t) Why can not run. */ > int td_pflags; /* (k) Private thread (TDP_*) flags. */ > int td_dupfd; /* (k) Ret value from fdopen. XXX */ This name is so verbose that it messes up the comment indentation. > @@ -830,6 +832,11 @@ extern pid_t pid_max; > KASSERT((p)->p_lock == 0, ("process held")); \ > } while (0) > > +#define PROC_UPDATE_COW(p) do { \ > + PROC_LOCK_ASSERT((p), MA_OWNED); \ > + p->p_cowgeneration++; \ Missing parentheses. > +} while (0) > + > /* Check whether a thread is safe to be swapped out. */ > #define thread_safetoswapout(td) ((td)->td_flags & TDF_CANSWAP) > > @@ -976,6 +983,10 @@ struct thread *thread_alloc(int pages); > int thread_alloc_stack(struct thread *, int pages); > void thread_exit(void) __dead2; > void thread_free(struct thread *td); > +void thread_get_cow_proc(struct thread *newtd, struct proc *p); > +void thread_get_cow(struct thread *newtd, struct thread *td); > +void thread_free_cow(struct thread *td); > +void thread_update_cow(struct thread *td); Insertion sort errors. Namespace errors. I don't like the style of naming things with objects first and verbs last, but it is good for sorting related objects. Here the verbs "get" and "free" are in the middle of the objects "thread_cow_proc" and "thread_cow". Also, shouldn't it be "thread_proc_cow" (but less verbose, maybe "tpcow"), not "thread_cow_proc", to indicate that the cow is hung of the proc? I didn't notice the details, but it makes no sense to hang a proc of a cow :-). Bruce From owner-freebsd-arch@FreeBSD.ORG Tue Apr 28 13:45:10 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 1A28BD3B for ; Tue, 28 Apr 2015 13:45:10 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id E88CD1FD4 for ; Tue, 28 Apr 2015 13:45:09 +0000 (UTC) Received: from ralph.baldwin.cx (pool-173-54-116-245.nwrknj.fios.verizon.net [173.54.116.245]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id CF0B4B93C; Tue, 28 Apr 2015 09:45:07 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org Cc: Konstantin Belousov , Jason Harmening , Svatopluk Kraus Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space Date: Tue, 28 Apr 2015 09:40:33 -0400 Message-ID: <1876382.0PQNo3Rp24@ralph.baldwin.cx> User-Agent: KMail/4.14.2 (FreeBSD/10.1-STABLE; KDE/4.14.2; amd64; ; ) In-Reply-To: <20150425163444.GL2390@kib.kiev.ua> References: <553B9E64.8030907@gmail.com> <20150425163444.GL2390@kib.kiev.ua> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Tue, 28 Apr 2015 09:45:07 -0400 (EDT) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Apr 2015 13:45:10 -0000 On Saturday, April 25, 2015 07:34:44 PM Konstantin Belousov wrote: > On Sat, Apr 25, 2015 at 09:02:12AM -0500, Jason Harmening wrote: > > It seems like in general it is too hard for drivers using busdma to deal > > with usermode memory in a way that's both safe and efficient: > > --bus_dmamap_load_uio + UIO_USERSPACE is apparently really unsafe > > --if they do things the other way and allocate in the kernel, then then > > they better either be willing to do extra copying, or create and > > refcount their own vm_objects and use d_mmap_single (I still haven't > > seen a good example of that), or leak a bunch of memory (if they use > > d_mmap), because the old device pager is also really unsafe. > munmap(2) does not free the pages, it removes the mapping and dereferences > the backing vm object. If the region was wired, munmap would decrement > the wiring count for the pages. So if a kernel code wired the regions > pages, they are kept wired, but no longer mapped into the userspace. > So bcopy() still does not work. > > d_mmap_single() is used by GPU, definitely by GEM and TTM code, and possibly > by the proprietary nvidia driver. Yes, the nvidia driver uses it. I've also used it for some proprietary driver extensions. > I believe UIO_USERSPACE is almost unused, it might be there for some > obscure (and buggy) driver. I believe it was added (and only ever used) in crypto drivers, and that they all did bus_dma operations in the context of the thread that passed in the uio. I definitely think it is fragile and should be replaced with something more reliable. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Apr 28 13:45:10 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 011F9D3A; Tue, 28 Apr 2015 13:45:10 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id CE1E01FD3; Tue, 28 Apr 2015 13:45:09 +0000 (UTC) Received: from ralph.baldwin.cx (pool-173-54-116-245.nwrknj.fios.verizon.net [173.54.116.245]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 87851B95B; Tue, 28 Apr 2015 09:45:08 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org Cc: Adrian Chadd , Davide Italiano Subject: Re: RFC: setting performance_cx_lowest=C2 in -HEAD to avoid lock contention on many-CPU boxes Date: Tue, 28 Apr 2015 09:35:10 -0400 Message-ID: <1832557.zVusTDjZUx@ralph.baldwin.cx> User-Agent: KMail/4.14.2 (FreeBSD/10.1-STABLE; KDE/4.14.2; amd64; ; ) In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Tue, 28 Apr 2015 09:45:08 -0400 (EDT) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Apr 2015 13:45:10 -0000 On Saturday, April 25, 2015 11:45:10 AM Adrian Chadd wrote: > On 25 April 2015 at 11:18, Davide Italiano wrote: > > On Sat, Apr 25, 2015 at 9:31 AM, Adrian Chadd wrote: > >> Hi! > >> > >> I've been doing some NUMA testing on large boxes and I've found that > >> there's lock contention in the ACPI path. It's due to my change a > >> while ago to start using sleep states above ACPI C1 by default. The > >> ACPI C3 state involves a bunch of register fiddling in the ACPI sleep > >> path that grabs a serialiser lock, and on an 80 thread box this is > >> costly. > >> > >> I'd like to drop performance_cx_lowest to C2 in -HEAD. ACPI C2 state > >> doesn't require the same register fiddling (to disable bus mastering, > >> if I'm reading it right) and so it doesn't enter that particular > >> serialised path. I've verified on Westmere-EX, Sandybridge, Ivybridge > >> and Haswell boxes that ACPI C2 does let one drop down into a deeper > >> CPU sleep state (C6 on each of these). I think is still a good default > >> for both servers and desktops. > >> > >> If no-one has a problem with this then I'll do it after the weekend. > >> > > > > This sounds to me just a way to hide a problem. > > Very few people nowaday run on NUMA and they can tune the machine as > > they like when they do testing. > > If there's a lock contention problem, it needs to be fixed and not > > hidden under another default. > > The lock contention problem is inside ACPI and how it's designed/implemented. > We're not going to easily be able to make ACPI lock "better" as we're > constrained by how ACPI implements things in the shared ACPICA code. Is the contention actually harmful? Note that this only happens when the CPUs are idle, not when doing actual work. In addition, IIRC, the ACPI idle stuff uses hueristics to only drop into deeper sleep states if the CPU has recently been idle "more" so that if you are relatively busy you will only go into C1 instead. (I think this latter might have changed since eventtimers came in, it looks like we now choose the idle state based on how long until the next timer interrupt?) If the only consequence of this is that it adds noise to profiling, then hack your profiling results to ignore this lock. I think that is a better tradeoff than sacraficing power gains to reduce noise in profiling output. Alternatively, your machine may be better off using cpu_idle_mwait. There are already CPUs now that only advertise deeper sleep states for use with mwait but not ACPI, so we may certainly end up with defaulting to mwait instead of ACPI for certain CPUs anyway. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Apr 28 14:13:08 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id ED1AF5A9; Tue, 28 Apr 2015 14:13:08 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 747651300; Tue, 28 Apr 2015 14:13:08 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3SED2uW007586 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Tue, 28 Apr 2015 17:13:02 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3SED2uW007586 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id t3SED2Ft007585; Tue, 28 Apr 2015 17:13:02 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 28 Apr 2015 17:13:02 +0300 From: Konstantin Belousov To: John Baldwin Cc: freebsd-arch@freebsd.org, Davide Italiano , Adrian Chadd Subject: Re: RFC: setting performance_cx_lowest=C2 in -HEAD to avoid lock contention on many-CPU boxes Message-ID: <20150428141302.GH2390@kib.kiev.ua> References: <1832557.zVusTDjZUx@ralph.baldwin.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1832557.zVusTDjZUx@ralph.baldwin.cx> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Apr 2015 14:13:09 -0000 On Tue, Apr 28, 2015 at 09:35:10AM -0400, John Baldwin wrote: > On Saturday, April 25, 2015 11:45:10 AM Adrian Chadd wrote: > > On 25 April 2015 at 11:18, Davide Italiano wrote: > > > On Sat, Apr 25, 2015 at 9:31 AM, Adrian Chadd wrote: > > >> Hi! > > >> > > >> I've been doing some NUMA testing on large boxes and I've found that > > >> there's lock contention in the ACPI path. It's due to my change a > > >> while ago to start using sleep states above ACPI C1 by default. The > > >> ACPI C3 state involves a bunch of register fiddling in the ACPI sleep > > >> path that grabs a serialiser lock, and on an 80 thread box this is > > >> costly. > > >> > > >> I'd like to drop performance_cx_lowest to C2 in -HEAD. ACPI C2 state > > >> doesn't require the same register fiddling (to disable bus mastering, > > >> if I'm reading it right) and so it doesn't enter that particular > > >> serialised path. I've verified on Westmere-EX, Sandybridge, Ivybridge > > >> and Haswell boxes that ACPI C2 does let one drop down into a deeper > > >> CPU sleep state (C6 on each of these). I think is still a good default > > >> for both servers and desktops. > > >> > > >> If no-one has a problem with this then I'll do it after the weekend. > > >> > > > > > > This sounds to me just a way to hide a problem. > > > Very few people nowaday run on NUMA and they can tune the machine as > > > they like when they do testing. > > > If there's a lock contention problem, it needs to be fixed and not > > > hidden under another default. > > > > The lock contention problem is inside ACPI and how it's designed/implemented. > > We're not going to easily be able to make ACPI lock "better" as we're > > constrained by how ACPI implements things in the shared ACPICA code. > > Is the contention actually harmful? Note that this only happens when the > CPUs are idle, not when doing actual work. In addition, IIRC, the ACPI idle > stuff uses hueristics to only drop into deeper sleep states if the CPU has > recently been idle "more" so that if you are relatively busy you will only go > into C1 instead. (I think this latter might have changed since eventtimers > came in, it looks like we now choose the idle state based on how long until > the next timer interrupt?) You have to spin, waiting other cores, to get the right to reduce the power state. > > If the only consequence of this is that it adds noise to profiling, then hack > your profiling results to ignore this lock. I think that is a better tradeoff > than sacraficing power gains to reduce noise in profiling output. I suspect that it adds latency, since interrupts cannot stop the wait for the ACPI lock. Also, it probably increases the power usage since CPU has to spend more time contending for the lock instead of sleeping. > > Alternatively, your machine may be better off using cpu_idle_mwait. There > are already CPUs now that only advertise deeper sleep states for use with > mwait but not ACPI, so we may certainly end up with defaulting to mwait > instead of ACPI for certain CPUs anyway. cpu_idle_mwait is quite useless, it only enters C1, which should be almost the same as hlt. mwait for C1 might reduce latency of waking up, but definitely would not reduce power consumption on par with higher Cx. That said, I think that for non-laptop usage, limiting lowest state to C2 is fine. For Haswells, Intel recommendation for BIOS writers is to limit the announced states to C2 (eliminating the BM avoidance at all). Internally ACPI C2 is mapped to CPU C6 or might be even C7. From owner-freebsd-arch@FreeBSD.ORG Tue Apr 28 14:47:50 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 5CB0A1BD; Tue, 28 Apr 2015 14:47:50 +0000 (UTC) Received: from mail-ob0-x234.google.com (mail-ob0-x234.google.com [IPv6:2607:f8b0:4003:c01::234]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 22876176C; Tue, 28 Apr 2015 14:47:50 +0000 (UTC) Received: by obbeb7 with SMTP id eb7so109723254obb.3; Tue, 28 Apr 2015 07:47:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type; bh=mKS1Vuc2xwpr85kVpktLCtktynSHmLjCBPWh5fmR2uw=; b=yA+AnwvOvYvgIEwmzNS9YbrhIscZH2N3BLeMBkS0FiunpIbEml04iOyQnd2bdHhSwj C+pnjNx2n8dfhUpCfH3sQFdcEUEtBaFQ99mTX6fy3546Pj4S3Wq2HGqjlKCtJGZl+q4S h6pptbbTS6S0OcmtHny6vV5JCutgygXh5wr+a+cZjhPUxSo3RHAsMyQ3HTSh+u2PJQw1 WzYH4tl2H7hU/8MWkN9afJ2JXe/7LtxJuxE3M2zhILREfSUH8ROG6gQGipPCiPJaxI3s nT/QN5u2eAtgEyfQD/zlRlRsTMZISz7fMGMmetuPQi1N2AdErdSXeLqByaXmMxaNb3IU 6mtQ== X-Received: by 10.202.225.65 with SMTP id y62mr13884239oig.78.1430232469407; Tue, 28 Apr 2015 07:47:49 -0700 (PDT) Received: from corona.austin.rr.com (cpe-72-177-6-10.austin.res.rr.com. [72.177.6.10]) by mx.google.com with ESMTPSA id ph19sm9959705oeb.9.2015.04.28.07.47.48 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 28 Apr 2015 07:47:48 -0700 (PDT) Message-ID: <553F9DE2.5080908@gmail.com> Date: Tue, 28 Apr 2015 09:49:06 -0500 From: Jason Harmening User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: John Baldwin , freebsd-arch@freebsd.org CC: Konstantin Belousov , Svatopluk Kraus Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space References: <553B9E64.8030907@gmail.com> <20150425163444.GL2390@kib.kiev.ua> <1876382.0PQNo3Rp24@ralph.baldwin.cx> In-Reply-To: <1876382.0PQNo3Rp24@ralph.baldwin.cx> Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="FdFxoWkPFWu4kEMfT2l2KniOuEbJax1dd" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Apr 2015 14:47:50 -0000 This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --FdFxoWkPFWu4kEMfT2l2KniOuEbJax1dd Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable On 04/28/15 08:40, John Baldwin wrote: > On Saturday, April 25, 2015 07:34:44 PM Konstantin Belousov wrote: >> On Sat, Apr 25, 2015 at 09:02:12AM -0500, Jason Harmening wrote: >>> It seems like in general it is too hard for drivers using busdma to d= eal >>> with usermode memory in a way that's both safe and efficient: >>> --bus_dmamap_load_uio + UIO_USERSPACE is apparently really unsafe >>> --if they do things the other way and allocate in the kernel, then th= en >>> they better either be willing to do extra copying, or create and >>> refcount their own vm_objects and use d_mmap_single (I still haven't >>> seen a good example of that), or leak a bunch of memory (if they use >>> d_mmap), because the old device pager is also really unsafe. >> munmap(2) does not free the pages, it removes the mapping and derefere= nces >> the backing vm object. If the region was wired, munmap would decremen= t >> the wiring count for the pages. So if a kernel code wired the regions= >> pages, they are kept wired, but no longer mapped into the userspace. >> So bcopy() still does not work. >> >> d_mmap_single() is used by GPU, definitely by GEM and TTM code, and po= ssibly >> by the proprietary nvidia driver. > Yes, the nvidia driver uses it. I've also used it for some proprietary= > driver extensions. I've seen d_mmap_single() used in the GPU code, but I haven't seen it used in conjunction with busdma (but maybe not looking in the right place= ). > >> I believe UIO_USERSPACE is almost unused, it might be there for some >> obscure (and buggy) driver. > I believe it was added (and only ever used) in crypto drivers, and that= they > all did bus_dma operations in the context of the thread that passed in = the > uio. I definitely think it is fragile and should be replaced with some= thing > more reliable. > I think it's useful to make the bounce-buffering logic more robust in cases where it's not executed in the owning process; it's also a really simple set of changes. Of course doing vslock beforehand is still going to be the only safe way to use that API, but that seems reasonable if it's documented and done sparingly (which it is). In the longer term, vm_fault_quick_hold_pages + _bus_dmamap_load_ma is probably better for user buffers, at least for short transfers (which I think is most of them). load_ma needs to at least be made a public and documented KPI though. I'd like to try moving some of the drm2 code to use it once I finally have a reasonably modern machine for testing -curre= nt. Either _bus_dmamap_load_ma or out-of-context UIO_USERSPACE bounce buffering could have issues with waiting on sfbufs on some arches, including arm. That could be fixed by making each unmapped bounce buffer set up a kva mapping for the data addr when it's created, but that fix might be worse than the problem it's trying to solve. --FdFxoWkPFWu4kEMfT2l2KniOuEbJax1dd Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQF8BAEBCgBmBQJVP53iXxSAAAAAAC4AKGlzc3Vlci1mcHJAbm90YXRpb25zLm9w ZW5wZ3AuZmlmdGhob3JzZW1hbi5uZXQwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAw MDAwMDAwMDAwMDAwMDAwAAoJELufi/mShB0bEIUIAN34JfagFVf9hR2fNWwHitxn pLO99FuzZCuf6WCS1eieMHPN0d0xNqnqTmKvVIQ+xbPO4dRhLSz+FhJQiwf5wSDl VCCdvSaVF4SJuASwiFzvG1XkSjMUJHvmJTIGROtqCkjvhY1qBDST/MPPJCtYbuU6 dbkoi4avwyrtVWelqcyA4HwQ09wPVSKl3p2HF9DXIwpGDFecf7y8FaHeflsopUja b/V5dwI/j1SLYls7rtgdjZ39kHX4iU4AjSuk2DY/yaRItgE6LPMa+YaTe9cz5Uhm lfhsZj8xPiTMRRt1/y5UI40taGcADz8h/mt7AmdE/SzLJjPYb82bRPKr7lhsy34= =NAS+ -----END PGP SIGNATURE----- --FdFxoWkPFWu4kEMfT2l2KniOuEbJax1dd-- From owner-freebsd-arch@FreeBSD.ORG Tue Apr 28 15:36:39 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id EE3C93ED; Tue, 28 Apr 2015 15:36:39 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id AC3AD1DBD; Tue, 28 Apr 2015 15:36:39 +0000 (UTC) Received: from ralph.baldwin.cx (pool-173-54-116-245.nwrknj.fios.verizon.net [173.54.116.245]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 1D5B8B93A; Tue, 28 Apr 2015 11:36:38 -0400 (EDT) From: John Baldwin To: Konstantin Belousov Cc: freebsd-arch@freebsd.org, Davide Italiano , Adrian Chadd Subject: Re: RFC: setting performance_cx_lowest=C2 in -HEAD to avoid lock contention on many-CPU boxes Date: Tue, 28 Apr 2015 10:23:33 -0400 Message-ID: <3094092.O50xjOxef9@ralph.baldwin.cx> User-Agent: KMail/4.14.2 (FreeBSD/10.1-STABLE; KDE/4.14.2; amd64; ; ) In-Reply-To: <20150428141302.GH2390@kib.kiev.ua> References: <1832557.zVusTDjZUx@ralph.baldwin.cx> <20150428141302.GH2390@kib.kiev.ua> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Tue, 28 Apr 2015 11:36:38 -0400 (EDT) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Apr 2015 15:36:40 -0000 On Tuesday, April 28, 2015 05:13:02 PM Konstantin Belousov wrote: > On Tue, Apr 28, 2015 at 09:35:10AM -0400, John Baldwin wrote: > > On Saturday, April 25, 2015 11:45:10 AM Adrian Chadd wrote: > > > On 25 April 2015 at 11:18, Davide Italiano wrote: > > > > On Sat, Apr 25, 2015 at 9:31 AM, Adrian Chadd wrote: > > > >> Hi! > > > >> > > > >> I've been doing some NUMA testing on large boxes and I've found that > > > >> there's lock contention in the ACPI path. It's due to my change a > > > >> while ago to start using sleep states above ACPI C1 by default. The > > > >> ACPI C3 state involves a bunch of register fiddling in the ACPI sleep > > > >> path that grabs a serialiser lock, and on an 80 thread box this is > > > >> costly. > > > >> > > > >> I'd like to drop performance_cx_lowest to C2 in -HEAD. ACPI C2 state > > > >> doesn't require the same register fiddling (to disable bus mastering, > > > >> if I'm reading it right) and so it doesn't enter that particular > > > >> serialised path. I've verified on Westmere-EX, Sandybridge, Ivybridge > > > >> and Haswell boxes that ACPI C2 does let one drop down into a deeper > > > >> CPU sleep state (C6 on each of these). I think is still a good default > > > >> for both servers and desktops. > > > >> > > > >> If no-one has a problem with this then I'll do it after the weekend. > > > >> > > > > > > > > This sounds to me just a way to hide a problem. > > > > Very few people nowaday run on NUMA and they can tune the machine as > > > > they like when they do testing. > > > > If there's a lock contention problem, it needs to be fixed and not > > > > hidden under another default. > > > > > > The lock contention problem is inside ACPI and how it's designed/implemented. > > > We're not going to easily be able to make ACPI lock "better" as we're > > > constrained by how ACPI implements things in the shared ACPICA code. > > > > Is the contention actually harmful? Note that this only happens when the > > CPUs are idle, not when doing actual work. In addition, IIRC, the ACPI idle > > stuff uses hueristics to only drop into deeper sleep states if the CPU has > > recently been idle "more" so that if you are relatively busy you will only go > > into C1 instead. (I think this latter might have changed since eventtimers > > came in, it looks like we now choose the idle state based on how long until > > the next timer interrupt?) > You have to spin, waiting other cores, to get the right to reduce the > power state. Yes, normally spinning wouldn't do that, but the cpu idle hooks run with interrupts disabled. We could fix that perhaps though Acpi doesn't quite have what we would want (a single op that would disable interrupts after grabbing the lock, do the test and set of the bit in question and return its old value leaving interrupts disabled after dropping the lock). However, I would still like to know if the contention here is actually harmful in some measurable way aside from showing up in profiling output. > > Alternatively, your machine may be better off using cpu_idle_mwait. There > > are already CPUs now that only advertise deeper sleep states for use with > > mwait but not ACPI, so we may certainly end up with defaulting to mwait > > instead of ACPI for certain CPUs anyway. > > cpu_idle_mwait is quite useless, it only enters C1, which should be > almost the same as hlt. mwait for C1 might reduce latency of waking up, > but definitely would not reduce power consumption on par with higher Cx. Mmm, it was your pending patch I was thinking of. Don't you use mwait with the hints to use deeper sleep states in your change? > That said, I think that for non-laptop usage, limiting lowest state to C2 > is fine. For Haswells, Intel recommendation for BIOS writers is to > limit the announced states to C2 (eliminating the BM avoidance at all). > Internally ACPI C2 is mapped to CPU C6 or might be even C7. The problem of course is detecting non-laptops. :-/ In my own crude measurements based on the power draw numbers in the BMC on recent SuperMicro X9 boards for SandyBridge servers, most of the gain you get is from C2; C3 doesn't add much difference once you are able to do C2. Also of note is the comment above the busmaster register in question about USB. I'm not sure if that is still true anymore. If it were, systems would never go into C3 in which case this would be a moot point and there would be no need to enable C3. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Apr 28 15:42:50 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A2DE877C; Tue, 28 Apr 2015 15:42:50 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 199FD1EBE; Tue, 28 Apr 2015 15:42:49 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3SFgjlm028728 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Tue, 28 Apr 2015 18:42:45 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3SFgjlm028728 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id t3SFgjF8028727; Tue, 28 Apr 2015 18:42:45 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 28 Apr 2015 18:42:45 +0300 From: Konstantin Belousov To: Jason Harmening Cc: John Baldwin , freebsd-arch@freebsd.org, Svatopluk Kraus Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space Message-ID: <20150428154245.GJ2390@kib.kiev.ua> References: <553B9E64.8030907@gmail.com> <20150425163444.GL2390@kib.kiev.ua> <1876382.0PQNo3Rp24@ralph.baldwin.cx> <553F9DE2.5080908@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <553F9DE2.5080908@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Apr 2015 15:42:50 -0000 On Tue, Apr 28, 2015 at 09:49:06AM -0500, Jason Harmening wrote: > > Either _bus_dmamap_load_ma or out-of-context UIO_USERSPACE bounce > buffering could have issues with waiting on sfbufs on some arches, > including arm. That could be fixed by making each unmapped bounce > buffer set up a kva mapping for the data addr when it's created, but > that fix might be worse than the problem it's trying to solve. I had an implementation of the sfbuf allocator which never sleeps. If sfbuf was not available without sleep, a callback is called later, when a reusable sf buf is freed. It was written to allow drivers like PIO ATA to take unmapped bios, but I never finished it, at least did not converted a single driver. I am not sure if I can find the branch or is it reasonable to try to rebase it, but the base idea may be useful for the UIO_USERSPACE case as well. From owner-freebsd-arch@FreeBSD.ORG Tue Apr 28 16:19:23 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8E279B49 for ; Tue, 28 Apr 2015 16:19:23 +0000 (UTC) Received: from mail-ie0-f176.google.com (mail-ie0-f176.google.com [209.85.223.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 5849E1358 for ; Tue, 28 Apr 2015 16:19:23 +0000 (UTC) Received: by iejt8 with SMTP id t8so22157319iej.2 for ; Tue, 28 Apr 2015 09:19:22 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:subject:mime-version:content-type:from :in-reply-to:date:cc:message-id:references:to; bh=RlN8tnWkK8GsdOjWPJE7l64uVm/t2RPf2F/1sWXevFE=; b=lPaIKZjkVVe8Z0AIkNIwWMXT54sXQPpcTkjEM5hgmZ0+hWunGb/bHmR46PR/JL8JqW uw9zpIMKHTgvA3Tmzp5dJSMWMpgFOFWBR8I2Yh6YKXtl19Ke+TmG2WVpWVTuHOXuG7gC w+Sq1eEXiKUB+193fZxXLhY+TdlLpQHffqyezBpzASF29eh+q39smyI8HvyIBXBaRMAf 0IROf+6KPgl6cVCuc2UPlU0InvWi/sD+l5PrF0zIfnsnActf+LophR+EjNIuvH98xQQd 01oPMnjuRSGJwMYpEVWfj+x3NgKF05udpd6pHhBi9EQ8Wy4lWTgZQLbHF+kH3jJ4AShg +PbA== X-Gm-Message-State: ALoCoQlr+9z2yaipnDFafdtd4tGDx+YonlJVVt9x1OHaKSmxcZzwcTVPU2optrHc7OjIvcsuqjJq X-Received: by 10.50.61.234 with SMTP id t10mr14345968igr.19.1430237962233; Tue, 28 Apr 2015 09:19:22 -0700 (PDT) Received: from netflix-mac-wired.bsdimp.com ([50.253.99.174]) by mx.google.com with ESMTPSA id qo11sm7476281igb.17.2015.04.28.09.19.20 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 28 Apr 2015 09:19:21 -0700 (PDT) Sender: Warner Losh Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\)) Content-Type: multipart/signed; boundary="Apple-Mail=_851502F5-21E5-4D4C-B196-6A58C0E7DE9E"; protocol="application/pgp-signature"; micalg=pgp-sha512 X-Pgp-Agent: GPGMail 2.5b6 From: Warner Losh In-Reply-To: <1876382.0PQNo3Rp24@ralph.baldwin.cx> Date: Tue, 28 Apr 2015 10:19:20 -0600 Cc: freebsd-arch , Konstantin Belousov , Jason Harmening , Svatopluk Kraus Message-Id: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com> References: <553B9E64.8030907@gmail.com> <20150425163444.GL2390@kib.kiev.ua> <1876382.0PQNo3Rp24@ralph.baldwin.cx> To: John Baldwin X-Mailer: Apple Mail (2.2098) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Apr 2015 16:19:23 -0000 --Apple-Mail=_851502F5-21E5-4D4C-B196-6A58C0E7DE9E Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On Apr 28, 2015, at 7:40 AM, John Baldwin wrote: >=20 >> I believe UIO_USERSPACE is almost unused, it might be there for some >> obscure (and buggy) driver. >=20 > I believe it was added (and only ever used) in crypto drivers, and = that they > all did bus_dma operations in the context of the thread that passed in = the > uio. I definitely think it is fragile and should be replaced with = something > more reliable. Fusion I/O=E2=80=99s SDK used this trick to allow mapping of userspace = buffers down into the block layer after doing the requisite locking / pinning / etc = of the buffers into memory. That=E2=80=99s if memory serves correctly (the SDK did = these things, I can=E2=80=99t easily check on that detail since I=E2=80=99m no longer at FIO). Warner --Apple-Mail=_851502F5-21E5-4D4C-B196-6A58C0E7DE9E Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP using GPGMail -----BEGIN PGP SIGNATURE----- Comment: GPGTools - https://gpgtools.org iQIcBAEBCgAGBQJVP7MIAAoJEGwc0Sh9sBEAMb0P/2yrwv3jtTWEKkmFUBwOGOVB I1BB21SS91M/v5it92fHDQtgUgZUsDg49Ej2jYZ9Llv7LuOVr/PX/w9+F/evxFRo Vt/LSo/BXRIuuoNMc+pvMBEPL14e5WtrWXCvS4tQnfFH4mljcRqXpagrVIsHaaIC 3Gm3RwEfb+lvgNug5haryW6NaHwanhd+NMAlasemy2iAhey2ur+1qGGL3GtX5S5T 2VxO6rvVMbsaepvywTjgtGA68CE+CnCY1hMi78CIPEXeA5Kn2+Ugy6DOGDTOLa8s ABJ+2DnWjJF2bv7fxrBfiWmn9CRbqOpwaFmLG1nwM5/ZIdLJ8Q4RkAd1ynky0NXS jm2529wOmzCGmXftVffCNt83vZeVcmkaC4NLxnVx+1iDRwMlTVcWZTRwSpR3zqiE srviQE+PkCuRX8B7RwTXLwyPLIrKg78Uhn8YAhrs0MvLhvdCiS8q3CprnI37phPO 9gIBMITFYG61fjxMdOdjehpL2hRVW+nudKH8ZI1AqVqCF0wGgAQx192KpKVo0IEh g9QwXY04GS+PrqwEk1tO2st+/DYcEKDKjmz3ucAgM/GXZd8EtbxfbCWTOilrgVN5 sDoCRlFXC9tfyIYWDoa+cj8UxE7YPQbnquz+DcD0JDtHwSU6iBgd23W2Jj4/wJu5 5XcBt0cfYdpTObv74roC =FS3o -----END PGP SIGNATURE----- --Apple-Mail=_851502F5-21E5-4D4C-B196-6A58C0E7DE9E-- From owner-freebsd-arch@FreeBSD.ORG Tue Apr 28 16:55:14 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id EF0C3BC7; Tue, 28 Apr 2015 16:55:14 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 7822D1882; Tue, 28 Apr 2015 16:55:14 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3SGt3Ji045172 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Tue, 28 Apr 2015 19:55:04 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3SGt3Ji045172 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id t3SGt3sn045168; Tue, 28 Apr 2015 19:55:03 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 28 Apr 2015 19:55:03 +0300 From: Konstantin Belousov To: John Baldwin Cc: freebsd-arch@freebsd.org, Davide Italiano , Adrian Chadd Subject: Re: RFC: setting performance_cx_lowest=C2 in -HEAD to avoid lock contention on many-CPU boxes Message-ID: <20150428165503.GK2390@kib.kiev.ua> References: <1832557.zVusTDjZUx@ralph.baldwin.cx> <20150428141302.GH2390@kib.kiev.ua> <3094092.O50xjOxef9@ralph.baldwin.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3094092.O50xjOxef9@ralph.baldwin.cx> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Apr 2015 16:55:15 -0000 On Tue, Apr 28, 2015 at 10:23:33AM -0400, John Baldwin wrote: > On Tuesday, April 28, 2015 05:13:02 PM Konstantin Belousov wrote: > > On Tue, Apr 28, 2015 at 09:35:10AM -0400, John Baldwin wrote: > > > On Saturday, April 25, 2015 11:45:10 AM Adrian Chadd wrote: > > > > On 25 April 2015 at 11:18, Davide Italiano wrote: > > > > > On Sat, Apr 25, 2015 at 9:31 AM, Adrian Chadd wrote: > > > > >> Hi! > > > > >> > > > > >> I've been doing some NUMA testing on large boxes and I've found that > > > > >> there's lock contention in the ACPI path. It's due to my change a > > > > >> while ago to start using sleep states above ACPI C1 by default. The > > > > >> ACPI C3 state involves a bunch of register fiddling in the ACPI sleep > > > > >> path that grabs a serialiser lock, and on an 80 thread box this is > > > > >> costly. > > > > >> > > > > >> I'd like to drop performance_cx_lowest to C2 in -HEAD. ACPI C2 state > > > > >> doesn't require the same register fiddling (to disable bus mastering, > > > > >> if I'm reading it right) and so it doesn't enter that particular > > > > >> serialised path. I've verified on Westmere-EX, Sandybridge, Ivybridge > > > > >> and Haswell boxes that ACPI C2 does let one drop down into a deeper > > > > >> CPU sleep state (C6 on each of these). I think is still a good default > > > > >> for both servers and desktops. > > > > >> > > > > >> If no-one has a problem with this then I'll do it after the weekend. > > > > >> > > > > > > > > > > This sounds to me just a way to hide a problem. > > > > > Very few people nowaday run on NUMA and they can tune the machine as > > > > > they like when they do testing. > > > > > If there's a lock contention problem, it needs to be fixed and not > > > > > hidden under another default. > > > > > > > > The lock contention problem is inside ACPI and how it's designed/implemented. > > > > We're not going to easily be able to make ACPI lock "better" as we're > > > > constrained by how ACPI implements things in the shared ACPICA code. > > > > > > Is the contention actually harmful? Note that this only happens when the > > > CPUs are idle, not when doing actual work. In addition, IIRC, the ACPI idle > > > stuff uses hueristics to only drop into deeper sleep states if the CPU has > > > recently been idle "more" so that if you are relatively busy you will only go > > > into C1 instead. (I think this latter might have changed since eventtimers > > > came in, it looks like we now choose the idle state based on how long until > > > the next timer interrupt?) > > You have to spin, waiting other cores, to get the right to reduce the > > power state. > > Yes, normally spinning wouldn't do that, but the cpu idle hooks run with > interrupts disabled. We could fix that perhaps though Acpi doesn't quite > have what we would want (a single op that would disable interrupts after > grabbing the lock, do the test and set of the bit in question and return > its old value leaving interrupts disabled after dropping the lock). > > However, I would still like to know if the contention here is actually > harmful in some measurable way aside from showing up in profiling output. I think Adrian could run intel pmc on his box with C2 and C3 and compare the power reports. > > > > Alternatively, your machine may be better off using cpu_idle_mwait. There > > > are already CPUs now that only advertise deeper sleep states for use with > > > mwait but not ACPI, so we may certainly end up with defaulting to mwait > > > instead of ACPI for certain CPUs anyway. > > > > cpu_idle_mwait is quite useless, it only enters C1, which should be > > almost the same as hlt. mwait for C1 might reduce latency of waking up, > > but definitely would not reduce power consumption on par with higher Cx. > > Mmm, it was your pending patch I was thinking of. Don't you use mwait with > the hints to use deeper sleep states in your change? Only in the acpi idle method. It is not safe to blindly enter states higher than C1 with mwait. Intel wrote a driver for Linux which does not rely on ACPU _CST tables for this. The driver has hard-coded tables for cores >= Nehalem which specify supported states, latency and cache behaviour. This is what I tried to mention in the original mail. If we write such driver (and rip the tables from Linux), we could allow deeper states in the cpu_idle_mwait. But I remember that avg did not liked the approach, and I agree that this is not maintanable, if you are not Intel. > > > That said, I think that for non-laptop usage, limiting lowest state to C2 > > is fine. For Haswells, Intel recommendation for BIOS writers is to > > limit the announced states to C2 (eliminating the BM avoidance at all). > > Internally ACPI C2 is mapped to CPU C6 or might be even C7. > > The problem of course is detecting non-laptops. :-/ In my own crude > measurements based on the power draw numbers in the BMC on recent > SuperMicro X9 boards for SandyBridge servers, most of the gain you get is > from C2; C3 doesn't add much difference once you are able to do C2. Also of > note is the comment above the busmaster register in question about USB. I'm > not sure if that is still true anymore. If it were, systems would never go > into C3 in which case this would be a moot point and there would be no need to > enable C3. I remember turbo boost requires C3, and non-trivially deep package C states on older CPUs also require C3. This is an argument against Adrian' change, but I think it is not applicable on newer processors. From owner-freebsd-arch@FreeBSD.ORG Tue Apr 28 19:10:44 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 914DA82E; Tue, 28 Apr 2015 19:10:44 +0000 (UTC) Received: from mail-ig0-x235.google.com (mail-ig0-x235.google.com [IPv6:2607:f8b0:4001:c05::235]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 56962195F; Tue, 28 Apr 2015 19:10:44 +0000 (UTC) Received: by igblo3 with SMTP id lo3so29229362igb.0; Tue, 28 Apr 2015 12:10:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type:content-transfer-encoding; bh=YXXRMHR5TssIc7/XIinO/uAA5sROdLXc4Scm+HA05S8=; b=QgQ+MAmjok5hHzKoja4Fb4tNksyei5zkoB0bgU9mJG4BniruS7SNMR1bdfxLvSXBRM oRz5f76UmHSlmM5X3yKFFFo+yqqcpAGHckEh+NoAA67VdGnezBOkFW256vdO2Nj4duaq R9qZpvmMWAinVhAINX9aPrF4XzKbamEU2rIpYLIS54LN5kIkQZ7T6vj5Rll5/4s3MmS/ 45gc1VsrCb7Xx6X0o08oQRAfA5rxux+CpXoLMBQLmut1dN5Lc7FIaCu9x/Mq2/LyBHih g+TEmJPmIEaj7PryEiqLLy1JedboTPND60WcTp/XWm0ioUkfwzI2ct5HInz1hmfyswdt DAmw== MIME-Version: 1.0 X-Received: by 10.50.73.198 with SMTP id n6mr22560481igv.32.1430248243568; Tue, 28 Apr 2015 12:10:43 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.36.38.133 with HTTP; Tue, 28 Apr 2015 12:10:43 -0700 (PDT) In-Reply-To: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com> References: <553B9E64.8030907@gmail.com> <20150425163444.GL2390@kib.kiev.ua> <1876382.0PQNo3Rp24@ralph.baldwin.cx> <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com> Date: Tue, 28 Apr 2015 12:10:43 -0700 X-Google-Sender-Auth: 9H9hpSigDX-70d3_tIqGxz5PMEk Message-ID: Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space From: Adrian Chadd To: Warner Losh Cc: John Baldwin , Konstantin Belousov , Jason Harmening , Svatopluk Kraus , freebsd-arch Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Apr 2015 19:10:44 -0000 On 28 April 2015 at 09:19, Warner Losh wrote: > >> On Apr 28, 2015, at 7:40 AM, John Baldwin wrote: >> >>> I believe UIO_USERSPACE is almost unused, it might be there for some >>> obscure (and buggy) driver. >> >> I believe it was added (and only ever used) in crypto drivers, and that = they >> all did bus_dma operations in the context of the thread that passed in t= he >> uio. I definitely think it is fragile and should be replaced with somet= hing >> more reliable. > > Fusion I/O=E2=80=99s SDK used this trick to allow mapping of userspace bu= ffers down > into the block layer after doing the requisite locking / pinning / etc of= the buffers > into memory. That=E2=80=99s if memory serves correctly (the SDK did these= things, I can=E2=80=99t > easily check on that detail since I=E2=80=99m no longer at FIO). This is a long-standing trick. physio() does it too, aio_read/aio_write does it for direct block accesses. Now that pbufs aren't involved anymore, it should scale rather well. So I'd like to see more of it in the kernel and disk/net APIs and drivers. -adrian From owner-freebsd-arch@FreeBSD.ORG Tue Apr 28 22:27:44 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 117655C9; Tue, 28 Apr 2015 22:27:44 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DE5EA1F35; Tue, 28 Apr 2015 22:27:43 +0000 (UTC) Received: from ralph.baldwin.cx (pool-173-54-116-245.nwrknj.fios.verizon.net [173.54.116.245]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id EDC3FB926; Tue, 28 Apr 2015 18:27:41 -0400 (EDT) From: John Baldwin To: Adrian Chadd Cc: Warner Losh , Konstantin Belousov , Jason Harmening , Svatopluk Kraus , freebsd-arch Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space Date: Tue, 28 Apr 2015 18:27:34 -0400 Message-ID: <1761247.Bq816CMB8v@ralph.baldwin.cx> User-Agent: KMail/4.14.2 (FreeBSD/10.1-STABLE; KDE/4.14.2; amd64; ; ) In-Reply-To: References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Tue, 28 Apr 2015 18:27:42 -0400 (EDT) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Apr 2015 22:27:44 -0000 On Tuesday, April 28, 2015 12:10:43 PM Adrian Chadd wrote: > On 28 April 2015 at 09:19, Warner Losh wrote: > > > >> On Apr 28, 2015, at 7:40 AM, John Baldwin wrote:= > >> > >>> I believe UIO_USERSPACE is almost unused, it might be there for s= ome > >>> obscure (and buggy) driver. > >> > >> I believe it was added (and only ever used) in crypto drivers, and= that they > >> all did bus_dma operations in the context of the thread that passe= d in the > >> uio. I definitely think it is fragile and should be replaced with= something > >> more reliable. > > > > Fusion I/O=E2=80=99s SDK used this trick to allow mapping of usersp= ace buffers down > > into the block layer after doing the requisite locking / pinning / = etc of the buffers > > into memory. That=E2=80=99s if memory serves correctly (the SDK did= these things, I can=E2=80=99t > > easily check on that detail since I=E2=80=99m no longer at FIO). >=20 > This is a long-standing trick. physio() does it too, > aio_read/aio_write does it for direct block accesses. Now that pbufs > aren't involved anymore, it should scale rather well. >=20 > So I'd like to see more of it in the kernel and disk/net APIs and dri= vers. aio_read/write jump through gross hacks to create dedicated kthreads th= at "borrow" the address space of the requester. The fact is that we want = to make unmapped I/O work in the general case and the same solutions for temporary mappings for that can be reused to temporarily map the wired pages backing a user request when needed. Reusing user mappings direct= ly in the kernel isn't really the way forward. --=20 John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Apr 28 22:39:33 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 9770997F; Tue, 28 Apr 2015 22:39:33 +0000 (UTC) Received: from st11p02mm-asmtp001.mac.com (st11p02mm-asmtpout001.mac.com [17.172.220.236]) (using TLSv1.2 with cipher DHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 6AF69105B; Tue, 28 Apr 2015 22:39:33 +0000 (UTC) Received: from st11p02mm-spool001.mac.com ([17.172.220.246]) by st11p02mm-asmtp001.mac.com (Oracle Communications Messaging Server 7.0.5.35.0 64bit (built Dec 4 2014)) with ESMTP id <0NNJ000Z8DHMJ260@st11p02mm-asmtp001.mac.com>; Tue, 28 Apr 2015 21:39:25 +0000 (GMT) X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.13.68,1.0.33,0.0.0000 definitions=2015-04-28_07:2015-04-28,2015-04-28,1970-01-01 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1412110000 definitions=main-1504280242 MIME-version: 1.0 Received: from localhost ([17.172.220.163]) by st11p02mm-spool001.mac.com (Oracle Communications Messaging Server 7.0.5.33.0 64bit (built Aug 27 2014)) with ESMTP id <0NNJ00FF4DHMBP10@st11p02mm-spool001.mac.com>; Tue, 28 Apr 2015 21:39:22 +0000 (GMT) To: Adrian Chadd Cc: "freebsd-arch@freebsd.org" From: Rui Paulo Subject: Re: RFT: numa policy branch Date: Tue, 28 Apr 2015 21:39:22 +0000 (GMT) X-Mailer: iCloud MailClient15B.8196069 MailServer15B.18830 X-Originating-IP: [12.218.212.178] Message-id: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Apr 2015 22:39:33 -0000 On Apr 26, 2015, at 01:30 PM, Adrian Chadd wrote:=0A=0A= Hi!=0A=0AAnother update:=0A=0A* updated to recent -HEAD;=0A* numactl now c= an set memory policy and cpuset domain information - so=0Ait's easy to say= "this runs in memory domain X and cpu domain Y" in=0Aone pass with it;=0A= =C2=A0=0AThat works, but --mempolicy=3Dfirst-touch should ignore the --mem= domain argument (or print an error) if it's present.=0A=0A* the locality m= atrix is now available. Here's an example from scott's=0A2x haswell v3, wi= th cluster-on-die enabled:=0A=0Avm.phys_locality:=0A0: 10 21 31 31=0A1: 21= 10 31 31=0A2: 31 31 10 21=0A3: 31 31 21 10=0A=0AAnd on the westmere-ex bo= x, with no SLIT table:=0A=0Avm.phys_locality:=0A0: -1 -1 -1 -1=0A1: -1 -1 = -1 -1=0A2: -1 -1 -1 -1=0A3: -1 -1 -1 -1=0A=C2=A0=0AThis worked for us on I= vyBridge a SLIT table.=0A=0A* I've tested in on westmere-ex (4x socket), s= andybridge, ivybridge,=0Ahaswell v3 and haswell v3 cluster on die.=0A* I'v= e discovered that our implementation of libgomp (from gcc-4.2) is=0Avery o= ld and doesn't include some of the thread control environment=0Avariables,= grr.=0A* .. and that the gcc libgomp code doesn't at all have freebsd thr= ead=0Aaffinity routines, so I added them to gcc-4.8.=0A=C2=A0=0AI used gcc= 4.9=0A=0AI'd appreciate any reviews / testing people are able to provide.= I'm=0Aabout at the functionality point where I'd like to submit it for=0A= formal review and try to land it in -HEAD.=0A=C2=A0=0AThere's a bug in the= default sysctl policy. =C2=A0You're calling strcat on an uninitialised st= ring, so it produces garbage output. =C2=A0We also hit the a panic when ou= r application starts allocation many GBs of memory. =C2=A0In this case, th= e memory is split between two sockets and I think it's crashing like you d= escribed on IRC.=0A=0A=0A= From owner-freebsd-arch@FreeBSD.ORG Tue Apr 28 23:32:30 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 7729A3A5 for ; Tue, 28 Apr 2015 23:32:30 +0000 (UTC) Received: from mail-ig0-x235.google.com (mail-ig0-x235.google.com [IPv6:2607:f8b0:4001:c05::235]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 422151688 for ; Tue, 28 Apr 2015 23:32:30 +0000 (UTC) Received: by igbyr2 with SMTP id yr2so105406484igb.0 for ; Tue, 28 Apr 2015 16:32:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=ZZs1ydaXaqCxtEC9pBx/o1ptA2BaBG+l7mud9JS8/u8=; b=kkFN0C/IP3L1GLFuSzIUyjJtMcg8tdS8egO/2RL5RmIoJowBV1hd6VguwcsqYEGpdE NSrb1fNWG5wQB4MUF5Agq6zV6HGwPCMqYuBqtq5MkmV1bA2PJNGtc0/aifcZiAXNJwGU G7FSW/EKx4oKFAq5OzcmexF+ePdmmZ9xng3D54Ap5SgbIpCX85qxPcLtF4i66/RRKpjP TcPVNmzV8AXufYfvYRZIZZgKWbfINukI9J9kZ+DY5EnC10Q/AMcHJAoarEwGC9zwNW6c A/kqmFOcn3d2bGZ2zcwhFthAz7EQhOLoAVxbFALd7bcSOcwB5g7BeBrV+P8NHtVxzVVu IUAA== MIME-Version: 1.0 X-Received: by 10.43.163.129 with SMTP id mo1mr395770icc.61.1430263949658; Tue, 28 Apr 2015 16:32:29 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.36.38.133 with HTTP; Tue, 28 Apr 2015 16:32:29 -0700 (PDT) In-Reply-To: References: Date: Tue, 28 Apr 2015 16:32:29 -0700 X-Google-Sender-Auth: nSierJALtjIeSja88mBq_CJQbF8 Message-ID: Subject: Re: RFT: numa policy branch From: Adrian Chadd To: Rui Paulo Cc: "freebsd-arch@freebsd.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Apr 2015 23:32:30 -0000 On 28 April 2015 at 14:39, Rui Paulo wrote: > On Apr 26, 2015, at 01:30 PM, Adrian Chadd wrote: > > Hi! > > Another update: > > * updated to recent -HEAD; > * numactl now can set memory policy and cpuset domain information - so > it's easy to say "this runs in memory domain X and cpu domain Y" in > one pass with it; > > > That works, but --mempolicy=first-touch should ignore the --memdomain > argument (or print an error) if it's present. Ok. > * the locality matrix is now available. Here's an example from scott's > 2x haswell v3, with cluster-on-die enabled: > > vm.phys_locality: > 0: 10 21 31 31 > 1: 21 10 31 31 > 2: 31 31 10 21 > 3: 31 31 21 10 > > And on the westmere-ex box, with no SLIT table: > > vm.phys_locality: > 0: -1 -1 -1 -1 > 1: -1 -1 -1 -1 > 2: -1 -1 -1 -1 > 3: -1 -1 -1 -1 > > > This worked for us on IvyBridge a SLIT table. Cool. > * I've tested in on westmere-ex (4x socket), sandybridge, ivybridge, > haswell v3 and haswell v3 cluster on die. > * I've discovered that our implementation of libgomp (from gcc-4.2) is > very old and doesn't include some of the thread control environment > variables, grr. > * .. and that the gcc libgomp code doesn't at all have freebsd thread > affinity routines, so I added them to gcc-4.8. > > > I used gcc 4.9 > > I'd appreciate any reviews / testing people are able to provide. I'm > about at the functionality point where I'd like to submit it for > formal review and try to land it in -HEAD. > > There's a bug in the default sysctl policy. You're calling strcat on an > uninitialised string, so it produces garbage output. We also hit the a > panic when our application starts allocation many GBs of memory. In this > case, the memory is split between two sockets and I think it's crashing like > you described on IRC. I'll fix the former soon, thanks for pointing that out. As for the crash - yeah, I reproducd it and sent a patch to alc for review. It's because vm_page_alloc() doesn't expect calls to vm_phys to fail a second time around. Trouble is - the VM thresholds are all global. Failing an allocation in one domain does cause pagedaemon to start up on that domain, but no paging actually occurs. Unfortunately the pager still thinks there's plenty of memory available, so it doesn't know it needs to run. There's a pagedaemon per domain, but no thresholds per domain or paging / paging targets per domain. I don't think we're going to be able to fix that this pass - I'd rather get this or something like this into the kernel so at least first-touch-rr, fixed-domain-rr and rr work. Then yes, the VM will need some updating. -adrian From owner-freebsd-arch@FreeBSD.ORG Wed Apr 29 10:22:20 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 144A7707; Wed, 29 Apr 2015 10:22:20 +0000 (UTC) Received: from mail-ie0-x231.google.com (mail-ie0-x231.google.com [IPv6:2607:f8b0:4001:c03::231]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id D20941AE8; Wed, 29 Apr 2015 10:22:19 +0000 (UTC) Received: by iejt8 with SMTP id t8so38512954iej.2; Wed, 29 Apr 2015 03:22:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=tko/ed6toqS5d2a1mxw+D0TWHs0226KwrmCSduVh5yQ=; b=QDguw+9rIztsXOeWENfTx7OFZKjrgH+Z84JzkeuUxUPRqtxJ4N+Mz0XJB0byqrWpur 3eRsbDqVke1zzi6ZmMnVsUcEG73j8LL63+TakdDLbAzmbb/Ehw4VPIuaMEHXqLNBfpwr VpLs7kffRCIR/rh9dLGLeQJoTAA5p3rxLT3/gZ1SfpTIitpp6W0OL0l7DpekFPeZvryV 4EGAxplO7DQjuUhgyRyULurQhnFGNAfmNGAwQI9vlnbCu3Vc4c13nrtiaSdSkVjdhXeZ aFTG7u5fDacAkBN6VARrPCLmL8zZsHaIyeJDpDQBnzlUkepd4jdsEAXLrncLXp/hY3Ww zIwg== MIME-Version: 1.0 X-Received: by 10.50.77.13 with SMTP id o13mr26742856igw.39.1430302939208; Wed, 29 Apr 2015 03:22:19 -0700 (PDT) Received: by 10.64.13.81 with HTTP; Wed, 29 Apr 2015 03:22:19 -0700 (PDT) In-Reply-To: <1761247.Bq816CMB8v@ralph.baldwin.cx> References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com> <1761247.Bq816CMB8v@ralph.baldwin.cx> Date: Wed, 29 Apr 2015 12:22:19 +0200 Message-ID: Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space From: Svatopluk Kraus To: John Baldwin Cc: Adrian Chadd , Warner Losh , Konstantin Belousov , Jason Harmening , freebsd-arch Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Apr 2015 10:22:20 -0000 On Wed, Apr 29, 2015 at 12:27 AM, John Baldwin wrote: > On Tuesday, April 28, 2015 12:10:43 PM Adrian Chadd wrote: >> On 28 April 2015 at 09:19, Warner Losh wrote: >> > >> >> On Apr 28, 2015, at 7:40 AM, John Baldwin wrote: >> >> >> >>> I believe UIO_USERSPACE is almost unused, it might be there for some >> >>> obscure (and buggy) driver. >> >> >> >> I believe it was added (and only ever used) in crypto drivers, and th= at they >> >> all did bus_dma operations in the context of the thread that passed i= n the >> >> uio. I definitely think it is fragile and should be replaced with so= mething >> >> more reliable. >> > >> > Fusion I/O=E2=80=99s SDK used this trick to allow mapping of userspace= buffers down >> > into the block layer after doing the requisite locking / pinning / etc= of the buffers >> > into memory. That=E2=80=99s if memory serves correctly (the SDK did th= ese things, I can=E2=80=99t >> > easily check on that detail since I=E2=80=99m no longer at FIO). >> >> This is a long-standing trick. physio() does it too, >> aio_read/aio_write does it for direct block accesses. Now that pbufs >> aren't involved anymore, it should scale rather well. >> >> So I'd like to see more of it in the kernel and disk/net APIs and driver= s. > > aio_read/write jump through gross hacks to create dedicated kthreads that > "borrow" the address space of the requester. The fact is that we want to > make unmapped I/O work in the general case and the same solutions for > temporary mappings for that can be reused to temporarily map the wired > pages backing a user request when needed. Reusing user mappings directly > in the kernel isn't really the way forward. > If using unmapped buffers is the way we will take to play with user space buffers, then: (1) DMA clients, which support DMA for user space buffers, must use some variant of _bus_dmamap_load_phys(). They must wire physical pages in system anyway. (2) Maybe some better way how to temporarily allocate KVA for unmapped buffers should be implemented. (3) DMA clients which already use _bus_dmamap_load_uio() with UIO_USERSPACE must be reimplemented or made obsolete. (4) UIO_USERSPACE must be off limit in _bus_dmamap_load_uio() and man page should be changed according to it. (5) And pmap can be deleted from struct bus_dmamap and all functions which use it as argument. Only kernel pmap will be used in DMA framework. Did I miss out something? > -- > John Baldwin From owner-freebsd-arch@FreeBSD.ORG Wed Apr 29 13:20:30 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 22FF74E2; Wed, 29 Apr 2015 13:20:30 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 8AA491EEE; Wed, 29 Apr 2015 13:20:29 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3TDKIwZ066239 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Wed, 29 Apr 2015 16:20:18 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3TDKIwZ066239 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id t3TDKHxx066237; Wed, 29 Apr 2015 16:20:17 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 29 Apr 2015 16:20:17 +0300 From: Konstantin Belousov To: Svatopluk Kraus Cc: John Baldwin , Adrian Chadd , Warner Losh , Jason Harmening , freebsd-arch Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space Message-ID: <20150429132017.GM2390@kib.kiev.ua> References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com> <1761247.Bq816CMB8v@ralph.baldwin.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Apr 2015 13:20:30 -0000 On Wed, Apr 29, 2015 at 12:22:19PM +0200, Svatopluk Kraus wrote: > If using unmapped buffers is the way we will take to play with user > space buffers, then: > > (1) DMA clients, which support DMA for user space buffers, must use > some variant of _bus_dmamap_load_phys(). They must wire physical pages > in system anyway. No, vm_fault_quick_hold_pages() + bus_dmamap_load_ma(). Or yes, if you count bus_dmamap_load_ma() as a variant of _load_phys(). I do not. > (2) Maybe some better way how to temporarily allocate KVA for unmapped > buffers should be implemented. See some other mail from me about non-blocking sfbuf allocator with callback. > (3) DMA clients which already use _bus_dmamap_load_uio() with > UIO_USERSPACE must be reimplemented or made obsolete. Yes. > (4) UIO_USERSPACE must be off limit in _bus_dmamap_load_uio() and man > page should be changed according to it. Yes. > (5) And pmap can be deleted from struct bus_dmamap and all functions > which use it as argument. Only kernel pmap will be used in DMA > framework. Probably yes. > > Did I miss out something? > > > > -- > > John Baldwin From owner-freebsd-arch@FreeBSD.ORG Wed Apr 29 15:16:00 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0D1FBAED; Wed, 29 Apr 2015 15:16:00 +0000 (UTC) Received: from mail-ie0-x22c.google.com (mail-ie0-x22c.google.com [IPv6:2607:f8b0:4001:c03::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id C93AC1C6F; Wed, 29 Apr 2015 15:15:59 +0000 (UTC) Received: by iedfl3 with SMTP id fl3so50185924ied.1; Wed, 29 Apr 2015 08:15:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=zs8dgeV3gEUvCWybM5nUXYpo2mB5ZEwhuA1KHwV0LEI=; b=sSW29W480sBVjdsrejqfSyFt+5/1W67fOT1B+jnR2+lDKd7s8wEuJ3nn9JHr2U21zU m/37fUYUQBSEvO+zzyQbgsscyn7slxRCZ0SDuhC4mtYpl+iVJgxGhryK9vzMXP26ADRJ C4/V7tNBII+amqn7CdWbS0RMc/RgpDa5hRU2fEL0HsVFOjvZkx3baPlCErIqzCGfyzLU Eel6yKzL7QW/B0q8rVxzc+5fKnS2Ggvz9+TrMpPAu1+zXkP7R+veAaOXI2VDPvS7G4x7 AVnJoZsZ4cQ4KzT/+kOzOsd79gEFzyc6sBQ91GEEYxdACA4m4RpF3m6naF0PV3lwL2MO eWHg== MIME-Version: 1.0 X-Received: by 10.50.61.200 with SMTP id s8mr28430846igr.7.1430320159109; Wed, 29 Apr 2015 08:09:19 -0700 (PDT) Received: by 10.64.13.81 with HTTP; Wed, 29 Apr 2015 08:09:18 -0700 (PDT) In-Reply-To: <20150429132017.GM2390@kib.kiev.ua> References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com> <1761247.Bq816CMB8v@ralph.baldwin.cx> <20150429132017.GM2390@kib.kiev.ua> Date: Wed, 29 Apr 2015 17:09:18 +0200 Message-ID: Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space From: Svatopluk Kraus To: Konstantin Belousov Cc: John Baldwin , Adrian Chadd , Warner Losh , Jason Harmening , freebsd-arch Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Apr 2015 15:16:00 -0000 On Wed, Apr 29, 2015 at 3:20 PM, Konstantin Belousov wrote: > On Wed, Apr 29, 2015 at 12:22:19PM +0200, Svatopluk Kraus wrote: >> If using unmapped buffers is the way we will take to play with user >> space buffers, then: >> >> (1) DMA clients, which support DMA for user space buffers, must use >> some variant of _bus_dmamap_load_phys(). They must wire physical pages >> in system anyway. > No, vm_fault_quick_hold_pages() + bus_dmamap_load_ma(). > Or yes, if you count bus_dmamap_load_ma() as a variant of _load_phys(). > I do not. There are only two basic functions in MD implementations which all other functions call: _bus_dmamap_load_phys() and _bus_dmamap_load_buffer() as a synonym for unmapped buffers and mapped ones. Are you saying that bus_dmamap_load_ma() should be some third kind? > >> (2) Maybe some better way how to temporarily allocate KVA for unmapped >> buffers should be implemented. > See some other mail from me about non-blocking sfbuf allocator with > callback. This small list was meant as summary. As I saw your emails in this thread, I added this point . I did not get that it's already in source tree. > >> (3) DMA clients which already use _bus_dmamap_load_uio() with >> UIO_USERSPACE must be reimplemented or made obsolete. > Yes. > >> (4) UIO_USERSPACE must be off limit in _bus_dmamap_load_uio() and man >> page should be changed according to it. > Yes. Hmm, I think that for the beginning, _bus_dmamap_load_uio() for UIO_USERSPACE can be hacked to use bus_dmamap_load_ma(). Maybe with some warning to force users of old clients to reimplement them. > >> (5) And pmap can be deleted from struct bus_dmamap and all functions >> which use it as argument. Only kernel pmap will be used in DMA >> framework. > Probably yes. > From owner-freebsd-arch@FreeBSD.ORG Wed Apr 29 16:54:39 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A720A88B; Wed, 29 Apr 2015 16:54:39 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 313EB197D; Wed, 29 Apr 2015 16:54:39 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3TGsW8P016999 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Wed, 29 Apr 2015 19:54:32 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3TGsW8P016999 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id t3TGsWpE016998; Wed, 29 Apr 2015 19:54:32 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 29 Apr 2015 19:54:32 +0300 From: Konstantin Belousov To: Svatopluk Kraus Cc: John Baldwin , Adrian Chadd , Warner Losh , Jason Harmening , freebsd-arch Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space Message-ID: <20150429165432.GN2390@kib.kiev.ua> References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com> <1761247.Bq816CMB8v@ralph.baldwin.cx> <20150429132017.GM2390@kib.kiev.ua> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Apr 2015 16:54:39 -0000 On Wed, Apr 29, 2015 at 05:09:18PM +0200, Svatopluk Kraus wrote: > On Wed, Apr 29, 2015 at 3:20 PM, Konstantin Belousov > wrote: > > On Wed, Apr 29, 2015 at 12:22:19PM +0200, Svatopluk Kraus wrote: > >> If using unmapped buffers is the way we will take to play with user > >> space buffers, then: > >> > >> (1) DMA clients, which support DMA for user space buffers, must use > >> some variant of _bus_dmamap_load_phys(). They must wire physical pages > >> in system anyway. > > No, vm_fault_quick_hold_pages() + bus_dmamap_load_ma(). > > Or yes, if you count bus_dmamap_load_ma() as a variant of _load_phys(). > > I do not. > > There are only two basic functions in MD implementations which all > other functions call: _bus_dmamap_load_phys() and > _bus_dmamap_load_buffer() as a synonym for unmapped buffers and mapped > ones. Are you saying that bus_dmamap_load_ma() should be some third > kind? It is. On the VT-d backed x86 busdma, load_ma() is the fundamental function, which is called both by _load_buffer() and _load_phys(). This is not completely true, the real backstage worker is called _load_something(), but it differs from _load_ma() only by taking casted tag and map. On the other hand, the load_ma_triv() wrapper implements _load_ma() using load_phys() on architectures which do not yet provide native _load_ma(), or where native _load_ma() does not make sense. > > > > >> (2) Maybe some better way how to temporarily allocate KVA for unmapped > >> buffers should be implemented. > > See some other mail from me about non-blocking sfbuf allocator with > > callback. > > This small list was meant as summary. As I saw your emails in this > thread, I added this point . I did not get that it's already in source > tree. No, it is not. I stopped working on it during the unmapped i/o work, after I realized that there is no much interest from the device drivers authors. Nobody cared about drivers like ATA PIO. Now, with the new possible use for the non-blocking sfbuf allocator, it can be revived. > > > > >> (3) DMA clients which already use _bus_dmamap_load_uio() with > >> UIO_USERSPACE must be reimplemented or made obsolete. > > Yes. > > > >> (4) UIO_USERSPACE must be off limit in _bus_dmamap_load_uio() and man > >> page should be changed according to it. > > Yes. > > > Hmm, I think that for the beginning, _bus_dmamap_load_uio() for > UIO_USERSPACE can be hacked to use bus_dmamap_load_ma(). Maybe with > some warning to force users of old clients to reimplement them. Also it would be a good test for my claim that vm_fault_quick_hold_pages() + bus_dmamap_load_ma() is all what is needed. > > > > > >> (5) And pmap can be deleted from struct bus_dmamap and all functions > >> which use it as argument. Only kernel pmap will be used in DMA > >> framework. > > Probably yes. > > From owner-freebsd-arch@FreeBSD.ORG Wed Apr 29 18:04:48 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 24CA4BB4; Wed, 29 Apr 2015 18:04:48 +0000 (UTC) Received: from mail-ig0-x232.google.com (mail-ig0-x232.google.com [IPv6:2607:f8b0:4001:c05::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id DA8CC1258; Wed, 29 Apr 2015 18:04:47 +0000 (UTC) Received: by igbyr2 with SMTP id yr2so126471088igb.0; Wed, 29 Apr 2015 11:04:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=xZjuMQFTs6ctt8IwuhzIiiveObkGNoBwHXztpkpKUFc=; b=AHCy9YgyMw2HbUj8KTTm5kntWRoQjzjRrJyY/CjvGZgdKo5FrjIQUP2L680uf0GsAV 0QazHqinpo0cd8jcuEeshCQqtU6ySCmkUXYM0fgIGngEKuXuAHWxw3VZN2meqlNQO/Vb G4Dv504ne4c32BECIwTQoy/YriTMmdGPuaE8Ol+9YnbBM9OG24j0SSPaUhVF8hlWZd3g L2poleCYj+km93sHFvVqzGnGQB1NP7dmiMYla1Uum8jqhaAvzDcfiP8TFFUh8P/5pnrS /l0WufqDCKrUJUzpqIeQWbP2XS4bvB38GW2XBKHEPBgZ9tn3HEGclLc4s6xl7+txwhWR s9hw== MIME-Version: 1.0 X-Received: by 10.50.110.104 with SMTP id hz8mr5897421igb.38.1430330687110; Wed, 29 Apr 2015 11:04:47 -0700 (PDT) Received: by 10.36.106.70 with HTTP; Wed, 29 Apr 2015 11:04:46 -0700 (PDT) In-Reply-To: <20150429165432.GN2390@kib.kiev.ua> References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com> <1761247.Bq816CMB8v@ralph.baldwin.cx> <20150429132017.GM2390@kib.kiev.ua> <20150429165432.GN2390@kib.kiev.ua> Date: Wed, 29 Apr 2015 13:04:46 -0500 Message-ID: Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space From: Jason Harmening To: Konstantin Belousov Cc: Svatopluk Kraus , John Baldwin , Adrian Chadd , Warner Losh , freebsd-arch Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Apr 2015 18:04:48 -0000 So, here's a patch that would add unmapped user bounce-buffer support for existing UIO_USERSPACE cases. I've only made sure it builds (everywhere) and given it a quick check on amd64. Things to note: --no changes to sparc64 and intel dmar, because they don't use bounce buffers --effectively adds UIO_USERSPACE support for mips, which was a KASSERT before --I am worried about the cache maintenance operations for arm and mips. I'm not an expert in non-coherent architectures. In particular, I'm not sure what (if any) allowances need to be made for user VAs that may be present in VIPT caches on other cores of SMP systems. --the above point about cache maintenance also makes me wonder how that should be handled for drivers that would use vm_fault_quick_hold_pages() + bus_dmamap_load_ma(). Presumably, some UVAs for the buffer could be present in caches for the same or other core. Index: sys/arm/arm/busdma_machdep-v6.c =================================================================== --- sys/arm/arm/busdma_machdep-v6.c (revision 282208) +++ sys/arm/arm/busdma_machdep-v6.c (working copy) @@ -1309,15 +1309,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t { struct bounce_page *bpage; struct sync_list *sl, *end; - /* - * If the buffer was from user space, it is possible that this is not - * the same vm map, especially on a POST operation. It's not clear that - * dma on userland buffers can work at all right now. To be safe, until - * we're able to test direct userland dma, panic on a map mismatch. - */ + if ((bpage = STAILQ_FIRST(&map->bpages)) != NULL) { - if (!pmap_dmap_iscurrent(map->pmap)) - panic("_bus_dmamap_sync: wrong user map for bounce sync."); CTR4(KTR_BUSDMA, "%s: tag %p tag flags 0x%x op 0x%x " "performing bounce", __func__, dmat, dmat->flags, op); @@ -1328,14 +1321,10 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t */ if (op & BUS_DMASYNC_PREWRITE) { while (bpage != NULL) { - if (bpage->datavaddr != 0) - bcopy((void *)bpage->datavaddr, - (void *)bpage->vaddr, - bpage->datacount); + if (bpage->datavaddr != 0 && pmap_dmap_iscurrent(map->pmap)) + bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount); else - physcopyout(bpage->dataaddr, - (void *)bpage->vaddr, - bpage->datacount); + physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount); cpu_dcache_wb_range((vm_offset_t)bpage->vaddr, bpage->datacount); l2cache_wb_range((vm_offset_t)bpage->vaddr, @@ -1396,14 +1385,10 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t arm_dcache_align; l2cache_inv_range(startv, startp, len); cpu_dcache_inv_range(startv, len); - if (bpage->datavaddr != 0) - bcopy((void *)bpage->vaddr, - (void *)bpage->datavaddr, - bpage->datacount); + if (bpage->datavaddr != 0 && pmap_dmap_iscurrent(map->pmap)) + bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount); else - physcopyin((void *)bpage->vaddr, - bpage->dataaddr, - bpage->datacount); + physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount); bpage = STAILQ_NEXT(bpage, links); } dmat->bounce_zone->total_bounced++; @@ -1433,10 +1418,15 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t * that the sequence is inner-to-outer for PREREAD invalidation and * outer-to-inner for POSTREAD invalidation is not a mistake. */ +#ifndef ARM_L2_PIPT + /* + * If we don't have any physically-indexed caches, we don't need to do + * cache maintenance if we're not in the context that owns the VA. + */ + if (!pmap_dmap_iscurrent(map->pmap)) + return; +#endif if (map->sync_count != 0) { - if (!pmap_dmap_iscurrent(map->pmap)) - panic("_bus_dmamap_sync: wrong user map for sync."); - sl = &map->slist[0]; end = &map->slist[map->sync_count]; CTR4(KTR_BUSDMA, "%s: tag %p tag flags 0x%x op 0x%x " @@ -1446,7 +1436,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t case BUS_DMASYNC_PREWRITE: case BUS_DMASYNC_PREWRITE | BUS_DMASYNC_PREREAD: while (sl != end) { - cpu_dcache_wb_range(sl->vaddr, sl->datacount); + if (pmap_dmap_iscurrent(map->pmap)) + cpu_dcache_wb_range(sl->vaddr, sl->datacount); l2cache_wb_range(sl->vaddr, sl->busaddr, sl->datacount); sl++; @@ -1472,7 +1463,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t l2cache_wb_range(sl->vaddr, sl->busaddr, 1); } - cpu_dcache_inv_range(sl->vaddr, sl->datacount); + if (pmap_dmap_iscurrent(map->pmap)) + cpu_dcache_inv_range(sl->vaddr, sl->datacount); l2cache_inv_range(sl->vaddr, sl->busaddr, sl->datacount); sl++; @@ -1487,7 +1479,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t while (sl != end) { l2cache_inv_range(sl->vaddr, sl->busaddr, sl->datacount); - cpu_dcache_inv_range(sl->vaddr, sl->datacount); + if (pmap_dmap_iscurrent(map->pmap)) + cpu_dcache_inv_range(sl->vaddr, sl->datacount); sl++; } break; Index: sys/arm/arm/busdma_machdep.c =================================================================== --- sys/arm/arm/busdma_machdep.c (revision 282208) +++ sys/arm/arm/busdma_machdep.c (working copy) @@ -131,7 +131,6 @@ struct bounce_page { struct sync_list { vm_offset_t vaddr; /* kva of bounce buffer */ - bus_addr_t busaddr; /* Physical address */ bus_size_t datacount; /* client data count */ }; @@ -177,6 +176,7 @@ struct bus_dmamap { STAILQ_ENTRY(bus_dmamap) links; bus_dmamap_callback_t *callback; void *callback_arg; + pmap_t pmap; int sync_count; struct sync_list *slist; }; @@ -831,7 +831,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma } static void -_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap, +_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, void *buf, bus_size_t buflen, int flags) { vm_offset_t vaddr; @@ -851,10 +851,10 @@ static void vendaddr = (vm_offset_t)buf + buflen; while (vaddr < vendaddr) { - if (__predict_true(pmap == kernel_pmap)) + if (__predict_true(map->pmap == kernel_pmap)) paddr = pmap_kextract(vaddr); else - paddr = pmap_extract(pmap, vaddr); + paddr = pmap_extract(map->pmap, vaddr); if (run_filter(dmat, paddr) != 0) map->pagesneeded++; vaddr += PAGE_SIZE; @@ -1009,7 +1009,7 @@ _bus_dmamap_load_ma(bus_dma_tag_t dmat, bus_dmamap */ int _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dmamap_t map, void *buf, - bus_size_t buflen, struct pmap *pmap, int flags, bus_dma_segment_t *segs, + bus_size_t buflen, pmap_t pmap, int flags, bus_dma_segment_t *segs, int *segp) { bus_size_t sgsize; @@ -1023,8 +1023,10 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm if ((flags & BUS_DMA_LOAD_MBUF) != 0) map->flags |= DMAMAP_CACHE_ALIGNED; + map->pmap = pmap; + if ((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) { - _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags); + _bus_dmamap_count_pages(dmat, map, buf, buflen, flags); if (map->pagesneeded != 0) { error = _bus_dmamap_reserve_pages(dmat, map, flags); if (error) @@ -1042,6 +1044,8 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm curaddr = pmap_kextract(vaddr); } else { curaddr = pmap_extract(pmap, vaddr); + if (curaddr == 0) + goto cleanup; map->flags &= ~DMAMAP_COHERENT; } @@ -1067,7 +1071,6 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm sl++; sl->vaddr = vaddr; sl->datacount = sgsize; - sl->busaddr = curaddr; } else sl->datacount += sgsize; } @@ -1205,12 +1208,11 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap STAILQ_FOREACH(bpage, &map->bpages, links) { if (op & BUS_DMASYNC_PREWRITE) { - if (bpage->datavaddr != 0) - bcopy((void *)bpage->datavaddr, - (void *)bpage->vaddr, bpage->datacount); + if (bpage->datavaddr != 0 && + (map->pmap == kernel_pmap || map->pmap == vmspace_pmap(curproc->p_vmspace))) + bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount); else - physcopyout(bpage->dataaddr, - (void *)bpage->vaddr,bpage->datacount); + physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount); cpu_dcache_wb_range(bpage->vaddr, bpage->datacount); cpu_l2cache_wb_range(bpage->vaddr, bpage->datacount); dmat->bounce_zone->total_bounced++; @@ -1218,12 +1220,11 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap if (op & BUS_DMASYNC_POSTREAD) { cpu_dcache_inv_range(bpage->vaddr, bpage->datacount); cpu_l2cache_inv_range(bpage->vaddr, bpage->datacount); - if (bpage->datavaddr != 0) - bcopy((void *)bpage->vaddr, - (void *)bpage->datavaddr, bpage->datacount); + if (bpage->datavaddr != 0 && + (map->pmap == kernel_pmap || map->pmap == vmspace_pmap(curproc->p_vmspace))) + bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount); else - physcopyin((void *)bpage->vaddr, - bpage->dataaddr, bpage->datacount); + physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount); dmat->bounce_zone->total_bounced++; } } @@ -1243,7 +1244,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t _bus_dmamap_sync_bp(dmat, map, op); CTR3(KTR_BUSDMA, "%s: op %x flags %x", __func__, op, map->flags); bufaligned = (map->flags & DMAMAP_CACHE_ALIGNED); - if (map->sync_count) { + if (map->sync_count != 0 && + (map->pmap == kernel_pmap || map->pmap == vmspace_pmap(curproc->p_vmspace))) { end = &map->slist[map->sync_count]; for (sl = &map->slist[0]; sl != end; sl++) bus_dmamap_sync_buf(sl->vaddr, sl->datacount, op, Index: sys/mips/mips/busdma_machdep.c =================================================================== --- sys/mips/mips/busdma_machdep.c (revision 282208) +++ sys/mips/mips/busdma_machdep.c (working copy) @@ -96,7 +96,6 @@ struct bounce_page { struct sync_list { vm_offset_t vaddr; /* kva of bounce buffer */ - bus_addr_t busaddr; /* Physical address */ bus_size_t datacount; /* client data count */ }; @@ -144,6 +143,7 @@ struct bus_dmamap { void *allocbuffer; TAILQ_ENTRY(bus_dmamap) freelist; STAILQ_ENTRY(bus_dmamap) links; + pmap_t pmap; bus_dmamap_callback_t *callback; void *callback_arg; int sync_count; @@ -725,7 +725,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma } static void -_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap, +_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, void *buf, bus_size_t buflen, int flags) { vm_offset_t vaddr; @@ -747,9 +747,11 @@ static void while (vaddr < vendaddr) { bus_size_t sg_len; - KASSERT(kernel_pmap == pmap, ("pmap is not kernel pmap")); sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK); - paddr = pmap_kextract(vaddr); + if (map->pmap == kernel_pmap) + paddr = pmap_kextract(vaddr); + else + paddr = pmap_extract(map->pmap, vaddr); if (((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) && run_filter(dmat, paddr) != 0) { sg_len = roundup2(sg_len, dmat->alignment); @@ -895,7 +897,7 @@ _bus_dmamap_load_ma(bus_dma_tag_t dmat, bus_dmamap */ int _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dmamap_t map, void *buf, - bus_size_t buflen, struct pmap *pmap, int flags, bus_dma_segment_t *segs, + bus_size_t buflen, pmap_t pmap, int flags, bus_dma_segment_t *segs, int *segp) { bus_size_t sgsize; @@ -908,8 +910,10 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm if (segs == NULL) segs = dmat->segments; + map->pmap = pmap; + if ((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) { - _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags); + _bus_dmamap_count_pages(dmat, map, buf, buflen, flags); if (map->pagesneeded != 0) { error = _bus_dmamap_reserve_pages(dmat, map, flags); if (error) @@ -922,12 +926,11 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm while (buflen > 0) { /* * Get the physical address for this segment. - * - * XXX Don't support checking for coherent mappings - * XXX in user address space. */ - KASSERT(kernel_pmap == pmap, ("pmap is not kernel pmap")); - curaddr = pmap_kextract(vaddr); + if (pmap == kernel_pmap) + curaddr = pmap_kextract(vaddr); + else + curaddr = pmap_extract(pmap, vaddr); /* * Compute the segment size, and adjust counts. @@ -951,7 +954,6 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm sl++; sl->vaddr = vaddr; sl->datacount = sgsize; - sl->busaddr = curaddr; } else sl->datacount += sgsize; } @@ -1111,17 +1113,14 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap STAILQ_FOREACH(bpage, &map->bpages, links) { if (op & BUS_DMASYNC_PREWRITE) { - if (bpage->datavaddr != 0) + if (bpage->datavaddr != 0 && + (map->pmap == kernel_pmap || map->pmap == vmspace_pmap(curproc->p_vmspace))) bcopy((void *)bpage->datavaddr, - (void *)(bpage->vaddr_nocache != 0 ? - bpage->vaddr_nocache : - bpage->vaddr), + (void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache : bpage->vaddr), bpage->datacount); else physcopyout(bpage->dataaddr, - (void *)(bpage->vaddr_nocache != 0 ? - bpage->vaddr_nocache : - bpage->vaddr), + (void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache : bpage->vaddr), bpage->datacount); if (bpage->vaddr_nocache == 0) { mips_dcache_wb_range(bpage->vaddr, @@ -1134,13 +1133,12 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap mips_dcache_inv_range(bpage->vaddr, bpage->datacount); } - if (bpage->datavaddr != 0) - bcopy((void *)(bpage->vaddr_nocache != 0 ? - bpage->vaddr_nocache : bpage->vaddr), + if (bpage->datavaddr != 0 && + (map->pmap == kernel_pmap || map->pmap == vmspace_pmap(curproc->p_vmspace))) + bcopy((void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache : bpage->vaddr), (void *)bpage->datavaddr, bpage->datacount); else - physcopyin((void *)(bpage->vaddr_nocache != 0 ? - bpage->vaddr_nocache : bpage->vaddr), + physcopyin((void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache : bpage->vaddr), bpage->dataaddr, bpage->datacount); dmat->bounce_zone->total_bounced++; } @@ -1164,7 +1162,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t return; CTR3(KTR_BUSDMA, "%s: op %x flags %x", __func__, op, map->flags); - if (map->sync_count) { + if (map->sync_count != 0 && + (map->pmap == kernel_pmap || map->pmap == vmspace_pmap(curproc->p_vmspace))) { end = &map->slist[map->sync_count]; for (sl = &map->slist[0]; sl != end; sl++) bus_dmamap_sync_buf(sl->vaddr, sl->datacount, op); Index: sys/powerpc/powerpc/busdma_machdep.c =================================================================== --- sys/powerpc/powerpc/busdma_machdep.c (revision 282208) +++ sys/powerpc/powerpc/busdma_machdep.c (working copy) @@ -131,6 +131,7 @@ struct bus_dmamap { int nsegs; bus_dmamap_callback_t *callback; void *callback_arg; + pmap_t pmap; STAILQ_ENTRY(bus_dmamap) links; int contigalloc; }; @@ -596,7 +597,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma } static void -_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap, +_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, void *buf, bus_size_t buflen, int flags) { vm_offset_t vaddr; @@ -619,10 +620,10 @@ static void bus_size_t sg_len; sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK); - if (pmap == kernel_pmap) + if (map->pmap == kernel_pmap) paddr = pmap_kextract(vaddr); else - paddr = pmap_extract(pmap, vaddr); + paddr = pmap_extract(map->pmap, vaddr); if (run_filter(dmat, paddr) != 0) { sg_len = roundup2(sg_len, dmat->alignment); map->pagesneeded++; @@ -785,8 +786,10 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, if (segs == NULL) segs = map->segments; + map->pmap = pmap; + if ((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) { - _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags); + _bus_dmamap_count_pages(dmat, map, buf, buflen, flags); if (map->pagesneeded != 0) { error = _bus_dmamap_reserve_pages(dmat, map, flags); if (error) @@ -905,14 +908,11 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t if (op & BUS_DMASYNC_PREWRITE) { while (bpage != NULL) { - if (bpage->datavaddr != 0) - bcopy((void *)bpage->datavaddr, - (void *)bpage->vaddr, - bpage->datacount); + if (bpage->datavaddr != 0 && + (map->pmap == kernel_pmap || map->pmap == vmspace_pmap(curproc->p_vmspace))) + bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount); else - physcopyout(bpage->dataaddr, - (void *)bpage->vaddr, - bpage->datacount); + physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount); bpage = STAILQ_NEXT(bpage, links); } dmat->bounce_zone->total_bounced++; @@ -920,13 +920,11 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t if (op & BUS_DMASYNC_POSTREAD) { while (bpage != NULL) { - if (bpage->datavaddr != 0) - bcopy((void *)bpage->vaddr, - (void *)bpage->datavaddr, - bpage->datacount); + if (bpage->datavaddr != 0 && + (map->pmap == kernel_pmap || map->pmap == vmspace_pmap(curproc->p_vmspace))) + bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount); else - physcopyin((void *)bpage->vaddr, - bpage->dataaddr, bpage->datacount); + physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount); bpage = STAILQ_NEXT(bpage, links); } dmat->bounce_zone->total_bounced++; Index: sys/x86/x86/busdma_bounce.c =================================================================== --- sys/x86/x86/busdma_bounce.c (revision 282208) +++ sys/x86/x86/busdma_bounce.c (working copy) @@ -121,6 +121,7 @@ struct bus_dmamap { struct memdesc mem; bus_dmamap_callback_t *callback; void *callback_arg; + pmap_t pmap; STAILQ_ENTRY(bus_dmamap) links; }; @@ -139,7 +140,7 @@ static bus_addr_t add_bounce_page(bus_dma_tag_t dm static void free_bounce_page(bus_dma_tag_t dmat, struct bounce_page *bpage); int run_filter(bus_dma_tag_t dmat, bus_addr_t paddr); static void _bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, - pmap_t pmap, void *buf, bus_size_t buflen, + void *buf, bus_size_t buflen, int flags); static void _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dmamap_t map, vm_paddr_t buf, bus_size_t buflen, @@ -491,7 +492,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma } static void -_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap, +_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, void *buf, bus_size_t buflen, int flags) { vm_offset_t vaddr; @@ -515,10 +516,10 @@ static void while (vaddr < vendaddr) { sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK); - if (pmap == kernel_pmap) + if (map->pmap == kernel_pmap) paddr = pmap_kextract(vaddr); else - paddr = pmap_extract(pmap, vaddr); + paddr = pmap_extract(map->pmap, vaddr); if (bus_dma_run_filter(&dmat->common, paddr) != 0) { sg_len = roundup2(sg_len, dmat->common.alignment); @@ -668,12 +669,14 @@ bounce_bus_dmamap_load_buffer(bus_dma_tag_t dmat, if (map == NULL) map = &nobounce_dmamap; + else + map->pmap = pmap; if (segs == NULL) segs = dmat->segments; if ((dmat->bounce_flags & BUS_DMA_COULD_BOUNCE) != 0) { - _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags); + _bus_dmamap_count_pages(dmat, map, buf, buflen, flags); if (map->pagesneeded != 0) { error = _bus_dmamap_reserve_pages(dmat, map, flags); if (error) @@ -775,15 +778,11 @@ bounce_bus_dmamap_sync(bus_dma_tag_t dmat, bus_dma if ((op & BUS_DMASYNC_PREWRITE) != 0) { while (bpage != NULL) { - if (bpage->datavaddr != 0) { - bcopy((void *)bpage->datavaddr, - (void *)bpage->vaddr, - bpage->datacount); - } else { - physcopyout(bpage->dataaddr, - (void *)bpage->vaddr, - bpage->datacount); - } + if (bpage->datavaddr != 0 && + (map->pmap == kernel_pmap || map->pmap == vmspace_pmap(curproc->p_vmspace))) + bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount); + else + physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount); bpage = STAILQ_NEXT(bpage, links); } dmat->bounce_zone->total_bounced++; @@ -791,15 +790,11 @@ bounce_bus_dmamap_sync(bus_dma_tag_t dmat, bus_dma if ((op & BUS_DMASYNC_POSTREAD) != 0) { while (bpage != NULL) { - if (bpage->datavaddr != 0) { - bcopy((void *)bpage->vaddr, - (void *)bpage->datavaddr, - bpage->datacount); - } else { - physcopyin((void *)bpage->vaddr, - bpage->dataaddr, - bpage->datacount); - } + if (bpage->datavaddr != 0 && + (map->pmap == kernel_pmap || map->pmap == vmspace_pmap(curproc->p_vmspace))) + bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount); + else + physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount); bpage = STAILQ_NEXT(bpage, links); } dmat->bounce_zone->total_bounced++; From owner-freebsd-arch@FreeBSD.ORG Wed Apr 29 18:50:32 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 96E23A34; Wed, 29 Apr 2015 18:50:32 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id E887317B2; Wed, 29 Apr 2015 18:50:31 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3TIoJ6v044134 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Wed, 29 Apr 2015 21:50:19 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3TIoJ6v044134 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id t3TIoJsq044124; Wed, 29 Apr 2015 21:50:19 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 29 Apr 2015 21:50:19 +0300 From: Konstantin Belousov To: Jason Harmening Cc: Svatopluk Kraus , John Baldwin , Adrian Chadd , Warner Losh , freebsd-arch Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space Message-ID: <20150429185019.GO2390@kib.kiev.ua> References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com> <1761247.Bq816CMB8v@ralph.baldwin.cx> <20150429132017.GM2390@kib.kiev.ua> <20150429165432.GN2390@kib.kiev.ua> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Apr 2015 18:50:32 -0000 On Wed, Apr 29, 2015 at 01:04:46PM -0500, Jason Harmening wrote: > So, here's a patch that would add unmapped user bounce-buffer support for > existing UIO_USERSPACE cases. I've only made sure it builds (everywhere) > and given it a quick check on amd64. > Things to note: > --no changes to sparc64 and intel dmar, because they don't use bounce > buffers > --effectively adds UIO_USERSPACE support for mips, which was a KASSERT > before > --I am worried about the cache maintenance operations for arm and mips. > I'm not an expert in non-coherent architectures. In particular, I'm not > sure what (if any) allowances need to be made for user VAs that may be > present in VIPT caches on other cores of SMP systems. > --the above point about cache maintenance also makes me wonder how that > should be handled for drivers that would use vm_fault_quick_hold_pages() + > bus_dmamap_load_ma(). Presumably, some UVAs for the buffer could be > present in caches for the same or other core. > The spaces/tabs in your mail are damaged. It does not matter in the text, but makes the patch unapplicable and hardly readable. I only read the x86/busdma_bounce.c part. It looks fine in the part where you add the test for the current pmap being identical to the pmap owning the user page mapping. I do not understand the part of the diff for bcopy/physcopyout lines, I cannot find non-whitespace changes there, and whitespace change would make too long line. Did I misread the patch ? BTW, why not use physcopyout() unconditionally on x86 ? To avoid i386 sfbuf allocation failures ? For non-coherent arches, isn't the issue of CPUs having filled caches for the DMA region present regardless of the vm_fault_quick_hold() use ? DMASYNC_PREREAD/WRITE must ensure that the lines are written back and invalidated even now, or always fall back to use bounce page. > > Index: sys/arm/arm/busdma_machdep-v6.c > =================================================================== > --- sys/arm/arm/busdma_machdep-v6.c (revision 282208) > +++ sys/arm/arm/busdma_machdep-v6.c (working copy) > @@ -1309,15 +1309,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t > { > struct bounce_page *bpage; > struct sync_list *sl, *end; > - /* > - * If the buffer was from user space, it is possible that this is not > - * the same vm map, especially on a POST operation. It's not clear that > - * dma on userland buffers can work at all right now. To be safe, until > - * we're able to test direct userland dma, panic on a map mismatch. > - */ > + > if ((bpage = STAILQ_FIRST(&map->bpages)) != NULL) { > - if (!pmap_dmap_iscurrent(map->pmap)) > - panic("_bus_dmamap_sync: wrong user map for bounce sync."); > > CTR4(KTR_BUSDMA, "%s: tag %p tag flags 0x%x op 0x%x " > "performing bounce", __func__, dmat, dmat->flags, op); > @@ -1328,14 +1321,10 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t > */ > if (op & BUS_DMASYNC_PREWRITE) { > while (bpage != NULL) { > - if (bpage->datavaddr != 0) > - bcopy((void *)bpage->datavaddr, > - (void *)bpage->vaddr, > - bpage->datacount); > + if (bpage->datavaddr != 0 && pmap_dmap_iscurrent(map->pmap)) > + bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount); > else > - physcopyout(bpage->dataaddr, > - (void *)bpage->vaddr, > - bpage->datacount); > + physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount); > cpu_dcache_wb_range((vm_offset_t)bpage->vaddr, > bpage->datacount); > l2cache_wb_range((vm_offset_t)bpage->vaddr, > @@ -1396,14 +1385,10 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t > arm_dcache_align; > l2cache_inv_range(startv, startp, len); > cpu_dcache_inv_range(startv, len); > - if (bpage->datavaddr != 0) > - bcopy((void *)bpage->vaddr, > - (void *)bpage->datavaddr, > - bpage->datacount); > + if (bpage->datavaddr != 0 && pmap_dmap_iscurrent(map->pmap)) > + bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount); > else > - physcopyin((void *)bpage->vaddr, > - bpage->dataaddr, > - bpage->datacount); > + physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount); > bpage = STAILQ_NEXT(bpage, links); > } > dmat->bounce_zone->total_bounced++; > @@ -1433,10 +1418,15 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t > * that the sequence is inner-to-outer for PREREAD invalidation and > * outer-to-inner for POSTREAD invalidation is not a mistake. > */ > +#ifndef ARM_L2_PIPT > + /* > + * If we don't have any physically-indexed caches, we don't need to do > + * cache maintenance if we're not in the context that owns the VA. > + */ > + if (!pmap_dmap_iscurrent(map->pmap)) > + return; > +#endif > if (map->sync_count != 0) { > - if (!pmap_dmap_iscurrent(map->pmap)) > - panic("_bus_dmamap_sync: wrong user map for sync."); > - > sl = &map->slist[0]; > end = &map->slist[map->sync_count]; > CTR4(KTR_BUSDMA, "%s: tag %p tag flags 0x%x op 0x%x " > @@ -1446,7 +1436,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t > case BUS_DMASYNC_PREWRITE: > case BUS_DMASYNC_PREWRITE | BUS_DMASYNC_PREREAD: > while (sl != end) { > - cpu_dcache_wb_range(sl->vaddr, sl->datacount); > + if (pmap_dmap_iscurrent(map->pmap)) > + cpu_dcache_wb_range(sl->vaddr, sl->datacount); > l2cache_wb_range(sl->vaddr, sl->busaddr, > sl->datacount); > sl++; > @@ -1472,7 +1463,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t > l2cache_wb_range(sl->vaddr, > sl->busaddr, 1); > } > - cpu_dcache_inv_range(sl->vaddr, sl->datacount); > + if (pmap_dmap_iscurrent(map->pmap)) > + cpu_dcache_inv_range(sl->vaddr, sl->datacount); > l2cache_inv_range(sl->vaddr, sl->busaddr, > sl->datacount); > sl++; > @@ -1487,7 +1479,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t > while (sl != end) { > l2cache_inv_range(sl->vaddr, sl->busaddr, > sl->datacount); > - cpu_dcache_inv_range(sl->vaddr, sl->datacount); > + if (pmap_dmap_iscurrent(map->pmap)) > + cpu_dcache_inv_range(sl->vaddr, sl->datacount); > sl++; > } > break; > Index: sys/arm/arm/busdma_machdep.c > =================================================================== > --- sys/arm/arm/busdma_machdep.c (revision 282208) > +++ sys/arm/arm/busdma_machdep.c (working copy) > @@ -131,7 +131,6 @@ struct bounce_page { > > struct sync_list { > vm_offset_t vaddr; /* kva of bounce buffer */ > - bus_addr_t busaddr; /* Physical address */ > bus_size_t datacount; /* client data count */ > }; > > @@ -177,6 +176,7 @@ struct bus_dmamap { > STAILQ_ENTRY(bus_dmamap) links; > bus_dmamap_callback_t *callback; > void *callback_arg; > + pmap_t pmap; > int sync_count; > struct sync_list *slist; > }; > @@ -831,7 +831,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma > } > > static void > -_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap, > +_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, > void *buf, bus_size_t buflen, int flags) > { > vm_offset_t vaddr; > @@ -851,10 +851,10 @@ static void > vendaddr = (vm_offset_t)buf + buflen; > > while (vaddr < vendaddr) { > - if (__predict_true(pmap == kernel_pmap)) > + if (__predict_true(map->pmap == kernel_pmap)) > paddr = pmap_kextract(vaddr); > else > - paddr = pmap_extract(pmap, vaddr); > + paddr = pmap_extract(map->pmap, vaddr); > if (run_filter(dmat, paddr) != 0) > map->pagesneeded++; > vaddr += PAGE_SIZE; > @@ -1009,7 +1009,7 @@ _bus_dmamap_load_ma(bus_dma_tag_t dmat, bus_dmamap > */ > int > _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dmamap_t map, void *buf, > - bus_size_t buflen, struct pmap *pmap, int flags, bus_dma_segment_t > *segs, > + bus_size_t buflen, pmap_t pmap, int flags, bus_dma_segment_t *segs, > int *segp) > { > bus_size_t sgsize; > @@ -1023,8 +1023,10 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm > if ((flags & BUS_DMA_LOAD_MBUF) != 0) > map->flags |= DMAMAP_CACHE_ALIGNED; > > + map->pmap = pmap; > + > if ((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) { > - _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags); > + _bus_dmamap_count_pages(dmat, map, buf, buflen, flags); > if (map->pagesneeded != 0) { > error = _bus_dmamap_reserve_pages(dmat, map, flags); > if (error) > @@ -1042,6 +1044,8 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm > curaddr = pmap_kextract(vaddr); > } else { > curaddr = pmap_extract(pmap, vaddr); > + if (curaddr == 0) > + goto cleanup; > map->flags &= ~DMAMAP_COHERENT; > } > > @@ -1067,7 +1071,6 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm > sl++; > sl->vaddr = vaddr; > sl->datacount = sgsize; > - sl->busaddr = curaddr; > } else > sl->datacount += sgsize; > } > @@ -1205,12 +1208,11 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap > > STAILQ_FOREACH(bpage, &map->bpages, links) { > if (op & BUS_DMASYNC_PREWRITE) { > - if (bpage->datavaddr != 0) > - bcopy((void *)bpage->datavaddr, > - (void *)bpage->vaddr, bpage->datacount); > + if (bpage->datavaddr != 0 && > + (map->pmap == kernel_pmap || map->pmap == > vmspace_pmap(curproc->p_vmspace))) > + bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount); > else > - physcopyout(bpage->dataaddr, > - (void *)bpage->vaddr,bpage->datacount); > + physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount); > cpu_dcache_wb_range(bpage->vaddr, bpage->datacount); > cpu_l2cache_wb_range(bpage->vaddr, bpage->datacount); > dmat->bounce_zone->total_bounced++; > @@ -1218,12 +1220,11 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap > if (op & BUS_DMASYNC_POSTREAD) { > cpu_dcache_inv_range(bpage->vaddr, bpage->datacount); > cpu_l2cache_inv_range(bpage->vaddr, bpage->datacount); > - if (bpage->datavaddr != 0) > - bcopy((void *)bpage->vaddr, > - (void *)bpage->datavaddr, bpage->datacount); > + if (bpage->datavaddr != 0 && > + (map->pmap == kernel_pmap || map->pmap == > vmspace_pmap(curproc->p_vmspace))) > + bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount); > else > - physcopyin((void *)bpage->vaddr, > - bpage->dataaddr, bpage->datacount); > + physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount); > dmat->bounce_zone->total_bounced++; > } > } > @@ -1243,7 +1244,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t > _bus_dmamap_sync_bp(dmat, map, op); > CTR3(KTR_BUSDMA, "%s: op %x flags %x", __func__, op, map->flags); > bufaligned = (map->flags & DMAMAP_CACHE_ALIGNED); > - if (map->sync_count) { > + if (map->sync_count != 0 && > + (map->pmap == kernel_pmap || map->pmap == > vmspace_pmap(curproc->p_vmspace))) { > end = &map->slist[map->sync_count]; > for (sl = &map->slist[0]; sl != end; sl++) > bus_dmamap_sync_buf(sl->vaddr, sl->datacount, op, > Index: sys/mips/mips/busdma_machdep.c > =================================================================== > --- sys/mips/mips/busdma_machdep.c (revision 282208) > +++ sys/mips/mips/busdma_machdep.c (working copy) > @@ -96,7 +96,6 @@ struct bounce_page { > > struct sync_list { > vm_offset_t vaddr; /* kva of bounce buffer */ > - bus_addr_t busaddr; /* Physical address */ > bus_size_t datacount; /* client data count */ > }; > > @@ -144,6 +143,7 @@ struct bus_dmamap { > void *allocbuffer; > TAILQ_ENTRY(bus_dmamap) freelist; > STAILQ_ENTRY(bus_dmamap) links; > + pmap_t pmap; > bus_dmamap_callback_t *callback; > void *callback_arg; > int sync_count; > @@ -725,7 +725,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma > } > > static void > -_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap, > +_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, > void *buf, bus_size_t buflen, int flags) > { > vm_offset_t vaddr; > @@ -747,9 +747,11 @@ static void > while (vaddr < vendaddr) { > bus_size_t sg_len; > > - KASSERT(kernel_pmap == pmap, ("pmap is not kernel pmap")); > sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK); > - paddr = pmap_kextract(vaddr); > + if (map->pmap == kernel_pmap) > + paddr = pmap_kextract(vaddr); > + else > + paddr = pmap_extract(map->pmap, vaddr); > if (((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) && > run_filter(dmat, paddr) != 0) { > sg_len = roundup2(sg_len, dmat->alignment); > @@ -895,7 +897,7 @@ _bus_dmamap_load_ma(bus_dma_tag_t dmat, bus_dmamap > */ > int > _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dmamap_t map, void *buf, > - bus_size_t buflen, struct pmap *pmap, int flags, bus_dma_segment_t > *segs, > + bus_size_t buflen, pmap_t pmap, int flags, bus_dma_segment_t *segs, > int *segp) > { > bus_size_t sgsize; > @@ -908,8 +910,10 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm > if (segs == NULL) > segs = dmat->segments; > > + map->pmap = pmap; > + > if ((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) { > - _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags); > + _bus_dmamap_count_pages(dmat, map, buf, buflen, flags); > if (map->pagesneeded != 0) { > error = _bus_dmamap_reserve_pages(dmat, map, flags); > if (error) > @@ -922,12 +926,11 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm > while (buflen > 0) { > /* > * Get the physical address for this segment. > - * > - * XXX Don't support checking for coherent mappings > - * XXX in user address space. > */ > - KASSERT(kernel_pmap == pmap, ("pmap is not kernel pmap")); > - curaddr = pmap_kextract(vaddr); > + if (pmap == kernel_pmap) > + curaddr = pmap_kextract(vaddr); > + else > + curaddr = pmap_extract(pmap, vaddr); > > /* > * Compute the segment size, and adjust counts. > @@ -951,7 +954,6 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm > sl++; > sl->vaddr = vaddr; > sl->datacount = sgsize; > - sl->busaddr = curaddr; > } else > sl->datacount += sgsize; > } > @@ -1111,17 +1113,14 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap > > STAILQ_FOREACH(bpage, &map->bpages, links) { > if (op & BUS_DMASYNC_PREWRITE) { > - if (bpage->datavaddr != 0) > + if (bpage->datavaddr != 0 && > + (map->pmap == kernel_pmap || map->pmap == > vmspace_pmap(curproc->p_vmspace))) > bcopy((void *)bpage->datavaddr, > - (void *)(bpage->vaddr_nocache != 0 ? > - bpage->vaddr_nocache : > - bpage->vaddr), > + (void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache : > bpage->vaddr), > bpage->datacount); > else > physcopyout(bpage->dataaddr, > - (void *)(bpage->vaddr_nocache != 0 ? > - bpage->vaddr_nocache : > - bpage->vaddr), > + (void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache : > bpage->vaddr), > bpage->datacount); > if (bpage->vaddr_nocache == 0) { > mips_dcache_wb_range(bpage->vaddr, > @@ -1134,13 +1133,12 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap > mips_dcache_inv_range(bpage->vaddr, > bpage->datacount); > } > - if (bpage->datavaddr != 0) > - bcopy((void *)(bpage->vaddr_nocache != 0 ? > - bpage->vaddr_nocache : bpage->vaddr), > + if (bpage->datavaddr != 0 && > + (map->pmap == kernel_pmap || map->pmap == > vmspace_pmap(curproc->p_vmspace))) > + bcopy((void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache : > bpage->vaddr), > (void *)bpage->datavaddr, bpage->datacount); > else > - physcopyin((void *)(bpage->vaddr_nocache != 0 ? > - bpage->vaddr_nocache : bpage->vaddr), > + physcopyin((void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache : > bpage->vaddr), > bpage->dataaddr, bpage->datacount); > dmat->bounce_zone->total_bounced++; > } > @@ -1164,7 +1162,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t > return; > > CTR3(KTR_BUSDMA, "%s: op %x flags %x", __func__, op, map->flags); > - if (map->sync_count) { > + if (map->sync_count != 0 && > + (map->pmap == kernel_pmap || map->pmap == > vmspace_pmap(curproc->p_vmspace))) { > end = &map->slist[map->sync_count]; > for (sl = &map->slist[0]; sl != end; sl++) > bus_dmamap_sync_buf(sl->vaddr, sl->datacount, op); > Index: sys/powerpc/powerpc/busdma_machdep.c > =================================================================== > --- sys/powerpc/powerpc/busdma_machdep.c (revision 282208) > +++ sys/powerpc/powerpc/busdma_machdep.c (working copy) > @@ -131,6 +131,7 @@ struct bus_dmamap { > int nsegs; > bus_dmamap_callback_t *callback; > void *callback_arg; > + pmap_t pmap; > STAILQ_ENTRY(bus_dmamap) links; > int contigalloc; > }; > @@ -596,7 +597,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma > } > > static void > -_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap, > +_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, > void *buf, bus_size_t buflen, int flags) > { > vm_offset_t vaddr; > @@ -619,10 +620,10 @@ static void > bus_size_t sg_len; > > sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK); > - if (pmap == kernel_pmap) > + if (map->pmap == kernel_pmap) > paddr = pmap_kextract(vaddr); > else > - paddr = pmap_extract(pmap, vaddr); > + paddr = pmap_extract(map->pmap, vaddr); > if (run_filter(dmat, paddr) != 0) { > sg_len = roundup2(sg_len, dmat->alignment); > map->pagesneeded++; > @@ -785,8 +786,10 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, > if (segs == NULL) > segs = map->segments; > > + map->pmap = pmap; > + > if ((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) { > - _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags); > + _bus_dmamap_count_pages(dmat, map, buf, buflen, flags); > if (map->pagesneeded != 0) { > error = _bus_dmamap_reserve_pages(dmat, map, flags); > if (error) > @@ -905,14 +908,11 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t > > if (op & BUS_DMASYNC_PREWRITE) { > while (bpage != NULL) { > - if (bpage->datavaddr != 0) > - bcopy((void *)bpage->datavaddr, > - (void *)bpage->vaddr, > - bpage->datacount); > + if (bpage->datavaddr != 0 && > + (map->pmap == kernel_pmap || map->pmap == > vmspace_pmap(curproc->p_vmspace))) > + bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount); > else > - physcopyout(bpage->dataaddr, > - (void *)bpage->vaddr, > - bpage->datacount); > + physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount); > bpage = STAILQ_NEXT(bpage, links); > } > dmat->bounce_zone->total_bounced++; > @@ -920,13 +920,11 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t > > if (op & BUS_DMASYNC_POSTREAD) { > while (bpage != NULL) { > - if (bpage->datavaddr != 0) > - bcopy((void *)bpage->vaddr, > - (void *)bpage->datavaddr, > - bpage->datacount); > + if (bpage->datavaddr != 0 && > + (map->pmap == kernel_pmap || map->pmap == > vmspace_pmap(curproc->p_vmspace))) > + bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount); > else > - physcopyin((void *)bpage->vaddr, > - bpage->dataaddr, bpage->datacount); > + physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount); > bpage = STAILQ_NEXT(bpage, links); > } > dmat->bounce_zone->total_bounced++; > Index: sys/x86/x86/busdma_bounce.c > =================================================================== > --- sys/x86/x86/busdma_bounce.c (revision 282208) > +++ sys/x86/x86/busdma_bounce.c (working copy) > @@ -121,6 +121,7 @@ struct bus_dmamap { > struct memdesc mem; > bus_dmamap_callback_t *callback; > void *callback_arg; > + pmap_t pmap; > STAILQ_ENTRY(bus_dmamap) links; > }; > > @@ -139,7 +140,7 @@ static bus_addr_t add_bounce_page(bus_dma_tag_t dm > static void free_bounce_page(bus_dma_tag_t dmat, struct bounce_page > *bpage); > int run_filter(bus_dma_tag_t dmat, bus_addr_t paddr); > static void _bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, > - pmap_t pmap, void *buf, bus_size_t buflen, > + void *buf, bus_size_t buflen, > int flags); > static void _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dmamap_t map, > vm_paddr_t buf, bus_size_t buflen, > @@ -491,7 +492,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma > } > > static void > -_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap, > +_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, > void *buf, bus_size_t buflen, int flags) > { > vm_offset_t vaddr; > @@ -515,10 +516,10 @@ static void > > while (vaddr < vendaddr) { > sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK); > - if (pmap == kernel_pmap) > + if (map->pmap == kernel_pmap) > paddr = pmap_kextract(vaddr); > else > - paddr = pmap_extract(pmap, vaddr); > + paddr = pmap_extract(map->pmap, vaddr); > if (bus_dma_run_filter(&dmat->common, paddr) != 0) { > sg_len = roundup2(sg_len, > dmat->common.alignment); > @@ -668,12 +669,14 @@ bounce_bus_dmamap_load_buffer(bus_dma_tag_t dmat, > > if (map == NULL) > map = &nobounce_dmamap; > + else > + map->pmap = pmap; > > if (segs == NULL) > segs = dmat->segments; > > if ((dmat->bounce_flags & BUS_DMA_COULD_BOUNCE) != 0) { > - _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags); > + _bus_dmamap_count_pages(dmat, map, buf, buflen, flags); > if (map->pagesneeded != 0) { > error = _bus_dmamap_reserve_pages(dmat, map, flags); > if (error) > @@ -775,15 +778,11 @@ bounce_bus_dmamap_sync(bus_dma_tag_t dmat, bus_dma > > if ((op & BUS_DMASYNC_PREWRITE) != 0) { > while (bpage != NULL) { > - if (bpage->datavaddr != 0) { > - bcopy((void *)bpage->datavaddr, > - (void *)bpage->vaddr, > - bpage->datacount); > - } else { > - physcopyout(bpage->dataaddr, > - (void *)bpage->vaddr, > - bpage->datacount); > - } > + if (bpage->datavaddr != 0 && > + (map->pmap == kernel_pmap || map->pmap == > vmspace_pmap(curproc->p_vmspace))) > + bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount); > + else > + physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount); > bpage = STAILQ_NEXT(bpage, links); > } > dmat->bounce_zone->total_bounced++; > @@ -791,15 +790,11 @@ bounce_bus_dmamap_sync(bus_dma_tag_t dmat, bus_dma > > if ((op & BUS_DMASYNC_POSTREAD) != 0) { > while (bpage != NULL) { > - if (bpage->datavaddr != 0) { > - bcopy((void *)bpage->vaddr, > - (void *)bpage->datavaddr, > - bpage->datacount); > - } else { > - physcopyin((void *)bpage->vaddr, > - bpage->dataaddr, > - bpage->datacount); > - } > + if (bpage->datavaddr != 0 && > + (map->pmap == kernel_pmap || map->pmap == > vmspace_pmap(curproc->p_vmspace))) > + bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount); > + else > + physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount); > bpage = STAILQ_NEXT(bpage, links); > } > dmat->bounce_zone->total_bounced++; From owner-freebsd-arch@FreeBSD.ORG Wed Apr 29 19:17:51 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 1CAB49A2; Wed, 29 Apr 2015 19:17:51 +0000 (UTC) Received: from mail-ig0-x231.google.com (mail-ig0-x231.google.com [IPv6:2607:f8b0:4001:c05::231]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id D493B1B49; Wed, 29 Apr 2015 19:17:50 +0000 (UTC) Received: by igblo3 with SMTP id lo3so57612943igb.0; Wed, 29 Apr 2015 12:17:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=Cbly6yhSt5GjOHKxKY6CxUDLVgmbdC0Ml3zaTzPTmjk=; b=hP8Ry2CKwtoPwHkWsY9BDKNDVBa/EX7QcyphC37jdoGUWEP9omnMTIPe0JHwUIXB2i DcWXBBp6VyESIz+2SowCDFlhtFuGQCRIQsfBC3Dw8GZiDSlz+z4+7JRIllXTrZxqICob 7qDFWh7bp2Cx+++vU3Gqy82OjHC2m92bKMP9oIun68zTGkEnLV0J5G3O+ScnggknvHGI 68UQn/+0h4MixAAw1Pw1nvmy0zH6SyGyV1nzRGVOmWgD3vWdvhvFcPBSmccq6Im0lG7N x0PAcQPFL04IdQHs0hGIIYN+iOi+YxiN10bbqiKfh40H3iZR5xZqetyo32dSeSy3EYTm kpLQ== MIME-Version: 1.0 X-Received: by 10.107.11.211 with SMTP id 80mr887093iol.18.1430335070140; Wed, 29 Apr 2015 12:17:50 -0700 (PDT) Received: by 10.36.106.70 with HTTP; Wed, 29 Apr 2015 12:17:50 -0700 (PDT) In-Reply-To: <20150429185019.GO2390@kib.kiev.ua> References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com> <1761247.Bq816CMB8v@ralph.baldwin.cx> <20150429132017.GM2390@kib.kiev.ua> <20150429165432.GN2390@kib.kiev.ua> <20150429185019.GO2390@kib.kiev.ua> Date: Wed, 29 Apr 2015 14:17:50 -0500 Message-ID: Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space From: Jason Harmening To: Konstantin Belousov Cc: Svatopluk Kraus , John Baldwin , Adrian Chadd , Warner Losh , freebsd-arch Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Apr 2015 19:17:51 -0000 > > > The spaces/tabs in your mail are damaged. It does not matter in the > text, but makes the patch unapplicable and hardly readable. > Ugh. I'm at work right now and using the gmail web client. It seems like every day I find a new way in which that thing is incredibly unfriendly for use with mailing lists. I will re-post the patch from a sane mail client later. > > I only read the x86/busdma_bounce.c part. It looks fine in the part > where you add the test for the current pmap being identical to the pmap > owning the user page mapping. > > I do not understand the part of the diff for bcopy/physcopyout lines, > I cannot find non-whitespace changes there, and whitespace change would > make too long line. Did I misread the patch ?\ > You probably misread it, since it is unreadable. There is a section in bounce_bus_dmamap_sync() where I check for map->pmap being kernel_pmap or curproc's pmap before doing bcopy. > > BTW, why not use physcopyout() unconditionally on x86 ? To avoid i386 sfbuf > allocation failures ? > Yes. > > For non-coherent arches, isn't the issue of CPUs having filled caches > for the DMA region present regardless of the vm_fault_quick_hold() use ? > DMASYNC_PREREAD/WRITE must ensure that the lines are written back and > invalidated even now, or always fall back to use bounce page. > > Yes, that needs to be done regardless of how the pages are wired. The particular problem here is that some caches on arm and mips are virtually-indexed (usually virtually-indexed, physically-tagged (VIPT)). That means the flush/invalidate instructions need virtual addresses, so figuring out the correct UVA to use for those could be a challenge. As I understand it, VIPT caches usually do have some hardware logic for finding all the cachelines that correspond to a physical address, so they can handle multiple VA mappings of the same PA. But it is unclear to me how cross-processor cache maintenance is supposed to work with VIPT caches on SMP systems. If the caches were physically-indexed, then I don't think there would be an issue. You'd just pass the PA to the flush/invalidate instruction, and presumably a sane SMP implementation would propagate that to other cores via IPI. From owner-freebsd-arch@FreeBSD.ORG Wed Apr 29 19:33:44 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 38C45F1B; Wed, 29 Apr 2015 19:33:44 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id B6AF61D86; Wed, 29 Apr 2015 19:33:43 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3TJXb7x062579 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Wed, 29 Apr 2015 22:33:37 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3TJXb7x062579 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id t3TJXbHf062578; Wed, 29 Apr 2015 22:33:37 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 29 Apr 2015 22:33:37 +0300 From: Konstantin Belousov To: Jason Harmening Cc: Svatopluk Kraus , John Baldwin , Adrian Chadd , Warner Losh , freebsd-arch Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space Message-ID: <20150429193337.GQ2390@kib.kiev.ua> References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com> <1761247.Bq816CMB8v@ralph.baldwin.cx> <20150429132017.GM2390@kib.kiev.ua> <20150429165432.GN2390@kib.kiev.ua> <20150429185019.GO2390@kib.kiev.ua> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Apr 2015 19:33:44 -0000 On Wed, Apr 29, 2015 at 02:17:50PM -0500, Jason Harmening wrote: > > > > > > The spaces/tabs in your mail are damaged. It does not matter in the > > text, but makes the patch unapplicable and hardly readable. > > > > Ugh. I'm at work right now and using the gmail web client. It seems like > every day I find a new way in which that thing is incredibly unfriendly for > use with mailing lists. > I will re-post the patch from a sane mail client later. > > > > > > I only read the x86/busdma_bounce.c part. It looks fine in the part > > where you add the test for the current pmap being identical to the pmap > > owning the user page mapping. > > > > I do not understand the part of the diff for bcopy/physcopyout lines, > > I cannot find non-whitespace changes there, and whitespace change would > > make too long line. Did I misread the patch ?\ > > > > You probably misread it, since it is unreadable. There is a section in > bounce_bus_dmamap_sync() where I check for map->pmap being kernel_pmap or > curproc's pmap before doing bcopy. See the paragraph in my mail before the one you answered. I am asking about the bcopy()/physcopyout() lines in diff, not about the if () conditions change. The later is definitely fine. > > > > > > BTW, why not use physcopyout() unconditionally on x86 ? To avoid i386 sfbuf > > allocation failures ? > > > > Yes. > > > > > > For non-coherent arches, isn't the issue of CPUs having filled caches > > for the DMA region present regardless of the vm_fault_quick_hold() use ? > > DMASYNC_PREREAD/WRITE must ensure that the lines are written back and > > invalidated even now, or always fall back to use bounce page. > > > > > Yes, that needs to be done regardless of how the pages are wired. The > particular problem here is that some caches on arm and mips are > virtually-indexed (usually virtually-indexed, physically-tagged (VIPT)). > That means the flush/invalidate instructions need virtual addresses, so > figuring out the correct UVA to use for those could be a challenge. As I > understand it, VIPT caches usually do have some hardware logic for finding > all the cachelines that correspond to a physical address, so they can > handle multiple VA mappings of the same PA. But it is unclear to me how > cross-processor cache maintenance is supposed to work with VIPT caches on > SMP systems. > > If the caches were physically-indexed, then I don't think there would be an > issue. You'd just pass the PA to the flush/invalidate instruction, and > presumably a sane SMP implementation would propagate that to other cores > via IPI. Even without SMP, VIPT cache cannot hold two mappings of the same page. As I understand, sometimes it is more involved, eg if mappings have correct color (eg. on ultrasparcs), then cache can deal with aliasing. Otherwise pmap has to map the page uncached for all mappings. I do not see what would make this case special for SMP after that. Cache invalidation would be either not needed, or coherency domain propagation of the virtual address does the right thing. From owner-freebsd-arch@FreeBSD.ORG Wed Apr 29 19:59:05 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 983235A0; Wed, 29 Apr 2015 19:59:05 +0000 (UTC) Received: from mail-ig0-x230.google.com (mail-ig0-x230.google.com [IPv6:2607:f8b0:4001:c05::230]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 5AB801FDA; Wed, 29 Apr 2015 19:59:05 +0000 (UTC) Received: by igblo3 with SMTP id lo3so127421146igb.1; Wed, 29 Apr 2015 12:59:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=7Q2lep1yKgOp92i7Jdy38JoyONV+nE+5XTwrfLLxzoI=; b=VPN9c9PYvN3SoDQXW8WhgzcdR1qW7RKl8CrP+jmzU+6iGnFjxlfVpFKIBRvky0u8Jv IgGeqQ3ImTLeI/JfLuTk+XEIKX3P9UMOyDhhXppT5mVMDzgr71FIjpnpuq1oT4qdFHI5 kRaX6HqJYygTczNV2x7S/AEvcUl1V5EqRLI+YIR5ZP53hFUuiL3AOnVQe9Ur8iEuoMOK t4HpM8XnAmO3TpXwc02WCXLGXb/vAOjtNVafvfkGYrgKOxjh/2Zp7nH66FB00X7nzXCv R0ktxAahRQ2iKIvmCwXzWr5iYDgH2REsPI3wNL8F8Gt8QB34B4LGeN4n1LX/6P1gT2nW 2afQ== MIME-Version: 1.0 X-Received: by 10.50.41.8 with SMTP id b8mr29704345igl.38.1430337542406; Wed, 29 Apr 2015 12:59:02 -0700 (PDT) Received: by 10.36.106.70 with HTTP; Wed, 29 Apr 2015 12:59:02 -0700 (PDT) In-Reply-To: <20150429193337.GQ2390@kib.kiev.ua> References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com> <1761247.Bq816CMB8v@ralph.baldwin.cx> <20150429132017.GM2390@kib.kiev.ua> <20150429165432.GN2390@kib.kiev.ua> <20150429185019.GO2390@kib.kiev.ua> <20150429193337.GQ2390@kib.kiev.ua> Date: Wed, 29 Apr 2015 14:59:02 -0500 Message-ID: Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space From: Jason Harmening To: Konstantin Belousov Cc: Svatopluk Kraus , John Baldwin , Adrian Chadd , Warner Losh , freebsd-arch Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Apr 2015 19:59:05 -0000 > > See the paragraph in my mail before the one you answered. > I am asking about the bcopy()/physcopyout() lines in diff, not about > the if () conditions change. The later is definitely fine. > Oh, yes, sorry. There were a couple of whitespace changes there, but nothing of consequence. > Even without SMP, VIPT cache cannot hold two mappings of the same page. > As I understand, sometimes it is more involved, eg if mappings have > correct color (eg. on ultrasparcs), then cache can deal with aliasing. > Otherwise pmap has to map the page uncached for all mappings. > Yes, you are right. Regardless of whatever logic the cache uses (or doesn't use), FreeBSD's page-coloring scheme should prevent that. > > I do not see what would make this case special for SMP after that. > Cache invalidation would be either not needed, or coherency domain > propagation of the virtual address does the right thing. > Since VIPT cache operations require a virtual address, I'm wondering about the case where different processes are running on different cores, and the same UVA corresponds to a completely different physical page for each of those processes. If the d-cache for each core contains that UVA, then what does it mean when one core issues a flush/invalidate instruction for that UVA? Admittedly, there's a lot I don't know about how that's supposed to work in the arm/mips SMP world. For all I know, the SMP targets could be fully-snooped and we don't need to worry about cache maintenance at all. From owner-freebsd-arch@FreeBSD.ORG Wed Apr 29 20:05:54 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4120E79B for ; Wed, 29 Apr 2015 20:05:54 +0000 (UTC) Received: from mail-pa0-f46.google.com (mail-pa0-f46.google.com [209.85.220.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 0D15F10AD for ; Wed, 29 Apr 2015 20:05:53 +0000 (UTC) Received: by pacwv17 with SMTP id wv17so37335632pac.0 for ; Wed, 29 Apr 2015 13:05:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:subject:mime-version:content-type:from :in-reply-to:date:cc:message-id:references:to; bh=zf3QgncH7y1pFVp9phBTSTMoawrxRjhv0poiCdeXuQc=; b=eL4eUBaN39AYt6T5V9Wc6UH9iP4wxTLRCaT/R1hemV9xbeN0NrqPu+BqC+BtMDQ8q8 tOyA/Ne9WFyCWJJe6NvlV3PhR+BjgB+JmGv0IYbObyLgqnlFDhrOKgUdk1KBWuz59Y60 pTe+MSzkOrGq7ow7V0Dr5VO0YJA1cVweaQgrHB0cBg94O4WvIDCbGjLtXk+WLbcAUOcq 6qaEpCSFq0pu7b8FSlnipyGo5XlZUaGYTADJEuPse5jf9CR+JIKfB0IqlDLflq6jl555 B1PsT0ebsLUOJrxgIPZ4uj+FAfbR3nVEq8EfCTB4RGjzRhYfbDvPFBn74+BV5at5EGfe 36CA== X-Gm-Message-State: ALoCoQm/YnWkcWlGZSVUXKZAAvWRupzCYzj2hUmjforXY3ItcKfydrsFjPrbYL1hU0Uf05SKDVFc X-Received: by 10.70.124.233 with SMTP id ml9mr1432149pdb.9.1430337946909; Wed, 29 Apr 2015 13:05:46 -0700 (PDT) Received: from lgwl-sram.corp.netflix.com ([69.53.236.236]) by mx.google.com with ESMTPSA id c8sm32559pdj.65.2015.04.29.13.05.44 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 29 Apr 2015 13:05:45 -0700 (PDT) Sender: Warner Losh Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\)) Content-Type: multipart/signed; boundary="Apple-Mail=_BE24FC7E-A878-4059-963E-1A19E29BB82A"; protocol="application/pgp-signature"; micalg=pgp-sha512 X-Pgp-Agent: GPGMail 2.5b6 From: Warner Losh In-Reply-To: Date: Wed, 29 Apr 2015 14:05:42 -0600 Cc: Konstantin Belousov , Svatopluk Kraus , John Baldwin , Adrian Chadd , freebsd-arch Message-Id: <9807ECB0-5218-42D1-9BD9-94F6BB5C69C8@bsdimp.com> References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com> <1761247.Bq816CMB8v@ralph.baldwin.cx> <20150429132017.GM2390@kib.kiev.ua> <20150429165432.GN2390@kib.kiev.ua> <20150429185019.GO2390@kib.kiev.ua> To: Jason Harmening X-Mailer: Apple Mail (2.2098) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Apr 2015 20:05:54 -0000 --Apple-Mail=_BE24FC7E-A878-4059-963E-1A19E29BB82A Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii > On Apr 29, 2015, at 1:17 PM, Jason Harmening = wrote: >=20 >=20 >=20 >=20 > Yes, that needs to be done regardless of how the pages are wired. The = particular problem here is that some caches on arm and mips are = virtually-indexed (usually virtually-indexed, physically-tagged (VIPT)). = That means the flush/invalidate instructions need virtual addresses, so = figuring out the correct UVA to use for those could be a challenge. As = I understand it, VIPT caches usually do have some hardware logic for = finding all the cachelines that correspond to a physical address, so = they can handle multiple VA mappings of the same PA. But it is unclear = to me how cross-processor cache maintenance is supposed to work with = VIPT caches on SMP systems. >=20 > If the caches were physically-indexed, then I don't think there would = be an issue. You'd just pass the PA to the flush/invalidate = instruction, and presumably a sane SMP implementation would propagate = that to other cores via IPI. I know on MIPS you cannot have more than one mapping to a page you are = doing DMA to/from ever. Warner --Apple-Mail=_BE24FC7E-A878-4059-963E-1A19E29BB82A Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP using GPGMail -----BEGIN PGP SIGNATURE----- Comment: GPGTools - https://gpgtools.org iQIcBAEBCgAGBQJVQTmXAAoJEGwc0Sh9sBEAk9UP/0EXZ/oWomX5qr9eSByKloVy m7MthNOYTTFnHhJlrtOSspQ/OZdsBoK1lzgNhcvjQRtXzDyelK1fDP5iYla1w5Lt XAwcL8yIjUBUm1SmHdY9O/rBLrMaeg03sEUJzaLGtF1V5dRrvHr/UsQpegcEy2Kw +4m5aSAZmr9MPIJ+r/1ztilvZv9k26pDQ1UdUvCpq7/c28A9JWdhbGSwuNpFzOI/ WSy+7fxBH4WbeC9ikRkkoIqmAEO2EAaecMnRAHbTzoPhKnQahtzXC14BSUzpNKL2 HSkXZK0INc12VEocr/rovNP4iTRe4HrcN4nPHIeyKNjJdm2Pu8bo39yU4FWBzTkt efnTd9jGAy3Sqy+YJFZSKkRxYjMDSP6qmp+bD/8vRUf7z5AiB20zUxPQ0fCmXdLX F5MTlAjRdQ/I9+HHEOIqk1ZkPAQJP5Zz6KzTm7WLBBIdSC7sqewOsw5iSXufssCl 80pg/er17pyCm4PsmR+i4fwi5UtgGkNt0gUcScWDqHcItFX9tHTrSb/OpFEa5WvW pcordomq6pOQ3f23lG/R964yLu3hlCf9Jrhznom9/FwzMoS1cKsNQVGUS0Pwa6aa M+egJHSmhz8weoGry4ygBrIq8jLOHB+xnobLGHPqiPk/q5IGwI5o1m0ODquO2pD8 g2ckIm17cRCNmvLJzhAu =axbV -----END PGP SIGNATURE----- --Apple-Mail=_BE24FC7E-A878-4059-963E-1A19E29BB82A-- From owner-freebsd-arch@FreeBSD.ORG Wed Apr 29 22:23:38 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 21945B87; Wed, 29 Apr 2015 22:23:38 +0000 (UTC) Received: from relay.mailchannels.net (tkt-001-i373.relay.mailchannels.net [174.136.5.175]) by mx1.freebsd.org (Postfix) with ESMTP id 08EF91004; Wed, 29 Apr 2015 22:23:36 +0000 (UTC) X-Sender-Id: duocircle|x-authuser|hippie Received: from smtp2.ore.mailhop.org (ip-10-204-4-183.us-west-2.compute.internal [10.204.4.183]) by relay.mailchannels.net (Postfix) with ESMTPA id 5FF60A11A0; Wed, 29 Apr 2015 22:23:28 +0000 (UTC) X-Sender-Id: duocircle|x-authuser|hippie Received: from smtp2.ore.mailhop.org (smtp2.ore.mailhop.org [10.45.8.167]) (using TLSv1 with cipher DHE-RSA-AES256-SHA) by 0.0.0.0:2500 (trex/5.4.8); Wed, 29 Apr 2015 22:23:28 +0000 X-MC-Relay: Neutral X-MailChannels-SenderId: duocircle|x-authuser|hippie X-MailChannels-Auth-Id: duocircle X-MC-Loop-Signature: 1430346208532:536895691 X-MC-Ingress-Time: 1430346208532 Received: from c-73-34-117-227.hsd1.co.comcast.net ([73.34.117.227] helo=ilsoft.org) by smtp2.ore.mailhop.org with esmtpsa (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256) (Exim 4.82) (envelope-from ) id 1YnaO7-0006MA-7y; Wed, 29 Apr 2015 22:23:27 +0000 Received: from revolution.hippie.lan (revolution.hippie.lan [172.22.42.240]) by ilsoft.org (8.14.9/8.14.9) with ESMTP id t3TMNOWI050105; Wed, 29 Apr 2015 16:23:24 -0600 (MDT) (envelope-from ian@freebsd.org) X-Mail-Handler: DuoCircle Outbound SMTP X-Originating-IP: 73.34.117.227 X-Report-Abuse-To: abuse@duocircle.com (see https://support.duocircle.com/support/solutions/articles/5000540958-duocircle-standard-smtp-abuse-information for abuse reporting information) X-MHO-User: U2FsdGVkX1/ky0KcCx9j8ENGWudC41dk Message-ID: <1430346204.1157.107.camel@freebsd.org> Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space From: Ian Lepore To: Jason Harmening Cc: Konstantin Belousov , Adrian Chadd , Svatopluk Kraus , freebsd-arch Date: Wed, 29 Apr 2015 16:23:24 -0600 In-Reply-To: References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com> <1761247.Bq816CMB8v@ralph.baldwin.cx> <20150429132017.GM2390@kib.kiev.ua> <20150429165432.GN2390@kib.kiev.ua> <20150429185019.GO2390@kib.kiev.ua> <20150429193337.GQ2390@kib.kiev.ua> Content-Type: text/plain; charset="us-ascii" X-Mailer: Evolution 3.12.10 FreeBSD GNOME Team Port Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-AuthUser: hippie X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Apr 2015 22:23:38 -0000 On Wed, 2015-04-29 at 14:59 -0500, Jason Harmening wrote: > > > > Even without SMP, VIPT cache cannot hold two mappings of the same page. > > As I understand, sometimes it is more involved, eg if mappings have > > correct color (eg. on ultrasparcs), then cache can deal with aliasing. > > Otherwise pmap has to map the page uncached for all mappings. > > > > Yes, you are right. Regardless of whatever logic the cache uses (or > doesn't use), FreeBSD's page-coloring scheme should prevent that. > > > > > > I do not see what would make this case special for SMP after that. > > Cache invalidation would be either not needed, or coherency domain > > propagation of the virtual address does the right thing. > > > > Since VIPT cache operations require a virtual address, I'm wondering about > the case where different processes are running on different cores, and the > same UVA corresponds to a completely different physical page for each of > those processes. If the d-cache for each core contains that UVA, then what > does it mean when one core issues a flush/invalidate instruction for that > UVA? > > Admittedly, there's a lot I don't know about how that's supposed to work in > the arm/mips SMP world. For all I know, the SMP targets could be > fully-snooped and we don't need to worry about cache maintenance at all. For what we call armv6 (which is mostly armv7)... The cache maintenance operations require virtual addresses, which means it looks a lot like a VIPT cache. Under the hood the implementation behaves as if it were a PIPT cache so even in the presence of multiple mappings of the same physical page into different virtual addresses, the SMP coherency hardware works correctly. The ARM ARM says... [Stuff about ARMv6 and page coloring when a cache way exceeds 4K.] ARMv7 does not support page coloring, and requires that all data and unified caches behave as Physically Indexed Physically Tagged (PIPT) caches. The only true armv6 chip we support isn't SMP and has a 16K/4-way cache that neatly sidesteps the aliasing problem that requires page coloring solutions. So modern arm chips we get to act like we've got PIPT data caches, but with the quirk that cache ops are initiated by virtual address. Basically, when you perform a cache maintainence operation, a translation table walk is done on the core that issued the cache op, then from that point on the physical address is used within the cache hardware and that's what gets broadcast to the other cores by the snoop control unit or cache coherency fabric (depending on the chip). Not that it's germane to this discussion, but an ARM instruction cache can really be VIPT with no "behave as if" restrictions in the spec. That means when doing i-cache maintenance on a virtual address that could be multiply-mapped our only option a rather expensive all-cores "invalidate entire i-cache and branch predictor cache". For the older armv4/v5 world which is VIVT, we have a restriction that a page that is multiply-mapped cannot have cache enabled (it's handled in pmap). That's also probably not very germane to this discussion, because it doesn't seem likely that anyone is going to try to add physical IO or userspace DMA support to that old code. -- Ian From owner-freebsd-arch@FreeBSD.ORG Wed Apr 29 23:10:19 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 933F736D; Wed, 29 Apr 2015 23:10:19 +0000 (UTC) Received: from mail-ie0-x232.google.com (mail-ie0-x232.google.com [IPv6:2607:f8b0:4001:c03::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 563E41583; Wed, 29 Apr 2015 23:10:19 +0000 (UTC) Received: by iebrs15 with SMTP id rs15so54806421ieb.3; Wed, 29 Apr 2015 16:10:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=e1bzNf8ir1N1xE/m9tDofiB92Z7aU/E9Gv+c76cKChA=; b=A2ITE4Tx6q3GF96Wq9KVWPgKtK815kLx0PaYNdONCs2YHxIBDYTNvuIhoxHgWACjY0 YqI3JTh753xMxtbmpMy9FSUJnudc1elZUvLolAIWlBz1/u3Pcm5cGooVe/N4mwV3AgGp GMyVY8fvxCjuFOMRAMAyrKwf0riWrYR1mp0rYTOCGAPjF/m4YwYV12rsChKsqW5mqXyp WQe4WDKDqebMGEVLNJDt5lFUdU3Bvo/fogzRDDhNBx9G9q0VH7AWjnpcxGnHixytZobZ 5oNF7IqaQHTG9efAp3qCnYypc8pFq8FX0SAByn/iWlfhdPbn5yQVipX43IkOw3nTAOYE 3T3A== MIME-Version: 1.0 X-Received: by 10.107.9.67 with SMTP id j64mr1964837ioi.39.1430349018698; Wed, 29 Apr 2015 16:10:18 -0700 (PDT) Received: by 10.36.106.70 with HTTP; Wed, 29 Apr 2015 16:10:18 -0700 (PDT) In-Reply-To: <1430346204.1157.107.camel@freebsd.org> References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com> <1761247.Bq816CMB8v@ralph.baldwin.cx> <20150429132017.GM2390@kib.kiev.ua> <20150429165432.GN2390@kib.kiev.ua> <20150429185019.GO2390@kib.kiev.ua> <20150429193337.GQ2390@kib.kiev.ua> <1430346204.1157.107.camel@freebsd.org> Date: Wed, 29 Apr 2015 18:10:18 -0500 Message-ID: Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space From: Jason Harmening To: Ian Lepore Cc: Konstantin Belousov , Adrian Chadd , Svatopluk Kraus , freebsd-arch Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Apr 2015 23:10:19 -0000 > > > For what we call armv6 (which is mostly armv7)... > > The cache maintenance operations require virtual addresses, which means > it looks a lot like a VIPT cache. Under the hood the implementation > behaves as if it were a PIPT cache so even in the presence of multiple > mappings of the same physical page into different virtual addresses, the > SMP coherency hardware works correctly. > > The ARM ARM says... > > [Stuff about ARMv6 and page coloring when a cache way exceeds > 4K.] > > ARMv7 does not support page coloring, and requires that all data > and unified caches behave as Physically Indexed Physically > Tagged (PIPT) caches. > > The only true armv6 chip we support isn't SMP and has a 16K/4-way cache > that neatly sidesteps the aliasing problem that requires page coloring > solutions. So modern arm chips we get to act like we've got PIPT data > caches, but with the quirk that cache ops are initiated by virtual > address. > Cool, thanks for the explanation! To satisfy my own curiosity, since it "looks like VIPT", does that mean we still have to flush the cache on context switch? > > Basically, when you perform a cache maintainence operation, a > translation table walk is done on the core that issued the cache op, > then from that point on the physical address is used within the cache > hardware and that's what gets broadcast to the other cores by the snoop > control unit or cache coherency fabric (depending on the chip). So, if we go back to the original problem of wanting to do bus_dmamap_sync() on userspace buffers from some asynchronous context: Say the process that owns the buffer is running on one core and prefetches some data into a cacheline for the buffer, and bus_dmamap_sync(POSTREAD) is done by a kernel thread running on another core. Since the core running the kernel thread is responsible for the TLB lookup to get the physical address, then since that core has no UVA the cache ops will be treated as misses and the cacheline on the core that owns the UVA won't be invalidated, correct? That means the panic on !pmap_dmap_iscurrent() in busdma_machdep-v6.c should stay? Sort of the same problem would apply to drivers using vm_fault_quick_hold_pages + bus_dmamap_load_ma...no cache maintenance, since there are no VAs to operate on. Indeed, both arm and mips implementation of _bus_dmamap_load_phys don't do anything with the sync list. From owner-freebsd-arch@FreeBSD.ORG Thu Apr 30 04:13:46 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id B299B530; Thu, 30 Apr 2015 04:13:46 +0000 (UTC) Received: from mail-ig0-x229.google.com (mail-ig0-x229.google.com [IPv6:2607:f8b0:4001:c05::229]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 7427017EE; Thu, 30 Apr 2015 04:13:46 +0000 (UTC) Received: by iget9 with SMTP id t9so3713992ige.1; Wed, 29 Apr 2015 21:13:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=IxMUAtHmz0kI9UGh/ASkTjjQpgipqh5+cPEN50qx2eU=; b=HDW95hATGOmFWppbnYphTjSPXufwYWmx4YKfrtHnSjk4oXwH6HSdPXr/LnE+KNMPjf 9Jp1/op197oMnhfcrnqS/LE5wN5Y0kWEy3p95TkiXjuwrel1pzFCd2zY673y3ZBtuLUp CODHjsfQt0xqaiYd67/sG86m88uIhlBcH7AXvPNEQZXWNJSK35EYGhzWSwVQB4BhKiK9 gxhXvnYO6L+gmELRjQV7UdSMvj3s6F/1rpVnDN8lQKXXCMSH2N0WxjJj9oAshxwVXKyu 869h5rOYKB9SVOhvQUrvngQvPBFYGjbvCARjR1QJxmbVxhnpG+SapK9DSupocuQPuU5g 8EYQ== MIME-Version: 1.0 X-Received: by 10.50.72.8 with SMTP id z8mr1102031igu.36.1430367225931; Wed, 29 Apr 2015 21:13:45 -0700 (PDT) Received: by 10.36.106.70 with HTTP; Wed, 29 Apr 2015 21:13:45 -0700 (PDT) In-Reply-To: References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com> <1761247.Bq816CMB8v@ralph.baldwin.cx> <20150429132017.GM2390@kib.kiev.ua> <20150429165432.GN2390@kib.kiev.ua> <20150429185019.GO2390@kib.kiev.ua> <20150429193337.GQ2390@kib.kiev.ua> <1430346204.1157.107.camel@freebsd.org> Date: Wed, 29 Apr 2015 23:13:45 -0500 Message-ID: Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space From: Jason Harmening To: Ian Lepore Cc: Konstantin Belousov , Adrian Chadd , Svatopluk Kraus , freebsd-arch Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Apr 2015 04:13:46 -0000 On Wed, Apr 29, 2015 at 6:10 PM, Jason Harmening wrote: > >> For what we call armv6 (which is mostly armv7)... >> >> The cache maintenance operations require virtual addresses, which means >> it looks a lot like a VIPT cache. Under the hood the implementation >> behaves as if it were a PIPT cache so even in the presence of multiple >> mappings of the same physical page into different virtual addresses, the >> SMP coherency hardware works correctly. >> >> The ARM ARM says... >> >> [Stuff about ARMv6 and page coloring when a cache way exceeds >> 4K.] >> >> ARMv7 does not support page coloring, and requires that all data >> and unified caches behave as Physically Indexed Physically >> Tagged (PIPT) caches. >> >> The only true armv6 chip we support isn't SMP and has a 16K/4-way cache >> that neatly sidesteps the aliasing problem that requires page coloring >> solutions. So modern arm chips we get to act like we've got PIPT data >> caches, but with the quirk that cache ops are initiated by virtual >> address. >> > > Cool, thanks for the explanation! > To satisfy my own curiosity, since it "looks like VIPT", does that mean we > still have to flush the cache on context switch? > > >> >> Basically, when you perform a cache maintainence operation, a >> translation table walk is done on the core that issued the cache op, >> then from that point on the physical address is used within the cache >> hardware and that's what gets broadcast to the other cores by the snoop >> control unit or cache coherency fabric (depending on the chip). > > > So, if we go back to the original problem of wanting to do > bus_dmamap_sync() on userspace buffers from some asynchronous context: > > Say the process that owns the buffer is running on one core and prefetches > some data into a cacheline for the buffer, and bus_dmamap_sync(POSTREAD) is > done by a kernel thread running on another core. Since the core running > the kernel thread is responsible for the TLB lookup to get the physical > address, then since that core has no UVA the cache ops will be treated as > misses and the cacheline on the core that owns the UVA won't be > invalidated, correct? > > That means the panic on !pmap_dmap_iscurrent() in busdma_machdep-v6.c > should stay? > > Sort of the same problem would apply to drivers using > vm_fault_quick_hold_pages + bus_dmamap_load_ma...no cache maintenance, > since there are no VAs to operate on. Indeed, both arm and mips > implementation of _bus_dmamap_load_phys don't do anything with the sync > list. > It occurs to me that one way to deal with both the blocking-sfbuf for physcopy and VIPT cache maintenance might be to have a reserved per-CPU KVA page. For arches that don't have a direct map, the idea would be to grab a critical section, copy the bounce page or do cache maintenance on the synclist entry, then drop the critical section. That brought up a dim memory I had of Linux doing something similar, and in fact it seems to use kmap_atomic for both cache ops and bounce buffers. From owner-freebsd-arch@FreeBSD.ORG Thu Apr 30 08:38:38 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A4F49446; Thu, 30 Apr 2015 08:38:38 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 47685136A; Thu, 30 Apr 2015 08:38:38 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3U8cWuR049983 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Thu, 30 Apr 2015 11:38:32 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3U8cWuR049983 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id t3U8cW59049982; Thu, 30 Apr 2015 11:38:32 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 30 Apr 2015 11:38:32 +0300 From: Konstantin Belousov To: Ian Lepore Cc: Jason Harmening , Adrian Chadd , Svatopluk Kraus , freebsd-arch Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space Message-ID: <20150430083832.GR2390@kib.kiev.ua> References: <20150429132017.GM2390@kib.kiev.ua> <20150429165432.GN2390@kib.kiev.ua> <20150429185019.GO2390@kib.kiev.ua> <20150429193337.GQ2390@kib.kiev.ua> <1430346204.1157.107.camel@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1430346204.1157.107.camel@freebsd.org> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Apr 2015 08:38:38 -0000 On Wed, Apr 29, 2015 at 04:23:24PM -0600, Ian Lepore wrote: > For what we call armv6 (which is mostly armv7)... > > The cache maintenance operations require virtual addresses, which means > it looks a lot like a VIPT cache. Under the hood the implementation > behaves as if it were a PIPT cache so even in the presence of multiple > mappings of the same physical page into different virtual addresses, the > SMP coherency hardware works correctly. > > The ARM ARM says... > > [Stuff about ARMv6 and page coloring when a cache way exceeds > 4K.] > > ARMv7 does not support page coloring, and requires that all data > and unified caches behave as Physically Indexed Physically > Tagged (PIPT) caches. > > The only true armv6 chip we support isn't SMP and has a 16K/4-way cache > that neatly sidesteps the aliasing problem that requires page coloring > solutions. So modern arm chips we get to act like we've got PIPT data > caches, but with the quirk that cache ops are initiated by virtual > address. > > Basically, when you perform a cache maintainence operation, a > translation table walk is done on the core that issued the cache op, > then from that point on the physical address is used within the cache > hardware and that's what gets broadcast to the other cores by the snoop > control unit or cache coherency fabric (depending on the chip). This is the same as it is done on the x86. There is a CLFLUSH instruction, which takes virtual address and invalidates the cache line, maintaining cache coherency in the coherency domain and possibly doing write-back. It takes a virtual address, and even set the accessed bit in the page table entry. My understanding is that such decision to operate on the virtual addresses for x86 was done to allow the instruction to work from the user mode. Still, an instruction to flush cache line addressed by the physical address would be nice. The required circuits are already there, since CPUs must react on the coherency requests from other CPUs. On amd64, the pmap_invalidate_cache_pages() uses direct map, but on i386 kernel has to use specially allocated KVA page frame for temporal mapping (per-cpu CMAP2), see i386/i386/pmap.c:pmap_flush_page(). From owner-freebsd-arch@FreeBSD.ORG Thu Apr 30 09:53:07 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 39CDE48E; Thu, 30 Apr 2015 09:53:07 +0000 (UTC) Received: from mail-ie0-x231.google.com (mail-ie0-x231.google.com [IPv6:2607:f8b0:4001:c03::231]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id F234A1CB9; Thu, 30 Apr 2015 09:53:06 +0000 (UTC) Received: by iedfl3 with SMTP id fl3so70652438ied.1; Thu, 30 Apr 2015 02:53:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=vE8UxfyFvnSUa7y0Ao9fY7DbeYyy+An+QqpnG4OUzu0=; b=E6bwNNpBoA69OaqVDEDr8MO/+MuR8+ZZC/6uXz8KU0VUwxe/SuXuU+O7FsTZu4d+l9 QRnhxdW1LwP/yTZJUOjxugXB8cEZyoT5kzKUEH+wbqekoxGVdMyM6VASwi54OBxkerhn LjYH/qcXEWBWdlyiOlDW/yx0jEEVkI+5SNgOXgrZmX9vzGm91f0jTnfMT5ccs85L5ZhR HJxKvtn00Sz9p+WiZSkxlL3q4Lq6r6XoQqlSP8cPJw8k1Vo/j/LJgqVeyArq8YMIp004 6PQK78DBSx00SLjKoumNfojAcu9rLdbSZOKc/UUhuuvp2tAhaRXTi0Y/yOwD652TJ3Pf cg/g== MIME-Version: 1.0 X-Received: by 10.107.28.146 with SMTP id c140mr4399830ioc.67.1430387586495; Thu, 30 Apr 2015 02:53:06 -0700 (PDT) Received: by 10.64.13.81 with HTTP; Thu, 30 Apr 2015 02:53:06 -0700 (PDT) In-Reply-To: References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com> <1761247.Bq816CMB8v@ralph.baldwin.cx> <20150429132017.GM2390@kib.kiev.ua> <20150429165432.GN2390@kib.kiev.ua> <20150429185019.GO2390@kib.kiev.ua> <20150429193337.GQ2390@kib.kiev.ua> <1430346204.1157.107.camel@freebsd.org> Date: Thu, 30 Apr 2015 11:53:06 +0200 Message-ID: Subject: Re: bus_dmamap_sync() for bounced client buffers from user address space From: Svatopluk Kraus To: Jason Harmening Cc: Ian Lepore , Konstantin Belousov , Adrian Chadd , freebsd-arch Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Apr 2015 09:53:07 -0000 On Thu, Apr 30, 2015 at 1:10 AM, Jason Harmening wrote: >> >> For what we call armv6 (which is mostly armv7)... >> >> The cache maintenance operations require virtual addresses, which means >> it looks a lot like a VIPT cache. Under the hood the implementation >> behaves as if it were a PIPT cache so even in the presence of multiple >> mappings of the same physical page into different virtual addresses, the >> SMP coherency hardware works correctly. >> >> The ARM ARM says... >> >> [Stuff about ARMv6 and page coloring when a cache way exceeds >> 4K.] >> >> ARMv7 does not support page coloring, and requires that all data >> and unified caches behave as Physically Indexed Physically >> Tagged (PIPT) caches. >> >> The only true armv6 chip we support isn't SMP and has a 16K/4-way cache >> that neatly sidesteps the aliasing problem that requires page coloring >> solutions. So modern arm chips we get to act like we've got PIPT data >> caches, but with the quirk that cache ops are initiated by virtual >> address. > > > Cool, thanks for the explanation! > To satisfy my own curiosity, since it "looks like VIPT", does that mean we > still have to flush the cache on context switch? No, in general, there is no need to flush PIPT caches (even if they "look like VIPT") on context switch. When it comes (cache maintainance), physical page is either mapped in correct context or you have to map it somewhere in current context (KVA is used for that). >> >> >> Basically, when you perform a cache maintainence operation, a >> translation table walk is done on the core that issued the cache op, >> then from that point on the physical address is used within the cache >> hardware and that's what gets broadcast to the other cores by the snoop >> control unit or cache coherency fabric (depending on the chip). > > > So, if we go back to the original problem of wanting to do bus_dmamap_sync() > on userspace buffers from some asynchronous context: > > Say the process that owns the buffer is running on one core and prefetches > some data into a cacheline for the buffer, and bus_dmamap_sync(POSTREAD) is > done by a kernel thread running on another core. Since the core running the > kernel thread is responsible for the TLB lookup to get the physical address, > then since that core has no UVA the cache ops will be treated as misses and > the cacheline on the core that owns the UVA won't be invalidated, correct? > > That means the panic on !pmap_dmap_iscurrent() in busdma_machdep-v6.c should > stay? Not for unmapped buffers. For user space buffers, it's still a question how this will be resolved. It looks now that it's aiming to not using UVA for DMA buffers in kernel at all. In any case, even if UVA will be used, the panic won't be needed if correct implementation will be done. > > Sort of the same problem would apply to drivers using > vm_fault_quick_hold_pages + bus_dmamap_load_ma...no cache maintenance, since > there are no VAs to operate on. Indeed, both arm and mips implementation of > _bus_dmamap_load_phys don't do anything with the sync list. I'm just working on _bus_dmamap_load_phys() implementation for armv6. It means that I'm adding sync list for unmapped buffers (with virtual address set to zero) and implement cache maintainance operations with physical address passed as argument. It means that given range will be temporarily mapped to kernel (page by page) and cache operation using virtual address willl be called. It's the same scenarion taken in i386 pmap. From owner-freebsd-arch@FreeBSD.ORG Thu Apr 30 14:24:13 2015 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 1549B78D; Thu, 30 Apr 2015 14:24:13 +0000 (UTC) Received: from cell.glebius.int.ru (glebius.int.ru [81.19.69.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "cell.glebius.int.ru", Issuer "cell.glebius.int.ru" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 2EC1A1038; Thu, 30 Apr 2015 14:24:11 +0000 (UTC) Received: from cell.glebius.int.ru (localhost [127.0.0.1]) by cell.glebius.int.ru (8.14.9/8.14.9) with ESMTP id t3UEO8Nr022445 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 30 Apr 2015 17:24:08 +0300 (MSK) (envelope-from glebius@FreeBSD.org) Received: (from glebius@localhost) by cell.glebius.int.ru (8.14.9/8.14.9/Submit) id t3UEO849022444; Thu, 30 Apr 2015 17:24:08 +0300 (MSK) (envelope-from glebius@FreeBSD.org) X-Authentication-Warning: cell.glebius.int.ru: glebius set sender to glebius@FreeBSD.org using -f Date: Thu, 30 Apr 2015 17:24:08 +0300 From: Gleb Smirnoff To: kib@FreeBSD.org, alc@FreeBSD.org Cc: arch@FreeBSD.org Subject: more strict KPI for vm_pager_get_pages() Message-ID: <20150430142408.GS546@nginx.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="45Z9DzgjV8m4Oswq" Content-Disposition: inline User-Agent: Mutt/1.5.23 (2014-03-12) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Apr 2015 14:24:13 -0000 --45Z9DzgjV8m4Oswq Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hi! The reason to write down this patch emerges from the projects/sendfile branch, where vm_pager_get_pages() is used in the sendfile(2) system call. Although the new sendfile works flawlessly, it makes some assumptions about vnode_pager that theoretically may not be valid, however always hold in our current code. Going deeper in the problem I have found more important points, which yielded in the suggested patch. To start, let me display the current KPI assumptions: 1) vm_pager_get_pages() works on an array of consequtive array of pages. Pindex of (n+1)-th pages must be pindex of n-th + 1. One page is special, it is called reqpage. 2) vm_pager_get_pages() guarantees to swapin only the reqpage, and may skip or fail other pages for different reasons, that may vary from pager to pager. 3) There also is function vm_pager_has_page(), which reports availability of a page at given index in the pager, and also provides hints on how many consequtive pages before this one and after this one can be swapped in in single pager request. Most pagers return zeros in these hints. The vnode pager for UFS returns a strong promise, that one can later utilize in vm_pager_get_pages(). 4) All pages must be busied on enter. On exit only reqpage will be left busied. The KPI doesn't guarantee that rest of the pages is still in place. The pager usually calls vm_page_readahead_finish() on them, which can either free, or put the page on active/inactive queue, using quite a strange approach to choose a queue. 5) The pages must not be wired, since vm_page_free() may be called on them. However, this is violated by several consumers of KPI, relying on lack of errors in the pager. Moreover, the swap pager has a special function to skip wired pages, while doing the sweep, to avoid this problem. So, passing wired pages to swapper is OK, while to the reset is not. 6) Pagers may replace a page in the object with a new one. The sg_pager actually does that. To protect from this event, consumers of vm_pager_get_pages() always run vm_page_lookup() over the array of pages to relookup the pages. However, not all consumers do this. Speaking of pagers and their consumers: - 11 consumers request array of size 1, a single page - 3 consumers actually request array My suggestion is to change the KPI assumptions to the following: 1) There is no reqpage. All pages are entered busied, all pages are returned busied and validated. If pager fails to validate all pages it must return error. 2) The consumer (not the pager!) is to decide what to do with the pages: vm_page_active, vm_page_deactivate, vm_page_flash or just vm_page_free them. The consumer also unbusies pages, if it wants to. The consumer is free to wire pages before the call. 3) Consumers must first query the pager via vm_pager_has_page(), and use the after/before hints to limit the size of the requested pages array. 4) In case if pager replaces pages, it must also update the array, so that consumer doesn't need to do relookup. Doing this sweep, I also noticed that all pagers have a copy-pasted code of zeroing invalid regions of partially valid pages. Also, many pagers got a set of assertions copy and pasted from each other. So, I decided to un-inline the vm_pager_get_pages(), bring it to the vm_pager.c file and gather all these copy-pastes into one place. The suggested patch is attached. As expected, it simplifies and removes quite a lot of code. Right now it is tested on UFS only, testing NFS and ZFS is on my list. There is one panic known, but it seems unrelated, and Peter pho@ says that once it has been seen before. -- Totus tuus, Glebius. --45Z9DzgjV8m4Oswq Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="vm_pager_get_pages-new-KPI.diff" Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c =================================================================== --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c (revision 282213) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c (working copy) @@ -5712,12 +5712,12 @@ ioflags(int ioflags) } static int -zfs_getpages(struct vnode *vp, vm_page_t *m, int count, int reqpage) +zfs_getpages(struct vnode *vp, vm_page_t *m, int count) { znode_t *zp = VTOZ(vp); zfsvfs_t *zfsvfs = zp->z_zfsvfs; objset_t *os = zp->z_zfsvfs->z_os; - vm_page_t mfirst, mlast, mreq; + vm_page_t mlast; vm_object_t object; caddr_t va; struct sf_buf *sf; @@ -5730,80 +5730,27 @@ static int ZFS_VERIFY_ZP(zp); pcount = OFF_TO_IDX(round_page(count)); - mreq = m[reqpage]; - object = mreq->object; + object = m[0]->object; + mlast = m[pcount - 1]; error = 0; - KASSERT(vp->v_object == object, ("mismatching object")); - - if (pcount > 1 && zp->z_blksz > PAGESIZE) { - startoff = rounddown(IDX_TO_OFF(mreq->pindex), zp->z_blksz); - reqstart = OFF_TO_IDX(round_page(startoff)); - if (reqstart < m[0]->pindex) - reqstart = 0; - else - reqstart = reqstart - m[0]->pindex; - endoff = roundup(IDX_TO_OFF(mreq->pindex) + PAGE_SIZE, - zp->z_blksz); - reqend = OFF_TO_IDX(trunc_page(endoff)) - 1; - if (reqend > m[pcount - 1]->pindex) - reqend = m[pcount - 1]->pindex; - reqsize = reqend - m[reqstart]->pindex + 1; - KASSERT(reqstart <= reqpage && reqpage < reqstart + reqsize, - ("reqpage beyond [reqstart, reqstart + reqsize[ bounds")); - } else { - reqstart = reqpage; - reqsize = 1; - } - mfirst = m[reqstart]; - mlast = m[reqstart + reqsize - 1]; - - zfs_vmobject_wlock(object); - - for (i = 0; i < reqstart; i++) { - vm_page_lock(m[i]); - vm_page_free(m[i]); - vm_page_unlock(m[i]); - } - for (i = reqstart + reqsize; i < pcount; i++) { - vm_page_lock(m[i]); - vm_page_free(m[i]); - vm_page_unlock(m[i]); - } - - if (mreq->valid && reqsize == 1) { - if (mreq->valid != VM_PAGE_BITS_ALL) - vm_page_zero_invalid(mreq, TRUE); - zfs_vmobject_wunlock(object); + if (IDX_TO_OFF(mlast->pindex) >= + object->un_pager.vnp.vnp_size) { ZFS_EXIT(zfsvfs); - return (zfs_vm_pagerret_ok); + return (zfs_vm_pagerret_bad); } PCPU_INC(cnt.v_vnodein); PCPU_ADD(cnt.v_vnodepgsin, reqsize); - if (IDX_TO_OFF(mreq->pindex) >= object->un_pager.vnp.vnp_size) { - for (i = reqstart; i < reqstart + reqsize; i++) { - if (i != reqpage) { - vm_page_lock(m[i]); - vm_page_free(m[i]); - vm_page_unlock(m[i]); - } - } - zfs_vmobject_wunlock(object); - ZFS_EXIT(zfsvfs); - return (zfs_vm_pagerret_bad); - } - lsize = PAGE_SIZE; if (IDX_TO_OFF(mlast->pindex) + lsize > object->un_pager.vnp.vnp_size) - lsize = object->un_pager.vnp.vnp_size - IDX_TO_OFF(mlast->pindex); + lsize = object->un_pager.vnp.vnp_size - + IDX_TO_OFF(mlast->pindex); - zfs_vmobject_wunlock(object); - - for (i = reqstart; i < reqstart + reqsize; i++) { + for (i = 0; i < pcount; i++) { size = PAGE_SIZE; - if (i == (reqstart + reqsize - 1)) + if (i == pcount - 1) size = lsize; va = zfs_map_page(m[i], &sf); error = dmu_read(os, zp->z_id, IDX_TO_OFF(m[i]->pindex), @@ -5812,21 +5759,15 @@ static int bzero(va + size, PAGE_SIZE - size); zfs_unmap_page(sf); if (error != 0) - break; + goto out; } zfs_vmobject_wlock(object); - - for (i = reqstart; i < reqstart + reqsize; i++) { - if (!error) - m[i]->valid = VM_PAGE_BITS_ALL; - KASSERT(m[i]->dirty == 0, ("zfs_getpages: page %p is dirty", m[i])); - if (i != reqpage) - vm_page_readahead_finish(m[i]); - } - + for (i = 0; i < pcount; i++) + m[i]->valid = VM_PAGE_BITS_ALL; zfs_vmobject_wunlock(object); +out: ZFS_ACCESSTIME_STAMP(zfsvfs, zp); ZFS_EXIT(zfsvfs); return (error ? zfs_vm_pagerret_error : zfs_vm_pagerret_ok); @@ -5842,7 +5783,7 @@ zfs_freebsd_getpages(ap) } */ *ap; { - return (zfs_getpages(ap->a_vp, ap->a_m, ap->a_count, ap->a_reqpage)); + return (zfs_getpages(ap->a_vp, ap->a_m, ap->a_count)); } static int Index: sys/dev/drm2/i915/i915_gem.c =================================================================== --- sys/dev/drm2/i915/i915_gem.c (revision 282213) +++ sys/dev/drm2/i915/i915_gem.c (working copy) @@ -3174,8 +3174,7 @@ i915_gem_wire_page(vm_object_t object, vm_pindex_t m = vm_page_grab(object, pindex, VM_ALLOC_NORMAL); if (m->valid != VM_PAGE_BITS_ALL) { if (vm_pager_has_page(object, pindex, NULL, NULL)) { - rv = vm_pager_get_pages(object, &m, 1, 0); - m = vm_page_lookup(object, pindex); + rv = vm_pager_get_pages(object, &m, 1); if (m == NULL) return (NULL); if (rv != VM_PAGER_OK) { Index: sys/dev/drm2/ttm/ttm_tt.c =================================================================== --- sys/dev/drm2/ttm/ttm_tt.c (revision 282213) +++ sys/dev/drm2/ttm/ttm_tt.c (working copy) @@ -291,7 +291,7 @@ int ttm_tt_swapin(struct ttm_tt *ttm) from_page = vm_page_grab(obj, i, VM_ALLOC_NORMAL); if (from_page->valid != VM_PAGE_BITS_ALL) { if (vm_pager_has_page(obj, i, NULL, NULL)) { - rv = vm_pager_get_pages(obj, &from_page, 1, 0); + rv = vm_pager_get_pages(obj, &from_page, 1); if (rv != VM_PAGER_OK) { vm_page_lock(from_page); vm_page_free(from_page); Index: sys/dev/md/md.c =================================================================== --- sys/dev/md/md.c (revision 282213) +++ sys/dev/md/md.c (working copy) @@ -835,7 +835,7 @@ mdstart_swap(struct md_s *sc, struct bio *bp) if (m->valid == VM_PAGE_BITS_ALL) rv = VM_PAGER_OK; else - rv = vm_pager_get_pages(sc->object, &m, 1, 0); + rv = vm_pager_get_pages(sc->object, &m, 1); if (rv == VM_PAGER_ERROR) { vm_page_xunbusy(m); break; @@ -858,7 +858,7 @@ mdstart_swap(struct md_s *sc, struct bio *bp) } } else if (bp->bio_cmd == BIO_WRITE) { if (len != PAGE_SIZE && m->valid != VM_PAGE_BITS_ALL) - rv = vm_pager_get_pages(sc->object, &m, 1, 0); + rv = vm_pager_get_pages(sc->object, &m, 1); else rv = VM_PAGER_OK; if (rv == VM_PAGER_ERROR) { @@ -874,7 +874,7 @@ mdstart_swap(struct md_s *sc, struct bio *bp) m->valid = VM_PAGE_BITS_ALL; } else if (bp->bio_cmd == BIO_DELETE) { if (len != PAGE_SIZE && m->valid != VM_PAGE_BITS_ALL) - rv = vm_pager_get_pages(sc->object, &m, 1, 0); + rv = vm_pager_get_pages(sc->object, &m, 1); else rv = VM_PAGER_OK; if (rv == VM_PAGER_ERROR) { Index: sys/fs/fuse/fuse_vnops.c =================================================================== --- sys/fs/fuse/fuse_vnops.c (revision 282213) +++ sys/fs/fuse/fuse_vnops.c (working copy) @@ -1761,29 +1761,6 @@ fuse_vnop_getpages(struct vop_getpages_args *ap) npages = btoc(count); /* - * If the requested page is partially valid, just return it and - * allow the pager to zero-out the blanks. Partially valid pages - * can only occur at the file EOF. - */ - - VM_OBJECT_WLOCK(vp->v_object); - fuse_vm_page_lock_queues(); - if (pages[ap->a_reqpage]->valid != 0) { - for (i = 0; i < npages; ++i) { - if (i != ap->a_reqpage) { - fuse_vm_page_lock(pages[i]); - vm_page_free(pages[i]); - fuse_vm_page_unlock(pages[i]); - } - } - fuse_vm_page_unlock_queues(); - VM_OBJECT_WUNLOCK(vp->v_object); - return 0; - } - fuse_vm_page_unlock_queues(); - VM_OBJECT_WUNLOCK(vp->v_object); - - /* * We use only the kva address for the buffer, but this is extremely * convienient and fast. */ @@ -1811,17 +1788,6 @@ fuse_vnop_getpages(struct vop_getpages_args *ap) if (error && (uio.uio_resid == count)) { FS_DEBUG("error %d\n", error); - VM_OBJECT_WLOCK(vp->v_object); - fuse_vm_page_lock_queues(); - for (i = 0; i < npages; ++i) { - if (i != ap->a_reqpage) { - fuse_vm_page_lock(pages[i]); - vm_page_free(pages[i]); - fuse_vm_page_unlock(pages[i]); - } - } - fuse_vm_page_unlock_queues(); - VM_OBJECT_WUNLOCK(vp->v_object); return VM_PAGER_ERROR; } /* @@ -1862,8 +1828,6 @@ fuse_vnop_getpages(struct vop_getpages_args *ap) */ ; } - if (i != ap->a_reqpage) - vm_page_readahead_finish(m); } fuse_vm_page_unlock_queues(); VM_OBJECT_WUNLOCK(vp->v_object); Index: sys/fs/nfsclient/nfs_clbio.c =================================================================== --- sys/fs/nfsclient/nfs_clbio.c (revision 282213) +++ sys/fs/nfsclient/nfs_clbio.c (working copy) @@ -129,23 +129,6 @@ ncl_getpages(struct vop_getpages_args *ap) npages = btoc(count); /* - * Since the caller has busied the requested page, that page's valid - * field will not be changed by other threads. - */ - vm_page_assert_xbusied(pages[ap->a_reqpage]); - - /* - * If the requested page is partially valid, just return it and - * allow the pager to zero-out the blanks. Partially valid pages - * can only occur at the file EOF. - */ - if (pages[ap->a_reqpage]->valid != 0) { - vm_pager_free_nonreq(object, pages, ap->a_reqpage, npages, - FALSE); - return (VM_PAGER_OK); - } - - /* * We use only the kva address for the buffer, but this is extremely * convienient and fast. */ @@ -173,8 +156,6 @@ ncl_getpages(struct vop_getpages_args *ap) if (error && (uio.uio_resid == count)) { ncl_printf("nfs_getpages: error %d\n", error); - vm_pager_free_nonreq(object, pages, ap->a_reqpage, npages, - FALSE); return (VM_PAGER_ERROR); } @@ -218,8 +199,6 @@ ncl_getpages(struct vop_getpages_args *ap) */ ; } - if (i != ap->a_reqpage) - vm_page_readahead_finish(m); } VM_OBJECT_WUNLOCK(object); return (0); Index: sys/fs/smbfs/smbfs_io.c =================================================================== --- sys/fs/smbfs/smbfs_io.c (revision 282213) +++ sys/fs/smbfs/smbfs_io.c (working copy) @@ -424,7 +424,7 @@ smbfs_getpages(ap) #ifdef SMBFS_RWGENERIC return vop_stdgetpages(ap); #else - int i, error, nextoff, size, toff, npages, count, reqpage; + int i, error, nextoff, size, toff, npages, count; struct uio uio; struct iovec iov; vm_offset_t kva; @@ -436,7 +436,7 @@ smbfs_getpages(ap) struct smbnode *np; struct smb_cred *scred; vm_object_t object; - vm_page_t *pages, m; + vm_page_t *pages; vp = ap->a_vp; if ((object = vp->v_object) == NULL) { @@ -451,29 +451,7 @@ smbfs_getpages(ap) pages = ap->a_m; count = ap->a_count; npages = btoc(count); - reqpage = ap->a_reqpage; - /* - * If the requested page is partially valid, just return it and - * allow the pager to zero-out the blanks. Partially valid pages - * can only occur at the file EOF. - */ - m = pages[reqpage]; - - VM_OBJECT_WLOCK(object); - if (m->valid != 0) { - for (i = 0; i < npages; ++i) { - if (i != reqpage) { - vm_page_lock(pages[i]); - vm_page_free(pages[i]); - vm_page_unlock(pages[i]); - } - } - VM_OBJECT_WUNLOCK(object); - return 0; - } - VM_OBJECT_WUNLOCK(object); - scred = smbfs_malloc_scred(); smb_makescred(scred, td, cred); @@ -500,17 +478,8 @@ smbfs_getpages(ap) relpbuf(bp, &smbfs_pbuf_freecnt); - VM_OBJECT_WLOCK(object); if (error && (uio.uio_resid == count)) { printf("smbfs_getpages: error %d\n",error); - for (i = 0; i < npages; i++) { - if (reqpage != i) { - vm_page_lock(pages[i]); - vm_page_free(pages[i]); - vm_page_unlock(pages[i]); - } - } - VM_OBJECT_WUNLOCK(object); return VM_PAGER_ERROR; } @@ -544,9 +513,6 @@ smbfs_getpages(ap) */ ; } - - if (i != reqpage) - vm_page_readahead_finish(m); } VM_OBJECT_WUNLOCK(object); return 0; Index: sys/fs/tmpfs/tmpfs_subr.c =================================================================== --- sys/fs/tmpfs/tmpfs_subr.c (revision 282213) +++ sys/fs/tmpfs/tmpfs_subr.c (working copy) @@ -1320,7 +1320,7 @@ tmpfs_reg_resize(struct vnode *vp, off_t newsize, struct tmpfs_mount *tmp; struct tmpfs_node *node; vm_object_t uobj; - vm_page_t m, ma[1]; + vm_page_t m; vm_pindex_t idx, newpages, oldpages; off_t oldsize; int base, rv; @@ -1368,9 +1368,7 @@ retry: VM_OBJECT_WLOCK(uobj); goto retry; } else if (m->valid != VM_PAGE_BITS_ALL) { - ma[0] = m; - rv = vm_pager_get_pages(uobj, ma, 1, 0); - m = vm_page_lookup(uobj, idx); + rv = vm_pager_get_pages(uobj, &m, 1); } else /* A cached page was reactivated. */ rv = VM_PAGER_OK; Index: sys/kern/kern_exec.c =================================================================== --- sys/kern/kern_exec.c (revision 282213) +++ sys/kern/kern_exec.c (working copy) @@ -920,8 +920,7 @@ int exec_map_first_page(imgp) struct image_params *imgp; { - int rv, i; - int initial_pagein; + int rv, i, after, initial_pagein; vm_page_t ma[VM_INITIAL_PAGEIN]; vm_object_t object; @@ -937,9 +936,18 @@ exec_map_first_page(imgp) #endif ma[0] = vm_page_grab(object, 0, VM_ALLOC_NORMAL); if (ma[0]->valid != VM_PAGE_BITS_ALL) { - initial_pagein = VM_INITIAL_PAGEIN; - if (initial_pagein > object->size) - initial_pagein = object->size; + if (!vm_pager_has_page(object, 0, NULL, &after)) { + vm_page_xunbusy(ma[0]); + vm_page_lock(ma[0]); + vm_page_free(ma[0]); + vm_page_unlock(ma[0]); + VM_OBJECT_WUNLOCK(object); + return (EIO); + } + initial_pagein = min(after, VM_INITIAL_PAGEIN); + KASSERT(initial_pagein <= object->size, + ("%s: initial_pagein %d object->size %ju", + __func__, initial_pagein, (uintmax_t )object->size)); for (i = 1; i < initial_pagein; i++) { if ((ma[i] = vm_page_next(ma[i - 1])) != NULL) { if (ma[i]->valid) @@ -954,19 +962,21 @@ exec_map_first_page(imgp) } } initial_pagein = i; - rv = vm_pager_get_pages(object, ma, initial_pagein, 0); - ma[0] = vm_page_lookup(object, 0); - if ((rv != VM_PAGER_OK) || (ma[0] == NULL)) { - if (ma[0] != NULL) { - vm_page_lock(ma[0]); - vm_page_free(ma[0]); - vm_page_unlock(ma[0]); + rv = vm_pager_get_pages(object, ma, initial_pagein); + if (rv != VM_PAGER_OK) { + for (i = 0; i < initial_pagein; i++) { + vm_page_xunbusy(ma[i]); + vm_page_lock(ma[i]); + vm_page_free(ma[i]); + vm_page_unlock(ma[i]); } VM_OBJECT_WUNLOCK(object); return (EIO); } - } - vm_page_xunbusy(ma[0]); + } else + initial_pagein = 1; + for (i = 0; i < initial_pagein; i++) + vm_page_xunbusy(ma[i]); vm_page_lock(ma[0]); vm_page_hold(ma[0]); vm_page_activate(ma[0]); Index: sys/kern/uipc_shm.c =================================================================== --- sys/kern/uipc_shm.c (revision 282213) +++ sys/kern/uipc_shm.c (working copy) @@ -186,15 +186,7 @@ uiomove_object_page(vm_object_t obj, size_t len, s m = vm_page_grab(obj, idx, VM_ALLOC_NORMAL); if (m->valid != VM_PAGE_BITS_ALL) { if (vm_pager_has_page(obj, idx, NULL, NULL)) { - rv = vm_pager_get_pages(obj, &m, 1, 0); - m = vm_page_lookup(obj, idx); - if (m == NULL) { - printf( - "uiomove_object: vm_obj %p idx %jd null lookup rv %d\n", - obj, idx, rv); - VM_OBJECT_WUNLOCK(obj); - return (EIO); - } + rv = vm_pager_get_pages(obj, &m, 1); if (rv != VM_PAGER_OK) { printf( "uiomove_object: vm_obj %p idx %jd valid %x pager error %d\n", @@ -421,7 +413,7 @@ static int shm_dotruncate(struct shmfd *shmfd, off_t length) { vm_object_t object; - vm_page_t m, ma[1]; + vm_page_t m; vm_pindex_t idx, nobjsize; vm_ooffset_t delta; int base, rv; @@ -463,12 +455,9 @@ retry: VM_WAIT; VM_OBJECT_WLOCK(object); goto retry; - } else if (m->valid != VM_PAGE_BITS_ALL) { - ma[0] = m; - rv = vm_pager_get_pages(object, ma, 1, - 0); - m = vm_page_lookup(object, idx); - } else + } else if (m->valid != VM_PAGE_BITS_ALL) + rv = vm_pager_get_pages(object, &m, 1); + else /* A cached page was reactivated. */ rv = VM_PAGER_OK; vm_page_lock(m); Index: sys/kern/uipc_syscalls.c =================================================================== --- sys/kern/uipc_syscalls.c (revision 282213) +++ sys/kern/uipc_syscalls.c (working copy) @@ -2024,12 +2024,9 @@ sendfile_readpage(vm_object_t obj, struct vnode *v VM_OBJECT_WLOCK(obj); } else { if (vm_pager_has_page(obj, pindex, NULL, NULL)) { - rv = vm_pager_get_pages(obj, &m, 1, 0); + rv = vm_pager_get_pages(obj, &m, 1); SFSTAT_INC(sf_iocnt); - m = vm_page_lookup(obj, pindex); - if (m == NULL) - error = EIO; - else if (rv != VM_PAGER_OK) { + if (rv != VM_PAGER_OK) { vm_page_lock(m); vm_page_free(m); vm_page_unlock(m); Index: sys/kern/vfs_default.c =================================================================== --- sys/kern/vfs_default.c (revision 282213) +++ sys/kern/vfs_default.c (working copy) @@ -731,12 +731,11 @@ vop_stdgetpages(ap) struct vnode *a_vp; vm_page_t *a_m; int a_count; - int a_reqpage; } */ *ap; { return vnode_pager_generic_getpages(ap->a_vp, ap->a_m, - ap->a_count, ap->a_reqpage, NULL, NULL); + ap->a_count, NULL, NULL); } static int @@ -744,8 +743,8 @@ vop_stdgetpages_async(struct vop_getpages_async_ar { int error; - error = VOP_GETPAGES(ap->a_vp, ap->a_m, ap->a_count, ap->a_reqpage); - ap->a_iodone(ap->a_arg, ap->a_m, ap->a_reqpage, error); + error = VOP_GETPAGES(ap->a_vp, ap->a_m, ap->a_count); + ap->a_iodone(ap->a_arg, ap->a_m, ap->a_count, error); return (error); } Index: sys/kern/vnode_if.src =================================================================== --- sys/kern/vnode_if.src (revision 282213) +++ sys/kern/vnode_if.src (working copy) @@ -472,7 +472,6 @@ vop_getpages { IN struct vnode *vp; IN vm_page_t *m; IN int count; - IN int reqpage; }; @@ -482,7 +481,6 @@ vop_getpages_async { IN struct vnode *vp; IN vm_page_t *m; IN int count; - IN int reqpage; IN vop_getpages_iodone_t *iodone; IN void *arg; }; Index: sys/sys/buf.h =================================================================== --- sys/sys/buf.h (revision 282213) +++ sys/sys/buf.h (working copy) @@ -124,14 +124,9 @@ struct buf { struct ucred *b_wcred; /* Write credentials reference. */ void *b_saveaddr; /* Original b_addr for physio. */ union { - TAILQ_ENTRY(buf) bu_freelist; /* (Q) */ - struct { - void (*pg_iodone)(void *, vm_page_t *, int, int); - int pg_reqpage; - } bu_pager; - } b_union; -#define b_freelist b_union.bu_freelist -#define b_pager b_union.bu_pager + TAILQ_ENTRY(buf) b_freelist; /* (Q) */ + void (*b_pgiodone)(void *, vm_page_t *, int, int); + }; union cluster_info { TAILQ_HEAD(cluster_list_head, buf) cluster_head; TAILQ_ENTRY(buf) cluster_entry; Index: sys/vm/default_pager.c =================================================================== --- sys/vm/default_pager.c (revision 282213) +++ sys/vm/default_pager.c (working copy) @@ -56,7 +56,7 @@ __FBSDID("$FreeBSD$"); static vm_object_t default_pager_alloc(void *, vm_ooffset_t, vm_prot_t, vm_ooffset_t, struct ucred *); static void default_pager_dealloc(vm_object_t); -static int default_pager_getpages(vm_object_t, vm_page_t *, int, int); +static int default_pager_getpages(vm_object_t, vm_page_t *, int); static void default_pager_putpages(vm_object_t, vm_page_t *, int, boolean_t, int *); static boolean_t default_pager_haspage(vm_object_t, vm_pindex_t, int *, @@ -121,11 +121,10 @@ default_pager_dealloc(object) * see a vm_page with assigned swap here. */ static int -default_pager_getpages(object, m, count, reqpage) +default_pager_getpages(object, m, count) vm_object_t object; vm_page_t *m; int count; - int reqpage; { return VM_PAGER_FAIL; } Index: sys/vm/device_pager.c =================================================================== --- sys/vm/device_pager.c (revision 282213) +++ sys/vm/device_pager.c (working copy) @@ -59,7 +59,7 @@ static void dev_pager_init(void); static vm_object_t dev_pager_alloc(void *, vm_ooffset_t, vm_prot_t, vm_ooffset_t, struct ucred *); static void dev_pager_dealloc(vm_object_t); -static int dev_pager_getpages(vm_object_t, vm_page_t *, int, int); +static int dev_pager_getpages(vm_object_t, vm_page_t *, int); static void dev_pager_putpages(vm_object_t, vm_page_t *, int, boolean_t, int *); static boolean_t dev_pager_haspage(vm_object_t, vm_pindex_t, int *, @@ -257,33 +257,27 @@ dev_pager_dealloc(object) } static int -dev_pager_getpages(vm_object_t object, vm_page_t *ma, int count, int reqpage) +dev_pager_getpages(vm_object_t object, vm_page_t *ma, int count) { - int error, i; + int error; + /* Since our putpages reports zero after/before, the count is 1. */ + KASSERT(count == 1, ("%s: count %d", __func__, count)); VM_OBJECT_ASSERT_WLOCKED(object); error = object->un_pager.devp.ops->cdev_pg_fault(object, - IDX_TO_OFF(ma[reqpage]->pindex), PROT_READ, &ma[reqpage]); + IDX_TO_OFF(ma[0]->pindex), PROT_READ, &ma[0]); VM_OBJECT_ASSERT_WLOCKED(object); - for (i = 0; i < count; i++) { - if (i != reqpage) { - vm_page_lock(ma[i]); - vm_page_free(ma[i]); - vm_page_unlock(ma[i]); - } - } - if (error == VM_PAGER_OK) { KASSERT((object->type == OBJT_DEVICE && - (ma[reqpage]->oflags & VPO_UNMANAGED) != 0) || + (ma[0]->oflags & VPO_UNMANAGED) != 0) || (object->type == OBJT_MGTDEVICE && - (ma[reqpage]->oflags & VPO_UNMANAGED) == 0), - ("Wrong page type %p %p", ma[reqpage], object)); + (ma[0]->oflags & VPO_UNMANAGED) == 0), + ("Wrong page type %p %p", ma[0], object)); if (object->type == OBJT_DEVICE) { TAILQ_INSERT_TAIL(&object->un_pager.devp.devp_pglist, - ma[reqpage], plinks.q); + ma[0], plinks.q); } } Index: sys/vm/phys_pager.c =================================================================== --- sys/vm/phys_pager.c (revision 282213) +++ sys/vm/phys_pager.c (working copy) @@ -137,7 +137,7 @@ phys_pager_dealloc(vm_object_t object) * Fill as many pages as vm_fault has allocated for us. */ static int -phys_pager_getpages(vm_object_t object, vm_page_t *m, int count, int reqpage) +phys_pager_getpages(vm_object_t object, vm_page_t *m, int count) { int i; @@ -152,13 +152,6 @@ static int ("phys_pager_getpages: partially valid page %p", m[i])); KASSERT(m[i]->dirty == 0, ("phys_pager_getpages: dirty page %p", m[i])); - /* The requested page must remain busy, the others not. */ - if (i == reqpage) { - vm_page_lock(m[i]); - vm_page_flash(m[i]); - vm_page_unlock(m[i]); - } else - vm_page_xunbusy(m[i]); } return (VM_PAGER_OK); } Index: sys/vm/sg_pager.c =================================================================== --- sys/vm/sg_pager.c (revision 282213) +++ sys/vm/sg_pager.c (working copy) @@ -49,7 +49,7 @@ __FBSDID("$FreeBSD$"); static vm_object_t sg_pager_alloc(void *, vm_ooffset_t, vm_prot_t, vm_ooffset_t, struct ucred *); static void sg_pager_dealloc(vm_object_t); -static int sg_pager_getpages(vm_object_t, vm_page_t *, int, int); +static int sg_pager_getpages(vm_object_t, vm_page_t *, int); static void sg_pager_putpages(vm_object_t, vm_page_t *, int, boolean_t, int *); static boolean_t sg_pager_haspage(vm_object_t, vm_pindex_t, int *, @@ -133,7 +133,7 @@ sg_pager_dealloc(vm_object_t object) } static int -sg_pager_getpages(vm_object_t object, vm_page_t *m, int count, int reqpage) +sg_pager_getpages(vm_object_t object, vm_page_t *m, int count) { struct sglist *sg; vm_page_t m_paddr, page; @@ -143,11 +143,13 @@ static int size_t space; int i; + /* Since our putpages reports zero after/before, the count is 1. */ + KASSERT(count == 1, ("%s: count %d", __func__, count)); VM_OBJECT_ASSERT_WLOCKED(object); sg = object->handle; memattr = object->memattr; VM_OBJECT_WUNLOCK(object); - offset = m[reqpage]->pindex; + offset = m[0]->pindex; /* * Lookup the physical address of the requested page. An initial @@ -176,7 +178,7 @@ static int } /* Return a fake page for the requested page. */ - KASSERT(!(m[reqpage]->flags & PG_FICTITIOUS), + KASSERT(!(m[0]->flags & PG_FICTITIOUS), ("backing page for SG is fake")); /* Construct a new fake page. */ @@ -183,17 +185,9 @@ static int page = vm_page_getfake(paddr, memattr); VM_OBJECT_WLOCK(object); TAILQ_INSERT_TAIL(&object->un_pager.sgp.sgp_pglist, page, plinks.q); - - /* Free the original pages and insert this fake page into the object. */ - for (i = 0; i < count; i++) { - if (i == reqpage && - vm_page_replace(page, object, offset) != m[i]) - panic("sg_pager_getpages: invalid place replacement"); - vm_page_lock(m[i]); - vm_page_free(m[i]); - vm_page_unlock(m[i]); - } - m[reqpage] = page; + if (vm_page_replace(page, object, offset) != m[0]) + panic("sg_pager_getpages: invalid place replacement"); + m[0] = page; page->valid = VM_PAGE_BITS_ALL; return (VM_PAGER_OK); Index: sys/vm/swap_pager.c =================================================================== --- sys/vm/swap_pager.c (revision 282213) +++ sys/vm/swap_pager.c (working copy) @@ -362,8 +362,8 @@ static vm_object_t swap_pager_alloc(void *handle, vm_ooffset_t size, vm_prot_t prot, vm_ooffset_t offset, struct ucred *); static void swap_pager_dealloc(vm_object_t object); -static int swap_pager_getpages(vm_object_t, vm_page_t *, int, int); -static int swap_pager_getpages_async(vm_object_t, vm_page_t *, int, int, +static int swap_pager_getpages(vm_object_t, vm_page_t *, int); +static int swap_pager_getpages_async(vm_object_t, vm_page_t *, int, pgo_getpages_iodone_t, void *); static void swap_pager_putpages(vm_object_t, vm_page_t *, int, boolean_t, int *); static boolean_t @@ -418,16 +418,6 @@ static void swp_pager_meta_free(vm_object_t, vm_pi static void swp_pager_meta_free_all(vm_object_t); static daddr_t swp_pager_meta_ctl(vm_object_t, vm_pindex_t, int); -static void -swp_pager_free_nrpage(vm_page_t m) -{ - - vm_page_lock(m); - if (m->wire_count == 0) - vm_page_free(m); - vm_page_unlock(m); -} - /* * SWP_SIZECHECK() - update swap_pager_full indication * @@ -1109,20 +1099,11 @@ swap_pager_unswapped(vm_page_t m) * left busy, but the others adjusted. */ static int -swap_pager_getpages(vm_object_t object, vm_page_t *m, int count, int reqpage) +swap_pager_getpages(vm_object_t object, vm_page_t *m, int count) { struct buf *bp; - vm_page_t mreq; - int i; - int j; daddr_t blk; - mreq = m[reqpage]; - - KASSERT(mreq->object == object, - ("swap_pager_getpages: object mismatch %p/%p", - object, mreq->object)); - /* * Calculate range to retrieve. The pages have already been assigned * their swapblks. We require a *contiguous* range but we know it to @@ -1132,45 +1113,18 @@ static int * * The swp_*() calls must be made with the object locked. */ - blk = swp_pager_meta_ctl(mreq->object, mreq->pindex, 0); + blk = swp_pager_meta_ctl(m[0]->object, m[0]->pindex, 0); - for (i = reqpage - 1; i >= 0; --i) { - daddr_t iblk; - - iblk = swp_pager_meta_ctl(m[i]->object, m[i]->pindex, 0); - if (blk != iblk + (reqpage - i)) - break; - } - ++i; - - for (j = reqpage + 1; j < count; ++j) { - daddr_t jblk; - - jblk = swp_pager_meta_ctl(m[j]->object, m[j]->pindex, 0); - if (blk != jblk - (j - reqpage)) - break; - } - - /* - * free pages outside our collection range. Note: we never free - * mreq, it must remain busy throughout. - */ - if (0 < i || j < count) { - int k; - - for (k = 0; k < i; ++k) - swp_pager_free_nrpage(m[k]); - for (k = j; k < count; ++k) - swp_pager_free_nrpage(m[k]); - } - - /* - * Return VM_PAGER_FAIL if we have nothing to do. Return mreq - * still busy, but the others unbusied. - */ if (blk == SWAPBLK_NONE) return (VM_PAGER_FAIL); +#ifdef INVARIANTS + for (int i = 0; i < count; i++) + KASSERT(blk + i == + swp_pager_meta_ctl(m[i]->object, m[i]->pindex, 0), + ("%s: range is not contiguous", __func__)); +#endif + /* * Getpbuf() can sleep. */ @@ -1185,21 +1139,16 @@ static int bp->b_iodone = swp_pager_async_iodone; bp->b_rcred = crhold(thread0.td_ucred); bp->b_wcred = crhold(thread0.td_ucred); - bp->b_blkno = blk - (reqpage - i); - bp->b_bcount = PAGE_SIZE * (j - i); - bp->b_bufsize = PAGE_SIZE * (j - i); - bp->b_pager.pg_reqpage = reqpage - i; + bp->b_blkno = blk; + bp->b_bcount = PAGE_SIZE * count; + bp->b_bufsize = PAGE_SIZE * count; + bp->b_npages = count; VM_OBJECT_WLOCK(object); - { - int k; - - for (k = i; k < j; ++k) { - bp->b_pages[k - i] = m[k]; - m[k]->oflags |= VPO_SWAPINPROG; - } + for (int i = 0; i < count; i++) { + bp->b_pages[i] = m[i]; + m[i]->oflags |= VPO_SWAPINPROG; } - bp->b_npages = j - i; PCPU_INC(cnt.v_swapin); PCPU_ADD(cnt.v_swappgsin, bp->b_npages); @@ -1231,8 +1180,8 @@ static int * is set in the meta-data. */ VM_OBJECT_WLOCK(object); - while ((mreq->oflags & VPO_SWAPINPROG) != 0) { - mreq->oflags |= VPO_SWAPSLEEP; + while ((m[0]->oflags & VPO_SWAPINPROG) != 0) { + m[0]->oflags |= VPO_SWAPSLEEP; PCPU_INC(cnt.v_intrans); if (VM_OBJECT_SLEEP(object, &object->paging_in_progress, PSWP, "swread", hz * 20)) { @@ -1243,16 +1192,14 @@ static int } /* - * mreq is left busied after completion, but all the other pages - * are freed. If we had an unrecoverable read error the page will - * not be valid. + * If we had an unrecoverable read error pages will not be valid. */ - if (mreq->valid != VM_PAGE_BITS_ALL) { - return (VM_PAGER_ERROR); - } else { - return (VM_PAGER_OK); - } + for (int i = 0; i < count; i++) + if (m[i]->valid != VM_PAGE_BITS_ALL) + return (VM_PAGER_ERROR); + return (VM_PAGER_OK); + /* * A final note: in a low swap situation, we cannot deallocate swap * and mark a page dirty here because the caller is likely to mark @@ -1269,11 +1216,11 @@ static int */ static int swap_pager_getpages_async(vm_object_t object, vm_page_t *m, int count, - int reqpage, pgo_getpages_iodone_t iodone, void *arg) + pgo_getpages_iodone_t iodone, void *arg) { int r, error; - r = swap_pager_getpages(object, m, count, reqpage); + r = swap_pager_getpages(object, m, count); VM_OBJECT_WUNLOCK(object); switch (r) { case VM_PAGER_OK: @@ -1572,33 +1519,11 @@ swp_pager_async_iodone(struct buf *bp) */ if (bp->b_iocmd == BIO_READ) { /* - * When reading, reqpage needs to stay - * locked for the parent, but all other - * pages can be freed. We still want to - * wakeup the parent waiting on the page, - * though. ( also: pg_reqpage can be -1 and - * not match anything ). - * - * We have to wake specifically requested pages - * up too because we cleared VPO_SWAPINPROG and - * someone may be waiting for that. - * * NOTE: for reads, m->dirty will probably * be overridden by the original caller of * getpages so don't play cute tricks here. */ m->valid = 0; - if (i != bp->b_pager.pg_reqpage) - swp_pager_free_nrpage(m); - else { - vm_page_lock(m); - vm_page_flash(m); - vm_page_unlock(m); - } - /* - * If i == bp->b_pager.pg_reqpage, do not wake - * the page up. The caller needs to. - */ } else { /* * If a write error occurs, reactivate page @@ -1620,38 +1545,12 @@ swp_pager_async_iodone(struct buf *bp) * want to do that anyway, but it was an optimization * that existed in the old swapper for a time before * it got ripped out due to precisely this problem. - * - * If not the requested page then deactivate it. - * - * Note that the requested page, reqpage, is left - * busied, but we still have to wake it up. The - * other pages are released (unbusied) by - * vm_page_xunbusy(). */ KASSERT(!pmap_page_is_mapped(m), ("swp_pager_async_iodone: page %p is mapped", m)); - m->valid = VM_PAGE_BITS_ALL; KASSERT(m->dirty == 0, ("swp_pager_async_iodone: page %p is dirty", m)); - - /* - * We have to wake specifically requested pages - * up too because we cleared VPO_SWAPINPROG and - * could be waiting for it in getpages. However, - * be sure to not unbusy getpages specifically - * requested page - getpages expects it to be - * left busy. - */ - if (i != bp->b_pager.pg_reqpage) { - vm_page_lock(m); - vm_page_deactivate(m); - vm_page_unlock(m); - vm_page_xunbusy(m); - } else { - vm_page_lock(m); - vm_page_flash(m); - vm_page_unlock(m); - } + m->valid = VM_PAGE_BITS_ALL; } else { /* * For write success, clear the dirty @@ -1772,7 +1671,7 @@ swp_pager_force_pagein(vm_object_t object, vm_pind return; } - if (swap_pager_getpages(object, &m, 1, 0) != VM_PAGER_OK) + if (swap_pager_getpages(object, &m, 1) != VM_PAGER_OK) panic("swap_pager_force_pagein: read from swap failed");/*XXX*/ vm_object_pip_wakeup(object); vm_page_dirty(m); Index: sys/vm/vm_fault.c =================================================================== --- sys/vm/vm_fault.c (revision 282213) +++ sys/vm/vm_fault.c (working copy) @@ -672,26 +672,21 @@ vnode_locked: fs.m, behind, ahead, marray, &reqpage); rv = faultcount ? - vm_pager_get_pages(fs.object, marray, faultcount, - reqpage) : VM_PAGER_FAIL; + vm_pager_get_pages(fs.object, marray, faultcount) : + VM_PAGER_FAIL; if (rv == VM_PAGER_OK) { /* * Found the page. Leave it busy while we play - * with it. + * with it. Unbusy companion pages. */ - - /* - * Relookup in case pager changed page. Pager - * is responsible for disposition of old page - * if moved. - */ - fs.m = vm_page_lookup(fs.object, fs.pindex); - if (!fs.m) { - unlock_and_deallocate(&fs); - goto RetryFault; + for (int i = 0; i < faultcount; i++) { + if (i == reqpage) + continue; + vm_page_readahead_finish(marray[i]); } - + /* Pager could have changed the page. */ + fs.m = marray[reqpage]; hardfault++; break; /* break to PAGE HAS BEEN FOUND */ } Index: sys/vm/vm_glue.c =================================================================== --- sys/vm/vm_glue.c (revision 282213) +++ sys/vm/vm_glue.c (working copy) @@ -230,7 +230,7 @@ vsunlock(void *addr, size_t len) static vm_page_t vm_imgact_hold_page(vm_object_t object, vm_ooffset_t offset) { - vm_page_t m, ma[1]; + vm_page_t m; vm_pindex_t pindex; int rv; @@ -238,11 +238,7 @@ vm_imgact_hold_page(vm_object_t object, vm_ooffset pindex = OFF_TO_IDX(offset); m = vm_page_grab(object, pindex, VM_ALLOC_NORMAL); if (m->valid != VM_PAGE_BITS_ALL) { - ma[0] = m; - rv = vm_pager_get_pages(object, ma, 1, 0); - m = vm_page_lookup(object, pindex); - if (m == NULL) - goto out; + rv = vm_pager_get_pages(object, &m, 1); if (rv != VM_PAGER_OK) { vm_page_lock(m); vm_page_free(m); @@ -571,34 +567,37 @@ vm_thread_swapin(struct thread *td) { vm_object_t ksobj; vm_page_t ma[KSTACK_MAX_PAGES]; - int i, j, k, pages, rv; + int pages; pages = td->td_kstack_pages; ksobj = td->td_kstack_obj; VM_OBJECT_WLOCK(ksobj); - for (i = 0; i < pages; i++) + for (int i = 0; i < pages; i++) ma[i] = vm_page_grab(ksobj, i, VM_ALLOC_NORMAL | VM_ALLOC_WIRED); - for (i = 0; i < pages; i++) { - if (ma[i]->valid != VM_PAGE_BITS_ALL) { - vm_page_assert_xbusied(ma[i]); - vm_object_pip_add(ksobj, 1); - for (j = i + 1; j < pages; j++) { - if (ma[j]->valid != VM_PAGE_BITS_ALL) - vm_page_assert_xbusied(ma[j]); - if (ma[j]->valid == VM_PAGE_BITS_ALL) - break; - } - rv = vm_pager_get_pages(ksobj, ma + i, j - i, 0); - if (rv != VM_PAGER_OK) - panic("vm_thread_swapin: cannot get kstack for proc: %d", - td->td_proc->p_pid); - vm_object_pip_wakeup(ksobj); - for (k = i; k < j; k++) - ma[k] = vm_page_lookup(ksobj, k); + for (int i = 0; i < pages;) { + int j, a, count, rv; + + vm_page_assert_xbusied(ma[i]); + if (ma[i]->valid == VM_PAGE_BITS_ALL) { vm_page_xunbusy(ma[i]); - } else if (vm_page_xbusied(ma[i])) - vm_page_xunbusy(ma[i]); + i++; + continue; + } + vm_object_pip_add(ksobj, 1); + for (j = i + 1; j < pages; j++) + if (ma[j]->valid == VM_PAGE_BITS_ALL) + break; + rv = vm_pager_has_page(ksobj, ma[i]->pindex, NULL, &a); + KASSERT(rv == 1, ("%s: missing page %p", __func__, ma[i])); + count = min(a + 1, j - i); + rv = vm_pager_get_pages(ksobj, ma + i, count); + KASSERT(rv == VM_PAGER_OK, ("%s: cannot get kstack for proc %d", + __func__, td->td_proc->p_pid)); + vm_object_pip_wakeup(ksobj); + for (j = i; j < i + count; j++) + vm_page_xunbusy(ma[j]); + i += count; } VM_OBJECT_WUNLOCK(ksobj); pmap_qenter(td->td_kstack, ma, pages); Index: sys/vm/vm_object.c =================================================================== --- sys/vm/vm_object.c (revision 282213) +++ sys/vm/vm_object.c (working copy) @@ -2042,7 +2042,7 @@ vm_object_page_cache(vm_object_t object, vm_pindex boolean_t vm_object_populate(vm_object_t object, vm_pindex_t start, vm_pindex_t end) { - vm_page_t m, ma[1]; + vm_page_t m; vm_pindex_t pindex; int rv; @@ -2050,11 +2050,7 @@ vm_object_populate(vm_object_t object, vm_pindex_t for (pindex = start; pindex < end; pindex++) { m = vm_page_grab(object, pindex, VM_ALLOC_NORMAL); if (m->valid != VM_PAGE_BITS_ALL) { - ma[0] = m; - rv = vm_pager_get_pages(object, ma, 1, 0); - m = vm_page_lookup(object, pindex); - if (m == NULL) - break; + rv = vm_pager_get_pages(object, &m, 1); if (rv != VM_PAGER_OK) { vm_page_lock(m); vm_page_free(m); Index: sys/vm/vm_page.c =================================================================== --- sys/vm/vm_page.c (revision 282213) +++ sys/vm/vm_page.c (working copy) @@ -863,32 +863,19 @@ void vm_page_readahead_finish(vm_page_t m) { - if (m->valid != 0) { - /* - * Since the page is not the requested page, whether - * it should be activated or deactivated is not - * obvious. Empirical results have shown that - * deactivating the page is usually the best choice, - * unless the page is wanted by another thread. - */ - vm_page_lock(m); - if ((m->busy_lock & VPB_BIT_WAITERS) != 0) - vm_page_activate(m); - else - vm_page_deactivate(m); - vm_page_unlock(m); - vm_page_xunbusy(m); - } else { - /* - * Free the completely invalid page. Such page state - * occurs due to the short read operation which did - * not covered our page at all, or in case when a read - * error happens. - */ - vm_page_lock(m); - vm_page_free(m); - vm_page_unlock(m); - } + /* + * Since the page is not the requested page, whether it should be + * activated or deactivated is not obvious. Empirical results have + * shown that deactivating the page is usually the best choice, + * unless the page is wanted by another thread. + */ + vm_page_lock(m); + if ((m->busy_lock & VPB_BIT_WAITERS) != 0) + vm_page_activate(m); + else + vm_page_deactivate(m); + vm_page_unlock(m); + vm_page_xunbusy(m); } /* Index: sys/vm/vm_pager.c =================================================================== --- sys/vm/vm_pager.c (revision 282213) +++ sys/vm/vm_pager.c (working copy) @@ -251,7 +251,95 @@ vm_pager_deallocate(object) } /* - * vm_pager_get_pages() - inline, see vm/vm_pager.h + * Retrieve pages from the VM system in order to map them into an object + * ( or into VM space somewhere ). If the pagein was successful, we + * must fully validate it. + */ +int +vm_pager_get_pages(vm_object_t object, vm_page_t *m, int count) +{ +#ifdef INVARIANTS + vm_pindex_t pindex = m[0]->pindex; +#endif + int r; + + VM_OBJECT_ASSERT_WLOCKED(object); + KASSERT(count > 0, ("%s: 0 count", __func__)); + + /* + * If the last page is partially valid, just return it and zero-out + * the blanks. Partially valid pages can only occur at the file EOF. + */ + if (m[count - 1]->valid != 0) { + vm_page_zero_invalid(m[count - 1], TRUE); + if (--count == 0) + return (VM_PAGER_OK); + } + +#ifdef INVARIANTS + /* + * All pages must be busied, not mapped, not valid, not dirty + * and belong to the proper object. + */ + for (int i = 0 ; i < count; i++) { + vm_page_assert_xbusied(m[i]); + KASSERT(!pmap_page_is_mapped(m[i]), + ("%s: page %p is mapped", __func__, m[i])); + KASSERT(m[i]->valid == 0, + ("%s: request for a valid page %p", __func__, m[i])); + KASSERT(m[i]->dirty == 0, + ("%s: page %p is dirty", __func__, m[i])); + KASSERT(m[i]->object == object, + ("%s: wrong object %p/%p", __func__, object, m[i]->object)); + } +#endif + + r = (*pagertab[object->type]->pgo_getpages)(object, m, count); + if (r != VM_PAGER_OK) + return (r); + + for (int i = 0; i < count; i++) { + /* + * If pager has replaced a page, assert that it had + * updated the array. + */ + KASSERT(m[i] == vm_page_lookup(object, pindex++), + ("%s: mismatch page %p pindex %ju", __func__, + m[i], (uintmax_t )pindex - 1)); + /* + * Zero out partially filled data. + */ + if (m[i]->valid != VM_PAGE_BITS_ALL) + vm_page_zero_invalid(m[count - 1], TRUE); + } + return (VM_PAGER_OK); +} + +int +vm_pager_get_pages_async(vm_object_t object, vm_page_t *m, int count, + pgo_getpages_iodone_t iodone, void *arg) +{ + + VM_OBJECT_ASSERT_WLOCKED(object); + KASSERT(count > 0, ("%s: 0 count", __func__)); + + /* + * If the last page is partially valid, just return it and zero-out + * the blanks. Partially valid pages can only occur at the file EOF. + */ + if (m[count - 1]->valid != 0) { + vm_page_zero_invalid(m[count - 1], TRUE); + if (--count == 0) { + iodone(arg, m, 1, 0); + return (VM_PAGER_OK); + } + } + + return ((*pagertab[object->type]->pgo_getpages_async)(object, m, + count, iodone, arg)); +} + +/* * vm_pager_put_pages() - inline, see vm/vm_pager.h * vm_pager_has_page() - inline, see vm/vm_pager.h */ @@ -283,39 +371,6 @@ vm_pager_object_lookup(struct pagerlst *pg_list, v } /* - * Free the non-requested pages from the given array. To remove all pages, - * caller should provide out of range reqpage number. - */ -void -vm_pager_free_nonreq(vm_object_t object, vm_page_t ma[], int reqpage, - int npages, boolean_t object_locked) -{ - enum { UNLOCKED, CALLER_LOCKED, INTERNALLY_LOCKED } locked; - int i; - - if (object_locked) { - VM_OBJECT_ASSERT_WLOCKED(object); - locked = CALLER_LOCKED; - } else { - VM_OBJECT_ASSERT_UNLOCKED(object); - locked = UNLOCKED; - } - for (i = 0; i < npages; ++i) { - if (i != reqpage) { - if (locked == UNLOCKED) { - VM_OBJECT_WLOCK(object); - locked = INTERNALLY_LOCKED; - } - vm_page_lock(ma[i]); - vm_page_free(ma[i]); - vm_page_unlock(ma[i]); - } - } - if (locked == INTERNALLY_LOCKED) - VM_OBJECT_WUNLOCK(object); -} - -/* * initialize a physical buffer */ Index: sys/vm/vm_pager.h =================================================================== --- sys/vm/vm_pager.h (revision 282213) +++ sys/vm/vm_pager.h (working copy) @@ -50,9 +50,9 @@ typedef void pgo_init_t(void); typedef vm_object_t pgo_alloc_t(void *, vm_ooffset_t, vm_prot_t, vm_ooffset_t, struct ucred *); typedef void pgo_dealloc_t(vm_object_t); -typedef int pgo_getpages_t(vm_object_t, vm_page_t *, int, int); +typedef int pgo_getpages_t(vm_object_t, vm_page_t *, int); typedef void pgo_getpages_iodone_t(void *, vm_page_t *, int, int); -typedef int pgo_getpages_async_t(vm_object_t, vm_page_t *, int, int, +typedef int pgo_getpages_async_t(vm_object_t, vm_page_t *, int, pgo_getpages_iodone_t, void *); typedef void pgo_putpages_t(vm_object_t, vm_page_t *, int, int, int *); typedef boolean_t pgo_haspage_t(vm_object_t, vm_pindex_t, int *, int *); @@ -106,49 +106,13 @@ vm_object_t vm_pager_allocate(objtype_t, void *, v vm_ooffset_t, struct ucred *); void vm_pager_bufferinit(void); void vm_pager_deallocate(vm_object_t); -static __inline int vm_pager_get_pages(vm_object_t, vm_page_t *, int, int); -static inline int vm_pager_get_pages_async(vm_object_t, vm_page_t *, int, - int, pgo_getpages_iodone_t, void *); +int vm_pager_get_pages(vm_object_t, vm_page_t *, int); +int vm_pager_get_pages_async(vm_object_t, vm_page_t *, int, + pgo_getpages_iodone_t, void *); static __inline boolean_t vm_pager_has_page(vm_object_t, vm_pindex_t, int *, int *); void vm_pager_init(void); vm_object_t vm_pager_object_lookup(struct pagerlst *, void *); -void vm_pager_free_nonreq(vm_object_t object, vm_page_t ma[], int reqpage, - int npages, boolean_t object_locked); -/* - * vm_page_get_pages: - * - * Retrieve pages from the VM system in order to map them into an object - * ( or into VM space somewhere ). If the pagein was successful, we - * must fully validate it. - */ -static __inline int -vm_pager_get_pages( - vm_object_t object, - vm_page_t *m, - int count, - int reqpage -) { - int r; - - VM_OBJECT_ASSERT_WLOCKED(object); - r = (*pagertab[object->type]->pgo_getpages)(object, m, count, reqpage); - if (r == VM_PAGER_OK && m[reqpage]->valid != VM_PAGE_BITS_ALL) { - vm_page_zero_invalid(m[reqpage], TRUE); - } - return (r); -} - -static inline int -vm_pager_get_pages_async(vm_object_t object, vm_page_t *m, int count, - int reqpage, pgo_getpages_iodone_t iodone, void *arg) -{ - - VM_OBJECT_ASSERT_WLOCKED(object); - return ((*pagertab[object->type]->pgo_getpages_async)(object, m, - count, reqpage, iodone, arg)); -} - static __inline void vm_pager_put_pages( vm_object_t object, Index: sys/vm/vnode_pager.c =================================================================== --- sys/vm/vnode_pager.c (revision 282213) +++ sys/vm/vnode_pager.c (working copy) @@ -84,11 +84,9 @@ static int vnode_pager_addr(struct vnode *vp, vm_o static int vnode_pager_input_smlfs(vm_object_t object, vm_page_t m); static int vnode_pager_input_old(vm_object_t object, vm_page_t m); static void vnode_pager_dealloc(vm_object_t); -static int vnode_pager_local_getpages0(struct vnode *, vm_page_t *, int, int, +static int vnode_pager_getpages(vm_object_t, vm_page_t *, int); +static int vnode_pager_getpages_async(vm_object_t, vm_page_t *, int, vop_getpages_iodone_t, void *); -static int vnode_pager_getpages(vm_object_t, vm_page_t *, int, int); -static int vnode_pager_getpages_async(vm_object_t, vm_page_t *, int, int, - vop_getpages_iodone_t, void *); static void vnode_pager_putpages(vm_object_t, vm_page_t *, int, int, int *); static boolean_t vnode_pager_haspage(vm_object_t, vm_pindex_t, int *, int *); static vm_object_t vnode_pager_alloc(void *, vm_ooffset_t, vm_prot_t, @@ -662,7 +660,7 @@ vnode_pager_input_old(vm_object_t object, vm_page_ * backing vp's VOP_GETPAGES. */ static int -vnode_pager_getpages(vm_object_t object, vm_page_t *m, int count, int reqpage) +vnode_pager_getpages(vm_object_t object, vm_page_t *m, int count) { int rtval; struct vnode *vp; @@ -670,7 +668,7 @@ static int vp = object->handle; VM_OBJECT_WUNLOCK(object); - rtval = VOP_GETPAGES(vp, m, bytes, reqpage); + rtval = VOP_GETPAGES(vp, m, bytes); KASSERT(rtval != EOPNOTSUPP, ("vnode_pager: FS getpages not implemented\n")); VM_OBJECT_WLOCK(object); @@ -679,7 +677,7 @@ static int static int vnode_pager_getpages_async(vm_object_t object, vm_page_t *m, int count, - int reqpage, vop_getpages_iodone_t iodone, void *arg) + vop_getpages_iodone_t iodone, void *arg) { struct vnode *vp; int rtval; @@ -686,8 +684,7 @@ vnode_pager_getpages_async(vm_object_t object, vm_ vp = object->handle; VM_OBJECT_WUNLOCK(object); - rtval = VOP_GETPAGES_ASYNC(vp, m, count * PAGE_SIZE, reqpage, - iodone, arg); + rtval = VOP_GETPAGES_ASYNC(vp, m, count * PAGE_SIZE, iodone, arg); KASSERT(rtval != EOPNOTSUPP, ("vnode_pager: FS getpages_async not implemented\n")); VM_OBJECT_WLOCK(object); @@ -703,8 +700,8 @@ int vnode_pager_local_getpages(struct vop_getpages_args *ap) { - return (vnode_pager_local_getpages0(ap->a_vp, ap->a_m, ap->a_count, - ap->a_reqpage, NULL, NULL)); + return (vnode_pager_generic_getpages(ap->a_vp, ap->a_m, ap->a_count, + NULL, NULL)); } int @@ -711,42 +708,10 @@ int vnode_pager_local_getpages_async(struct vop_getpages_async_args *ap) { - return (vnode_pager_local_getpages0(ap->a_vp, ap->a_m, ap->a_count, - ap->a_reqpage, ap->a_iodone, ap->a_arg)); + return (vnode_pager_generic_getpages(ap->a_vp, ap->a_m, ap->a_count, + ap->a_iodone, ap->a_arg)); } -static int -vnode_pager_local_getpages0(struct vnode *vp, vm_page_t *m, int bytecount, - int reqpage, vop_getpages_iodone_t iodone, void *arg) -{ - vm_page_t mreq; - - mreq = m[reqpage]; - - /* - * Since the caller has busied the requested page, that page's valid - * field will not be changed by other threads. - */ - vm_page_assert_xbusied(mreq); - - /* - * The requested page has valid blocks. Invalid part can only - * exist at the end of file, and the page is made fully valid - * by zeroing in vm_pager_get_pages(). Free non-requested - * pages, since no i/o is done to read its content. - */ - if (mreq->valid != 0) { - vm_pager_free_nonreq(mreq->object, m, reqpage, - round_page(bytecount) / PAGE_SIZE, FALSE); - if (iodone != NULL) - iodone(arg, m, reqpage, 0); - return (VM_PAGER_OK); - } - - return (vnode_pager_generic_getpages(vp, m, bytecount, reqpage, - iodone, arg)); -} - /* * This is now called from local media FS's to operate against their * own vnodes if they fail to implement VOP_GETPAGES. @@ -753,29 +718,31 @@ vnode_pager_local_getpages_async(struct vop_getpag */ int vnode_pager_generic_getpages(struct vnode *vp, vm_page_t *m, int bytecount, - int reqpage, vop_getpages_iodone_t iodone, void *arg) + vop_getpages_iodone_t iodone, void *arg) { vm_object_t object; off_t foff; - int i, j, size, bsize, first, *freecnt; - daddr_t firstaddr, reqblock; + int error, count, bsize, i, after, secmask, *freecnt; + daddr_t reqblock; struct bufobj *bo; - int runpg; - int runend; struct buf *bp; - int count; - int error; - object = vp->v_object; - count = bytecount / PAGE_SIZE; + KASSERT(vp->v_type != VCHR && vp->v_type != VBLK, + ("%s does not support devices", __func__)); + KASSERT(bytecount > 0 && (bytecount & ~PAGE_MASK) == bytecount, + ("%s: bytecount %d", __func__, bytecount)); - KASSERT(vp->v_type != VCHR && vp->v_type != VBLK, - ("vnode_pager_generic_getpages does not support devices")); if (vp->v_iflag & VI_DOOMED) return VM_PAGER_BAD; + object = vp->v_object; + foff = IDX_TO_OFF(m[0]->pindex); + + KASSERT(foff < object->un_pager.vnp.vnp_size, + ("%s: page %p offset beyond vp %p size", __func__, m[0], vp)); + + count = bytecount >> PAGE_SHIFT; bsize = vp->v_mount->mnt_stat.f_iosize; - foff = IDX_TO_OFF(m[reqpage]->pindex); /* * Synchronous and asynchronous paging operations use different @@ -794,172 +761,58 @@ vnode_pager_generic_getpages(struct vnode *vp, vm_ * If the file system doesn't support VOP_BMAP, use old way of * getting pages via VOP_READ. */ - error = VOP_BMAP(vp, foff / bsize, &bo, &reqblock, NULL, NULL); + error = VOP_BMAP(vp, foff / bsize, &bo, &reqblock, &after, NULL); if (error == EOPNOTSUPP) { relpbuf(bp, freecnt); VM_OBJECT_WLOCK(object); - for (i = 0; i < count; i++) - if (i != reqpage) { - vm_page_lock(m[i]); - vm_page_free(m[i]); - vm_page_unlock(m[i]); - } - PCPU_INC(cnt.v_vnodein); - PCPU_INC(cnt.v_vnodepgsin); - error = vnode_pager_input_old(object, m[reqpage]); + for (i = 0; i < count; i++) { + PCPU_INC(cnt.v_vnodein); + PCPU_INC(cnt.v_vnodepgsin); + error = vnode_pager_input_old(object, m[i]); + if (error) + break; + } VM_OBJECT_WUNLOCK(object); return (error); } else if (error != 0) { relpbuf(bp, freecnt); - vm_pager_free_nonreq(object, m, reqpage, count, FALSE); return (VM_PAGER_ERROR); - - /* - * if the blocksize is smaller than a page size, then use - * special small filesystem code. NFS sometimes has a small - * blocksize, but it can handle large reads itself. - */ - } else if ((PAGE_SIZE / bsize) > 1 && - (vp->v_mount->mnt_stat.f_type != nfs_mount_type)) { - relpbuf(bp, freecnt); - vm_pager_free_nonreq(object, m, reqpage, count, FALSE); - PCPU_INC(cnt.v_vnodein); - PCPU_INC(cnt.v_vnodepgsin); - return vnode_pager_input_smlfs(object, m[reqpage]); } /* - * Since the caller has busied the requested page, that page's valid - * field will not be changed by other threads. + * If the blocksize is smaller than a page size, then use + * special small filesystem code. NFS sometimes has a small + * blocksize, but it can handle large reads itself. */ - vm_page_assert_xbusied(m[reqpage]); - - /* - * If we have a completely valid page available to us, we can - * clean up and return. Otherwise we have to re-read the - * media. - */ - if (m[reqpage]->valid == VM_PAGE_BITS_ALL) { + if ((PAGE_SIZE / bsize) > 1 && + (vp->v_mount->mnt_stat.f_type != nfs_mount_type)) { relpbuf(bp, freecnt); - vm_pager_free_nonreq(object, m, reqpage, count, FALSE); - return (VM_PAGER_OK); - } else if (reqblock == -1) { - relpbuf(bp, freecnt); - pmap_zero_page(m[reqpage]); - KASSERT(m[reqpage]->dirty == 0, - ("vnode_pager_generic_getpages: page %p is dirty", m)); - VM_OBJECT_WLOCK(object); - m[reqpage]->valid = VM_PAGE_BITS_ALL; - vm_pager_free_nonreq(object, m, reqpage, count, TRUE); - VM_OBJECT_WUNLOCK(object); - return (VM_PAGER_OK); - } else if (m[reqpage]->valid != 0) { - VM_OBJECT_WLOCK(object); - m[reqpage]->valid = 0; - VM_OBJECT_WUNLOCK(object); - } - - /* - * here on direct device I/O - */ - firstaddr = -1; - - /* - * calculate the run that includes the required page - */ - for (first = 0, i = 0; i < count; i = runend) { - if (vnode_pager_addr(vp, IDX_TO_OFF(m[i]->pindex), &firstaddr, - &runpg) != 0) { - relpbuf(bp, freecnt); - /* The requested page may be out of range. */ - vm_pager_free_nonreq(object, m + i, reqpage - i, - count - i, FALSE); - return (VM_PAGER_ERROR); + for (i = 0; i < count; i++) { + PCPU_INC(cnt.v_vnodein); + PCPU_INC(cnt.v_vnodepgsin); + error = vnode_pager_input_smlfs(object, m[i]); + if (error) + break; } - if (firstaddr == -1) { - VM_OBJECT_WLOCK(object); - if (i == reqpage && foff < object->un_pager.vnp.vnp_size) { - panic("vnode_pager_getpages: unexpected missing page: firstaddr: %jd, foff: 0x%jx%08jx, vnp_size: 0x%jx%08jx", - (intmax_t)firstaddr, (uintmax_t)(foff >> 32), - (uintmax_t)foff, - (uintmax_t) - (object->un_pager.vnp.vnp_size >> 32), - (uintmax_t)object->un_pager.vnp.vnp_size); - } - vm_page_lock(m[i]); - vm_page_free(m[i]); - vm_page_unlock(m[i]); - VM_OBJECT_WUNLOCK(object); - runend = i + 1; - first = runend; - continue; - } - runend = i + runpg; - if (runend <= reqpage) { - VM_OBJECT_WLOCK(object); - for (j = i; j < runend; j++) { - vm_page_lock(m[j]); - vm_page_free(m[j]); - vm_page_unlock(m[j]); - } - VM_OBJECT_WUNLOCK(object); - } else { - if (runpg < (count - first)) { - VM_OBJECT_WLOCK(object); - for (i = first + runpg; i < count; i++) { - vm_page_lock(m[i]); - vm_page_free(m[i]); - vm_page_unlock(m[i]); - } - VM_OBJECT_WUNLOCK(object); - count = first + runpg; - } - break; - } - first = runend; + return (error); } /* - * the first and last page have been calculated now, move input pages - * to be zero based... + * Truncate bytecount to vnode real size and round up physical size + * for real devices. */ - if (first != 0) { - m += first; - count -= first; - reqpage -= first; - } + if ((foff + bytecount) > object->un_pager.vnp.vnp_size) + bytecount = object->un_pager.vnp.vnp_size - foff; + secmask = bo->bo_bsize - 1; + KASSERT(secmask < PAGE_SIZE && secmask > 0, + ("%s: sector size %d too large", __func__, secmask + 1)); + bytecount = (bytecount + secmask) & ~secmask; /* - * calculate the file virtual address for the transfer + * And map the pages to be read into the kva, if the filesystem + * requires mapped buffers. */ - foff = IDX_TO_OFF(m[0]->pindex); - - /* - * calculate the size of the transfer - */ - size = count * PAGE_SIZE; - KASSERT(count > 0, ("zero count")); - if ((foff + size) > object->un_pager.vnp.vnp_size) - size = object->un_pager.vnp.vnp_size - foff; - KASSERT(size > 0, ("zero size")); - - /* - * round up physical size for real devices. - */ - if (1) { - int secmask = bo->bo_bsize - 1; - KASSERT(secmask < PAGE_SIZE && secmask > 0, - ("vnode_pager_generic_getpages: sector size %d too large", - secmask + 1)); - size = (size + secmask) & ~secmask; - } - bp->b_kvaalloc = bp->b_data; - - /* - * and map the pages to be read into the kva, if the filesystem - * requires mapped buffers. - */ if ((vp->v_mount->mnt_kern_flag & MNTK_UNMAPPED_BUFS) != 0 && unmapped_buf_allowed) { bp->b_data = unmapped_buf; @@ -969,38 +822,33 @@ vnode_pager_generic_getpages(struct vnode *vp, vm_ } else pmap_qenter((vm_offset_t)bp->b_kvaalloc, m, count); - /* build a minimal buffer header */ + /* Build a minimal buffer header. */ bp->b_iocmd = BIO_READ; KASSERT(bp->b_rcred == NOCRED, ("leaking read ucred")); KASSERT(bp->b_wcred == NOCRED, ("leaking write ucred")); bp->b_rcred = crhold(curthread->td_ucred); bp->b_wcred = crhold(curthread->td_ucred); - bp->b_blkno = firstaddr; + bp->b_blkno = reqblock + ((foff % bsize) / DEV_BSIZE); pbgetbo(bo, bp); bp->b_vp = vp; - bp->b_bcount = size; - bp->b_bufsize = size; - bp->b_runningbufspace = bp->b_bufsize; + bp->b_bcount = bp->b_bufsize = bp->b_runningbufspace = bytecount; for (i = 0; i < count; i++) bp->b_pages[i] = m[i]; bp->b_npages = count; - bp->b_pager.pg_reqpage = reqpage; + bp->b_iooffset = dbtob(bp->b_blkno); + atomic_add_long(&runningbufspace, bp->b_runningbufspace); - PCPU_INC(cnt.v_vnodein); PCPU_ADD(cnt.v_vnodepgsin, count); - /* do the input */ - bp->b_iooffset = dbtob(bp->b_blkno); - if (iodone != NULL) { /* async */ - bp->b_pager.pg_iodone = iodone; + bp->b_pgiodone = iodone; bp->b_caller1 = arg; bp->b_iodone = vnode_pager_generic_getpages_done_async; bp->b_flags |= B_ASYNC; BUF_KERNPROC(bp); bstrategy(bp); - /* Good bye! */ + return (0); } else { bp->b_iodone = bdone; bstrategy(bp); @@ -1011,9 +859,8 @@ vnode_pager_generic_getpages(struct vnode *vp, vm_ bp->b_vp = NULL; pbrelbo(bp); relpbuf(bp, &vnode_pbuf_freecnt); + return (error != 0 ? VM_PAGER_ERROR : VM_PAGER_OK); } - - return (error != 0 ? VM_PAGER_ERROR : VM_PAGER_OK); } static void @@ -1022,8 +869,7 @@ vnode_pager_generic_getpages_done_async(struct buf int error; error = vnode_pager_generic_getpages_done(bp); - bp->b_pager.pg_iodone(bp->b_caller1, bp->b_pages, - bp->b_pager.pg_reqpage, error); + bp->b_pgiodone(bp->b_caller1, bp->b_pages, bp->b_npages, error); for (int i = 0; i < bp->b_npages; i++) bp->b_pages[i] = NULL; bp->b_vp = NULL; @@ -1089,9 +935,6 @@ vnode_pager_generic_getpages_done(struct buf *bp) object->un_pager.vnp.vnp_size - tfoff)) == 0, ("%s: page %p is dirty", __func__, mt)); } - - if (i != bp->b_pager.pg_reqpage) - vm_page_readahead_finish(mt); } VM_OBJECT_WUNLOCK(object); if (error != 0) Index: sys/vm/vnode_pager.h =================================================================== --- sys/vm/vnode_pager.h (revision 282213) +++ sys/vm/vnode_pager.h (working copy) @@ -41,7 +41,7 @@ #ifdef _KERNEL int vnode_pager_generic_getpages(struct vnode *vp, vm_page_t *m, - int count, int reqpage, vop_getpages_iodone_t iodone, void *arg); + int count, vop_getpages_iodone_t iodone, void *arg); int vnode_pager_generic_putpages(struct vnode *vp, vm_page_t *m, int count, boolean_t sync, int *rtvals); --45Z9DzgjV8m4Oswq-- From owner-freebsd-arch@FreeBSD.ORG Fri May 1 16:56:39 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 652C5FCE for ; Fri, 1 May 2015 16:56:39 +0000 (UTC) Received: from mail-wg0-x231.google.com (mail-wg0-x231.google.com [IPv6:2a00:1450:400c:c00::231]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id E006510E9 for ; Fri, 1 May 2015 16:56:38 +0000 (UTC) Received: by wgin8 with SMTP id n8so95495050wgi.0 for ; Fri, 01 May 2015 09:56:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=9WFq2dRwwItZbk/UUMZi5StodnW2EEnemXORL6yI3zM=; b=YVkMYEAyAzQDL4zlQBkpdYwNTjTB7hZle2usKFwqmUVPBQ9x52trWCYZQHoDs9DpXU xMranw+o9D3UEE3Eo/02SWxmBfPOfnAemLg9c0DdJsijUnuxlalst4DQ13LSKi0VTjge YzdBlIl0fV2zYtuXRnXowVe5hRcHsLVGVKq7VW5SSDBeqIelE5sqJT2TkIU/vWA26vuj G5kKfLj05DcFE74lXOBo6uBhyYGB+3tdNs2w5rmKDq4zsv8xqIShmLILZJxq7ciKvti5 04+MYY72ahMzpzRMfo55xVgdKTvoscUf5/ZM0galDPIB0/T1Yj7Mzv0UjxiytxC5kO48 wcEA== X-Received: by 10.194.248.132 with SMTP id ym4mr20146995wjc.74.1430499397328; Fri, 01 May 2015 09:56:37 -0700 (PDT) Received: from dft-labs.eu (n1x0n-1-pt.tunnel.tserv5.lon1.ipv6.he.net. [2001:470:1f08:1f7::2]) by mx.google.com with ESMTPSA id nb9sm7428478wic.10.2015.05.01.09.56.35 (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Fri, 01 May 2015 09:56:36 -0700 (PDT) Date: Fri, 1 May 2015 18:56:33 +0200 From: Mateusz Guzik To: Bruce Evans Cc: freebsd-arch@freebsd.org Subject: Re: [PATCH 1/2] Generalised support for copy-on-write structures shared by threads. Message-ID: <20150501165633.GA7112@dft-labs.eu> References: <1430188443-19413-1-git-send-email-mjguzik@gmail.com> <1430188443-19413-2-git-send-email-mjguzik@gmail.com> <20150428181802.F1119@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20150428181802.F1119@besplex.bde.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 May 2015 16:56:39 -0000 On Tue, Apr 28, 2015 at 06:45:01PM +1000, Bruce Evans wrote: > On Tue, 28 Apr 2015, Mateusz Guzik wrote: > >diff --git a/sys/sys/proc.h b/sys/sys/proc.h > >index 64b99fc..f29d796 100644 > >--- a/sys/sys/proc.h > >+++ b/sys/sys/proc.h > >@@ -225,6 +225,7 @@ struct thread { > >/* Cleared during fork1() */ > >#define td_startzero td_flags > > int td_flags; /* (t) TDF_* flags. */ > >+ u_int td_cowgeneration;/* (k) Generation of COW pointers. */ > > int td_inhibitors; /* (t) Why can not run. */ > > int td_pflags; /* (k) Private thread (TDP_*) flags. */ > > int td_dupfd; /* (k) Ret value from fdopen. XXX */ > > This name is so verbose that it messes up the comment indentation. > Yeah, that's crap, but the naming is already inconsistent and verbose. For instance there is td_generation alrady. Is _cowgen variant ok? > >@@ -830,6 +832,11 @@ extern pid_t pid_max; > > KASSERT((p)->p_lock == 0, ("process held")); \ > >} while (0) > > > >+#define PROC_UPDATE_COW(p) do { \ > >+ PROC_LOCK_ASSERT((p), MA_OWNED); \ > >+ p->p_cowgeneration++; \ > > Missing parentheses. Oops, fixed. > > >+} while (0) > >+ > >/* Check whether a thread is safe to be swapped out. */ > >#define thread_safetoswapout(td) ((td)->td_flags & TDF_CANSWAP) > > > >@@ -976,6 +983,10 @@ struct thread *thread_alloc(int pages); > >int thread_alloc_stack(struct thread *, int pages); > >void thread_exit(void) __dead2; > >void thread_free(struct thread *td); > >+void thread_get_cow_proc(struct thread *newtd, struct proc *p); > >+void thread_get_cow(struct thread *newtd, struct thread *td); > >+void thread_free_cow(struct thread *td); > >+void thread_update_cow(struct thread *td); > > Insertion sort errors. > > Namespace errors. I don't like the style of naming things with objects > first and verbs last, but it is good for sorting related objects. Here > the verbs "get" and "free" are in the middle of the objects > "thread_cow_proc" and "thread_cow". Also, shouldn't it be "thread_proc_cow" > (but less verbose, maybe "tpcow"), not "thread_cow_proc", to indicate > that the cow is hung of the proc? I didn't notice the details, but it > makes no sense to hang a proc of a cow :-). > Well all current funcs are named thread_*, so tpcow and the like would be inconsistent. On another look existence of thread_suspend_* suggests thread_cow_* naming. With this putting _proc variant anywhere but at the end also breaks consistency. 'thread_cow_from_proc' would increase verbosity. That said, I would say the patch below is ok enough. diff --git a/sys/amd64/amd64/trap.c b/sys/amd64/amd64/trap.c index 193d207..cef3221 100644 --- a/sys/amd64/amd64/trap.c +++ b/sys/amd64/amd64/trap.c @@ -257,8 +257,8 @@ trap(struct trapframe *frame) td->td_pticks = 0; td->td_frame = frame; addr = frame->tf_rip; - if (td->td_ucred != p->p_ucred) - cred_update_thread(td); + if (td->td_cowgen != p->p_cowgen) + thread_cow_update(td); switch (type) { case T_PRIVINFLT: /* privileged instruction fault */ diff --git a/sys/arm/arm/trap-v6.c b/sys/arm/arm/trap-v6.c index abafa86..7463d3c 100644 --- a/sys/arm/arm/trap-v6.c +++ b/sys/arm/arm/trap-v6.c @@ -394,8 +394,8 @@ abort_handler(struct trapframe *tf, int prefetch) p = td->td_proc; if (usermode) { td->td_pticks = 0; - if (td->td_ucred != p->p_ucred) - cred_update_thread(td); + if (td->td_cowgen != p->p_cowgen) + thread_cow_update(td); } /* Invoke the appropriate handler, if necessary. */ diff --git a/sys/arm/arm/trap.c b/sys/arm/arm/trap.c index 0f142ce..d7fb73a 100644 --- a/sys/arm/arm/trap.c +++ b/sys/arm/arm/trap.c @@ -214,8 +214,8 @@ abort_handler(struct trapframe *tf, int type) if (user) { td->td_pticks = 0; td->td_frame = tf; - if (td->td_ucred != td->td_proc->p_ucred) - cred_update_thread(td); + if (td->td_cowgen != td->td_proc->p_cowgen) + thread_cow_update(td); } /* Grab the current pcb */ @@ -644,8 +644,8 @@ prefetch_abort_handler(struct trapframe *tf) if (TRAP_USERMODE(tf)) { td->td_frame = tf; - if (td->td_ucred != td->td_proc->p_ucred) - cred_update_thread(td); + if (td->td_cowgen != td->td_proc->p_cowgen) + thread_cow_update(td); } fault_pc = tf->tf_pc; if (td->td_md.md_spinlock_count == 0) { diff --git a/sys/i386/i386/trap.c b/sys/i386/i386/trap.c index d783a2b..b118e73 100644 --- a/sys/i386/i386/trap.c +++ b/sys/i386/i386/trap.c @@ -306,8 +306,8 @@ trap(struct trapframe *frame) td->td_pticks = 0; td->td_frame = frame; addr = frame->tf_eip; - if (td->td_ucred != p->p_ucred) - cred_update_thread(td); + if (td->td_cowgen != p->p_cowgen) + thread_cow_update(td); switch (type) { case T_PRIVINFLT: /* privileged instruction fault */ diff --git a/sys/kern/init_main.c b/sys/kern/init_main.c index b77b788..e0042e9 100644 --- a/sys/kern/init_main.c +++ b/sys/kern/init_main.c @@ -522,8 +522,6 @@ proc0_init(void *dummy __unused) #ifdef MAC mac_cred_create_swapper(newcred); #endif - td->td_ucred = crhold(newcred); - /* Create sigacts. */ p->p_sigacts = sigacts_alloc(); @@ -555,6 +553,10 @@ proc0_init(void *dummy __unused) p->p_limit->pl_rlimit[RLIMIT_MEMLOCK].rlim_max = pageablemem; p->p_cpulimit = RLIM_INFINITY; + PROC_LOCK(p); + thread_cow_get_proc(td, p); + PROC_UNLOCK(p); + /* Initialize resource accounting structures. */ racct_create(&p->p_racct); @@ -842,10 +844,10 @@ create_init(const void *udata __unused) audit_cred_proc1(newcred); #endif proc_set_cred(initproc, newcred); + cred_update_thread(FIRST_THREAD_IN_PROC(initproc)); PROC_UNLOCK(initproc); sx_xunlock(&proctree_lock); crfree(oldcred); - cred_update_thread(FIRST_THREAD_IN_PROC(initproc)); cpu_set_fork_handler(FIRST_THREAD_IN_PROC(initproc), start_init, NULL); } SYSINIT(init, SI_SUB_CREATE_INIT, SI_ORDER_FIRST, create_init, NULL); diff --git a/sys/kern/kern_fork.c b/sys/kern/kern_fork.c index c3dd792..0dfecff 100644 --- a/sys/kern/kern_fork.c +++ b/sys/kern/kern_fork.c @@ -496,7 +496,6 @@ do_fork(struct thread *td, int flags, struct proc *p2, struct thread *td2, p2->p_swtick = ticks; if (p1->p_flag & P_PROFIL) startprofclock(p2); - td2->td_ucred = crhold(p2->p_ucred); if (flags & RFSIGSHARE) { p2->p_sigacts = sigacts_hold(p1->p_sigacts); @@ -526,6 +525,8 @@ do_fork(struct thread *td, int flags, struct proc *p2, struct thread *td2, */ lim_fork(p1, p2); + thread_cow_get_proc(td2, p2); + pstats_fork(p1->p_stats, p2->p_stats); PROC_UNLOCK(p1); diff --git a/sys/kern/kern_kthread.c b/sys/kern/kern_kthread.c index ee94de0..863bbc6 100644 --- a/sys/kern/kern_kthread.c +++ b/sys/kern/kern_kthread.c @@ -289,7 +289,7 @@ kthread_add(void (*func)(void *), void *arg, struct proc *p, cpu_set_fork_handler(newtd, func, arg); newtd->td_pflags |= TDP_KTHREAD; - newtd->td_ucred = crhold(p->p_ucred); + thread_cow_get_proc(newtd, p); /* this code almost the same as create_thread() in kern_thr.c */ p->p_flag |= P_HADTHREADS; diff --git a/sys/kern/kern_prot.c b/sys/kern/kern_prot.c index 9c49f71..b531763 100644 --- a/sys/kern/kern_prot.c +++ b/sys/kern/kern_prot.c @@ -1946,9 +1946,8 @@ cred_update_thread(struct thread *td) p = td->td_proc; cred = td->td_ucred; - PROC_LOCK(p); + PROC_LOCK_ASSERT(p, MA_OWNED); td->td_ucred = crhold(p->p_ucred); - PROC_UNLOCK(p); if (cred != NULL) crfree(cred); } @@ -1987,6 +1986,8 @@ proc_set_cred(struct proc *p, struct ucred *newcred) oldcred = p->p_ucred; p->p_ucred = newcred; + if (newcred != NULL) + PROC_UPDATE_COW(p); return (oldcred); } diff --git a/sys/kern/kern_syscalls.c b/sys/kern/kern_syscalls.c index dada746..3d3df01 100644 --- a/sys/kern/kern_syscalls.c +++ b/sys/kern/kern_syscalls.c @@ -31,6 +31,8 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include +#include #include #include #include diff --git a/sys/kern/kern_thr.c b/sys/kern/kern_thr.c index 6911bb97..a53bd25 100644 --- a/sys/kern/kern_thr.c +++ b/sys/kern/kern_thr.c @@ -228,13 +228,13 @@ create_thread(struct thread *td, mcontext_t *ctx, bcopy(&td->td_startcopy, &newtd->td_startcopy, __rangeof(struct thread, td_startcopy, td_endcopy)); newtd->td_proc = td->td_proc; - newtd->td_ucred = crhold(td->td_ucred); + thread_cow_get(newtd, td); if (ctx != NULL) { /* old way to set user context */ error = set_mcontext(newtd, ctx); if (error != 0) { + thread_cow_free(newtd); thread_free(newtd); - crfree(td->td_ucred); goto fail; } } else { @@ -246,8 +246,8 @@ create_thread(struct thread *td, mcontext_t *ctx, /* Setup user TLS address and TLS pointer register. */ error = cpu_set_user_tls(newtd, tls_base); if (error != 0) { + thread_cow_free(newtd); thread_free(newtd); - crfree(td->td_ucred); goto fail; } } diff --git a/sys/kern/kern_thread.c b/sys/kern/kern_thread.c index 0a93dbd..063dfe9 100644 --- a/sys/kern/kern_thread.c +++ b/sys/kern/kern_thread.c @@ -324,8 +324,7 @@ thread_reap(void) mtx_unlock_spin(&zombie_lock); while (td_first) { td_next = TAILQ_NEXT(td_first, td_slpq); - if (td_first->td_ucred) - crfree(td_first->td_ucred); + thread_cow_free(td_first); thread_free(td_first); td_first = td_next; } @@ -381,6 +380,44 @@ thread_free(struct thread *td) uma_zfree(thread_zone, td); } +void +thread_cow_get_proc(struct thread *newtd, struct proc *p) +{ + + PROC_LOCK_ASSERT(p, MA_OWNED); + newtd->td_ucred = crhold(p->p_ucred); + newtd->td_cowgen = p->p_cowgen; +} + +void +thread_cow_get(struct thread *newtd, struct thread *td) +{ + + newtd->td_ucred = crhold(td->td_ucred); + newtd->td_cowgen = td->td_cowgen; +} + +void +thread_cow_free(struct thread *td) +{ + + if (td->td_ucred) + crfree(td->td_ucred); +} + +void +thread_cow_update(struct thread *td) +{ + struct proc *p; + + p = td->td_proc; + PROC_LOCK(p); + if (td->td_ucred != p->p_ucred) + cred_update_thread(td); + td->td_cowgen = p->p_cowgen; + PROC_UNLOCK(p); +} + /* * Discard the current thread and exit from its context. * Always called with scheduler locked. @@ -518,7 +555,7 @@ thread_wait(struct proc *p) cpuset_rel(td->td_cpuset); td->td_cpuset = NULL; cpu_thread_clean(td); - crfree(td->td_ucred); + thread_cow_free(td); thread_reap(); /* check for zombie threads etc. */ } diff --git a/sys/kern/subr_syscall.c b/sys/kern/subr_syscall.c index 1bf78b8..070ba28 100644 --- a/sys/kern/subr_syscall.c +++ b/sys/kern/subr_syscall.c @@ -61,8 +61,8 @@ syscallenter(struct thread *td, struct syscall_args *sa) p = td->td_proc; td->td_pticks = 0; - if (td->td_ucred != p->p_ucred) - cred_update_thread(td); + if (td->td_cowgen != p->p_cowgen) + thread_cow_update(td); if (p->p_flag & P_TRACED) { traced = 1; PROC_LOCK(p); diff --git a/sys/kern/subr_trap.c b/sys/kern/subr_trap.c index 93f7557..e5e55dd 100644 --- a/sys/kern/subr_trap.c +++ b/sys/kern/subr_trap.c @@ -213,8 +213,8 @@ ast(struct trapframe *framep) thread_unlock(td); PCPU_INC(cnt.v_trap); - if (td->td_ucred != p->p_ucred) - cred_update_thread(td); + if (td->td_cowgen != p->p_cowgen) + thread_cow_update(td); if (td->td_pflags & TDP_OWEUPC && p->p_flag & P_PROFIL) { addupc_task(td, td->td_profil_addr, td->td_profil_ticks); td->td_profil_ticks = 0; diff --git a/sys/powerpc/powerpc/trap.c b/sys/powerpc/powerpc/trap.c index 0ceb170..bfbd94d 100644 --- a/sys/powerpc/powerpc/trap.c +++ b/sys/powerpc/powerpc/trap.c @@ -196,8 +196,8 @@ trap(struct trapframe *frame) if (user) { td->td_pticks = 0; td->td_frame = frame; - if (td->td_ucred != p->p_ucred) - cred_update_thread(td); + if (td->td_cowgen != p->p_cowgen) + thread_cow_update(td); /* User Mode Traps */ switch (type) { diff --git a/sys/sparc64/sparc64/trap.c b/sys/sparc64/sparc64/trap.c index b4f0e27..e9917e5 100644 --- a/sys/sparc64/sparc64/trap.c +++ b/sys/sparc64/sparc64/trap.c @@ -277,8 +277,8 @@ trap(struct trapframe *tf) td->td_pticks = 0; td->td_frame = tf; addr = tf->tf_tpc; - if (td->td_ucred != p->p_ucred) - cred_update_thread(td); + if (td->td_cowgen != p->p_cowgen) + thread_cow_update(td); switch (tf->tf_type) { case T_DATA_MISS: diff --git a/sys/sys/proc.h b/sys/sys/proc.h index 64b99fc..5033957 100644 --- a/sys/sys/proc.h +++ b/sys/sys/proc.h @@ -225,6 +225,7 @@ struct thread { /* Cleared during fork1() */ #define td_startzero td_flags int td_flags; /* (t) TDF_* flags. */ + u_int td_cowgen; /* (k) Generation of COW pointers. */ int td_inhibitors; /* (t) Why can not run. */ int td_pflags; /* (k) Private thread (TDP_*) flags. */ int td_dupfd; /* (k) Ret value from fdopen. XXX */ @@ -531,6 +532,7 @@ struct proc { pid_t p_oppid; /* (c + e) Save ppid in ptrace. XXX */ struct vmspace *p_vmspace; /* (b) Address space. */ u_int p_swtick; /* (c) Tick when swapped in or out. */ + u_int p_cowgen; /* (c) Generation of COW pointers. */ struct itimerval p_realtimer; /* (c) Alarm timer. */ struct rusage p_ru; /* (a) Exit information. */ struct rusage_ext p_rux; /* (cu) Internal resource usage. */ @@ -830,6 +832,11 @@ extern pid_t pid_max; KASSERT((p)->p_lock == 0, ("process held")); \ } while (0) +#define PROC_UPDATE_COW(p) do { \ + PROC_LOCK_ASSERT((p), MA_OWNED); \ + (p)->p_cowgen++; \ +} while (0) + /* Check whether a thread is safe to be swapped out. */ #define thread_safetoswapout(td) ((td)->td_flags & TDF_CANSWAP) @@ -974,6 +981,10 @@ void cpu_thread_swapin(struct thread *); void cpu_thread_swapout(struct thread *); struct thread *thread_alloc(int pages); int thread_alloc_stack(struct thread *, int pages); +void thread_cow_get_proc(struct thread *newtd, struct proc *p); +void thread_cow_get(struct thread *newtd, struct thread *td); +void thread_cow_free(struct thread *td); +void thread_cow_update(struct thread *td); void thread_exit(void) __dead2; void thread_free(struct thread *td); void thread_link(struct thread *td, struct proc *p); -- Mateusz Guzik