From owner-freebsd-arch@FreeBSD.ORG  Sun Apr 26 18:02:45 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id BCE113AE
 for <freebsd-arch@freebsd.org>; Sun, 26 Apr 2015 18:02:45 +0000 (UTC)
Received: from mail-oi0-x22a.google.com (mail-oi0-x22a.google.com
 [IPv6:2607:f8b0:4003:c06::22a])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 760D010A9
 for <freebsd-arch@freebsd.org>; Sun, 26 Apr 2015 18:02:45 +0000 (UTC)
Received: by oiko83 with SMTP id o83so73893885oik.1
 for <freebsd-arch@freebsd.org>; Sun, 26 Apr 2015 11:02:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=message-id:date:from:user-agent:mime-version:to:cc:subject
 :references:in-reply-to:content-type;
 bh=rnqbSthaT/1Vc0JHz5+2PPFOazfbUncR3bsi7ooNK98=;
 b=h91WrfyebFwjCges2TwApNteb5n5y6xE6EEW2pEVxahix94J2vVefQP8qIn2Php5dm
 zVzIW+nK1gKKcYsf1Say2h7vSHOY6GZ5iUDP9hpDGExztFTaZodzhq4E8JtNDiHgMjqQ
 ca74RTiNepDTCKy+SO5g24Sa5+jQY8l1sjLfnjPXibWTTY5mRJKbwLHpb63w3FJSeRvf
 /M56Je7j7dv+vBHfahOUtpxJ96ObYm4Hc1BvVSALmIr97jTeKyzDKSeqvW6KKA31Fymj
 sjZhRyFXHJipcv9JYj2/3pee4Z3HRKIzXjdoZW3DmcetljIy16Fx7RbbfEYp2DCCWkKE
 vhLg==
X-Received: by 10.60.223.228 with SMTP id qx4mr6974831oec.24.1430071364726;
 Sun, 26 Apr 2015 11:02:44 -0700 (PDT)
Received: from corona.austin.rr.com (cpe-72-177-6-10.austin.res.rr.com.
 [72.177.6.10])
 by mx.google.com with ESMTPSA id m42sm7994785oik.3.2015.04.26.11.02.43
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Sun, 26 Apr 2015 11:02:44 -0700 (PDT)
Message-ID: <553D2890.4020107@gmail.com>
Date: Sun, 26 Apr 2015 13:04:00 -0500
From: Jason Harmening <jason.harmening@gmail.com>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:31.0) Gecko/20100101 Thunderbird/31.3.0
MIME-Version: 1.0
To: Konstantin Belousov <kostikbel@gmail.com>
CC: Svatopluk Kraus <onwahe@gmail.com>, 
 FreeBSD Arch <freebsd-arch@freebsd.org>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <CAM=8qan-4SbKJaddrfkv=HG3n+HaOPDL5MEPS9DoaTvnhrJPZQ@mail.gmail.com>
 <20150425094152.GE2390@kib.kiev.ua> <553B9E64.8030907@gmail.com>
 <20150425163444.GL2390@kib.kiev.ua> <553BC9D1.1070502@gmail.com>
 <20150425172833.GM2390@kib.kiev.ua> <553BD501.4010109@gmail.com>
 <20150425181846.GN2390@kib.kiev.ua> <553BE12B.4000105@gmail.com>
 <20150425201410.GP2390@kib.kiev.ua>
In-Reply-To: <20150425201410.GP2390@kib.kiev.ua>
Content-Type: multipart/signed; micalg=pgp-sha512;
 protocol="application/pgp-signature";
 boundary="eCCgoIbMufOXxfvxMFPjB42fodthGCmE7"
X-Content-Filtered-By: Mailman/MimeDel 2.1.20
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 26 Apr 2015 18:02:45 -0000

This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--eCCgoIbMufOXxfvxMFPjB42fodthGCmE7
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable


On 04/25/15 15:14, Konstantin Belousov wrote:
> On Sat, Apr 25, 2015 at 01:47:07PM -0500, Jason Harmening wrote:
>> On 04/25/15 13:18, Konstantin Belousov wrote:
>>> On Sat, Apr 25, 2015 at 12:55:13PM -0500, Jason Harmening wrote:
>>>> Ah, that looks much better.  A few things though:
>>>> 1) _bus_dmamap_load_ma (note the underscore) is still part of the MI=
/MD
>>>> interface, which we tell drivers not to use.  It looks like it's
>>>> implemented for every arch though.  Should there be a public and
>>>> documented bus_dmamap_load_ma ?
>>> Might be yes.  But at least one consumer of the KPI must appear befor=
e
>>> the facility is introduced.
>> Could some of the GART/GTT code consume that?
> Do you mean, by GEM/GTT code ?  Indeed, this is interesting and probabl=
y
> workable suggestion.  I thought that I would need to provide a special
> interface from DMAR for the GEM, but your proposal seems to fit.  Still=
,
> an issue is that the Linux code is structured significantly different,
> and this code, although isolated, is significant divergent from the
> upstream.

Yes, GEM/GTT.  I know it would be useful for i915, maybe other drm2
drivers too.

>
>>>> 3) Using bus_dmamap_load_ma would mean always using physcopy for bou=
nce
>>>> buffers...seems like the sfbufs would slow things down ?
>>> For amd64, sfbufs are nop, due to the direct map.  But, I doubt that
>>> we can combine bounce buffers and performance in the single sentence.=

>> In fact the amd64 implementation of uiomove_fromphys doesn't use sfbuf=
s
>> at all thanks to the direct map.  sparc64 seems to avoid sfbufs as muc=
h
>> as possible too.  I don't know what arm64/aarch64 will be able to use.=
=20
>> Those seem like the platforms where bounce buffering would be the most=

>> likely, along with i386 + PAE.  They might still be used on 32-bit
>> platforms for alignment or devices with < 32-bit address width, but th=
en
>> those are likely to be old and slow anyway.
>>
>> I'm still a bit worried about the slowness of waiting for an sfbuf if
>> one is needed, but in practice that might not be a big issue.
>>
I noticed the following in vm_map_delete, which is called by sys_munmap:

=20
 2956                  * Wait for wiring or unwiring of an entry to compl=
ete.
 2957                  * Also wait for any system wirings to disappear on=

 2958                  * user maps.
 2959                  */
 2960                 if ((entry->eflags & MAP_ENTRY_IN_TRANSITION) !=3D =
0 ||
 2961                     (vm_map_pmap(map) !=3D kernel_pmap &&
 2962                     vm_map_entry_system_wired_count(entry) !=3D 0))=
 {
=2E..
 2970                         (void) vm_map_unlock_and_wait(map, 0);

It looks like munmap does wait on wired pages (well, system-wired pages, =
not mlock'ed pages).
The system-wire count on the PTE will be non-zero if vslock/vm_map_wire(.=
=2E.VM_MAP_WIRE_SYSTEM...) was called on it.
Does that mean UIO_USERSPACE dmamaps are actually safe from getting the U=
VA taken out from under them?
Obviously it doesn't make bcopy safe to do in the wrong process context, =
but that seems easily fixable.


--eCCgoIbMufOXxfvxMFPjB42fodthGCmE7
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQF8BAEBCgBmBQJVPSiQXxSAAAAAAC4AKGlzc3Vlci1mcHJAbm90YXRpb25zLm9w
ZW5wZ3AuZmlmdGhob3JzZW1hbi5uZXQwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAw
MDAwMDAwMDAwMDAwMDAwAAoJELufi/mShB0ba7gIAIcGtzGRq1R1W/S8AoR2qTCi
JY9p/fLrD4i1kmiOmcI2hnfBa9UbFLmUGOJnlnrNifQfhY3vnw/IPHhO6zlQW8Jp
llSh6eBiq2lb59+ptA0VLDE33mOzIL2/ZYBVm7EmavGirKVEBtbGLLtCw20ZwiQz
HiRj1cXoppwYyt6xrl1OtbQs9jZNqURvdIwVa2NkwVKZftwqtGv4a5UXJXNr08U3
wbi9niaylcAjwpBlxheemBkC1V0m5QtVvAOSbxMsKwlxOGgMJztnQrksJOWgVvIH
qm4FZeKGOUSSzswfd8l0WWMzkBi4mYdFo6JhRpP3lWYmWow+uBTe0+Y9RU3RxBw=
=nb6L
-----END PGP SIGNATURE-----

--eCCgoIbMufOXxfvxMFPjB42fodthGCmE7--

From owner-freebsd-arch@FreeBSD.ORG  Sun Apr 26 19:56:58 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 2FC5C15F
 for <freebsd-arch@freebsd.org>; Sun, 26 Apr 2015 19:56:58 +0000 (UTC)
Received: from mail-ie0-x22c.google.com (mail-ie0-x22c.google.com
 [IPv6:2607:f8b0:4001:c03::22c])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id EDD641B2A
 for <freebsd-arch@freebsd.org>; Sun, 26 Apr 2015 19:56:57 +0000 (UTC)
Received: by iecrt8 with SMTP id rt8so113665042iec.0
 for <freebsd-arch@freebsd.org>; Sun, 26 Apr 2015 12:56:57 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=ezCfJmJd0fa0bLql8KhRi69TLmo3ooBGEMzEGahQAmI=;
 b=RlZro5HeNGcAjrAr+DD5mg0h0p53luzyldckDjm4fdNltNHXVa68Of1T6GEiOWgExx
 5XW1uxq4LctnxIfXVu6MlsxdnRcr5/3se5grZHcSXQkLMNf89j5V5nphan5J0fNlg8Xl
 LpFyU/0BlZGA8ZZWREhBifYcXd3/HHRu30eJgktcck5U8HgCvfo84Ji+9uiIDyY5ejRy
 r51MBzWISvVj9hCxpAfYXF6voeMzOswd06s3ceoW3USbVdpWnQU3SqEdseJWJ12iCNkL
 SWu0a2qe4Q+ESZV0binaH1LyXj1KTmUUjYnZYfkFmqMaMBErTHcilrwaCGdxaW3MDUg7
 wKcw==
MIME-Version: 1.0
X-Received: by 10.42.76.146 with SMTP id e18mr9344383ick.42.1430078217335;
 Sun, 26 Apr 2015 12:56:57 -0700 (PDT)
Received: by 10.64.13.81 with HTTP; Sun, 26 Apr 2015 12:56:57 -0700 (PDT)
In-Reply-To: <CAM=8qan-4SbKJaddrfkv=HG3n+HaOPDL5MEPS9DoaTvnhrJPZQ@mail.gmail.com>
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <CAM=8qan-4SbKJaddrfkv=HG3n+HaOPDL5MEPS9DoaTvnhrJPZQ@mail.gmail.com>
Date: Sun, 26 Apr 2015 21:56:57 +0200
Message-ID: <CAFHCsPUbOssR4x2TqLp53eBNUpO7d_sXMB=bO9W=ZH9_Kx8iuQ@mail.gmail.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
From: Svatopluk Kraus <onwahe@gmail.com>
To: Jason Harmening <jason.harmening@gmail.com>
Cc: FreeBSD Arch <freebsd-arch@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 26 Apr 2015 19:56:58 -0000

On Sat, Apr 25, 2015 at 12:50 AM, Jason Harmening
<jason.harmening@gmail.com> wrote:
> A couple of comments:
>
> --POSTWRITE and POSTREAD are only asynchronous if you call them from an
> asynchronous context.
> For a driver that's already performing DMA operations on usermode memory, it
> seems likely that there's going to be *some* place where you can call
> bus_dmamap_sync() and be guaranteed to be executing in the context of the
> process that owns the memory.  Then a regular bcopy will be safe and
> inexpensive, assuming the pages have been properly vslock-ed/vm_map_wire-d.
> That's usually whatever read/write/ioctl operation spawned the DMA transfer
> in the first place.  So, in those cases can you not just move the
> POSTREAD/POSTWRITE sync from the "DMA-finished" interrupt to the
> d_read/d_write/d_ioctl that waits on the "DMA-finished" interrupt?
>


Yes, it could be possible in those cases. However it implies, that dma
unload must be moved as well. And to make it symmetric, dma load too.
Then dma driver just programs hardware and all clients must do dma
load, sync, wait for finish, sync, and unload itself. So (1) almost
same code will be spreaded on many places and (2) all the stuff which
is done for dma load will be pending in system much longer.


> --physcopyin/physcopyout aren't trivial.  They go through uiomove_fromphys,
> which often uses sfbufs to create temporary KVA mappings for the physical
> pages.  sf_buf_alloc() can sleep (unless SFB_NOWAIT is specified, which
> means it can fail and which uiomove_fromphys does not specify for good
> reason); that makes it unsafe for use in either a threaded interrupt or a
> filter.  Perhaps the physcopyout path could be changed to use pmap_qenter
> directly in this case, but that can still be expensive in terms of TLB
> shootdowns.
>


I thought that unmapped buffers are used to save KVA space. For such
buffers physcopyin/physcopyout must be used already. So if there is
some slowing down, it's taken into acount already. And, if it's good
for unmapped buffers, it should be good for user buffers as well.

I'm not so afraid of TLB shutdowns in ARM arch. On the contrary, the
arch is not DMA cache coherent, so cache maintainance is of much care.
It must always be done for cached memory, bouncing or not.


> Checking against VM_MIN_KERNEL_ADDRESS seems sketchy; it eliminates the
> chance to use a much-less-expensive bcopy in cases where the sync is
> happening in correct process context.
>


Right, but it's the simplest solution.


> Context-switching during bus_dmamap_sync() shouldn't be an issue.  In a
> filter interrupt, curproc will be completely arbitrary but none of this
> stuff should be called in a filter anyway.  Otherwise, if you're in a kernel
> thread (including an ithread), curproc should be whatever proc was supplied
> when the thread was created.  That's usually proc0, which only has kernel
> address space.  IOW, even if a context-switch happens sometime during
> bus_dmamap_sync, any pmap check or copy should have a consistent and
> non-arbitrary process context.
>


It's correct analysis with given presumptions. But why are you so sure
that this stuff should not be done in interrupt filter?


> I think something like your second solution would be workable to make
> UIO_USERSPACE syncs work in non-interrupt kernel threads, but given all the
> restrictions and extra cost of physcopy, I'm not sure how useful that would
> be.
>


That or KASSERT to check that context is bad. In fact, the second
solution does not close door. If it's called in correct context, bcopy
is used anyway, and if it's called in bad context, some extra work is
done due to physcopyin/physcopyout.


> I do think busdma.9 could at least use a note that bus_dmamap_sync() is only
> safe to call in the context of the owning process for user buffers.
>


At least for now. However, I would be unhappy if it remains that for ever.


>
> On Fri, Apr 24, 2015 at 8:13 AM, Svatopluk Kraus <onwahe@gmail.com> wrote:
>>
>> DMA can be done on client buffer from user address space. For example,
>> thru bus_dmamap_load_uio() when uio->uio_segflg is UIO_USERSPACE. Such
>> client buffer can bounce and then, it must be copied to and from
>> bounce buffer in bus_dmamap_sync().
>>
>> Current implementations (in all archs) do not take into account that
>> bus_dmamap_sync() is asynchronous for POSTWRITE and POSTREAD in
>> general. It can be asynchronous for PREWRITE and PREREAD too. For
>> example, in some driver implementations where DMA client buffers
>> operations are buffered. In those cases, simple bcopy() is not
>> correct.
>>
>> Demonstration of current implementation (x86) is the following:
>>
>> -----------------------------
>> struct bounce_page {
>>     vm_offset_t vaddr;      /* kva of bounce buffer */
>>     bus_addr_t  busaddr;    /* Physical address */
>>     vm_offset_t datavaddr;  /* kva of client data */
>>     bus_addr_t  dataaddr;   /* client physical address */
>>     bus_size_t  datacount;  /* client data count */
>>     STAILQ_ENTRY(bounce_page) links;
>> };
>>
>>
>> if ((op & BUS_DMASYNC_PREWRITE) != 0) {
>>     while (bpage != NULL) {
>>         if (bpage->datavaddr != 0) {
>>             bcopy((void *)bpage->datavaddr,
>>                 (void *)bpage->vaddr,
>>                 bpage->datacount);
>>         } else {
>>             physcopyout(bpage->dataaddr,
>>                 (void *)bpage->vaddr,
>>                 bpage->datacount);
>>         }
>>         bpage = STAILQ_NEXT(bpage, links);
>>     }
>>     dmat->bounce_zone->total_bounced++;
>> }
>> -----------------------------
>>
>> There are two things:
>>
>> (1) datavaddr is not always kva of client data, but sometimes it can
>> be uva of client data.
>> (2) bcopy() can be used only if datavaddr is kva or when map->pmap is
>> current pmap.
>>
>> Note that there is an implication for bus_dmamap_load_uio() with
>> uio->uio_segflg set to UIO_USERSPACE that used physical pages are
>> in-core and wired. See "man bus_dma".
>>
>> There is not public interface to check that map->pmap is current pmap.
>> So one solution is the following:
>>
>> if (bpage->datavaddr >= VM_MIN_KERNEL_ADDRESS) {
>>         bcopy();
>> } else {
>>         physcopy();
>> }
>>
>> If there will be public pmap_is_current() then another solution is the
>> following:
>>
>> if (bpage->datavaddr != 0) && pmap_is_current(map->pmap)) {
>>         bcopy();
>> } else {
>>         physcopy();
>> }
>>
>> The second solution implies that context switch must not happen during
>> bus_dmamap_sync() called from an interrupt routine. However, IMO, it's
>> granted.
>>
>> Note that map->pmap should be always kernel_pmap for datavaddr >=
>> VM_MIN_KERNEL_ADDRESS.
>>
>> Comments, different solutions, or objections?
>>
>> Svatopluk Kraus
>> _______________________________________________
>> freebsd-arch@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
>> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"
>
>

From owner-freebsd-arch@FreeBSD.ORG  Sun Apr 26 20:00:33 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 85BA6372
 for <freebsd-arch@freebsd.org>; Sun, 26 Apr 2015 20:00:33 +0000 (UTC)
Received: from mail-ie0-x230.google.com (mail-ie0-x230.google.com
 [IPv6:2607:f8b0:4001:c03::230])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 606C21C09
 for <freebsd-arch@freebsd.org>; Sun, 26 Apr 2015 20:00:33 +0000 (UTC)
Received: by iedfl3 with SMTP id fl3so128885226ied.1
 for <freebsd-arch@freebsd.org>; Sun, 26 Apr 2015 13:00:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=ek+5hbmqYaM9feQ5ezfpPRLJg1a5KakUWNeYNF+8xps=;
 b=UoIPWeJfsTW+VffnnH+uBHCOw1kI/yLw/THWQ8BY1sfAcs5tFQ0S0rJ8tTdf+Ji/c1
 IwNmalIopNZs5xBPdlpyMoD1E9c3B7Q6QYqt6iAfW/+xthlijc2Po3OihNVN40GO/eNQ
 TahOpzuMSz94+WgizIC9ECeEKfCncD+TY8Ux04fPI/7jn/10mqXmtoLJF/AYQy7qXSdK
 i6hPH+Sgv6nC2pVGs6jREVSiwQMIgtwVK9jxghV6faniAPFxUnFLCDRPu567BhGj5d3I
 wVB02Sf4/9bmaHwoqr2WTHox4mJAHKvkk0faXf7zCm36zlnHS0kznDIAdH6zuSpsauBk
 NqOQ==
MIME-Version: 1.0
X-Received: by 10.43.39.1 with SMTP id tk1mr9081421icb.26.1430078432242; Sun,
 26 Apr 2015 13:00:32 -0700 (PDT)
Received: by 10.64.13.81 with HTTP; Sun, 26 Apr 2015 13:00:32 -0700 (PDT)
In-Reply-To: <20150425094152.GE2390@kib.kiev.ua>
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <CAM=8qan-4SbKJaddrfkv=HG3n+HaOPDL5MEPS9DoaTvnhrJPZQ@mail.gmail.com>
 <20150425094152.GE2390@kib.kiev.ua>
Date: Sun, 26 Apr 2015 22:00:32 +0200
Message-ID: <CAFHCsPXzU_Cu51bj8vc_5e7to4GqdFDUvHvwWM1mAxJ5LvemXw@mail.gmail.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
From: Svatopluk Kraus <onwahe@gmail.com>
To: Konstantin Belousov <kostikbel@gmail.com>
Cc: Jason Harmening <jason.harmening@gmail.com>,
 FreeBSD Arch <freebsd-arch@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 26 Apr 2015 20:00:33 -0000

On Sat, Apr 25, 2015 at 11:41 AM, Konstantin Belousov
<kostikbel@gmail.com> wrote:
> On Fri, Apr 24, 2015 at 05:50:15PM -0500, Jason Harmening wrote:
>> A couple of comments:
>>
>> --POSTWRITE and POSTREAD are only asynchronous if you call them from an
>> asynchronous context.
>> For a driver that's already performing DMA operations on usermode memory,
>> it seems likely that there's going to be *some* place where you can call
>> bus_dmamap_sync() and be guaranteed to be executing in the context of the
>> process that owns the memory.  Then a regular bcopy will be safe and
>> inexpensive, assuming the pages have been properly vslock-ed/vm_map_wire-d.
>> That's usually whatever read/write/ioctl operation spawned the DMA transfer
>> in the first place.  So, in those cases can you not just move the
>> POSTREAD/POSTWRITE sync from the "DMA-finished" interrupt to the
>> d_read/d_write/d_ioctl that waits on the "DMA-finished" interrupt?
>>
>> --physcopyin/physcopyout aren't trivial.  They go through uiomove_fromphys,
>> which often uses sfbufs to create temporary KVA mappings for the physical
>> pages.  sf_buf_alloc() can sleep (unless SFB_NOWAIT is specified, which
>> means it can fail and which uiomove_fromphys does not specify for good
>> reason); that makes it unsafe for use in either a threaded interrupt or a
>> filter.  Perhaps the physcopyout path could be changed to use pmap_qenter
>> directly in this case, but that can still be expensive in terms of TLB
>> shootdowns.
>>
>> Checking against VM_MIN_KERNEL_ADDRESS seems sketchy; it eliminates the
>> chance to use a much-less-expensive bcopy in cases where the sync is
>> happening in correct process context.
>>
>> Context-switching during bus_dmamap_sync() shouldn't be an issue.  In a
>> filter interrupt, curproc will be completely arbitrary but none of this
>> stuff should be called in a filter anyway.  Otherwise, if you're in a
>> kernel thread (including an ithread), curproc should be whatever proc was
>> supplied when the thread was created.  That's usually proc0, which only has
>> kernel address space.  IOW, even if a context-switch happens sometime
>> during bus_dmamap_sync, any pmap check or copy should have a consistent and
>> non-arbitrary process context.
>>
>> I think something like your second solution would be workable to make
>> UIO_USERSPACE syncs work in non-interrupt kernel threads, but given all the
>> restrictions and extra cost of physcopy, I'm not sure how useful that would
>> be.
>>
>> I do think busdma.9 could at least use a note that bus_dmamap_sync() is
>> only safe to call in the context of the owning process for user buffers.
>
> UIO_USERSPACE for busdma is absolutely unsafe and cannot be used without
> making kernel panicing.  Even if you wire the userspace bufer, there is
> nothing which could prevent other thread in the user process, or other
> process sharing the same address space, to call munmap(2) on the range.
>


Using of vslock() is proposed method in bus_dma man page. IMO, the
function looks complex and can be a big time eater. However, are you
saying that vslock() does not work for that? Then for what reason does
that function exist?


> The only safe method to work with the userspace regions is to
> vm_fault_quick_hold() them to get hold on the pages, and then either
> pass pages array down, or remap them in the KVA with pmap_qenter().
>


So, even vm_fault_quick_hold() does not keep valid user mapping?


>>
>>
>> On Fri, Apr 24, 2015 at 8:13 AM, Svatopluk Kraus <onwahe@gmail.com> wrote:
>>
>> > DMA can be done on client buffer from user address space. For example,
>> > thru bus_dmamap_load_uio() when uio->uio_segflg is UIO_USERSPACE. Such
>> > client buffer can bounce and then, it must be copied to and from
>> > bounce buffer in bus_dmamap_sync().
>> >
>> > Current implementations (in all archs) do not take into account that
>> > bus_dmamap_sync() is asynchronous for POSTWRITE and POSTREAD in
>> > general. It can be asynchronous for PREWRITE and PREREAD too. For
>> > example, in some driver implementations where DMA client buffers
>> > operations are buffered. In those cases, simple bcopy() is not
>> > correct.
>> >
>> > Demonstration of current implementation (x86) is the following:
>> >
>> > -----------------------------
>> > struct bounce_page {
>> >     vm_offset_t vaddr;      /* kva of bounce buffer */
>> >     bus_addr_t  busaddr;    /* Physical address */
>> >     vm_offset_t datavaddr;  /* kva of client data */
>> >     bus_addr_t  dataaddr;   /* client physical address */
>> >     bus_size_t  datacount;  /* client data count */
>> >     STAILQ_ENTRY(bounce_page) links;
>> > };
>> >
>> >
>> > if ((op & BUS_DMASYNC_PREWRITE) != 0) {
>> >     while (bpage != NULL) {
>> >         if (bpage->datavaddr != 0) {
>> >             bcopy((void *)bpage->datavaddr,
>> >                 (void *)bpage->vaddr,
>> >                 bpage->datacount);
>> >         } else {
>> >             physcopyout(bpage->dataaddr,
>> >                 (void *)bpage->vaddr,
>> >                 bpage->datacount);
>> >         }
>> >         bpage = STAILQ_NEXT(bpage, links);
>> >     }
>> >     dmat->bounce_zone->total_bounced++;
>> > }
>> > -----------------------------
>> >
>> > There are two things:
>> >
>> > (1) datavaddr is not always kva of client data, but sometimes it can
>> > be uva of client data.
>> > (2) bcopy() can be used only if datavaddr is kva or when map->pmap is
>> > current pmap.
>> >
>> > Note that there is an implication for bus_dmamap_load_uio() with
>> > uio->uio_segflg set to UIO_USERSPACE that used physical pages are
>> > in-core and wired. See "man bus_dma".
>> >
>> > There is not public interface to check that map->pmap is current pmap.
>> > So one solution is the following:
>> >
>> > if (bpage->datavaddr >= VM_MIN_KERNEL_ADDRESS) {
>> >         bcopy();
>> > } else {
>> >         physcopy();
>> > }
>> >
>> > If there will be public pmap_is_current() then another solution is the
>> > following:
>> >
>> > if (bpage->datavaddr != 0) && pmap_is_current(map->pmap)) {
>> >         bcopy();
>> > } else {
>> >         physcopy();
>> > }
>> >
>> > The second solution implies that context switch must not happen during
>> > bus_dmamap_sync() called from an interrupt routine. However, IMO, it's
>> > granted.
>> >
>> > Note that map->pmap should be always kernel_pmap for datavaddr >=
>> > VM_MIN_KERNEL_ADDRESS.
>> >
>> > Comments, different solutions, or objections?
>> >
>> > Svatopluk Kraus
>> > _______________________________________________
>> > freebsd-arch@freebsd.org mailing list
>> > http://lists.freebsd.org/mailman/listinfo/freebsd-arch
>> > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"
>> >
>> _______________________________________________
>> freebsd-arch@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
>> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"

From owner-freebsd-arch@FreeBSD.ORG  Sun Apr 26 20:08:32 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id BD0285B4
 for <freebsd-arch@freebsd.org>; Sun, 26 Apr 2015 20:08:32 +0000 (UTC)
Received: from mail-ig0-x22c.google.com (mail-ig0-x22c.google.com
 [IPv6:2607:f8b0:4001:c05::22c])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 85D391C5A
 for <freebsd-arch@freebsd.org>; Sun, 26 Apr 2015 20:08:32 +0000 (UTC)
Received: by igblo3 with SMTP id lo3so47489264igb.0
 for <freebsd-arch@freebsd.org>; Sun, 26 Apr 2015 13:08:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=ew0Pl0XNmqlrtlmVQVnPc6VrSdMW/fv2qSkUlO+6kxs=;
 b=R4dkHDhHs1q//jbNSohF1lTMWlGN0osEnGlEwx0KKEi8Yc5Nwol2j0uzj6D5l4WK2b
 smiXJqs5WibhKV+96DI4MtI0nyBJG08FBoh7gMAqZKPwurjeQ2y8j4YpSJp8RL7mRm9r
 jzRyDLWOwFiADB0yLlghS532uJ41rXuLARxTapoj3QfZN+OeGmwpQnjM6ywJPHyc+N5U
 N/aoPSeb4Y1faeCEyUol6eQzwm0dyBXWYyGmgZyFsMM/1EU55bHWfzutWZMoLWJyxumy
 7IFXImhmdGpGCXnmZJsf3kiYaPxUAOOfOnn9r+WStPY7oUMUjqu4WQKtCwCxWRlRF9W3
 WUiA==
MIME-Version: 1.0
X-Received: by 10.50.102.68 with SMTP id fm4mr9502079igb.25.1430078911984;
 Sun, 26 Apr 2015 13:08:31 -0700 (PDT)
Received: by 10.64.13.81 with HTTP; Sun, 26 Apr 2015 13:08:31 -0700 (PDT)
In-Reply-To: <20150425172833.GM2390@kib.kiev.ua>
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <CAM=8qan-4SbKJaddrfkv=HG3n+HaOPDL5MEPS9DoaTvnhrJPZQ@mail.gmail.com>
 <20150425094152.GE2390@kib.kiev.ua> <553B9E64.8030907@gmail.com>
 <20150425163444.GL2390@kib.kiev.ua> <553BC9D1.1070502@gmail.com>
 <20150425172833.GM2390@kib.kiev.ua>
Date: Sun, 26 Apr 2015 22:08:31 +0200
Message-ID: <CAFHCsPVnkuhEQkFFOS3_RvBs=1LZwCbAKGRAuatQJ0Xo5bWeww@mail.gmail.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
From: Svatopluk Kraus <onwahe@gmail.com>
To: Konstantin Belousov <kostikbel@gmail.com>
Cc: Jason Harmening <jason.harmening@gmail.com>,
 FreeBSD Arch <freebsd-arch@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 26 Apr 2015 20:08:32 -0000

On Sat, Apr 25, 2015 at 7:28 PM, Konstantin Belousov
<kostikbel@gmail.com> wrote:
> On Sat, Apr 25, 2015 at 12:07:29PM -0500, Jason Harmening wrote:
>>
>> On 04/25/15 11:34, Konstantin Belousov wrote:
>> > I believe UIO_USERSPACE is almost unused, it might be there for some
>> > obscure (and buggy) driver.
>> It may be nearly unused, but we still document it in busdma.9, and we
>> still explicitly check for it when setting the pmap in
>> _bus_dmamap_load_uio.  If it's not safe to use, then it's not OK for us
>> to do that.
>> We need to either a) remove support for it by adding a failure/KASSERT
>> on UIO_USERSPACE in _busdmamap_load_uio() and remove the paragraph on it
>> from busdma.9, or b) make it safe.
>>
>> I'd be in favor of b), because I think it is still valid to support some
>> non-painful way of using DMA with userspace buffers.  Right now, the
>> only safe way to do that seems to be:
>> 1) vm_fault_quick_hold_pages
>> 2) kva_alloc
>> 3) pmap_qenter
>> 4) bus_dmamap_load
> 1. vm_fault_quick_hold
> 2. bus_dmamap_load_ma
>
>>
>> That seems both unnecessarily complex and wasteful of KVA space.
>>
> The above sequence does not need a KVA allocation.
>
>

But if the buffer bounces, then some KVA must be allocated temporarily
for physcopyin/physcopyout.

FYI, we are in the following situation in ARM arch. (1) The DMA in not
cache coherent, and (2) cache maintainance operations are done on
virtual addresses. It means that cache maintainance must be done for
cached memory. Moreover, it must be done even for unmapped buffers and
they must be mapped for that.

Thus it could be of much help if we can used UVA for that if context is correct.

From owner-freebsd-arch@FreeBSD.ORG  Sun Apr 26 20:30:53 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 5DB2889C
 for <freebsd-arch@freebsd.org>; Sun, 26 Apr 2015 20:30:53 +0000 (UTC)
Received: from mail-wi0-x235.google.com (mail-wi0-x235.google.com
 [IPv6:2a00:1450:400c:c05::235])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id EB5E31EAA
 for <freebsd-arch@freebsd.org>; Sun, 26 Apr 2015 20:30:52 +0000 (UTC)
Received: by wizk4 with SMTP id k4so77337706wiz.1
 for <freebsd-arch@freebsd.org>; Sun, 26 Apr 2015 13:30:51 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:content-type;
 bh=004It6X1U8ovvM8sfNhtpNeQYamSHDfyMXVS68kIXDA=;
 b=rvRJX/IGGGZIGrA1SVZDXVovUmS8RMtsPe5cYJPQfHAA/GWbVnOthqfDDhXUClHSGW
 d2+Sdd92vrULL+YmM7XRAIV1muozDaycrsjUJuYY+4v0XWui5yTWGePOoR0UX9gZBxJh
 odXEv8fheMcm8H9znJlKBZyMf/gtHt7HTPyGALqT7CdgKZuiFq5huwh9ksiF1s1OUQC9
 8d+FNKzo3wH13oSbWcnR2W4h9yUGqZO4V/pga+0+Z0YeIwJ8Kxk2tG+3mjJiiJVfBg91
 uUBKbEfh/uatsH5vSV2JKxULf2T7GQnKdyMHhpK2i3/+0MwZkbmV/8IuaCXDYgWxf2WU
 AQuQ==
MIME-Version: 1.0
X-Received: by 10.180.208.42 with SMTP id mb10mr14445543wic.80.1430080251349; 
 Sun, 26 Apr 2015 13:30:51 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.180.42.68 with HTTP; Sun, 26 Apr 2015 13:30:51 -0700 (PDT)
In-Reply-To: <CAJ-VmonCp7VDWrSXhiQ5PwcCogPM8NG6tDjQRy8osUQw=uUYKQ@mail.gmail.com>
References: <CAJ-VmomL9hZZHPtZ3+TdujHmo5UQfFhm59vQKUbxW++-TGobmg@mail.gmail.com>
 <CAJ-VmokPd=CUAfqmjWPns+pj6zKbpF55tDn2_u8JPNzaK7F1Pw@mail.gmail.com>
 <CAJ-VmonCp7VDWrSXhiQ5PwcCogPM8NG6tDjQRy8osUQw=uUYKQ@mail.gmail.com>
Date: Sun, 26 Apr 2015 13:30:51 -0700
X-Google-Sender-Auth: JR-IUyNetz_wBzufHa-RsInf0I4
Message-ID: <CAJ-VmonyjDVb_sdKZVdDLy4vYcRWjVKFydsx6gNymmcqpPYKeA@mail.gmail.com>
Subject: Re: RFT: numa policy branch
From: Adrian Chadd <adrian@freebsd.org>
To: "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 26 Apr 2015 20:30:53 -0000

Hi!

Another update:

* updated to recent -HEAD;
* numactl now can set memory policy and cpuset domain information - so
it's easy to say "this runs in memory domain X and cpu domain Y" in
one pass with it;
* the locality matrix is now available. Here's an example from scott's
2x haswell v3, with cluster-on-die enabled:

vm.phys_locality:
0: 10 21 31 31
1: 21 10 31 31
2: 31 31 10 21
3: 31 31 21 10

And on the westmere-ex box, with no SLIT table:

vm.phys_locality:
0: -1 -1 -1 -1
1: -1 -1 -1 -1
2: -1 -1 -1 -1
3: -1 -1 -1 -1

* I've tested in on westmere-ex (4x socket), sandybridge, ivybridge,
haswell v3 and haswell v3 cluster on die.
* I've discovered that our implementation of libgomp (from gcc-4.2) is
very old and doesn't include some of the thread control environment
variables, grr.
* .. and that the gcc libgomp code doesn't at all have freebsd thread
affinity routines, so I added them to gcc-4.8.

Testing with a local copy of stream - using gcc-4.9 and the updated
libgomp to support thread pinning - shows that yes, it all works as
expected, and yes for NUMA workloads its quite a big difference.

I'd appreciate any reviews / testing people are able to provide. I'm
about at the functionality point where I'd like to submit it for
formal review and try to land it in -HEAD.


-adrian

From owner-freebsd-arch@FreeBSD.ORG  Mon Apr 27 08:14:59 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 5C23DE9
 for <freebsd-arch@freebsd.org>; Mon, 27 Apr 2015 08:14:59 +0000 (UTC)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id D8A3B109E
 for <freebsd-arch@freebsd.org>; Mon, 27 Apr 2015 08:14:58 +0000 (UTC)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3R8Erqk021008
 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Mon, 27 Apr 2015 11:14:53 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3R8Erqk021008
Received: (from kostik@localhost)
 by tom.home (8.14.9/8.14.9/Submit) id t3R8Erqj021007;
 Mon, 27 Apr 2015 11:14:53 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Mon, 27 Apr 2015 11:14:53 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Jason Harmening <jason.harmening@gmail.com>
Cc: Svatopluk Kraus <onwahe@gmail.com>, FreeBSD Arch <freebsd-arch@freebsd.org>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
Message-ID: <20150427081453.GZ2390@kib.kiev.ua>
References: <20150425094152.GE2390@kib.kiev.ua> <553B9E64.8030907@gmail.com>
 <20150425163444.GL2390@kib.kiev.ua> <553BC9D1.1070502@gmail.com>
 <20150425172833.GM2390@kib.kiev.ua> <553BD501.4010109@gmail.com>
 <20150425181846.GN2390@kib.kiev.ua> <553BE12B.4000105@gmail.com>
 <20150425201410.GP2390@kib.kiev.ua> <553D2890.4020107@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <553D2890.4020107@gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.0
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 27 Apr 2015 08:14:59 -0000

On Sun, Apr 26, 2015 at 01:04:00PM -0500, Jason Harmening wrote:
> 
> On 04/25/15 15:14, Konstantin Belousov wrote:
> > On Sat, Apr 25, 2015 at 01:47:07PM -0500, Jason Harmening wrote:
> >> On 04/25/15 13:18, Konstantin Belousov wrote:
> >>> On Sat, Apr 25, 2015 at 12:55:13PM -0500, Jason Harmening wrote:
> >>>> Ah, that looks much better.  A few things though:
> >>>> 1) _bus_dmamap_load_ma (note the underscore) is still part of the MI/MD
> >>>> interface, which we tell drivers not to use.  It looks like it's
> >>>> implemented for every arch though.  Should there be a public and
> >>>> documented bus_dmamap_load_ma ?
> >>> Might be yes.  But at least one consumer of the KPI must appear before
> >>> the facility is introduced.
> >> Could some of the GART/GTT code consume that?
> > Do you mean, by GEM/GTT code ?  Indeed, this is interesting and probably
> > workable suggestion.  I thought that I would need to provide a special
> > interface from DMAR for the GEM, but your proposal seems to fit.  Still,
> > an issue is that the Linux code is structured significantly different,
> > and this code, although isolated, is significant divergent from the
> > upstream.
> 
> Yes, GEM/GTT.  I know it would be useful for i915, maybe other drm2
> drivers too.
> 
> >
> >>>> 3) Using bus_dmamap_load_ma would mean always using physcopy for bounce
> >>>> buffers...seems like the sfbufs would slow things down ?
> >>> For amd64, sfbufs are nop, due to the direct map.  But, I doubt that
> >>> we can combine bounce buffers and performance in the single sentence.
> >> In fact the amd64 implementation of uiomove_fromphys doesn't use sfbufs
> >> at all thanks to the direct map.  sparc64 seems to avoid sfbufs as much
> >> as possible too.  I don't know what arm64/aarch64 will be able to use. 
> >> Those seem like the platforms where bounce buffering would be the most
> >> likely, along with i386 + PAE.  They might still be used on 32-bit
> >> platforms for alignment or devices with < 32-bit address width, but then
> >> those are likely to be old and slow anyway.
> >>
> >> I'm still a bit worried about the slowness of waiting for an sfbuf if
> >> one is needed, but in practice that might not be a big issue.
> >>
> I noticed the following in vm_map_delete, which is called by sys_munmap:
> 
>  
>  2956                  * Wait for wiring or unwiring of an entry to complete.
>  2957                  * Also wait for any system wirings to disappear on
>  2958                  * user maps.
>  2959                  */
>  2960                 if ((entry->eflags & MAP_ENTRY_IN_TRANSITION) != 0 ||
>  2961                     (vm_map_pmap(map) != kernel_pmap &&
>  2962                     vm_map_entry_system_wired_count(entry) != 0)) {
> ...
>  2970                         (void) vm_map_unlock_and_wait(map, 0);
> 
> It looks like munmap does wait on wired pages (well, system-wired pages, not mlock'ed pages).
> The system-wire count on the PTE will be non-zero if vslock/vm_map_wire(...VM_MAP_WIRE_SYSTEM...) was called on it.
> Does that mean UIO_USERSPACE dmamaps are actually safe from getting the UVA taken out from under them?
> Obviously it doesn't make bcopy safe to do in the wrong process context, but that seems easily fixable.

vslock() indeed would prevent the unmap, but it also causes very serious
user address space fragmentation. vslock() carves map entry covering the
specified region, which, for the typical application use of malloced
memory for buffers, could easily fragment the bss into per-page map
entries. It is not very important for the current vslock() use by sysctl
code, since apps usually do bounded number of sysctls at the startup,
but definitely it would be an issue if vslock() appears on the i/o path.

From owner-freebsd-arch@FreeBSD.ORG  Mon Apr 27 10:48:51 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 9D677F9B;
 Mon, 27 Apr 2015 10:48:51 +0000 (UTC)
Received: from mailhost.netlabit.sk (mailhost.netlabit.sk [84.245.65.72])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 359E61117;
 Mon, 27 Apr 2015 10:48:50 +0000 (UTC)
Received: from zeta.dino.sk (fw1.dino.sk [84.245.95.252]) (AUTH: LOGIN milan)
 by mailhost.netlabit.sk with ESMTPA; Mon, 27 Apr 2015 12:48:41 +0200
 id 00347808.553E1409.00006044
Date: Mon, 27 Apr 2015 12:48:41 +0200
From: Milan Obuch <freebsd-arch@dino.sk>
To: Adrian Chadd <adrian@freebsd.org>
Cc: freebsd-arch@freebsd.org
Subject: Re: using libgpio to bitbang LCDs!
Message-ID: <20150427124841.7b8a59bc@zeta.dino.sk>
In-Reply-To: <CAJ-VmonT1FvoNUAAY1JeHM9LopaCiWvPBjw_ubP5Ktm0FYZBvA@mail.gmail.com>
References: <CAJ-Vmo=n=Lfh_XqkxwAA43jUi7suoPjWV1uSv4VnKNncw3+Pfw@mail.gmail.com>
 <CAJ-VmonT1FvoNUAAY1JeHM9LopaCiWvPBjw_ubP5Ktm0FYZBvA@mail.gmail.com>
X-Mailer: Claws Mail 3.11.1 (GTK+ 2.24.27; i386-portbld-freebsd10.1)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 27 Apr 2015 10:48:51 -0000

On Sat, 11 Apr 2015 23:45:55 -0700
Adrian Chadd <adrian@freebsd.org> wrote:

> Hi,
> 
> The library source code and a demo program is available here:
> 
> https://github.com/erikarn/freebsd-liblcd
> 
> It includes the wiring needed to hook the example OLED board up
> (http://www.adafruit.com/products/684) to a Carambola 2 evaluation
> board.
> 
> Anything you can get 5v and 5 GPIO pins from will work. (Well, as long
> as there's also libgpio / gpio API support for your device..)
> 
> 
> 
> -adrian
>

Hi,

I downloaded master.zip from github and now I am trying to modify it
for my hardware - Raspberry PI and TFT display with ILI9341 chip
https://learn.adafruit.com/adafruit-pitft-28-inch-resistive-touchscreen-display-raspberry-pi
My original attempt to use SPI failed probably because ILI9341 does not
use pure SPI, but there is one extension - DC pin. I did not manage to
satisfy chip's timing/bit sequence/whatever, so I would like to try bit
banging.

I found two small issues with archive downloaded -
freebsd-liblcd-master/src/beastie_ili9340c_320x240/Makefile seems to ba
a copy of freebsd-liblcd-master/src/beastie_ssd1351_128x128/Makefile, I
think there should be a difference, after my fix

--- beastie_ili9340c_320x240/Makefile   2015-04-23 21:40:10.693847000 +0200
+++ beastie_ssd1351_128x128/Makefile    2015-04-13 00:58:06.000000000 +0200
@@ -3,7 +3,7 @@
 
 .include <bsd.own.mk>
 
-PROG=beastie_ili9340c_320x240
+PROG=beastie_ssd1351_128x128
 SRCS=main.c
 CFLAGS+=-I../../lib/liblcd
 LDFLAGS+=-L../../lib/liblcd


The second issue is when doing 'make install' binaries produced are
being installed into / directory. While both are not fatal for me, they
are annoyances at least.

I found following in source files:

        /* Configured for the carambola 2 board */
        cfg.gpio_unit = 0;
        cfg.pin_cs = 19;
        cfg.pin_rst = 20;
        cfg.pin_dc = 21;
        cfg.pin_sck = 22;
        cfg.pin_mosi = 23;

I see there no 'pin_miso', so does this mean only one directional
communication is being used, no status reading, or both reading and
writing are being carried over one pin? If the former, then all I
should to do is change pin numbers in above given excerption to ones
valid for Raspberry Pi, if the later, I can't use this without more
modifications.

Also, I have simple monochrome display with PCD8544 chip, which should
use basically the same bus design, so it could be used for this too,
with some modification.

Regards,
Milan

From owner-freebsd-arch@FreeBSD.ORG  Mon Apr 27 12:13:12 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 68FBABE7;
 Mon, 27 Apr 2015 12:13:12 +0000 (UTC)
Received: from mailhost.netlabit.sk (mailhost.netlabit.sk [84.245.65.72])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id DAEA61B9C;
 Mon, 27 Apr 2015 12:13:10 +0000 (UTC)
Received: from zeta.dino.sk (fw1.dino.sk [84.245.95.252]) (AUTH: LOGIN milan)
 by mailhost.netlabit.sk with ESMTPA; Mon, 27 Apr 2015 14:13:07 +0200
 id 00347808.553E27D3.000068E2
Date: Mon, 27 Apr 2015 14:13:07 +0200
From: Milan Obuch <freebsd-arch@dino.sk>
To: Adrian Chadd <adrian@freebsd.org>
Cc: freebsd-arch@freebsd.org
Subject: Re: using libgpio to bitbang LCDs!
Message-ID: <20150427141307.46630f07@zeta.dino.sk>
In-Reply-To: <20150427124841.7b8a59bc@zeta.dino.sk>
References: <CAJ-Vmo=n=Lfh_XqkxwAA43jUi7suoPjWV1uSv4VnKNncw3+Pfw@mail.gmail.com>
 <CAJ-VmonT1FvoNUAAY1JeHM9LopaCiWvPBjw_ubP5Ktm0FYZBvA@mail.gmail.com>
 <20150427124841.7b8a59bc@zeta.dino.sk>
X-Mailer: Claws Mail 3.11.1 (GTK+ 2.24.27; i386-portbld-freebsd10.1)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 27 Apr 2015 12:13:12 -0000

On Mon, 27 Apr 2015 12:48:41 +0200
Milan Obuch <freebsd-arch@dino.sk> wrote:

> On Sat, 11 Apr 2015 23:45:55 -0700
> Adrian Chadd <adrian@freebsd.org> wrote:
> 
> > Hi,
> > 
> > The library source code and a demo program is available here:
> > 
> > https://github.com/erikarn/freebsd-liblcd
> > 
> > It includes the wiring needed to hook the example OLED board up
> > (http://www.adafruit.com/products/684) to a Carambola 2 evaluation
> > board.
> > 
> > Anything you can get 5v and 5 GPIO pins from will work. (Well, as
> > long as there's also libgpio / gpio API support for your device..)
> > 
> > 
> > 
> > -adrian
> >
> 
> Hi,
> 
> I downloaded master.zip from github and now I am trying to modify it
> for my hardware - Raspberry PI and TFT display with ILI9341 chip
> https://learn.adafruit.com/adafruit-pitft-28-inch-resistive-touchscreen-display-raspberry-pi
> My original attempt to use SPI failed probably because ILI9341 does
> not use pure SPI, but there is one extension - DC pin. I did not
> manage to satisfy chip's timing/bit sequence/whatever, so I would
> like to try bit banging.
> 
> I found two small issues with archive downloaded -
> freebsd-liblcd-master/src/beastie_ili9340c_320x240/Makefile seems to
> ba a copy of
> freebsd-liblcd-master/src/beastie_ssd1351_128x128/Makefile, I think
> there should be a difference, after my fix
> 
> --- beastie_ili9340c_320x240/Makefile   2015-04-23 21:40:10.693847000
> +0200 +++ beastie_ssd1351_128x128/Makefile    2015-04-13
> 00:58:06.000000000 +0200 @@ -3,7 +3,7 @@
>  
>  .include <bsd.own.mk>
>  
> -PROG=beastie_ili9340c_320x240
> +PROG=beastie_ssd1351_128x128
>  SRCS=main.c
>  CFLAGS+=-I../../lib/liblcd
>  LDFLAGS+=-L../../lib/liblcd
> 
> 
> The second issue is when doing 'make install' binaries produced are
> being installed into / directory. While both are not fatal for me,
> they are annoyances at least.
> 
> I found following in source files:
> 
>         /* Configured for the carambola 2 board */
>         cfg.gpio_unit = 0;
>         cfg.pin_cs = 19;
>         cfg.pin_rst = 20;
>         cfg.pin_dc = 21;
>         cfg.pin_sck = 22;
>         cfg.pin_mosi = 23;
> 
> I see there no 'pin_miso', so does this mean only one directional
> communication is being used, no status reading, or both reading and
> writing are being carried over one pin? If the former, then all I
> should to do is change pin numbers in above given excerption to ones
> valid for Raspberry Pi, if the later, I can't use this without more
> modifications.
> 
> Also, I have simple monochrome display with PCD8544 chip, which should
> use basically the same bus design, so it could be used for this too,
> with some modification.
> 
> Regards,
> Milan
>

I decided just to try it. I copied
freebsd-liblcd-master/src/beastie_ili9340c_320x240 to
freebsd-liblcd-master/src/beastie_ili9340c_320x240-1, just to keep
things clean (and be able to revert easy if I screw something too much)
and changed configuration lines mentioned above to

        /* Configured for the Raspberry Pi board */
        cfg.gpio_unit = 0;
        cfg.pin_cs = 8;
        cfg.pin_rst = 0;
        cfg.pin_dc = 25;
        cfg.pin_sck = 11;
        cfg.pin_mosi = 10;

and it works, a bit slowly, but this could be expected. Also note there
is no reset pin connected in my display, so I put there 0, which may
not be the best value, but it works. One more issue here - picture is
turned upside down, I have four buttons below screen and I need to turn
the display to be normally readable so they are on top... but this is
not hard to solve... I can use this display now, it is slow, but works.

Regards,
Milan

From owner-freebsd-arch@FreeBSD.ORG  Mon Apr 27 14:21:36 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 97DAB59A
 for <freebsd-arch@freebsd.org>; Mon, 27 Apr 2015 14:21:36 +0000 (UTC)
Received: from mail-ie0-x22f.google.com (mail-ie0-x22f.google.com
 [IPv6:2607:f8b0:4001:c03::22f])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 5F8831B2E
 for <freebsd-arch@freebsd.org>; Mon, 27 Apr 2015 14:21:36 +0000 (UTC)
Received: by iecrt8 with SMTP id rt8so127029388iec.0
 for <freebsd-arch@freebsd.org>; Mon, 27 Apr 2015 07:21:35 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=MA47pEcho1Y0oWJUlTACgdTFkhHGKYZ9Ow0uoKxYyr0=;
 b=dQ7Bk3aG9sJxwDXqpjEclvJlRM9eB/xXBx8IgkH4YLrTJS4ecoGu0r0+4rC+uRqaf0
 6LMoJzl5dAjw/6O/0fKWk4gwQldJ9PFxYT1bK5zorzeMdiYXaZ4hDmEQ8k3KyF6rjKxp
 yPF4ZIRKvbdNhdwIYiVyV6u1OdMO8AoxwYypDuJMUUA4by3ItPwi8IzgaGW/dbskvnd/
 OJGvPDn0Pm9DjUH6rdrurrqylo1jZI9PCQEbnOheg/4D4eLpyVp9kz5fy8gZveUvcwDF
 sxfMmr4gCrIU04Lgh0bFEFgxtNd5E1nSkZVCBGY8aAOIFkGYtxkDArzxKpn5PSfoWglL
 QnYw==
MIME-Version: 1.0
X-Received: by 10.42.76.146 with SMTP id e18mr12925726ick.42.1430144495787;
 Mon, 27 Apr 2015 07:21:35 -0700 (PDT)
Received: by 10.64.13.81 with HTTP; Mon, 27 Apr 2015 07:21:35 -0700 (PDT)
In-Reply-To: <20150427081453.GZ2390@kib.kiev.ua>
References: <20150425094152.GE2390@kib.kiev.ua> <553B9E64.8030907@gmail.com>
 <20150425163444.GL2390@kib.kiev.ua> <553BC9D1.1070502@gmail.com>
 <20150425172833.GM2390@kib.kiev.ua> <553BD501.4010109@gmail.com>
 <20150425181846.GN2390@kib.kiev.ua> <553BE12B.4000105@gmail.com>
 <20150425201410.GP2390@kib.kiev.ua> <553D2890.4020107@gmail.com>
 <20150427081453.GZ2390@kib.kiev.ua>
Date: Mon, 27 Apr 2015 16:21:35 +0200
Message-ID: <CAFHCsPX05ZFw4iZUyjwdL6_G-r7OQEZ0fsk16i9Qn97NNMmzKw@mail.gmail.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
From: Svatopluk Kraus <onwahe@gmail.com>
To: Konstantin Belousov <kostikbel@gmail.com>
Cc: Jason Harmening <jason.harmening@gmail.com>,
 FreeBSD Arch <freebsd-arch@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 27 Apr 2015 14:21:36 -0000

On Mon, Apr 27, 2015 at 10:14 AM, Konstantin Belousov
<kostikbel@gmail.com> wrote:
> On Sun, Apr 26, 2015 at 01:04:00PM -0500, Jason Harmening wrote:
>>
>> On 04/25/15 15:14, Konstantin Belousov wrote:
>> > On Sat, Apr 25, 2015 at 01:47:07PM -0500, Jason Harmening wrote:
>> >> On 04/25/15 13:18, Konstantin Belousov wrote:
>> >>> On Sat, Apr 25, 2015 at 12:55:13PM -0500, Jason Harmening wrote:
>> >>>> Ah, that looks much better.  A few things though:
>> >>>> 1) _bus_dmamap_load_ma (note the underscore) is still part of the MI/MD
>> >>>> interface, which we tell drivers not to use.  It looks like it's
>> >>>> implemented for every arch though.  Should there be a public and
>> >>>> documented bus_dmamap_load_ma ?
>> >>> Might be yes.  But at least one consumer of the KPI must appear before
>> >>> the facility is introduced.
>> >> Could some of the GART/GTT code consume that?
>> > Do you mean, by GEM/GTT code ?  Indeed, this is interesting and probably
>> > workable suggestion.  I thought that I would need to provide a special
>> > interface from DMAR for the GEM, but your proposal seems to fit.  Still,
>> > an issue is that the Linux code is structured significantly different,
>> > and this code, although isolated, is significant divergent from the
>> > upstream.
>>
>> Yes, GEM/GTT.  I know it would be useful for i915, maybe other drm2
>> drivers too.
>>
>> >
>> >>>> 3) Using bus_dmamap_load_ma would mean always using physcopy for bounce
>> >>>> buffers...seems like the sfbufs would slow things down ?
>> >>> For amd64, sfbufs are nop, due to the direct map.  But, I doubt that
>> >>> we can combine bounce buffers and performance in the single sentence.
>> >> In fact the amd64 implementation of uiomove_fromphys doesn't use sfbufs
>> >> at all thanks to the direct map.  sparc64 seems to avoid sfbufs as much
>> >> as possible too.  I don't know what arm64/aarch64 will be able to use.
>> >> Those seem like the platforms where bounce buffering would be the most
>> >> likely, along with i386 + PAE.  They might still be used on 32-bit
>> >> platforms for alignment or devices with < 32-bit address width, but then
>> >> those are likely to be old and slow anyway.
>> >>
>> >> I'm still a bit worried about the slowness of waiting for an sfbuf if
>> >> one is needed, but in practice that might not be a big issue.
>> >>
>> I noticed the following in vm_map_delete, which is called by sys_munmap:
>>
>>
>>  2956                  * Wait for wiring or unwiring of an entry to complete.
>>  2957                  * Also wait for any system wirings to disappear on
>>  2958                  * user maps.
>>  2959                  */
>>  2960                 if ((entry->eflags & MAP_ENTRY_IN_TRANSITION) != 0 ||
>>  2961                     (vm_map_pmap(map) != kernel_pmap &&
>>  2962                     vm_map_entry_system_wired_count(entry) != 0)) {
>> ...
>>  2970                         (void) vm_map_unlock_and_wait(map, 0);
>>
>> It looks like munmap does wait on wired pages (well, system-wired pages, not mlock'ed pages).
>> The system-wire count on the PTE will be non-zero if vslock/vm_map_wire(...VM_MAP_WIRE_SYSTEM...) was called on it.
>> Does that mean UIO_USERSPACE dmamaps are actually safe from getting the UVA taken out from under them?
>> Obviously it doesn't make bcopy safe to do in the wrong process context, but that seems easily fixable.
>
> vslock() indeed would prevent the unmap, but it also causes very serious
> user address space fragmentation. vslock() carves map entry covering the
> specified region, which, for the typical application use of malloced
> memory for buffers, could easily fragment the bss into per-page map
> entries. It is not very important for the current vslock() use by sysctl
> code, since apps usually do bounded number of sysctls at the startup,
> but definitely it would be an issue if vslock() appears on the i/o path.


In the scope of this thread, there are two things which must be
fulfilled during DMA operations:

(1) Affected physical pages must be kept in system at any cost. It
means no swapping and no freeing.

(2) DMA sync must be doable. It means that physical pages must be
mapped somewhere if needed, even temporarily.

The point (1) must be fulfilled by DMA client by the way which is
suitable for it. It should not be a part of any DMA load or unload
method.

The subject of this thread was meant to be about point (2). I have no
problem that it was extented to point (1) too. In fact, I welcome
that. But there are still two proposed solutions how to fix bouncing
for user space buffers here.

The first solution is very simple and user space buffers could be
looked at like unbuffered ones. If a mapping is needed, some temporaty
KVA is used.

The second solution is simple too. If a mapping is needed and context
is correct, UVA is used. Otherwise, some temporaty KVA is used. I
prefer this solution as on cache not coherent DMA case, cache
maintainace operations must be taken and buffer must always have valid
mapping for them in DMA sync.

I think that support for DMA from/to user space buffers is important
for graphic adapters, fast data grabbers, whatever what needs fast
user process interaction with a device. IMHO, There is no way to
cancel support for it.

Thus some fix for bouncing must be done in all archs.

From owner-freebsd-arch@FreeBSD.ORG  Mon Apr 27 14:46:53 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 70B46DAB
 for <freebsd-arch@freebsd.org>; Mon, 27 Apr 2015 14:46:53 +0000 (UTC)
Received: from mail-ig0-x234.google.com (mail-ig0-x234.google.com
 [IPv6:2607:f8b0:4001:c05::234])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 373A61E03
 for <freebsd-arch@freebsd.org>; Mon, 27 Apr 2015 14:46:53 +0000 (UTC)
Received: by igbyr2 with SMTP id yr2so64564113igb.0
 for <freebsd-arch@freebsd.org>; Mon, 27 Apr 2015 07:46:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=ESLr29EpsUU5sWCCSPZoZhz8DpYA3n/0OyEY5KHIggo=;
 b=wIJUQ7mzZz5UUmc7py9SYgixKDwo0LPZbLpUS80cShxbdfc2Rg7j9jAe+DKmq5yLmq
 REAoIQLJPrIc3oFbHrxixXdqTCUshpLWm9nTJRIf6HLexv54RzSoT3/RXq9FCvb1f4wy
 EWNAm3bA8aXd50DncDhF1GGHIng/pGsNmelGJtRnKD1O6e33orLBGXGLhnz5FP/VTCkb
 FpodAZORovbcvL3S/JYFpA3j780y65XhdPDfKQ2kWvEuJEKtf4nnHb4groXMPc6yfzEG
 RG8QB7CqQ9N5taSIEf+PncySXI2NTVyHb09LMP3Ke8f8fHNcc10Oi8YYTr98jSlg0q4R
 x7RQ==
MIME-Version: 1.0
X-Received: by 10.50.6.4 with SMTP id w4mr13928585igw.36.1430146012552; Mon,
 27 Apr 2015 07:46:52 -0700 (PDT)
Received: by 10.36.106.70 with HTTP; Mon, 27 Apr 2015 07:46:52 -0700 (PDT)
In-Reply-To: <CAFHCsPXzU_Cu51bj8vc_5e7to4GqdFDUvHvwWM1mAxJ5LvemXw@mail.gmail.com>
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <CAM=8qan-4SbKJaddrfkv=HG3n+HaOPDL5MEPS9DoaTvnhrJPZQ@mail.gmail.com>
 <20150425094152.GE2390@kib.kiev.ua>
 <CAFHCsPXzU_Cu51bj8vc_5e7to4GqdFDUvHvwWM1mAxJ5LvemXw@mail.gmail.com>
Date: Mon, 27 Apr 2015 09:46:52 -0500
Message-ID: <CAM=8qam+-VnvroYF9J30JWYDCMLt=r1RfgodkqfG3ZVWwDzROw@mail.gmail.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
From: Jason Harmening <jason.harmening@gmail.com>
To: Svatopluk Kraus <onwahe@gmail.com>
Cc: Konstantin Belousov <kostikbel@gmail.com>,
 FreeBSD Arch <freebsd-arch@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.20
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 27 Apr 2015 14:46:53 -0000

>
> Using of vslock() is proposed method in bus_dma man page. IMO, the
>> function looks complex and can be a big time eater. However, are you
>> saying that vslock() does not work for that? Then for what reason
>> does that function exist?
>>
>
There's been some misunderstanding here, I think.  If you use vslock (or
vm_map_wire, which vslock wraps), then the UVAs should be safe from
teardown and you should be able to use bcopy if you are in the correct
context.  See the post elsewhere in this thread where I dig through the
sys_munmap path and find vm_map_delete waiting on system-wired PTEs.

>
>
>>
>>
>> > The only safe method to work with the userspace regions is to
>> > vm_fault_quick_hold() them to get hold on the pages, and then either
>> > pass pages array down, or remap them in the KVA with pmap_qenter().
>> >
>>
>>
>> So, even vm_fault_quick_hold() does not keep valid user mapping?
>>
>
vm_fault_quick_hold_pages() doesn't do any bookkeeping on UVAs, only the
underlying physical pages.  That means it is possible for the UVA region to
be munmap'ed if vm_fault_quick_hold_pages() has been used.  So if you use
vm_fault_quick_hold_pages() instead of vslock(), you can't use
bus_dmamap_load_uio(UIO_USERSPACE) because that assumes valid UVA
mappings.  You must instead deal only with the underlying vm_page_t's,
which means using _bus_dmamap_load_ma().

Here's my take on it:

vslock(), as you mention, is very complex.  It not only keeps the physical
pages from being swapped out, but it also removes them from page queues (see
https://lists.freebsd.org/pipermail/freebsd-current/2015-March/054890.html)
and does a lot of bookkeeping on the UVA mappings for those pages.  Part of
that involves simulating a pagefault, which as kib mentions can lead to a
lot of UVA fragmentation.

vm_fault_quick_hold_pages() is much cheaper and seems mostly intended for
short-term DMA operations.

So, you might use vslock() + bus_dmamap_load_uio() for long-duration DMA
transfers, like continuous streaming to a circular buffer that could last
minutes or longer.  Then, the extra cost of the vslock will be amortized
over the long time of the transfer, and UVA fragmentation will be less of a
concern since you presumably will have a limited number of vslock() calls
over the lifetime of the process.  Also, you will probably be keeping the
DMA map for a long duration anyway, so it should be OK to wait and call
bus_dmamap_sync() in the process context.  Since vslock() removed the pages
from the page queues, there will also be less work for pagedaemon to do
during the long transfer.

OTOH, vm_fault_quick_hold_pages() + _bus_dmamap_load_ma() seems much better
to do for frequent short transfers to widely-varying buffers, such as block
I/O.  The extra pagedaemon work is inconsequential here, and since the DMA
operations are frequent and you may have many in-flight at once, the
reduced setup cost and fragmentation are much more important.

From owner-freebsd-arch@FreeBSD.ORG  Mon Apr 27 16:13:07 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id D13F2DC2
 for <freebsd-arch@freebsd.org>; Mon, 27 Apr 2015 16:13:07 +0000 (UTC)
Received: from nm9-vm0.bullet.mail.bf1.yahoo.com
 (nm9-vm0.bullet.mail.bf1.yahoo.com [98.139.213.154])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 84D691973
 for <freebsd-arch@freebsd.org>; Mon, 27 Apr 2015 16:13:07 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048;
 t=1430151180; bh=fAyR+eXP0AOnpPupWvjSOb6S8LbSoHw4zMZ0WP8zVCE=;
 h=Date:From:To:Subject:From:Subject;
 b=UD2o/zIX0TahjIi/5DySmOQW8koiv/JsJBVWyngGn+h/oMWd56yn3GHzPb2dOSCEISNV94blkjYUasArYAHwg10Bi69TFI77rtBK3F8i0P2fXtFs2SJaYoYU/8P1JYF+COsD5uR1w8hXAIoH42KuJ1cnbnPx1kCkZH9619X7c49tLYhnsUfzB9KWiNMpf8icAOrdlb9dTv1spSn3kkW/KhG+btt1cGGn5tn8A+oCaE5w4AHWBPN/CGO3avN+F8enp0xoxgzf/4aZPouqtCHAR/OaWmEdAkg1HvNBtjp/w4Qr6h8rOIdCy7Ezi7kzBhz/xa3XazYMhMdPL1jo34RuBg==
Received: from [98.139.170.178] by nm9.bullet.mail.bf1.yahoo.com with NNFMP;
 27 Apr 2015 16:13:00 -0000
Received: from [98.139.213.9] by tm21.bullet.mail.bf1.yahoo.com with NNFMP;
 27 Apr 2015 16:13:00 -0000
Received: from [127.0.0.1] by smtp109.mail.bf1.yahoo.com with NNFMP;
 27 Apr 2015 16:13:00 -0000
X-Yahoo-Newman-Id: 54358.14564.bm@smtp109.mail.bf1.yahoo.com
X-Yahoo-Newman-Property: ymail-3
X-YMail-OSG: IL1kViAVM1l9Yr9_15h2mRnP.V_3gpiWMYxO8Jsu2vRRndM
 jrBWlbJ6ThFsICgFYoGYs66CulABKxfOf.fDbDmaaWcQsEIeMohMCGOPvNXl
 aAaUnmNhXDq.9qqA_r0Az3HH66YNNi5qOv6zGDtvUzCDsimjOYg5KnzR0_QL
 BhRomkLuwJmQu9TSQO3xJjATLEifXMk4sCyDWrfhpAUsTGkPoooYBQUs_E7O
 r6EpjlTnUD1XWxhp0ayrWZaiaRxK_qi5YeyFjAeJRYkWJj6BbiVoV5ay5td6
 rUpfD6xhjn552A3IGABY0eNpWfpdOG0mGyGxqaEPw6cqQbJ0TEFHUd1J8wjd
 eFaWrW37HEWuYPyXIQx0EmN4GfgG5YLU_qeUQ39g8J53HIh4.S4YoFCKayLv
 5q0lrZrnEu204dXwEKsNMBUCqud79Mufwl9d9IgcHARBbU_y1l4ubGsTN9J7
 YnwEH0pzlgt__pAvhg3jxQEd5JI3e1KcKa0p656E1C9e2DfPAKWrYjI5eAvF
 8rzrzOLE1gfeB6GsHFSee1hzmvV5cFjRJ
X-Yahoo-SMTP: xcjD0guswBAZaPPIbxpWwLcp9Unf
Message-ID: <553E600D.2000405@FreeBSD.org>
Date: Mon, 27 Apr 2015 11:13:01 -0500
From: Pedro Giffuni <pfg@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:31.0) Gecko/20100101 Thunderbird/31.6.0
MIME-Version: 1.0
To: Adrian Chadd <adrian@freebsd.org>, 
 FreeBSD-arch list <freebsd-arch@freebsd.org>
Subject: Re: RFT: numa policy branch
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 27 Apr 2015 16:13:07 -0000

Hello;

Well, I figure it may help the effort so I created a GPLv2 patch[1]
to add the CPU affinity code to our aging libgomp.

It is taken from GCC-pre43 branch, so no idea how well it works
and you are basically on your own, but if it doesn't break anything
I can commit it later today.

Thanks for all the great work on NUMA!

Pedro.

[1]
https://people.freebsd.org/~pfg/patches/libgomp-GCCr123494.diff

From owner-freebsd-arch@FreeBSD.ORG  Mon Apr 27 17:41:35 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 5B302871;
 Mon, 27 Apr 2015 17:41:35 +0000 (UTC)
Received: from mail-ig0-x234.google.com (mail-ig0-x234.google.com
 [IPv6:2607:f8b0:4001:c05::234])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 27ABC1434;
 Mon, 27 Apr 2015 17:41:35 +0000 (UTC)
Received: by igbhj9 with SMTP id hj9so68529556igb.1;
 Mon, 27 Apr 2015 10:41:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=6E0lhif+d/w1YUY5jLgZUzZVFKZnLINzX5SImMV2wiY=;
 b=0WRgKeBm2wADJdK/oLuX6A1R65XcORZ+8wagpH3M1/2dZeXfpIVeoo1a8IeVnoMvPY
 A3Jt1IbmoHwP3armsMP/PuJX9JOoRI959gu1ltj5O1io6BZpQGT9Zx1ougM6qjhLntuZ
 twFW8LAwzhGjyRtokiRIwg8pod4gNh5ftNgujmXcZ19fpR44BlgQHYXJlrVhdA9fERbq
 ODRkWXuATXBj+S32ZOgt3RO3zv+XTXa1+azPGsbT8AxDOsOnfWrbUzGBsZJfrLTjuLpW
 oeyU8EtUSeQOYlj0z0RCA8eaghgBmYva7EkQ0dQ0Shxm/TAC+St+qvZaO+uXHk2U5z6J
 A1dQ==
MIME-Version: 1.0
X-Received: by 10.107.46.39 with SMTP id i39mr15362517ioo.8.1430156494606;
 Mon, 27 Apr 2015 10:41:34 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.36.38.133 with HTTP; Mon, 27 Apr 2015 10:41:34 -0700 (PDT)
In-Reply-To: <553E600D.2000405@FreeBSD.org>
References: <553E600D.2000405@FreeBSD.org>
Date: Mon, 27 Apr 2015 10:41:34 -0700
X-Google-Sender-Auth: 6LGBa85RH-o9tUlh_U80M-aykTw
Message-ID: <CAJ-Vmo=C6ZXi9zfGp9vcExuQo8TPf8D+jdHXaHS9vGYT=hmHnA@mail.gmail.com>
Subject: Re: RFT: numa policy branch
From: Adrian Chadd <adrian@freebsd.org>
To: Pedro Giffuni <pfg@freebsd.org>
Cc: FreeBSD-arch list <freebsd-arch@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 27 Apr 2015 17:41:35 -0000

Hi!

Would you mind seeing if we can do the proc bind option too? That's
apparently quite popular.


-a

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 28 02:34:08 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 488993E0;
 Tue, 28 Apr 2015 02:34:08 +0000 (UTC)
Received: from mail-wg0-x229.google.com (mail-wg0-x229.google.com
 [IPv6:2a00:1450:400c:c00::229])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id D81F210CC;
 Tue, 28 Apr 2015 02:34:07 +0000 (UTC)
Received: by wgso17 with SMTP id o17so135659293wgs.1;
 Mon, 27 Apr 2015 19:34:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=from:to:cc:subject:date:message-id;
 bh=4GG0lUEEMlCsfoUhPKIrpbqLcfVpZ6YgUj9kAX+Q3j4=;
 b=EsUuRKNEamy+KrisMIA4Ax1PMjk91fzPWGqapdn2m7aeS6sKt87b+fNbrJoCYEDhtH
 SxmhTlOS6cbhatxldzQA1mMYJRszE9ky50sxnREfrEFlxET0x8A+qDNh80c/MqbI7wMF
 GT6WQZjiHUMs6N/rDZGI8zicBB7RHq/ZkNmpnN3JDeygB8dVftN1cPupx8e7QPz32IZS
 2sf/u9mezRiagVhR9rbWVedYUjrJvLDbBtY+e0one0pnw1HZYGgjshr3hVBNOzr0rYgp
 JMNud+zoW97SYhZUAbJVXhOmeaNqrT77srFvdsPEweU/L7AmahDpZm1IKxswY4etQW1s
 AazQ==
X-Received: by 10.194.222.197 with SMTP id qo5mr27446540wjc.142.1430188446430; 
 Mon, 27 Apr 2015 19:34:06 -0700 (PDT)
Received: from localhost.localdomain (ip-89-102-11-63.net.upcbroadband.cz.
 [89.102.11.63])
 by mx.google.com with ESMTPSA id fo7sm14118352wic.1.2015.04.27.19.34.05
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 27 Apr 2015 19:34:05 -0700 (PDT)
From: Mateusz Guzik <mjguzik@gmail.com>
To: freebsd-arch@freebsd.org
Cc: Mateusz Guzik <mjg@freebsd.org>
Subject: [PATCH 0/2] generalised cow per-thread structs
Date: Tue, 28 Apr 2015 04:34:01 +0200
Message-Id: <1430188443-19413-1-git-send-email-mjguzik@gmail.com>
X-Mailer: git-send-email 1.8.3.1
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Apr 2015 02:34:08 -0000

From: Mateusz Guzik <mjg@freebsd.org>

struct ucred is managed per thread as follows:
setuid and the like updated the pointer in struct proc
on kernel<->userspace boundary it is checked whether the thread needs
updating

This scheme is useful for other structures as well, so this patch
generalises it by introducing a counter which is compared instead.
This prevents introduction of further comparisons as such structures
are added.

The first patch just adds convenience funcs and adjusts cred handling
to use it.

The second patch implements lockless resource limits.

The bigger goal concerns struct filedesc - the plan is to split it to
fd part and vnode part. The latter is seldomly modified, se could be
accessed locklessly and with further effort can save some refs/unrefs
on vnodes since we will be sure they cannot go away.

Mateusz Guzik (2):
  Generalised support for copy-on-write structures shared by threads.
  Implement lockless resource limits.

 contrib/binutils/ld/emultempl/spu_ovl.o | Bin 1432 -> 0 bytes
 sys/amd64/amd64/trap.c                  |   4 +-
 sys/arm/arm/trap-v6.c                   |   4 +-
 sys/arm/arm/trap.c                      |  11 +++--
 sys/i386/i386/trap.c                    |   4 +-
 sys/kern/imgact_elf.c                   |  13 +++---
 sys/kern/init_main.c                    |   8 ++--
 sys/kern/kern_descrip.c                 |  24 +++++-----
 sys/kern/kern_event.c                   |   6 +--
 sys/kern/kern_exec.c                    |   4 +-
 sys/kern/kern_fork.c                    |   7 ++-
 sys/kern/kern_kthread.c                 |   2 +-
 sys/kern/kern_proc.c                    |   7 +--
 sys/kern/kern_prot.c                    |   5 ++-
 sys/kern/kern_resource.c                |  77 +++++++++++++++++++-------------
 sys/kern/kern_sig.c                     |   2 +-
 sys/kern/kern_syscalls.c                |   3 ++
 sys/kern/kern_thr.c                     |   6 +--
 sys/kern/kern_thread.c                  |  49 ++++++++++++++++++--
 sys/kern/subr_syscall.c                 |   4 +-
 sys/kern/subr_trap.c                    |   4 +-
 sys/kern/subr_uio.c                     |   4 +-
 sys/kern/sysv_shm.c                     |   4 +-
 sys/kern/tty_pts.c                      |   4 +-
 sys/kern/uipc_sockbuf.c                 |   4 +-
 sys/kern/vfs_vnops.c                    |   7 ++-
 sys/powerpc/powerpc/trap.c              |   4 +-
 sys/sparc64/sparc64/trap.c              |   4 +-
 sys/sys/proc.h                          |  14 +++++-
 sys/sys/resourcevar.h                   |   9 ++--
 sys/sys/vnode.h                         |   2 +-
 sys/vm/swap_pager.c                     |   4 +-
 sys/vm/vm_map.c                         |  14 +++---
 sys/vm/vm_mmap.c                        |  34 +++++++-------
 sys/vm/vm_pageout.c                     |   2 +-
 sys/vm/vm_unix.c                        |   8 ++--
 36 files changed, 208 insertions(+), 154 deletions(-)
 delete mode 100644 contrib/binutils/ld/emultempl/spu_ovl.o

-- 
2.3.6


From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 28 02:34:11 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 0446647B;
 Tue, 28 Apr 2015 02:34:11 +0000 (UTC)
Received: from mail-wi0-x232.google.com (mail-wi0-x232.google.com
 [IPv6:2a00:1450:400c:c05::232])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 7923710CE;
 Tue, 28 Apr 2015 02:34:10 +0000 (UTC)
Received: by wicmx19 with SMTP id mx19so97303024wic.1;
 Mon, 27 Apr 2015 19:34:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=nbz4ZCJ5M6wNvU2eAyQ4Xkizd8PPaGDx+1u9eMA4IC4=;
 b=AMaVomoBm71KsxoN4UaBZthsGCSjQbCrMwjQSgPNmgOW0J6fscliQEJ00WMWNO9ZDf
 tNkIt4n+uv6+7Im7nqyMZRpE8LvV6mm4/bUarhZrXAv9cTxzPBNoRhViYf/mUIGjBPAk
 rD8UlElTpxpE8IpaKNcmSGqEVd3j/cWYLVJpCqzGt56JY0wIdr0PBDtUR/0de+K8JW6h
 0ToT7Y0YnAdKUNEZSOk4ZlT5WEp9MkLNfKP7jmClOdkg2TV+IvM/y48gWgo7hl2V6JiS
 ZeL0gn3fGnoxf64/f6yKpuoej0Z3WF/9gRdE4X52LkWVFw0KiVzp/JxBN/6Oe7mtILNz
 q83g==
X-Received: by 10.195.11.202 with SMTP id ek10mr27213692wjd.12.1430188449019; 
 Mon, 27 Apr 2015 19:34:09 -0700 (PDT)
Received: from localhost.localdomain (ip-89-102-11-63.net.upcbroadband.cz.
 [89.102.11.63])
 by mx.google.com with ESMTPSA id fo7sm14118352wic.1.2015.04.27.19.34.07
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 27 Apr 2015 19:34:08 -0700 (PDT)
From: Mateusz Guzik <mjguzik@gmail.com>
To: freebsd-arch@freebsd.org
Cc: Mateusz Guzik <mjg@freebsd.org>
Subject: [PATCH 2/2] Implement lockless resource limits.
Date: Tue, 28 Apr 2015 04:34:03 +0200
Message-Id: <1430188443-19413-3-git-send-email-mjguzik@gmail.com>
X-Mailer: git-send-email 1.8.3.1
In-Reply-To: <1430188443-19413-1-git-send-email-mjguzik@gmail.com>
References: <1430188443-19413-1-git-send-email-mjguzik@gmail.com>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Apr 2015 02:34:11 -0000

From: Mateusz Guzik <mjg@freebsd.org>

Employ the same mechanism which is used to manage per-thread
credentials.
---
 sys/kern/imgact_elf.c    | 13 ++++----
 sys/kern/kern_descrip.c  | 24 +++++++--------
 sys/kern/kern_event.c    |  6 +---
 sys/kern/kern_exec.c     |  4 +--
 sys/kern/kern_fork.c     |  4 +--
 sys/kern/kern_proc.c     |  7 ++---
 sys/kern/kern_resource.c | 77 ++++++++++++++++++++++++++++--------------------
 sys/kern/kern_sig.c      |  2 +-
 sys/kern/kern_syscalls.c |  1 +
 sys/kern/kern_thread.c   |  6 ++++
 sys/kern/subr_uio.c      |  4 +--
 sys/kern/sysv_shm.c      |  4 +--
 sys/kern/tty_pts.c       |  4 +--
 sys/kern/uipc_sockbuf.c  |  4 +--
 sys/kern/vfs_vnops.c     |  7 ++---
 sys/sys/proc.h           |  3 +-
 sys/sys/resourcevar.h    |  9 ++++--
 sys/sys/vnode.h          |  2 +-
 sys/vm/swap_pager.c      |  4 +--
 sys/vm/vm_map.c          | 14 ++++-----
 sys/vm/vm_mmap.c         | 34 +++++++++++----------
 sys/vm/vm_pageout.c      |  2 +-
 sys/vm/vm_unix.c         |  8 ++---
 23 files changed, 122 insertions(+), 121 deletions(-)

diff --git a/sys/kern/imgact_elf.c b/sys/kern/imgact_elf.c
index 39e4df3..ff3a371 100644
--- a/sys/kern/imgact_elf.c
+++ b/sys/kern/imgact_elf.c
@@ -900,13 +900,17 @@ __CONCAT(exec_, __elfN(imgact))(struct image_params *imgp)
 	 * limits after loading the segments since we do
 	 * not actually fault in all the segments pages.
 	 */
+#ifdef	RACCT
 	PROC_LOCK(imgp->proc);
-	if (data_size > lim_cur(imgp->proc, RLIMIT_DATA) ||
+#endif
+	if (data_size > lim_cur(curthread, RLIMIT_DATA) ||
 	    text_size > maxtsiz ||
-	    total_size > lim_cur(imgp->proc, RLIMIT_VMEM) ||
+	    total_size > lim_cur(curthread, RLIMIT_VMEM) ||
 	    racct_set(imgp->proc, RACCT_DATA, data_size) != 0 ||
 	    racct_set(imgp->proc, RACCT_VMEM, total_size) != 0) {
+#ifdef	RACCT
 		PROC_UNLOCK(imgp->proc);
+#endif
 		return (ENOMEM);
 	}
 
@@ -922,9 +926,8 @@ __CONCAT(exec_, __elfN(imgact))(struct image_params *imgp)
 	 * calculation is that it leaves room for the heap to grow to
 	 * its maximum allowed size.
 	 */
-	addr = round_page((vm_offset_t)vmspace->vm_daddr + lim_max(imgp->proc,
+	addr = round_page((vm_offset_t)vmspace->vm_daddr + lim_max(curthread,
 	    RLIMIT_DATA));
-	PROC_UNLOCK(imgp->proc);
 
 	imgp->entry_addr = entry;
 
@@ -1963,7 +1966,7 @@ note_procstat_rlimit(void *arg, struct sbuf *sb, size_t *sizep)
 		sbuf_bcat(sb, &structsize, sizeof(structsize));
 		PROC_LOCK(p);
 		for (i = 0; i < RLIM_NLIMITS; i++)
-			lim_rlimit(p, i, &rlim[i]);
+			lim_rlimit_proc(p, i, &rlim[i]);
 		PROC_UNLOCK(p);
 		sbuf_bcat(sb, rlim, sizeof(rlim));
 	}
diff --git a/sys/kern/kern_descrip.c b/sys/kern/kern_descrip.c
index f3f27bf..cc7b276 100644
--- a/sys/kern/kern_descrip.c
+++ b/sys/kern/kern_descrip.c
@@ -109,7 +109,7 @@ static void	fdgrowtable(struct filedesc *fdp, int nfd);
 static void	fdgrowtable_exp(struct filedesc *fdp, int nfd);
 static void	fdunused(struct filedesc *fdp, int fd);
 static void	fdused(struct filedesc *fdp, int fd);
-static int	getmaxfd(struct proc *p);
+static int	getmaxfd(struct thread *td);
 
 /* Flags for do_dup() */
 #define	DUP_FIXED	0x1	/* Force fixed allocation. */
@@ -331,16 +331,19 @@ struct getdtablesize_args {
 int
 sys_getdtablesize(struct thread *td, struct getdtablesize_args *uap)
 {
-	struct proc *p = td->td_proc;
+#ifdef	RACCT
 	uint64_t lim;
+#endif
 
-	PROC_LOCK(p);
 	td->td_retval[0] =
-	    min((int)lim_cur(p, RLIMIT_NOFILE), maxfilesperproc);
+	    min((int)lim_cur(td, RLIMIT_NOFILE), maxfilesperproc);
+#ifdef	RACCT
+	PROC_LOCK(p);
 	lim = racct_get_limit(td->td_proc, RACCT_NOFILE);
 	PROC_UNLOCK(p);
 	if (lim < td->td_retval[0])
 		td->td_retval[0] = lim;
+#endif
 	return (0);
 }
 
@@ -785,15 +788,10 @@ kern_fcntl(struct thread *td, int fd, int cmd, intptr_t arg)
 }
 
 static int
-getmaxfd(struct proc *p)
+getmaxfd(struct thread *td)
 {
-	int maxfd;
-
-	PROC_LOCK(p);
-	maxfd = min((int)lim_cur(p, RLIMIT_NOFILE), maxfilesperproc);
-	PROC_UNLOCK(p);
 
-	return (maxfd);
+	return (min((int)lim_cur(td, RLIMIT_NOFILE), maxfilesperproc));
 }
 
 /*
@@ -821,7 +819,7 @@ do_dup(struct thread *td, int flags, int old, int new)
 		return (EBADF);
 	if (new < 0)
 		return (flags & DUP_FCNTL ? EINVAL : EBADF);
-	maxfd = getmaxfd(p);
+	maxfd = getmaxfd(td);
 	if (new >= maxfd)
 		return (flags & DUP_FCNTL ? EINVAL : EBADF);
 
@@ -1619,7 +1617,7 @@ fdalloc(struct thread *td, int minfd, int *result)
 	if (fdp->fd_freefile > minfd)
 		minfd = fdp->fd_freefile;
 
-	maxfd = getmaxfd(p);
+	maxfd = getmaxfd(td);
 
 	/*
 	 * Search the bitmap for a free descriptor starting at minfd.
diff --git a/sys/kern/kern_event.c b/sys/kern/kern_event.c
index e01f12c..618a68e 100644
--- a/sys/kern/kern_event.c
+++ b/sys/kern/kern_event.c
@@ -747,14 +747,10 @@ sys_kqueue(struct thread *td, struct kqueue_args *uap)
 	p = td->td_proc;
 	cred = td->td_ucred;
 	crhold(cred);
-	PROC_LOCK(p);
-	if (!chgkqcnt(cred->cr_ruidinfo, 1, lim_cur(td->td_proc,
-	    RLIMIT_KQUEUES))) {
-		PROC_UNLOCK(p);
+	if (!chgkqcnt(cred->cr_ruidinfo, 1, lim_cur(td, RLIMIT_KQUEUES))) {
 		crfree(cred);
 		return (ENOMEM);
 	}
-	PROC_UNLOCK(p);
 
 	fdp = p->p_fd;
 	error = falloc(td, &fp, &fd, 0);
diff --git a/sys/kern/kern_exec.c b/sys/kern/kern_exec.c
index 9d893f8..751f153 100644
--- a/sys/kern/kern_exec.c
+++ b/sys/kern/kern_exec.c
@@ -1061,9 +1061,7 @@ exec_new_vmspace(imgp, sv)
 	/* Allocate a new stack */
 	if (imgp->stack_sz != 0) {
 		ssiz = trunc_page(imgp->stack_sz);
-		PROC_LOCK(p);
-		lim_rlimit(p, RLIMIT_STACK, &rlim_stack);
-		PROC_UNLOCK(p);
+		lim_rlimit(curthread, RLIMIT_STACK, &rlim_stack);
 		if (ssiz > rlim_stack.rlim_max)
 			ssiz = rlim_stack.rlim_max;
 		if (ssiz > rlim_stack.rlim_cur) {
diff --git a/sys/kern/kern_fork.c b/sys/kern/kern_fork.c
index d04c3e3..6cde199 100644
--- a/sys/kern/kern_fork.c
+++ b/sys/kern/kern_fork.c
@@ -912,10 +912,8 @@ fork1(struct thread *td, int flags, int pages, struct proc **procp,
 	if (error == 0)
 		ok = chgproccnt(td->td_ucred->cr_ruidinfo, 1, 0);
 	else {
-		PROC_LOCK(p1);
 		ok = chgproccnt(td->td_ucred->cr_ruidinfo, 1,
-		    lim_cur(p1, RLIMIT_NPROC));
-		PROC_UNLOCK(p1);
+		    lim_cur(td, RLIMIT_NPROC));
 	}
 	if (ok) {
 		do_fork(td, flags, newproc, td2, vm2, pdflags);
diff --git a/sys/kern/kern_proc.c b/sys/kern/kern_proc.c
index 505521d..0708d71 100644
--- a/sys/kern/kern_proc.c
+++ b/sys/kern/kern_proc.c
@@ -2597,11 +2597,8 @@ sysctl_kern_proc_rlimit(SYSCTL_HANDLER_ARGS)
 	/*
 	 * Retrieve limit.
 	 */
-	if (req->oldptr != NULL) {
-		PROC_LOCK(p);
-		lim_rlimit(p, which, &rlim);
-		PROC_UNLOCK(p);
-	}
+	if (req->oldptr != NULL)
+		lim_rlimit(curthread, which, &rlim);
 	error = SYSCTL_OUT(req, &rlim, sizeof(rlim));
 	if (error != 0)
 		goto errout;
diff --git a/sys/kern/kern_resource.c b/sys/kern/kern_resource.c
index dac49cd..bc677dc 100644
--- a/sys/kern/kern_resource.c
+++ b/sys/kern/kern_resource.c
@@ -560,15 +560,11 @@ ogetrlimit(struct thread *td, register struct ogetrlimit_args *uap)
 {
 	struct orlimit olim;
 	struct rlimit rl;
-	struct proc *p;
 	int error;
 
 	if (uap->which >= RLIM_NLIMITS)
 		return (EINVAL);
-	p = td->td_proc;
-	PROC_LOCK(p);
-	lim_rlimit(p, uap->which, &rl);
-	PROC_UNLOCK(p);
+	lim_rlimit(td, uap->which, &rl);
 
 	/*
 	 * XXX would be more correct to convert only RLIM_INFINITY to the
@@ -625,7 +621,7 @@ lim_cb(void *arg)
 	}
 	PROC_STATUNLOCK(p);
 	if (p->p_rux.rux_runtime > p->p_cpulimit * cpu_tickrate()) {
-		lim_rlimit(p, RLIMIT_CPU, &rlim);
+		lim_rlimit_proc(p, RLIMIT_CPU, &rlim);
 		if (p->p_rux.rux_runtime >= rlim.rlim_max * cpu_tickrate()) {
 			killproc(p, "exceeded maximum CPU limit");
 		} else {
@@ -667,29 +663,21 @@ kern_proc_setrlimit(struct thread *td, struct proc *p, u_int which,
 		limp->rlim_max = RLIM_INFINITY;
 
 	oldssiz.rlim_cur = 0;
-	newlim = NULL;
+	newlim = lim_alloc();
 	PROC_LOCK(p);
-	if (lim_shared(p->p_limit)) {
-		PROC_UNLOCK(p);
-		newlim = lim_alloc();
-		PROC_LOCK(p);
-	}
 	oldlim = p->p_limit;
 	alimp = &oldlim->pl_rlimit[which];
 	if (limp->rlim_cur > alimp->rlim_max ||
 	    limp->rlim_max > alimp->rlim_max)
 		if ((error = priv_check(td, PRIV_PROC_SETRLIMIT))) {
 			PROC_UNLOCK(p);
-			if (newlim != NULL)
-				lim_free(newlim);
+			lim_free(newlim);
 			return (error);
 		}
 	if (limp->rlim_cur > limp->rlim_max)
 		limp->rlim_cur = limp->rlim_max;
-	if (newlim != NULL) {
-		lim_copy(newlim, oldlim);
-		alimp = &newlim->pl_rlimit[which];
-	}
+	lim_copy(newlim, oldlim);
+	alimp = &newlim->pl_rlimit[which];
 
 	switch (which) {
 
@@ -739,11 +727,10 @@ kern_proc_setrlimit(struct thread *td, struct proc *p, u_int which,
 	if (p->p_sysent->sv_fixlimit != NULL)
 		p->p_sysent->sv_fixlimit(limp, which);
 	*alimp = *limp;
-	if (newlim != NULL)
-		p->p_limit = newlim;
+	p->p_limit = newlim;
+	PROC_UPDATE_COW(p);
 	PROC_UNLOCK(p);
-	if (newlim != NULL)
-		lim_free(oldlim);
+	lim_free(oldlim);
 
 	if (which == RLIMIT_STACK &&
 	    /*
@@ -793,15 +780,11 @@ int
 sys_getrlimit(struct thread *td, register struct __getrlimit_args *uap)
 {
 	struct rlimit rlim;
-	struct proc *p;
 	int error;
 
 	if (uap->which >= RLIM_NLIMITS)
 		return (EINVAL);
-	p = td->td_proc;
-	PROC_LOCK(p);
-	lim_rlimit(p, uap->which, &rlim);
-	PROC_UNLOCK(p);
+	lim_rlimit(td, uap->which, &rlim);
 	error = copyout(&rlim, uap->rlp, sizeof(struct rlimit));
 	return (error);
 }
@@ -1172,11 +1155,11 @@ lim_copy(struct plimit *dst, struct plimit *src)
  * which parameter specifies the index into the rlimit array.
  */
 rlim_t
-lim_max(struct proc *p, int which)
+lim_max(struct thread *td, int which)
 {
 	struct rlimit rl;
 
-	lim_rlimit(p, which, &rl);
+	lim_rlimit(td, which, &rl);
 	return (rl.rlim_max);
 }
 
@@ -1185,11 +1168,11 @@ lim_max(struct proc *p, int which)
  * The which parameter which specifies the index into the rlimit array
  */
 rlim_t
-lim_cur(struct proc *p, int which)
+lim_cur(struct thread *td, int which)
 {
 	struct rlimit rl;
 
-	lim_rlimit(p, which, &rl);
+	lim_rlimit(td, which, &rl);
 	return (rl.rlim_cur);
 }
 
@@ -1198,7 +1181,23 @@ lim_cur(struct proc *p, int which)
  * specified by 'which' in the rlimit structure pointed to by 'rlp'.
  */
 void
-lim_rlimit(struct proc *p, int which, struct rlimit *rlp)
+lim_rlimit(struct thread *td, int which, struct rlimit *rlp)
+{
+	struct proc *p = td->td_proc;
+
+	MPASS(td == curthread);
+	KASSERT(which >= 0 && which < RLIM_NLIMITS,
+	    ("request for invalid resource limit"));
+	*rlp = td->td_limit->pl_rlimit[which];
+	if (p->p_sysent->sv_fixlimit != NULL)
+		p->p_sysent->sv_fixlimit(rlp, which);
+}
+
+/*
+ * Same as lim_rlimit but can be used with non-curthread.
+ */
+void
+lim_rlimit_proc(struct proc *p, int which, struct rlimit *rlp)
 {
 
 	PROC_LOCK_ASSERT(p, MA_OWNED);
@@ -1441,3 +1440,17 @@ chgkqcnt(struct uidinfo *uip, int diff, rlim_t max)
 	}
 	return (1);
 }
+
+void
+lim_update_thread(struct thread *td)
+{
+	struct proc *p;
+	struct plimit *lim;
+
+	p = td->td_proc;
+	lim = td->td_limit;
+	PROC_LOCK_ASSERT(p, MA_OWNED);
+	td->td_limit = lim_hold(p->p_limit);
+	if (lim != NULL)
+		lim_free(lim);
+}
diff --git a/sys/kern/kern_sig.c b/sys/kern/kern_sig.c
index 154c250..07a586f 100644
--- a/sys/kern/kern_sig.c
+++ b/sys/kern/kern_sig.c
@@ -3304,7 +3304,7 @@ coredump(struct thread *td)
 	 * a corefile is truncated instead of not being created,
 	 * if it is larger than the limit.
 	 */
-	limit = (off_t)lim_cur(p, RLIMIT_CORE);
+	limit = (off_t)lim_cur(td, RLIMIT_CORE);
 	if (limit == 0 || racct_get_available(p, RACCT_CORE) == 0) {
 		PROC_UNLOCK(p);
 		return (EFBIG);
diff --git a/sys/kern/kern_syscalls.c b/sys/kern/kern_syscalls.c
index 3d3df01..15574be 100644
--- a/sys/kern/kern_syscalls.c
+++ b/sys/kern/kern_syscalls.c
@@ -33,6 +33,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/module.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
+#include <sys/resourcevar.h>
 #include <sys/sx.h>
 #include <sys/syscall.h>
 #include <sys/sysent.h>
diff --git a/sys/kern/kern_thread.c b/sys/kern/kern_thread.c
index df8511b..79e9c50 100644
--- a/sys/kern/kern_thread.c
+++ b/sys/kern/kern_thread.c
@@ -386,6 +386,7 @@ thread_get_cow_proc(struct thread *newtd, struct proc *p)
 
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 	newtd->td_ucred = crhold(p->p_ucred);
+	newtd->td_limit = lim_hold(p->p_limit);
 	newtd->td_cowgeneration = p->p_cowgeneration;
 }
 
@@ -394,6 +395,7 @@ thread_get_cow(struct thread *newtd, struct thread *td)
 {
 
 	newtd->td_ucred = crhold(td->td_ucred);
+	newtd->td_limit = lim_hold(td->td_limit);
 	newtd->td_cowgeneration = td->td_cowgeneration;
 }
 
@@ -403,6 +405,8 @@ thread_free_cow(struct thread *td)
 
 	if (td->td_ucred)
 		crfree(td->td_ucred);
+	if (td->td_limit)
+		lim_free(td->td_limit);
 }
 
 void
@@ -414,6 +418,8 @@ thread_update_cow(struct thread *td)
 	PROC_LOCK(p);
 	if (td->td_ucred != p->p_ucred)
 		cred_update_thread(td);
+	if (td->td_limit != p->p_limit)
+		lim_update_thread(td);
 	td->td_cowgeneration = p->p_cowgeneration;
 	PROC_UNLOCK(p);
 }
diff --git a/sys/kern/subr_uio.c b/sys/kern/subr_uio.c
index 87892fd..570298f 100644
--- a/sys/kern/subr_uio.c
+++ b/sys/kern/subr_uio.c
@@ -409,10 +409,8 @@ copyout_map(struct thread *td, vm_offset_t *addr, size_t sz)
 	/*
 	 * Map somewhere after heap in process memory.
 	 */
-	PROC_LOCK(td->td_proc);
 	*addr = round_page((vm_offset_t)vms->vm_daddr +
-	    lim_max(td->td_proc, RLIMIT_DATA));
-	PROC_UNLOCK(td->td_proc);
+	    lim_max(td, RLIMIT_DATA));
 
 	/* round size up to page boundry */
 	size = (vm_size_t)round_page(sz);
diff --git a/sys/kern/sysv_shm.c b/sys/kern/sysv_shm.c
index 274deda..00e3c0a 100644
--- a/sys/kern/sysv_shm.c
+++ b/sys/kern/sysv_shm.c
@@ -380,10 +380,8 @@ kern_shmat_locked(struct thread *td, int shmid, const void *shmaddr,
 		 * This is just a hint to vm_map_find() about where to
 		 * put it.
 		 */
-		PROC_LOCK(p);
 		attach_va = round_page((vm_offset_t)p->p_vmspace->vm_daddr +
-		    lim_max(p, RLIMIT_DATA));
-		PROC_UNLOCK(p);
+		    lim_max(td, RLIMIT_DATA));
 	}
 
 	vm_object_reference(shmseg->object);
diff --git a/sys/kern/tty_pts.c b/sys/kern/tty_pts.c
index 2d1e8fe..fcc9c47 100644
--- a/sys/kern/tty_pts.c
+++ b/sys/kern/tty_pts.c
@@ -741,7 +741,7 @@ pts_alloc(int fflags, struct thread *td, struct file *fp)
 		PROC_UNLOCK(p);
 		return (EAGAIN);
 	}
-	ok = chgptscnt(cred->cr_ruidinfo, 1, lim_cur(p, RLIMIT_NPTS));
+	ok = chgptscnt(cred->cr_ruidinfo, 1, lim_cur(td, RLIMIT_NPTS));
 	if (!ok) {
 		racct_sub(p, RACCT_NPTS, 1);
 		PROC_UNLOCK(p);
@@ -795,7 +795,7 @@ pts_alloc_external(int fflags, struct thread *td, struct file *fp,
 		PROC_UNLOCK(p);
 		return (EAGAIN);
 	}
-	ok = chgptscnt(cred->cr_ruidinfo, 1, lim_cur(p, RLIMIT_NPTS));
+	ok = chgptscnt(cred->cr_ruidinfo, 1, lim_cur(td, RLIMIT_NPTS));
 	if (!ok) {
 		racct_sub(p, RACCT_NPTS, 1);
 		PROC_UNLOCK(p);
diff --git a/sys/kern/uipc_sockbuf.c b/sys/kern/uipc_sockbuf.c
index 88952ed..243450d 100644
--- a/sys/kern/uipc_sockbuf.c
+++ b/sys/kern/uipc_sockbuf.c
@@ -420,9 +420,7 @@ sbreserve_locked(struct sockbuf *sb, u_long cc, struct socket *so,
 	if (cc > sb_max_adj)
 		return (0);
 	if (td != NULL) {
-		PROC_LOCK(td->td_proc);
-		sbsize_limit = lim_cur(td->td_proc, RLIMIT_SBSIZE);
-		PROC_UNLOCK(td->td_proc);
+		sbsize_limit = lim_cur(td, RLIMIT_SBSIZE);
 	} else
 		sbsize_limit = RLIM_INFINITY;
 	if (!chgsbsize(so->so_cred->cr_uidinfo, &sb->sb_hiwat, cc,
diff --git a/sys/kern/vfs_vnops.c b/sys/kern/vfs_vnops.c
index 01d448e..9db72c3 100644
--- a/sys/kern/vfs_vnops.c
+++ b/sys/kern/vfs_vnops.c
@@ -2098,19 +2098,18 @@ vn_vget_ino_gen(struct vnode *vp, vn_get_ino_t alloc, void *alloc_arg,
 
 int
 vn_rlimit_fsize(const struct vnode *vp, const struct uio *uio,
-    const struct thread *td)
+    struct thread *td)
 {
 
 	if (vp->v_type != VREG || td == NULL)
 		return (0);
-	PROC_LOCK(td->td_proc);
 	if ((uoff_t)uio->uio_offset + uio->uio_resid >
-	    lim_cur(td->td_proc, RLIMIT_FSIZE)) {
+	    lim_cur(td, RLIMIT_FSIZE)) {
+		PROC_LOCK(td->td_proc);
 		kern_psignal(td->td_proc, SIGXFSZ);
 		PROC_UNLOCK(td->td_proc);
 		return (EFBIG);
 	}
-	PROC_UNLOCK(td->td_proc);
 	return (0);
 }
 
diff --git a/sys/sys/proc.h b/sys/sys/proc.h
index f29d796..9d58550 100644
--- a/sys/sys/proc.h
+++ b/sys/sys/proc.h
@@ -247,6 +247,7 @@ struct thread {
 	int		td_intr_nesting_level; /* (k) Interrupt recursion. */
 	int		td_pinned;	/* (k) Temporary cpu pin count. */
 	struct ucred	*td_ucred;	/* (k) Reference to credentials. */
+	struct plimit	*td_limit;	/* (k) Resource limits. */
 	u_int		td_estcpu;	/* (t) estimated cpu utilization */
 	int		td_slptick;	/* (t) Time at sleep. */
 	int		td_blktick;	/* (t) Time spent blocked. */
@@ -497,7 +498,7 @@ struct proc {
 	struct filedesc	*p_fd;		/* (b) Open files. */
 	struct filedesc_to_leader *p_fdtol; /* (b) Tracking node */
 	struct pstats	*p_stats;	/* (b) Accounting/statistics (CPU). */
-	struct plimit	*p_limit;	/* (c) Process limits. */
+	struct plimit	*p_limit;	/* (c) Resource limits. */
 	struct callout	p_limco;	/* (c) Limit callout handle */
 	struct sigacts	*p_sigacts;	/* (x) Signal actions, state (CPU). */
 
diff --git a/sys/sys/resourcevar.h b/sys/sys/resourcevar.h
index a07fdf8..426a27a 100644
--- a/sys/sys/resourcevar.h
+++ b/sys/sys/resourcevar.h
@@ -130,13 +130,14 @@ int	 kern_proc_setrlimit(struct thread *td, struct proc *p, u_int which,
 struct plimit
 	*lim_alloc(void);
 void	 lim_copy(struct plimit *dst, struct plimit *src);
-rlim_t	 lim_cur(struct proc *p, int which);
+rlim_t	 lim_cur(struct thread *td, int which);
 void	 lim_fork(struct proc *p1, struct proc *p2);
 void	 lim_free(struct plimit *limp);
 struct plimit
 	*lim_hold(struct plimit *limp);
-rlim_t	 lim_max(struct proc *p, int which);
-void	 lim_rlimit(struct proc *p, int which, struct rlimit *rlp);
+rlim_t	 lim_max(struct thread *td, int which);
+void	 lim_rlimit(struct thread *td, int which, struct rlimit *rlp);
+void	 lim_rlimit_proc(struct proc *p, int which, struct rlimit *rlp);
 void	 ruadd(struct rusage *ru, struct rusage_ext *rux, struct rusage *ru2,
 	    struct rusage_ext *rux2);
 void	 rucollect(struct rusage *ru, struct rusage *ru2);
@@ -156,5 +157,7 @@ void	 ui_racct_foreach(void (*callback)(struct racct *racct,
 	    void *arg2, void *arg3), void *arg2, void *arg3);
 #endif
 
+void	lim_update_thread(struct thread *td);
+
 #endif /* _KERNEL */
 #endif /* !_SYS_RESOURCEVAR_H_ */
diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h
index d70aa57..4aecd93 100644
--- a/sys/sys/vnode.h
+++ b/sys/sys/vnode.h
@@ -691,7 +691,7 @@ int	vn_rdwr_inchunks(enum uio_rw rw, struct vnode *vp, void *base,
 	    struct ucred *active_cred, struct ucred *file_cred, size_t *aresid,
 	    struct thread *td);
 int	vn_rlimit_fsize(const struct vnode *vn, const struct uio *uio,
-	    const struct thread *td);
+	    struct thread *td);
 int	vn_stat(struct vnode *vp, struct stat *sb, struct ucred *active_cred,
 	    struct ucred *file_cred, struct thread *td);
 int	vn_start_write(struct vnode *vp, struct mount **mpp, int flags);
diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c
index 55e02c4..bdf55c5 100644
--- a/sys/vm/swap_pager.c
+++ b/sys/vm/swap_pager.c
@@ -222,16 +222,14 @@ swap_reserve_by_cred(vm_ooffset_t incr, struct ucred *cred)
 	mtx_unlock(&sw_dev_mtx);
 
 	if (res) {
-		PROC_LOCK(curproc);
 		UIDINFO_VMSIZE_LOCK(uip);
 		if ((overcommit & SWAP_RESERVE_RLIMIT_ON) != 0 &&
-		    uip->ui_vmsize + incr > lim_cur(curproc, RLIMIT_SWAP) &&
+		    uip->ui_vmsize + incr > lim_cur(curthread, RLIMIT_SWAP) &&
 		    priv_check(curthread, PRIV_VM_SWAP_NORLIMIT))
 			res = 0;
 		else
 			uip->ui_vmsize += incr;
 		UIDINFO_VMSIZE_UNLOCK(uip);
-		PROC_UNLOCK(curproc);
 		if (!res) {
 			mtx_lock(&sw_dev_mtx);
 			swap_reserved -= incr;
diff --git a/sys/vm/vm_map.c b/sys/vm/vm_map.c
index b7e668b..225837f 100644
--- a/sys/vm/vm_map.c
+++ b/sys/vm/vm_map.c
@@ -3421,10 +3421,8 @@ vm_map_stack(vm_map_t map, vm_offset_t addrbos, vm_size_t max_ssize,
 	growsize = sgrowsiz;
 	init_ssize = (max_ssize < growsize) ? max_ssize : growsize;
 	vm_map_lock(map);
-	PROC_LOCK(curproc);
-	lmemlim = lim_cur(curproc, RLIMIT_MEMLOCK);
-	vmemlim = lim_cur(curproc, RLIMIT_VMEM);
-	PROC_UNLOCK(curproc);
+	lmemlim = lim_cur(curthread, RLIMIT_MEMLOCK);
+	vmemlim = lim_cur(curthread, RLIMIT_VMEM);
 	if (!old_mlock && map->flags & MAP_WIREFUTURE) {
 		if (ptoa(pmap_wired_count(map->pmap)) + init_ssize > lmemlim) {
 			rv = KERN_NO_SPACE;
@@ -3553,12 +3551,10 @@ vm_map_growstack(struct proc *p, vm_offset_t addr)
 	int error;
 #endif
 
+	lmemlim = lim_cur(curthread, RLIMIT_MEMLOCK);
+	stacklim = lim_cur(curthread, RLIMIT_STACK);
+	vmemlim = lim_cur(curthread, RLIMIT_VMEM);
 Retry:
-	PROC_LOCK(p);
-	lmemlim = lim_cur(p, RLIMIT_MEMLOCK);
-	stacklim = lim_cur(p, RLIMIT_STACK);
-	vmemlim = lim_cur(p, RLIMIT_VMEM);
-	PROC_UNLOCK(p);
 
 	vm_map_lock_read(map);
 
diff --git a/sys/vm/vm_mmap.c b/sys/vm/vm_mmap.c
index 02634d6..adc7fba 100644
--- a/sys/vm/vm_mmap.c
+++ b/sys/vm/vm_mmap.c
@@ -325,14 +325,12 @@ sys_mmap(td, uap)
 		 * There should really be a pmap call to determine a reasonable
 		 * location.
 		 */
-		PROC_LOCK(td->td_proc);
 		if (addr == 0 ||
 		    (addr >= round_page((vm_offset_t)vms->vm_taddr) &&
 		    addr < round_page((vm_offset_t)vms->vm_daddr +
-		    lim_max(td->td_proc, RLIMIT_DATA))))
+		    lim_max(td, RLIMIT_DATA))))
 			addr = round_page((vm_offset_t)vms->vm_daddr +
-			    lim_max(td->td_proc, RLIMIT_DATA));
-		PROC_UNLOCK(td->td_proc);
+			    lim_max(td, RLIMIT_DATA));
 	}
 	if (flags & MAP_ANON) {
 		/*
@@ -1112,13 +1110,9 @@ vm_mlock(struct proc *proc, struct ucred *cred, const void *addr0, size_t len)
 	if (npages > vm_page_max_wired)
 		return (ENOMEM);
 	map = &proc->p_vmspace->vm_map;
-	PROC_LOCK(proc);
 	nsize = ptoa(npages + pmap_wired_count(map->pmap));
-	if (nsize > lim_cur(proc, RLIMIT_MEMLOCK)) {
-		PROC_UNLOCK(proc);
+	if (nsize > lim_cur(curthread, RLIMIT_MEMLOCK))
 		return (ENOMEM);
-	}
-	PROC_UNLOCK(proc);
 	if (npages + vm_cnt.v_wire_count > vm_page_max_wired)
 		return (EAGAIN);
 #ifdef RACCT
@@ -1171,12 +1165,8 @@ sys_mlockall(td, uap)
 	 * a hard resource limit, return ENOMEM.
 	 */
 	if (!old_mlock && uap->how & MCL_CURRENT) {
-		PROC_LOCK(td->td_proc);
-		if (map->size > lim_cur(td->td_proc, RLIMIT_MEMLOCK)) {
-			PROC_UNLOCK(td->td_proc);
+		if (map->size > lim_cur(td, RLIMIT_MEMLOCK))
 			return (ENOMEM);
-		}
-		PROC_UNLOCK(td->td_proc);
 	}
 #ifdef RACCT
 	PROC_LOCK(td->td_proc);
@@ -1551,21 +1541,29 @@ vm_mmap(vm_map_t map, vm_offset_t *addr, vm_size_t size, vm_prot_t prot,
 	size = round_page(size);
 
 	if (map == &td->td_proc->p_vmspace->vm_map) {
+#ifdef	RACCT
 		PROC_LOCK(td->td_proc);
-		if (map->size + size > lim_cur(td->td_proc, RLIMIT_VMEM)) {
+#endif
+		if (map->size + size > lim_cur(td, RLIMIT_VMEM)) {
+#ifdef	RACCT
 			PROC_UNLOCK(td->td_proc);
+#endif
 			return (ENOMEM);
 		}
 		if (racct_set(td->td_proc, RACCT_VMEM, map->size + size)) {
+#ifdef	RACCT
 			PROC_UNLOCK(td->td_proc);
+#endif
 			return (ENOMEM);
 		}
 		if (!old_mlock && map->flags & MAP_WIREFUTURE) {
 			if (ptoa(pmap_wired_count(map->pmap)) + size >
-			    lim_cur(td->td_proc, RLIMIT_MEMLOCK)) {
+			    lim_cur(td, RLIMIT_MEMLOCK)) {
 				racct_set_force(td->td_proc, RACCT_VMEM,
 				    map->size);
+#ifdef	RACCT
 				PROC_UNLOCK(td->td_proc);
+#endif
 				return (ENOMEM);
 			}
 			error = racct_set(td->td_proc, RACCT_MEMLOCK,
@@ -1573,11 +1571,15 @@ vm_mmap(vm_map_t map, vm_offset_t *addr, vm_size_t size, vm_prot_t prot,
 			if (error != 0) {
 				racct_set_force(td->td_proc, RACCT_VMEM,
 				    map->size);
+#ifdef	RACCT
 				PROC_UNLOCK(td->td_proc);
+#endif
 				return (error);
 			}
 		}
+#ifdef	RACCT
 		PROC_UNLOCK(td->td_proc);
+#endif
 	}
 
 	/*
diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c
index 6f50053..8225522 100644
--- a/sys/vm/vm_pageout.c
+++ b/sys/vm/vm_pageout.c
@@ -1844,7 +1844,7 @@ again:
 			/*
 			 * get a limit
 			 */
-			lim_rlimit(p, RLIMIT_RSS, &rsslim);
+			lim_rlimit_proc(p, RLIMIT_RSS, &rsslim);
 			limit = OFF_TO_IDX(
 			    qmin(rsslim.rlim_cur, rsslim.rlim_max));
 
diff --git a/sys/vm/vm_unix.c b/sys/vm/vm_unix.c
index de9aa78..0e55ddf 100644
--- a/sys/vm/vm_unix.c
+++ b/sys/vm/vm_unix.c
@@ -83,11 +83,9 @@ sys_obreak(td, uap)
 	int error = 0;
 	boolean_t do_map_wirefuture;
 
-	PROC_LOCK(td->td_proc);
-	datalim = lim_cur(td->td_proc, RLIMIT_DATA);
-	lmemlim = lim_cur(td->td_proc, RLIMIT_MEMLOCK);
-	vmemlim = lim_cur(td->td_proc, RLIMIT_VMEM);
-	PROC_UNLOCK(td->td_proc);
+	datalim = lim_cur(td, RLIMIT_DATA);
+	lmemlim = lim_cur(td, RLIMIT_MEMLOCK);
+	vmemlim = lim_cur(td, RLIMIT_VMEM);
 
 	do_map_wirefuture = FALSE;
 	new = round_page((vm_offset_t)uap->nsize);
-- 
2.3.6


From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 28 02:34:09 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id AD5F13E2;
 Tue, 28 Apr 2015 02:34:09 +0000 (UTC)
Received: from mail-wg0-x22c.google.com (mail-wg0-x22c.google.com
 [IPv6:2a00:1450:400c:c00::22c])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 37B6710CD;
 Tue, 28 Apr 2015 02:34:09 +0000 (UTC)
Received: by wgen6 with SMTP id n6so135281463wge.3;
 Mon, 27 Apr 2015 19:34:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=+5j929s7xJAmfYn5RKOs1RXBFUo1F4Y4UJayr5y/XkQ=;
 b=GxsFQ0Xe6qlRHS/55m5t28ft1GyynORaMAC7s6UdjmYILg/Pvbhw8SObetcwtChnT3
 KDrTw3kK5VIiICCWdLqSqP4ZYrr32DhU7hxAUwpARWUHtWjGNsDnChauXM6dc4dUWsw1
 mOADaYPvWJSybWxQ8cpU6iE9B/e2h6KU72gZLHD647wxeTei/qDyfTCI+I37juiPdlMJ
 c6iQrRgAK74Ric7OEB7z33Dion8RXzLtPzlDbn6w2dN1bMrU2iBhoXg/34C7Dw421Oky
 aMtd9Pv2lhSoRaYuIuCePnFXLVIb5GVAYK+d4T2ikh9kD+pUEA6YXrQWzrXuQmmh8+uY
 WX8A==
X-Received: by 10.194.184.10 with SMTP id eq10mr28223179wjc.147.1430188447676; 
 Mon, 27 Apr 2015 19:34:07 -0700 (PDT)
Received: from localhost.localdomain (ip-89-102-11-63.net.upcbroadband.cz.
 [89.102.11.63])
 by mx.google.com with ESMTPSA id fo7sm14118352wic.1.2015.04.27.19.34.06
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 27 Apr 2015 19:34:06 -0700 (PDT)
From: Mateusz Guzik <mjguzik@gmail.com>
To: freebsd-arch@freebsd.org
Cc: Mateusz Guzik <mjg@freebsd.org>
Subject: [PATCH 1/2] Generalised support for copy-on-write structures shared
 by threads.
Date: Tue, 28 Apr 2015 04:34:02 +0200
Message-Id: <1430188443-19413-2-git-send-email-mjguzik@gmail.com>
X-Mailer: git-send-email 1.8.3.1
In-Reply-To: <1430188443-19413-1-git-send-email-mjguzik@gmail.com>
References: <1430188443-19413-1-git-send-email-mjguzik@gmail.com>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Apr 2015 02:34:09 -0000

From: Mateusz Guzik <mjg@freebsd.org>

Previously td_ucred was managed by comparing it to struct proc's version
on kernel<->userspace boundary.

Now a dedicated counter is introduced instead which makes it possible to
treat more structures this way without adding more tests for the common
case (no change).
---
 sys/amd64/amd64/trap.c                  |   4 +--
 sys/arm/arm/trap-v6.c                   |   4 +--
 sys/arm/arm/trap.c                      |  11 ++++----
 sys/i386/i386/trap.c                    |   4 +--
 sys/kern/init_main.c                    |   8 +++---
 sys/kern/kern_fork.c                    |   3 ++-
 sys/kern/kern_kthread.c                 |   2 +-
 sys/kern/kern_prot.c                    |   5 ++--
 sys/kern/kern_syscalls.c                |   2 ++
 sys/kern/kern_thr.c                     |   6 ++---
 sys/kern/kern_thread.c                  |  43 +++++++++++++++++++++++++++++---
 sys/kern/subr_syscall.c                 |   4 +--
 sys/kern/subr_trap.c                    |   4 +--
 sys/powerpc/powerpc/trap.c              |   4 +--
 sys/sparc64/sparc64/trap.c              |   4 +--
 sys/sys/proc.h                          |  11 ++++++++
 17 files changed, 86 insertions(+), 33 deletions(-)

diff --git a/sys/amd64/amd64/trap.c b/sys/amd64/amd64/trap.c
index 193d207..1883727 100644
--- a/sys/amd64/amd64/trap.c
+++ b/sys/amd64/amd64/trap.c
@@ -257,8 +257,8 @@ trap(struct trapframe *frame)
 		td->td_pticks = 0;
 		td->td_frame = frame;
 		addr = frame->tf_rip;
-		if (td->td_ucred != p->p_ucred) 
-			cred_update_thread(td);
+		if (td->td_cowgeneration != p->p_cowgeneration)
+			thread_update_cow(td);
 
 		switch (type) {
 		case T_PRIVINFLT:	/* privileged instruction fault */
diff --git a/sys/arm/arm/trap-v6.c b/sys/arm/arm/trap-v6.c
index abafa86..f521785 100644
--- a/sys/arm/arm/trap-v6.c
+++ b/sys/arm/arm/trap-v6.c
@@ -394,8 +394,8 @@ abort_handler(struct trapframe *tf, int prefetch)
 	p = td->td_proc;
 	if (usermode) {
 		td->td_pticks = 0;
-		if (td->td_ucred != p->p_ucred)
-			cred_update_thread(td);
+		if (td->td_cowgeneration != p->p_cowgeneration)
+			thread_update_cow(td);
 	}
 
 	/* Invoke the appropriate handler, if necessary. */
diff --git a/sys/arm/arm/trap.c b/sys/arm/arm/trap.c
index 0f142ce..36faac2 100644
--- a/sys/arm/arm/trap.c
+++ b/sys/arm/arm/trap.c
@@ -214,9 +214,8 @@ abort_handler(struct trapframe *tf, int type)
 	if (user) {
 		td->td_pticks = 0;
 		td->td_frame = tf;
-		if (td->td_ucred != td->td_proc->p_ucred)
-			cred_update_thread(td);
-
+		if (td->td_cowgeneration != p->p_cowgeneration)
+			thread_update_cow(td);
 	}
 	/* Grab the current pcb */
 	pcb = td->td_pcb;
@@ -644,8 +643,8 @@ prefetch_abort_handler(struct trapframe *tf)
 
 	if (TRAP_USERMODE(tf)) {
 		td->td_frame = tf;
-		if (td->td_ucred != td->td_proc->p_ucred)
-			cred_update_thread(td);
+                if (td->td_cowgeneration != p->p_cowgeneration)
+                        thread_update_cow(td);
 	}
 	fault_pc = tf->tf_pc;
 	if (td->td_md.md_spinlock_count == 0) {
diff --git a/sys/i386/i386/trap.c b/sys/i386/i386/trap.c
index d783a2b..41e62db 100644
--- a/sys/i386/i386/trap.c
+++ b/sys/i386/i386/trap.c
@@ -306,8 +306,8 @@ trap(struct trapframe *frame)
 		td->td_pticks = 0;
 		td->td_frame = frame;
 		addr = frame->tf_eip;
-		if (td->td_ucred != p->p_ucred) 
-			cred_update_thread(td);
+		if (td->td_cowgeneration != p->p_cowgeneration)
+			thread_update_cow(td);
 
 		switch (type) {
 		case T_PRIVINFLT:	/* privileged instruction fault */
diff --git a/sys/kern/init_main.c b/sys/kern/init_main.c
index b77b788..97e5878 100644
--- a/sys/kern/init_main.c
+++ b/sys/kern/init_main.c
@@ -522,8 +522,6 @@ proc0_init(void *dummy __unused)
 #ifdef MAC
 	mac_cred_create_swapper(newcred);
 #endif
-	td->td_ucred = crhold(newcred);
-
 	/* Create sigacts. */
 	p->p_sigacts = sigacts_alloc();
 
@@ -555,6 +553,10 @@ proc0_init(void *dummy __unused)
 	p->p_limit->pl_rlimit[RLIMIT_MEMLOCK].rlim_max = pageablemem;
 	p->p_cpulimit = RLIM_INFINITY;
 
+	PROC_LOCK(p);
+	thread_get_cow_proc(td, p);
+	PROC_UNLOCK(p);
+
 	/* Initialize resource accounting structures. */
 	racct_create(&p->p_racct);
 
@@ -842,10 +844,10 @@ create_init(const void *udata __unused)
 	audit_cred_proc1(newcred);
 #endif
 	proc_set_cred(initproc, newcred);
+	cred_update_thread(FIRST_THREAD_IN_PROC(initproc));
 	PROC_UNLOCK(initproc);
 	sx_xunlock(&proctree_lock);
 	crfree(oldcred);
-	cred_update_thread(FIRST_THREAD_IN_PROC(initproc));
 	cpu_set_fork_handler(FIRST_THREAD_IN_PROC(initproc), start_init, NULL);
 }
 SYSINIT(init, SI_SUB_CREATE_INIT, SI_ORDER_FIRST, create_init, NULL);
diff --git a/sys/kern/kern_fork.c b/sys/kern/kern_fork.c
index c3dd792..d04c3e3 100644
--- a/sys/kern/kern_fork.c
+++ b/sys/kern/kern_fork.c
@@ -496,7 +496,6 @@ do_fork(struct thread *td, int flags, struct proc *p2, struct thread *td2,
 	p2->p_swtick = ticks;
 	if (p1->p_flag & P_PROFIL)
 		startprofclock(p2);
-	td2->td_ucred = crhold(p2->p_ucred);
 
 	if (flags & RFSIGSHARE) {
 		p2->p_sigacts = sigacts_hold(p1->p_sigacts);
@@ -526,6 +525,8 @@ do_fork(struct thread *td, int flags, struct proc *p2, struct thread *td2,
 	 */
 	lim_fork(p1, p2);
 
+	thread_get_cow_proc(td2, p2);
+
 	pstats_fork(p1->p_stats, p2->p_stats);
 
 	PROC_UNLOCK(p1);
diff --git a/sys/kern/kern_kthread.c b/sys/kern/kern_kthread.c
index ee94de0..0614d89 100644
--- a/sys/kern/kern_kthread.c
+++ b/sys/kern/kern_kthread.c
@@ -289,7 +289,7 @@ kthread_add(void (*func)(void *), void *arg, struct proc *p,
 	cpu_set_fork_handler(newtd, func, arg);
 
 	newtd->td_pflags |= TDP_KTHREAD;
-	newtd->td_ucred = crhold(p->p_ucred);
+	thread_get_cow_proc(newtd, p);
 
 	/* this code almost the same as create_thread() in kern_thr.c */
 	p->p_flag |= P_HADTHREADS;
diff --git a/sys/kern/kern_prot.c b/sys/kern/kern_prot.c
index 9c49f71..b531763 100644
--- a/sys/kern/kern_prot.c
+++ b/sys/kern/kern_prot.c
@@ -1946,9 +1946,8 @@ cred_update_thread(struct thread *td)
 
 	p = td->td_proc;
 	cred = td->td_ucred;
-	PROC_LOCK(p);
+	PROC_LOCK_ASSERT(p, MA_OWNED);
 	td->td_ucred = crhold(p->p_ucred);
-	PROC_UNLOCK(p);
 	if (cred != NULL)
 		crfree(cred);
 }
@@ -1987,6 +1986,8 @@ proc_set_cred(struct proc *p, struct ucred *newcred)
 
 	oldcred = p->p_ucred;
 	p->p_ucred = newcred;
+	if (newcred != NULL)
+		PROC_UPDATE_COW(p);
 	return (oldcred);
 }
 
diff --git a/sys/kern/kern_syscalls.c b/sys/kern/kern_syscalls.c
index dada746..3d3df01 100644
--- a/sys/kern/kern_syscalls.c
+++ b/sys/kern/kern_syscalls.c
@@ -31,6 +31,8 @@ __FBSDID("$FreeBSD$");
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/module.h>
+#include <sys/mutex.h>
+#include <sys/proc.h>
 #include <sys/sx.h>
 #include <sys/syscall.h>
 #include <sys/sysent.h>
diff --git a/sys/kern/kern_thr.c b/sys/kern/kern_thr.c
index d5f1ce6..242e4dd 100644
--- a/sys/kern/kern_thr.c
+++ b/sys/kern/kern_thr.c
@@ -226,13 +226,13 @@ create_thread(struct thread *td, mcontext_t *ctx,
 	bcopy(&td->td_startcopy, &newtd->td_startcopy,
 	    __rangeof(struct thread, td_startcopy, td_endcopy));
 	newtd->td_proc = td->td_proc;
-	newtd->td_ucred = crhold(td->td_ucred);
+	thread_get_cow(newtd, td);
 
 	if (ctx != NULL) { /* old way to set user context */
 		error = set_mcontext(newtd, ctx);
 		if (error != 0) {
+			thread_free_cow(newtd);
 			thread_free(newtd);
-			crfree(td->td_ucred);
 			goto fail;
 		}
 	} else {
@@ -244,8 +244,8 @@ create_thread(struct thread *td, mcontext_t *ctx,
 		/* Setup user TLS address and TLS pointer register. */
 		error = cpu_set_user_tls(newtd, tls_base);
 		if (error != 0) {
+			thread_free_cow(newtd);
 			thread_free(newtd);
-			crfree(td->td_ucred);
 			goto fail;
 		}
 	}
diff --git a/sys/kern/kern_thread.c b/sys/kern/kern_thread.c
index 0a93dbd..df8511b 100644
--- a/sys/kern/kern_thread.c
+++ b/sys/kern/kern_thread.c
@@ -324,8 +324,7 @@ thread_reap(void)
 		mtx_unlock_spin(&zombie_lock);
 		while (td_first) {
 			td_next = TAILQ_NEXT(td_first, td_slpq);
-			if (td_first->td_ucred)
-				crfree(td_first->td_ucred);
+			thread_free_cow(td_first);
 			thread_free(td_first);
 			td_first = td_next;
 		}
@@ -381,6 +380,44 @@ thread_free(struct thread *td)
 	uma_zfree(thread_zone, td);
 }
 
+void
+thread_get_cow_proc(struct thread *newtd, struct proc *p)
+{
+
+	PROC_LOCK_ASSERT(p, MA_OWNED);
+	newtd->td_ucred = crhold(p->p_ucred);
+	newtd->td_cowgeneration = p->p_cowgeneration;
+}
+
+void
+thread_get_cow(struct thread *newtd, struct thread *td)
+{
+
+	newtd->td_ucred = crhold(td->td_ucred);
+	newtd->td_cowgeneration = td->td_cowgeneration;
+}
+
+void
+thread_free_cow(struct thread *td)
+{
+
+	if (td->td_ucred)
+		crfree(td->td_ucred);
+}
+
+void
+thread_update_cow(struct thread *td)
+{
+	struct proc *p;
+
+	p = td->td_proc;
+	PROC_LOCK(p);
+	if (td->td_ucred != p->p_ucred)
+		cred_update_thread(td);
+	td->td_cowgeneration = p->p_cowgeneration;
+	PROC_UNLOCK(p);
+}
+
 /*
  * Discard the current thread and exit from its context.
  * Always called with scheduler locked.
@@ -518,7 +555,7 @@ thread_wait(struct proc *p)
 	cpuset_rel(td->td_cpuset);
 	td->td_cpuset = NULL;
 	cpu_thread_clean(td);
-	crfree(td->td_ucred);
+	thread_free_cow(td);
 	thread_reap();	/* check for zombie threads etc. */
 }
 
diff --git a/sys/kern/subr_syscall.c b/sys/kern/subr_syscall.c
index 1bf78b8..8fdb828 100644
--- a/sys/kern/subr_syscall.c
+++ b/sys/kern/subr_syscall.c
@@ -61,8 +61,8 @@ syscallenter(struct thread *td, struct syscall_args *sa)
 	p = td->td_proc;
 
 	td->td_pticks = 0;
-	if (td->td_ucred != p->p_ucred)
-		cred_update_thread(td);
+	if (td->td_cowgeneration != p->p_cowgeneration)
+		thread_update_cow(td);
 	if (p->p_flag & P_TRACED) {
 		traced = 1;
 		PROC_LOCK(p);
diff --git a/sys/kern/subr_trap.c b/sys/kern/subr_trap.c
index cfc3ed7..e055e54 100644
--- a/sys/kern/subr_trap.c
+++ b/sys/kern/subr_trap.c
@@ -219,8 +219,8 @@ ast(struct trapframe *framep)
 	thread_unlock(td);
 	PCPU_INC(cnt.v_trap);
 
-	if (td->td_ucred != p->p_ucred) 
-		cred_update_thread(td);
+	if (td->td_cowgeneration != p->p_cowgeneration)
+		thread_update_cow(td);
 	if (td->td_pflags & TDP_OWEUPC && p->p_flag & P_PROFIL) {
 		addupc_task(td, td->td_profil_addr, td->td_profil_ticks);
 		td->td_profil_ticks = 0;
diff --git a/sys/powerpc/powerpc/trap.c b/sys/powerpc/powerpc/trap.c
index 0ceb170..007752c 100644
--- a/sys/powerpc/powerpc/trap.c
+++ b/sys/powerpc/powerpc/trap.c
@@ -196,8 +196,8 @@ trap(struct trapframe *frame)
 	if (user) {
 		td->td_pticks = 0;
 		td->td_frame = frame;
-		if (td->td_ucred != p->p_ucred)
-			cred_update_thread(td);
+		if (td->td_cowgeneration != p->p_cowgeneration)
+			thread_update_cow(td);
 
 		/* User Mode Traps */
 		switch (type) {
diff --git a/sys/sparc64/sparc64/trap.c b/sys/sparc64/sparc64/trap.c
index b4f0e27..54c1ebe 100644
--- a/sys/sparc64/sparc64/trap.c
+++ b/sys/sparc64/sparc64/trap.c
@@ -277,8 +277,8 @@ trap(struct trapframe *tf)
 		td->td_pticks = 0;
 		td->td_frame = tf;
 		addr = tf->tf_tpc;
-		if (td->td_ucred != p->p_ucred)
-			cred_update_thread(td);
+		if (td->td_cowgeneration != p->p_cowgeneration)
+			thread_update_cow(td);
 
 		switch (tf->tf_type) {
 		case T_DATA_MISS:
diff --git a/sys/sys/proc.h b/sys/sys/proc.h
index 64b99fc..f29d796 100644
--- a/sys/sys/proc.h
+++ b/sys/sys/proc.h
@@ -225,6 +225,7 @@ struct thread {
 /* Cleared during fork1() */
 #define	td_startzero td_flags
 	int		td_flags;	/* (t) TDF_* flags. */
+	u_int		td_cowgeneration;/* (k) Generation of COW pointers. */
 	int		td_inhibitors;	/* (t) Why can not run. */
 	int		td_pflags;	/* (k) Private thread (TDP_*) flags. */
 	int		td_dupfd;	/* (k) Ret value from fdopen. XXX */
@@ -531,6 +532,7 @@ struct proc {
 	pid_t		p_oppid;	/* (c + e) Save ppid in ptrace. XXX */
 	struct vmspace	*p_vmspace;	/* (b) Address space. */
 	u_int		p_swtick;	/* (c) Tick when swapped in or out. */
+	u_int		p_cowgeneration;/* (c) Generation of COW pointers. */
 	struct itimerval p_realtimer;	/* (c) Alarm timer. */
 	struct rusage	p_ru;		/* (a) Exit information. */
 	struct rusage_ext p_rux;	/* (cu) Internal resource usage. */
@@ -830,6 +832,11 @@ extern pid_t pid_max;
 	KASSERT((p)->p_lock == 0, ("process held"));			\
 } while (0)
 
+#define	PROC_UPDATE_COW(p) do {						\
+	PROC_LOCK_ASSERT((p), MA_OWNED);				\
+	p->p_cowgeneration++;						\
+} while (0)
+
 /* Check whether a thread is safe to be swapped out. */
 #define	thread_safetoswapout(td)	((td)->td_flags & TDF_CANSWAP)
 
@@ -976,6 +983,10 @@ struct	thread *thread_alloc(int pages);
 int	thread_alloc_stack(struct thread *, int pages);
 void	thread_exit(void) __dead2;
 void	thread_free(struct thread *td);
+void	thread_get_cow_proc(struct thread *newtd, struct proc *p);
+void	thread_get_cow(struct thread *newtd, struct thread *td);
+void	thread_free_cow(struct thread *td);
+void	thread_update_cow(struct thread *td);
 void	thread_link(struct thread *td, struct proc *p);
 void	thread_reap(void);
 int	thread_single(struct proc *p, int how);
-- 
2.3.6


From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 28 08:45:16 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id E3DFCC1E;
 Tue, 28 Apr 2015 08:45:15 +0000 (UTC)
Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au
 [211.29.132.249])
 by mx1.freebsd.org (Postfix) with ESMTP id 75AEB1B4F;
 Tue, 28 Apr 2015 08:45:14 +0000 (UTC)
Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au
 (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197])
 by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 06F8D1040193;
 Tue, 28 Apr 2015 18:45:01 +1000 (AEST)
Date: Tue, 28 Apr 2015 18:45:01 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Mateusz Guzik <mjguzik@gmail.com>
cc: freebsd-arch@freebsd.org, Mateusz Guzik <mjg@freebsd.org>
Subject: Re: [PATCH 1/2] Generalised support for copy-on-write structures
 shared by threads.
In-Reply-To: <1430188443-19413-2-git-send-email-mjguzik@gmail.com>
Message-ID: <20150428181802.F1119@besplex.bde.org>
References: <1430188443-19413-1-git-send-email-mjguzik@gmail.com>
 <1430188443-19413-2-git-send-email-mjguzik@gmail.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=dKqfxopb c=1 sm=1 tr=0
 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=kj9zAlcOel0A:10
 a=JzwRw_2MAAAA:8 a=1twMqG6x6PHpFXFWjvsA:9 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Apr 2015 08:45:16 -0000

On Tue, 28 Apr 2015, Mateusz Guzik wrote:

> diff --git a/sys/amd64/amd64/trap.c b/sys/amd64/amd64/trap.c
> index 193d207..1883727 100644
> --- a/sys/amd64/amd64/trap.c
> +++ b/sys/amd64/amd64/trap.c
> @@ -257,8 +257,8 @@ trap(struct trapframe *frame)
> 		td->td_pticks = 0;
> 		td->td_frame = frame;
> 		addr = frame->tf_rip;
> -		if (td->td_ucred != p->p_ucred)
> -			cred_update_thread(td);
> +		if (td->td_cowgeneration != p->p_cowgeneration)
> +			thread_update_cow(td);
>
> 		switch (type) {
> 		case T_PRIVINFLT:	/* privileged instruction fault */

This seems reasonable, but I don't like verbose names like p_cowgeneration.
It is especially bad to abbreviate "copy on write" to "cow" and then spell
"generation" in full.  "gen" would be a reasonable abbreviation, but "g"
goes better with "cow".

Old bad names visible in the patch include "thread" instead of "td".  "td"
is not such a good abbreviation for "thread pointer".

> diff --git a/sys/kern/kern_thr.c b/sys/kern/kern_thr.c
> index d5f1ce6..242e4dd 100644
> --- a/sys/kern/kern_thr.c
> +++ b/sys/kern/kern_thr.c

"thread" has too many different spellings.  For just file names, there
are kern_thr.c and kern_thread.c.  For variable names, there is also
"t" in "tid".  "tid" is the best of all the names mentioned so far.

> diff --git a/sys/sys/proc.h b/sys/sys/proc.h
> index 64b99fc..f29d796 100644
> --- a/sys/sys/proc.h
> +++ b/sys/sys/proc.h
> @@ -225,6 +225,7 @@ struct thread {
> /* Cleared during fork1() */
> #define	td_startzero td_flags
> 	int		td_flags;	/* (t) TDF_* flags. */
> +	u_int		td_cowgeneration;/* (k) Generation of COW pointers. */
> 	int		td_inhibitors;	/* (t) Why can not run. */
> 	int		td_pflags;	/* (k) Private thread (TDP_*) flags. */
> 	int		td_dupfd;	/* (k) Ret value from fdopen. XXX */

This name is so verbose that it messes up the comment indentation.

> @@ -830,6 +832,11 @@ extern pid_t pid_max;
> 	KASSERT((p)->p_lock == 0, ("process held"));			\
> } while (0)
>
> +#define	PROC_UPDATE_COW(p) do {						\
> +	PROC_LOCK_ASSERT((p), MA_OWNED);				\
> +	p->p_cowgeneration++;						\

Missing parentheses.

> +} while (0)
> +
> /* Check whether a thread is safe to be swapped out. */
> #define	thread_safetoswapout(td)	((td)->td_flags & TDF_CANSWAP)
>
> @@ -976,6 +983,10 @@ struct	thread *thread_alloc(int pages);
> int	thread_alloc_stack(struct thread *, int pages);
> void	thread_exit(void) __dead2;
> void	thread_free(struct thread *td);
> +void	thread_get_cow_proc(struct thread *newtd, struct proc *p);
> +void	thread_get_cow(struct thread *newtd, struct thread *td);
> +void	thread_free_cow(struct thread *td);
> +void	thread_update_cow(struct thread *td);

Insertion sort errors.

Namespace errors.  I don't like the style of naming things with objects
first and verbs last, but it is good for sorting related objects.  Here
the verbs "get" and "free" are in the middle of the objects
"thread_cow_proc" and "thread_cow".  Also, shouldn't it be "thread_proc_cow"
(but less verbose, maybe "tpcow"), not "thread_cow_proc", to indicate
that the cow is hung of the proc?  I didn't notice the details, but it
makes no sense to hang a proc of a cow :-).

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 28 13:45:10 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 1A28BD3B
 for <freebsd-arch@freebsd.org>; Tue, 28 Apr 2015 13:45:10 +0000 (UTC)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id E88CD1FD4
 for <freebsd-arch@freebsd.org>; Tue, 28 Apr 2015 13:45:09 +0000 (UTC)
Received: from ralph.baldwin.cx (pool-173-54-116-245.nwrknj.fios.verizon.net
 [173.54.116.245])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id CF0B4B93C;
 Tue, 28 Apr 2015 09:45:07 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-arch@freebsd.org
Cc: Konstantin Belousov <kostikbel@gmail.com>,
 Jason Harmening <jason.harmening@gmail.com>,
 Svatopluk Kraus <onwahe@gmail.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
Date: Tue, 28 Apr 2015 09:40:33 -0400
Message-ID: <1876382.0PQNo3Rp24@ralph.baldwin.cx>
User-Agent: KMail/4.14.2 (FreeBSD/10.1-STABLE; KDE/4.14.2; amd64; ; )
In-Reply-To: <20150425163444.GL2390@kib.kiev.ua>
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <553B9E64.8030907@gmail.com> <20150425163444.GL2390@kib.kiev.ua>
MIME-Version: 1.0
Content-Transfer-Encoding: 7Bit
Content-Type: text/plain; charset="us-ascii"
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Tue, 28 Apr 2015 09:45:07 -0400 (EDT)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Apr 2015 13:45:10 -0000

On Saturday, April 25, 2015 07:34:44 PM Konstantin Belousov wrote:
> On Sat, Apr 25, 2015 at 09:02:12AM -0500, Jason Harmening wrote:
> > It seems like in general it is too hard for drivers using busdma to deal
> > with usermode memory in a way that's both safe and efficient:
> > --bus_dmamap_load_uio + UIO_USERSPACE is apparently really unsafe
> > --if they do things the other way and allocate in the kernel, then then
> > they better either be willing to do extra copying, or create and
> > refcount their own vm_objects and use d_mmap_single (I still haven't
> > seen a good example of that), or leak a bunch of memory (if they use
> > d_mmap), because the old device pager is also really unsafe.
> munmap(2) does not free the pages, it removes the mapping and dereferences
> the backing vm object.  If the region was wired, munmap would decrement
> the wiring count for the pages.  So if a kernel code wired the regions
> pages, they are kept wired, but no longer mapped into the userspace.
> So bcopy() still does not work.
> 
> d_mmap_single() is used by GPU, definitely by GEM and TTM code, and possibly
> by the proprietary nvidia driver.

Yes, the nvidia driver uses it.  I've also used it for some proprietary
driver extensions.

> I believe UIO_USERSPACE is almost unused, it might be there for some
> obscure (and buggy) driver.

I believe it was added (and only ever used) in crypto drivers, and that they
all did bus_dma operations in the context of the thread that passed in the
uio.  I definitely think it is fragile and should be replaced with something
more reliable.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 28 13:45:10 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 011F9D3A;
 Tue, 28 Apr 2015 13:45:10 +0000 (UTC)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id CE1E01FD3;
 Tue, 28 Apr 2015 13:45:09 +0000 (UTC)
Received: from ralph.baldwin.cx (pool-173-54-116-245.nwrknj.fios.verizon.net
 [173.54.116.245])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id 87851B95B;
 Tue, 28 Apr 2015 09:45:08 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-arch@freebsd.org
Cc: Adrian Chadd <adrian@freebsd.org>, Davide Italiano <davide@freebsd.org>
Subject: Re: RFC: setting performance_cx_lowest=C2 in -HEAD to avoid lock
 contention on many-CPU boxes
Date: Tue, 28 Apr 2015 09:35:10 -0400
Message-ID: <1832557.zVusTDjZUx@ralph.baldwin.cx>
User-Agent: KMail/4.14.2 (FreeBSD/10.1-STABLE; KDE/4.14.2; amd64; ; )
In-Reply-To: <CAJ-Vmo=RrcS2dYX38BW4ezMTBrY3-t6=rCnc79AwSYD2gJy55w@mail.gmail.com>
References: <CAJ-VmonG+y5gzoYmer70KAswUorvezcZxRSDsQWj47=jsAZ71w@mail.gmail.com>
 <CACYV=-F_p_Pe=y+s4COk+Jf1Y8EEfxFcCKPmOXX9RE0k-KqGAA@mail.gmail.com>
 <CAJ-Vmo=RrcS2dYX38BW4ezMTBrY3-t6=rCnc79AwSYD2gJy55w@mail.gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 7Bit
Content-Type: text/plain; charset="us-ascii"
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Tue, 28 Apr 2015 09:45:08 -0400 (EDT)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Apr 2015 13:45:10 -0000

On Saturday, April 25, 2015 11:45:10 AM Adrian Chadd wrote:
> On 25 April 2015 at 11:18, Davide Italiano <davide@freebsd.org> wrote:
> > On Sat, Apr 25, 2015 at 9:31 AM, Adrian Chadd <adrian@freebsd.org> wrote:
> >> Hi!
> >>
> >> I've been doing some NUMA testing on large boxes and I've found that
> >> there's lock contention in the ACPI path. It's due to my change a
> >> while ago to start using sleep states above ACPI C1 by default. The
> >> ACPI C3 state involves a bunch of register fiddling in the ACPI sleep
> >> path that grabs a serialiser lock, and on an 80 thread box this is
> >> costly.
> >>
> >> I'd like to drop performance_cx_lowest to C2 in -HEAD. ACPI C2 state
> >> doesn't require the same register fiddling (to disable bus mastering,
> >> if I'm reading it right) and so it doesn't enter that particular
> >> serialised path. I've verified on Westmere-EX, Sandybridge, Ivybridge
> >> and Haswell boxes that ACPI C2 does let one drop down into a deeper
> >> CPU sleep state (C6 on each of these). I think is still a good default
> >> for both servers and desktops.
> >>
> >> If no-one has a problem with this then I'll do it after the weekend.
> >>
> >
> > This sounds to me just a way to hide a problem.
> > Very few people nowaday run on NUMA and they can tune the machine as
> > they like when they do testing.
> > If there's a lock contention problem, it needs to be fixed and not
> > hidden under another default.
> 
> The lock contention problem is inside ACPI and how it's designed/implemented.
> We're not going to easily be able to make ACPI lock "better" as we're
> constrained by how ACPI implements things in the shared ACPICA code.

Is the contention actually harmful?  Note that this only happens when the
CPUs are idle, not when doing actual work.  In addition, IIRC, the ACPI idle
stuff uses hueristics to only drop into deeper sleep states if the CPU has
recently been idle "more" so that if you are relatively busy you will only go
into C1 instead.  (I think this latter might have changed since eventtimers
came in, it looks like we now choose the idle state based on how long until
the next timer interrupt?)

If the only consequence of this is that it adds noise to profiling, then hack
your profiling results to ignore this lock.  I think that is a better tradeoff
than sacraficing power gains to reduce noise in profiling output.

Alternatively, your machine may be better off using cpu_idle_mwait.  There
are already CPUs now that only advertise deeper sleep states for use with
mwait but not ACPI, so we may certainly end up with defaulting to mwait
instead of ACPI for certain CPUs anyway.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 28 14:13:08 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id ED1AF5A9;
 Tue, 28 Apr 2015 14:13:08 +0000 (UTC)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 747651300;
 Tue, 28 Apr 2015 14:13:08 +0000 (UTC)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3SED2uW007586
 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Tue, 28 Apr 2015 17:13:02 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3SED2uW007586
Received: (from kostik@localhost)
 by tom.home (8.14.9/8.14.9/Submit) id t3SED2Ft007585;
 Tue, 28 Apr 2015 17:13:02 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Tue, 28 Apr 2015 17:13:02 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: John Baldwin <jhb@freebsd.org>
Cc: freebsd-arch@freebsd.org, Davide Italiano <davide@freebsd.org>,
 Adrian Chadd <adrian@freebsd.org>
Subject: Re: RFC: setting performance_cx_lowest=C2 in -HEAD to avoid lock
 contention on many-CPU boxes
Message-ID: <20150428141302.GH2390@kib.kiev.ua>
References: <CAJ-VmonG+y5gzoYmer70KAswUorvezcZxRSDsQWj47=jsAZ71w@mail.gmail.com>
 <CACYV=-F_p_Pe=y+s4COk+Jf1Y8EEfxFcCKPmOXX9RE0k-KqGAA@mail.gmail.com>
 <CAJ-Vmo=RrcS2dYX38BW4ezMTBrY3-t6=rCnc79AwSYD2gJy55w@mail.gmail.com>
 <1832557.zVusTDjZUx@ralph.baldwin.cx>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1832557.zVusTDjZUx@ralph.baldwin.cx>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.0
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Apr 2015 14:13:09 -0000

On Tue, Apr 28, 2015 at 09:35:10AM -0400, John Baldwin wrote:
> On Saturday, April 25, 2015 11:45:10 AM Adrian Chadd wrote:
> > On 25 April 2015 at 11:18, Davide Italiano <davide@freebsd.org> wrote:
> > > On Sat, Apr 25, 2015 at 9:31 AM, Adrian Chadd <adrian@freebsd.org> wrote:
> > >> Hi!
> > >>
> > >> I've been doing some NUMA testing on large boxes and I've found that
> > >> there's lock contention in the ACPI path. It's due to my change a
> > >> while ago to start using sleep states above ACPI C1 by default. The
> > >> ACPI C3 state involves a bunch of register fiddling in the ACPI sleep
> > >> path that grabs a serialiser lock, and on an 80 thread box this is
> > >> costly.
> > >>
> > >> I'd like to drop performance_cx_lowest to C2 in -HEAD. ACPI C2 state
> > >> doesn't require the same register fiddling (to disable bus mastering,
> > >> if I'm reading it right) and so it doesn't enter that particular
> > >> serialised path. I've verified on Westmere-EX, Sandybridge, Ivybridge
> > >> and Haswell boxes that ACPI C2 does let one drop down into a deeper
> > >> CPU sleep state (C6 on each of these). I think is still a good default
> > >> for both servers and desktops.
> > >>
> > >> If no-one has a problem with this then I'll do it after the weekend.
> > >>
> > >
> > > This sounds to me just a way to hide a problem.
> > > Very few people nowaday run on NUMA and they can tune the machine as
> > > they like when they do testing.
> > > If there's a lock contention problem, it needs to be fixed and not
> > > hidden under another default.
> > 
> > The lock contention problem is inside ACPI and how it's designed/implemented.
> > We're not going to easily be able to make ACPI lock "better" as we're
> > constrained by how ACPI implements things in the shared ACPICA code.
> 
> Is the contention actually harmful?  Note that this only happens when the
> CPUs are idle, not when doing actual work.  In addition, IIRC, the ACPI idle
> stuff uses hueristics to only drop into deeper sleep states if the CPU has
> recently been idle "more" so that if you are relatively busy you will only go
> into C1 instead.  (I think this latter might have changed since eventtimers
> came in, it looks like we now choose the idle state based on how long until
> the next timer interrupt?)
You have to spin, waiting other cores, to get the right to reduce the
power state.

> 
> If the only consequence of this is that it adds noise to profiling, then hack
> your profiling results to ignore this lock.  I think that is a better tradeoff
> than sacraficing power gains to reduce noise in profiling output.
I suspect that it adds latency, since interrupts cannot stop the wait for
the ACPI lock.  Also, it probably increases the power usage since CPU
has to spend more time contending for the lock instead of sleeping.

> 
> Alternatively, your machine may be better off using cpu_idle_mwait.  There
> are already CPUs now that only advertise deeper sleep states for use with
> mwait but not ACPI, so we may certainly end up with defaulting to mwait
> instead of ACPI for certain CPUs anyway.

cpu_idle_mwait is quite useless, it only enters C1, which should be
almost the same as hlt. mwait for C1 might reduce latency of waking up,
but definitely would not reduce power consumption on par with higher Cx.

That said, I think that for non-laptop usage, limiting lowest state to C2
is fine.   For Haswells, Intel recommendation for BIOS writers is to
limit the announced states to C2 (eliminating the BM avoidance at all).
Internally ACPI C2 is mapped to CPU C6 or might be even C7.


From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 28 14:47:50 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 5CB0A1BD;
 Tue, 28 Apr 2015 14:47:50 +0000 (UTC)
Received: from mail-ob0-x234.google.com (mail-ob0-x234.google.com
 [IPv6:2607:f8b0:4003:c01::234])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 22876176C;
 Tue, 28 Apr 2015 14:47:50 +0000 (UTC)
Received: by obbeb7 with SMTP id eb7so109723254obb.3;
 Tue, 28 Apr 2015 07:47:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=message-id:date:from:user-agent:mime-version:to:cc:subject
 :references:in-reply-to:content-type;
 bh=mKS1Vuc2xwpr85kVpktLCtktynSHmLjCBPWh5fmR2uw=;
 b=yA+AnwvOvYvgIEwmzNS9YbrhIscZH2N3BLeMBkS0FiunpIbEml04iOyQnd2bdHhSwj
 C+pnjNx2n8dfhUpCfH3sQFdcEUEtBaFQ99mTX6fy3546Pj4S3Wq2HGqjlKCtJGZl+q4S
 h6pptbbTS6S0OcmtHny6vV5JCutgygXh5wr+a+cZjhPUxSo3RHAsMyQ3HTSh+u2PJQw1
 WzYH4tl2H7hU/8MWkN9afJ2JXe/7LtxJuxE3M2zhILREfSUH8ROG6gQGipPCiPJaxI3s
 nT/QN5u2eAtgEyfQD/zlRlRsTMZISz7fMGMmetuPQi1N2AdErdSXeLqByaXmMxaNb3IU
 6mtQ==
X-Received: by 10.202.225.65 with SMTP id y62mr13884239oig.78.1430232469407;
 Tue, 28 Apr 2015 07:47:49 -0700 (PDT)
Received: from corona.austin.rr.com (cpe-72-177-6-10.austin.res.rr.com.
 [72.177.6.10])
 by mx.google.com with ESMTPSA id ph19sm9959705oeb.9.2015.04.28.07.47.48
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Tue, 28 Apr 2015 07:47:48 -0700 (PDT)
Message-ID: <553F9DE2.5080908@gmail.com>
Date: Tue, 28 Apr 2015 09:49:06 -0500
From: Jason Harmening <jason.harmening@gmail.com>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:31.0) Gecko/20100101 Thunderbird/31.3.0
MIME-Version: 1.0
To: John Baldwin <jhb@freebsd.org>, freebsd-arch@freebsd.org
CC: Konstantin Belousov <kostikbel@gmail.com>, 
 Svatopluk Kraus <onwahe@gmail.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <553B9E64.8030907@gmail.com> <20150425163444.GL2390@kib.kiev.ua>
 <1876382.0PQNo3Rp24@ralph.baldwin.cx>
In-Reply-To: <1876382.0PQNo3Rp24@ralph.baldwin.cx>
Content-Type: multipart/signed; micalg=pgp-sha512;
 protocol="application/pgp-signature";
 boundary="FdFxoWkPFWu4kEMfT2l2KniOuEbJax1dd"
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Apr 2015 14:47:50 -0000

This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--FdFxoWkPFWu4kEMfT2l2KniOuEbJax1dd
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable


On 04/28/15 08:40, John Baldwin wrote:
> On Saturday, April 25, 2015 07:34:44 PM Konstantin Belousov wrote:
>> On Sat, Apr 25, 2015 at 09:02:12AM -0500, Jason Harmening wrote:
>>> It seems like in general it is too hard for drivers using busdma to d=
eal
>>> with usermode memory in a way that's both safe and efficient:
>>> --bus_dmamap_load_uio + UIO_USERSPACE is apparently really unsafe
>>> --if they do things the other way and allocate in the kernel, then th=
en
>>> they better either be willing to do extra copying, or create and
>>> refcount their own vm_objects and use d_mmap_single (I still haven't
>>> seen a good example of that), or leak a bunch of memory (if they use
>>> d_mmap), because the old device pager is also really unsafe.
>> munmap(2) does not free the pages, it removes the mapping and derefere=
nces
>> the backing vm object.  If the region was wired, munmap would decremen=
t
>> the wiring count for the pages.  So if a kernel code wired the regions=

>> pages, they are kept wired, but no longer mapped into the userspace.
>> So bcopy() still does not work.
>>
>> d_mmap_single() is used by GPU, definitely by GEM and TTM code, and po=
ssibly
>> by the proprietary nvidia driver.
> Yes, the nvidia driver uses it.  I've also used it for some proprietary=

> driver extensions.

I've seen d_mmap_single() used in the GPU code, but I haven't seen it
used in conjunction with busdma (but maybe not looking in the right place=
).


>
>> I believe UIO_USERSPACE is almost unused, it might be there for some
>> obscure (and buggy) driver.
> I believe it was added (and only ever used) in crypto drivers, and that=
 they
> all did bus_dma operations in the context of the thread that passed in =
the
> uio.  I definitely think it is fragile and should be replaced with some=
thing
> more reliable.
>
I think it's useful to make the bounce-buffering logic more robust in
cases where it's not executed in the owning process; it's also a really
simple set of changes.  Of course doing vslock beforehand is still going
to be the only safe way to use that API, but that seems reasonable if
it's documented and done sparingly (which it is).
In the longer term, vm_fault_quick_hold_pages + _bus_dmamap_load_ma is
probably better for user buffers, at least for short transfers (which I
think is most of them).  load_ma needs to at least be made a public and
documented KPI though.  I'd like to try moving some of the drm2 code to
use it once I finally have a reasonably modern machine for testing -curre=
nt.

Either _bus_dmamap_load_ma or out-of-context UIO_USERSPACE bounce
buffering could have issues with waiting on sfbufs on some arches,
including arm.  That could be fixed by making each unmapped bounce
buffer set up a kva mapping for the data addr when it's created, but
that fix might be worse than the problem it's trying to solve.


--FdFxoWkPFWu4kEMfT2l2KniOuEbJax1dd
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQF8BAEBCgBmBQJVP53iXxSAAAAAAC4AKGlzc3Vlci1mcHJAbm90YXRpb25zLm9w
ZW5wZ3AuZmlmdGhob3JzZW1hbi5uZXQwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAw
MDAwMDAwMDAwMDAwMDAwAAoJELufi/mShB0bEIUIAN34JfagFVf9hR2fNWwHitxn
pLO99FuzZCuf6WCS1eieMHPN0d0xNqnqTmKvVIQ+xbPO4dRhLSz+FhJQiwf5wSDl
VCCdvSaVF4SJuASwiFzvG1XkSjMUJHvmJTIGROtqCkjvhY1qBDST/MPPJCtYbuU6
dbkoi4avwyrtVWelqcyA4HwQ09wPVSKl3p2HF9DXIwpGDFecf7y8FaHeflsopUja
b/V5dwI/j1SLYls7rtgdjZ39kHX4iU4AjSuk2DY/yaRItgE6LPMa+YaTe9cz5Uhm
lfhsZj8xPiTMRRt1/y5UI40taGcADz8h/mt7AmdE/SzLJjPYb82bRPKr7lhsy34=
=NAS+
-----END PGP SIGNATURE-----

--FdFxoWkPFWu4kEMfT2l2KniOuEbJax1dd--

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 28 15:36:39 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id EE3C93ED;
 Tue, 28 Apr 2015 15:36:39 +0000 (UTC)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id AC3AD1DBD;
 Tue, 28 Apr 2015 15:36:39 +0000 (UTC)
Received: from ralph.baldwin.cx (pool-173-54-116-245.nwrknj.fios.verizon.net
 [173.54.116.245])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id 1D5B8B93A;
 Tue, 28 Apr 2015 11:36:38 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Cc: freebsd-arch@freebsd.org, Davide Italiano <davide@freebsd.org>,
 Adrian Chadd <adrian@freebsd.org>
Subject: Re: RFC: setting performance_cx_lowest=C2 in -HEAD to avoid lock
 contention on many-CPU boxes
Date: Tue, 28 Apr 2015 10:23:33 -0400
Message-ID: <3094092.O50xjOxef9@ralph.baldwin.cx>
User-Agent: KMail/4.14.2 (FreeBSD/10.1-STABLE; KDE/4.14.2; amd64; ; )
In-Reply-To: <20150428141302.GH2390@kib.kiev.ua>
References: <CAJ-VmonG+y5gzoYmer70KAswUorvezcZxRSDsQWj47=jsAZ71w@mail.gmail.com>
 <1832557.zVusTDjZUx@ralph.baldwin.cx> <20150428141302.GH2390@kib.kiev.ua>
MIME-Version: 1.0
Content-Transfer-Encoding: 7Bit
Content-Type: text/plain; charset="us-ascii"
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Tue, 28 Apr 2015 11:36:38 -0400 (EDT)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Apr 2015 15:36:40 -0000

On Tuesday, April 28, 2015 05:13:02 PM Konstantin Belousov wrote:
> On Tue, Apr 28, 2015 at 09:35:10AM -0400, John Baldwin wrote:
> > On Saturday, April 25, 2015 11:45:10 AM Adrian Chadd wrote:
> > > On 25 April 2015 at 11:18, Davide Italiano <davide@freebsd.org> wrote:
> > > > On Sat, Apr 25, 2015 at 9:31 AM, Adrian Chadd <adrian@freebsd.org> wrote:
> > > >> Hi!
> > > >>
> > > >> I've been doing some NUMA testing on large boxes and I've found that
> > > >> there's lock contention in the ACPI path. It's due to my change a
> > > >> while ago to start using sleep states above ACPI C1 by default. The
> > > >> ACPI C3 state involves a bunch of register fiddling in the ACPI sleep
> > > >> path that grabs a serialiser lock, and on an 80 thread box this is
> > > >> costly.
> > > >>
> > > >> I'd like to drop performance_cx_lowest to C2 in -HEAD. ACPI C2 state
> > > >> doesn't require the same register fiddling (to disable bus mastering,
> > > >> if I'm reading it right) and so it doesn't enter that particular
> > > >> serialised path. I've verified on Westmere-EX, Sandybridge, Ivybridge
> > > >> and Haswell boxes that ACPI C2 does let one drop down into a deeper
> > > >> CPU sleep state (C6 on each of these). I think is still a good default
> > > >> for both servers and desktops.
> > > >>
> > > >> If no-one has a problem with this then I'll do it after the weekend.
> > > >>
> > > >
> > > > This sounds to me just a way to hide a problem.
> > > > Very few people nowaday run on NUMA and they can tune the machine as
> > > > they like when they do testing.
> > > > If there's a lock contention problem, it needs to be fixed and not
> > > > hidden under another default.
> > > 
> > > The lock contention problem is inside ACPI and how it's designed/implemented.
> > > We're not going to easily be able to make ACPI lock "better" as we're
> > > constrained by how ACPI implements things in the shared ACPICA code.
> > 
> > Is the contention actually harmful?  Note that this only happens when the
> > CPUs are idle, not when doing actual work.  In addition, IIRC, the ACPI idle
> > stuff uses hueristics to only drop into deeper sleep states if the CPU has
> > recently been idle "more" so that if you are relatively busy you will only go
> > into C1 instead.  (I think this latter might have changed since eventtimers
> > came in, it looks like we now choose the idle state based on how long until
> > the next timer interrupt?)
> You have to spin, waiting other cores, to get the right to reduce the
> power state.

Yes, normally spinning wouldn't do that, but the cpu idle hooks run with
interrupts disabled.  We could fix that perhaps though Acpi doesn't quite
have what we would want (a single op that would disable interrupts after
grabbing the lock, do the test and set of the bit in question and return
its old value leaving interrupts disabled after dropping the lock).

However, I would still like to know if the contention here is actually
harmful in some measurable way aside from showing up in profiling output.

> > Alternatively, your machine may be better off using cpu_idle_mwait.  There
> > are already CPUs now that only advertise deeper sleep states for use with
> > mwait but not ACPI, so we may certainly end up with defaulting to mwait
> > instead of ACPI for certain CPUs anyway.
> 
> cpu_idle_mwait is quite useless, it only enters C1, which should be
> almost the same as hlt. mwait for C1 might reduce latency of waking up,
> but definitely would not reduce power consumption on par with higher Cx.

Mmm, it was your pending patch I was thinking of.  Don't you use mwait with
the hints to use deeper sleep states in your change?
 
> That said, I think that for non-laptop usage, limiting lowest state to C2
> is fine.   For Haswells, Intel recommendation for BIOS writers is to
> limit the announced states to C2 (eliminating the BM avoidance at all).
> Internally ACPI C2 is mapped to CPU C6 or might be even C7.

The problem of course is detecting non-laptops. :-/  In my own crude
measurements based on the power draw numbers in the BMC on recent
SuperMicro X9 boards for SandyBridge servers, most of the gain you get is
from C2; C3 doesn't add much difference once you are able to do C2.  Also of
note is the comment above the busmaster register in question about USB.  I'm
not sure if that is still true anymore.  If it were, systems would never go
into C3 in which case this would be a moot point and there would be no need to
enable C3.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 28 15:42:50 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id A2DE877C;
 Tue, 28 Apr 2015 15:42:50 +0000 (UTC)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 199FD1EBE;
 Tue, 28 Apr 2015 15:42:49 +0000 (UTC)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3SFgjlm028728
 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Tue, 28 Apr 2015 18:42:45 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3SFgjlm028728
Received: (from kostik@localhost)
 by tom.home (8.14.9/8.14.9/Submit) id t3SFgjF8028727;
 Tue, 28 Apr 2015 18:42:45 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Tue, 28 Apr 2015 18:42:45 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Jason Harmening <jason.harmening@gmail.com>
Cc: John Baldwin <jhb@freebsd.org>, freebsd-arch@freebsd.org,
 Svatopluk Kraus <onwahe@gmail.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
Message-ID: <20150428154245.GJ2390@kib.kiev.ua>
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <553B9E64.8030907@gmail.com> <20150425163444.GL2390@kib.kiev.ua>
 <1876382.0PQNo3Rp24@ralph.baldwin.cx> <553F9DE2.5080908@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <553F9DE2.5080908@gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.0
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Apr 2015 15:42:50 -0000

On Tue, Apr 28, 2015 at 09:49:06AM -0500, Jason Harmening wrote:
> 
> Either _bus_dmamap_load_ma or out-of-context UIO_USERSPACE bounce
> buffering could have issues with waiting on sfbufs on some arches,
> including arm.  That could be fixed by making each unmapped bounce
> buffer set up a kva mapping for the data addr when it's created, but
> that fix might be worse than the problem it's trying to solve.

I had an implementation of the sfbuf allocator which never sleeps. If
sfbuf was not available without sleep, a callback is called later, when
a reusable sf buf is freed. It was written to allow drivers like PIO
ATA to take unmapped bios, but I never finished it, at least did not
converted a single driver.

I am not sure if I can find the branch or is it reasonable to try to
rebase it, but the base idea may be useful for the UIO_USERSPACE case
as well.

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 28 16:19:23 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 8E279B49
 for <freebsd-arch@freebsd.org>; Tue, 28 Apr 2015 16:19:23 +0000 (UTC)
Received: from mail-ie0-f176.google.com (mail-ie0-f176.google.com
 [209.85.223.176])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 5849E1358
 for <freebsd-arch@freebsd.org>; Tue, 28 Apr 2015 16:19:23 +0000 (UTC)
Received: by iejt8 with SMTP id t8so22157319iej.2
 for <freebsd-arch@freebsd.org>; Tue, 28 Apr 2015 09:19:22 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:sender:subject:mime-version:content-type:from
 :in-reply-to:date:cc:message-id:references:to;
 bh=RlN8tnWkK8GsdOjWPJE7l64uVm/t2RPf2F/1sWXevFE=;
 b=lPaIKZjkVVe8Z0AIkNIwWMXT54sXQPpcTkjEM5hgmZ0+hWunGb/bHmR46PR/JL8JqW
 uw9zpIMKHTgvA3Tmzp5dJSMWMpgFOFWBR8I2Yh6YKXtl19Ke+TmG2WVpWVTuHOXuG7gC
 w+Sq1eEXiKUB+193fZxXLhY+TdlLpQHffqyezBpzASF29eh+q39smyI8HvyIBXBaRMAf
 0IROf+6KPgl6cVCuc2UPlU0InvWi/sD+l5PrF0zIfnsnActf+LophR+EjNIuvH98xQQd
 01oPMnjuRSGJwMYpEVWfj+x3NgKF05udpd6pHhBi9EQ8Wy4lWTgZQLbHF+kH3jJ4AShg
 +PbA==
X-Gm-Message-State: ALoCoQlr+9z2yaipnDFafdtd4tGDx+YonlJVVt9x1OHaKSmxcZzwcTVPU2optrHc7OjIvcsuqjJq
X-Received: by 10.50.61.234 with SMTP id t10mr14345968igr.19.1430237962233;
 Tue, 28 Apr 2015 09:19:22 -0700 (PDT)
Received: from netflix-mac-wired.bsdimp.com ([50.253.99.174])
 by mx.google.com with ESMTPSA id qo11sm7476281igb.17.2015.04.28.09.19.20
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Tue, 28 Apr 2015 09:19:21 -0700 (PDT)
Sender: Warner Losh <wlosh@bsdimp.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\))
Content-Type: multipart/signed;
 boundary="Apple-Mail=_851502F5-21E5-4D4C-B196-6A58C0E7DE9E";
 protocol="application/pgp-signature"; micalg=pgp-sha512
X-Pgp-Agent: GPGMail 2.5b6
From: Warner Losh <imp@bsdimp.com>
In-Reply-To: <1876382.0PQNo3Rp24@ralph.baldwin.cx>
Date: Tue, 28 Apr 2015 10:19:20 -0600
Cc: freebsd-arch <freebsd-arch@freebsd.org>,
 Konstantin Belousov <kostikbel@gmail.com>,
 Jason Harmening <jason.harmening@gmail.com>,
 Svatopluk Kraus <onwahe@gmail.com>
Message-Id: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com>
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <553B9E64.8030907@gmail.com> <20150425163444.GL2390@kib.kiev.ua>
 <1876382.0PQNo3Rp24@ralph.baldwin.cx>
To: John Baldwin <jhb@FreeBSD.org>
X-Mailer: Apple Mail (2.2098)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Apr 2015 16:19:23 -0000


--Apple-Mail=_851502F5-21E5-4D4C-B196-6A58C0E7DE9E
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8


> On Apr 28, 2015, at 7:40 AM, John Baldwin <jhb@FreeBSD.org> wrote:
>=20
>> I believe UIO_USERSPACE is almost unused, it might be there for some
>> obscure (and buggy) driver.
>=20
> I believe it was added (and only ever used) in crypto drivers, and =
that they
> all did bus_dma operations in the context of the thread that passed in =
the
> uio.  I definitely think it is fragile and should be replaced with =
something
> more reliable.

Fusion I/O=E2=80=99s SDK used this trick to allow mapping of userspace =
buffers down
into the block layer after doing the requisite locking / pinning / etc =
of the buffers
into memory. That=E2=80=99s if memory serves correctly (the SDK did =
these things, I can=E2=80=99t
easily check on that detail since I=E2=80=99m no longer at FIO).

Warner


--Apple-Mail=_851502F5-21E5-4D4C-B196-6A58C0E7DE9E
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename=signature.asc
Content-Type: application/pgp-signature;
	name=signature.asc
Content-Description: Message signed with OpenPGP using GPGMail

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - https://gpgtools.org

iQIcBAEBCgAGBQJVP7MIAAoJEGwc0Sh9sBEAMb0P/2yrwv3jtTWEKkmFUBwOGOVB
I1BB21SS91M/v5it92fHDQtgUgZUsDg49Ej2jYZ9Llv7LuOVr/PX/w9+F/evxFRo
Vt/LSo/BXRIuuoNMc+pvMBEPL14e5WtrWXCvS4tQnfFH4mljcRqXpagrVIsHaaIC
3Gm3RwEfb+lvgNug5haryW6NaHwanhd+NMAlasemy2iAhey2ur+1qGGL3GtX5S5T
2VxO6rvVMbsaepvywTjgtGA68CE+CnCY1hMi78CIPEXeA5Kn2+Ugy6DOGDTOLa8s
ABJ+2DnWjJF2bv7fxrBfiWmn9CRbqOpwaFmLG1nwM5/ZIdLJ8Q4RkAd1ynky0NXS
jm2529wOmzCGmXftVffCNt83vZeVcmkaC4NLxnVx+1iDRwMlTVcWZTRwSpR3zqiE
srviQE+PkCuRX8B7RwTXLwyPLIrKg78Uhn8YAhrs0MvLhvdCiS8q3CprnI37phPO
9gIBMITFYG61fjxMdOdjehpL2hRVW+nudKH8ZI1AqVqCF0wGgAQx192KpKVo0IEh
g9QwXY04GS+PrqwEk1tO2st+/DYcEKDKjmz3ucAgM/GXZd8EtbxfbCWTOilrgVN5
sDoCRlFXC9tfyIYWDoa+cj8UxE7YPQbnquz+DcD0JDtHwSU6iBgd23W2Jj4/wJu5
5XcBt0cfYdpTObv74roC
=FS3o
-----END PGP SIGNATURE-----

--Apple-Mail=_851502F5-21E5-4D4C-B196-6A58C0E7DE9E--

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 28 16:55:14 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id EF0C3BC7;
 Tue, 28 Apr 2015 16:55:14 +0000 (UTC)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 7822D1882;
 Tue, 28 Apr 2015 16:55:14 +0000 (UTC)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3SGt3Ji045172
 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Tue, 28 Apr 2015 19:55:04 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3SGt3Ji045172
Received: (from kostik@localhost)
 by tom.home (8.14.9/8.14.9/Submit) id t3SGt3sn045168;
 Tue, 28 Apr 2015 19:55:03 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Tue, 28 Apr 2015 19:55:03 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: John Baldwin <jhb@freebsd.org>
Cc: freebsd-arch@freebsd.org, Davide Italiano <davide@freebsd.org>,
 Adrian Chadd <adrian@freebsd.org>
Subject: Re: RFC: setting performance_cx_lowest=C2 in -HEAD to avoid lock
 contention on many-CPU boxes
Message-ID: <20150428165503.GK2390@kib.kiev.ua>
References: <CAJ-VmonG+y5gzoYmer70KAswUorvezcZxRSDsQWj47=jsAZ71w@mail.gmail.com>
 <1832557.zVusTDjZUx@ralph.baldwin.cx>
 <20150428141302.GH2390@kib.kiev.ua>
 <3094092.O50xjOxef9@ralph.baldwin.cx>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <3094092.O50xjOxef9@ralph.baldwin.cx>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.0
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Apr 2015 16:55:15 -0000

On Tue, Apr 28, 2015 at 10:23:33AM -0400, John Baldwin wrote:
> On Tuesday, April 28, 2015 05:13:02 PM Konstantin Belousov wrote:
> > On Tue, Apr 28, 2015 at 09:35:10AM -0400, John Baldwin wrote:
> > > On Saturday, April 25, 2015 11:45:10 AM Adrian Chadd wrote:
> > > > On 25 April 2015 at 11:18, Davide Italiano <davide@freebsd.org> wrote:
> > > > > On Sat, Apr 25, 2015 at 9:31 AM, Adrian Chadd <adrian@freebsd.org> wrote:
> > > > >> Hi!
> > > > >>
> > > > >> I've been doing some NUMA testing on large boxes and I've found that
> > > > >> there's lock contention in the ACPI path. It's due to my change a
> > > > >> while ago to start using sleep states above ACPI C1 by default. The
> > > > >> ACPI C3 state involves a bunch of register fiddling in the ACPI sleep
> > > > >> path that grabs a serialiser lock, and on an 80 thread box this is
> > > > >> costly.
> > > > >>
> > > > >> I'd like to drop performance_cx_lowest to C2 in -HEAD. ACPI C2 state
> > > > >> doesn't require the same register fiddling (to disable bus mastering,
> > > > >> if I'm reading it right) and so it doesn't enter that particular
> > > > >> serialised path. I've verified on Westmere-EX, Sandybridge, Ivybridge
> > > > >> and Haswell boxes that ACPI C2 does let one drop down into a deeper
> > > > >> CPU sleep state (C6 on each of these). I think is still a good default
> > > > >> for both servers and desktops.
> > > > >>
> > > > >> If no-one has a problem with this then I'll do it after the weekend.
> > > > >>
> > > > >
> > > > > This sounds to me just a way to hide a problem.
> > > > > Very few people nowaday run on NUMA and they can tune the machine as
> > > > > they like when they do testing.
> > > > > If there's a lock contention problem, it needs to be fixed and not
> > > > > hidden under another default.
> > > > 
> > > > The lock contention problem is inside ACPI and how it's designed/implemented.
> > > > We're not going to easily be able to make ACPI lock "better" as we're
> > > > constrained by how ACPI implements things in the shared ACPICA code.
> > > 
> > > Is the contention actually harmful?  Note that this only happens when the
> > > CPUs are idle, not when doing actual work.  In addition, IIRC, the ACPI idle
> > > stuff uses hueristics to only drop into deeper sleep states if the CPU has
> > > recently been idle "more" so that if you are relatively busy you will only go
> > > into C1 instead.  (I think this latter might have changed since eventtimers
> > > came in, it looks like we now choose the idle state based on how long until
> > > the next timer interrupt?)
> > You have to spin, waiting other cores, to get the right to reduce the
> > power state.
> 
> Yes, normally spinning wouldn't do that, but the cpu idle hooks run with
> interrupts disabled.  We could fix that perhaps though Acpi doesn't quite
> have what we would want (a single op that would disable interrupts after
> grabbing the lock, do the test and set of the bit in question and return
> its old value leaving interrupts disabled after dropping the lock).
> 
> However, I would still like to know if the contention here is actually
> harmful in some measurable way aside from showing up in profiling output.
I think Adrian could run intel pmc on his box with C2 and C3 and
compare the power reports.

> 
> > > Alternatively, your machine may be better off using cpu_idle_mwait.  There
> > > are already CPUs now that only advertise deeper sleep states for use with
> > > mwait but not ACPI, so we may certainly end up with defaulting to mwait
> > > instead of ACPI for certain CPUs anyway.
> > 
> > cpu_idle_mwait is quite useless, it only enters C1, which should be
> > almost the same as hlt. mwait for C1 might reduce latency of waking up,
> > but definitely would not reduce power consumption on par with higher Cx.
> 
> Mmm, it was your pending patch I was thinking of.  Don't you use mwait with
> the hints to use deeper sleep states in your change?
Only in the acpi idle method.  It is not safe to blindly enter states
higher than C1 with mwait.

Intel wrote a driver for Linux which does not rely on ACPU _CST tables
for this.  The driver has hard-coded tables for cores >= Nehalem which
specify supported states, latency and cache behaviour.  This is what
I tried to mention in the original mail.  If we write such driver (and
rip the tables from Linux), we could allow deeper states in the
cpu_idle_mwait.  But I remember that avg did not liked the approach,
and I agree that this is not maintanable, if you are not Intel.

>  
> > That said, I think that for non-laptop usage, limiting lowest state to C2
> > is fine.   For Haswells, Intel recommendation for BIOS writers is to
> > limit the announced states to C2 (eliminating the BM avoidance at all).
> > Internally ACPI C2 is mapped to CPU C6 or might be even C7.
> 
> The problem of course is detecting non-laptops. :-/  In my own crude
> measurements based on the power draw numbers in the BMC on recent
> SuperMicro X9 boards for SandyBridge servers, most of the gain you get is
> from C2; C3 doesn't add much difference once you are able to do C2.  Also of
> note is the comment above the busmaster register in question about USB.  I'm
> not sure if that is still true anymore.  If it were, systems would never go
> into C3 in which case this would be a moot point and there would be no need to
> enable C3.

I remember turbo boost requires C3, and non-trivially deep package C states
on older CPUs also require C3.  This is an argument against Adrian' change,
but I think it is not applicable on newer processors.

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 28 19:10:44 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 914DA82E;
 Tue, 28 Apr 2015 19:10:44 +0000 (UTC)
Received: from mail-ig0-x235.google.com (mail-ig0-x235.google.com
 [IPv6:2607:f8b0:4001:c05::235])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 56962195F;
 Tue, 28 Apr 2015 19:10:44 +0000 (UTC)
Received: by igblo3 with SMTP id lo3so29229362igb.0;
 Tue, 28 Apr 2015 12:10:43 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type:content-transfer-encoding;
 bh=YXXRMHR5TssIc7/XIinO/uAA5sROdLXc4Scm+HA05S8=;
 b=QgQ+MAmjok5hHzKoja4Fb4tNksyei5zkoB0bgU9mJG4BniruS7SNMR1bdfxLvSXBRM
 oRz5f76UmHSlmM5X3yKFFFo+yqqcpAGHckEh+NoAA67VdGnezBOkFW256vdO2Nj4duaq
 R9qZpvmMWAinVhAINX9aPrF4XzKbamEU2rIpYLIS54LN5kIkQZ7T6vj5Rll5/4s3MmS/
 45gc1VsrCb7Xx6X0o08oQRAfA5rxux+CpXoLMBQLmut1dN5Lc7FIaCu9x/Mq2/LyBHih
 g+TEmJPmIEaj7PryEiqLLy1JedboTPND60WcTp/XWm0ioUkfwzI2ct5HInz1hmfyswdt
 DAmw==
MIME-Version: 1.0
X-Received: by 10.50.73.198 with SMTP id n6mr22560481igv.32.1430248243568;
 Tue, 28 Apr 2015 12:10:43 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.36.38.133 with HTTP; Tue, 28 Apr 2015 12:10:43 -0700 (PDT)
In-Reply-To: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com>
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <553B9E64.8030907@gmail.com> <20150425163444.GL2390@kib.kiev.ua>
 <1876382.0PQNo3Rp24@ralph.baldwin.cx>
 <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com>
Date: Tue, 28 Apr 2015 12:10:43 -0700
X-Google-Sender-Auth: 9H9hpSigDX-70d3_tIqGxz5PMEk
Message-ID: <CAJ-VmomqGkEFVauya+rmPGcD_-=Z-mmg1RSDf1D2bT_DfwPBGA@mail.gmail.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
From: Adrian Chadd <adrian@freebsd.org>
To: Warner Losh <imp@bsdimp.com>
Cc: John Baldwin <jhb@freebsd.org>, Konstantin Belousov <kostikbel@gmail.com>, 
 Jason Harmening <jason.harmening@gmail.com>,
 Svatopluk Kraus <onwahe@gmail.com>, 
 freebsd-arch <freebsd-arch@freebsd.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Apr 2015 19:10:44 -0000

On 28 April 2015 at 09:19, Warner Losh <imp@bsdimp.com> wrote:
>
>> On Apr 28, 2015, at 7:40 AM, John Baldwin <jhb@FreeBSD.org> wrote:
>>
>>> I believe UIO_USERSPACE is almost unused, it might be there for some
>>> obscure (and buggy) driver.
>>
>> I believe it was added (and only ever used) in crypto drivers, and that =
they
>> all did bus_dma operations in the context of the thread that passed in t=
he
>> uio.  I definitely think it is fragile and should be replaced with somet=
hing
>> more reliable.
>
> Fusion I/O=E2=80=99s SDK used this trick to allow mapping of userspace bu=
ffers down
> into the block layer after doing the requisite locking / pinning / etc of=
 the buffers
> into memory. That=E2=80=99s if memory serves correctly (the SDK did these=
 things, I can=E2=80=99t
> easily check on that detail since I=E2=80=99m no longer at FIO).

This is a long-standing trick. physio() does it too,
aio_read/aio_write does it for direct block accesses. Now that pbufs
aren't involved anymore, it should scale rather well.

So I'd like to see more of it in the kernel and disk/net APIs and drivers.


-adrian

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 28 22:27:44 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 117655C9;
 Tue, 28 Apr 2015 22:27:44 +0000 (UTC)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id DE5EA1F35;
 Tue, 28 Apr 2015 22:27:43 +0000 (UTC)
Received: from ralph.baldwin.cx (pool-173-54-116-245.nwrknj.fios.verizon.net
 [173.54.116.245])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id EDC3FB926;
 Tue, 28 Apr 2015 18:27:41 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Adrian Chadd <adrian@freebsd.org>
Cc: Warner Losh <imp@bsdimp.com>, Konstantin Belousov <kostikbel@gmail.com>,
 Jason Harmening <jason.harmening@gmail.com>,
 Svatopluk Kraus <onwahe@gmail.com>, freebsd-arch <freebsd-arch@freebsd.org>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
Date: Tue, 28 Apr 2015 18:27:34 -0400
Message-ID: <1761247.Bq816CMB8v@ralph.baldwin.cx>
User-Agent: KMail/4.14.2 (FreeBSD/10.1-STABLE; KDE/4.14.2; amd64; ; )
In-Reply-To: <CAJ-VmomqGkEFVauya+rmPGcD_-=Z-mmg1RSDf1D2bT_DfwPBGA@mail.gmail.com>
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com>
 <CAJ-VmomqGkEFVauya+rmPGcD_-=Z-mmg1RSDf1D2bT_DfwPBGA@mail.gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Tue, 28 Apr 2015 18:27:42 -0400 (EDT)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Apr 2015 22:27:44 -0000

On Tuesday, April 28, 2015 12:10:43 PM Adrian Chadd wrote:
> On 28 April 2015 at 09:19, Warner Losh <imp@bsdimp.com> wrote:
> >
> >> On Apr 28, 2015, at 7:40 AM, John Baldwin <jhb@FreeBSD.org> wrote:=

> >>
> >>> I believe UIO_USERSPACE is almost unused, it might be there for s=
ome
> >>> obscure (and buggy) driver.
> >>
> >> I believe it was added (and only ever used) in crypto drivers, and=
 that they
> >> all did bus_dma operations in the context of the thread that passe=
d in the
> >> uio.  I definitely think it is fragile and should be replaced with=
 something
> >> more reliable.
> >
> > Fusion I/O=E2=80=99s SDK used this trick to allow mapping of usersp=
ace buffers down
> > into the block layer after doing the requisite locking / pinning / =
etc of the buffers
> > into memory. That=E2=80=99s if memory serves correctly (the SDK did=
 these things, I can=E2=80=99t
> > easily check on that detail since I=E2=80=99m no longer at FIO).
>=20
> This is a long-standing trick. physio() does it too,
> aio_read/aio_write does it for direct block accesses. Now that pbufs
> aren't involved anymore, it should scale rather well.
>=20
> So I'd like to see more of it in the kernel and disk/net APIs and dri=
vers.

aio_read/write jump through gross hacks to create dedicated kthreads th=
at
"borrow" the address space of the requester.  The fact is that we want =
to
make unmapped I/O work in the general case and the same solutions for
temporary mappings for that can be reused to temporarily map the wired
pages backing a user request when needed.  Reusing user mappings direct=
ly
in the kernel isn't really the way forward.

--=20
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 28 22:39:33 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 9770997F;
 Tue, 28 Apr 2015 22:39:33 +0000 (UTC)
Received: from st11p02mm-asmtp001.mac.com (st11p02mm-asmtpout001.mac.com
 [17.172.220.236])
 (using TLSv1.2 with cipher DHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 6AF69105B;
 Tue, 28 Apr 2015 22:39:33 +0000 (UTC)
Received: from st11p02mm-spool001.mac.com ([17.172.220.246])
 by st11p02mm-asmtp001.mac.com
 (Oracle Communications Messaging Server 7.0.5.35.0 64bit (built Dec 4 2014))
 with ESMTP id <0NNJ000Z8DHMJ260@st11p02mm-asmtp001.mac.com>; Tue,
 28 Apr 2015 21:39:25 +0000 (GMT)
X-Proofpoint-Virus-Version: vendor=fsecure
 engine=2.50.10432:5.13.68,1.0.33,0.0.0000
 definitions=2015-04-28_07:2015-04-28,2015-04-28,1970-01-01 signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
 suspectscore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0
 reason=mlx scancount=1 engine=7.0.1-1412110000 definitions=main-1504280242
MIME-version: 1.0
Received: from localhost ([17.172.220.163]) by st11p02mm-spool001.mac.com
 (Oracle Communications Messaging Server 7.0.5.33.0 64bit (built Aug 27 2014))
 with ESMTP id <0NNJ00FF4DHMBP10@st11p02mm-spool001.mac.com>; Tue,
 28 Apr 2015 21:39:22 +0000 (GMT)
To: Adrian Chadd <adrian@freebsd.org>
Cc: "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
From: Rui Paulo <rpaulo@me.com>
Subject: Re: RFT: numa policy branch
Date: Tue, 28 Apr 2015 21:39:22 +0000 (GMT)
X-Mailer: iCloud MailClient15B.8196069 MailServer15B.18830
X-Originating-IP: [12.218.212.178]
Message-id: <f035d836-21e6-43e0-a105-6958caab475b@me.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.20
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Apr 2015 22:39:33 -0000

On Apr 26, 2015, at 01:30 PM, Adrian Chadd <adrian@freebsd.org> wrote:=0A=0A=
Hi!=0A=0AAnother update:=0A=0A* updated to recent -HEAD;=0A* numactl now c=
an set memory policy and cpuset domain information - so=0Ait's easy to say=
 "this runs in memory domain X and cpu domain Y" in=0Aone pass with it;=0A=
=C2=A0=0AThat works, but --mempolicy=3Dfirst-touch should ignore the --mem=
domain argument (or print an error) if it's present.=0A=0A* the locality m=
atrix is now available. Here's an example from scott's=0A2x haswell v3, wi=
th cluster-on-die enabled:=0A=0Avm.phys_locality:=0A0: 10 21 31 31=0A1: 21=
 10 31 31=0A2: 31 31 10 21=0A3: 31 31 21 10=0A=0AAnd on the westmere-ex bo=
x, with no SLIT table:=0A=0Avm.phys_locality:=0A0: -1 -1 -1 -1=0A1: -1 -1 =
-1 -1=0A2: -1 -1 -1 -1=0A3: -1 -1 -1 -1=0A=C2=A0=0AThis worked for us on I=
vyBridge a SLIT table.=0A=0A* I've tested in on westmere-ex (4x socket), s=
andybridge, ivybridge,=0Ahaswell v3 and haswell v3 cluster on die.=0A* I'v=
e discovered that our implementation of libgomp (from gcc-4.2) is=0Avery o=
ld and doesn't include some of the thread control environment=0Avariables,=
 grr.=0A* .. and that the gcc libgomp code doesn't at all have freebsd thr=
ead=0Aaffinity routines, so I added them to gcc-4.8.=0A=C2=A0=0AI used gcc=
 4.9=0A=0AI'd appreciate any reviews / testing people are able to provide.=
 I'm=0Aabout at the functionality point where I'd like to submit it for=0A=
formal review and try to land it in -HEAD.=0A=C2=A0=0AThere's a bug in the=
 default sysctl policy. =C2=A0You're calling strcat on an uninitialised st=
ring, so it produces garbage output. =C2=A0We also hit the a panic when ou=
r application starts allocation many GBs of memory. =C2=A0In this case, th=
e memory is split between two sockets and I think it's crashing like you d=
escribed on IRC.=0A=0A=0A=

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 28 23:32:30 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 7729A3A5
 for <freebsd-arch@freebsd.org>; Tue, 28 Apr 2015 23:32:30 +0000 (UTC)
Received: from mail-ig0-x235.google.com (mail-ig0-x235.google.com
 [IPv6:2607:f8b0:4001:c05::235])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 422151688
 for <freebsd-arch@freebsd.org>; Tue, 28 Apr 2015 23:32:30 +0000 (UTC)
Received: by igbyr2 with SMTP id yr2so105406484igb.0
 for <freebsd-arch@freebsd.org>; Tue, 28 Apr 2015 16:32:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=ZZs1ydaXaqCxtEC9pBx/o1ptA2BaBG+l7mud9JS8/u8=;
 b=kkFN0C/IP3L1GLFuSzIUyjJtMcg8tdS8egO/2RL5RmIoJowBV1hd6VguwcsqYEGpdE
 NSrb1fNWG5wQB4MUF5Agq6zV6HGwPCMqYuBqtq5MkmV1bA2PJNGtc0/aifcZiAXNJwGU
 G7FSW/EKx4oKFAq5OzcmexF+ePdmmZ9xng3D54Ap5SgbIpCX85qxPcLtF4i66/RRKpjP
 TcPVNmzV8AXufYfvYRZIZZgKWbfINukI9J9kZ+DY5EnC10Q/AMcHJAoarEwGC9zwNW6c
 A/kqmFOcn3d2bGZ2zcwhFthAz7EQhOLoAVxbFALd7bcSOcwB5g7BeBrV+P8NHtVxzVVu
 IUAA==
MIME-Version: 1.0
X-Received: by 10.43.163.129 with SMTP id mo1mr395770icc.61.1430263949658;
 Tue, 28 Apr 2015 16:32:29 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.36.38.133 with HTTP; Tue, 28 Apr 2015 16:32:29 -0700 (PDT)
In-Reply-To: <f035d836-21e6-43e0-a105-6958caab475b@me.com>
References: <f035d836-21e6-43e0-a105-6958caab475b@me.com>
Date: Tue, 28 Apr 2015 16:32:29 -0700
X-Google-Sender-Auth: nSierJALtjIeSja88mBq_CJQbF8
Message-ID: <CAJ-Vmo=TN8AQo0dv+38jjLKVOTweJjP5xWUXQ6jBeVwyjUfUow@mail.gmail.com>
Subject: Re: RFT: numa policy branch
From: Adrian Chadd <adrian@freebsd.org>
To: Rui Paulo <rpaulo@me.com>
Cc: "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Apr 2015 23:32:30 -0000

On 28 April 2015 at 14:39, Rui Paulo <rpaulo@me.com> wrote:
> On Apr 26, 2015, at 01:30 PM, Adrian Chadd <adrian@freebsd.org> wrote:
>
> Hi!
>
> Another update:
>
> * updated to recent -HEAD;
> * numactl now can set memory policy and cpuset domain information - so
> it's easy to say "this runs in memory domain X and cpu domain Y" in
> one pass with it;
>
>
> That works, but --mempolicy=first-touch should ignore the --memdomain
> argument (or print an error) if it's present.

Ok.

> * the locality matrix is now available. Here's an example from scott's
> 2x haswell v3, with cluster-on-die enabled:
>
> vm.phys_locality:
> 0: 10 21 31 31
> 1: 21 10 31 31
> 2: 31 31 10 21
> 3: 31 31 21 10
>
> And on the westmere-ex box, with no SLIT table:
>
> vm.phys_locality:
> 0: -1 -1 -1 -1
> 1: -1 -1 -1 -1
> 2: -1 -1 -1 -1
> 3: -1 -1 -1 -1
>
>
> This worked for us on IvyBridge a SLIT table.

Cool.

> * I've tested in on westmere-ex (4x socket), sandybridge, ivybridge,
> haswell v3 and haswell v3 cluster on die.
> * I've discovered that our implementation of libgomp (from gcc-4.2) is
> very old and doesn't include some of the thread control environment
> variables, grr.
> * .. and that the gcc libgomp code doesn't at all have freebsd thread
> affinity routines, so I added them to gcc-4.8.
>
>
> I used gcc 4.9
>
> I'd appreciate any reviews / testing people are able to provide. I'm
> about at the functionality point where I'd like to submit it for
> formal review and try to land it in -HEAD.
>
> There's a bug in the default sysctl policy.  You're calling strcat on an
> uninitialised string, so it produces garbage output.  We also hit the a
> panic when our application starts allocation many GBs of memory.  In this
> case, the memory is split between two sockets and I think it's crashing like
> you described on IRC.

I'll fix the former soon, thanks for pointing that out.

As for the crash - yeah, I reproducd it and sent a patch to alc for
review. It's because vm_page_alloc() doesn't expect calls to vm_phys
to fail a second time around.

Trouble is - the VM thresholds are all global. Failing an allocation
in one domain does cause pagedaemon to start up on that domain, but no
paging actually occurs. Unfortunately the pager still thinks there's
plenty of memory available, so it doesn't know it needs to run.
There's a pagedaemon per domain, but no thresholds per domain or
paging / paging targets per domain. I don't think we're going to be
able to fix that this pass - I'd rather get this or something like
this into the kernel so at least first-touch-rr, fixed-domain-rr and
rr work. Then yes, the VM will need some updating.


-adrian

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr 29 10:22:20 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 144A7707;
 Wed, 29 Apr 2015 10:22:20 +0000 (UTC)
Received: from mail-ie0-x231.google.com (mail-ie0-x231.google.com
 [IPv6:2607:f8b0:4001:c03::231])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id D20941AE8;
 Wed, 29 Apr 2015 10:22:19 +0000 (UTC)
Received: by iejt8 with SMTP id t8so38512954iej.2;
 Wed, 29 Apr 2015 03:22:19 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type:content-transfer-encoding;
 bh=tko/ed6toqS5d2a1mxw+D0TWHs0226KwrmCSduVh5yQ=;
 b=QDguw+9rIztsXOeWENfTx7OFZKjrgH+Z84JzkeuUxUPRqtxJ4N+Mz0XJB0byqrWpur
 3eRsbDqVke1zzi6ZmMnVsUcEG73j8LL63+TakdDLbAzmbb/Ehw4VPIuaMEHXqLNBfpwr
 VpLs7kffRCIR/rh9dLGLeQJoTAA5p3rxLT3/gZ1SfpTIitpp6W0OL0l7DpekFPeZvryV
 4EGAxplO7DQjuUhgyRyULurQhnFGNAfmNGAwQI9vlnbCu3Vc4c13nrtiaSdSkVjdhXeZ
 aFTG7u5fDacAkBN6VARrPCLmL8zZsHaIyeJDpDQBnzlUkepd4jdsEAXLrncLXp/hY3Ww
 zIwg==
MIME-Version: 1.0
X-Received: by 10.50.77.13 with SMTP id o13mr26742856igw.39.1430302939208;
 Wed, 29 Apr 2015 03:22:19 -0700 (PDT)
Received: by 10.64.13.81 with HTTP; Wed, 29 Apr 2015 03:22:19 -0700 (PDT)
In-Reply-To: <1761247.Bq816CMB8v@ralph.baldwin.cx>
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com>
 <CAJ-VmomqGkEFVauya+rmPGcD_-=Z-mmg1RSDf1D2bT_DfwPBGA@mail.gmail.com>
 <1761247.Bq816CMB8v@ralph.baldwin.cx>
Date: Wed, 29 Apr 2015 12:22:19 +0200
Message-ID: <CAFHCsPX9rgmCAPABct84a000NuBPQm5sprOAQr9BTT6Ev6KZcQ@mail.gmail.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
From: Svatopluk Kraus <onwahe@gmail.com>
To: John Baldwin <jhb@freebsd.org>
Cc: Adrian Chadd <adrian@freebsd.org>, Warner Losh <imp@bsdimp.com>, 
 Konstantin Belousov <kostikbel@gmail.com>,
 Jason Harmening <jason.harmening@gmail.com>, 
 freebsd-arch <freebsd-arch@freebsd.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Apr 2015 10:22:20 -0000

On Wed, Apr 29, 2015 at 12:27 AM, John Baldwin <jhb@freebsd.org> wrote:
> On Tuesday, April 28, 2015 12:10:43 PM Adrian Chadd wrote:
>> On 28 April 2015 at 09:19, Warner Losh <imp@bsdimp.com> wrote:
>> >
>> >> On Apr 28, 2015, at 7:40 AM, John Baldwin <jhb@FreeBSD.org> wrote:
>> >>
>> >>> I believe UIO_USERSPACE is almost unused, it might be there for some
>> >>> obscure (and buggy) driver.
>> >>
>> >> I believe it was added (and only ever used) in crypto drivers, and th=
at they
>> >> all did bus_dma operations in the context of the thread that passed i=
n the
>> >> uio.  I definitely think it is fragile and should be replaced with so=
mething
>> >> more reliable.
>> >
>> > Fusion I/O=E2=80=99s SDK used this trick to allow mapping of userspace=
 buffers down
>> > into the block layer after doing the requisite locking / pinning / etc=
 of the buffers
>> > into memory. That=E2=80=99s if memory serves correctly (the SDK did th=
ese things, I can=E2=80=99t
>> > easily check on that detail since I=E2=80=99m no longer at FIO).
>>
>> This is a long-standing trick. physio() does it too,
>> aio_read/aio_write does it for direct block accesses. Now that pbufs
>> aren't involved anymore, it should scale rather well.
>>
>> So I'd like to see more of it in the kernel and disk/net APIs and driver=
s.
>
> aio_read/write jump through gross hacks to create dedicated kthreads that
> "borrow" the address space of the requester.  The fact is that we want to
> make unmapped I/O work in the general case and the same solutions for
> temporary mappings for that can be reused to temporarily map the wired
> pages backing a user request when needed.  Reusing user mappings directly
> in the kernel isn't really the way forward.
>


If using unmapped buffers is the way we will take to play with user
space buffers, then:

(1) DMA clients, which support DMA for user space buffers, must use
some variant of _bus_dmamap_load_phys(). They must wire physical pages
in system anyway.
(2) Maybe some better way how to temporarily allocate KVA for unmapped
buffers should be implemented.
(3) DMA clients which already use _bus_dmamap_load_uio() with
UIO_USERSPACE must be reimplemented or made obsolete.
(4) UIO_USERSPACE must be off limit in _bus_dmamap_load_uio() and man
page should be changed according to it.
(5) And pmap can be deleted from struct bus_dmamap and all functions
which use it as argument. Only kernel pmap will be used in DMA
framework.

Did I miss out something?


> --
> John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr 29 13:20:30 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 22FF74E2;
 Wed, 29 Apr 2015 13:20:30 +0000 (UTC)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 8AA491EEE;
 Wed, 29 Apr 2015 13:20:29 +0000 (UTC)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3TDKIwZ066239
 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Wed, 29 Apr 2015 16:20:18 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3TDKIwZ066239
Received: (from kostik@localhost)
 by tom.home (8.14.9/8.14.9/Submit) id t3TDKHxx066237;
 Wed, 29 Apr 2015 16:20:17 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Wed, 29 Apr 2015 16:20:17 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Svatopluk Kraus <onwahe@gmail.com>
Cc: John Baldwin <jhb@freebsd.org>, Adrian Chadd <adrian@freebsd.org>,
 Warner Losh <imp@bsdimp.com>, Jason Harmening <jason.harmening@gmail.com>,
 freebsd-arch <freebsd-arch@freebsd.org>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
Message-ID: <20150429132017.GM2390@kib.kiev.ua>
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com>
 <CAJ-VmomqGkEFVauya+rmPGcD_-=Z-mmg1RSDf1D2bT_DfwPBGA@mail.gmail.com>
 <1761247.Bq816CMB8v@ralph.baldwin.cx>
 <CAFHCsPX9rgmCAPABct84a000NuBPQm5sprOAQr9BTT6Ev6KZcQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAFHCsPX9rgmCAPABct84a000NuBPQm5sprOAQr9BTT6Ev6KZcQ@mail.gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.0
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Apr 2015 13:20:30 -0000

On Wed, Apr 29, 2015 at 12:22:19PM +0200, Svatopluk Kraus wrote:
> If using unmapped buffers is the way we will take to play with user
> space buffers, then:
> 
> (1) DMA clients, which support DMA for user space buffers, must use
> some variant of _bus_dmamap_load_phys(). They must wire physical pages
> in system anyway.
No, vm_fault_quick_hold_pages() + bus_dmamap_load_ma().
Or yes, if you count bus_dmamap_load_ma() as a variant of _load_phys().
I do not.

> (2) Maybe some better way how to temporarily allocate KVA for unmapped
> buffers should be implemented.
See some other mail from me about non-blocking sfbuf allocator with
callback.

> (3) DMA clients which already use _bus_dmamap_load_uio() with
> UIO_USERSPACE must be reimplemented or made obsolete.
Yes.

> (4) UIO_USERSPACE must be off limit in _bus_dmamap_load_uio() and man
> page should be changed according to it.
Yes.

> (5) And pmap can be deleted from struct bus_dmamap and all functions
> which use it as argument. Only kernel pmap will be used in DMA
> framework.
Probably yes.

> 
> Did I miss out something?
> 
> 
> > --
> > John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr 29 15:16:00 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 0D1FBAED;
 Wed, 29 Apr 2015 15:16:00 +0000 (UTC)
Received: from mail-ie0-x22c.google.com (mail-ie0-x22c.google.com
 [IPv6:2607:f8b0:4001:c03::22c])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id C93AC1C6F;
 Wed, 29 Apr 2015 15:15:59 +0000 (UTC)
Received: by iedfl3 with SMTP id fl3so50185924ied.1;
 Wed, 29 Apr 2015 08:15:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=zs8dgeV3gEUvCWybM5nUXYpo2mB5ZEwhuA1KHwV0LEI=;
 b=sSW29W480sBVjdsrejqfSyFt+5/1W67fOT1B+jnR2+lDKd7s8wEuJ3nn9JHr2U21zU
 m/37fUYUQBSEvO+zzyQbgsscyn7slxRCZ0SDuhC4mtYpl+iVJgxGhryK9vzMXP26ADRJ
 C4/V7tNBII+amqn7CdWbS0RMc/RgpDa5hRU2fEL0HsVFOjvZkx3baPlCErIqzCGfyzLU
 Eel6yKzL7QW/B0q8rVxzc+5fKnS2Ggvz9+TrMpPAu1+zXkP7R+veAaOXI2VDPvS7G4x7
 AVnJoZsZ4cQ4KzT/+kOzOsd79gEFzyc6sBQ91GEEYxdACA4m4RpF3m6naF0PV3lwL2MO
 eWHg==
MIME-Version: 1.0
X-Received: by 10.50.61.200 with SMTP id s8mr28430846igr.7.1430320159109; Wed,
 29 Apr 2015 08:09:19 -0700 (PDT)
Received: by 10.64.13.81 with HTTP; Wed, 29 Apr 2015 08:09:18 -0700 (PDT)
In-Reply-To: <20150429132017.GM2390@kib.kiev.ua>
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com>
 <CAJ-VmomqGkEFVauya+rmPGcD_-=Z-mmg1RSDf1D2bT_DfwPBGA@mail.gmail.com>
 <1761247.Bq816CMB8v@ralph.baldwin.cx>
 <CAFHCsPX9rgmCAPABct84a000NuBPQm5sprOAQr9BTT6Ev6KZcQ@mail.gmail.com>
 <20150429132017.GM2390@kib.kiev.ua>
Date: Wed, 29 Apr 2015 17:09:18 +0200
Message-ID: <CAFHCsPWjEFBF+-7SR7EJ3UHP6oAAa9xjbu0CbRaQvd_-6gKuAQ@mail.gmail.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
From: Svatopluk Kraus <onwahe@gmail.com>
To: Konstantin Belousov <kostikbel@gmail.com>
Cc: John Baldwin <jhb@freebsd.org>, Adrian Chadd <adrian@freebsd.org>,
 Warner Losh <imp@bsdimp.com>, Jason Harmening <jason.harmening@gmail.com>,
 freebsd-arch <freebsd-arch@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Apr 2015 15:16:00 -0000

On Wed, Apr 29, 2015 at 3:20 PM, Konstantin Belousov
<kostikbel@gmail.com> wrote:
> On Wed, Apr 29, 2015 at 12:22:19PM +0200, Svatopluk Kraus wrote:
>> If using unmapped buffers is the way we will take to play with user
>> space buffers, then:
>>
>> (1) DMA clients, which support DMA for user space buffers, must use
>> some variant of _bus_dmamap_load_phys(). They must wire physical pages
>> in system anyway.
> No, vm_fault_quick_hold_pages() + bus_dmamap_load_ma().
> Or yes, if you count bus_dmamap_load_ma() as a variant of _load_phys().
> I do not.

There are only two basic functions in MD implementations which all
other functions call: _bus_dmamap_load_phys() and
_bus_dmamap_load_buffer() as a synonym for unmapped buffers and mapped
ones. Are you saying that bus_dmamap_load_ma() should be some third
kind?

>
>> (2) Maybe some better way how to temporarily allocate KVA for unmapped
>> buffers should be implemented.
> See some other mail from me about non-blocking sfbuf allocator with
> callback.

This small list was meant as summary. As I saw your emails in this
thread, I added this point . I did not get that it's already in source
tree.

>
>> (3) DMA clients which already use _bus_dmamap_load_uio() with
>> UIO_USERSPACE must be reimplemented or made obsolete.
> Yes.
>
>> (4) UIO_USERSPACE must be off limit in _bus_dmamap_load_uio() and man
>> page should be changed according to it.
> Yes.


Hmm, I think that for the beginning, _bus_dmamap_load_uio() for
UIO_USERSPACE  can be hacked to use bus_dmamap_load_ma(). Maybe with
some warning to force users of old clients to reimplement them.


>
>> (5) And pmap can be deleted from struct bus_dmamap and all functions
>> which use it as argument. Only kernel pmap will be used in DMA
>> framework.
> Probably yes.
>

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr 29 16:54:39 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id A720A88B;
 Wed, 29 Apr 2015 16:54:39 +0000 (UTC)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 313EB197D;
 Wed, 29 Apr 2015 16:54:39 +0000 (UTC)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3TGsW8P016999
 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Wed, 29 Apr 2015 19:54:32 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3TGsW8P016999
Received: (from kostik@localhost)
 by tom.home (8.14.9/8.14.9/Submit) id t3TGsWpE016998;
 Wed, 29 Apr 2015 19:54:32 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Wed, 29 Apr 2015 19:54:32 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Svatopluk Kraus <onwahe@gmail.com>
Cc: John Baldwin <jhb@freebsd.org>, Adrian Chadd <adrian@freebsd.org>,
 Warner Losh <imp@bsdimp.com>, Jason Harmening <jason.harmening@gmail.com>,
 freebsd-arch <freebsd-arch@freebsd.org>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
Message-ID: <20150429165432.GN2390@kib.kiev.ua>
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com>
 <CAJ-VmomqGkEFVauya+rmPGcD_-=Z-mmg1RSDf1D2bT_DfwPBGA@mail.gmail.com>
 <1761247.Bq816CMB8v@ralph.baldwin.cx>
 <CAFHCsPX9rgmCAPABct84a000NuBPQm5sprOAQr9BTT6Ev6KZcQ@mail.gmail.com>
 <20150429132017.GM2390@kib.kiev.ua>
 <CAFHCsPWjEFBF+-7SR7EJ3UHP6oAAa9xjbu0CbRaQvd_-6gKuAQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAFHCsPWjEFBF+-7SR7EJ3UHP6oAAa9xjbu0CbRaQvd_-6gKuAQ@mail.gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.0
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Apr 2015 16:54:39 -0000

On Wed, Apr 29, 2015 at 05:09:18PM +0200, Svatopluk Kraus wrote:
> On Wed, Apr 29, 2015 at 3:20 PM, Konstantin Belousov
> <kostikbel@gmail.com> wrote:
> > On Wed, Apr 29, 2015 at 12:22:19PM +0200, Svatopluk Kraus wrote:
> >> If using unmapped buffers is the way we will take to play with user
> >> space buffers, then:
> >>
> >> (1) DMA clients, which support DMA for user space buffers, must use
> >> some variant of _bus_dmamap_load_phys(). They must wire physical pages
> >> in system anyway.
> > No, vm_fault_quick_hold_pages() + bus_dmamap_load_ma().
> > Or yes, if you count bus_dmamap_load_ma() as a variant of _load_phys().
> > I do not.
> 
> There are only two basic functions in MD implementations which all
> other functions call: _bus_dmamap_load_phys() and
> _bus_dmamap_load_buffer() as a synonym for unmapped buffers and mapped
> ones. Are you saying that bus_dmamap_load_ma() should be some third
> kind?
It is.

On the VT-d backed x86 busdma, load_ma() is the fundamental function,
which is called both by _load_buffer() and _load_phys().  This is not
completely true, the real backstage worker is called _load_something(),
but it differs from _load_ma() only by taking casted tag and map.

On the other hand, the load_ma_triv() wrapper implements _load_ma()
using load_phys() on architectures which do not yet provide native
_load_ma(), or where native _load_ma() does not make sense.

> 
> >
> >> (2) Maybe some better way how to temporarily allocate KVA for unmapped
> >> buffers should be implemented.
> > See some other mail from me about non-blocking sfbuf allocator with
> > callback.
> 
> This small list was meant as summary. As I saw your emails in this
> thread, I added this point . I did not get that it's already in source
> tree.
No, it is not.  I stopped working on it during the unmapped i/o work,
after I realized that there is no much interest from the device drivers
authors.  Nobody cared about drivers like ATA PIO.

Now, with the new possible use for the non-blocking sfbuf allocator,
it can be revived.

> 
> >
> >> (3) DMA clients which already use _bus_dmamap_load_uio() with
> >> UIO_USERSPACE must be reimplemented or made obsolete.
> > Yes.
> >
> >> (4) UIO_USERSPACE must be off limit in _bus_dmamap_load_uio() and man
> >> page should be changed according to it.
> > Yes.
> 
> 
> Hmm, I think that for the beginning, _bus_dmamap_load_uio() for
> UIO_USERSPACE  can be hacked to use bus_dmamap_load_ma(). Maybe with
> some warning to force users of old clients to reimplement them.
Also it would be a good test for my claim that
vm_fault_quick_hold_pages() + bus_dmamap_load_ma() is all what is
needed.

> 
> 
> >
> >> (5) And pmap can be deleted from struct bus_dmamap and all functions
> >> which use it as argument. Only kernel pmap will be used in DMA
> >> framework.
> > Probably yes.
> >

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr 29 18:04:48 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 24CA4BB4;
 Wed, 29 Apr 2015 18:04:48 +0000 (UTC)
Received: from mail-ig0-x232.google.com (mail-ig0-x232.google.com
 [IPv6:2607:f8b0:4001:c05::232])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id DA8CC1258;
 Wed, 29 Apr 2015 18:04:47 +0000 (UTC)
Received: by igbyr2 with SMTP id yr2so126471088igb.0;
 Wed, 29 Apr 2015 11:04:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=xZjuMQFTs6ctt8IwuhzIiiveObkGNoBwHXztpkpKUFc=;
 b=AHCy9YgyMw2HbUj8KTTm5kntWRoQjzjRrJyY/CjvGZgdKo5FrjIQUP2L680uf0GsAV
 0QazHqinpo0cd8jcuEeshCQqtU6ySCmkUXYM0fgIGngEKuXuAHWxw3VZN2meqlNQO/Vb
 G4Dv504ne4c32BECIwTQoy/YriTMmdGPuaE8Ol+9YnbBM9OG24j0SSPaUhVF8hlWZd3g
 L2poleCYj+km93sHFvVqzGnGQB1NP7dmiMYla1Uum8jqhaAvzDcfiP8TFFUh8P/5pnrS
 /l0WufqDCKrUJUzpqIeQWbP2XS4bvB38GW2XBKHEPBgZ9tn3HEGclLc4s6xl7+txwhWR
 s9hw==
MIME-Version: 1.0
X-Received: by 10.50.110.104 with SMTP id hz8mr5897421igb.38.1430330687110;
 Wed, 29 Apr 2015 11:04:47 -0700 (PDT)
Received: by 10.36.106.70 with HTTP; Wed, 29 Apr 2015 11:04:46 -0700 (PDT)
In-Reply-To: <20150429165432.GN2390@kib.kiev.ua>
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com>
 <CAJ-VmomqGkEFVauya+rmPGcD_-=Z-mmg1RSDf1D2bT_DfwPBGA@mail.gmail.com>
 <1761247.Bq816CMB8v@ralph.baldwin.cx>
 <CAFHCsPX9rgmCAPABct84a000NuBPQm5sprOAQr9BTT6Ev6KZcQ@mail.gmail.com>
 <20150429132017.GM2390@kib.kiev.ua>
 <CAFHCsPWjEFBF+-7SR7EJ3UHP6oAAa9xjbu0CbRaQvd_-6gKuAQ@mail.gmail.com>
 <20150429165432.GN2390@kib.kiev.ua>
Date: Wed, 29 Apr 2015 13:04:46 -0500
Message-ID: <CAM=8qakzkKX8TZNYE33H=JqL_r5z+AU9fyp5+7Z0mixmF5t63w@mail.gmail.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
From: Jason Harmening <jason.harmening@gmail.com>
To: Konstantin Belousov <kostikbel@gmail.com>
Cc: Svatopluk Kraus <onwahe@gmail.com>, John Baldwin <jhb@freebsd.org>,
 Adrian Chadd <adrian@freebsd.org>, 
 Warner Losh <imp@bsdimp.com>, freebsd-arch <freebsd-arch@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.20
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Apr 2015 18:04:48 -0000

So, here's a patch that would add unmapped user bounce-buffer support for
existing UIO_USERSPACE cases.  I've only made sure it builds (everywhere)
and given it a quick check on amd64.
Things to note:
--no changes to sparc64 and intel dmar, because they don't use bounce
buffers
--effectively adds UIO_USERSPACE support for mips, which was a KASSERT
before
--I am worried about the cache maintenance operations for arm and mips.
I'm not an expert in non-coherent architectures.  In particular, I'm not
sure what (if any) allowances need to be made for user VAs that may be
present in VIPT caches on other cores of SMP systems.
--the above point about cache maintenance also makes me wonder how that
should be handled for drivers that would use vm_fault_quick_hold_pages() +
bus_dmamap_load_ma().  Presumably, some UVAs for the buffer could be
present in caches for the same or other core.


Index: sys/arm/arm/busdma_machdep-v6.c
===================================================================
--- sys/arm/arm/busdma_machdep-v6.c (revision 282208)
+++ sys/arm/arm/busdma_machdep-v6.c (working copy)
@@ -1309,15 +1309,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
 {
  struct bounce_page *bpage;
  struct sync_list *sl, *end;
- /*
- * If the buffer was from user space, it is possible that this is not
- * the same vm map, especially on a POST operation.  It's not clear that
- * dma on userland buffers can work at all right now.  To be safe, until
- * we're able to test direct userland dma, panic on a map mismatch.
- */
+
  if ((bpage = STAILQ_FIRST(&map->bpages)) != NULL) {
- if (!pmap_dmap_iscurrent(map->pmap))
- panic("_bus_dmamap_sync: wrong user map for bounce sync.");

  CTR4(KTR_BUSDMA, "%s: tag %p tag flags 0x%x op 0x%x "
     "performing bounce", __func__, dmat, dmat->flags, op);
@@ -1328,14 +1321,10 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
  */
  if (op & BUS_DMASYNC_PREWRITE) {
  while (bpage != NULL) {
- if (bpage->datavaddr != 0)
- bcopy((void *)bpage->datavaddr,
-    (void *)bpage->vaddr,
-    bpage->datacount);
+ if (bpage->datavaddr != 0 && pmap_dmap_iscurrent(map->pmap))
+ bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount);
  else
- physcopyout(bpage->dataaddr,
-    (void *)bpage->vaddr,
-    bpage->datacount);
+ physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount);
  cpu_dcache_wb_range((vm_offset_t)bpage->vaddr,
     bpage->datacount);
  l2cache_wb_range((vm_offset_t)bpage->vaddr,
@@ -1396,14 +1385,10 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
     arm_dcache_align;
  l2cache_inv_range(startv, startp, len);
  cpu_dcache_inv_range(startv, len);
- if (bpage->datavaddr != 0)
- bcopy((void *)bpage->vaddr,
-    (void *)bpage->datavaddr,
-    bpage->datacount);
+ if (bpage->datavaddr != 0 && pmap_dmap_iscurrent(map->pmap))
+ bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount);
  else
- physcopyin((void *)bpage->vaddr,
-    bpage->dataaddr,
-    bpage->datacount);
+ physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount);
  bpage = STAILQ_NEXT(bpage, links);
  }
  dmat->bounce_zone->total_bounced++;
@@ -1433,10 +1418,15 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
  * that the sequence is inner-to-outer for PREREAD invalidation and
  * outer-to-inner for POSTREAD invalidation is not a mistake.
  */
+#ifndef ARM_L2_PIPT
+ /*
+ * If we don't have any physically-indexed caches, we don't need to do
+ * cache maintenance if we're not in the context that owns the VA.
+ */
+ if (!pmap_dmap_iscurrent(map->pmap))
+ return;
+#endif
  if (map->sync_count != 0) {
- if (!pmap_dmap_iscurrent(map->pmap))
- panic("_bus_dmamap_sync: wrong user map for sync.");
-
  sl = &map->slist[0];
  end = &map->slist[map->sync_count];
  CTR4(KTR_BUSDMA, "%s: tag %p tag flags 0x%x op 0x%x "
@@ -1446,7 +1436,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
  case BUS_DMASYNC_PREWRITE:
  case BUS_DMASYNC_PREWRITE | BUS_DMASYNC_PREREAD:
  while (sl != end) {
- cpu_dcache_wb_range(sl->vaddr, sl->datacount);
+ if (pmap_dmap_iscurrent(map->pmap))
+ cpu_dcache_wb_range(sl->vaddr, sl->datacount);
  l2cache_wb_range(sl->vaddr, sl->busaddr,
     sl->datacount);
  sl++;
@@ -1472,7 +1463,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
  l2cache_wb_range(sl->vaddr,
     sl->busaddr, 1);
  }
- cpu_dcache_inv_range(sl->vaddr, sl->datacount);
+ if (pmap_dmap_iscurrent(map->pmap))
+ cpu_dcache_inv_range(sl->vaddr, sl->datacount);
  l2cache_inv_range(sl->vaddr, sl->busaddr,
     sl->datacount);
  sl++;
@@ -1487,7 +1479,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
  while (sl != end) {
  l2cache_inv_range(sl->vaddr, sl->busaddr,
     sl->datacount);
- cpu_dcache_inv_range(sl->vaddr, sl->datacount);
+ if (pmap_dmap_iscurrent(map->pmap))
+ cpu_dcache_inv_range(sl->vaddr, sl->datacount);
  sl++;
  }
  break;
Index: sys/arm/arm/busdma_machdep.c
===================================================================
--- sys/arm/arm/busdma_machdep.c (revision 282208)
+++ sys/arm/arm/busdma_machdep.c (working copy)
@@ -131,7 +131,6 @@ struct bounce_page {

 struct sync_list {
  vm_offset_t vaddr; /* kva of bounce buffer */
- bus_addr_t busaddr; /* Physical address */
  bus_size_t datacount; /* client data count */
 };

@@ -177,6 +176,7 @@ struct bus_dmamap {
  STAILQ_ENTRY(bus_dmamap) links;
  bus_dmamap_callback_t *callback;
  void      *callback_arg;
+ pmap_t       pmap;
  int       sync_count;
  struct sync_list       *slist;
 };
@@ -831,7 +831,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma
 }

 static void
-_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap,
+_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
     void *buf, bus_size_t buflen, int flags)
 {
  vm_offset_t vaddr;
@@ -851,10 +851,10 @@ static void
  vendaddr = (vm_offset_t)buf + buflen;

  while (vaddr < vendaddr) {
- if (__predict_true(pmap == kernel_pmap))
+ if (__predict_true(map->pmap == kernel_pmap))
  paddr = pmap_kextract(vaddr);
  else
- paddr = pmap_extract(pmap, vaddr);
+ paddr = pmap_extract(map->pmap, vaddr);
  if (run_filter(dmat, paddr) != 0)
  map->pagesneeded++;
  vaddr += PAGE_SIZE;
@@ -1009,7 +1009,7 @@ _bus_dmamap_load_ma(bus_dma_tag_t dmat, bus_dmamap
  */
 int
 _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dmamap_t map, void *buf,
-    bus_size_t buflen, struct pmap *pmap, int flags, bus_dma_segment_t
*segs,
+    bus_size_t buflen, pmap_t pmap, int flags, bus_dma_segment_t *segs,
     int *segp)
 {
  bus_size_t sgsize;
@@ -1023,8 +1023,10 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
  if ((flags & BUS_DMA_LOAD_MBUF) != 0)
  map->flags |= DMAMAP_CACHE_ALIGNED;

+ map->pmap = pmap;
+
  if ((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) {
- _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags);
+ _bus_dmamap_count_pages(dmat, map, buf, buflen, flags);
  if (map->pagesneeded != 0) {
  error = _bus_dmamap_reserve_pages(dmat, map, flags);
  if (error)
@@ -1042,6 +1044,8 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
  curaddr = pmap_kextract(vaddr);
  } else {
  curaddr = pmap_extract(pmap, vaddr);
+ if (curaddr == 0)
+ goto cleanup;
  map->flags &= ~DMAMAP_COHERENT;
  }

@@ -1067,7 +1071,6 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
  sl++;
  sl->vaddr = vaddr;
  sl->datacount = sgsize;
- sl->busaddr = curaddr;
  } else
  sl->datacount += sgsize;
  }
@@ -1205,12 +1208,11 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap

  STAILQ_FOREACH(bpage, &map->bpages, links) {
  if (op & BUS_DMASYNC_PREWRITE) {
- if (bpage->datavaddr != 0)
- bcopy((void *)bpage->datavaddr,
-    (void *)bpage->vaddr, bpage->datacount);
+ if (bpage->datavaddr != 0 &&
+    (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount);
  else
- physcopyout(bpage->dataaddr,
-    (void *)bpage->vaddr,bpage->datacount);
+ physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount);
  cpu_dcache_wb_range(bpage->vaddr, bpage->datacount);
  cpu_l2cache_wb_range(bpage->vaddr, bpage->datacount);
  dmat->bounce_zone->total_bounced++;
@@ -1218,12 +1220,11 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap
  if (op & BUS_DMASYNC_POSTREAD) {
  cpu_dcache_inv_range(bpage->vaddr, bpage->datacount);
  cpu_l2cache_inv_range(bpage->vaddr, bpage->datacount);
- if (bpage->datavaddr != 0)
- bcopy((void *)bpage->vaddr,
-    (void *)bpage->datavaddr, bpage->datacount);
+ if (bpage->datavaddr != 0 &&
+    (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount);
  else
- physcopyin((void *)bpage->vaddr,
-    bpage->dataaddr, bpage->datacount);
+ physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount);
  dmat->bounce_zone->total_bounced++;
  }
  }
@@ -1243,7 +1244,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
  _bus_dmamap_sync_bp(dmat, map, op);
  CTR3(KTR_BUSDMA, "%s: op %x flags %x", __func__, op, map->flags);
  bufaligned = (map->flags & DMAMAP_CACHE_ALIGNED);
- if (map->sync_count) {
+ if (map->sync_count != 0 &&
+    (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace))) {
  end = &map->slist[map->sync_count];
  for (sl = &map->slist[0]; sl != end; sl++)
  bus_dmamap_sync_buf(sl->vaddr, sl->datacount, op,
Index: sys/mips/mips/busdma_machdep.c
===================================================================
--- sys/mips/mips/busdma_machdep.c (revision 282208)
+++ sys/mips/mips/busdma_machdep.c (working copy)
@@ -96,7 +96,6 @@ struct bounce_page {

 struct sync_list {
  vm_offset_t vaddr; /* kva of bounce buffer */
- bus_addr_t busaddr; /* Physical address */
  bus_size_t datacount; /* client data count */
 };

@@ -144,6 +143,7 @@ struct bus_dmamap {
  void *allocbuffer;
  TAILQ_ENTRY(bus_dmamap) freelist;
  STAILQ_ENTRY(bus_dmamap) links;
+ pmap_t pmap;
  bus_dmamap_callback_t *callback;
  void *callback_arg;
  int sync_count;
@@ -725,7 +725,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma
 }

 static void
-_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap,
+_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
     void *buf, bus_size_t buflen, int flags)
 {
  vm_offset_t vaddr;
@@ -747,9 +747,11 @@ static void
  while (vaddr < vendaddr) {
  bus_size_t sg_len;

- KASSERT(kernel_pmap == pmap, ("pmap is not kernel pmap"));
  sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK);
- paddr = pmap_kextract(vaddr);
+ if (map->pmap == kernel_pmap)
+ paddr = pmap_kextract(vaddr);
+ else
+ paddr = pmap_extract(map->pmap, vaddr);
  if (((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) &&
     run_filter(dmat, paddr) != 0) {
  sg_len = roundup2(sg_len, dmat->alignment);
@@ -895,7 +897,7 @@ _bus_dmamap_load_ma(bus_dma_tag_t dmat, bus_dmamap
  */
 int
 _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dmamap_t map, void *buf,
-    bus_size_t buflen, struct pmap *pmap, int flags, bus_dma_segment_t
*segs,
+    bus_size_t buflen, pmap_t pmap, int flags, bus_dma_segment_t *segs,
     int *segp)
 {
  bus_size_t sgsize;
@@ -908,8 +910,10 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
  if (segs == NULL)
  segs = dmat->segments;

+ map->pmap = pmap;
+
  if ((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) {
- _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags);
+ _bus_dmamap_count_pages(dmat, map, buf, buflen, flags);
  if (map->pagesneeded != 0) {
  error = _bus_dmamap_reserve_pages(dmat, map, flags);
  if (error)
@@ -922,12 +926,11 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
  while (buflen > 0) {
  /*
  * Get the physical address for this segment.
- *
- * XXX Don't support checking for coherent mappings
- * XXX in user address space.
  */
- KASSERT(kernel_pmap == pmap, ("pmap is not kernel pmap"));
- curaddr = pmap_kextract(vaddr);
+ if (pmap == kernel_pmap)
+ curaddr = pmap_kextract(vaddr);
+ else
+ curaddr = pmap_extract(pmap, vaddr);

  /*
  * Compute the segment size, and adjust counts.
@@ -951,7 +954,6 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
  sl++;
  sl->vaddr = vaddr;
  sl->datacount = sgsize;
- sl->busaddr = curaddr;
  } else
  sl->datacount += sgsize;
  }
@@ -1111,17 +1113,14 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap

  STAILQ_FOREACH(bpage, &map->bpages, links) {
  if (op & BUS_DMASYNC_PREWRITE) {
- if (bpage->datavaddr != 0)
+ if (bpage->datavaddr != 0 &&
+    (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
  bcopy((void *)bpage->datavaddr,
-    (void *)(bpage->vaddr_nocache != 0 ?
-     bpage->vaddr_nocache :
-     bpage->vaddr),
+    (void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache :
bpage->vaddr),
     bpage->datacount);
  else
  physcopyout(bpage->dataaddr,
-    (void *)(bpage->vaddr_nocache != 0 ?
-     bpage->vaddr_nocache :
-     bpage->vaddr),
+    (void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache :
bpage->vaddr),
     bpage->datacount);
  if (bpage->vaddr_nocache == 0) {
  mips_dcache_wb_range(bpage->vaddr,
@@ -1134,13 +1133,12 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap
  mips_dcache_inv_range(bpage->vaddr,
     bpage->datacount);
  }
- if (bpage->datavaddr != 0)
- bcopy((void *)(bpage->vaddr_nocache != 0 ?
-    bpage->vaddr_nocache : bpage->vaddr),
+ if (bpage->datavaddr != 0 &&
+    (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache :
bpage->vaddr),
     (void *)bpage->datavaddr, bpage->datacount);
  else
- physcopyin((void *)(bpage->vaddr_nocache != 0 ?
-    bpage->vaddr_nocache : bpage->vaddr),
+ physcopyin((void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache :
bpage->vaddr),
     bpage->dataaddr, bpage->datacount);
  dmat->bounce_zone->total_bounced++;
  }
@@ -1164,7 +1162,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
  return;

  CTR3(KTR_BUSDMA, "%s: op %x flags %x", __func__, op, map->flags);
- if (map->sync_count) {
+ if (map->sync_count != 0 &&
+    (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace))) {
  end = &map->slist[map->sync_count];
  for (sl = &map->slist[0]; sl != end; sl++)
  bus_dmamap_sync_buf(sl->vaddr, sl->datacount, op);
Index: sys/powerpc/powerpc/busdma_machdep.c
===================================================================
--- sys/powerpc/powerpc/busdma_machdep.c (revision 282208)
+++ sys/powerpc/powerpc/busdma_machdep.c (working copy)
@@ -131,6 +131,7 @@ struct bus_dmamap {
  int       nsegs;
  bus_dmamap_callback_t *callback;
  void      *callback_arg;
+ pmap_t       pmap;
  STAILQ_ENTRY(bus_dmamap) links;
  int       contigalloc;
 };
@@ -596,7 +597,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma
 }

 static void
-_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap,
+_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
     void *buf, bus_size_t buflen, int flags)
 {
         vm_offset_t vaddr;
@@ -619,10 +620,10 @@ static void
  bus_size_t sg_len;

  sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK);
- if (pmap == kernel_pmap)
+ if (map->pmap == kernel_pmap)
  paddr = pmap_kextract(vaddr);
  else
- paddr = pmap_extract(pmap, vaddr);
+ paddr = pmap_extract(map->pmap, vaddr);
  if (run_filter(dmat, paddr) != 0) {
  sg_len = roundup2(sg_len, dmat->alignment);
  map->pagesneeded++;
@@ -785,8 +786,10 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat,
  if (segs == NULL)
  segs = map->segments;

+ map->pmap = pmap;
+
  if ((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) {
- _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags);
+ _bus_dmamap_count_pages(dmat, map, buf, buflen, flags);
  if (map->pagesneeded != 0) {
  error = _bus_dmamap_reserve_pages(dmat, map, flags);
  if (error)
@@ -905,14 +908,11 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t

  if (op & BUS_DMASYNC_PREWRITE) {
  while (bpage != NULL) {
- if (bpage->datavaddr != 0)
- bcopy((void *)bpage->datavaddr,
-      (void *)bpage->vaddr,
-      bpage->datacount);
+ if (bpage->datavaddr != 0 &&
+    (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount);
  else
- physcopyout(bpage->dataaddr,
-    (void *)bpage->vaddr,
-    bpage->datacount);
+ physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount);
  bpage = STAILQ_NEXT(bpage, links);
  }
  dmat->bounce_zone->total_bounced++;
@@ -920,13 +920,11 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t

  if (op & BUS_DMASYNC_POSTREAD) {
  while (bpage != NULL) {
- if (bpage->datavaddr != 0)
- bcopy((void *)bpage->vaddr,
-      (void *)bpage->datavaddr,
-      bpage->datacount);
+ if (bpage->datavaddr != 0 &&
+    (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount);
  else
- physcopyin((void *)bpage->vaddr,
-    bpage->dataaddr, bpage->datacount);
+ physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount);
  bpage = STAILQ_NEXT(bpage, links);
  }
  dmat->bounce_zone->total_bounced++;
Index: sys/x86/x86/busdma_bounce.c
===================================================================
--- sys/x86/x86/busdma_bounce.c (revision 282208)
+++ sys/x86/x86/busdma_bounce.c (working copy)
@@ -121,6 +121,7 @@ struct bus_dmamap {
  struct memdesc       mem;
  bus_dmamap_callback_t *callback;
  void      *callback_arg;
+ pmap_t       pmap;
  STAILQ_ENTRY(bus_dmamap) links;
 };

@@ -139,7 +140,7 @@ static bus_addr_t add_bounce_page(bus_dma_tag_t dm
 static void free_bounce_page(bus_dma_tag_t dmat, struct bounce_page
*bpage);
 int run_filter(bus_dma_tag_t dmat, bus_addr_t paddr);
 static void _bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
-    pmap_t pmap, void *buf, bus_size_t buflen,
+    void *buf, bus_size_t buflen,
     int flags);
 static void _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dmamap_t map,
    vm_paddr_t buf, bus_size_t buflen,
@@ -491,7 +492,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma
 }

 static void
-_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap,
+_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
     void *buf, bus_size_t buflen, int flags)
 {
  vm_offset_t vaddr;
@@ -515,10 +516,10 @@ static void

  while (vaddr < vendaddr) {
  sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK);
- if (pmap == kernel_pmap)
+ if (map->pmap == kernel_pmap)
  paddr = pmap_kextract(vaddr);
  else
- paddr = pmap_extract(pmap, vaddr);
+ paddr = pmap_extract(map->pmap, vaddr);
  if (bus_dma_run_filter(&dmat->common, paddr) != 0) {
  sg_len = roundup2(sg_len,
     dmat->common.alignment);
@@ -668,12 +669,14 @@ bounce_bus_dmamap_load_buffer(bus_dma_tag_t dmat,

  if (map == NULL)
  map = &nobounce_dmamap;
+ else
+ map->pmap = pmap;

  if (segs == NULL)
  segs = dmat->segments;

  if ((dmat->bounce_flags & BUS_DMA_COULD_BOUNCE) != 0) {
- _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags);
+ _bus_dmamap_count_pages(dmat, map, buf, buflen, flags);
  if (map->pagesneeded != 0) {
  error = _bus_dmamap_reserve_pages(dmat, map, flags);
  if (error)
@@ -775,15 +778,11 @@ bounce_bus_dmamap_sync(bus_dma_tag_t dmat, bus_dma

  if ((op & BUS_DMASYNC_PREWRITE) != 0) {
  while (bpage != NULL) {
- if (bpage->datavaddr != 0) {
- bcopy((void *)bpage->datavaddr,
-    (void *)bpage->vaddr,
-    bpage->datacount);
- } else {
- physcopyout(bpage->dataaddr,
-    (void *)bpage->vaddr,
-    bpage->datacount);
- }
+ if (bpage->datavaddr != 0 &&
+    (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount);
+ else
+ physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount);
  bpage = STAILQ_NEXT(bpage, links);
  }
  dmat->bounce_zone->total_bounced++;
@@ -791,15 +790,11 @@ bounce_bus_dmamap_sync(bus_dma_tag_t dmat, bus_dma

  if ((op & BUS_DMASYNC_POSTREAD) != 0) {
  while (bpage != NULL) {
- if (bpage->datavaddr != 0) {
- bcopy((void *)bpage->vaddr,
-    (void *)bpage->datavaddr,
-    bpage->datacount);
- } else {
- physcopyin((void *)bpage->vaddr,
-    bpage->dataaddr,
-    bpage->datacount);
- }
+ if (bpage->datavaddr != 0 &&
+    (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount);
+ else
+ physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount);
  bpage = STAILQ_NEXT(bpage, links);
  }
  dmat->bounce_zone->total_bounced++;

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr 29 18:50:32 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 96E23A34;
 Wed, 29 Apr 2015 18:50:32 +0000 (UTC)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id E887317B2;
 Wed, 29 Apr 2015 18:50:31 +0000 (UTC)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3TIoJ6v044134
 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Wed, 29 Apr 2015 21:50:19 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3TIoJ6v044134
Received: (from kostik@localhost)
 by tom.home (8.14.9/8.14.9/Submit) id t3TIoJsq044124;
 Wed, 29 Apr 2015 21:50:19 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Wed, 29 Apr 2015 21:50:19 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Jason Harmening <jason.harmening@gmail.com>
Cc: Svatopluk Kraus <onwahe@gmail.com>, John Baldwin <jhb@freebsd.org>,
 Adrian Chadd <adrian@freebsd.org>, Warner Losh <imp@bsdimp.com>,
 freebsd-arch <freebsd-arch@freebsd.org>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
Message-ID: <20150429185019.GO2390@kib.kiev.ua>
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com>
 <CAJ-VmomqGkEFVauya+rmPGcD_-=Z-mmg1RSDf1D2bT_DfwPBGA@mail.gmail.com>
 <1761247.Bq816CMB8v@ralph.baldwin.cx>
 <CAFHCsPX9rgmCAPABct84a000NuBPQm5sprOAQr9BTT6Ev6KZcQ@mail.gmail.com>
 <20150429132017.GM2390@kib.kiev.ua>
 <CAFHCsPWjEFBF+-7SR7EJ3UHP6oAAa9xjbu0CbRaQvd_-6gKuAQ@mail.gmail.com>
 <20150429165432.GN2390@kib.kiev.ua>
 <CAM=8qakzkKX8TZNYE33H=JqL_r5z+AU9fyp5+7Z0mixmF5t63w@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAM=8qakzkKX8TZNYE33H=JqL_r5z+AU9fyp5+7Z0mixmF5t63w@mail.gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.0
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Apr 2015 18:50:32 -0000

On Wed, Apr 29, 2015 at 01:04:46PM -0500, Jason Harmening wrote:
> So, here's a patch that would add unmapped user bounce-buffer support for
> existing UIO_USERSPACE cases.  I've only made sure it builds (everywhere)
> and given it a quick check on amd64.
> Things to note:
> --no changes to sparc64 and intel dmar, because they don't use bounce
> buffers
> --effectively adds UIO_USERSPACE support for mips, which was a KASSERT
> before
> --I am worried about the cache maintenance operations for arm and mips.
> I'm not an expert in non-coherent architectures.  In particular, I'm not
> sure what (if any) allowances need to be made for user VAs that may be
> present in VIPT caches on other cores of SMP systems.
> --the above point about cache maintenance also makes me wonder how that
> should be handled for drivers that would use vm_fault_quick_hold_pages() +
> bus_dmamap_load_ma().  Presumably, some UVAs for the buffer could be
> present in caches for the same or other core.
> 
The spaces/tabs in your mail are damaged. It does not matter in the
text, but makes the patch unapplicable and hardly readable.

I only read the x86/busdma_bounce.c part.  It looks fine in the part
where you add the test for the current pmap being identical to the pmap
owning the user page mapping.

I do not understand the part of the diff for bcopy/physcopyout lines,
I cannot find non-whitespace changes there, and whitespace change would
make too long line.  Did I misread the patch ?

BTW, why not use physcopyout() unconditionally on x86 ? To avoid i386 sfbuf
allocation failures ?

For non-coherent arches, isn't the issue of CPUs having filled caches
for the DMA region present regardless of the vm_fault_quick_hold() use ?
DMASYNC_PREREAD/WRITE must ensure that the lines are written back and
invalidated even now, or always fall back to use bounce page.

> 
> Index: sys/arm/arm/busdma_machdep-v6.c
> ===================================================================
> --- sys/arm/arm/busdma_machdep-v6.c (revision 282208)
> +++ sys/arm/arm/busdma_machdep-v6.c (working copy)
> @@ -1309,15 +1309,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
>  {
>   struct bounce_page *bpage;
>   struct sync_list *sl, *end;
> - /*
> - * If the buffer was from user space, it is possible that this is not
> - * the same vm map, especially on a POST operation.  It's not clear that
> - * dma on userland buffers can work at all right now.  To be safe, until
> - * we're able to test direct userland dma, panic on a map mismatch.
> - */
> +
>   if ((bpage = STAILQ_FIRST(&map->bpages)) != NULL) {
> - if (!pmap_dmap_iscurrent(map->pmap))
> - panic("_bus_dmamap_sync: wrong user map for bounce sync.");
> 
>   CTR4(KTR_BUSDMA, "%s: tag %p tag flags 0x%x op 0x%x "
>      "performing bounce", __func__, dmat, dmat->flags, op);
> @@ -1328,14 +1321,10 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
>   */
>   if (op & BUS_DMASYNC_PREWRITE) {
>   while (bpage != NULL) {
> - if (bpage->datavaddr != 0)
> - bcopy((void *)bpage->datavaddr,
> -    (void *)bpage->vaddr,
> -    bpage->datacount);
> + if (bpage->datavaddr != 0 && pmap_dmap_iscurrent(map->pmap))
> + bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount);
>   else
> - physcopyout(bpage->dataaddr,
> -    (void *)bpage->vaddr,
> -    bpage->datacount);
> + physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount);
>   cpu_dcache_wb_range((vm_offset_t)bpage->vaddr,
>      bpage->datacount);
>   l2cache_wb_range((vm_offset_t)bpage->vaddr,
> @@ -1396,14 +1385,10 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
>      arm_dcache_align;
>   l2cache_inv_range(startv, startp, len);
>   cpu_dcache_inv_range(startv, len);
> - if (bpage->datavaddr != 0)
> - bcopy((void *)bpage->vaddr,
> -    (void *)bpage->datavaddr,
> -    bpage->datacount);
> + if (bpage->datavaddr != 0 && pmap_dmap_iscurrent(map->pmap))
> + bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount);
>   else
> - physcopyin((void *)bpage->vaddr,
> -    bpage->dataaddr,
> -    bpage->datacount);
> + physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount);
>   bpage = STAILQ_NEXT(bpage, links);
>   }
>   dmat->bounce_zone->total_bounced++;
> @@ -1433,10 +1418,15 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
>   * that the sequence is inner-to-outer for PREREAD invalidation and
>   * outer-to-inner for POSTREAD invalidation is not a mistake.
>   */
> +#ifndef ARM_L2_PIPT
> + /*
> + * If we don't have any physically-indexed caches, we don't need to do
> + * cache maintenance if we're not in the context that owns the VA.
> + */
> + if (!pmap_dmap_iscurrent(map->pmap))
> + return;
> +#endif
>   if (map->sync_count != 0) {
> - if (!pmap_dmap_iscurrent(map->pmap))
> - panic("_bus_dmamap_sync: wrong user map for sync.");
> -
>   sl = &map->slist[0];
>   end = &map->slist[map->sync_count];
>   CTR4(KTR_BUSDMA, "%s: tag %p tag flags 0x%x op 0x%x "
> @@ -1446,7 +1436,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
>   case BUS_DMASYNC_PREWRITE:
>   case BUS_DMASYNC_PREWRITE | BUS_DMASYNC_PREREAD:
>   while (sl != end) {
> - cpu_dcache_wb_range(sl->vaddr, sl->datacount);
> + if (pmap_dmap_iscurrent(map->pmap))
> + cpu_dcache_wb_range(sl->vaddr, sl->datacount);
>   l2cache_wb_range(sl->vaddr, sl->busaddr,
>      sl->datacount);
>   sl++;
> @@ -1472,7 +1463,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
>   l2cache_wb_range(sl->vaddr,
>      sl->busaddr, 1);
>   }
> - cpu_dcache_inv_range(sl->vaddr, sl->datacount);
> + if (pmap_dmap_iscurrent(map->pmap))
> + cpu_dcache_inv_range(sl->vaddr, sl->datacount);
>   l2cache_inv_range(sl->vaddr, sl->busaddr,
>      sl->datacount);
>   sl++;
> @@ -1487,7 +1479,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
>   while (sl != end) {
>   l2cache_inv_range(sl->vaddr, sl->busaddr,
>      sl->datacount);
> - cpu_dcache_inv_range(sl->vaddr, sl->datacount);
> + if (pmap_dmap_iscurrent(map->pmap))
> + cpu_dcache_inv_range(sl->vaddr, sl->datacount);
>   sl++;
>   }
>   break;
> Index: sys/arm/arm/busdma_machdep.c
> ===================================================================
> --- sys/arm/arm/busdma_machdep.c (revision 282208)
> +++ sys/arm/arm/busdma_machdep.c (working copy)
> @@ -131,7 +131,6 @@ struct bounce_page {
> 
>  struct sync_list {
>   vm_offset_t vaddr; /* kva of bounce buffer */
> - bus_addr_t busaddr; /* Physical address */
>   bus_size_t datacount; /* client data count */
>  };
> 
> @@ -177,6 +176,7 @@ struct bus_dmamap {
>   STAILQ_ENTRY(bus_dmamap) links;
>   bus_dmamap_callback_t *callback;
>   void      *callback_arg;
> + pmap_t       pmap;
>   int       sync_count;
>   struct sync_list       *slist;
>  };
> @@ -831,7 +831,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma
>  }
> 
>  static void
> -_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap,
> +_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
>      void *buf, bus_size_t buflen, int flags)
>  {
>   vm_offset_t vaddr;
> @@ -851,10 +851,10 @@ static void
>   vendaddr = (vm_offset_t)buf + buflen;
> 
>   while (vaddr < vendaddr) {
> - if (__predict_true(pmap == kernel_pmap))
> + if (__predict_true(map->pmap == kernel_pmap))
>   paddr = pmap_kextract(vaddr);
>   else
> - paddr = pmap_extract(pmap, vaddr);
> + paddr = pmap_extract(map->pmap, vaddr);
>   if (run_filter(dmat, paddr) != 0)
>   map->pagesneeded++;
>   vaddr += PAGE_SIZE;
> @@ -1009,7 +1009,7 @@ _bus_dmamap_load_ma(bus_dma_tag_t dmat, bus_dmamap
>   */
>  int
>  _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dmamap_t map, void *buf,
> -    bus_size_t buflen, struct pmap *pmap, int flags, bus_dma_segment_t
> *segs,
> +    bus_size_t buflen, pmap_t pmap, int flags, bus_dma_segment_t *segs,
>      int *segp)
>  {
>   bus_size_t sgsize;
> @@ -1023,8 +1023,10 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
>   if ((flags & BUS_DMA_LOAD_MBUF) != 0)
>   map->flags |= DMAMAP_CACHE_ALIGNED;
> 
> + map->pmap = pmap;
> +
>   if ((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) {
> - _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags);
> + _bus_dmamap_count_pages(dmat, map, buf, buflen, flags);
>   if (map->pagesneeded != 0) {
>   error = _bus_dmamap_reserve_pages(dmat, map, flags);
>   if (error)
> @@ -1042,6 +1044,8 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
>   curaddr = pmap_kextract(vaddr);
>   } else {
>   curaddr = pmap_extract(pmap, vaddr);
> + if (curaddr == 0)
> + goto cleanup;
>   map->flags &= ~DMAMAP_COHERENT;
>   }
> 
> @@ -1067,7 +1071,6 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
>   sl++;
>   sl->vaddr = vaddr;
>   sl->datacount = sgsize;
> - sl->busaddr = curaddr;
>   } else
>   sl->datacount += sgsize;
>   }
> @@ -1205,12 +1208,11 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap
> 
>   STAILQ_FOREACH(bpage, &map->bpages, links) {
>   if (op & BUS_DMASYNC_PREWRITE) {
> - if (bpage->datavaddr != 0)
> - bcopy((void *)bpage->datavaddr,
> -    (void *)bpage->vaddr, bpage->datacount);
> + if (bpage->datavaddr != 0 &&
> +    (map->pmap == kernel_pmap || map->pmap ==
> vmspace_pmap(curproc->p_vmspace)))
> + bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount);
>   else
> - physcopyout(bpage->dataaddr,
> -    (void *)bpage->vaddr,bpage->datacount);
> + physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount);
>   cpu_dcache_wb_range(bpage->vaddr, bpage->datacount);
>   cpu_l2cache_wb_range(bpage->vaddr, bpage->datacount);
>   dmat->bounce_zone->total_bounced++;
> @@ -1218,12 +1220,11 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap
>   if (op & BUS_DMASYNC_POSTREAD) {
>   cpu_dcache_inv_range(bpage->vaddr, bpage->datacount);
>   cpu_l2cache_inv_range(bpage->vaddr, bpage->datacount);
> - if (bpage->datavaddr != 0)
> - bcopy((void *)bpage->vaddr,
> -    (void *)bpage->datavaddr, bpage->datacount);
> + if (bpage->datavaddr != 0 &&
> +    (map->pmap == kernel_pmap || map->pmap ==
> vmspace_pmap(curproc->p_vmspace)))
> + bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount);
>   else
> - physcopyin((void *)bpage->vaddr,
> -    bpage->dataaddr, bpage->datacount);
> + physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount);
>   dmat->bounce_zone->total_bounced++;
>   }
>   }
> @@ -1243,7 +1244,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
>   _bus_dmamap_sync_bp(dmat, map, op);
>   CTR3(KTR_BUSDMA, "%s: op %x flags %x", __func__, op, map->flags);
>   bufaligned = (map->flags & DMAMAP_CACHE_ALIGNED);
> - if (map->sync_count) {
> + if (map->sync_count != 0 &&
> +    (map->pmap == kernel_pmap || map->pmap ==
> vmspace_pmap(curproc->p_vmspace))) {
>   end = &map->slist[map->sync_count];
>   for (sl = &map->slist[0]; sl != end; sl++)
>   bus_dmamap_sync_buf(sl->vaddr, sl->datacount, op,
> Index: sys/mips/mips/busdma_machdep.c
> ===================================================================
> --- sys/mips/mips/busdma_machdep.c (revision 282208)
> +++ sys/mips/mips/busdma_machdep.c (working copy)
> @@ -96,7 +96,6 @@ struct bounce_page {
> 
>  struct sync_list {
>   vm_offset_t vaddr; /* kva of bounce buffer */
> - bus_addr_t busaddr; /* Physical address */
>   bus_size_t datacount; /* client data count */
>  };
> 
> @@ -144,6 +143,7 @@ struct bus_dmamap {
>   void *allocbuffer;
>   TAILQ_ENTRY(bus_dmamap) freelist;
>   STAILQ_ENTRY(bus_dmamap) links;
> + pmap_t pmap;
>   bus_dmamap_callback_t *callback;
>   void *callback_arg;
>   int sync_count;
> @@ -725,7 +725,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma
>  }
> 
>  static void
> -_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap,
> +_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
>      void *buf, bus_size_t buflen, int flags)
>  {
>   vm_offset_t vaddr;
> @@ -747,9 +747,11 @@ static void
>   while (vaddr < vendaddr) {
>   bus_size_t sg_len;
> 
> - KASSERT(kernel_pmap == pmap, ("pmap is not kernel pmap"));
>   sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK);
> - paddr = pmap_kextract(vaddr);
> + if (map->pmap == kernel_pmap)
> + paddr = pmap_kextract(vaddr);
> + else
> + paddr = pmap_extract(map->pmap, vaddr);
>   if (((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) &&
>      run_filter(dmat, paddr) != 0) {
>   sg_len = roundup2(sg_len, dmat->alignment);
> @@ -895,7 +897,7 @@ _bus_dmamap_load_ma(bus_dma_tag_t dmat, bus_dmamap
>   */
>  int
>  _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dmamap_t map, void *buf,
> -    bus_size_t buflen, struct pmap *pmap, int flags, bus_dma_segment_t
> *segs,
> +    bus_size_t buflen, pmap_t pmap, int flags, bus_dma_segment_t *segs,
>      int *segp)
>  {
>   bus_size_t sgsize;
> @@ -908,8 +910,10 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
>   if (segs == NULL)
>   segs = dmat->segments;
> 
> + map->pmap = pmap;
> +
>   if ((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) {
> - _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags);
> + _bus_dmamap_count_pages(dmat, map, buf, buflen, flags);
>   if (map->pagesneeded != 0) {
>   error = _bus_dmamap_reserve_pages(dmat, map, flags);
>   if (error)
> @@ -922,12 +926,11 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
>   while (buflen > 0) {
>   /*
>   * Get the physical address for this segment.
> - *
> - * XXX Don't support checking for coherent mappings
> - * XXX in user address space.
>   */
> - KASSERT(kernel_pmap == pmap, ("pmap is not kernel pmap"));
> - curaddr = pmap_kextract(vaddr);
> + if (pmap == kernel_pmap)
> + curaddr = pmap_kextract(vaddr);
> + else
> + curaddr = pmap_extract(pmap, vaddr);
> 
>   /*
>   * Compute the segment size, and adjust counts.
> @@ -951,7 +954,6 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
>   sl++;
>   sl->vaddr = vaddr;
>   sl->datacount = sgsize;
> - sl->busaddr = curaddr;
>   } else
>   sl->datacount += sgsize;
>   }
> @@ -1111,17 +1113,14 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap
> 
>   STAILQ_FOREACH(bpage, &map->bpages, links) {
>   if (op & BUS_DMASYNC_PREWRITE) {
> - if (bpage->datavaddr != 0)
> + if (bpage->datavaddr != 0 &&
> +    (map->pmap == kernel_pmap || map->pmap ==
> vmspace_pmap(curproc->p_vmspace)))
>   bcopy((void *)bpage->datavaddr,
> -    (void *)(bpage->vaddr_nocache != 0 ?
> -     bpage->vaddr_nocache :
> -     bpage->vaddr),
> +    (void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache :
> bpage->vaddr),
>      bpage->datacount);
>   else
>   physcopyout(bpage->dataaddr,
> -    (void *)(bpage->vaddr_nocache != 0 ?
> -     bpage->vaddr_nocache :
> -     bpage->vaddr),
> +    (void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache :
> bpage->vaddr),
>      bpage->datacount);
>   if (bpage->vaddr_nocache == 0) {
>   mips_dcache_wb_range(bpage->vaddr,
> @@ -1134,13 +1133,12 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap
>   mips_dcache_inv_range(bpage->vaddr,
>      bpage->datacount);
>   }
> - if (bpage->datavaddr != 0)
> - bcopy((void *)(bpage->vaddr_nocache != 0 ?
> -    bpage->vaddr_nocache : bpage->vaddr),
> + if (bpage->datavaddr != 0 &&
> +    (map->pmap == kernel_pmap || map->pmap ==
> vmspace_pmap(curproc->p_vmspace)))
> + bcopy((void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache :
> bpage->vaddr),
>      (void *)bpage->datavaddr, bpage->datacount);
>   else
> - physcopyin((void *)(bpage->vaddr_nocache != 0 ?
> -    bpage->vaddr_nocache : bpage->vaddr),
> + physcopyin((void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache :
> bpage->vaddr),
>      bpage->dataaddr, bpage->datacount);
>   dmat->bounce_zone->total_bounced++;
>   }
> @@ -1164,7 +1162,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
>   return;
> 
>   CTR3(KTR_BUSDMA, "%s: op %x flags %x", __func__, op, map->flags);
> - if (map->sync_count) {
> + if (map->sync_count != 0 &&
> +    (map->pmap == kernel_pmap || map->pmap ==
> vmspace_pmap(curproc->p_vmspace))) {
>   end = &map->slist[map->sync_count];
>   for (sl = &map->slist[0]; sl != end; sl++)
>   bus_dmamap_sync_buf(sl->vaddr, sl->datacount, op);
> Index: sys/powerpc/powerpc/busdma_machdep.c
> ===================================================================
> --- sys/powerpc/powerpc/busdma_machdep.c (revision 282208)
> +++ sys/powerpc/powerpc/busdma_machdep.c (working copy)
> @@ -131,6 +131,7 @@ struct bus_dmamap {
>   int       nsegs;
>   bus_dmamap_callback_t *callback;
>   void      *callback_arg;
> + pmap_t       pmap;
>   STAILQ_ENTRY(bus_dmamap) links;
>   int       contigalloc;
>  };
> @@ -596,7 +597,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma
>  }
> 
>  static void
> -_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap,
> +_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
>      void *buf, bus_size_t buflen, int flags)
>  {
>          vm_offset_t vaddr;
> @@ -619,10 +620,10 @@ static void
>   bus_size_t sg_len;
> 
>   sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK);
> - if (pmap == kernel_pmap)
> + if (map->pmap == kernel_pmap)
>   paddr = pmap_kextract(vaddr);
>   else
> - paddr = pmap_extract(pmap, vaddr);
> + paddr = pmap_extract(map->pmap, vaddr);
>   if (run_filter(dmat, paddr) != 0) {
>   sg_len = roundup2(sg_len, dmat->alignment);
>   map->pagesneeded++;
> @@ -785,8 +786,10 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat,
>   if (segs == NULL)
>   segs = map->segments;
> 
> + map->pmap = pmap;
> +
>   if ((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) {
> - _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags);
> + _bus_dmamap_count_pages(dmat, map, buf, buflen, flags);
>   if (map->pagesneeded != 0) {
>   error = _bus_dmamap_reserve_pages(dmat, map, flags);
>   if (error)
> @@ -905,14 +908,11 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
> 
>   if (op & BUS_DMASYNC_PREWRITE) {
>   while (bpage != NULL) {
> - if (bpage->datavaddr != 0)
> - bcopy((void *)bpage->datavaddr,
> -      (void *)bpage->vaddr,
> -      bpage->datacount);
> + if (bpage->datavaddr != 0 &&
> +    (map->pmap == kernel_pmap || map->pmap ==
> vmspace_pmap(curproc->p_vmspace)))
> + bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount);
>   else
> - physcopyout(bpage->dataaddr,
> -    (void *)bpage->vaddr,
> -    bpage->datacount);
> + physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount);
>   bpage = STAILQ_NEXT(bpage, links);
>   }
>   dmat->bounce_zone->total_bounced++;
> @@ -920,13 +920,11 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
> 
>   if (op & BUS_DMASYNC_POSTREAD) {
>   while (bpage != NULL) {
> - if (bpage->datavaddr != 0)
> - bcopy((void *)bpage->vaddr,
> -      (void *)bpage->datavaddr,
> -      bpage->datacount);
> + if (bpage->datavaddr != 0 &&
> +    (map->pmap == kernel_pmap || map->pmap ==
> vmspace_pmap(curproc->p_vmspace)))
> + bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount);
>   else
> - physcopyin((void *)bpage->vaddr,
> -    bpage->dataaddr, bpage->datacount);
> + physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount);
>   bpage = STAILQ_NEXT(bpage, links);
>   }
>   dmat->bounce_zone->total_bounced++;
> Index: sys/x86/x86/busdma_bounce.c
> ===================================================================
> --- sys/x86/x86/busdma_bounce.c (revision 282208)
> +++ sys/x86/x86/busdma_bounce.c (working copy)
> @@ -121,6 +121,7 @@ struct bus_dmamap {
>   struct memdesc       mem;
>   bus_dmamap_callback_t *callback;
>   void      *callback_arg;
> + pmap_t       pmap;
>   STAILQ_ENTRY(bus_dmamap) links;
>  };
> 
> @@ -139,7 +140,7 @@ static bus_addr_t add_bounce_page(bus_dma_tag_t dm
>  static void free_bounce_page(bus_dma_tag_t dmat, struct bounce_page
> *bpage);
>  int run_filter(bus_dma_tag_t dmat, bus_addr_t paddr);
>  static void _bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
> -    pmap_t pmap, void *buf, bus_size_t buflen,
> +    void *buf, bus_size_t buflen,
>      int flags);
>  static void _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dmamap_t map,
>     vm_paddr_t buf, bus_size_t buflen,
> @@ -491,7 +492,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma
>  }
> 
>  static void
> -_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap,
> +_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
>      void *buf, bus_size_t buflen, int flags)
>  {
>   vm_offset_t vaddr;
> @@ -515,10 +516,10 @@ static void
> 
>   while (vaddr < vendaddr) {
>   sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK);
> - if (pmap == kernel_pmap)
> + if (map->pmap == kernel_pmap)
>   paddr = pmap_kextract(vaddr);
>   else
> - paddr = pmap_extract(pmap, vaddr);
> + paddr = pmap_extract(map->pmap, vaddr);
>   if (bus_dma_run_filter(&dmat->common, paddr) != 0) {
>   sg_len = roundup2(sg_len,
>      dmat->common.alignment);
> @@ -668,12 +669,14 @@ bounce_bus_dmamap_load_buffer(bus_dma_tag_t dmat,
> 
>   if (map == NULL)
>   map = &nobounce_dmamap;
> + else
> + map->pmap = pmap;
> 
>   if (segs == NULL)
>   segs = dmat->segments;
> 
>   if ((dmat->bounce_flags & BUS_DMA_COULD_BOUNCE) != 0) {
> - _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags);
> + _bus_dmamap_count_pages(dmat, map, buf, buflen, flags);
>   if (map->pagesneeded != 0) {
>   error = _bus_dmamap_reserve_pages(dmat, map, flags);
>   if (error)
> @@ -775,15 +778,11 @@ bounce_bus_dmamap_sync(bus_dma_tag_t dmat, bus_dma
> 
>   if ((op & BUS_DMASYNC_PREWRITE) != 0) {
>   while (bpage != NULL) {
> - if (bpage->datavaddr != 0) {
> - bcopy((void *)bpage->datavaddr,
> -    (void *)bpage->vaddr,
> -    bpage->datacount);
> - } else {
> - physcopyout(bpage->dataaddr,
> -    (void *)bpage->vaddr,
> -    bpage->datacount);
> - }
> + if (bpage->datavaddr != 0 &&
> +    (map->pmap == kernel_pmap || map->pmap ==
> vmspace_pmap(curproc->p_vmspace)))
> + bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount);
> + else
> + physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount);
>   bpage = STAILQ_NEXT(bpage, links);
>   }
>   dmat->bounce_zone->total_bounced++;
> @@ -791,15 +790,11 @@ bounce_bus_dmamap_sync(bus_dma_tag_t dmat, bus_dma
> 
>   if ((op & BUS_DMASYNC_POSTREAD) != 0) {
>   while (bpage != NULL) {
> - if (bpage->datavaddr != 0) {
> - bcopy((void *)bpage->vaddr,
> -    (void *)bpage->datavaddr,
> -    bpage->datacount);
> - } else {
> - physcopyin((void *)bpage->vaddr,
> -    bpage->dataaddr,
> -    bpage->datacount);
> - }
> + if (bpage->datavaddr != 0 &&
> +    (map->pmap == kernel_pmap || map->pmap ==
> vmspace_pmap(curproc->p_vmspace)))
> + bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount);
> + else
> + physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount);
>   bpage = STAILQ_NEXT(bpage, links);
>   }
>   dmat->bounce_zone->total_bounced++;

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr 29 19:17:51 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 1CAB49A2;
 Wed, 29 Apr 2015 19:17:51 +0000 (UTC)
Received: from mail-ig0-x231.google.com (mail-ig0-x231.google.com
 [IPv6:2607:f8b0:4001:c05::231])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id D493B1B49;
 Wed, 29 Apr 2015 19:17:50 +0000 (UTC)
Received: by igblo3 with SMTP id lo3so57612943igb.0;
 Wed, 29 Apr 2015 12:17:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=Cbly6yhSt5GjOHKxKY6CxUDLVgmbdC0Ml3zaTzPTmjk=;
 b=hP8Ry2CKwtoPwHkWsY9BDKNDVBa/EX7QcyphC37jdoGUWEP9omnMTIPe0JHwUIXB2i
 DcWXBBp6VyESIz+2SowCDFlhtFuGQCRIQsfBC3Dw8GZiDSlz+z4+7JRIllXTrZxqICob
 7qDFWh7bp2Cx+++vU3Gqy82OjHC2m92bKMP9oIun68zTGkEnLV0J5G3O+ScnggknvHGI
 68UQn/+0h4MixAAw1Pw1nvmy0zH6SyGyV1nzRGVOmWgD3vWdvhvFcPBSmccq6Im0lG7N
 x0PAcQPFL04IdQHs0hGIIYN+iOi+YxiN10bbqiKfh40H3iZR5xZqetyo32dSeSy3EYTm
 kpLQ==
MIME-Version: 1.0
X-Received: by 10.107.11.211 with SMTP id 80mr887093iol.18.1430335070140; Wed,
 29 Apr 2015 12:17:50 -0700 (PDT)
Received: by 10.36.106.70 with HTTP; Wed, 29 Apr 2015 12:17:50 -0700 (PDT)
In-Reply-To: <20150429185019.GO2390@kib.kiev.ua>
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com>
 <CAJ-VmomqGkEFVauya+rmPGcD_-=Z-mmg1RSDf1D2bT_DfwPBGA@mail.gmail.com>
 <1761247.Bq816CMB8v@ralph.baldwin.cx>
 <CAFHCsPX9rgmCAPABct84a000NuBPQm5sprOAQr9BTT6Ev6KZcQ@mail.gmail.com>
 <20150429132017.GM2390@kib.kiev.ua>
 <CAFHCsPWjEFBF+-7SR7EJ3UHP6oAAa9xjbu0CbRaQvd_-6gKuAQ@mail.gmail.com>
 <20150429165432.GN2390@kib.kiev.ua>
 <CAM=8qakzkKX8TZNYE33H=JqL_r5z+AU9fyp5+7Z0mixmF5t63w@mail.gmail.com>
 <20150429185019.GO2390@kib.kiev.ua>
Date: Wed, 29 Apr 2015 14:17:50 -0500
Message-ID: <CAM=8qanPHbCwUeu0-zi-ccY4WprHaOGzCm44PwNSgb==nwgGGw@mail.gmail.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
From: Jason Harmening <jason.harmening@gmail.com>
To: Konstantin Belousov <kostikbel@gmail.com>
Cc: Svatopluk Kraus <onwahe@gmail.com>, John Baldwin <jhb@freebsd.org>,
 Adrian Chadd <adrian@freebsd.org>, 
 Warner Losh <imp@bsdimp.com>, freebsd-arch <freebsd-arch@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.20
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Apr 2015 19:17:51 -0000

>
>
> The spaces/tabs in your mail are damaged. It does not matter in the
> text, but makes the patch unapplicable and hardly readable.
>

Ugh.  I'm at work right now and using the gmail web client.  It seems like
every day I find a new way in which that thing is incredibly unfriendly for
use with mailing lists.
I will re-post the patch from a sane mail client later.


>
> I only read the x86/busdma_bounce.c part.  It looks fine in the part
> where you add the test for the current pmap being identical to the pmap
> owning the user page mapping.
>
> I do not understand the part of the diff for bcopy/physcopyout lines,
> I cannot find non-whitespace changes there, and whitespace change would
> make too long line.  Did I misread the patch ?\
>

You probably misread it, since it is unreadable.  There is a section in
bounce_bus_dmamap_sync() where I check for map->pmap being kernel_pmap or
curproc's pmap before doing bcopy.


>
> BTW, why not use physcopyout() unconditionally on x86 ? To avoid i386 sfbuf
> allocation failures ?
>

Yes.


>
> For non-coherent arches, isn't the issue of CPUs having filled caches
> for the DMA region present regardless of the vm_fault_quick_hold() use ?
> DMASYNC_PREREAD/WRITE must ensure that the lines are written back and
> invalidated even now, or always fall back to use bounce page.
>
>
Yes, that needs to be done regardless of how the pages are wired.  The
particular problem here is that some caches on arm and mips are
virtually-indexed (usually virtually-indexed, physically-tagged (VIPT)).
That means the flush/invalidate instructions need virtual addresses, so
figuring out the correct UVA to use for those could be a challenge.  As I
understand it, VIPT caches usually do have some hardware logic for finding
all the cachelines that correspond to a physical address, so they can
handle multiple VA mappings of the same PA.  But it is unclear to me how
cross-processor cache maintenance is supposed to work with VIPT caches on
SMP systems.

If the caches were physically-indexed, then I don't think there would be an
issue.  You'd just pass the PA to the flush/invalidate instruction, and
presumably a sane SMP implementation would propagate that to other cores
via IPI.

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr 29 19:33:44 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 38C45F1B;
 Wed, 29 Apr 2015 19:33:44 +0000 (UTC)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id B6AF61D86;
 Wed, 29 Apr 2015 19:33:43 +0000 (UTC)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3TJXb7x062579
 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Wed, 29 Apr 2015 22:33:37 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3TJXb7x062579
Received: (from kostik@localhost)
 by tom.home (8.14.9/8.14.9/Submit) id t3TJXbHf062578;
 Wed, 29 Apr 2015 22:33:37 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Wed, 29 Apr 2015 22:33:37 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Jason Harmening <jason.harmening@gmail.com>
Cc: Svatopluk Kraus <onwahe@gmail.com>, John Baldwin <jhb@freebsd.org>,
 Adrian Chadd <adrian@freebsd.org>, Warner Losh <imp@bsdimp.com>,
 freebsd-arch <freebsd-arch@freebsd.org>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
Message-ID: <20150429193337.GQ2390@kib.kiev.ua>
References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com>
 <CAJ-VmomqGkEFVauya+rmPGcD_-=Z-mmg1RSDf1D2bT_DfwPBGA@mail.gmail.com>
 <1761247.Bq816CMB8v@ralph.baldwin.cx>
 <CAFHCsPX9rgmCAPABct84a000NuBPQm5sprOAQr9BTT6Ev6KZcQ@mail.gmail.com>
 <20150429132017.GM2390@kib.kiev.ua>
 <CAFHCsPWjEFBF+-7SR7EJ3UHP6oAAa9xjbu0CbRaQvd_-6gKuAQ@mail.gmail.com>
 <20150429165432.GN2390@kib.kiev.ua>
 <CAM=8qakzkKX8TZNYE33H=JqL_r5z+AU9fyp5+7Z0mixmF5t63w@mail.gmail.com>
 <20150429185019.GO2390@kib.kiev.ua>
 <CAM=8qanPHbCwUeu0-zi-ccY4WprHaOGzCm44PwNSgb==nwgGGw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAM=8qanPHbCwUeu0-zi-ccY4WprHaOGzCm44PwNSgb==nwgGGw@mail.gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.0
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Apr 2015 19:33:44 -0000

On Wed, Apr 29, 2015 at 02:17:50PM -0500, Jason Harmening wrote:
> >
> >
> > The spaces/tabs in your mail are damaged. It does not matter in the
> > text, but makes the patch unapplicable and hardly readable.
> >
> 
> Ugh.  I'm at work right now and using the gmail web client.  It seems like
> every day I find a new way in which that thing is incredibly unfriendly for
> use with mailing lists.
> I will re-post the patch from a sane mail client later.
> 
> 
> >
> > I only read the x86/busdma_bounce.c part.  It looks fine in the part
> > where you add the test for the current pmap being identical to the pmap
> > owning the user page mapping.
> >
> > I do not understand the part of the diff for bcopy/physcopyout lines,
> > I cannot find non-whitespace changes there, and whitespace change would
> > make too long line.  Did I misread the patch ?\
> >
> 
> You probably misread it, since it is unreadable.  There is a section in
> bounce_bus_dmamap_sync() where I check for map->pmap being kernel_pmap or
> curproc's pmap before doing bcopy.
See the paragraph in my mail before the one you answered.
I am asking about the bcopy()/physcopyout() lines in diff, not about
the if () conditions change.  The later is definitely fine.

> 
> 
> >
> > BTW, why not use physcopyout() unconditionally on x86 ? To avoid i386 sfbuf
> > allocation failures ?
> >
> 
> Yes.
> 
> 
> >
> > For non-coherent arches, isn't the issue of CPUs having filled caches
> > for the DMA region present regardless of the vm_fault_quick_hold() use ?
> > DMASYNC_PREREAD/WRITE must ensure that the lines are written back and
> > invalidated even now, or always fall back to use bounce page.
> >
> >
> Yes, that needs to be done regardless of how the pages are wired.  The
> particular problem here is that some caches on arm and mips are
> virtually-indexed (usually virtually-indexed, physically-tagged (VIPT)).
> That means the flush/invalidate instructions need virtual addresses, so
> figuring out the correct UVA to use for those could be a challenge.  As I
> understand it, VIPT caches usually do have some hardware logic for finding
> all the cachelines that correspond to a physical address, so they can
> handle multiple VA mappings of the same PA.  But it is unclear to me how
> cross-processor cache maintenance is supposed to work with VIPT caches on
> SMP systems.
> 
> If the caches were physically-indexed, then I don't think there would be an
> issue.  You'd just pass the PA to the flush/invalidate instruction, and
> presumably a sane SMP implementation would propagate that to other cores
> via IPI.

Even without SMP, VIPT cache cannot hold two mappings of the same page.
As I understand, sometimes it is more involved, eg if mappings have
correct color (eg. on ultrasparcs), then cache can deal with aliasing.
Otherwise pmap has to map the page uncached for all mappings.

I do not see what would make this case special for SMP after that.
Cache invalidation would be either not needed, or coherency domain
propagation of the virtual address does the right thing.

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr 29 19:59:05 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 983235A0;
 Wed, 29 Apr 2015 19:59:05 +0000 (UTC)
Received: from mail-ig0-x230.google.com (mail-ig0-x230.google.com
 [IPv6:2607:f8b0:4001:c05::230])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 5AB801FDA;
 Wed, 29 Apr 2015 19:59:05 +0000 (UTC)
Received: by igblo3 with SMTP id lo3so127421146igb.1;
 Wed, 29 Apr 2015 12:59:02 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=7Q2lep1yKgOp92i7Jdy38JoyONV+nE+5XTwrfLLxzoI=;
 b=VPN9c9PYvN3SoDQXW8WhgzcdR1qW7RKl8CrP+jmzU+6iGnFjxlfVpFKIBRvky0u8Jv
 IgGeqQ3ImTLeI/JfLuTk+XEIKX3P9UMOyDhhXppT5mVMDzgr71FIjpnpuq1oT4qdFHI5
 kRaX6HqJYygTczNV2x7S/AEvcUl1V5EqRLI+YIR5ZP53hFUuiL3AOnVQe9Ur8iEuoMOK
 t4HpM8XnAmO3TpXwc02WCXLGXb/vAOjtNVafvfkGYrgKOxjh/2Zp7nH66FB00X7nzXCv
 R0ktxAahRQ2iKIvmCwXzWr5iYDgH2REsPI3wNL8F8Gt8QB34B4LGeN4n1LX/6P1gT2nW
 2afQ==
MIME-Version: 1.0
X-Received: by 10.50.41.8 with SMTP id b8mr29704345igl.38.1430337542406; Wed,
 29 Apr 2015 12:59:02 -0700 (PDT)
Received: by 10.36.106.70 with HTTP; Wed, 29 Apr 2015 12:59:02 -0700 (PDT)
In-Reply-To: <20150429193337.GQ2390@kib.kiev.ua>
References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com>
 <CAJ-VmomqGkEFVauya+rmPGcD_-=Z-mmg1RSDf1D2bT_DfwPBGA@mail.gmail.com>
 <1761247.Bq816CMB8v@ralph.baldwin.cx>
 <CAFHCsPX9rgmCAPABct84a000NuBPQm5sprOAQr9BTT6Ev6KZcQ@mail.gmail.com>
 <20150429132017.GM2390@kib.kiev.ua>
 <CAFHCsPWjEFBF+-7SR7EJ3UHP6oAAa9xjbu0CbRaQvd_-6gKuAQ@mail.gmail.com>
 <20150429165432.GN2390@kib.kiev.ua>
 <CAM=8qakzkKX8TZNYE33H=JqL_r5z+AU9fyp5+7Z0mixmF5t63w@mail.gmail.com>
 <20150429185019.GO2390@kib.kiev.ua>
 <CAM=8qanPHbCwUeu0-zi-ccY4WprHaOGzCm44PwNSgb==nwgGGw@mail.gmail.com>
 <20150429193337.GQ2390@kib.kiev.ua>
Date: Wed, 29 Apr 2015 14:59:02 -0500
Message-ID: <CAM=8qak0qRw5MsSG4e1Zqxo_x9VFGQ2rQpjUBFX_UA6P9_-2cA@mail.gmail.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
From: Jason Harmening <jason.harmening@gmail.com>
To: Konstantin Belousov <kostikbel@gmail.com>
Cc: Svatopluk Kraus <onwahe@gmail.com>, John Baldwin <jhb@freebsd.org>,
 Adrian Chadd <adrian@freebsd.org>, 
 Warner Losh <imp@bsdimp.com>, freebsd-arch <freebsd-arch@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.20
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Apr 2015 19:59:05 -0000

>
> See the paragraph in my mail before the one you answered.
> I am asking about the bcopy()/physcopyout() lines in diff, not about
> the if () conditions change.  The later is definitely fine.
>

Oh, yes, sorry.  There were a couple of whitespace changes there, but
nothing of consequence.


> Even without SMP, VIPT cache cannot hold two mappings of the same page.
> As I understand, sometimes it is more involved, eg if mappings have
> correct color (eg. on ultrasparcs), then cache can deal with aliasing.
> Otherwise pmap has to map the page uncached for all mappings.
>

Yes, you are right.  Regardless of whatever logic the cache uses (or
doesn't use), FreeBSD's page-coloring scheme should prevent that.


>
> I do not see what would make this case special for SMP after that.
> Cache invalidation would be either not needed, or coherency domain
> propagation of the virtual address does the right thing.
>

Since VIPT cache operations require a virtual address, I'm wondering about
the case where different processes are running on different cores, and the
same UVA corresponds to a completely different physical page for each of
those processes.  If the d-cache for each core contains that UVA, then what
does it mean when one core issues a flush/invalidate instruction for that
UVA?

Admittedly, there's a lot I don't know about how that's supposed to work in
the arm/mips SMP world.  For all I know, the SMP targets could be
fully-snooped and we don't need to worry about cache maintenance at all.

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr 29 20:05:54 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 4120E79B
 for <freebsd-arch@freebsd.org>; Wed, 29 Apr 2015 20:05:54 +0000 (UTC)
Received: from mail-pa0-f46.google.com (mail-pa0-f46.google.com
 [209.85.220.46])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 0D15F10AD
 for <freebsd-arch@freebsd.org>; Wed, 29 Apr 2015 20:05:53 +0000 (UTC)
Received: by pacwv17 with SMTP id wv17so37335632pac.0
 for <freebsd-arch@freebsd.org>; Wed, 29 Apr 2015 13:05:47 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:sender:subject:mime-version:content-type:from
 :in-reply-to:date:cc:message-id:references:to;
 bh=zf3QgncH7y1pFVp9phBTSTMoawrxRjhv0poiCdeXuQc=;
 b=eL4eUBaN39AYt6T5V9Wc6UH9iP4wxTLRCaT/R1hemV9xbeN0NrqPu+BqC+BtMDQ8q8
 tOyA/Ne9WFyCWJJe6NvlV3PhR+BjgB+JmGv0IYbObyLgqnlFDhrOKgUdk1KBWuz59Y60
 pTe+MSzkOrGq7ow7V0Dr5VO0YJA1cVweaQgrHB0cBg94O4WvIDCbGjLtXk+WLbcAUOcq
 6qaEpCSFq0pu7b8FSlnipyGo5XlZUaGYTADJEuPse5jf9CR+JIKfB0IqlDLflq6jl555
 B1PsT0ebsLUOJrxgIPZ4uj+FAfbR3nVEq8EfCTB4RGjzRhYfbDvPFBn74+BV5at5EGfe
 36CA==
X-Gm-Message-State: ALoCoQm/YnWkcWlGZSVUXKZAAvWRupzCYzj2hUmjforXY3ItcKfydrsFjPrbYL1hU0Uf05SKDVFc
X-Received: by 10.70.124.233 with SMTP id ml9mr1432149pdb.9.1430337946909;
 Wed, 29 Apr 2015 13:05:46 -0700 (PDT)
Received: from lgwl-sram.corp.netflix.com ([69.53.236.236])
 by mx.google.com with ESMTPSA id c8sm32559pdj.65.2015.04.29.13.05.44
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Wed, 29 Apr 2015 13:05:45 -0700 (PDT)
Sender: Warner Losh <wlosh@bsdimp.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\))
Content-Type: multipart/signed;
 boundary="Apple-Mail=_BE24FC7E-A878-4059-963E-1A19E29BB82A";
 protocol="application/pgp-signature"; micalg=pgp-sha512
X-Pgp-Agent: GPGMail 2.5b6
From: Warner Losh <imp@bsdimp.com>
In-Reply-To: <CAM=8qanPHbCwUeu0-zi-ccY4WprHaOGzCm44PwNSgb==nwgGGw@mail.gmail.com>
Date: Wed, 29 Apr 2015 14:05:42 -0600
Cc: Konstantin Belousov <kostikbel@gmail.com>,
 Svatopluk Kraus <onwahe@gmail.com>, John Baldwin <jhb@freebsd.org>,
 Adrian Chadd <adrian@freebsd.org>, freebsd-arch <freebsd-arch@freebsd.org>
Message-Id: <9807ECB0-5218-42D1-9BD9-94F6BB5C69C8@bsdimp.com>
References: <CAFHCsPXMjge84AR2cR8KXMXWP4kH2YvuV_uqtPKUvn5C3ygknw@mail.gmail.com>
 <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com>
 <CAJ-VmomqGkEFVauya+rmPGcD_-=Z-mmg1RSDf1D2bT_DfwPBGA@mail.gmail.com>
 <1761247.Bq816CMB8v@ralph.baldwin.cx>
 <CAFHCsPX9rgmCAPABct84a000NuBPQm5sprOAQr9BTT6Ev6KZcQ@mail.gmail.com>
 <20150429132017.GM2390@kib.kiev.ua>
 <CAFHCsPWjEFBF+-7SR7EJ3UHP6oAAa9xjbu0CbRaQvd_-6gKuAQ@mail.gmail.com>
 <20150429165432.GN2390@kib.kiev.ua>
 <CAM=8qakzkKX8TZNYE33H=JqL_r5z+AU9fyp5+7Z0mixmF5t63w@mail.gmail.com>
 <20150429185019.GO2390@kib.kiev.ua>
 <CAM=8qanPHbCwUeu0-zi-ccY4WprHaOGzCm44PwNSgb==nwgGGw@mail.gmail.com>
To: Jason Harmening <jason.harmening@gmail.com>
X-Mailer: Apple Mail (2.2098)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Apr 2015 20:05:54 -0000


--Apple-Mail=_BE24FC7E-A878-4059-963E-1A19E29BB82A
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii


> On Apr 29, 2015, at 1:17 PM, Jason Harmening =
<jason.harmening@gmail.com> wrote:
>=20
>=20
>=20
>=20
> Yes, that needs to be done regardless of how the pages are wired.  The =
particular problem here is that some caches on arm and mips are =
virtually-indexed (usually virtually-indexed, physically-tagged (VIPT)). =
 That means the flush/invalidate instructions need virtual addresses, so =
figuring out the correct UVA to use for those could be a challenge.  As =
I understand it, VIPT caches usually do have some hardware logic for =
finding all the cachelines that correspond to a physical address, so =
they can handle multiple VA mappings of the same PA.  But it is unclear =
to me how cross-processor cache maintenance is supposed to work with =
VIPT caches on SMP systems.
>=20
> If the caches were physically-indexed, then I don't think there would =
be an issue.  You'd just pass the PA to the flush/invalidate =
instruction, and presumably a sane SMP implementation would propagate =
that to other cores via IPI.

I know on MIPS you cannot have more than one mapping to a page you are =
doing DMA to/from ever.

Warner


--Apple-Mail=_BE24FC7E-A878-4059-963E-1A19E29BB82A
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename=signature.asc
Content-Type: application/pgp-signature;
	name=signature.asc
Content-Description: Message signed with OpenPGP using GPGMail

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - https://gpgtools.org

iQIcBAEBCgAGBQJVQTmXAAoJEGwc0Sh9sBEAk9UP/0EXZ/oWomX5qr9eSByKloVy
m7MthNOYTTFnHhJlrtOSspQ/OZdsBoK1lzgNhcvjQRtXzDyelK1fDP5iYla1w5Lt
XAwcL8yIjUBUm1SmHdY9O/rBLrMaeg03sEUJzaLGtF1V5dRrvHr/UsQpegcEy2Kw
+4m5aSAZmr9MPIJ+r/1ztilvZv9k26pDQ1UdUvCpq7/c28A9JWdhbGSwuNpFzOI/
WSy+7fxBH4WbeC9ikRkkoIqmAEO2EAaecMnRAHbTzoPhKnQahtzXC14BSUzpNKL2
HSkXZK0INc12VEocr/rovNP4iTRe4HrcN4nPHIeyKNjJdm2Pu8bo39yU4FWBzTkt
efnTd9jGAy3Sqy+YJFZSKkRxYjMDSP6qmp+bD/8vRUf7z5AiB20zUxPQ0fCmXdLX
F5MTlAjRdQ/I9+HHEOIqk1ZkPAQJP5Zz6KzTm7WLBBIdSC7sqewOsw5iSXufssCl
80pg/er17pyCm4PsmR+i4fwi5UtgGkNt0gUcScWDqHcItFX9tHTrSb/OpFEa5WvW
pcordomq6pOQ3f23lG/R964yLu3hlCf9Jrhznom9/FwzMoS1cKsNQVGUS0Pwa6aa
M+egJHSmhz8weoGry4ygBrIq8jLOHB+xnobLGHPqiPk/q5IGwI5o1m0ODquO2pD8
g2ckIm17cRCNmvLJzhAu
=axbV
-----END PGP SIGNATURE-----

--Apple-Mail=_BE24FC7E-A878-4059-963E-1A19E29BB82A--

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr 29 22:23:38 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 21945B87;
 Wed, 29 Apr 2015 22:23:38 +0000 (UTC)
Received: from relay.mailchannels.net (tkt-001-i373.relay.mailchannels.net
 [174.136.5.175])
 by mx1.freebsd.org (Postfix) with ESMTP id 08EF91004;
 Wed, 29 Apr 2015 22:23:36 +0000 (UTC)
X-Sender-Id: duocircle|x-authuser|hippie
Received: from smtp2.ore.mailhop.org
 (ip-10-204-4-183.us-west-2.compute.internal [10.204.4.183])
 by relay.mailchannels.net (Postfix) with ESMTPA id 5FF60A11A0;
 Wed, 29 Apr 2015 22:23:28 +0000 (UTC)
X-Sender-Id: duocircle|x-authuser|hippie
Received: from smtp2.ore.mailhop.org (smtp2.ore.mailhop.org [10.45.8.167])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA)
 by 0.0.0.0:2500 (trex/5.4.8); Wed, 29 Apr 2015 22:23:28 +0000
X-MC-Relay: Neutral
X-MailChannels-SenderId: duocircle|x-authuser|hippie
X-MailChannels-Auth-Id: duocircle
X-MC-Loop-Signature: 1430346208532:536895691
X-MC-Ingress-Time: 1430346208532
Received: from c-73-34-117-227.hsd1.co.comcast.net ([73.34.117.227]
 helo=ilsoft.org)
 by smtp2.ore.mailhop.org with esmtpsa (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256)
 (Exim 4.82) (envelope-from <ian@freebsd.org>)
 id 1YnaO7-0006MA-7y; Wed, 29 Apr 2015 22:23:27 +0000
Received: from revolution.hippie.lan (revolution.hippie.lan [172.22.42.240])
 by ilsoft.org (8.14.9/8.14.9) with ESMTP id t3TMNOWI050105;
 Wed, 29 Apr 2015 16:23:24 -0600 (MDT) (envelope-from ian@freebsd.org)
X-Mail-Handler: DuoCircle Outbound SMTP
X-Originating-IP: 73.34.117.227
X-Report-Abuse-To: abuse@duocircle.com (see
 https://support.duocircle.com/support/solutions/articles/5000540958-duocircle-standard-smtp-abuse-information
 for abuse reporting information)
X-MHO-User: U2FsdGVkX1/ky0KcCx9j8ENGWudC41dk
Message-ID: <1430346204.1157.107.camel@freebsd.org>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
From: Ian Lepore <ian@freebsd.org>
To: Jason Harmening <jason.harmening@gmail.com>
Cc: Konstantin Belousov <kostikbel@gmail.com>, Adrian Chadd
 <adrian@freebsd.org>, Svatopluk Kraus <onwahe@gmail.com>, freebsd-arch
 <freebsd-arch@freebsd.org>
Date: Wed, 29 Apr 2015 16:23:24 -0600
In-Reply-To: <CAM=8qak0qRw5MsSG4e1Zqxo_x9VFGQ2rQpjUBFX_UA6P9_-2cA@mail.gmail.com>
References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com>
 <CAJ-VmomqGkEFVauya+rmPGcD_-=Z-mmg1RSDf1D2bT_DfwPBGA@mail.gmail.com>
 <1761247.Bq816CMB8v@ralph.baldwin.cx>
 <CAFHCsPX9rgmCAPABct84a000NuBPQm5sprOAQr9BTT6Ev6KZcQ@mail.gmail.com>
 <20150429132017.GM2390@kib.kiev.ua>
 <CAFHCsPWjEFBF+-7SR7EJ3UHP6oAAa9xjbu0CbRaQvd_-6gKuAQ@mail.gmail.com>
 <20150429165432.GN2390@kib.kiev.ua>
 <CAM=8qakzkKX8TZNYE33H=JqL_r5z+AU9fyp5+7Z0mixmF5t63w@mail.gmail.com>
 <20150429185019.GO2390@kib.kiev.ua>
 <CAM=8qanPHbCwUeu0-zi-ccY4WprHaOGzCm44PwNSgb==nwgGGw@mail.gmail.com>
 <20150429193337.GQ2390@kib.kiev.ua>
 <CAM=8qak0qRw5MsSG4e1Zqxo_x9VFGQ2rQpjUBFX_UA6P9_-2cA@mail.gmail.com>
Content-Type: text/plain; charset="us-ascii"
X-Mailer: Evolution 3.12.10 FreeBSD GNOME Team Port 
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
X-AuthUser: hippie
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Apr 2015 22:23:38 -0000

On Wed, 2015-04-29 at 14:59 -0500, Jason Harmening wrote:
> >
> > Even without SMP, VIPT cache cannot hold two mappings of the same page.
> > As I understand, sometimes it is more involved, eg if mappings have
> > correct color (eg. on ultrasparcs), then cache can deal with aliasing.
> > Otherwise pmap has to map the page uncached for all mappings.
> >
> 
> Yes, you are right.  Regardless of whatever logic the cache uses (or
> doesn't use), FreeBSD's page-coloring scheme should prevent that.
> 
> 
> >
> > I do not see what would make this case special for SMP after that.
> > Cache invalidation would be either not needed, or coherency domain
> > propagation of the virtual address does the right thing.
> >
> 
> Since VIPT cache operations require a virtual address, I'm wondering about
> the case where different processes are running on different cores, and the
> same UVA corresponds to a completely different physical page for each of
> those processes.  If the d-cache for each core contains that UVA, then what
> does it mean when one core issues a flush/invalidate instruction for that
> UVA?
> 
> Admittedly, there's a lot I don't know about how that's supposed to work in
> the arm/mips SMP world.  For all I know, the SMP targets could be
> fully-snooped and we don't need to worry about cache maintenance at all.

For what we call armv6 (which is mostly armv7)...

The cache maintenance operations require virtual addresses, which means
it looks a lot like a VIPT cache.  Under the hood the implementation
behaves as if it were a PIPT cache so even in the presence of multiple
mappings of the same physical page into different virtual addresses, the
SMP coherency hardware works correctly.

The ARM ARM says...

        [Stuff about ARMv6 and page coloring when a cache way exceeds
        4K.]
        
        ARMv7 does not support page coloring, and requires that all data
        and unified caches behave as Physically Indexed Physically
        Tagged (PIPT) caches.

The only true armv6 chip we support isn't SMP and has a 16K/4-way cache
that neatly sidesteps the aliasing problem that requires page coloring
solutions.  So modern arm chips we get to act like we've got PIPT data
caches, but with the quirk that cache ops are initiated by virtual
address.

Basically, when you perform a cache maintainence operation, a
translation table walk is done on the core that issued the cache op,
then from that point on the physical address is used within the cache
hardware and that's what gets broadcast to the other cores by the snoop
control unit or cache coherency fabric (depending on the chip).

Not that it's germane to this discussion, but an ARM instruction cache
can really be VIPT with no "behave as if" restrictions in the spec.
That means when doing i-cache maintenance on a virtual address that
could be multiply-mapped our only option a rather expensive all-cores
"invalidate entire i-cache and branch predictor cache".

For the older armv4/v5 world which is VIVT, we have a restriction that a
page that is multiply-mapped cannot have cache enabled (it's handled in
pmap).  That's also probably not very germane to this discussion,
because it doesn't seem likely that anyone is going to try to add
physical IO or userspace DMA support to that old code.

-- Ian


From owner-freebsd-arch@FreeBSD.ORG  Wed Apr 29 23:10:19 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 933F736D;
 Wed, 29 Apr 2015 23:10:19 +0000 (UTC)
Received: from mail-ie0-x232.google.com (mail-ie0-x232.google.com
 [IPv6:2607:f8b0:4001:c03::232])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 563E41583;
 Wed, 29 Apr 2015 23:10:19 +0000 (UTC)
Received: by iebrs15 with SMTP id rs15so54806421ieb.3;
 Wed, 29 Apr 2015 16:10:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=e1bzNf8ir1N1xE/m9tDofiB92Z7aU/E9Gv+c76cKChA=;
 b=A2ITE4Tx6q3GF96Wq9KVWPgKtK815kLx0PaYNdONCs2YHxIBDYTNvuIhoxHgWACjY0
 YqI3JTh753xMxtbmpMy9FSUJnudc1elZUvLolAIWlBz1/u3Pcm5cGooVe/N4mwV3AgGp
 GMyVY8fvxCjuFOMRAMAyrKwf0riWrYR1mp0rYTOCGAPjF/m4YwYV12rsChKsqW5mqXyp
 WQe4WDKDqebMGEVLNJDt5lFUdU3Bvo/fogzRDDhNBx9G9q0VH7AWjnpcxGnHixytZobZ
 5oNF7IqaQHTG9efAp3qCnYypc8pFq8FX0SAByn/iWlfhdPbn5yQVipX43IkOw3nTAOYE
 3T3A==
MIME-Version: 1.0
X-Received: by 10.107.9.67 with SMTP id j64mr1964837ioi.39.1430349018698; Wed,
 29 Apr 2015 16:10:18 -0700 (PDT)
Received: by 10.36.106.70 with HTTP; Wed, 29 Apr 2015 16:10:18 -0700 (PDT)
In-Reply-To: <1430346204.1157.107.camel@freebsd.org>
References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com>
 <CAJ-VmomqGkEFVauya+rmPGcD_-=Z-mmg1RSDf1D2bT_DfwPBGA@mail.gmail.com>
 <1761247.Bq816CMB8v@ralph.baldwin.cx>
 <CAFHCsPX9rgmCAPABct84a000NuBPQm5sprOAQr9BTT6Ev6KZcQ@mail.gmail.com>
 <20150429132017.GM2390@kib.kiev.ua>
 <CAFHCsPWjEFBF+-7SR7EJ3UHP6oAAa9xjbu0CbRaQvd_-6gKuAQ@mail.gmail.com>
 <20150429165432.GN2390@kib.kiev.ua>
 <CAM=8qakzkKX8TZNYE33H=JqL_r5z+AU9fyp5+7Z0mixmF5t63w@mail.gmail.com>
 <20150429185019.GO2390@kib.kiev.ua>
 <CAM=8qanPHbCwUeu0-zi-ccY4WprHaOGzCm44PwNSgb==nwgGGw@mail.gmail.com>
 <20150429193337.GQ2390@kib.kiev.ua>
 <CAM=8qak0qRw5MsSG4e1Zqxo_x9VFGQ2rQpjUBFX_UA6P9_-2cA@mail.gmail.com>
 <1430346204.1157.107.camel@freebsd.org>
Date: Wed, 29 Apr 2015 18:10:18 -0500
Message-ID: <CAM=8qakVsbukSTVh5UVQEO2Vmtcmj36cqw6KJC3frvEQGGCQsg@mail.gmail.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
From: Jason Harmening <jason.harmening@gmail.com>
To: Ian Lepore <ian@freebsd.org>
Cc: Konstantin Belousov <kostikbel@gmail.com>,
 Adrian Chadd <adrian@freebsd.org>, 
 Svatopluk Kraus <onwahe@gmail.com>, freebsd-arch <freebsd-arch@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.20
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Apr 2015 23:10:19 -0000

>
>
> For what we call armv6 (which is mostly armv7)...
>
> The cache maintenance operations require virtual addresses, which means
> it looks a lot like a VIPT cache.  Under the hood the implementation
> behaves as if it were a PIPT cache so even in the presence of multiple
> mappings of the same physical page into different virtual addresses, the
> SMP coherency hardware works correctly.
>
> The ARM ARM says...
>
>         [Stuff about ARMv6 and page coloring when a cache way exceeds
>         4K.]
>
>         ARMv7 does not support page coloring, and requires that all data
>         and unified caches behave as Physically Indexed Physically
>         Tagged (PIPT) caches.
>
> The only true armv6 chip we support isn't SMP and has a 16K/4-way cache
> that neatly sidesteps the aliasing problem that requires page coloring
> solutions.  So modern arm chips we get to act like we've got PIPT data
> caches, but with the quirk that cache ops are initiated by virtual
> address.
>

Cool, thanks for the explanation!
To satisfy my own curiosity, since it "looks like VIPT", does that mean we
still have to flush the cache on context switch?


>
> Basically, when you perform a cache maintainence operation, a
> translation table walk is done on the core that issued the cache op,
> then from that point on the physical address is used within the cache
> hardware and that's what gets broadcast to the other cores by the snoop
> control unit or cache coherency fabric (depending on the chip).


So, if we go back to the original problem of wanting to do
bus_dmamap_sync() on userspace buffers from some asynchronous context:

Say the process that owns the buffer is running on one core and prefetches
some data into a cacheline for the buffer, and bus_dmamap_sync(POSTREAD) is
done by a kernel thread running on another core.  Since the core running
the kernel thread is responsible for the TLB lookup to get the physical
address, then since that core has no UVA the cache ops will be treated as
misses and the cacheline on the core that owns the UVA won't be
invalidated, correct?

That means the panic on !pmap_dmap_iscurrent() in busdma_machdep-v6.c
should stay?

Sort of the same problem would apply to drivers using
vm_fault_quick_hold_pages + bus_dmamap_load_ma...no cache maintenance,
since there are no VAs to operate on.  Indeed, both arm and mips
implementation of _bus_dmamap_load_phys don't do anything with the sync
list.

From owner-freebsd-arch@FreeBSD.ORG  Thu Apr 30 04:13:46 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id B299B530;
 Thu, 30 Apr 2015 04:13:46 +0000 (UTC)
Received: from mail-ig0-x229.google.com (mail-ig0-x229.google.com
 [IPv6:2607:f8b0:4001:c05::229])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 7427017EE;
 Thu, 30 Apr 2015 04:13:46 +0000 (UTC)
Received: by iget9 with SMTP id t9so3713992ige.1;
 Wed, 29 Apr 2015 21:13:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=IxMUAtHmz0kI9UGh/ASkTjjQpgipqh5+cPEN50qx2eU=;
 b=HDW95hATGOmFWppbnYphTjSPXufwYWmx4YKfrtHnSjk4oXwH6HSdPXr/LnE+KNMPjf
 9Jp1/op197oMnhfcrnqS/LE5wN5Y0kWEy3p95TkiXjuwrel1pzFCd2zY673y3ZBtuLUp
 CODHjsfQt0xqaiYd67/sG86m88uIhlBcH7AXvPNEQZXWNJSK35EYGhzWSwVQB4BhKiK9
 gxhXvnYO6L+gmELRjQV7UdSMvj3s6F/1rpVnDN8lQKXXCMSH2N0WxjJj9oAshxwVXKyu
 869h5rOYKB9SVOhvQUrvngQvPBFYGjbvCARjR1QJxmbVxhnpG+SapK9DSupocuQPuU5g
 8EYQ==
MIME-Version: 1.0
X-Received: by 10.50.72.8 with SMTP id z8mr1102031igu.36.1430367225931; Wed,
 29 Apr 2015 21:13:45 -0700 (PDT)
Received: by 10.36.106.70 with HTTP; Wed, 29 Apr 2015 21:13:45 -0700 (PDT)
In-Reply-To: <CAM=8qakVsbukSTVh5UVQEO2Vmtcmj36cqw6KJC3frvEQGGCQsg@mail.gmail.com>
References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com>
 <CAJ-VmomqGkEFVauya+rmPGcD_-=Z-mmg1RSDf1D2bT_DfwPBGA@mail.gmail.com>
 <1761247.Bq816CMB8v@ralph.baldwin.cx>
 <CAFHCsPX9rgmCAPABct84a000NuBPQm5sprOAQr9BTT6Ev6KZcQ@mail.gmail.com>
 <20150429132017.GM2390@kib.kiev.ua>
 <CAFHCsPWjEFBF+-7SR7EJ3UHP6oAAa9xjbu0CbRaQvd_-6gKuAQ@mail.gmail.com>
 <20150429165432.GN2390@kib.kiev.ua>
 <CAM=8qakzkKX8TZNYE33H=JqL_r5z+AU9fyp5+7Z0mixmF5t63w@mail.gmail.com>
 <20150429185019.GO2390@kib.kiev.ua>
 <CAM=8qanPHbCwUeu0-zi-ccY4WprHaOGzCm44PwNSgb==nwgGGw@mail.gmail.com>
 <20150429193337.GQ2390@kib.kiev.ua>
 <CAM=8qak0qRw5MsSG4e1Zqxo_x9VFGQ2rQpjUBFX_UA6P9_-2cA@mail.gmail.com>
 <1430346204.1157.107.camel@freebsd.org>
 <CAM=8qakVsbukSTVh5UVQEO2Vmtcmj36cqw6KJC3frvEQGGCQsg@mail.gmail.com>
Date: Wed, 29 Apr 2015 23:13:45 -0500
Message-ID: <CAM=8qanFPinmBV3Sv4GD=mFRbyZc8tGHF_OJdJ+rwEDLPKu48w@mail.gmail.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
From: Jason Harmening <jason.harmening@gmail.com>
To: Ian Lepore <ian@freebsd.org>
Cc: Konstantin Belousov <kostikbel@gmail.com>,
 Adrian Chadd <adrian@freebsd.org>, 
 Svatopluk Kraus <onwahe@gmail.com>, freebsd-arch <freebsd-arch@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.20
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Apr 2015 04:13:46 -0000

On Wed, Apr 29, 2015 at 6:10 PM, Jason Harmening <jason.harmening@gmail.com>
wrote:

>
>> For what we call armv6 (which is mostly armv7)...
>>
>> The cache maintenance operations require virtual addresses, which means
>> it looks a lot like a VIPT cache.  Under the hood the implementation
>> behaves as if it were a PIPT cache so even in the presence of multiple
>> mappings of the same physical page into different virtual addresses, the
>> SMP coherency hardware works correctly.
>>
>> The ARM ARM says...
>>
>>         [Stuff about ARMv6 and page coloring when a cache way exceeds
>>         4K.]
>>
>>         ARMv7 does not support page coloring, and requires that all data
>>         and unified caches behave as Physically Indexed Physically
>>         Tagged (PIPT) caches.
>>
>> The only true armv6 chip we support isn't SMP and has a 16K/4-way cache
>> that neatly sidesteps the aliasing problem that requires page coloring
>> solutions.  So modern arm chips we get to act like we've got PIPT data
>> caches, but with the quirk that cache ops are initiated by virtual
>> address.
>>
>
> Cool, thanks for the explanation!
> To satisfy my own curiosity, since it "looks like VIPT", does that mean we
> still have to flush the cache on context switch?
>
>
>>
>> Basically, when you perform a cache maintainence operation, a
>> translation table walk is done on the core that issued the cache op,
>> then from that point on the physical address is used within the cache
>> hardware and that's what gets broadcast to the other cores by the snoop
>> control unit or cache coherency fabric (depending on the chip).
>
>
> So, if we go back to the original problem of wanting to do
> bus_dmamap_sync() on userspace buffers from some asynchronous context:
>
> Say the process that owns the buffer is running on one core and prefetches
> some data into a cacheline for the buffer, and bus_dmamap_sync(POSTREAD) is
> done by a kernel thread running on another core.  Since the core running
> the kernel thread is responsible for the TLB lookup to get the physical
> address, then since that core has no UVA the cache ops will be treated as
> misses and the cacheline on the core that owns the UVA won't be
> invalidated, correct?
>
> That means the panic on !pmap_dmap_iscurrent() in busdma_machdep-v6.c
> should stay?
>
> Sort of the same problem would apply to drivers using
> vm_fault_quick_hold_pages + bus_dmamap_load_ma...no cache maintenance,
> since there are no VAs to operate on.  Indeed, both arm and mips
> implementation of _bus_dmamap_load_phys don't do anything with the sync
> list.
>
It occurs to me that one way to deal with both the blocking-sfbuf for
physcopy and VIPT cache maintenance might be to have a reserved per-CPU KVA
page.  For arches that don't have a direct map, the idea would be to grab a
critical section, copy the bounce page or do cache maintenance on the
synclist entry, then drop the critical section.   That brought up a dim
memory I had of Linux doing something similar, and in fact it seems to use
kmap_atomic for both cache ops and bounce buffers.

From owner-freebsd-arch@FreeBSD.ORG  Thu Apr 30 08:38:38 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id A4F49446;
 Thu, 30 Apr 2015 08:38:38 +0000 (UTC)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 47685136A;
 Thu, 30 Apr 2015 08:38:38 +0000 (UTC)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3U8cWuR049983
 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Thu, 30 Apr 2015 11:38:32 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3U8cWuR049983
Received: (from kostik@localhost)
 by tom.home (8.14.9/8.14.9/Submit) id t3U8cW59049982;
 Thu, 30 Apr 2015 11:38:32 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Thu, 30 Apr 2015 11:38:32 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Ian Lepore <ian@freebsd.org>
Cc: Jason Harmening <jason.harmening@gmail.com>,
 Adrian Chadd <adrian@freebsd.org>, Svatopluk Kraus <onwahe@gmail.com>,
 freebsd-arch <freebsd-arch@freebsd.org>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
Message-ID: <20150430083832.GR2390@kib.kiev.ua>
References: <CAFHCsPX9rgmCAPABct84a000NuBPQm5sprOAQr9BTT6Ev6KZcQ@mail.gmail.com>
 <20150429132017.GM2390@kib.kiev.ua>
 <CAFHCsPWjEFBF+-7SR7EJ3UHP6oAAa9xjbu0CbRaQvd_-6gKuAQ@mail.gmail.com>
 <20150429165432.GN2390@kib.kiev.ua>
 <CAM=8qakzkKX8TZNYE33H=JqL_r5z+AU9fyp5+7Z0mixmF5t63w@mail.gmail.com>
 <20150429185019.GO2390@kib.kiev.ua>
 <CAM=8qanPHbCwUeu0-zi-ccY4WprHaOGzCm44PwNSgb==nwgGGw@mail.gmail.com>
 <20150429193337.GQ2390@kib.kiev.ua>
 <CAM=8qak0qRw5MsSG4e1Zqxo_x9VFGQ2rQpjUBFX_UA6P9_-2cA@mail.gmail.com>
 <1430346204.1157.107.camel@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1430346204.1157.107.camel@freebsd.org>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.1
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Apr 2015 08:38:38 -0000

On Wed, Apr 29, 2015 at 04:23:24PM -0600, Ian Lepore wrote:
> For what we call armv6 (which is mostly armv7)...
> 
> The cache maintenance operations require virtual addresses, which means
> it looks a lot like a VIPT cache.  Under the hood the implementation
> behaves as if it were a PIPT cache so even in the presence of multiple
> mappings of the same physical page into different virtual addresses, the
> SMP coherency hardware works correctly.
> 
> The ARM ARM says...
> 
>         [Stuff about ARMv6 and page coloring when a cache way exceeds
>         4K.]
>         
>         ARMv7 does not support page coloring, and requires that all data
>         and unified caches behave as Physically Indexed Physically
>         Tagged (PIPT) caches.
> 
> The only true armv6 chip we support isn't SMP and has a 16K/4-way cache
> that neatly sidesteps the aliasing problem that requires page coloring
> solutions.  So modern arm chips we get to act like we've got PIPT data
> caches, but with the quirk that cache ops are initiated by virtual
> address.
> 
> Basically, when you perform a cache maintainence operation, a
> translation table walk is done on the core that issued the cache op,
> then from that point on the physical address is used within the cache
> hardware and that's what gets broadcast to the other cores by the snoop
> control unit or cache coherency fabric (depending on the chip).
This is the same as it is done on the x86. There is a CLFLUSH
instruction, which takes virtual address and invalidates the cache line,
maintaining cache coherency in the coherency domain and possibly doing
write-back. It takes a virtual address, and even set the accessed bit in
the page table entry.

My understanding is that such decision to operate on the virtual
addresses for x86 was done to allow the instruction to work from the
user mode. Still, an instruction to flush cache line addressed by the
physical address would be nice. The required circuits are already
there, since CPUs must react on the coherency requests from other CPUs.
On amd64, the pmap_invalidate_cache_pages() uses direct map, but on
i386 kernel has to use specially allocated KVA page frame for temporal
mapping (per-cpu CMAP2), see i386/i386/pmap.c:pmap_flush_page().

From owner-freebsd-arch@FreeBSD.ORG  Thu Apr 30 09:53:07 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 39CDE48E;
 Thu, 30 Apr 2015 09:53:07 +0000 (UTC)
Received: from mail-ie0-x231.google.com (mail-ie0-x231.google.com
 [IPv6:2607:f8b0:4001:c03::231])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id F234A1CB9;
 Thu, 30 Apr 2015 09:53:06 +0000 (UTC)
Received: by iedfl3 with SMTP id fl3so70652438ied.1;
 Thu, 30 Apr 2015 02:53:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=vE8UxfyFvnSUa7y0Ao9fY7DbeYyy+An+QqpnG4OUzu0=;
 b=E6bwNNpBoA69OaqVDEDr8MO/+MuR8+ZZC/6uXz8KU0VUwxe/SuXuU+O7FsTZu4d+l9
 QRnhxdW1LwP/yTZJUOjxugXB8cEZyoT5kzKUEH+wbqekoxGVdMyM6VASwi54OBxkerhn
 LjYH/qcXEWBWdlyiOlDW/yx0jEEVkI+5SNgOXgrZmX9vzGm91f0jTnfMT5ccs85L5ZhR
 HJxKvtn00Sz9p+WiZSkxlL3q4Lq6r6XoQqlSP8cPJw8k1Vo/j/LJgqVeyArq8YMIp004
 6PQK78DBSx00SLjKoumNfojAcu9rLdbSZOKc/UUhuuvp2tAhaRXTi0Y/yOwD652TJ3Pf
 cg/g==
MIME-Version: 1.0
X-Received: by 10.107.28.146 with SMTP id c140mr4399830ioc.67.1430387586495;
 Thu, 30 Apr 2015 02:53:06 -0700 (PDT)
Received: by 10.64.13.81 with HTTP; Thu, 30 Apr 2015 02:53:06 -0700 (PDT)
In-Reply-To: <CAM=8qakVsbukSTVh5UVQEO2Vmtcmj36cqw6KJC3frvEQGGCQsg@mail.gmail.com>
References: <38574E63-2D74-4ECB-8D68-09AC76DFB30C@bsdimp.com>
 <CAJ-VmomqGkEFVauya+rmPGcD_-=Z-mmg1RSDf1D2bT_DfwPBGA@mail.gmail.com>
 <1761247.Bq816CMB8v@ralph.baldwin.cx>
 <CAFHCsPX9rgmCAPABct84a000NuBPQm5sprOAQr9BTT6Ev6KZcQ@mail.gmail.com>
 <20150429132017.GM2390@kib.kiev.ua>
 <CAFHCsPWjEFBF+-7SR7EJ3UHP6oAAa9xjbu0CbRaQvd_-6gKuAQ@mail.gmail.com>
 <20150429165432.GN2390@kib.kiev.ua>
 <CAM=8qakzkKX8TZNYE33H=JqL_r5z+AU9fyp5+7Z0mixmF5t63w@mail.gmail.com>
 <20150429185019.GO2390@kib.kiev.ua>
 <CAM=8qanPHbCwUeu0-zi-ccY4WprHaOGzCm44PwNSgb==nwgGGw@mail.gmail.com>
 <20150429193337.GQ2390@kib.kiev.ua>
 <CAM=8qak0qRw5MsSG4e1Zqxo_x9VFGQ2rQpjUBFX_UA6P9_-2cA@mail.gmail.com>
 <1430346204.1157.107.camel@freebsd.org>
 <CAM=8qakVsbukSTVh5UVQEO2Vmtcmj36cqw6KJC3frvEQGGCQsg@mail.gmail.com>
Date: Thu, 30 Apr 2015 11:53:06 +0200
Message-ID: <CAFHCsPX_ua8iCioPKoABNH=7evfj+FrXAZ=Gkf6N4vD3k5-afQ@mail.gmail.com>
Subject: Re: bus_dmamap_sync() for bounced client buffers from user address
 space
From: Svatopluk Kraus <onwahe@gmail.com>
To: Jason Harmening <jason.harmening@gmail.com>
Cc: Ian Lepore <ian@freebsd.org>, Konstantin Belousov <kostikbel@gmail.com>, 
 Adrian Chadd <adrian@freebsd.org>, freebsd-arch <freebsd-arch@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Apr 2015 09:53:07 -0000

On Thu, Apr 30, 2015 at 1:10 AM, Jason Harmening
<jason.harmening@gmail.com> wrote:
>>
>> For what we call armv6 (which is mostly armv7)...
>>
>> The cache maintenance operations require virtual addresses, which means
>> it looks a lot like a VIPT cache.  Under the hood the implementation
>> behaves as if it were a PIPT cache so even in the presence of multiple
>> mappings of the same physical page into different virtual addresses, the
>> SMP coherency hardware works correctly.
>>
>> The ARM ARM says...
>>
>>         [Stuff about ARMv6 and page coloring when a cache way exceeds
>>         4K.]
>>
>>         ARMv7 does not support page coloring, and requires that all data
>>         and unified caches behave as Physically Indexed Physically
>>         Tagged (PIPT) caches.
>>
>> The only true armv6 chip we support isn't SMP and has a 16K/4-way cache
>> that neatly sidesteps the aliasing problem that requires page coloring
>> solutions.  So modern arm chips we get to act like we've got PIPT data
>> caches, but with the quirk that cache ops are initiated by virtual
>> address.
>
>
> Cool, thanks for the explanation!
> To satisfy my own curiosity, since it "looks like VIPT", does that mean we
> still have to flush the cache on context switch?

No, in general, there is no need to flush PIPT caches (even if they
"look like VIPT") on context switch. When it comes (cache
maintainance), physical page is either mapped in correct context or
you have to map it somewhere in current context (KVA is used for
that).


>>
>>
>> Basically, when you perform a cache maintainence operation, a
>> translation table walk is done on the core that issued the cache op,
>> then from that point on the physical address is used within the cache
>> hardware and that's what gets broadcast to the other cores by the snoop
>> control unit or cache coherency fabric (depending on the chip).
>
>
> So, if we go back to the original problem of wanting to do bus_dmamap_sync()
> on userspace buffers from some asynchronous context:
>
> Say the process that owns the buffer is running on one core and prefetches
> some data into a cacheline for the buffer, and bus_dmamap_sync(POSTREAD) is
> done by a kernel thread running on another core.  Since the core running the
> kernel thread is responsible for the TLB lookup to get the physical address,
> then since that core has no UVA the cache ops will be treated as misses and
> the cacheline on the core that owns the UVA won't be invalidated, correct?
>
> That means the panic on !pmap_dmap_iscurrent() in busdma_machdep-v6.c should
> stay?

Not for unmapped buffers. For user space buffers, it's still a
question how this will be resolved. It looks now that it's aiming to
not using UVA for DMA buffers in kernel at all. In any case, even if
UVA will be used, the panic won't be needed if correct implementation
will be done.

>
> Sort of the same problem would apply to drivers using
> vm_fault_quick_hold_pages + bus_dmamap_load_ma...no cache maintenance, since
> there are no VAs to operate on.  Indeed, both arm and mips implementation of
> _bus_dmamap_load_phys don't do anything with the sync list.


I'm just working on _bus_dmamap_load_phys() implementation for armv6.
It means that I'm adding sync list for unmapped buffers (with virtual
address set to zero) and implement cache maintainance operations with
physical address passed as argument. It means that given range will be
temporarily mapped to kernel (page by page) and cache operation using
virtual address willl be called. It's the same scenarion taken in i386
pmap.

From owner-freebsd-arch@FreeBSD.ORG  Thu Apr 30 14:24:13 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 1549B78D;
 Thu, 30 Apr 2015 14:24:13 +0000 (UTC)
Received: from cell.glebius.int.ru (glebius.int.ru [81.19.69.10])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "cell.glebius.int.ru", Issuer "cell.glebius.int.ru" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 2EC1A1038;
 Thu, 30 Apr 2015 14:24:11 +0000 (UTC)
Received: from cell.glebius.int.ru (localhost [127.0.0.1])
 by cell.glebius.int.ru (8.14.9/8.14.9) with ESMTP id t3UEO8Nr022445
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Thu, 30 Apr 2015 17:24:08 +0300 (MSK)
 (envelope-from glebius@FreeBSD.org)
Received: (from glebius@localhost)
 by cell.glebius.int.ru (8.14.9/8.14.9/Submit) id t3UEO849022444;
 Thu, 30 Apr 2015 17:24:08 +0300 (MSK)
 (envelope-from glebius@FreeBSD.org)
X-Authentication-Warning: cell.glebius.int.ru: glebius set sender to
 glebius@FreeBSD.org using -f
Date: Thu, 30 Apr 2015 17:24:08 +0300
From: Gleb Smirnoff <glebius@FreeBSD.org>
To: kib@FreeBSD.org, alc@FreeBSD.org
Cc: arch@FreeBSD.org
Subject: more strict KPI for vm_pager_get_pages()
Message-ID: <20150430142408.GS546@nginx.com>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="45Z9DzgjV8m4Oswq"
Content-Disposition: inline
User-Agent: Mutt/1.5.23 (2014-03-12)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Apr 2015 14:24:13 -0000


--45Z9DzgjV8m4Oswq
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

  Hi!

  The reason to write down this patch emerges from the
projects/sendfile branch, where vm_pager_get_pages() is
used in the sendfile(2) system call. Although the new
sendfile works flawlessly, it makes some assumptions
about vnode_pager that theoretically may not be valid,
however always hold in our current code.

Going deeper in the problem I have found more important
points, which yielded in the suggested patch. To start,
let me display the current KPI assumptions:

1) vm_pager_get_pages() works on an array of consequtive
   array of pages. Pindex of (n+1)-th pages must be pindex
   of n-th + 1. One page is special, it is called reqpage.
2) vm_pager_get_pages() guarantees to swapin only the reqpage,
   and may skip or fail other pages for different reasons, that
   may vary from pager to pager.
3) There also is function vm_pager_has_page(), which reports
   availability of a page at given index in the pager, and also
   provides hints on how many consequtive pages before this one
   and after this one can be swapped in in single pager request.
   Most pagers return zeros in these hints. The vnode pager for
   UFS returns a strong promise, that one can later utilize in
   vm_pager_get_pages().
4) All pages must be busied on enter. On exit only reqpage
   will be left busied. The KPI doesn't guarantee that rest
   of the pages is still in place. The pager usually calls
   vm_page_readahead_finish() on them, which can either free,
   or put the page on active/inactive queue, using quite
   a strange approach to choose a queue.
5) The pages must not be wired, since vm_page_free() may be
   called on them. However, this is violated by several
   consumers of KPI, relying on lack of errors in the pager.
   Moreover, the swap pager has a special function to skip
   wired pages, while doing the sweep, to avoid this problem.
   So, passing wired pages to swapper is OK, while to the
   reset is not.
6) Pagers may replace a page in the object with a new one.
   The sg_pager actually does that. To protect from this
   event, consumers of vm_pager_get_pages() always run
   vm_page_lookup() over the array of pages to relookup the pages.
   However, not all consumers do this.

Speaking of pagers and their consumers:
- 11 consumers request array of size 1, a single page
- 3 consumers actually request array

My suggestion is to change the KPI assumptions to the following:

1) There is no reqpage. All pages are entered busied, all pages
   are returned busied and validated. If pager fails to validate
   all pages it must return error.
2) The consumer (not the pager!) is to decide what to do with the
   pages: vm_page_active, vm_page_deactivate, vm_page_flash or just
   vm_page_free them. The consumer also unbusies pages, if it
   wants to. The consumer is free to wire pages before the call.
3) Consumers must first query the pager via vm_pager_has_page(),
   and use the after/before hints to limit the size of the
   requested pages array.
4) In case if pager replaces pages, it must also update the array,
   so that consumer doesn't need to do relookup.

Doing this sweep, I also noticed that all pagers have a copy-pasted
code of zeroing invalid regions of partially valid pages. Also,
many pagers got a set of assertions copy and pasted from each
other. So, I decided to un-inline the vm_pager_get_pages(), bring
it to the vm_pager.c file and gather all these copy-pastes
into one place.

The suggested patch is attached. As expected, it simplifies and
removes quite a lot of code.

Right now it is tested on UFS only, testing NFS and ZFS is on my list.
There is one panic known, but it seems unrelated, and Peter pho@ says
that once it has been seen before.

-- 
Totus tuus, Glebius.

--45Z9DzgjV8m4Oswq
Content-Type: text/x-diff; charset=us-ascii
Content-Disposition: attachment; filename="vm_pager_get_pages-new-KPI.diff"

Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c	(revision 282213)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c	(working copy)
@@ -5712,12 +5712,12 @@ ioflags(int ioflags)
 }
 
 static int
-zfs_getpages(struct vnode *vp, vm_page_t *m, int count, int reqpage)
+zfs_getpages(struct vnode *vp, vm_page_t *m, int count)
 {
 	znode_t *zp = VTOZ(vp);
 	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
 	objset_t *os = zp->z_zfsvfs->z_os;
-	vm_page_t mfirst, mlast, mreq;
+	vm_page_t mlast;
 	vm_object_t object;
 	caddr_t va;
 	struct sf_buf *sf;
@@ -5730,80 +5730,27 @@ static int
 	ZFS_VERIFY_ZP(zp);
 
 	pcount = OFF_TO_IDX(round_page(count));
-	mreq = m[reqpage];
-	object = mreq->object;
+	object = m[0]->object;
+	mlast = m[pcount - 1];
 	error = 0;
 
-	KASSERT(vp->v_object == object, ("mismatching object"));
-
-	if (pcount > 1 && zp->z_blksz > PAGESIZE) {
-		startoff = rounddown(IDX_TO_OFF(mreq->pindex), zp->z_blksz);
-		reqstart = OFF_TO_IDX(round_page(startoff));
-		if (reqstart < m[0]->pindex)
-			reqstart = 0;
-		else
-			reqstart = reqstart - m[0]->pindex;
-		endoff = roundup(IDX_TO_OFF(mreq->pindex) + PAGE_SIZE,
-		    zp->z_blksz);
-		reqend = OFF_TO_IDX(trunc_page(endoff)) - 1;
-		if (reqend > m[pcount - 1]->pindex)
-			reqend = m[pcount - 1]->pindex;
-		reqsize = reqend - m[reqstart]->pindex + 1;
-		KASSERT(reqstart <= reqpage && reqpage < reqstart + reqsize,
-		    ("reqpage beyond [reqstart, reqstart + reqsize[ bounds"));
-	} else {
-		reqstart = reqpage;
-		reqsize = 1;
-	}
-	mfirst = m[reqstart];
-	mlast = m[reqstart + reqsize - 1];
-
-	zfs_vmobject_wlock(object);
-
-	for (i = 0; i < reqstart; i++) {
-		vm_page_lock(m[i]);
-		vm_page_free(m[i]);
-		vm_page_unlock(m[i]);
-	}
-	for (i = reqstart + reqsize; i < pcount; i++) {
-		vm_page_lock(m[i]);
-		vm_page_free(m[i]);
-		vm_page_unlock(m[i]);
-	}
-
-	if (mreq->valid && reqsize == 1) {
-		if (mreq->valid != VM_PAGE_BITS_ALL)
-			vm_page_zero_invalid(mreq, TRUE);
-		zfs_vmobject_wunlock(object);
+	if (IDX_TO_OFF(mlast->pindex) >=
+	    object->un_pager.vnp.vnp_size) {
 		ZFS_EXIT(zfsvfs);
-		return (zfs_vm_pagerret_ok);
+		return (zfs_vm_pagerret_bad);
 	}
 
 	PCPU_INC(cnt.v_vnodein);
 	PCPU_ADD(cnt.v_vnodepgsin, reqsize);
 
-	if (IDX_TO_OFF(mreq->pindex) >= object->un_pager.vnp.vnp_size) {
-		for (i = reqstart; i < reqstart + reqsize; i++) {
-			if (i != reqpage) {
-				vm_page_lock(m[i]);
-				vm_page_free(m[i]);
-				vm_page_unlock(m[i]);
-			}
-		}
-		zfs_vmobject_wunlock(object);
-		ZFS_EXIT(zfsvfs);
-		return (zfs_vm_pagerret_bad);
-	}
-
 	lsize = PAGE_SIZE;
 	if (IDX_TO_OFF(mlast->pindex) + lsize > object->un_pager.vnp.vnp_size)
-		lsize = object->un_pager.vnp.vnp_size - IDX_TO_OFF(mlast->pindex);
+		lsize = object->un_pager.vnp.vnp_size -
+		    IDX_TO_OFF(mlast->pindex);
 
-	zfs_vmobject_wunlock(object);
-
-	for (i = reqstart; i < reqstart + reqsize; i++) {
+	for (i = 0; i < pcount; i++) {
 		size = PAGE_SIZE;
-		if (i == (reqstart + reqsize - 1))
+		if (i == pcount - 1)
 			size = lsize;
 		va = zfs_map_page(m[i], &sf);
 		error = dmu_read(os, zp->z_id, IDX_TO_OFF(m[i]->pindex),
@@ -5812,21 +5759,15 @@ static int
 			bzero(va + size, PAGE_SIZE - size);
 		zfs_unmap_page(sf);
 		if (error != 0)
-			break;
+			goto out;
 	}
 
 	zfs_vmobject_wlock(object);
-
-	for (i = reqstart; i < reqstart + reqsize; i++) {
-		if (!error)
-			m[i]->valid = VM_PAGE_BITS_ALL;
-		KASSERT(m[i]->dirty == 0, ("zfs_getpages: page %p is dirty", m[i]));
-		if (i != reqpage)
-			vm_page_readahead_finish(m[i]);
-	}
-
+	for (i = 0; i < pcount; i++)
+		m[i]->valid = VM_PAGE_BITS_ALL;
 	zfs_vmobject_wunlock(object);
 
+out:
 	ZFS_ACCESSTIME_STAMP(zfsvfs, zp);
 	ZFS_EXIT(zfsvfs);
 	return (error ? zfs_vm_pagerret_error : zfs_vm_pagerret_ok);
@@ -5842,7 +5783,7 @@ zfs_freebsd_getpages(ap)
 	} */ *ap;
 {
 
-	return (zfs_getpages(ap->a_vp, ap->a_m, ap->a_count, ap->a_reqpage));
+	return (zfs_getpages(ap->a_vp, ap->a_m, ap->a_count));
 }
 
 static int
Index: sys/dev/drm2/i915/i915_gem.c
===================================================================
--- sys/dev/drm2/i915/i915_gem.c	(revision 282213)
+++ sys/dev/drm2/i915/i915_gem.c	(working copy)
@@ -3174,8 +3174,7 @@ i915_gem_wire_page(vm_object_t object, vm_pindex_t
 	m = vm_page_grab(object, pindex, VM_ALLOC_NORMAL);
 	if (m->valid != VM_PAGE_BITS_ALL) {
 		if (vm_pager_has_page(object, pindex, NULL, NULL)) {
-			rv = vm_pager_get_pages(object, &m, 1, 0);
-			m = vm_page_lookup(object, pindex);
+			rv = vm_pager_get_pages(object, &m, 1);
 			if (m == NULL)
 				return (NULL);
 			if (rv != VM_PAGER_OK) {
Index: sys/dev/drm2/ttm/ttm_tt.c
===================================================================
--- sys/dev/drm2/ttm/ttm_tt.c	(revision 282213)
+++ sys/dev/drm2/ttm/ttm_tt.c	(working copy)
@@ -291,7 +291,7 @@ int ttm_tt_swapin(struct ttm_tt *ttm)
 		from_page = vm_page_grab(obj, i, VM_ALLOC_NORMAL);
 		if (from_page->valid != VM_PAGE_BITS_ALL) {
 			if (vm_pager_has_page(obj, i, NULL, NULL)) {
-				rv = vm_pager_get_pages(obj, &from_page, 1, 0);
+				rv = vm_pager_get_pages(obj, &from_page, 1);
 				if (rv != VM_PAGER_OK) {
 					vm_page_lock(from_page);
 					vm_page_free(from_page);
Index: sys/dev/md/md.c
===================================================================
--- sys/dev/md/md.c	(revision 282213)
+++ sys/dev/md/md.c	(working copy)
@@ -835,7 +835,7 @@ mdstart_swap(struct md_s *sc, struct bio *bp)
 			if (m->valid == VM_PAGE_BITS_ALL)
 				rv = VM_PAGER_OK;
 			else
-				rv = vm_pager_get_pages(sc->object, &m, 1, 0);
+				rv = vm_pager_get_pages(sc->object, &m, 1);
 			if (rv == VM_PAGER_ERROR) {
 				vm_page_xunbusy(m);
 				break;
@@ -858,7 +858,7 @@ mdstart_swap(struct md_s *sc, struct bio *bp)
 			}
 		} else if (bp->bio_cmd == BIO_WRITE) {
 			if (len != PAGE_SIZE && m->valid != VM_PAGE_BITS_ALL)
-				rv = vm_pager_get_pages(sc->object, &m, 1, 0);
+				rv = vm_pager_get_pages(sc->object, &m, 1);
 			else
 				rv = VM_PAGER_OK;
 			if (rv == VM_PAGER_ERROR) {
@@ -874,7 +874,7 @@ mdstart_swap(struct md_s *sc, struct bio *bp)
 			m->valid = VM_PAGE_BITS_ALL;
 		} else if (bp->bio_cmd == BIO_DELETE) {
 			if (len != PAGE_SIZE && m->valid != VM_PAGE_BITS_ALL)
-				rv = vm_pager_get_pages(sc->object, &m, 1, 0);
+				rv = vm_pager_get_pages(sc->object, &m, 1);
 			else
 				rv = VM_PAGER_OK;
 			if (rv == VM_PAGER_ERROR) {
Index: sys/fs/fuse/fuse_vnops.c
===================================================================
--- sys/fs/fuse/fuse_vnops.c	(revision 282213)
+++ sys/fs/fuse/fuse_vnops.c	(working copy)
@@ -1761,29 +1761,6 @@ fuse_vnop_getpages(struct vop_getpages_args *ap)
 	npages = btoc(count);
 
 	/*
-	 * If the requested page is partially valid, just return it and
-	 * allow the pager to zero-out the blanks.  Partially valid pages
-	 * can only occur at the file EOF.
-	 */
-
-	VM_OBJECT_WLOCK(vp->v_object);
-	fuse_vm_page_lock_queues();
-	if (pages[ap->a_reqpage]->valid != 0) {
-		for (i = 0; i < npages; ++i) {
-			if (i != ap->a_reqpage) {
-				fuse_vm_page_lock(pages[i]);
-				vm_page_free(pages[i]);
-				fuse_vm_page_unlock(pages[i]);
-			}
-		}
-		fuse_vm_page_unlock_queues();
-		VM_OBJECT_WUNLOCK(vp->v_object);
-		return 0;
-	}
-	fuse_vm_page_unlock_queues();
-	VM_OBJECT_WUNLOCK(vp->v_object);
-
-	/*
 	 * We use only the kva address for the buffer, but this is extremely
 	 * convienient and fast.
 	 */
@@ -1811,17 +1788,6 @@ fuse_vnop_getpages(struct vop_getpages_args *ap)
 
 	if (error && (uio.uio_resid == count)) {
 		FS_DEBUG("error %d\n", error);
-		VM_OBJECT_WLOCK(vp->v_object);
-		fuse_vm_page_lock_queues();
-		for (i = 0; i < npages; ++i) {
-			if (i != ap->a_reqpage) {
-				fuse_vm_page_lock(pages[i]);
-				vm_page_free(pages[i]);
-				fuse_vm_page_unlock(pages[i]);
-			}
-		}
-		fuse_vm_page_unlock_queues();
-		VM_OBJECT_WUNLOCK(vp->v_object);
 		return VM_PAGER_ERROR;
 	}
 	/*
@@ -1862,8 +1828,6 @@ fuse_vnop_getpages(struct vop_getpages_args *ap)
 			 */
 			;
 		}
-		if (i != ap->a_reqpage)
-			vm_page_readahead_finish(m);
 	}
 	fuse_vm_page_unlock_queues();
 	VM_OBJECT_WUNLOCK(vp->v_object);
Index: sys/fs/nfsclient/nfs_clbio.c
===================================================================
--- sys/fs/nfsclient/nfs_clbio.c	(revision 282213)
+++ sys/fs/nfsclient/nfs_clbio.c	(working copy)
@@ -129,23 +129,6 @@ ncl_getpages(struct vop_getpages_args *ap)
 	npages = btoc(count);
 
 	/*
-	 * Since the caller has busied the requested page, that page's valid
-	 * field will not be changed by other threads.
-	 */
-	vm_page_assert_xbusied(pages[ap->a_reqpage]);
-
-	/*
-	 * If the requested page is partially valid, just return it and
-	 * allow the pager to zero-out the blanks.  Partially valid pages
-	 * can only occur at the file EOF.
-	 */
-	if (pages[ap->a_reqpage]->valid != 0) {
-		vm_pager_free_nonreq(object, pages, ap->a_reqpage, npages,
-		    FALSE);
-		return (VM_PAGER_OK);
-	}
-
-	/*
 	 * We use only the kva address for the buffer, but this is extremely
 	 * convienient and fast.
 	 */
@@ -173,8 +156,6 @@ ncl_getpages(struct vop_getpages_args *ap)
 
 	if (error && (uio.uio_resid == count)) {
 		ncl_printf("nfs_getpages: error %d\n", error);
-		vm_pager_free_nonreq(object, pages, ap->a_reqpage, npages,
-		    FALSE);
 		return (VM_PAGER_ERROR);
 	}
 
@@ -218,8 +199,6 @@ ncl_getpages(struct vop_getpages_args *ap)
 			 */
 			;
 		}
-		if (i != ap->a_reqpage)
-			vm_page_readahead_finish(m);
 	}
 	VM_OBJECT_WUNLOCK(object);
 	return (0);
Index: sys/fs/smbfs/smbfs_io.c
===================================================================
--- sys/fs/smbfs/smbfs_io.c	(revision 282213)
+++ sys/fs/smbfs/smbfs_io.c	(working copy)
@@ -424,7 +424,7 @@ smbfs_getpages(ap)
 #ifdef SMBFS_RWGENERIC
 	return vop_stdgetpages(ap);
 #else
-	int i, error, nextoff, size, toff, npages, count, reqpage;
+	int i, error, nextoff, size, toff, npages, count;
 	struct uio uio;
 	struct iovec iov;
 	vm_offset_t kva;
@@ -436,7 +436,7 @@ smbfs_getpages(ap)
 	struct smbnode *np;
 	struct smb_cred *scred;
 	vm_object_t object;
-	vm_page_t *pages, m;
+	vm_page_t *pages;
 
 	vp = ap->a_vp;
 	if ((object = vp->v_object) == NULL) {
@@ -451,29 +451,7 @@ smbfs_getpages(ap)
 	pages = ap->a_m;
 	count = ap->a_count;
 	npages = btoc(count);
-	reqpage = ap->a_reqpage;
 
-	/*
-	 * If the requested page is partially valid, just return it and
-	 * allow the pager to zero-out the blanks.  Partially valid pages
-	 * can only occur at the file EOF.
-	 */
-	m = pages[reqpage];
-
-	VM_OBJECT_WLOCK(object);
-	if (m->valid != 0) {
-		for (i = 0; i < npages; ++i) {
-			if (i != reqpage) {
-				vm_page_lock(pages[i]);
-				vm_page_free(pages[i]);
-				vm_page_unlock(pages[i]);
-			}
-		}
-		VM_OBJECT_WUNLOCK(object);
-		return 0;
-	}
-	VM_OBJECT_WUNLOCK(object);
-
 	scred = smbfs_malloc_scred();
 	smb_makescred(scred, td, cred);
 
@@ -500,17 +478,8 @@ smbfs_getpages(ap)
 
 	relpbuf(bp, &smbfs_pbuf_freecnt);
 
-	VM_OBJECT_WLOCK(object);
 	if (error && (uio.uio_resid == count)) {
 		printf("smbfs_getpages: error %d\n",error);
-		for (i = 0; i < npages; i++) {
-			if (reqpage != i) {
-				vm_page_lock(pages[i]);
-				vm_page_free(pages[i]);
-				vm_page_unlock(pages[i]);
-			}
-		}
-		VM_OBJECT_WUNLOCK(object);
 		return VM_PAGER_ERROR;
 	}
 
@@ -544,9 +513,6 @@ smbfs_getpages(ap)
 			 */
 			;
 		}
-
-		if (i != reqpage)
-			vm_page_readahead_finish(m);
 	}
 	VM_OBJECT_WUNLOCK(object);
 	return 0;
Index: sys/fs/tmpfs/tmpfs_subr.c
===================================================================
--- sys/fs/tmpfs/tmpfs_subr.c	(revision 282213)
+++ sys/fs/tmpfs/tmpfs_subr.c	(working copy)
@@ -1320,7 +1320,7 @@ tmpfs_reg_resize(struct vnode *vp, off_t newsize,
 	struct tmpfs_mount *tmp;
 	struct tmpfs_node *node;
 	vm_object_t uobj;
-	vm_page_t m, ma[1];
+	vm_page_t m;
 	vm_pindex_t idx, newpages, oldpages;
 	off_t oldsize;
 	int base, rv;
@@ -1368,9 +1368,7 @@ retry:
 					VM_OBJECT_WLOCK(uobj);
 					goto retry;
 				} else if (m->valid != VM_PAGE_BITS_ALL) {
-					ma[0] = m;
-					rv = vm_pager_get_pages(uobj, ma, 1, 0);
-					m = vm_page_lookup(uobj, idx);
+					rv = vm_pager_get_pages(uobj, &m, 1);
 				} else
 					/* A cached page was reactivated. */
 					rv = VM_PAGER_OK;
Index: sys/kern/kern_exec.c
===================================================================
--- sys/kern/kern_exec.c	(revision 282213)
+++ sys/kern/kern_exec.c	(working copy)
@@ -920,8 +920,7 @@ int
 exec_map_first_page(imgp)
 	struct image_params *imgp;
 {
-	int rv, i;
-	int initial_pagein;
+	int rv, i, after, initial_pagein;
 	vm_page_t ma[VM_INITIAL_PAGEIN];
 	vm_object_t object;
 
@@ -937,9 +936,18 @@ exec_map_first_page(imgp)
 #endif
 	ma[0] = vm_page_grab(object, 0, VM_ALLOC_NORMAL);
 	if (ma[0]->valid != VM_PAGE_BITS_ALL) {
-		initial_pagein = VM_INITIAL_PAGEIN;
-		if (initial_pagein > object->size)
-			initial_pagein = object->size;
+		if (!vm_pager_has_page(object, 0, NULL, &after)) {
+			vm_page_xunbusy(ma[0]);
+			vm_page_lock(ma[0]);
+			vm_page_free(ma[0]);
+			vm_page_unlock(ma[0]);
+			VM_OBJECT_WUNLOCK(object);
+			return (EIO);
+		}
+		initial_pagein = min(after, VM_INITIAL_PAGEIN);
+		KASSERT(initial_pagein <= object->size,
+		    ("%s: initial_pagein %d object->size %ju",
+		    __func__, initial_pagein, (uintmax_t )object->size));
 		for (i = 1; i < initial_pagein; i++) {
 			if ((ma[i] = vm_page_next(ma[i - 1])) != NULL) {
 				if (ma[i]->valid)
@@ -954,19 +962,21 @@ exec_map_first_page(imgp)
 			}
 		}
 		initial_pagein = i;
-		rv = vm_pager_get_pages(object, ma, initial_pagein, 0);
-		ma[0] = vm_page_lookup(object, 0);
-		if ((rv != VM_PAGER_OK) || (ma[0] == NULL)) {
-			if (ma[0] != NULL) {
-				vm_page_lock(ma[0]);
-				vm_page_free(ma[0]);
-				vm_page_unlock(ma[0]);
+		rv = vm_pager_get_pages(object, ma, initial_pagein);
+		if (rv != VM_PAGER_OK) {
+			for (i = 0; i < initial_pagein; i++) {
+				vm_page_xunbusy(ma[i]);
+				vm_page_lock(ma[i]);
+				vm_page_free(ma[i]);
+				vm_page_unlock(ma[i]);
 			}
 			VM_OBJECT_WUNLOCK(object);
 			return (EIO);
 		}
-	}
-	vm_page_xunbusy(ma[0]);
+	} else
+		initial_pagein = 1;
+	for (i = 0; i < initial_pagein; i++)
+		vm_page_xunbusy(ma[i]);
 	vm_page_lock(ma[0]);
 	vm_page_hold(ma[0]);
 	vm_page_activate(ma[0]);
Index: sys/kern/uipc_shm.c
===================================================================
--- sys/kern/uipc_shm.c	(revision 282213)
+++ sys/kern/uipc_shm.c	(working copy)
@@ -186,15 +186,7 @@ uiomove_object_page(vm_object_t obj, size_t len, s
 	m = vm_page_grab(obj, idx, VM_ALLOC_NORMAL);
 	if (m->valid != VM_PAGE_BITS_ALL) {
 		if (vm_pager_has_page(obj, idx, NULL, NULL)) {
-			rv = vm_pager_get_pages(obj, &m, 1, 0);
-			m = vm_page_lookup(obj, idx);
-			if (m == NULL) {
-				printf(
-		    "uiomove_object: vm_obj %p idx %jd null lookup rv %d\n",
-				    obj, idx, rv);
-				VM_OBJECT_WUNLOCK(obj);
-				return (EIO);
-			}
+			rv = vm_pager_get_pages(obj, &m, 1);
 			if (rv != VM_PAGER_OK) {
 				printf(
 	    "uiomove_object: vm_obj %p idx %jd valid %x pager error %d\n",
@@ -421,7 +413,7 @@ static int
 shm_dotruncate(struct shmfd *shmfd, off_t length)
 {
 	vm_object_t object;
-	vm_page_t m, ma[1];
+	vm_page_t m;
 	vm_pindex_t idx, nobjsize;
 	vm_ooffset_t delta;
 	int base, rv;
@@ -463,12 +455,9 @@ retry:
 					VM_WAIT;
 					VM_OBJECT_WLOCK(object);
 					goto retry;
-				} else if (m->valid != VM_PAGE_BITS_ALL) {
-					ma[0] = m;
-					rv = vm_pager_get_pages(object, ma, 1,
-					    0);
-					m = vm_page_lookup(object, idx);
-				} else
+				} else if (m->valid != VM_PAGE_BITS_ALL)
+					rv = vm_pager_get_pages(object, &m, 1);
+				else
 					/* A cached page was reactivated. */
 					rv = VM_PAGER_OK;
 				vm_page_lock(m);
Index: sys/kern/uipc_syscalls.c
===================================================================
--- sys/kern/uipc_syscalls.c	(revision 282213)
+++ sys/kern/uipc_syscalls.c	(working copy)
@@ -2024,12 +2024,9 @@ sendfile_readpage(vm_object_t obj, struct vnode *v
 		VM_OBJECT_WLOCK(obj);
 	} else {
 		if (vm_pager_has_page(obj, pindex, NULL, NULL)) {
-			rv = vm_pager_get_pages(obj, &m, 1, 0);
+			rv = vm_pager_get_pages(obj, &m, 1);
 			SFSTAT_INC(sf_iocnt);
-			m = vm_page_lookup(obj, pindex);
-			if (m == NULL)
-				error = EIO;
-			else if (rv != VM_PAGER_OK) {
+			if (rv != VM_PAGER_OK) {
 				vm_page_lock(m);
 				vm_page_free(m);
 				vm_page_unlock(m);
Index: sys/kern/vfs_default.c
===================================================================
--- sys/kern/vfs_default.c	(revision 282213)
+++ sys/kern/vfs_default.c	(working copy)
@@ -731,12 +731,11 @@ vop_stdgetpages(ap)
 		struct vnode *a_vp;
 		vm_page_t *a_m;
 		int a_count;
-		int a_reqpage;
 	} */ *ap;
 {
 
 	return vnode_pager_generic_getpages(ap->a_vp, ap->a_m,
-	    ap->a_count, ap->a_reqpage, NULL, NULL);
+	    ap->a_count, NULL, NULL);
 }
 
 static int
@@ -744,8 +743,8 @@ vop_stdgetpages_async(struct vop_getpages_async_ar
 {
 	int error;
 
-	error = VOP_GETPAGES(ap->a_vp, ap->a_m, ap->a_count, ap->a_reqpage);
-	ap->a_iodone(ap->a_arg, ap->a_m, ap->a_reqpage, error);
+	error = VOP_GETPAGES(ap->a_vp, ap->a_m, ap->a_count);
+	ap->a_iodone(ap->a_arg, ap->a_m, ap->a_count, error);
 	return (error);
 }
 
Index: sys/kern/vnode_if.src
===================================================================
--- sys/kern/vnode_if.src	(revision 282213)
+++ sys/kern/vnode_if.src	(working copy)
@@ -472,7 +472,6 @@ vop_getpages {
 	IN struct vnode *vp;
 	IN vm_page_t *m;
 	IN int count;
-	IN int reqpage;
 };
 
 
@@ -482,7 +481,6 @@ vop_getpages_async {
 	IN struct vnode *vp;
 	IN vm_page_t *m;
 	IN int count;
-	IN int reqpage;
 	IN vop_getpages_iodone_t *iodone;
 	IN void *arg;
 };
Index: sys/sys/buf.h
===================================================================
--- sys/sys/buf.h	(revision 282213)
+++ sys/sys/buf.h	(working copy)
@@ -124,14 +124,9 @@ struct buf {
 	struct	ucred *b_wcred;		/* Write credentials reference. */
 	void	*b_saveaddr;		/* Original b_addr for physio. */
 	union {
-		TAILQ_ENTRY(buf) bu_freelist; /* (Q) */
-		struct {
-			void	(*pg_iodone)(void *, vm_page_t *, int, int);
-			int	pg_reqpage;
-		} bu_pager;
-	} b_union;
-#define	b_freelist	b_union.bu_freelist
-#define	b_pager         b_union.bu_pager
+		TAILQ_ENTRY(buf) b_freelist; /* (Q) */
+		void	(*b_pgiodone)(void *, vm_page_t *, int, int);
+	};
 	union	cluster_info {
 		TAILQ_HEAD(cluster_list_head, buf) cluster_head;
 		TAILQ_ENTRY(buf) cluster_entry;
Index: sys/vm/default_pager.c
===================================================================
--- sys/vm/default_pager.c	(revision 282213)
+++ sys/vm/default_pager.c	(working copy)
@@ -56,7 +56,7 @@ __FBSDID("$FreeBSD$");
 static vm_object_t default_pager_alloc(void *, vm_ooffset_t, vm_prot_t,
     vm_ooffset_t, struct ucred *);
 static void default_pager_dealloc(vm_object_t);
-static int default_pager_getpages(vm_object_t, vm_page_t *, int, int);
+static int default_pager_getpages(vm_object_t, vm_page_t *, int);
 static void default_pager_putpages(vm_object_t, vm_page_t *, int, 
 		boolean_t, int *);
 static boolean_t default_pager_haspage(vm_object_t, vm_pindex_t, int *, 
@@ -121,11 +121,10 @@ default_pager_dealloc(object)
  * see a vm_page with assigned swap here.
  */
 static int
-default_pager_getpages(object, m, count, reqpage)
+default_pager_getpages(object, m, count)
 	vm_object_t object;
 	vm_page_t *m;
 	int count;
-	int reqpage;
 {
 	return VM_PAGER_FAIL;
 }
Index: sys/vm/device_pager.c
===================================================================
--- sys/vm/device_pager.c	(revision 282213)
+++ sys/vm/device_pager.c	(working copy)
@@ -59,7 +59,7 @@ static void dev_pager_init(void);
 static vm_object_t dev_pager_alloc(void *, vm_ooffset_t, vm_prot_t,
     vm_ooffset_t, struct ucred *);
 static void dev_pager_dealloc(vm_object_t);
-static int dev_pager_getpages(vm_object_t, vm_page_t *, int, int);
+static int dev_pager_getpages(vm_object_t, vm_page_t *, int);
 static void dev_pager_putpages(vm_object_t, vm_page_t *, int, 
 		boolean_t, int *);
 static boolean_t dev_pager_haspage(vm_object_t, vm_pindex_t, int *,
@@ -257,33 +257,27 @@ dev_pager_dealloc(object)
 }
 
 static int
-dev_pager_getpages(vm_object_t object, vm_page_t *ma, int count, int reqpage)
+dev_pager_getpages(vm_object_t object, vm_page_t *ma, int count)
 {
-	int error, i;
+	int error;
 
+	/* Since our putpages reports zero after/before, the count is 1. */
+	KASSERT(count == 1, ("%s: count %d", __func__, count));
 	VM_OBJECT_ASSERT_WLOCKED(object);
 	error = object->un_pager.devp.ops->cdev_pg_fault(object,
-	    IDX_TO_OFF(ma[reqpage]->pindex), PROT_READ, &ma[reqpage]);
+	    IDX_TO_OFF(ma[0]->pindex), PROT_READ, &ma[0]);
 
 	VM_OBJECT_ASSERT_WLOCKED(object);
 
-	for (i = 0; i < count; i++) {
-		if (i != reqpage) {
-			vm_page_lock(ma[i]);
-			vm_page_free(ma[i]);
-			vm_page_unlock(ma[i]);
-		}
-	}
-
 	if (error == VM_PAGER_OK) {
 		KASSERT((object->type == OBJT_DEVICE &&
-		     (ma[reqpage]->oflags & VPO_UNMANAGED) != 0) ||
+		     (ma[0]->oflags & VPO_UNMANAGED) != 0) ||
 		    (object->type == OBJT_MGTDEVICE &&
-		     (ma[reqpage]->oflags & VPO_UNMANAGED) == 0),
-		    ("Wrong page type %p %p", ma[reqpage], object));
+		     (ma[0]->oflags & VPO_UNMANAGED) == 0),
+		    ("Wrong page type %p %p", ma[0], object));
 		if (object->type == OBJT_DEVICE) {
 			TAILQ_INSERT_TAIL(&object->un_pager.devp.devp_pglist,
-			    ma[reqpage], plinks.q);
+			    ma[0], plinks.q);
 		}
 	}
 
Index: sys/vm/phys_pager.c
===================================================================
--- sys/vm/phys_pager.c	(revision 282213)
+++ sys/vm/phys_pager.c	(working copy)
@@ -137,7 +137,7 @@ phys_pager_dealloc(vm_object_t object)
  * Fill as many pages as vm_fault has allocated for us.
  */
 static int
-phys_pager_getpages(vm_object_t object, vm_page_t *m, int count, int reqpage)
+phys_pager_getpages(vm_object_t object, vm_page_t *m, int count)
 {
 	int i;
 
@@ -152,13 +152,6 @@ static int
 		    ("phys_pager_getpages: partially valid page %p", m[i]));
 		KASSERT(m[i]->dirty == 0,
 		    ("phys_pager_getpages: dirty page %p", m[i]));
-		/* The requested page must remain busy, the others not. */
-		if (i == reqpage) {
-			vm_page_lock(m[i]);
-			vm_page_flash(m[i]);
-			vm_page_unlock(m[i]);
-		} else
-			vm_page_xunbusy(m[i]);
 	}
 	return (VM_PAGER_OK);
 }
Index: sys/vm/sg_pager.c
===================================================================
--- sys/vm/sg_pager.c	(revision 282213)
+++ sys/vm/sg_pager.c	(working copy)
@@ -49,7 +49,7 @@ __FBSDID("$FreeBSD$");
 static vm_object_t sg_pager_alloc(void *, vm_ooffset_t, vm_prot_t,
     vm_ooffset_t, struct ucred *);
 static void sg_pager_dealloc(vm_object_t);
-static int sg_pager_getpages(vm_object_t, vm_page_t *, int, int);
+static int sg_pager_getpages(vm_object_t, vm_page_t *, int);
 static void sg_pager_putpages(vm_object_t, vm_page_t *, int, 
 		boolean_t, int *);
 static boolean_t sg_pager_haspage(vm_object_t, vm_pindex_t, int *,
@@ -133,7 +133,7 @@ sg_pager_dealloc(vm_object_t object)
 }
 
 static int
-sg_pager_getpages(vm_object_t object, vm_page_t *m, int count, int reqpage)
+sg_pager_getpages(vm_object_t object, vm_page_t *m, int count)
 {
 	struct sglist *sg;
 	vm_page_t m_paddr, page;
@@ -143,11 +143,13 @@ static int
 	size_t space;
 	int i;
 
+	/* Since our putpages reports zero after/before, the count is 1. */
+	KASSERT(count == 1, ("%s: count %d", __func__, count));
 	VM_OBJECT_ASSERT_WLOCKED(object);
 	sg = object->handle;
 	memattr = object->memattr;
 	VM_OBJECT_WUNLOCK(object);
-	offset = m[reqpage]->pindex;
+	offset = m[0]->pindex;
 
 	/*
 	 * Lookup the physical address of the requested page.  An initial
@@ -176,7 +178,7 @@ static int
 	}
 
 	/* Return a fake page for the requested page. */
-	KASSERT(!(m[reqpage]->flags & PG_FICTITIOUS),
+	KASSERT(!(m[0]->flags & PG_FICTITIOUS),
 	    ("backing page for SG is fake"));
 
 	/* Construct a new fake page. */
@@ -183,17 +185,9 @@ static int
 	page = vm_page_getfake(paddr, memattr);
 	VM_OBJECT_WLOCK(object);
 	TAILQ_INSERT_TAIL(&object->un_pager.sgp.sgp_pglist, page, plinks.q);
-
-	/* Free the original pages and insert this fake page into the object. */
-	for (i = 0; i < count; i++) {
-		if (i == reqpage &&
-		    vm_page_replace(page, object, offset) != m[i])
-			panic("sg_pager_getpages: invalid place replacement");
-		vm_page_lock(m[i]);
-		vm_page_free(m[i]);
-		vm_page_unlock(m[i]);
-	}
-	m[reqpage] = page;
+	if (vm_page_replace(page, object, offset) != m[0])
+		panic("sg_pager_getpages: invalid place replacement");
+	m[0] = page;
 	page->valid = VM_PAGE_BITS_ALL;
 
 	return (VM_PAGER_OK);
Index: sys/vm/swap_pager.c
===================================================================
--- sys/vm/swap_pager.c	(revision 282213)
+++ sys/vm/swap_pager.c	(working copy)
@@ -362,8 +362,8 @@ static vm_object_t
 		swap_pager_alloc(void *handle, vm_ooffset_t size,
 		    vm_prot_t prot, vm_ooffset_t offset, struct ucred *);
 static void	swap_pager_dealloc(vm_object_t object);
-static int	swap_pager_getpages(vm_object_t, vm_page_t *, int, int);
-static int	swap_pager_getpages_async(vm_object_t, vm_page_t *, int, int,
+static int	swap_pager_getpages(vm_object_t, vm_page_t *, int);
+static int	swap_pager_getpages_async(vm_object_t, vm_page_t *, int,
     pgo_getpages_iodone_t, void *);
 static void	swap_pager_putpages(vm_object_t, vm_page_t *, int, boolean_t, int *);
 static boolean_t
@@ -418,16 +418,6 @@ static void swp_pager_meta_free(vm_object_t, vm_pi
 static void swp_pager_meta_free_all(vm_object_t);
 static daddr_t swp_pager_meta_ctl(vm_object_t, vm_pindex_t, int);
 
-static void
-swp_pager_free_nrpage(vm_page_t m)
-{
-
-	vm_page_lock(m);
-	if (m->wire_count == 0)
-		vm_page_free(m);
-	vm_page_unlock(m);
-}
-
 /*
  * SWP_SIZECHECK() -	update swap_pager_full indication
  *
@@ -1109,20 +1099,11 @@ swap_pager_unswapped(vm_page_t m)
  *	left busy, but the others adjusted.
  */
 static int
-swap_pager_getpages(vm_object_t object, vm_page_t *m, int count, int reqpage)
+swap_pager_getpages(vm_object_t object, vm_page_t *m, int count)
 {
 	struct buf *bp;
-	vm_page_t mreq;
-	int i;
-	int j;
 	daddr_t blk;
 
-	mreq = m[reqpage];
-
-	KASSERT(mreq->object == object,
-	    ("swap_pager_getpages: object mismatch %p/%p",
-	    object, mreq->object));
-
 	/*
 	 * Calculate range to retrieve.  The pages have already been assigned
 	 * their swapblks.  We require a *contiguous* range but we know it to
@@ -1132,45 +1113,18 @@ static int
 	 *
 	 * The swp_*() calls must be made with the object locked.
 	 */
-	blk = swp_pager_meta_ctl(mreq->object, mreq->pindex, 0);
+	blk = swp_pager_meta_ctl(m[0]->object, m[0]->pindex, 0);
 
-	for (i = reqpage - 1; i >= 0; --i) {
-		daddr_t iblk;
-
-		iblk = swp_pager_meta_ctl(m[i]->object, m[i]->pindex, 0);
-		if (blk != iblk + (reqpage - i))
-			break;
-	}
-	++i;
-
-	for (j = reqpage + 1; j < count; ++j) {
-		daddr_t jblk;
-
-		jblk = swp_pager_meta_ctl(m[j]->object, m[j]->pindex, 0);
-		if (blk != jblk - (j - reqpage))
-			break;
-	}
-
-	/*
-	 * free pages outside our collection range.   Note: we never free
-	 * mreq, it must remain busy throughout.
-	 */
-	if (0 < i || j < count) {
-		int k;
-
-		for (k = 0; k < i; ++k)
-			swp_pager_free_nrpage(m[k]);
-		for (k = j; k < count; ++k)
-			swp_pager_free_nrpage(m[k]);
-	}
-
-	/*
-	 * Return VM_PAGER_FAIL if we have nothing to do.  Return mreq
-	 * still busy, but the others unbusied.
-	 */
 	if (blk == SWAPBLK_NONE)
 		return (VM_PAGER_FAIL);
 
+#ifdef INVARIANTS
+	for (int i = 0; i < count; i++)
+		KASSERT(blk + i ==
+		    swp_pager_meta_ctl(m[i]->object, m[i]->pindex, 0),
+		    ("%s: range is not contiguous", __func__));
+#endif
+
 	/*
 	 * Getpbuf() can sleep.
 	 */
@@ -1185,21 +1139,16 @@ static int
 	bp->b_iodone = swp_pager_async_iodone;
 	bp->b_rcred = crhold(thread0.td_ucred);
 	bp->b_wcred = crhold(thread0.td_ucred);
-	bp->b_blkno = blk - (reqpage - i);
-	bp->b_bcount = PAGE_SIZE * (j - i);
-	bp->b_bufsize = PAGE_SIZE * (j - i);
-	bp->b_pager.pg_reqpage = reqpage - i;
+	bp->b_blkno = blk;
+	bp->b_bcount = PAGE_SIZE * count;
+	bp->b_bufsize = PAGE_SIZE * count;
+	bp->b_npages = count;
 
 	VM_OBJECT_WLOCK(object);
-	{
-		int k;
-
-		for (k = i; k < j; ++k) {
-			bp->b_pages[k - i] = m[k];
-			m[k]->oflags |= VPO_SWAPINPROG;
-		}
+	for (int i = 0; i < count; i++) {
+		bp->b_pages[i] = m[i];
+		m[i]->oflags |= VPO_SWAPINPROG;
 	}
-	bp->b_npages = j - i;
 
 	PCPU_INC(cnt.v_swapin);
 	PCPU_ADD(cnt.v_swappgsin, bp->b_npages);
@@ -1231,8 +1180,8 @@ static int
 	 * is set in the meta-data.
 	 */
 	VM_OBJECT_WLOCK(object);
-	while ((mreq->oflags & VPO_SWAPINPROG) != 0) {
-		mreq->oflags |= VPO_SWAPSLEEP;
+	while ((m[0]->oflags & VPO_SWAPINPROG) != 0) {
+		m[0]->oflags |= VPO_SWAPSLEEP;
 		PCPU_INC(cnt.v_intrans);
 		if (VM_OBJECT_SLEEP(object, &object->paging_in_progress, PSWP,
 		    "swread", hz * 20)) {
@@ -1243,16 +1192,14 @@ static int
 	}
 
 	/*
-	 * mreq is left busied after completion, but all the other pages
-	 * are freed.  If we had an unrecoverable read error the page will
-	 * not be valid.
+	 * If we had an unrecoverable read error pages will not be valid.
 	 */
-	if (mreq->valid != VM_PAGE_BITS_ALL) {
-		return (VM_PAGER_ERROR);
-	} else {
-		return (VM_PAGER_OK);
-	}
+	for (int i = 0; i < count; i++)
+		if (m[i]->valid != VM_PAGE_BITS_ALL)
+			return (VM_PAGER_ERROR);
 
+	return (VM_PAGER_OK);
+
 	/*
 	 * A final note: in a low swap situation, we cannot deallocate swap
 	 * and mark a page dirty here because the caller is likely to mark
@@ -1269,11 +1216,11 @@ static int
  */
 static int
 swap_pager_getpages_async(vm_object_t object, vm_page_t *m, int count,
-    int reqpage, pgo_getpages_iodone_t iodone, void *arg)
+    pgo_getpages_iodone_t iodone, void *arg)
 {
 	int r, error;
 
-	r = swap_pager_getpages(object, m, count, reqpage);
+	r = swap_pager_getpages(object, m, count);
 	VM_OBJECT_WUNLOCK(object);
 	switch (r) {
 	case VM_PAGER_OK:
@@ -1572,33 +1519,11 @@ swp_pager_async_iodone(struct buf *bp)
 			 */
 			if (bp->b_iocmd == BIO_READ) {
 				/*
-				 * When reading, reqpage needs to stay
-				 * locked for the parent, but all other
-				 * pages can be freed.  We still want to
-				 * wakeup the parent waiting on the page,
-				 * though.  ( also: pg_reqpage can be -1 and
-				 * not match anything ).
-				 *
-				 * We have to wake specifically requested pages
-				 * up too because we cleared VPO_SWAPINPROG and
-				 * someone may be waiting for that.
-				 *
 				 * NOTE: for reads, m->dirty will probably
 				 * be overridden by the original caller of
 				 * getpages so don't play cute tricks here.
 				 */
 				m->valid = 0;
-				if (i != bp->b_pager.pg_reqpage)
-					swp_pager_free_nrpage(m);
-				else {
-					vm_page_lock(m);
-					vm_page_flash(m);
-					vm_page_unlock(m);
-				}
-				/*
-				 * If i == bp->b_pager.pg_reqpage, do not wake
-				 * the page up.  The caller needs to.
-				 */
 			} else {
 				/*
 				 * If a write error occurs, reactivate page
@@ -1620,38 +1545,12 @@ swp_pager_async_iodone(struct buf *bp)
 			 * want to do that anyway, but it was an optimization
 			 * that existed in the old swapper for a time before
 			 * it got ripped out due to precisely this problem.
-			 *
-			 * If not the requested page then deactivate it.
-			 *
-			 * Note that the requested page, reqpage, is left
-			 * busied, but we still have to wake it up.  The
-			 * other pages are released (unbusied) by
-			 * vm_page_xunbusy().
 			 */
 			KASSERT(!pmap_page_is_mapped(m),
 			    ("swp_pager_async_iodone: page %p is mapped", m));
-			m->valid = VM_PAGE_BITS_ALL;
 			KASSERT(m->dirty == 0,
 			    ("swp_pager_async_iodone: page %p is dirty", m));
-
-			/*
-			 * We have to wake specifically requested pages
-			 * up too because we cleared VPO_SWAPINPROG and
-			 * could be waiting for it in getpages.  However,
-			 * be sure to not unbusy getpages specifically
-			 * requested page - getpages expects it to be
-			 * left busy.
-			 */
-			if (i != bp->b_pager.pg_reqpage) {
-				vm_page_lock(m);
-				vm_page_deactivate(m);
-				vm_page_unlock(m);
-				vm_page_xunbusy(m);
-			} else {
-				vm_page_lock(m);
-				vm_page_flash(m);
-				vm_page_unlock(m);
-			}
+			m->valid = VM_PAGE_BITS_ALL;
 		} else {
 			/*
 			 * For write success, clear the dirty
@@ -1772,7 +1671,7 @@ swp_pager_force_pagein(vm_object_t object, vm_pind
 		return;
 	}
 
-	if (swap_pager_getpages(object, &m, 1, 0) != VM_PAGER_OK)
+	if (swap_pager_getpages(object, &m, 1) != VM_PAGER_OK)
 		panic("swap_pager_force_pagein: read from swap failed");/*XXX*/
 	vm_object_pip_wakeup(object);
 	vm_page_dirty(m);
Index: sys/vm/vm_fault.c
===================================================================
--- sys/vm/vm_fault.c	(revision 282213)
+++ sys/vm/vm_fault.c	(working copy)
@@ -672,26 +672,21 @@ vnode_locked:
 			    fs.m, behind, ahead, marray, &reqpage);
 
 			rv = faultcount ?
-			    vm_pager_get_pages(fs.object, marray, faultcount,
-				reqpage) : VM_PAGER_FAIL;
+			    vm_pager_get_pages(fs.object, marray, faultcount) :
+			    VM_PAGER_FAIL;
 
 			if (rv == VM_PAGER_OK) {
 				/*
 				 * Found the page. Leave it busy while we play
-				 * with it.
+				 * with it.  Unbusy companion pages.
 				 */
-
-				/*
-				 * Relookup in case pager changed page. Pager
-				 * is responsible for disposition of old page
-				 * if moved.
-				 */
-				fs.m = vm_page_lookup(fs.object, fs.pindex);
-				if (!fs.m) {
-					unlock_and_deallocate(&fs);
-					goto RetryFault;
+				for (int i = 0; i < faultcount; i++) {
+					if (i == reqpage)
+						continue;
+					vm_page_readahead_finish(marray[i]);
 				}
-
+				/* Pager could have changed the page. */
+				fs.m = marray[reqpage];
 				hardfault++;
 				break; /* break to PAGE HAS BEEN FOUND */
 			}
Index: sys/vm/vm_glue.c
===================================================================
--- sys/vm/vm_glue.c	(revision 282213)
+++ sys/vm/vm_glue.c	(working copy)
@@ -230,7 +230,7 @@ vsunlock(void *addr, size_t len)
 static vm_page_t
 vm_imgact_hold_page(vm_object_t object, vm_ooffset_t offset)
 {
-	vm_page_t m, ma[1];
+	vm_page_t m;
 	vm_pindex_t pindex;
 	int rv;
 
@@ -238,11 +238,7 @@ vm_imgact_hold_page(vm_object_t object, vm_ooffset
 	pindex = OFF_TO_IDX(offset);
 	m = vm_page_grab(object, pindex, VM_ALLOC_NORMAL);
 	if (m->valid != VM_PAGE_BITS_ALL) {
-		ma[0] = m;
-		rv = vm_pager_get_pages(object, ma, 1, 0);
-		m = vm_page_lookup(object, pindex);
-		if (m == NULL)
-			goto out;
+		rv = vm_pager_get_pages(object, &m, 1);
 		if (rv != VM_PAGER_OK) {
 			vm_page_lock(m);
 			vm_page_free(m);
@@ -571,34 +567,37 @@ vm_thread_swapin(struct thread *td)
 {
 	vm_object_t ksobj;
 	vm_page_t ma[KSTACK_MAX_PAGES];
-	int i, j, k, pages, rv;
+	int pages;
 
 	pages = td->td_kstack_pages;
 	ksobj = td->td_kstack_obj;
 	VM_OBJECT_WLOCK(ksobj);
-	for (i = 0; i < pages; i++)
+	for (int i = 0; i < pages; i++)
 		ma[i] = vm_page_grab(ksobj, i, VM_ALLOC_NORMAL |
 		    VM_ALLOC_WIRED);
-	for (i = 0; i < pages; i++) {
-		if (ma[i]->valid != VM_PAGE_BITS_ALL) {
-			vm_page_assert_xbusied(ma[i]);
-			vm_object_pip_add(ksobj, 1);
-			for (j = i + 1; j < pages; j++) {
-				if (ma[j]->valid != VM_PAGE_BITS_ALL)
-					vm_page_assert_xbusied(ma[j]);
-				if (ma[j]->valid == VM_PAGE_BITS_ALL)
-					break;
-			}
-			rv = vm_pager_get_pages(ksobj, ma + i, j - i, 0);
-			if (rv != VM_PAGER_OK)
-	panic("vm_thread_swapin: cannot get kstack for proc: %d",
-				    td->td_proc->p_pid);
-			vm_object_pip_wakeup(ksobj);
-			for (k = i; k < j; k++)
-				ma[k] = vm_page_lookup(ksobj, k);
+	for (int i = 0; i < pages;) {
+		int j, a, count, rv;
+
+		vm_page_assert_xbusied(ma[i]);
+		if (ma[i]->valid == VM_PAGE_BITS_ALL) {
 			vm_page_xunbusy(ma[i]);
-		} else if (vm_page_xbusied(ma[i]))
-			vm_page_xunbusy(ma[i]);
+			i++;
+			continue;
+		}
+		vm_object_pip_add(ksobj, 1);
+		for (j = i + 1; j < pages; j++)
+			if (ma[j]->valid == VM_PAGE_BITS_ALL)
+				break;
+		rv = vm_pager_has_page(ksobj, ma[i]->pindex, NULL, &a);
+		KASSERT(rv == 1, ("%s: missing page %p", __func__, ma[i]));
+		count = min(a + 1, j - i);
+		rv = vm_pager_get_pages(ksobj, ma + i, count);
+		KASSERT(rv == VM_PAGER_OK, ("%s: cannot get kstack for proc %d",
+		    __func__, td->td_proc->p_pid));
+		vm_object_pip_wakeup(ksobj);
+		for (j = i; j < i + count; j++)
+			vm_page_xunbusy(ma[j]);
+		i += count;
 	}
 	VM_OBJECT_WUNLOCK(ksobj);
 	pmap_qenter(td->td_kstack, ma, pages);
Index: sys/vm/vm_object.c
===================================================================
--- sys/vm/vm_object.c	(revision 282213)
+++ sys/vm/vm_object.c	(working copy)
@@ -2042,7 +2042,7 @@ vm_object_page_cache(vm_object_t object, vm_pindex
 boolean_t
 vm_object_populate(vm_object_t object, vm_pindex_t start, vm_pindex_t end)
 {
-	vm_page_t m, ma[1];
+	vm_page_t m;
 	vm_pindex_t pindex;
 	int rv;
 
@@ -2050,11 +2050,7 @@ vm_object_populate(vm_object_t object, vm_pindex_t
 	for (pindex = start; pindex < end; pindex++) {
 		m = vm_page_grab(object, pindex, VM_ALLOC_NORMAL);
 		if (m->valid != VM_PAGE_BITS_ALL) {
-			ma[0] = m;
-			rv = vm_pager_get_pages(object, ma, 1, 0);
-			m = vm_page_lookup(object, pindex);
-			if (m == NULL)
-				break;
+			rv = vm_pager_get_pages(object, &m, 1);
 			if (rv != VM_PAGER_OK) {
 				vm_page_lock(m);
 				vm_page_free(m);
Index: sys/vm/vm_page.c
===================================================================
--- sys/vm/vm_page.c	(revision 282213)
+++ sys/vm/vm_page.c	(working copy)
@@ -863,32 +863,19 @@ void
 vm_page_readahead_finish(vm_page_t m)
 {
 
-	if (m->valid != 0) {
-		/*
-		 * Since the page is not the requested page, whether
-		 * it should be activated or deactivated is not
-		 * obvious.  Empirical results have shown that
-		 * deactivating the page is usually the best choice,
-		 * unless the page is wanted by another thread.
-		 */
-		vm_page_lock(m);
-		if ((m->busy_lock & VPB_BIT_WAITERS) != 0)
-			vm_page_activate(m);
-		else
-			vm_page_deactivate(m);
-		vm_page_unlock(m);
-		vm_page_xunbusy(m);
-	} else {
-		/*
-		 * Free the completely invalid page.  Such page state
-		 * occurs due to the short read operation which did
-		 * not covered our page at all, or in case when a read
-		 * error happens.
-		 */
-		vm_page_lock(m);
-		vm_page_free(m);
-		vm_page_unlock(m);
-	}
+	/*
+	 * Since the page is not the requested page, whether it should be
+	 * activated or deactivated is not obvious.  Empirical results have
+	 * shown that deactivating the page is usually the best choice,
+	 * unless the page is wanted by another thread.
+	 */
+	vm_page_lock(m);
+	if ((m->busy_lock & VPB_BIT_WAITERS) != 0)
+		vm_page_activate(m);
+	else
+		vm_page_deactivate(m);
+	vm_page_unlock(m);
+	vm_page_xunbusy(m);
 }
 
 /*
Index: sys/vm/vm_pager.c
===================================================================
--- sys/vm/vm_pager.c	(revision 282213)
+++ sys/vm/vm_pager.c	(working copy)
@@ -251,7 +251,95 @@ vm_pager_deallocate(object)
 }
 
 /*
- * vm_pager_get_pages() - inline, see vm/vm_pager.h
+ * Retrieve pages from the VM system in order to map them into an object
+ * ( or into VM space somewhere ).  If the pagein was successful, we
+ * must fully validate it.
+ */
+int
+vm_pager_get_pages(vm_object_t object, vm_page_t *m, int count)
+{
+#ifdef INVARIANTS
+	vm_pindex_t pindex = m[0]->pindex;
+#endif
+	int r;
+
+	VM_OBJECT_ASSERT_WLOCKED(object);
+	KASSERT(count > 0, ("%s: 0 count", __func__));
+
+        /*
+	 * If the last page is partially valid, just return it and zero-out
+	 * the blanks.  Partially valid pages can only occur at the file EOF.
+	 */
+	if (m[count - 1]->valid != 0) {
+		vm_page_zero_invalid(m[count - 1], TRUE);
+		if (--count == 0)
+			return (VM_PAGER_OK);
+	}
+
+#ifdef INVARIANTS
+	/*
+	 * All pages must be busied, not mapped, not valid, not dirty
+	 * and belong to the proper object.
+	 */
+	for (int i = 0 ; i < count; i++) {
+		vm_page_assert_xbusied(m[i]);
+		KASSERT(!pmap_page_is_mapped(m[i]),
+		    ("%s: page %p is mapped", __func__, m[i]));
+		KASSERT(m[i]->valid == 0,
+		    ("%s: request for a valid page %p", __func__, m[i]));
+		KASSERT(m[i]->dirty == 0,
+		    ("%s: page %p is dirty", __func__, m[i]));
+		KASSERT(m[i]->object == object,
+		    ("%s: wrong object %p/%p", __func__, object, m[i]->object));
+	}
+#endif
+
+	r = (*pagertab[object->type]->pgo_getpages)(object, m, count);
+	if (r != VM_PAGER_OK)
+		return (r);
+
+	for (int i = 0; i < count; i++) {
+		/*
+		 * If pager has replaced a page, assert that it had
+		 * updated the array.
+		 */
+		KASSERT(m[i] == vm_page_lookup(object, pindex++),
+		    ("%s: mismatch page %p pindex %ju", __func__,
+		    m[i], (uintmax_t )pindex - 1));
+		/*
+		 * Zero out partially filled data.
+		 */
+		if (m[i]->valid != VM_PAGE_BITS_ALL)
+			vm_page_zero_invalid(m[count - 1], TRUE);
+	}
+	return (VM_PAGER_OK);
+}
+
+int
+vm_pager_get_pages_async(vm_object_t object, vm_page_t *m, int count,
+    pgo_getpages_iodone_t iodone, void *arg)
+{
+
+	VM_OBJECT_ASSERT_WLOCKED(object);
+	KASSERT(count > 0, ("%s: 0 count", __func__));
+
+        /*
+	 * If the last page is partially valid, just return it and zero-out
+	 * the blanks.  Partially valid pages can only occur at the file EOF.
+	 */
+	if (m[count - 1]->valid != 0) {
+		vm_page_zero_invalid(m[count - 1], TRUE);
+		if (--count == 0) {
+			iodone(arg, m, 1, 0);
+			return (VM_PAGER_OK);
+		}
+	}
+
+	return ((*pagertab[object->type]->pgo_getpages_async)(object, m,
+	    count, iodone, arg));
+}
+
+/*
  * vm_pager_put_pages() - inline, see vm/vm_pager.h
  * vm_pager_has_page() - inline, see vm/vm_pager.h
  */
@@ -283,39 +371,6 @@ vm_pager_object_lookup(struct pagerlst *pg_list, v
 }
 
 /*
- * Free the non-requested pages from the given array.  To remove all pages,
- * caller should provide out of range reqpage number.
- */
-void
-vm_pager_free_nonreq(vm_object_t object, vm_page_t ma[], int reqpage,
-    int npages, boolean_t object_locked)
-{
-	enum { UNLOCKED, CALLER_LOCKED, INTERNALLY_LOCKED } locked;
-	int i;
-
-	if (object_locked) {
-		VM_OBJECT_ASSERT_WLOCKED(object);
-		locked = CALLER_LOCKED;
-	} else {
-		VM_OBJECT_ASSERT_UNLOCKED(object);
-		locked = UNLOCKED;
-	}
-	for (i = 0; i < npages; ++i) {
-		if (i != reqpage) {
-			if (locked == UNLOCKED) {
-				VM_OBJECT_WLOCK(object);
-				locked = INTERNALLY_LOCKED;
-			}
-			vm_page_lock(ma[i]);
-			vm_page_free(ma[i]);
-			vm_page_unlock(ma[i]);
-		}
-	}
-	if (locked == INTERNALLY_LOCKED)
-		VM_OBJECT_WUNLOCK(object);
-}
-
-/*
  * initialize a physical buffer
  */
 
Index: sys/vm/vm_pager.h
===================================================================
--- sys/vm/vm_pager.h	(revision 282213)
+++ sys/vm/vm_pager.h	(working copy)
@@ -50,9 +50,9 @@ typedef void pgo_init_t(void);
 typedef vm_object_t pgo_alloc_t(void *, vm_ooffset_t, vm_prot_t, vm_ooffset_t,
     struct ucred *);
 typedef void pgo_dealloc_t(vm_object_t);
-typedef int pgo_getpages_t(vm_object_t, vm_page_t *, int, int);
+typedef int pgo_getpages_t(vm_object_t, vm_page_t *, int);
 typedef void pgo_getpages_iodone_t(void *, vm_page_t *, int, int);
-typedef int pgo_getpages_async_t(vm_object_t, vm_page_t *, int, int,
+typedef int pgo_getpages_async_t(vm_object_t, vm_page_t *, int,
     pgo_getpages_iodone_t, void *);
 typedef void pgo_putpages_t(vm_object_t, vm_page_t *, int, int, int *);
 typedef boolean_t pgo_haspage_t(vm_object_t, vm_pindex_t, int *, int *);
@@ -106,49 +106,13 @@ vm_object_t vm_pager_allocate(objtype_t, void *, v
     vm_ooffset_t, struct ucred *);
 void vm_pager_bufferinit(void);
 void vm_pager_deallocate(vm_object_t);
-static __inline int vm_pager_get_pages(vm_object_t, vm_page_t *, int, int);
-static inline int vm_pager_get_pages_async(vm_object_t, vm_page_t *, int,
-    int, pgo_getpages_iodone_t, void *);
+int vm_pager_get_pages(vm_object_t, vm_page_t *, int);
+int vm_pager_get_pages_async(vm_object_t, vm_page_t *, int,
+    pgo_getpages_iodone_t, void *);
 static __inline boolean_t vm_pager_has_page(vm_object_t, vm_pindex_t, int *, int *);
 void vm_pager_init(void);
 vm_object_t vm_pager_object_lookup(struct pagerlst *, void *);
-void vm_pager_free_nonreq(vm_object_t object, vm_page_t ma[], int reqpage,
-    int npages, boolean_t object_locked);
 
-/*
- *	vm_page_get_pages:
- *
- *	Retrieve pages from the VM system in order to map them into an object
- *	( or into VM space somewhere ).  If the pagein was successful, we
- *	must fully validate it.
- */
-static __inline int
-vm_pager_get_pages(
-	vm_object_t object,
-	vm_page_t *m,
-	int count,
-	int reqpage
-) {
-	int r;
-
-	VM_OBJECT_ASSERT_WLOCKED(object);
-	r = (*pagertab[object->type]->pgo_getpages)(object, m, count, reqpage);
-	if (r == VM_PAGER_OK && m[reqpage]->valid != VM_PAGE_BITS_ALL) {
-		vm_page_zero_invalid(m[reqpage], TRUE);
-	}
-	return (r);
-}
-
-static inline int
-vm_pager_get_pages_async(vm_object_t object, vm_page_t *m, int count,
-    int reqpage, pgo_getpages_iodone_t iodone, void *arg)
-{
-
-	VM_OBJECT_ASSERT_WLOCKED(object);
-	return ((*pagertab[object->type]->pgo_getpages_async)(object, m,
-	    count, reqpage, iodone, arg));
-}
-
 static __inline void
 vm_pager_put_pages(
 	vm_object_t object,
Index: sys/vm/vnode_pager.c
===================================================================
--- sys/vm/vnode_pager.c	(revision 282213)
+++ sys/vm/vnode_pager.c	(working copy)
@@ -84,11 +84,9 @@ static int vnode_pager_addr(struct vnode *vp, vm_o
 static int vnode_pager_input_smlfs(vm_object_t object, vm_page_t m);
 static int vnode_pager_input_old(vm_object_t object, vm_page_t m);
 static void vnode_pager_dealloc(vm_object_t);
-static int vnode_pager_local_getpages0(struct vnode *, vm_page_t *, int, int,
+static int vnode_pager_getpages(vm_object_t, vm_page_t *, int);
+static int vnode_pager_getpages_async(vm_object_t, vm_page_t *, int,
     vop_getpages_iodone_t, void *);
-static int vnode_pager_getpages(vm_object_t, vm_page_t *, int, int);
-static int vnode_pager_getpages_async(vm_object_t, vm_page_t *, int, int,
-    vop_getpages_iodone_t, void *);
 static void vnode_pager_putpages(vm_object_t, vm_page_t *, int, int, int *);
 static boolean_t vnode_pager_haspage(vm_object_t, vm_pindex_t, int *, int *);
 static vm_object_t vnode_pager_alloc(void *, vm_ooffset_t, vm_prot_t,
@@ -662,7 +660,7 @@ vnode_pager_input_old(vm_object_t object, vm_page_
  * backing vp's VOP_GETPAGES.
  */
 static int
-vnode_pager_getpages(vm_object_t object, vm_page_t *m, int count, int reqpage)
+vnode_pager_getpages(vm_object_t object, vm_page_t *m, int count)
 {
 	int rtval;
 	struct vnode *vp;
@@ -670,7 +668,7 @@ static int
 
 	vp = object->handle;
 	VM_OBJECT_WUNLOCK(object);
-	rtval = VOP_GETPAGES(vp, m, bytes, reqpage);
+	rtval = VOP_GETPAGES(vp, m, bytes);
 	KASSERT(rtval != EOPNOTSUPP,
 	    ("vnode_pager: FS getpages not implemented\n"));
 	VM_OBJECT_WLOCK(object);
@@ -679,7 +677,7 @@ static int
 
 static int
 vnode_pager_getpages_async(vm_object_t object, vm_page_t *m, int count,
-    int reqpage, vop_getpages_iodone_t iodone, void *arg)
+    vop_getpages_iodone_t iodone, void *arg)
 {
 	struct vnode *vp;
 	int rtval;
@@ -686,8 +684,7 @@ vnode_pager_getpages_async(vm_object_t object, vm_
 
 	vp = object->handle;
 	VM_OBJECT_WUNLOCK(object);
-	rtval = VOP_GETPAGES_ASYNC(vp, m, count * PAGE_SIZE, reqpage,
-	    iodone, arg);
+	rtval = VOP_GETPAGES_ASYNC(vp, m, count * PAGE_SIZE, iodone, arg);
 	KASSERT(rtval != EOPNOTSUPP,
 	    ("vnode_pager: FS getpages_async not implemented\n"));
 	VM_OBJECT_WLOCK(object);
@@ -703,8 +700,8 @@ int
 vnode_pager_local_getpages(struct vop_getpages_args *ap)
 {
 
-	return (vnode_pager_local_getpages0(ap->a_vp, ap->a_m, ap->a_count,
-	    ap->a_reqpage, NULL, NULL));
+	return (vnode_pager_generic_getpages(ap->a_vp, ap->a_m, ap->a_count,
+	    NULL, NULL));
 }
 
 int
@@ -711,42 +708,10 @@ int
 vnode_pager_local_getpages_async(struct vop_getpages_async_args *ap)
 {
 
-	return (vnode_pager_local_getpages0(ap->a_vp, ap->a_m, ap->a_count,
-	    ap->a_reqpage, ap->a_iodone, ap->a_arg));
+	return (vnode_pager_generic_getpages(ap->a_vp, ap->a_m, ap->a_count,
+	    ap->a_iodone, ap->a_arg));
 }
 
-static int
-vnode_pager_local_getpages0(struct vnode *vp, vm_page_t *m, int bytecount,
-    int reqpage, vop_getpages_iodone_t iodone, void *arg)
-{
-	vm_page_t mreq;
-
-	mreq = m[reqpage];
-
-	/*
-	 * Since the caller has busied the requested page, that page's valid
-	 * field will not be changed by other threads.
-	 */
-	vm_page_assert_xbusied(mreq);
-
-	/*
-	 * The requested page has valid blocks.  Invalid part can only
-	 * exist at the end of file, and the page is made fully valid
-	 * by zeroing in vm_pager_get_pages().  Free non-requested
-	 * pages, since no i/o is done to read its content.
-	 */
-	if (mreq->valid != 0) {
-		vm_pager_free_nonreq(mreq->object, m, reqpage,
-		    round_page(bytecount) / PAGE_SIZE, FALSE);
-		if (iodone != NULL)
-			iodone(arg, m, reqpage, 0);
-		return (VM_PAGER_OK);
-	}
-
-	return (vnode_pager_generic_getpages(vp, m, bytecount, reqpage,
-	    iodone, arg));
-}
-
 /*
  * This is now called from local media FS's to operate against their
  * own vnodes if they fail to implement VOP_GETPAGES.
@@ -753,29 +718,31 @@ vnode_pager_local_getpages_async(struct vop_getpag
  */
 int
 vnode_pager_generic_getpages(struct vnode *vp, vm_page_t *m, int bytecount,
-    int reqpage, vop_getpages_iodone_t iodone, void *arg)
+    vop_getpages_iodone_t iodone, void *arg)
 {
 	vm_object_t object;
 	off_t foff;
-	int i, j, size, bsize, first, *freecnt;
-	daddr_t firstaddr, reqblock;
+	int error, count, bsize, i, after, secmask, *freecnt;
+	daddr_t reqblock;
 	struct bufobj *bo;
-	int runpg;
-	int runend;
 	struct buf *bp;
-	int count;
-	int error;
 
-	object = vp->v_object;
-	count = bytecount / PAGE_SIZE;
+	KASSERT(vp->v_type != VCHR && vp->v_type != VBLK,
+	    ("%s does not support devices", __func__));
+	KASSERT(bytecount > 0 && (bytecount & ~PAGE_MASK) == bytecount,
+	    ("%s: bytecount %d", __func__, bytecount));
 
-	KASSERT(vp->v_type != VCHR && vp->v_type != VBLK,
-	    ("vnode_pager_generic_getpages does not support devices"));
 	if (vp->v_iflag & VI_DOOMED)
 		return VM_PAGER_BAD;
 
+	object = vp->v_object;
+	foff = IDX_TO_OFF(m[0]->pindex);
+
+	KASSERT(foff < object->un_pager.vnp.vnp_size,
+	    ("%s: page %p offset beyond vp %p size", __func__, m[0], vp));
+
+	count = bytecount >> PAGE_SHIFT;
 	bsize = vp->v_mount->mnt_stat.f_iosize;
-	foff = IDX_TO_OFF(m[reqpage]->pindex);
 
 	/*
 	 * Synchronous and asynchronous paging operations use different
@@ -794,172 +761,58 @@ vnode_pager_generic_getpages(struct vnode *vp, vm_
 	 * If the file system doesn't support VOP_BMAP, use old way of
 	 * getting pages via VOP_READ.
 	 */
-	error = VOP_BMAP(vp, foff / bsize, &bo, &reqblock, NULL, NULL);
+	error = VOP_BMAP(vp, foff / bsize, &bo, &reqblock, &after, NULL);
 	if (error == EOPNOTSUPP) {
 		relpbuf(bp, freecnt);
 		VM_OBJECT_WLOCK(object);
-		for (i = 0; i < count; i++)
-			if (i != reqpage) {
-				vm_page_lock(m[i]);
-				vm_page_free(m[i]);
-				vm_page_unlock(m[i]);
-			}
-		PCPU_INC(cnt.v_vnodein);
-		PCPU_INC(cnt.v_vnodepgsin);
-		error = vnode_pager_input_old(object, m[reqpage]);
+		for (i = 0; i < count; i++) {
+			PCPU_INC(cnt.v_vnodein);
+			PCPU_INC(cnt.v_vnodepgsin);
+			error = vnode_pager_input_old(object, m[i]);
+			if (error)
+				break;
+		}
 		VM_OBJECT_WUNLOCK(object);
 		return (error);
 	} else if (error != 0) {
 		relpbuf(bp, freecnt);
-		vm_pager_free_nonreq(object, m, reqpage, count, FALSE);
 		return (VM_PAGER_ERROR);
-
-		/*
-		 * if the blocksize is smaller than a page size, then use
-		 * special small filesystem code.  NFS sometimes has a small
-		 * blocksize, but it can handle large reads itself.
-		 */
-	} else if ((PAGE_SIZE / bsize) > 1 &&
-	    (vp->v_mount->mnt_stat.f_type != nfs_mount_type)) {
-		relpbuf(bp, freecnt);
-		vm_pager_free_nonreq(object, m, reqpage, count, FALSE);
-		PCPU_INC(cnt.v_vnodein);
-		PCPU_INC(cnt.v_vnodepgsin);
-		return vnode_pager_input_smlfs(object, m[reqpage]);
 	}
 
 	/*
-	 * Since the caller has busied the requested page, that page's valid
-	 * field will not be changed by other threads.
+	 * If the blocksize is smaller than a page size, then use
+	 * special small filesystem code.  NFS sometimes has a small
+	 * blocksize, but it can handle large reads itself.
 	 */
-	vm_page_assert_xbusied(m[reqpage]);
-
-	/*
-	 * If we have a completely valid page available to us, we can
-	 * clean up and return.  Otherwise we have to re-read the
-	 * media.
-	 */
-	if (m[reqpage]->valid == VM_PAGE_BITS_ALL) {
+	if ((PAGE_SIZE / bsize) > 1 &&
+	    (vp->v_mount->mnt_stat.f_type != nfs_mount_type)) {
 		relpbuf(bp, freecnt);
-		vm_pager_free_nonreq(object, m, reqpage, count, FALSE);
-		return (VM_PAGER_OK);
-	} else if (reqblock == -1) {
-		relpbuf(bp, freecnt);
-		pmap_zero_page(m[reqpage]);
-		KASSERT(m[reqpage]->dirty == 0,
-		    ("vnode_pager_generic_getpages: page %p is dirty", m));
-		VM_OBJECT_WLOCK(object);
-		m[reqpage]->valid = VM_PAGE_BITS_ALL;
-		vm_pager_free_nonreq(object, m, reqpage, count, TRUE);
-		VM_OBJECT_WUNLOCK(object);
-		return (VM_PAGER_OK);
-	} else if (m[reqpage]->valid != 0) {
-		VM_OBJECT_WLOCK(object);
-		m[reqpage]->valid = 0;
-		VM_OBJECT_WUNLOCK(object);
-	}
-
-	/*
-	 * here on direct device I/O
-	 */
-	firstaddr = -1;
-
-	/*
-	 * calculate the run that includes the required page
-	 */
-	for (first = 0, i = 0; i < count; i = runend) {
-		if (vnode_pager_addr(vp, IDX_TO_OFF(m[i]->pindex), &firstaddr,
-		    &runpg) != 0) {
-			relpbuf(bp, freecnt);
-			/* The requested page may be out of range. */
-			vm_pager_free_nonreq(object, m + i, reqpage - i,
-			    count - i, FALSE);
-			return (VM_PAGER_ERROR);
+		for (i = 0; i < count; i++) {
+			PCPU_INC(cnt.v_vnodein);
+			PCPU_INC(cnt.v_vnodepgsin);
+			error = vnode_pager_input_smlfs(object, m[i]);
+			if (error)
+				break;
 		}
-		if (firstaddr == -1) {
-			VM_OBJECT_WLOCK(object);
-			if (i == reqpage && foff < object->un_pager.vnp.vnp_size) {
-				panic("vnode_pager_getpages: unexpected missing page: firstaddr: %jd, foff: 0x%jx%08jx, vnp_size: 0x%jx%08jx",
-				    (intmax_t)firstaddr, (uintmax_t)(foff >> 32),
-				    (uintmax_t)foff,
-				    (uintmax_t)
-				    (object->un_pager.vnp.vnp_size >> 32),
-				    (uintmax_t)object->un_pager.vnp.vnp_size);
-			}
-			vm_page_lock(m[i]);
-			vm_page_free(m[i]);
-			vm_page_unlock(m[i]);
-			VM_OBJECT_WUNLOCK(object);
-			runend = i + 1;
-			first = runend;
-			continue;
-		}
-		runend = i + runpg;
-		if (runend <= reqpage) {
-			VM_OBJECT_WLOCK(object);
-			for (j = i; j < runend; j++) {
-				vm_page_lock(m[j]);
-				vm_page_free(m[j]);
-				vm_page_unlock(m[j]);
-			}
-			VM_OBJECT_WUNLOCK(object);
-		} else {
-			if (runpg < (count - first)) {
-				VM_OBJECT_WLOCK(object);
-				for (i = first + runpg; i < count; i++) {
-					vm_page_lock(m[i]);
-					vm_page_free(m[i]);
-					vm_page_unlock(m[i]);
-				}
-				VM_OBJECT_WUNLOCK(object);
-				count = first + runpg;
-			}
-			break;
-		}
-		first = runend;
+		return (error);
 	}
 
 	/*
-	 * the first and last page have been calculated now, move input pages
-	 * to be zero based...
+	 * Truncate bytecount to vnode real size and round up physical size
+	 * for real devices.
 	 */
-	if (first != 0) {
-		m += first;
-		count -= first;
-		reqpage -= first;
-	}
+	if ((foff + bytecount) > object->un_pager.vnp.vnp_size)
+		bytecount = object->un_pager.vnp.vnp_size - foff;
+	secmask = bo->bo_bsize - 1;
+	KASSERT(secmask < PAGE_SIZE && secmask > 0,
+	    ("%s: sector size %d too large", __func__, secmask + 1));
+	bytecount = (bytecount + secmask) & ~secmask;
 
 	/*
-	 * calculate the file virtual address for the transfer
+	 * And map the pages to be read into the kva, if the filesystem
+	 * requires mapped buffers.
 	 */
-	foff = IDX_TO_OFF(m[0]->pindex);
-
-	/*
-	 * calculate the size of the transfer
-	 */
-	size = count * PAGE_SIZE;
-	KASSERT(count > 0, ("zero count"));
-	if ((foff + size) > object->un_pager.vnp.vnp_size)
-		size = object->un_pager.vnp.vnp_size - foff;
-	KASSERT(size > 0, ("zero size"));
-
-	/*
-	 * round up physical size for real devices.
-	 */
-	if (1) {
-		int secmask = bo->bo_bsize - 1;
-		KASSERT(secmask < PAGE_SIZE && secmask > 0,
-		    ("vnode_pager_generic_getpages: sector size %d too large",
-		    secmask + 1));
-		size = (size + secmask) & ~secmask;
-	}
-
 	bp->b_kvaalloc = bp->b_data;
-
-	/*
-	 * and map the pages to be read into the kva, if the filesystem
-	 * requires mapped buffers.
-	 */
 	if ((vp->v_mount->mnt_kern_flag & MNTK_UNMAPPED_BUFS) != 0 &&
 	    unmapped_buf_allowed) {
 		bp->b_data = unmapped_buf;
@@ -969,38 +822,33 @@ vnode_pager_generic_getpages(struct vnode *vp, vm_
 	} else
 		pmap_qenter((vm_offset_t)bp->b_kvaalloc, m, count);
 
-	/* build a minimal buffer header */
+	/* Build a minimal buffer header. */
 	bp->b_iocmd = BIO_READ;
 	KASSERT(bp->b_rcred == NOCRED, ("leaking read ucred"));
 	KASSERT(bp->b_wcred == NOCRED, ("leaking write ucred"));
 	bp->b_rcred = crhold(curthread->td_ucred);
 	bp->b_wcred = crhold(curthread->td_ucred);
-	bp->b_blkno = firstaddr;
+	bp->b_blkno = reqblock + ((foff % bsize) / DEV_BSIZE);
 	pbgetbo(bo, bp);
 	bp->b_vp = vp;
-	bp->b_bcount = size;
-	bp->b_bufsize = size;
-	bp->b_runningbufspace = bp->b_bufsize;
+	bp->b_bcount = bp->b_bufsize = bp->b_runningbufspace = bytecount;
 	for (i = 0; i < count; i++)
 		bp->b_pages[i] = m[i];
 	bp->b_npages = count;
-	bp->b_pager.pg_reqpage = reqpage;
+	bp->b_iooffset = dbtob(bp->b_blkno);
+
 	atomic_add_long(&runningbufspace, bp->b_runningbufspace);
-
 	PCPU_INC(cnt.v_vnodein);
 	PCPU_ADD(cnt.v_vnodepgsin, count);
 
-	/* do the input */
-	bp->b_iooffset = dbtob(bp->b_blkno);
-
 	if (iodone != NULL) { /* async */
-		bp->b_pager.pg_iodone = iodone;
+		bp->b_pgiodone = iodone;
 		bp->b_caller1 = arg;
 		bp->b_iodone = vnode_pager_generic_getpages_done_async;
 		bp->b_flags |= B_ASYNC;
 		BUF_KERNPROC(bp);
 		bstrategy(bp);
-		/* Good bye! */
+		return (0);
 	} else {
 		bp->b_iodone = bdone;
 		bstrategy(bp);
@@ -1011,9 +859,8 @@ vnode_pager_generic_getpages(struct vnode *vp, vm_
 		bp->b_vp = NULL;
 		pbrelbo(bp);
 		relpbuf(bp, &vnode_pbuf_freecnt);
+		return (error != 0 ? VM_PAGER_ERROR : VM_PAGER_OK);
 	}
-
-	return (error != 0 ? VM_PAGER_ERROR : VM_PAGER_OK);
 }
 
 static void
@@ -1022,8 +869,7 @@ vnode_pager_generic_getpages_done_async(struct buf
 	int error;
 
 	error = vnode_pager_generic_getpages_done(bp);
-	bp->b_pager.pg_iodone(bp->b_caller1, bp->b_pages,
-	  bp->b_pager.pg_reqpage, error);
+	bp->b_pgiodone(bp->b_caller1, bp->b_pages, bp->b_npages, error);
 	for (int i = 0; i < bp->b_npages; i++)
 		bp->b_pages[i] = NULL;
 	bp->b_vp = NULL;
@@ -1089,9 +935,6 @@ vnode_pager_generic_getpages_done(struct buf *bp)
 			    object->un_pager.vnp.vnp_size - tfoff)) == 0,
 			    ("%s: page %p is dirty", __func__, mt));
 		}
-		
-		if (i != bp->b_pager.pg_reqpage)
-			vm_page_readahead_finish(mt);
 	}
 	VM_OBJECT_WUNLOCK(object);
 	if (error != 0)
Index: sys/vm/vnode_pager.h
===================================================================
--- sys/vm/vnode_pager.h	(revision 282213)
+++ sys/vm/vnode_pager.h	(working copy)
@@ -41,7 +41,7 @@
 #ifdef _KERNEL
 
 int vnode_pager_generic_getpages(struct vnode *vp, vm_page_t *m,
-    int count, int reqpage, vop_getpages_iodone_t iodone, void *arg);
+    int count, vop_getpages_iodone_t iodone, void *arg);
 int vnode_pager_generic_putpages(struct vnode *vp, vm_page_t *m,
 					  int count, boolean_t sync,
 					  int *rtvals);

--45Z9DzgjV8m4Oswq--

From owner-freebsd-arch@FreeBSD.ORG  Fri May  1 16:56:39 2015
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 652C5FCE
 for <freebsd-arch@freebsd.org>; Fri,  1 May 2015 16:56:39 +0000 (UTC)
Received: from mail-wg0-x231.google.com (mail-wg0-x231.google.com
 [IPv6:2a00:1450:400c:c00::231])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id E006510E9
 for <freebsd-arch@freebsd.org>; Fri,  1 May 2015 16:56:38 +0000 (UTC)
Received: by wgin8 with SMTP id n8so95495050wgi.0
 for <freebsd-arch@freebsd.org>; Fri, 01 May 2015 09:56:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-type:content-disposition:in-reply-to:user-agent;
 bh=9WFq2dRwwItZbk/UUMZi5StodnW2EEnemXORL6yI3zM=;
 b=YVkMYEAyAzQDL4zlQBkpdYwNTjTB7hZle2usKFwqmUVPBQ9x52trWCYZQHoDs9DpXU
 xMranw+o9D3UEE3Eo/02SWxmBfPOfnAemLg9c0DdJsijUnuxlalst4DQ13LSKi0VTjge
 YzdBlIl0fV2zYtuXRnXowVe5hRcHsLVGVKq7VW5SSDBeqIelE5sqJT2TkIU/vWA26vuj
 G5kKfLj05DcFE74lXOBo6uBhyYGB+3tdNs2w5rmKDq4zsv8xqIShmLILZJxq7ciKvti5
 04+MYY72ahMzpzRMfo55xVgdKTvoscUf5/ZM0galDPIB0/T1Yj7Mzv0UjxiytxC5kO48
 wcEA==
X-Received: by 10.194.248.132 with SMTP id ym4mr20146995wjc.74.1430499397328; 
 Fri, 01 May 2015 09:56:37 -0700 (PDT)
Received: from dft-labs.eu (n1x0n-1-pt.tunnel.tserv5.lon1.ipv6.he.net.
 [2001:470:1f08:1f7::2])
 by mx.google.com with ESMTPSA id nb9sm7428478wic.10.2015.05.01.09.56.35
 (version=TLSv1.2 cipher=RC4-SHA bits=128/128);
 Fri, 01 May 2015 09:56:36 -0700 (PDT)
Date: Fri, 1 May 2015 18:56:33 +0200
From: Mateusz Guzik <mjguzik@gmail.com>
To: Bruce Evans <brde@optusnet.com.au>
Cc: freebsd-arch@freebsd.org
Subject: Re: [PATCH 1/2] Generalised support for copy-on-write structures
 shared by threads.
Message-ID: <20150501165633.GA7112@dft-labs.eu>
References: <1430188443-19413-1-git-send-email-mjguzik@gmail.com>
 <1430188443-19413-2-git-send-email-mjguzik@gmail.com>
 <20150428181802.F1119@besplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20150428181802.F1119@besplex.bde.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 May 2015 16:56:39 -0000

On Tue, Apr 28, 2015 at 06:45:01PM +1000, Bruce Evans wrote:
> On Tue, 28 Apr 2015, Mateusz Guzik wrote:
> >diff --git a/sys/sys/proc.h b/sys/sys/proc.h
> >index 64b99fc..f29d796 100644
> >--- a/sys/sys/proc.h
> >+++ b/sys/sys/proc.h
> >@@ -225,6 +225,7 @@ struct thread {
> >/* Cleared during fork1() */
> >#define	td_startzero td_flags
> >	int		td_flags;	/* (t) TDF_* flags. */
> >+	u_int		td_cowgeneration;/* (k) Generation of COW pointers. */
> >	int		td_inhibitors;	/* (t) Why can not run. */
> >	int		td_pflags;	/* (k) Private thread (TDP_*) flags. */
> >	int		td_dupfd;	/* (k) Ret value from fdopen. XXX */
> 
> This name is so verbose that it messes up the comment indentation.
> 

Yeah, that's crap, but the naming is already inconsistent and verbose.
For instance there is td_generation alrady.

Is _cowgen variant ok?

> >@@ -830,6 +832,11 @@ extern pid_t pid_max;
> >	KASSERT((p)->p_lock == 0, ("process held"));			\
> >} while (0)
> >
> >+#define	PROC_UPDATE_COW(p) do {						\
> >+	PROC_LOCK_ASSERT((p), MA_OWNED);				\
> >+	p->p_cowgeneration++;						\
> 
> Missing parentheses.

Oops, fixed.

> 
> >+} while (0)
> >+
> >/* Check whether a thread is safe to be swapped out. */
> >#define	thread_safetoswapout(td)	((td)->td_flags & TDF_CANSWAP)
> >
> >@@ -976,6 +983,10 @@ struct	thread *thread_alloc(int pages);
> >int	thread_alloc_stack(struct thread *, int pages);
> >void	thread_exit(void) __dead2;
> >void	thread_free(struct thread *td);
> >+void	thread_get_cow_proc(struct thread *newtd, struct proc *p);
> >+void	thread_get_cow(struct thread *newtd, struct thread *td);
> >+void	thread_free_cow(struct thread *td);
> >+void	thread_update_cow(struct thread *td);
> 
> Insertion sort errors.
> 
> Namespace errors.  I don't like the style of naming things with objects
> first and verbs last, but it is good for sorting related objects.  Here
> the verbs "get" and "free" are in the middle of the objects
> "thread_cow_proc" and "thread_cow".  Also, shouldn't it be "thread_proc_cow"
> (but less verbose, maybe "tpcow"), not "thread_cow_proc", to indicate
> that the cow is hung of the proc?  I didn't notice the details, but it
> makes no sense to hang a proc of a cow :-).
> 

Well all current funcs are named thread_*, so tpcow and the like would
be inconsistent.

On another look existence of thread_suspend_* suggests thread_cow_*
naming.

With this putting _proc variant anywhere but at the end also breaks
consistency. 'thread_cow_from_proc' would increase verbosity.

That said, I would say the patch below is ok enough.

diff --git a/sys/amd64/amd64/trap.c b/sys/amd64/amd64/trap.c
index 193d207..cef3221 100644
--- a/sys/amd64/amd64/trap.c
+++ b/sys/amd64/amd64/trap.c
@@ -257,8 +257,8 @@ trap(struct trapframe *frame)
 		td->td_pticks = 0;
 		td->td_frame = frame;
 		addr = frame->tf_rip;
-		if (td->td_ucred != p->p_ucred) 
-			cred_update_thread(td);
+		if (td->td_cowgen != p->p_cowgen)
+			thread_cow_update(td);
 
 		switch (type) {
 		case T_PRIVINFLT:	/* privileged instruction fault */
diff --git a/sys/arm/arm/trap-v6.c b/sys/arm/arm/trap-v6.c
index abafa86..7463d3c 100644
--- a/sys/arm/arm/trap-v6.c
+++ b/sys/arm/arm/trap-v6.c
@@ -394,8 +394,8 @@ abort_handler(struct trapframe *tf, int prefetch)
 	p = td->td_proc;
 	if (usermode) {
 		td->td_pticks = 0;
-		if (td->td_ucred != p->p_ucred)
-			cred_update_thread(td);
+		if (td->td_cowgen != p->p_cowgen)
+			thread_cow_update(td);
 	}
 
 	/* Invoke the appropriate handler, if necessary. */
diff --git a/sys/arm/arm/trap.c b/sys/arm/arm/trap.c
index 0f142ce..d7fb73a 100644
--- a/sys/arm/arm/trap.c
+++ b/sys/arm/arm/trap.c
@@ -214,8 +214,8 @@ abort_handler(struct trapframe *tf, int type)
 	if (user) {
 		td->td_pticks = 0;
 		td->td_frame = tf;
-		if (td->td_ucred != td->td_proc->p_ucred)
-			cred_update_thread(td);
+		if (td->td_cowgen != td->td_proc->p_cowgen)
+			thread_cow_update(td);
 
 	}
 	/* Grab the current pcb */
@@ -644,8 +644,8 @@ prefetch_abort_handler(struct trapframe *tf)
 
 	if (TRAP_USERMODE(tf)) {
 		td->td_frame = tf;
-		if (td->td_ucred != td->td_proc->p_ucred)
-			cred_update_thread(td);
+		if (td->td_cowgen != td->td_proc->p_cowgen)
+			thread_cow_update(td);
 	}
 	fault_pc = tf->tf_pc;
 	if (td->td_md.md_spinlock_count == 0) {
diff --git a/sys/i386/i386/trap.c b/sys/i386/i386/trap.c
index d783a2b..b118e73 100644
--- a/sys/i386/i386/trap.c
+++ b/sys/i386/i386/trap.c
@@ -306,8 +306,8 @@ trap(struct trapframe *frame)
 		td->td_pticks = 0;
 		td->td_frame = frame;
 		addr = frame->tf_eip;
-		if (td->td_ucred != p->p_ucred) 
-			cred_update_thread(td);
+		if (td->td_cowgen != p->p_cowgen)
+			thread_cow_update(td);
 
 		switch (type) {
 		case T_PRIVINFLT:	/* privileged instruction fault */
diff --git a/sys/kern/init_main.c b/sys/kern/init_main.c
index b77b788..e0042e9 100644
--- a/sys/kern/init_main.c
+++ b/sys/kern/init_main.c
@@ -522,8 +522,6 @@ proc0_init(void *dummy __unused)
 #ifdef MAC
 	mac_cred_create_swapper(newcred);
 #endif
-	td->td_ucred = crhold(newcred);
-
 	/* Create sigacts. */
 	p->p_sigacts = sigacts_alloc();
 
@@ -555,6 +553,10 @@ proc0_init(void *dummy __unused)
 	p->p_limit->pl_rlimit[RLIMIT_MEMLOCK].rlim_max = pageablemem;
 	p->p_cpulimit = RLIM_INFINITY;
 
+	PROC_LOCK(p);
+	thread_cow_get_proc(td, p);
+	PROC_UNLOCK(p);
+
 	/* Initialize resource accounting structures. */
 	racct_create(&p->p_racct);
 
@@ -842,10 +844,10 @@ create_init(const void *udata __unused)
 	audit_cred_proc1(newcred);
 #endif
 	proc_set_cred(initproc, newcred);
+	cred_update_thread(FIRST_THREAD_IN_PROC(initproc));
 	PROC_UNLOCK(initproc);
 	sx_xunlock(&proctree_lock);
 	crfree(oldcred);
-	cred_update_thread(FIRST_THREAD_IN_PROC(initproc));
 	cpu_set_fork_handler(FIRST_THREAD_IN_PROC(initproc), start_init, NULL);
 }
 SYSINIT(init, SI_SUB_CREATE_INIT, SI_ORDER_FIRST, create_init, NULL);
diff --git a/sys/kern/kern_fork.c b/sys/kern/kern_fork.c
index c3dd792..0dfecff 100644
--- a/sys/kern/kern_fork.c
+++ b/sys/kern/kern_fork.c
@@ -496,7 +496,6 @@ do_fork(struct thread *td, int flags, struct proc *p2, struct thread *td2,
 	p2->p_swtick = ticks;
 	if (p1->p_flag & P_PROFIL)
 		startprofclock(p2);
-	td2->td_ucred = crhold(p2->p_ucred);
 
 	if (flags & RFSIGSHARE) {
 		p2->p_sigacts = sigacts_hold(p1->p_sigacts);
@@ -526,6 +525,8 @@ do_fork(struct thread *td, int flags, struct proc *p2, struct thread *td2,
 	 */
 	lim_fork(p1, p2);
 
+	thread_cow_get_proc(td2, p2);
+
 	pstats_fork(p1->p_stats, p2->p_stats);
 
 	PROC_UNLOCK(p1);
diff --git a/sys/kern/kern_kthread.c b/sys/kern/kern_kthread.c
index ee94de0..863bbc6 100644
--- a/sys/kern/kern_kthread.c
+++ b/sys/kern/kern_kthread.c
@@ -289,7 +289,7 @@ kthread_add(void (*func)(void *), void *arg, struct proc *p,
 	cpu_set_fork_handler(newtd, func, arg);
 
 	newtd->td_pflags |= TDP_KTHREAD;
-	newtd->td_ucred = crhold(p->p_ucred);
+	thread_cow_get_proc(newtd, p);
 
 	/* this code almost the same as create_thread() in kern_thr.c */
 	p->p_flag |= P_HADTHREADS;
diff --git a/sys/kern/kern_prot.c b/sys/kern/kern_prot.c
index 9c49f71..b531763 100644
--- a/sys/kern/kern_prot.c
+++ b/sys/kern/kern_prot.c
@@ -1946,9 +1946,8 @@ cred_update_thread(struct thread *td)
 
 	p = td->td_proc;
 	cred = td->td_ucred;
-	PROC_LOCK(p);
+	PROC_LOCK_ASSERT(p, MA_OWNED);
 	td->td_ucred = crhold(p->p_ucred);
-	PROC_UNLOCK(p);
 	if (cred != NULL)
 		crfree(cred);
 }
@@ -1987,6 +1986,8 @@ proc_set_cred(struct proc *p, struct ucred *newcred)
 
 	oldcred = p->p_ucred;
 	p->p_ucred = newcred;
+	if (newcred != NULL)
+		PROC_UPDATE_COW(p);
 	return (oldcred);
 }
 
diff --git a/sys/kern/kern_syscalls.c b/sys/kern/kern_syscalls.c
index dada746..3d3df01 100644
--- a/sys/kern/kern_syscalls.c
+++ b/sys/kern/kern_syscalls.c
@@ -31,6 +31,8 @@ __FBSDID("$FreeBSD$");
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/module.h>
+#include <sys/mutex.h>
+#include <sys/proc.h>
 #include <sys/sx.h>
 #include <sys/syscall.h>
 #include <sys/sysent.h>
diff --git a/sys/kern/kern_thr.c b/sys/kern/kern_thr.c
index 6911bb97..a53bd25 100644
--- a/sys/kern/kern_thr.c
+++ b/sys/kern/kern_thr.c
@@ -228,13 +228,13 @@ create_thread(struct thread *td, mcontext_t *ctx,
 	bcopy(&td->td_startcopy, &newtd->td_startcopy,
 	    __rangeof(struct thread, td_startcopy, td_endcopy));
 	newtd->td_proc = td->td_proc;
-	newtd->td_ucred = crhold(td->td_ucred);
+	thread_cow_get(newtd, td);
 
 	if (ctx != NULL) { /* old way to set user context */
 		error = set_mcontext(newtd, ctx);
 		if (error != 0) {
+			thread_cow_free(newtd);
 			thread_free(newtd);
-			crfree(td->td_ucred);
 			goto fail;
 		}
 	} else {
@@ -246,8 +246,8 @@ create_thread(struct thread *td, mcontext_t *ctx,
 		/* Setup user TLS address and TLS pointer register. */
 		error = cpu_set_user_tls(newtd, tls_base);
 		if (error != 0) {
+			thread_cow_free(newtd);
 			thread_free(newtd);
-			crfree(td->td_ucred);
 			goto fail;
 		}
 	}
diff --git a/sys/kern/kern_thread.c b/sys/kern/kern_thread.c
index 0a93dbd..063dfe9 100644
--- a/sys/kern/kern_thread.c
+++ b/sys/kern/kern_thread.c
@@ -324,8 +324,7 @@ thread_reap(void)
 		mtx_unlock_spin(&zombie_lock);
 		while (td_first) {
 			td_next = TAILQ_NEXT(td_first, td_slpq);
-			if (td_first->td_ucred)
-				crfree(td_first->td_ucred);
+			thread_cow_free(td_first);
 			thread_free(td_first);
 			td_first = td_next;
 		}
@@ -381,6 +380,44 @@ thread_free(struct thread *td)
 	uma_zfree(thread_zone, td);
 }
 
+void
+thread_cow_get_proc(struct thread *newtd, struct proc *p)
+{
+
+	PROC_LOCK_ASSERT(p, MA_OWNED);
+	newtd->td_ucred = crhold(p->p_ucred);
+	newtd->td_cowgen = p->p_cowgen;
+}
+
+void
+thread_cow_get(struct thread *newtd, struct thread *td)
+{
+
+	newtd->td_ucred = crhold(td->td_ucred);
+	newtd->td_cowgen = td->td_cowgen;
+}
+
+void
+thread_cow_free(struct thread *td)
+{
+
+	if (td->td_ucred)
+		crfree(td->td_ucred);
+}
+
+void
+thread_cow_update(struct thread *td)
+{
+	struct proc *p;
+
+	p = td->td_proc;
+	PROC_LOCK(p);
+	if (td->td_ucred != p->p_ucred)
+		cred_update_thread(td);
+	td->td_cowgen = p->p_cowgen;
+	PROC_UNLOCK(p);
+}
+
 /*
  * Discard the current thread and exit from its context.
  * Always called with scheduler locked.
@@ -518,7 +555,7 @@ thread_wait(struct proc *p)
 	cpuset_rel(td->td_cpuset);
 	td->td_cpuset = NULL;
 	cpu_thread_clean(td);
-	crfree(td->td_ucred);
+	thread_cow_free(td);
 	thread_reap();	/* check for zombie threads etc. */
 }
 
diff --git a/sys/kern/subr_syscall.c b/sys/kern/subr_syscall.c
index 1bf78b8..070ba28 100644
--- a/sys/kern/subr_syscall.c
+++ b/sys/kern/subr_syscall.c
@@ -61,8 +61,8 @@ syscallenter(struct thread *td, struct syscall_args *sa)
 	p = td->td_proc;
 
 	td->td_pticks = 0;
-	if (td->td_ucred != p->p_ucred)
-		cred_update_thread(td);
+	if (td->td_cowgen != p->p_cowgen)
+		thread_cow_update(td);
 	if (p->p_flag & P_TRACED) {
 		traced = 1;
 		PROC_LOCK(p);
diff --git a/sys/kern/subr_trap.c b/sys/kern/subr_trap.c
index 93f7557..e5e55dd 100644
--- a/sys/kern/subr_trap.c
+++ b/sys/kern/subr_trap.c
@@ -213,8 +213,8 @@ ast(struct trapframe *framep)
 	thread_unlock(td);
 	PCPU_INC(cnt.v_trap);
 
-	if (td->td_ucred != p->p_ucred) 
-		cred_update_thread(td);
+	if (td->td_cowgen != p->p_cowgen)
+		thread_cow_update(td);
 	if (td->td_pflags & TDP_OWEUPC && p->p_flag & P_PROFIL) {
 		addupc_task(td, td->td_profil_addr, td->td_profil_ticks);
 		td->td_profil_ticks = 0;
diff --git a/sys/powerpc/powerpc/trap.c b/sys/powerpc/powerpc/trap.c
index 0ceb170..bfbd94d 100644
--- a/sys/powerpc/powerpc/trap.c
+++ b/sys/powerpc/powerpc/trap.c
@@ -196,8 +196,8 @@ trap(struct trapframe *frame)
 	if (user) {
 		td->td_pticks = 0;
 		td->td_frame = frame;
-		if (td->td_ucred != p->p_ucred)
-			cred_update_thread(td);
+		if (td->td_cowgen != p->p_cowgen)
+			thread_cow_update(td);
 
 		/* User Mode Traps */
 		switch (type) {
diff --git a/sys/sparc64/sparc64/trap.c b/sys/sparc64/sparc64/trap.c
index b4f0e27..e9917e5 100644
--- a/sys/sparc64/sparc64/trap.c
+++ b/sys/sparc64/sparc64/trap.c
@@ -277,8 +277,8 @@ trap(struct trapframe *tf)
 		td->td_pticks = 0;
 		td->td_frame = tf;
 		addr = tf->tf_tpc;
-		if (td->td_ucred != p->p_ucred)
-			cred_update_thread(td);
+		if (td->td_cowgen != p->p_cowgen)
+			thread_cow_update(td);
 
 		switch (tf->tf_type) {
 		case T_DATA_MISS:
diff --git a/sys/sys/proc.h b/sys/sys/proc.h
index 64b99fc..5033957 100644
--- a/sys/sys/proc.h
+++ b/sys/sys/proc.h
@@ -225,6 +225,7 @@ struct thread {
 /* Cleared during fork1() */
 #define	td_startzero td_flags
 	int		td_flags;	/* (t) TDF_* flags. */
+	u_int		td_cowgen;	/* (k) Generation of COW pointers. */
 	int		td_inhibitors;	/* (t) Why can not run. */
 	int		td_pflags;	/* (k) Private thread (TDP_*) flags. */
 	int		td_dupfd;	/* (k) Ret value from fdopen. XXX */
@@ -531,6 +532,7 @@ struct proc {
 	pid_t		p_oppid;	/* (c + e) Save ppid in ptrace. XXX */
 	struct vmspace	*p_vmspace;	/* (b) Address space. */
 	u_int		p_swtick;	/* (c) Tick when swapped in or out. */
+	u_int		p_cowgen;	/* (c) Generation of COW pointers. */
 	struct itimerval p_realtimer;	/* (c) Alarm timer. */
 	struct rusage	p_ru;		/* (a) Exit information. */
 	struct rusage_ext p_rux;	/* (cu) Internal resource usage. */
@@ -830,6 +832,11 @@ extern pid_t pid_max;
 	KASSERT((p)->p_lock == 0, ("process held"));			\
 } while (0)
 
+#define	PROC_UPDATE_COW(p) do {						\
+	PROC_LOCK_ASSERT((p), MA_OWNED);				\
+	(p)->p_cowgen++;						\
+} while (0)
+
 /* Check whether a thread is safe to be swapped out. */
 #define	thread_safetoswapout(td)	((td)->td_flags & TDF_CANSWAP)
 
@@ -974,6 +981,10 @@ void	cpu_thread_swapin(struct thread *);
 void	cpu_thread_swapout(struct thread *);
 struct	thread *thread_alloc(int pages);
 int	thread_alloc_stack(struct thread *, int pages);
+void	thread_cow_get_proc(struct thread *newtd, struct proc *p);
+void	thread_cow_get(struct thread *newtd, struct thread *td);
+void	thread_cow_free(struct thread *td);
+void	thread_cow_update(struct thread *td);
 void	thread_exit(void) __dead2;
 void	thread_free(struct thread *td);
 void	thread_link(struct thread *td, struct proc *p);
-- 
Mateusz Guzik <mjguzik gmail.com>