From owner-freebsd-arch@FreeBSD.ORG Mon Oct 27 22:42:44 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A1BEB160; Mon, 27 Oct 2014 22:42:44 +0000 (UTC) Received: from mail-qg0-x232.google.com (mail-qg0-x232.google.com [IPv6:2607:f8b0:400d:c04::232]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4A72CF60; Mon, 27 Oct 2014 22:42:44 +0000 (UTC) Received: by mail-qg0-f50.google.com with SMTP id a108so2211113qge.9 for ; Mon, 27 Oct 2014 15:42:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=81sHLuCIEiotPGVwX5CPfWztVOFK7F60wev327HAAJI=; b=LZMdtoqrDn4uhsyw49gOPU8ageb9V8mkCTGsTZISnIjD507arw0iwJX4quUvDD4hn1 A0pvJgIzZArLKhOc2yrgxPFWRcXxzQZkuOHPYdw48F9dh7sOCpWziX2ffNTOjB4ycluK b0zLyjVtbfO4QTGiEBI93hN83HQdWhY92asEKS8f8tWgzb22CXN7zYwfgpHrB0gbEnhq B5vjP8TxY0f/Q4aeIbt/0QEY5mUE+GtZFCiC+cryjLIoLZtEtdwFpQlJ58auDApI9nz/ yMAocPS++k+V9bANXIQSX/kndehR93k9HQN6na6nQ7cP3Ro/C6Xgn6wCcoOgQmK2m/aN 0VSQ== MIME-Version: 1.0 X-Received: by 10.140.44.8 with SMTP id f8mr36255477qga.105.1414449763302; Mon, 27 Oct 2014 15:42:43 -0700 (PDT) Received: by 10.140.23.242 with HTTP; Mon, 27 Oct 2014 15:42:43 -0700 (PDT) In-Reply-To: <544E7376.6040002@rice.edu> References: <5428AF3B.1030906@rice.edu> <54497DC1.5070506@rice.edu> <544DED4C.3010501@rice.edu> <544E7376.6040002@rice.edu> Date: Mon, 27 Oct 2014 23:42:43 +0100 Message-ID: Subject: Re: vm_page_array and VM_PHYSSEG_SPARSE From: Svatopluk Kraus To: Alan Cox Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 Cc: alc@freebsd.org, FreeBSD Arch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Oct 2014 22:42:44 -0000 On Mon, Oct 27, 2014 at 5:31 PM, Alan Cox wrote: > > On 10/27/2014 08:22, Svatopluk Kraus wrote: > > > On Mon, Oct 27, 2014 at 7:59 AM, Alan Cox wrote: >> >> On 10/24/2014 06:33, Svatopluk Kraus wrote: >> >> >> On Fri, Oct 24, 2014 at 12:14 AM, Alan Cox wrote: >>> >>> On 10/08/2014 10:38, Svatopluk Kraus wrote: >>> > On Mon, Sep 29, 2014 at 3:00 AM, Alan Cox wrote: >>> > >>> >> On 09/27/2014 03:51, Svatopluk Kraus wrote: >>> >> >>> >> >>> >> On Fri, Sep 26, 2014 at 8:08 PM, Alan Cox wrote: >>> >> >>> >>> >>> >>> On Wed, Sep 24, 2014 at 7:27 AM, Svatopluk Kraus >>> >>> wrote: >>> >>> >>> >>>> Hi, >>> >>>> >>> >>>> I and Michal are finishing new ARM pmap-v6 code. There is one problem >>> >>>> we've >>> >>>> dealt with somehow, but now we would like to do it better. It's about >>> >>>> physical pages which are allocated before vm subsystem is initialized. >>> >>>> While later on these pages could be found in vm_page_array when >>> >>>> VM_PHYSSEG_DENSE memory model is used, it's not true for >>> >>>> VM_PHYSSEG_SPARSE >>> >>>> memory model. And ARM world uses VM_PHYSSEG_SPARSE model. >>> >>>> >>> >>>> It really would be nice to utilize vm_page_array for such preallocated >>> >>>> physical pages even when VM_PHYSSEG_SPARSE memory model is used. Things >>> >>>> could be much easier then. In our case, it's about pages which are used >>> >>>> for >>> >>>> level 2 page tables. In VM_PHYSSEG_SPARSE model, we have two sets of such >>> >>>> pages. First ones are preallocated and second ones are allocated after vm >>> >>>> subsystem was inited. We must deal with each set differently. So code is >>> >>>> more complex and so is debugging. >>> >>>> >>> >>>> Thus we need some method how to say that some part of physical memory >>> >>>> should be included in vm_page_array, but the pages from that region >>> >>>> should >>> >>>> not be put to free list during initialization. We think that such >>> >>>> possibility could be utilized in general. There could be a need for some >>> >>>> physical space which: >>> >>>> >>> >>>> (1) is needed only during boot and later on it can be freed and put to vm >>> >>>> subsystem, >>> >>>> >>> >>>> (2) is needed for something else and vm_page_array code could be used >>> >>>> without some kind of its duplication. >>> >>>> >>> >>>> There is already some code which deals with blacklisted pages in >>> >>>> vm_page.c >>> >>>> file. So the easiest way how to deal with presented situation is to add >>> >>>> some callback to this part of code which will be able to either exclude >>> >>>> whole phys_avail[i], phys_avail[i+1] region or single pages. As the >>> >>>> biggest >>> >>>> phys_avail region is used for vm subsystem allocations, there should be >>> >>>> some more coding. (However, blacklisted pages are not dealt with on that >>> >>>> part of region.) >>> >>>> >>> >>>> We would like to know if there is any objection: >>> >>>> >>> >>>> (1) to deal with presented problem, >>> >>>> (2) to deal with the problem presented way. >>> >>>> Some help is very appreciated. Thanks >>> >>>> >>> >>>> >>> >>> As an experiment, try modifying vm_phys.c to use dump_avail instead of >>> >>> phys_avail when sizing vm_page_array. On amd64, where the same problem >>> >>> exists, this allowed me to use VM_PHYSSEG_SPARSE. Right now, this is >>> >>> probably my preferred solution. The catch being that not all architectures >>> >>> implement dump_avail, but my recollection is that arm does. >>> >>> >>> >> Frankly, I would prefer this too, but there is one big open question: >>> >> >>> >> What is dump_avail for? >>> >> >>> >> >>> >> >>> >> dump_avail[] is solving a similar problem in the minidump code, hence, the >>> >> prefix "dump_" in its name. In other words, the minidump code couldn't use >>> >> phys_avail[] either because it didn't describe the full range of physical >>> >> addresses that might be included in a minidump, so dump_avail[] was created. >>> >> >>> >> There is already precedent for what I'm suggesting. dump_avail[] is >>> >> already (ab)used outside of the minidump code on x86 to solve this same >>> >> problem in x86/x86/nexus.c, and on arm in arm/arm/mem.c. >>> >> >>> >> >>> >> Using it for vm_page_array initialization and segmentation means that >>> >> phys_avail must be a subset of it. And this must be stated and be visible >>> >> enough. Maybe it should be even checked in code. I like the idea of >>> >> thinking about dump_avail as something what desribes all memory in a >>> >> system, but it's not how dump_avail is defined in archs now. >>> >> >>> >> >>> >> >>> >> When you say "it's not how dump_avail is defined in archs now", I'm not >>> >> sure whether you're talking about the code or the comments. In terms of >>> >> code, dump_avail[] is a superset of phys_avail[], and I'm not aware of any >>> >> code that would have to change. In terms of comments, I did a grep looking >>> >> for comments defining what dump_avail[] is, because I couldn't remember >>> >> any. I found one ... on arm. So, I don't think it's a onerous task >>> >> changing the definition of dump_avail[]. :-) >>> >> >>> >> Already, as things stand today with dump_avail[] being used outside of the >>> >> minidump code, one could reasonably argue that it should be renamed to >>> >> something like phys_exists[]. >>> >> >>> >> >>> >> >>> >> I will experiment with it on monday then. However, it's not only about how >>> >> memory segments are created in vm_phys.c, but it's about how vm_page_array >>> >> size is computed in vm_page.c too. >>> >> >>> >> >>> >> >>> >> Yes, and there is also a place in vm_reserv.c that needs to change. I've >>> >> attached the patch that I developed and tested a long time ago. It still >>> >> applies cleanly and runs ok on amd64. >>> >> >>> >> >>> >> >>> > >>> > >>> > Well, I've created and tested minimalistic patch which - I hope - is >>> > commitable. It runs ok on pandaboard (arm-v6) and solves presented problem. >>> > I would really appreciate if this will be commited. Thanks. >>> >>> >>> Sorry for the slow reply. I've just been swamped with work lately. I >>> finally had some time to look at this in the last day or so. >>> >>> The first thing that I propose to do is commit the attached patch. This >>> patch changes pmap_init() on amd64, armv6, and i386 so that it no longer >>> consults phys_avail[] to determine the end of memory. Instead, it calls >>> a new function provided by vm_phys.c to obtain the same information from >>> vm_phys_segs[]. >>> >>> With this change, the new variable phys_managed in your patch wouldn't >>> need to be a global. It could be a local variable in vm_page_startup() >>> that we pass as a parameter to vm_phys_init() and vm_reserv_init(). >>> >>> More generally, the long-term vision that I have is that we would stop >>> using phys_avail[] after vm_page_startup() had completed. It would only >>> be used during initialization. After that we would use vm_phys_segs[] >>> and functions provided by vm_phys.c. >> >> >> I understand. The patch and the long-term vision are fine for me. I just was not to bold to pass phys_managed as a parameter to vm_phys_init() and vm_reserv_init(). However, I certainly was thinking about it. While reading comment above vm_phys_get_end(), do we care of if last usable address is 0xFFFFFFFF? >> >> >> >> To date, this hasn't been a problem. However, handling 0xFFFFFFFF is easy. So, the final version of the patch that I committed this weekend does so. >> >> Can you please try the attached patch? It replaces phys_avail[] with vm_phys_segs[] in arm's busdma. > > > > It works fine on arm-v6 pandaboard. I have no objection to commit it. However, it's only 1:1 replacement. > > > > Right now, yes. However, once your patch is committed, it won't be 1:1 anymore, because vm_phys_segs[] will be populated based on dump_avail[] rather than phys_avail[]. > > My interpretation of the affected code is that using the ranges defined by dump_avail[] is actually closer to what this code intended. > True in both cases. As you said, it's closer. > > In fact, I still keep the following pattern in my head: > > present memory in system <=> all RAM and whatsoever > nobounce memory <=> addressable by DMA > > > > In general, I don't see how this can be an attribute of the memory, because it's going to depend on the device. In other words, a given physical address may require bouncing for some device but not all devices. > True again. I was thinking about it like some common property along all DMA devices on platform. If it's not that, but test for present RAM, then dump_avail[] is closer. However, again, does dump_avail[] represent all present RAM? > > > managed memory by vm subsystem <=> i.e. kept in vm_page_array > available memory for vm subsystem <=> can be allocated > > So, it's no problem to use phys_avail[], i.e. vm_phys_segs[], but it could be too much limiting in some scenarios. I would like to see something different in exclusion_bounce_check() in the future. Something what reflects NOBOUNCE property and not NOALLOC one like now. > > >> >> >> >> >> Do you think that the rest of my patch considering changes due to your patch is ok? >> >> >> >> >> Basically, yes. I do, however, think that >> >> +#if defined(__arm__) >> + phys_managed = dump_avail; >> +#else >> + phys_managed = phys_avail; >> +#endif >> >> should also be conditioned on VM_PHYSSEG_SPARSE. > > > > > So I've prepared new patch. phys_managed[] is passed to vm_phys_init() and vm_reserv_init() as a parameter and small optimalization is made in vm_page_startup(). I add VM_PHYSSEG_SPARSE condition to place you mentioned. Anyhow, I still think that this is only temporary hack. In general, phys_managed[] should always be distinguished from phys_avail[]. > > >> >> >>> >>> > >>> > BTW, while I was inspecting all archs, I think that maybe it's time to do >>> > what was done for busdma not long ago. There are many similar codes across >>> > archs which deal with physical memory and could be generalized and put to >>> > kern/subr_physmem.c for utilization. All work with physical memory could be >>> > simplify to two arrays of regions. >>> > >>> > phys_present[] ... describes all present physical memory regions >>> > phys_exclude[] ... describes various exclusions from phys_present[] >>> > >>> > Each excluded region will be labeled by flags to say what kind of exclusion >>> > it is. The flags like NODUMP, NOALLOC, NOMANAGE, NOBOUNCE, NOMEMRW could >>> > be combined. This idea is taken from sys/arm/arm/physmem.c. >>> > >>> > All other arrays like phys_managed[], phys_avail[], dump_avail[] will be >>> > created from these phys_present[] and phys_exclude[]. >>> > This way bootstrap codes in archs could be simplified and unified. For >>> > example, dealing with either hw.physmem or page with PA 0x00000000 could be >>> > transparent. >>> > >>> > I'm prepared to volunteer if the thing is ripe. However, some tutor will be >>> > looked for. >>> >>> >>> I've never really looked at arm/arm/physmem.c before. Let me do that >>> before I comment on this. >>> >> No problem. This could be long-term aim. However, I hope the VM_PHYSSEG_SPARSE problem could be dealt with in MI code in present time. In every case, thanks for your help. >> >> >> >> > >