From owner-freebsd-current@FreeBSD.ORG Mon Nov 1 22:02:21 2010 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AD039106566C; Mon, 1 Nov 2010 22:02:21 +0000 (UTC) (envelope-from giovanni.trematerra@gmail.com) Received: from mail-fx0-f54.google.com (mail-fx0-f54.google.com [209.85.161.54]) by mx1.freebsd.org (Postfix) with ESMTP id 619188FC13; Mon, 1 Nov 2010 22:02:19 +0000 (UTC) Received: by fxm17 with SMTP id 17so5568344fxm.13 for ; Mon, 01 Nov 2010 15:02:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:cc:content-type:content-transfer-encoding; bh=l9BU8NySE7Q+hVsdtO/sQcBq4hkk8ZjrGY+SlKq6GvA=; b=twLAB41DT6Kh1tejKpAVCoFq8XVP2usKAnWO0jvI10ntqZaNlvj+w2BgULT31oLAGw cGL74M4fjd9yZatCCdRb5AxFvshj5i32gNGKCg0U+vXCKGorqn/kMhEHJuey6E9guC0S qGf9HFGLPYiy3ybsZcAhKThfImoTMqLRr+7nA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=j2Jwwa0FzxT00ptoqfRxg2db0knyv6thLy+lfcBTYO++FOzS5t2JWjMlT2XNJDeQ2t 2uqYUlzM2QuvsFE01I+OK/7346SSItzL0+k1jLT1ArVZSfHuOH2T40LMFhwk16QumZf5 QhYdOmuQLam/d7+JAVH0TP0R/dxVlZ3yTgKis= MIME-Version: 1.0 Received: by 10.223.125.132 with SMTP id y4mr9719598far.148.1288648939110; Mon, 01 Nov 2010 15:02:19 -0700 (PDT) Sender: giovanni.trematerra@gmail.com Received: by 10.223.115.84 with HTTP; Mon, 1 Nov 2010 15:02:19 -0700 (PDT) In-Reply-To: <201011011514.50322.jhb@freebsd.org> References: <4C9B9B9C.6000807@freebsd.org> <4CBD40E2.7030507@freebsd.org> <201011011514.50322.jhb@freebsd.org> Date: Mon, 1 Nov 2010 23:02:19 +0100 X-Google-Sender-Auth: R3D1ExywdR_gpPOG4W9Izs-5rKQ Message-ID: From: Giovanni Trematerra To: John Baldwin Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: alc@freebsd.org, Attilio Rao , freebsd-current@freebsd.org, kib@freebsd.org, Andriy Gapon Subject: Re: panic in uma_startup for many-core amd64 system X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 01 Nov 2010 22:02:21 -0000 On Mon, Nov 1, 2010 at 8:14 PM, John Baldwin wrote: > On Monday, November 01, 2010 1:09:22 pm Giovanni Trematerra wrote: >> On Tue, Oct 19, 2010 at 8:55 AM, Andriy Gapon wrote: >> > on 19/10/2010 00:01 Giovanni Trematerra said the following: >> >> >> >> Your patch seems just a work around about initial slab size where the >> >> keg is backed. >> > >> > Well, setting aside my confusion with the terminology - yes, the patch= is > just >> > that, and precisely because I only tried to solve that particular prob= lem. >> > >> >> Having dynamic slab sizes would allow to have the keg backed on a lar= ger > slab >> >> without going OFFPAGE. >> > >> > I agree in principle. >> > But without seeing code that implements that I can't guess if it would > really be >> > more efficient or more maintainable, i.e. more useful in general. >> > Still a very good idea. >> > >> >> Here the patch that was in my mind. >> The patch doesn't implement dynamic slab size just allow >> to have a multipage slab to back uma_zone objects. >> I'm going to work more on the topic "dynamic slab size" soon. >> I tested the patch on qemu with -smp 32. >> I'm looking for real hw to test the patch on. >> >> here some interesting output: >> qemulab# vmstat -z | more >> ITEM =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 SIZE =A0LIMIT =A0 =A0 USED =A0 = =A0 FREE =A0 =A0 =A0REQ FAIL SLEEP >> >> UMA Kegs: =A0 =A0 =A0 =A0 =A0 =A0 =A0 208, =A0 =A0 =A00, =A0 =A0 149, = =A0 =A0 =A0 4, =A0 =A0 149, =A0 0, =A0 0 >> UMA Zones: =A0 =A0 =A0 =A0 =A0 =A0 4480, =A0 =A0 =A00, =A0 =A0 149, =A0 = =A0 =A0 0, =A0 =A0 149, =A0 0, =A0 0 >> UMA Slabs: =A0 =A0 =A0 =A0 =A0 =A0 =A0568, =A0 =A0 =A00, =A0 =A0 836, = =A0 =A0 =A0 4, =A0 =A01187, =A0 0, =A0 0 >> UMA RCntSlabs: =A0 =A0 =A0 =A0 =A0568, =A0 =A0 =A00, =A0 =A0 202, =A0 = =A0 =A0 1, =A0 =A0 202, =A0 0, =A0 0 >> UMA Hash: =A0 =A0 =A0 =A0 =A0 =A0 =A0 256, =A0 =A0 =A00, =A0 =A0 =A0 2, = =A0 =A0 =A013, =A0 =A0 =A0 3, =A0 0, =A0 0 >> >> qemulab# sysctl kern | grep cpu >> kern.ccpu: 0 >> =A0 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, = 10, >> 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, >> 28, 29, 30, 31 >> kern.smp.cpus: 32 >> kern.smp.maxcpus: 32 >> >> Any feedback will be appreciate. >> >> -- >> Giovanni Trematerra >> >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> diff -r 36572cbc6817 sys/vm/uma_core.c >> --- a/sys/vm/uma_core.c =A0 =A0 =A0 Tue Oct 05 04:49:04 2010 -0400 >> +++ b/sys/vm/uma_core.c =A0 =A0 =A0 Mon Nov 01 11:54:38 2010 -0400 >> @@ -930,27 +930,36 @@ startup_alloc(uma_zone_t zone, int bytes >> =A0{ >> =A0 =A0 =A0 uma_keg_t keg; >> =A0 =A0 =A0 uma_slab_t tmps; >> - >> - =A0 =A0 keg =3D zone_first_keg(zone); >> + =A0 =A0 u_int16_t pages; >> + >> + =A0 =A0 keg =A0 =3D zone_first_keg(zone); >> + =A0 =A0 pages =3D bytes / PAGE_SIZE; >> + >> + =A0 =A0 /* Account for remainder */ >> + =A0 =A0 if ((pages * PAGE_SIZE) < bytes) >> + =A0 =A0 =A0 =A0 =A0 =A0 pages++; >> + =A0 =A0 KASSERT(pages > 0, ("startup_alloc can't reserve 0 pages\n")); > > You can use 'pages =3D howmany(bytes, PAGE_SIZE)' here. Thanks for the hint. > Also, why did you make > pages a uint16_t instead of an int? =A0An int is generally more convenien= t > unless you really need a uint16_t (and C99 spells it without an _ after t= he > leading 'u'.. FYI). Uhm just to be coherent with field uk_ppera of struct keg, but I think I can just use an int. BTW is new code supposed to use C99 form even if the rest of the file use u_int* form? > >> =A0 =A0 =A0 /* >> =A0 =A0 =A0 =A0* Check our small startup cache to see if it has pages re= maining. >> =A0 =A0 =A0 =A0*/ >> =A0 =A0 =A0 mtx_lock(&uma_boot_pages_mtx); >> - =A0 =A0 if ((tmps =3D LIST_FIRST(&uma_boot_pages)) !=3D NULL) { >> - =A0 =A0 =A0 =A0 =A0 =A0 LIST_REMOVE(tmps, us_link); >> + =A0 =A0 do { >> + =A0 =A0 =A0 =A0 =A0 =A0 if ((tmps =3D LIST_FIRST(&uma_boot_pages)) != =3D NULL) >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 LIST_REMOVE(tmps, us_link); >> + =A0 =A0 } while (--pages && tmps !=3D NULL); >> + =A0 =A0 if (tmps !=3D NULL) { >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 mtx_unlock(&uma_boot_pages_mtx); >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 *pflag =3D tmps->us_flags; >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 return (tmps->us_data); >> - =A0 =A0 } >> + =A0 =A0 } else if (booted =3D=3D 0) >> + =A0 =A0 =A0 =A0 =A0 =A0 panic("UMA: Increase vm.boot_pages"); >> =A0 =A0 =A0 mtx_unlock(&uma_boot_pages_mtx); >> - =A0 =A0 if (booted =3D=3D 0) >> - =A0 =A0 =A0 =A0 =A0 =A0 panic("UMA: Increase vm.boot_pages"); > > Probably best to make the pages test here explicit. =A0Also, is there any= reason > you can't do this as: > > =A0 =A0 =A0 =A0while (--pages > 0) { > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0tmps =3D LIST_FIRST(&uma_boot_pages); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (tmps !=3D NULL) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0LIST_REMOVE(tmps, us_link)= ; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0else if (booted =3D=3D 0) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0panic(...); > =A0 =A0 =A0 =A0} > Well, no, even if I'll need to initialize tmps to NULL otherwise the compiler will raise a warning. do {} while(); might be still better than a while(){}. bytes parameter will never be zero so pages will always be at least one and KASSERT will catch some wired behavior. Anyway that looks to me more readable, thanks. I could add an "else break;" just in the few cases that "pages" is still > 0 and tmps =3D=3D NULL, that could be useless though. > One question btw, how does this work since if you need to allocate more t= han 1 > page it seems that the 'tmps' values for all but the last are simply igno= red > and leaked? When you extract one item from the list you have tmps->us_data pointing to start address of the memory page. The pages are contiguous in decrescent order of address (see below) so when you extract 2 items the last one will point at the start address of 2 contiguous pages of memory, just what I need to retu= rn. > > Is there some unwritten assumption that the free pages are all virtually > contiguous (and in order), so you can just pull off a run of X and use > the address from X? > sys/vm/vm_page.c:351 @@ vm_page_startup(vm_offset_t vaddr) /* * Allocate memory for use when boot strapping the kernel memory * allocator. */ new_end =3D end - (boot_pages * UMA_SLAB_SIZE); new_end =3D trunc_page(new_end); mapped =3D pmap_map(&vaddr, new_end, end, VM_PROT_READ | VM_PROT_WRITE); bzero((void *)mapped, end - new_end); uma_startup((void *)mapped, boot_pages); -------- sys/vm/uma_core.c:1683 @@ uma_startup(void *bootmem, int boot_pages) #ifdef UMA_DEBUG printf("Filling boot free list.\n"); #endif for (i =3D 0; i < boot_pages; i++) { slab =3D (uma_slab_t)((u_int8_t *)bootmem + (i * UMA_SLAB_SIZE)); slab->us_data =3D (u_int8_t *)slab; slab->us_flags =3D UMA_SLAB_BOOT; LIST_INSERT_HEAD(&uma_boot_pages, slab, us_link); } So if I'm not wrong I'd say that the pages are virtually contiguous. Just for the record, attilio@ made a light review of this patch in private. -- Giovanni Trematerra