From owner-freebsd-hackers@FreeBSD.ORG Wed Jun 10 20:15:54 2015 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 84DEA69C for ; Wed, 10 Jun 2015 20:15:54 +0000 (UTC) (envelope-from stefan.andritoiu@gmail.com) Received: from mail-ob0-x242.google.com (mail-ob0-x242.google.com [IPv6:2607:f8b0:4003:c01::242]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4CBA31464 for ; Wed, 10 Jun 2015 20:15:54 +0000 (UTC) (envelope-from stefan.andritoiu@gmail.com) Received: by obcwm4 with SMTP id wm4so5910900obc.3 for ; Wed, 10 Jun 2015 13:15:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=jlxV0A7Y7cO2zwg4jE5bhg1yZpOJE4PEw3t7WWgwkqk=; b=QDbvnXJa6WbI0eX+23CkfNUYvGvtnZvdYqq3Jk2aTWgyOnPmzqZmMhCfqPIOZCHDY0 UK/56nawpY0TIbbDw56YGnOvzh1zfzk9Zrj56edxTyTgnQ1eIdoekFBNtja/MGIvhh0/ 2CudntNWjrkrQdIEXfkm++igy427DSbggSwd4So5oRGA5/RVrVAGTIf+Oh4SZxuzGQrP kEY7aPyTftgi64t1e5c0FjNru3sBf+c0v3XRtWDXjC/AAaKUkRSxMOfIyJCXwW4TdbZS r+FoHfjNj+w7FQ3tiAHHqABDKwme6erSHGCPE4JTFMr6OGRFCVX9mKqBa5ltTsNtzDje ycEQ== MIME-Version: 1.0 X-Received: by 10.60.80.229 with SMTP id u5mr4417704oex.27.1433967353676; Wed, 10 Jun 2015 13:15:53 -0700 (PDT) Received: by 10.60.82.168 with HTTP; Wed, 10 Jun 2015 13:15:53 -0700 (PDT) Date: Wed, 10 Jun 2015 23:15:53 +0300 Message-ID: Subject: Gang scheduling implementation in the ULE scheduler From: Stefan Andritoiu To: freebsd-hackers@freebsd.org Content-Type: text/plain; charset=UTF-8 X-Mailman-Approved-At: Wed, 10 Jun 2015 22:50:29 +0000 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Jun 2015 20:15:54 -0000 Hello, I am currently working on a gang scheduling implementation for the bhyve VCPU-threads on FreeBSD 10.1. I have added a new field "int gang" to the thread structure to specify the gang it is part of (0 for no gang), and have modified the bhyve code to initialize this field when a VCPU is created. I will post these modifications in another message. When I start a Virtual Machine, during the guest's boot, IPIs are sent and received correctly between CPUs, but after a few seconds I get: spin lock 0xffffffff8164c290 (smp rendezvous) held by 0xfffff8000296c000 (tid 100009) too long panic: spin lock held too long If I limit the number of IPIs that are sent, I do not have this problem. Which leads me to believe that (because of the constant context-switch when the guest boots), the high number of IPIs sent starve the system. Does anyone know what is happening? And maybe know of a possible solution? Thank you, Stefan ====================================================================================== I have added here the modifications to the sched_ule.c file and a brief explanation of it: In struct tdq, I have added two new field: - int scheduled_gang; /* Set to a non-zero value if the respective CPU is required to schedule a thread belonging to a gang. The value of scheduled_gang also being the ID of the gang that we want scheduled. For now I have considered only one running guest, so the value is 0 or 1 */ - int gang_leader; /* Set if the respective CPU is the one who has initialized gang scheduling. Zero otherwise. Not relevant to the final code and will be removed. Just for debugging purposes. */ Created a new function "static void schedule_gang(void * arg)" that will be called by each processor when it receives an IPI from the gang leader: - sets scheduled_gang = 1 - informs the system that it needs to reschedule. Not yet implemented In function "struct thread* tdq_choose (struct tdq * tdq)": if (tdq->scheduled_gang) - checks to see if a thread belonging to a gang must be scheduled. If so, calls functions that check the runqs and return a gang thread. I have yet to implement these functions. In function "sched_choose()": if (td->gang) - checks if the chosen thread is part of a gang. If so it signals all other CPUs to run function "schedule_gang(void * gang)". if (tdq->scheduled_gang) - if scheduled_gang is set it means that the scheduler is called after the the code in schedule_gang() has ran, and bypasses sending IPIs to the other CPUs. If not for this checkup, a CPU would receive a IPI; set scheduled_gang=1; the scheduler would be called and would choose a thread to run; that thread would be part of a gang; an IPI would be sent to all other CPUs. A constant back-and-forth of IPIs between the CPUs would be created. The CPU that initializes gang scheduling, does not receive an IPI, and does not even call the "schedule_gang(void * gang)" function. It continues in scheduling the gang-thread it selected, the one that started the gang scheduling process. =================================================================== --- sched_ule.c (revision 24) +++ sched_ule.c (revision 26) @@ -247,6 +247,9 @@ struct runq tdq_timeshare; /* timeshare run queue. */ struct runq tdq_idle; /* Queue of IDLE threads. */ char tdq_name[TDQ_NAME_LEN]; + + int gang_leader; + int scheduled_gang; #ifdef KTR char tdq_loadname[TDQ_LOADNAME_LEN]; #endif @@ -1308,6 +1311,20 @@ struct thread *td; TDQ_LOCK_ASSERT(tdq, MA_OWNED); + + /* Pick gang thread to run */ + if (tdq->scheduled_gang){ + /* basically the normal choosing of threads but with regards to scheduled_gang + tdq = runq_choose_gang(&tdq->realtime); + if (td != NULL) + return (td); + + td = runq_choose_from_gang(&tdq->tdq_timeshare, tdq->tdq_ridx); + if (td != NULL) + return (td); + */ + } + td = runq_choose(&tdq->tdq_realtime); if (td != NULL) return (td); @@ -2295,6 +2312,22 @@ return (load); } +static void +schedule_gang(void * arg){ + struct tdq *tdq; + struct tdq *from_tdq = arg; + tdq = TDQ_SELF(); + + if(tdq == from_tdq){ + /* Just for testing IPI. Code is never reached, and should never be*/ + tdq->scheduled_gang = 1; +// printf("[schedule_gang] received IPI from himself\n"); + } + else{ + tdq->scheduled_gang = 1; +// printf("[schedule_gang] received on cpu: %s \n", tdq->tdq_name); + } +} /* * Choose the highest priority thread to run. The thread is removed from * the run-queue while running however the load remains. For SMP we set @@ -2305,11 +2338,26 @@ { struct thread *td; struct tdq *tdq; + cpuset_t map; tdq = TDQ_SELF(); TDQ_LOCK_ASSERT(tdq, MA_OWNED); td = tdq_choose(tdq); if (td) { + if(tdq->scheduled_gang){ + /* Scheduler called after IPI + jump over rendezvous*/ + tdq->scheduled_gang = 0; + } + else{ + if(td->gang){ + map = all_cpus; + CPU_CLR(curcpu, &map); + + smp_rendezvous_cpus(map, NULL, schedule_gang, NULL, tdq); + } + } + tdq_runq_rem(tdq, td); tdq->tdq_lowpri = td->td_priority; return (td);