Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 10 Jun 2015 23:15:53 +0300
From:      Stefan Andritoiu <stefan.andritoiu@gmail.com>
To:        freebsd-hackers@freebsd.org
Subject:   Gang scheduling implementation in the ULE scheduler
Message-ID:  <CAO3d8=aoPypn-57-EJKk0MUXtiLwM_Md6z41ONruxArkuOcHaw@mail.gmail.com>

next in thread | raw e-mail | index | archive | help
Hello,

I am currently working on a gang scheduling implementation for the
bhyve VCPU-threads on FreeBSD 10.1.
I have added a new field "int gang" to the thread structure to specify
the gang it is part of (0 for no gang), and have modified the bhyve
code to initialize this field when a VCPU is created. I will post
these modifications in another message.

When I start a Virtual Machine, during the guest's boot, IPIs are sent
and received correctly between CPUs, but after a few seconds I get:
    spin lock 0xffffffff8164c290 (smp rendezvous) held by
0xfffff8000296c000 (tid 100009) too long
    panic: spin lock held too long

If I limit the number of IPIs that are sent, I do not have this
problem. Which leads me to believe that (because of the constant
context-switch when the guest boots), the high number of IPIs sent
starve the system.

Does anyone know what is happening? And maybe know of a possible solution?

Thank you,
Stefan


======================================================================================
I have added here the modifications to the sched_ule.c file and a
brief explanation of it:

In struct tdq, I have added two new field:
  - int scheduled_gang;
    /* Set to a non-zero value if the respective CPU is required to
schedule a thread belonging to a gang. The value of scheduled_gang
also being the ID of the gang that we want scheduled. For now I have
considered only one running guest, so the value is 0 or 1 */
  - int gang_leader;
    /* Set if the respective CPU is the one who has initialized gang
scheduling. Zero otherwise. Not relevant to the final code and will be
removed. Just for debugging purposes. */

Created a new function "static void schedule_gang(void * arg)" that
will be called by each processor when it receives an IPI from the gang
leader:
  - sets scheduled_gang = 1
  - informs the system that it needs to reschedule. Not yet implemented

In function "struct thread* tdq_choose (struct tdq * tdq)":
    if (tdq->scheduled_gang) - checks to see if a thread belonging to
a gang must be scheduled. If so, calls functions that check the runqs
and return a gang thread. I have yet to implement these functions.

In function "sched_choose()":
   if (td->gang) - checks if the chosen thread is part of a gang. If
so it signals all other CPUs to run function "schedule_gang(void *
gang)".
   if (tdq->scheduled_gang) - if scheduled_gang is set it means that
the scheduler is called after the the code in schedule_gang() has ran,
and bypasses sending IPIs to the other CPUs. If not for this checkup,
a CPU would receive a IPI; set scheduled_gang=1; the scheduler would
be called and would choose a thread to run; that thread would be part
of a gang; an IPI would be sent to all other CPUs. A constant
back-and-forth of IPIs between the CPUs would be created.

The CPU that initializes gang scheduling, does not receive an IPI, and
does not even call the "schedule_gang(void * gang)" function. It
continues in scheduling the gang-thread it selected, the one that
started the gang scheduling process.


===================================================================
--- sched_ule.c (revision 24)
+++ sched_ule.c (revision 26)
@@ -247,6 +247,9 @@
  struct runq tdq_timeshare; /* timeshare run queue. */
  struct runq tdq_idle; /* Queue of IDLE threads. */
  char tdq_name[TDQ_NAME_LEN];
+
+ int gang_leader;
+ int scheduled_gang;
 #ifdef KTR
  char tdq_loadname[TDQ_LOADNAME_LEN];
 #endif
@@ -1308,6 +1311,20 @@
  struct thread *td;

  TDQ_LOCK_ASSERT(tdq, MA_OWNED);
+
+ /* Pick gang thread to run */
+ if (tdq->scheduled_gang){
+ /* basically the normal choosing of threads but with regards to scheduled_gang
+ tdq = runq_choose_gang(&tdq->realtime);
+ if (td != NULL)
+ return (td);
+
+ td = runq_choose_from_gang(&tdq->tdq_timeshare, tdq->tdq_ridx);
+ if (td != NULL)
+ return (td);
+ */
+ }
+
  td = runq_choose(&tdq->tdq_realtime);
  if (td != NULL)
  return (td);
@@ -2295,6 +2312,22 @@
  return (load);
 }

+static void
+schedule_gang(void * arg){
+ struct tdq *tdq;
+ struct tdq *from_tdq = arg;
+ tdq = TDQ_SELF();
+
+ if(tdq == from_tdq){
+ /* Just for testing IPI. Code is never reached, and should never be*/
+ tdq->scheduled_gang = 1;
+// printf("[schedule_gang] received IPI from himself\n");
+ }
+ else{
+ tdq->scheduled_gang = 1;
+// printf("[schedule_gang] received on cpu: %s \n", tdq->tdq_name);
+ }
+}
 /*
  * Choose the highest priority thread to run.  The thread is removed from
  * the run-queue while running however the load remains.  For SMP we set
@@ -2305,11 +2338,26 @@
 {
  struct thread *td;
  struct tdq *tdq;
+ cpuset_t map;

  tdq = TDQ_SELF();
  TDQ_LOCK_ASSERT(tdq, MA_OWNED);
  td = tdq_choose(tdq);
  if (td) {
+ if(tdq->scheduled_gang){
+ /* Scheduler called after IPI
+ jump over rendezvous*/
+ tdq->scheduled_gang = 0;
+ }
+ else{
+ if(td->gang){
+ map = all_cpus;
+ CPU_CLR(curcpu, &map);
+
+ smp_rendezvous_cpus(map, NULL, schedule_gang, NULL, tdq);
+ }
+ }
+
  tdq_runq_rem(tdq, td);
  tdq->tdq_lowpri = td->td_priority;
  return (td);



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAO3d8=aoPypn-57-EJKk0MUXtiLwM_Md6z41ONruxArkuOcHaw>