Date: Thu, 28 Jan 2016 13:08:16 +1100 From: Kubilay Kocak <koobs@FreeBSD.org> To: Wojciech Macek <wma@semihalf.com>, developers@freebsd.org, freebsd-arm@freebsd.org Cc: Olivier Houchard <cognet@ci0.org>, arm64-dev <arm64-dev@semihalf.com> Subject: Re: SCHED_ULE race condition, fix proposal Message-ID: <56A97810.8090303@FreeBSD.org> In-Reply-To: <CANsEV8e2QbW1Y83eC-d3GczWu4Lu91jDK14Xa1FkL=Y2s%2BRBMQ@mail.gmail.com> References: <CANsEV8e2QbW1Y83eC-d3GczWu4Lu91jDK14Xa1FkL=Y2s%2BRBMQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 28/01/2016 4:18 AM, Wojciech Macek wrote: > Hello, > > I've encountered a very nasty race condition during debugging armv8 > HWPMC. It seems that ULE scheduler can execute the same thread on two > different CPUs at the same time... > > Here is the scenario. > The PMC driver must execute some of the code on the CPU0. To ensure > that, a process migration is triggered as following: > > > thread_lock(curthread); > sched_bind(curthread, cpu); > thread_unlock(curthread); > > KASSERT(curthread->td_oncpu == cpu, > ("[pmc,%d] CPU not bound [cpu=%d, curr=%d]", __LINE__, > cpu, curthread->td_oncpu)); > > > That causes the context switch and (finally) execution of sched_switch() > function. The code correctly detects migration and calls > sched_switch_migrate. That function is supposed to add current thread to > the runqueue of another CPU ("tdn" variable). So it does: > > tdq_lock_pair(tdn, tdq); > tdq_add(tdn, td, flags); > tdq_notify(tdn, td); > TDQ_UNLOCK(tdn); > spinlock_exit(); > > > But that sometimes is causing a crash, because the other CPU is staring > to process mi_switch as soon as the IPI arrives (via tdq_notify) and the > runqueue lock is released. The problem is, that the thread does not > contain valid register set, because its context was not yet stored - > that happens later in machine dependent cpu_switch function. In another > words, the sched_switch run on the CPU we want the thread to migrate > onto restores thread context before it was actually stored on another > core - that causes setting regs/pc/lt to some junk data and crash. > > > I'd like to discuss a possible solution for this. I think it would be > reasonable to extend cpu_switch to be capable of releasing a lock as the > last thing it does after storing everything into the PCB. We could then > remove the "TDQ_UNLOCK(tdn);" from the sched_switch_migrate and be sure > that in the situation of migration nobody is allowed to touch the target > runqueue until the migrating process finishes storing its context. > > But first I'd like to discuss some possible alternatives and maybe find > another solution, because any change in this area will impact all > supported architectures. > > > Regards, > Wojtek > wma@freebsd.org <mailto:wma@freebsd.org> Can you create an issue in bugzilla to track this?
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?56A97810.8090303>