From owner-freebsd-hackers@FreeBSD.ORG Sun Nov 25 18:05:45 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 48A7A6E0; Sun, 25 Nov 2012 18:05:45 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-lb0-f182.google.com (mail-lb0-f182.google.com [209.85.217.182]) by mx1.freebsd.org (Postfix) with ESMTP id 8651E8FC12; Sun, 25 Nov 2012 18:05:44 +0000 (UTC) Received: by mail-lb0-f182.google.com with SMTP id go10so7357866lbb.13 for ; Sun, 25 Nov 2012 10:05:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=fj/Ch2vYBmUbefjW5oxM0i5CoeuVR1BNzv1sevGFXrw=; b=xSIRYqoV319H0kS43aAdS0XXOefGb2+148b+qJtqEa8ihr2UtvlYW3a34g8AaCX7oI HrYmVGxZHty8oG0u5SbTmBZynv4iCc9vMmawu/IE2pVXv8G1CsQiv8rDaYsuOuCdfI21 5mPuoF9bHQFWR8brVjd47D+ptdlf1ZGfVtIkv5geBjsXVkvbfqCUArbNAJSpJ7LQ2siI 9PEusjckaGIbSfQ8BfXZnZHfCJfm+p7uSm5ckExtiwWkdSmb+X4KpOsFfaZX1+0DnUC4 YpnrdRhqjNlTrMbLcjJ87bEU13YJoT8IB484G2oZwExeu83hlWQRg/KGb/jp4DasntLM s5Hg== MIME-Version: 1.0 Received: by 10.112.24.41 with SMTP id r9mr4032080lbf.115.1353866743281; Sun, 25 Nov 2012 10:05:43 -0800 (PST) Sender: asmrookie@gmail.com Received: by 10.112.134.5 with HTTP; Sun, 25 Nov 2012 10:05:43 -0800 (PST) In-Reply-To: <50B22DA0.9080207@FreeBSD.org> References: <50A5F12C.1050902@FreeBSD.org> <50A63D1D.9090500@FreeBSD.org> <50A65208.4050804@FreeBSD.org> <50B21545.5060807@FreeBSD.org> <50B22DA0.9080207@FreeBSD.org> Date: Sun, 25 Nov 2012 19:05:43 +0100 X-Google-Sender-Auth: J17eA8mJLy5cFfHpa9pQu1SOthE Message-ID: Subject: Re: stop_cpus_hard when multiple CPUs are panicking from an NMI From: Attilio Rao To: Andriy Gapon Content-Type: text/plain; charset=UTF-8 Cc: freebsd-hackers@freebsd.org, Ryan Stone X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: attilio@FreeBSD.org List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 25 Nov 2012 18:05:45 -0000 On Sun, Nov 25, 2012 at 3:39 PM, Andriy Gapon wrote: > on 25/11/2012 16:01 Attilio Rao said the following: >> On Sun, Nov 25, 2012 at 12:55 PM, Andriy Gapon wrote: >>> on 25/11/2012 14:29 Attilio Rao said the following: >>>> I think the patch you propose makes such effects even worse, because >>>> it disables interrupts in generic_stop_cpus(). >>>> What I suggest to do, is the following: >>>> - The CPU which wins the race for generic_stop_cpus also signals the >>>> CPUs it is willing to stop on a global mask >>>> - Another CPU entering generic_stop_cpus() and loosing the race, >>>> checks the mask of cpus which might be stopped and stops itself if >>>> necessary (ie. not yet done). We must be careful with races here, but >>>> I'm confindent this can be done easily enough. >>> >>> I think that you either misunderstood my patch or I misunderstand your >>> suggestion, because my patch does exactly what you wrote above. >> >> The patch is someway incomplete: >> - I don't think that we need specific checks in cpustop_handler() (and >> if you have added them to prevent races, I don't think they are >> enough, see below) > > The check is to cover the following scenario: > - two CPUs race in hard-stop scenario, and both are in NMI contexts Please consider also another type of (similar race): two CPUs race in (normal) stop scenario and the loser is with interrupt disabled. I don't this will be completely fatal, but I'm sure it can lead to issues (double stop, etc.). > - one CPU wins and the other does spinning right in generic_stop_cpus > - NMI for the spinning CPU is blocked and remains pending > - the winning CPU releases all other CPUs > - the spinning CPU exits the NMI context and get the NMI which was pending > >> - setting of "stopping_cpus" map must happen atomically/before the >> stopper_cpu cpuid setting, otherwise some CPUs may end up using a NULL >> mask in the check > > Not NULL, but empty or stale. But a very good point, I agree. > The logic must be redone. > >> - Did you consider the races about when a stop and restart request >> happen just after the CPU_ISSET() check? I think CPUs can deadlock >> there. > > Yeah, good point again. This seems to be a different side of the issue above. > stopping_cpus is probably a bad idea. > >> - I'm very doubious about the spinlock_enter() stuff, I think I can >> just make the problem worse atm. > > Well, this is where I disagree. I think that cpu_stop must already be called in > context which effectively disable pre-emption and interrupts. > The added spinlock_enter() stuff is kind of a safety belt to make things more > explicit, but it could be changed into some sort of an assert if that's possible. Maybe it can be come really safe once we take care of all the races involved and reported above. I reserve to suspend my judgement until we don't care of the other races. > >> However you are right, the concept of your patch is the same I really >> wanted to get, we maybe need to just lift it up a bit. > > I agree. > > To add some gas to the fire. Do you recall my wish to drop the mask parameter > from the stop calls? If that was done then it would be simpler to handle these > things. In that case only "stopper_cpu" ID (master/winner) would be needed. If you really want to do something like that please rename s/generic_stop_cpus/generic_stop_butself() or similar convention and I may be not opposed to it. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein