From owner-freebsd-hackers@freebsd.org Wed Sep 4 12:17:04 2019 Return-Path: Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 2A86FF0381; Wed, 4 Sep 2019 12:17:04 +0000 (UTC) (envelope-from shreyankfbsd@gmail.com) Received: from mail-yw1-xc35.google.com (mail-yw1-xc35.google.com [IPv6:2607:f8b0:4864:20::c35]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 46NjVl3jKvz4DNm; Wed, 4 Sep 2019 12:17:03 +0000 (UTC) (envelope-from shreyankfbsd@gmail.com) Received: by mail-yw1-xc35.google.com with SMTP id u141so7196843ywe.4; Wed, 04 Sep 2019 05:17:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=QQSudw3YiR1S8i3vweVzzCI95tPzIv4sgN4B3GJX9ZA=; b=SFYV/SPyrW6azhaUKjh0o5Xhu6qEanktr+hlTbecICB7mV/WozMPQgVtAsDj7PpVtd jCcpgCNZC30rgJ5TJusQ57s1q59Q6dSMLDao+Kp8gH0gVUYim+KUucQvdA17B1viKERW 0GJICTkXUguFgYy2iBkggddXXBXRTWBqTVJI35YIRKBAtnhaibRO7nc/jeNAjlLV7e1s TlUbP3MNMIAXr+YYoFN4Hf++Uyc2xcQeYeczzGfVCc8Ph7evFfHbPwzyOIl43P/BpDYL 6a7FOvo3IYfFz0SgKgeX2uxGnPzQIO92y1x2OIvG4R6KgYDjSV4avUeMAGSs3Ow8k/6R R2YQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=QQSudw3YiR1S8i3vweVzzCI95tPzIv4sgN4B3GJX9ZA=; b=Qyc+bXaadmmiwVyb3BpSxknHW+koKOz05CzOvLK2TBSUTM6OwiWo+IgGaFpTMbTAye prMzGtWPpV8tZa4cuu1Bd+ZCb1+v+3texkN5SheQduGUv/W1lmqa1gngbL8bB8FxG/qo SYGu6XjSZYYVwne4PhJn+v+XyKoKY0cveix2mfbydQOVqjy98rg8WkqRsBqA8MGG35kc Cin37vX0kW6KGcnrT0fVLuKt5xu8Z/iSwk5X7JWwMGIzHVKM5ubzx5Ez76gjeSBk3Rst Ev0c6I0E+Yl2hk/VaEOgHEEIhhcY3MAQHB3go07IjslN0Y2QYvgt5O9UF5hXbNUWs/n9 XWvA== X-Gm-Message-State: APjAAAURyFkVwHk1F6upBTCeR6+KVYqmKp/TYc2CNR5cWsTeQ7Giv/Vu Q7L/1OQFbEXmz+RVZLZbbLEfYlPVWayk6xQQf4qy X-Google-Smtp-Source: APXvYqzpVxM6y/iukAsCNI/M1BqFdpmUmL7azlbU+76RSWZ1P9Wd0m2Xa42kHn0U3VuH2YkPBvfAE2bhrcuIleASHhs= X-Received: by 2002:a81:af21:: with SMTP id n33mr9670799ywh.131.1567599421305; Wed, 04 Sep 2019 05:17:01 -0700 (PDT) MIME-Version: 1.0 References: <9d9e8a91-6030-99d9-a57c-a75bd4f4302f@FreeBSD.org> In-Reply-To: <9d9e8a91-6030-99d9-a57c-a75bd4f4302f@FreeBSD.org> From: shreyank amartya Date: Wed, 4 Sep 2019 17:46:49 +0530 Message-ID: Subject: Re: hwpmc: NMI cpuxx ... going to debugger To: Andriy Gapon Cc: mmacy@freebsd.org, freebsd-drivers@freebsd.org, freebsd-hackers@freebsd.org X-Rspamd-Queue-Id: 46NjVl3jKvz4DNm X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20161025 header.b=SFYV/SPy; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of shreyankfbsd@gmail.com designates 2607:f8b0:4864:20::c35 as permitted sender) smtp.mailfrom=shreyankfbsd@gmail.com X-Spamd-Result: default: False [-3.99 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20161025]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[4]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; FREEMAIL_FROM(0.00)[gmail.com]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; TO_MATCH_ENVRCPT_ALL(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; TO_DN_SOME(0.00)[]; IP_SCORE_FREEMAIL(0.00)[]; IP_SCORE(0.00)[ip: (-8.60), ipnet: 2607:f8b0::/32(-2.77), asn: 15169(-2.28), country: US(-0.05)]; DKIM_TRACE(0.00)[gmail.com:+]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; RCVD_IN_DNSWL_NONE(0.00)[5.3.c.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0.0.4.6.8.4.0.b.8.f.7.0.6.2.list.dnswl.org : 127.0.5.0]; NEURAL_HAM_SHORT(-0.99)[-0.994,0]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com.dwl.dnswl.org : 127.0.5.0] Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 Sep 2019 12:17:04 -0000 Hi, I have narrowed down the problem to this: - Inside process_csw_out() (during thread context switch out) routine in hwpmc_mod.c, we stop the pmc and record the current value for reloading it at a future time. here, I observe that the pmc has already overflowed while recording the value. The NMI handler is invoked near the end of the routine and it fails to service the interrupt. Now, is this expected behavior? Should I modify the code to deal with this situation? Or should this situation not occur at all, in which case this could be a hardware problem. If you could shed some light on this, I will be able to pursue this further. Thanks for your time Shreyank On Wed, Sep 4, 2019 at 12:09 PM Andriy Gapon wrote: > On 03/09/2019 21:03, shreyank amartya wrote: > > Hi, > > > > I'm running a profiler which utilizes hwpmc and am frequently running > into this > > situation: after a process context switch out I get an NMI. The NMI > handler is > > unable to service it due to the fact that the pmc hardware is marked as > free at > > the end of the process_csw_out() (hwpmc_mod.c) routine. > > > > Here are some logs I've collected when this happens: > > > > 1750 231 49360676470980 MDP:CFG:1: cpu=231 ri=0 pm=0xfffff801e7b2ea00 > > 1749 231 49360676470360 CSW:SWI:1: cpu=231 proc=0xfffff813e80ff000 > (5930, > > java) pp=0xfffff81077179000 > > 1748 249 49360676421260 CSW:SWO:3: cpu=249 ri=33 > > 1748 249 49360676421260 CSW:SWO:3: cpu=249 ri=33 > > 1747 249 49360676420320 MDP:SWO:1: pc=0xfffff8010a06c400 > > pp=0xfffff81077179000 enable-msr=0 > > 1746 249 49360676419000 CSW:SWO:1: cpu=249 proc=0xfffff813e80ff000 > (5930, > > java) pp=0xfffff81077179000 > > 1745 231 49360675789440 MDP:INT:2: retval=0 isnull=16 ovrflw=0 > > 1744 231 49360675789240 MDP:INT:1: cpu=231 tf=0xfffffe0001577f30 um=0 > > 1743 231 49360675788260 CSW:SWO:3: cpu=231 ri=33 > > 1742 231 49360675788080 MDP:SWO:1: pc=0xfffff80106fe1400 > > pp=0xfffff81077179000 enable-msr=0 > > 1741 231 49360675787880 MDP:CFG:1: cpu=231 ri=0 pm=0x0 > > 1740 231 49360675787020 CSW:SWO:2: cpu=231 ri=17 val=-164 (samp) > > 1739 231 49360675786940 MDP:REA:2: amd-read (post-munge) id=0 -> -164 > > 1738 231 49360675786880 MDP:REA:2: amd-read (pre-munge) id=0 -> 164 > > 1737 231 49360675786720 MDP:REA:1: amd-read id=0 class=2 > > 1736 231 49360675785180 MDP:STO:1: amd-stop ri=0 > > 1735 231 49360675784160 CSW:SWO:1: cpu=231 proc=0xfffff813e80ff000 > (5930, > > java) pp=0xfffff81077179000 > > 1734 231 49360675617080 MDP:INT:2: retval=1 isnull=0 ovrflw=0 > > 1733 231 49360675615240 MDP:INT:1: cpu=231 tf=0xfffffe0001577f30 um=0 > > 1732 231 49360675505160 MDP:SWI:1: pc=0xfffff80106fe1400 > > pp=0xfffff81077179000 enable-msr=0 > > 1731 231 49360675503600 MDP:STA:2: amd-start config=0x530076 > > 1730 231 49360675503060 MDP:STA:1: amd-start cpu=231 ri=0 > > 1729 231 49360675500720 MDP:WRI:1: amd-write cpu=231 ri=0 > v=fffffffffffd97d4 > > 1728 231 49360675500240 CSW:SWI:1: cpu=231 ri=17 new=157740 > > 1727 231 49360675498680 MDP:CFG:1: cpu=231 ri=0 pm=0xfffff801e7b2ea00 > > 1726 231 49360675498060 CSW:SWI:1: cpu=231 proc=0xfffff813e80ff000 > (5930, > > java) pp=0xfffff81077179000 > > 1725 154 49360675363180 CSW:SWO:3: cpu=154 ri=33 > > 1724 154 49360675362020 MDP:SWO:1: pc=0xfffff80106d59600 > > pp=0xfffff81077179000 enable-msr=0 > > 1723 154 49360675360220 CSW:SWO:1: cpu=154 proc=0xfffff813e80ff000 > (5930, > > java) pp=0xfffff81077179000 > > 1722 231 49360675009200 CSW:SWO:3: cpu=231 ri=33 > > 1721 231 49360675009020 MDP:SWO:1: pc=0xfffff80106fe1400 > > pp=0xfffff81077179000 enable-msr=0 > > 1720 231 49360675008320 MDP:CFG:1: cpu=231 ri=0 pm=0x0 > > 1719 231 49360675006220 CSW:SWO:2: cpu=231 ri=17 val=157740 (samp) > > 1718 231 49360675005700 MDP:REA:2: amd-read (post-munge) id=0 -> > 157740 > > 1717 231 49360675005240 MDP:REA:2: amd-read (pre-munge) id=0 -> > 281474976552916 > > 1716 231 49360675004140 MDP:REA:1: amd-read id=0 class=2 > > 1715 231 49360675002660 MDP:STO:1: amd-stop ri=0 > > 1714 231 49360675001460 CSW:SWO:1: cpu=231 proc=0xfffff813e80ff000 > (5930, > > java) pp=0xfffff81077179000 > > 1713 231 49360674941580 MDP:INT:2: retval=1 isnull=0 ovrflw=0 > > 1712 231 49360674939500 MDP:INT:1: cpu=231 tf=0xfffffe0001577f30 um=0 > > 1711 231 49360674793500 MDP:SWI:1: pc=0xfffff80106fe1400 > > pp=0xfffff81077179000 enable-msr=0 > > 1710 231 49360674792680 MDP:STA:2: amd-start config=0x530076 > > 1709 220 49360674792340 CSW:SWO:3: cpu=220 ri=33 > > 1708 220 49360674791060 MDP:SWO:1: pc=0xfffff80106f8a600 > > pp=0xfffff81077179000 enable-msr=0 > > 1707 231 49360674790840 MDP:STA:1: amd-start cpu=231 ri=0 > > > > The stray NMI is entry 1744 & 1745 (isnull=16). In entry 1741 when the > context > > switch out occurs, the hardware is marked free (pm=0x0). The NMI handler > is > > unable to find the hardware and ends up returning 0. Entry 1740 reports > the > > val=-164, I'm not sure about the significance but the value always seems > to be > > negative when this problem occurs. > > > > Once i get a stray NMI, I do not get any further interrupts which is > results in > > loss of samples in the generated pmclog. > > > > I'm using hwpmc from the most recent stable/12 branch which is same as > current > > (with a stable/12 kernel). I'm running it on a java application with a > single > > event in sampling mode. > > > > Any idea on how to resolve/avoid this problem? > > Any pointers or suggestion are appreciated. > > Sorry, I do not have time to debug this. > It looks like there could be a discrepancy between the hardware state and > the > driver state, so the hardware thinks that it should generate the NMI, but > the > driver does not expect to get it. Or there could be a hardware problem, > errata > are not unheard of. > It seems that you already have some data on what's going on, so maybe you > will > be able to pursue this further and narrow down the possibilities. > > Sorry again for not being able to actually help. > > > -- > Andriy Gapon >