From owner-freebsd-arm@freebsd.org  Sun Sep  2 16:14:18 2018
Return-Path: <owner-freebsd-arm@freebsd.org>
Delivered-To: freebsd-arm@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id DC5ACFF3480
 for <freebsd-arm@mailman.ysv.freebsd.org>;
 Sun,  2 Sep 2018 16:14:17 +0000 (UTC) (envelope-from ian@freebsd.org)
Received: from outbound1a.eu.mailhop.org (outbound1a.eu.mailhop.org
 [52.58.109.202])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 6B3537E65D
 for <freebsd-arm@freebsd.org>; Sun,  2 Sep 2018 16:14:17 +0000 (UTC)
 (envelope-from ian@freebsd.org)
X-MHO-RoutePath: aGlwcGll
X-MHO-User: 3a47f8ab-aecb-11e8-a747-09a40681ccbf
X-Report-Abuse-To: https://support.duocircle.com/support/solutions/articles/5000540958-duocircle-standard-smtp-abuse-information
X-Originating-IP: 67.177.211.60
X-Mail-Handler: DuoCircle Outbound SMTP
Received: from ilsoft.org (unknown [67.177.211.60])
 by outbound1.eu.mailhop.org (Halon) with ESMTPSA
 id 3a47f8ab-aecb-11e8-a747-09a40681ccbf;
 Sun, 02 Sep 2018 16:14:12 +0000 (UTC)
Received: from rev (rev [172.22.42.240])
 by ilsoft.org (8.15.2/8.15.2) with ESMTP id w82GEAol024890;
 Sun, 2 Sep 2018 10:14:11 -0600 (MDT) (envelope-from ian@freebsd.org)
Message-ID: <1535904850.9486.15.camel@freebsd.org>
Subject: Re: Kernel Panic on BBB cause by ti_adc intr
From: Ian Lepore <ian@freebsd.org>
To: "Dr. Rolf Jansen" <rj@obsigna.com>
Cc: freebsd-arm@freebsd.org
Date: Sun, 02 Sep 2018 10:14:10 -0600
In-Reply-To: <09B4DAE6-4021-4D77-8D74-6E112EE5E9E8@obsigna.com>
References: <B259CA27-7D08-45B1-97BB-35A544E346BB@obsigna.com>
 <1535900968.9486.5.camel@freebsd.org>
 <09B4DAE6-4021-4D77-8D74-6E112EE5E9E8@obsigna.com>
Content-Type: text/plain; charset="ISO-8859-1"
X-Mailer: Evolution 3.18.5.1 FreeBSD GNOME Team Port 
Mime-Version: 1.0
Content-Transfer-Encoding: 8bit
X-BeenThere: freebsd-arm@freebsd.org
X-Mailman-Version: 2.1.27
Precedence: list
List-Id: "Porting FreeBSD to ARM processors." <freebsd-arm.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-arm>,
 <mailto:freebsd-arm-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arm/>
List-Post: <mailto:freebsd-arm@freebsd.org>
List-Help: <mailto:freebsd-arm-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-arm>,
 <mailto:freebsd-arm-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 02 Sep 2018 16:14:18 -0000

On Sun, 2018-09-02 at 12:40 -0300, Dr. Rolf Jansen wrote:
> > 
> > Am 02.09.2018 um 12:09 schrieb Ian Lepore <ian@freebsd.org>:
> > 
> > On Sun, 2018-09-02 at 00:15 -0300, Dr. Rolf Jansen wrote:
> > > 
> > > I got signal sources connected to AIN0 and AIN1 of the BBB. The
> > > signals are divided, clipped and clamped and are guaranteed to
> > > stay
> > > in the range of 0 to 1.8 V. Generally, the circuitry does work
> > > and
> > > the ADC readings match very well the expectations.
> > > 
> > > Only, sometimes, usually when I power on some considerable load
> > > (e.g.
> > > a hair dryer) connected to a different AC plug, but in the same
> > > room,
> > > the BBB bails out, giving the stack backtrace shown below. It
> > > might
> > > well be, that a power-on spike traverses the AC electricity
> > > supply,
> > > but there is no way that the spike after clipping and clamping
> > > would
> > > exceed said limits.
> > > 
> > > My understanding of the stack backtrace is, that somehow an
> > > interrupt
> > > is triggered by said spike, and then it hits a bug in the
> > > interrupt
> > > handler. It seems that an exclusive sleep mutex is locked when it
> > > is
> > > not expected to be. This happened on FreeBSD 12.0-ALPHA3 and
> > > today
> > > also on -ALPHA4.
> > > 
> > > Question:
> > > 
> > >    I don't need interrupt handling in my project, since the
> > > signal
> > >    changes are slow, and the changes need to be read in defined
> > >    time intervals. So, is it possible to deactivate the interrupt
> > >    handler of the ti_adc?
> > > 
> > > Presumably then the feature of the exclusive sleep mutex on
> > > ti_adc0
> > > would not be challenged and therefore may continue sleeping
> > > forever.
> > > Of course, I want continue being able of timed reading of the ADC
> > > values.
> > > 
> > > Any suggestions would be greatly appreciated, since a BBB which
> > > can
> > > be DoS'ed by powering on a hair dryer is not as useful as it
> > > could
> > > be.
> > > 
> > > Best regards
> > > 
> > > Rolf
> > > 
> > > 
> > > Kernel page fault with the following non-sleepable locks held:
> > > exclusive sleep mutex ti_adc0 (ti_adc) r = 0 (0xc2277d08) locked
> > > @
> > > /usr/src/sys/arm/ti/ti_adc.c:508
> > > stack backtrace:
> > > Fatal kernel mode data abort: 'Translation Fault (L1)' on read
> > > trapframe: 0xd2ebeca0
> > > FSR=00000005, FAR=00000128, spsr=20000013
> > > r0 =00000000, r1 =00000003, r2 =00000001, r3 =00000000
> > > r4 =00000000, r5 =00000000, r6 =00000003, r7 =00000016
> > > r8 =00000000, r9 =c2280e00, r10=00000021, r11=d2ebed60
> > > r12=c0ace03c, ssp=d2ebed30, slr=c067d61c, pc =c00888c0
> > > 
> > > panic: Fatal abort
> > > cpuid = 0
> > > time = 1535844155
> > > KDB: stack backtrace:
> > > db_trace_self() at db_trace_self
> > > 	 pc = 0xc05c7484  lr = 0xc0075d04 (db_trace_self_wrapper+0x30)
> > > 	 sp = 0xd2ebea80  fp = 0xd2ebeb98
> > > db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> > > 	 pc = 0xc0075d04  lr = 0xc029d60c (vpanic+0x16c)
> > > 	 sp = 0xd2ebeba0  fp = 0xd2ebebc0
> > > 	 r4 = 0x00000100  r5 = 0x00000001
> > > 	 r6 = 0xc071bb22  r7 = 0xc0a8cfd8
> > > vpanic() at vpanic+0x16c
> > > 	 pc = 0xc029d60c  lr = 0xc029d3ec (doadump)
> > > 	 sp = 0xd2ebebc8  fp = 0xd2ebebcc
> > > 	 r4 = 0xd2ebeca0  r5 = 0x00000013
> > > 	 r6 = 0x00000128  r7 = 0x00000005
> > > 	 r8 = 0x00000005  r9 = 0xd2ebeca0
> > > 	r10 = 0x00000128
> > > doadump() at doadump
> > > 	 pc = 0xc029d3ec  lr = 0xc05e9bb0 (abort_align)
> > > 	 sp = 0xd2ebebd4  fp = 0xd2ebec00
> > > 	 r4 = 0xc029d3ec  r5 = 0xd2ebebd4
> > > abort_align() at abort_align
> > > 	 pc = 0xc05e9bb0  lr = 0xc05e9740 (abort_handler+0x2e0)
> > > 	 sp = 0xd2ebec08  fp = 0xd2ebec98
> > > 	 r4 = 0x00000013  r5 = 0x00000128
> > > abort_handler() at abort_handler+0x2e0
> > > 	 pc = 0xc05e9740  lr = 0xc05c9dd4 (exception_exit)
> > > 	 sp = 0xd2ebeca0  fp = 0xd2ebed60
> > > 	 r4 = 0x00000000  r5 = 0x00000000
> > > 	 r6 = 0x00000003  r7 = 0x00000016
> > > 	 r8 = 0x00000000  r9 = 0xc2280e00
> > > 	r10 = 0x00000021
> > > exception_exit() at exception_exit
> > > 	 pc = 0xc05c9dd4  lr = 0xc067d61c (ti_adc_intr+0x88)
> > > 	 sp = 0xd2ebed30  fp = 0xd2ebed60
> > > 	 r0 = 0x00000000  r1 = 0x00000003
> > > 	 r2 = 0x00000001  r3 = 0x00000000
> > > 	 r4 = 0x00000000  r5 = 0x00000000
> > > 	 r6 = 0x00000003  r7 = 0x00000016
> > > 	 r8 = 0x00000000  r9 = 0xc2280e00
> > > 	r10 = 0x00000021 r12 = 0xc0ace03c
> > > evdev_push_event() at evdev_push_event+0x4c
> > > 	 pc = 0xc00888c0  lr = 0xc067d61c (ti_adc_intr+0x88)
> > > 	 sp = 0xd2ebed68  fp = 0xd2ebedd0
> > > 	 r4 = 0xd2fce800  r5 = 0xc2277d00
> > > 	 r6 = 0x00000000  r7 = 0x00000421
> > > 	 r8 = 0xc2277d18  r9 = 0xc2280e00
> > > ti_adc_intr() at ti_adc_intr+0x88
> > > 	 pc = 0xc067d61c  lr = 0xc02662fc (ithread_loop+0x1f0)
> > > 	 sp = 0xd2ebedd8  fp = 0xd2ebee20
> > > 	 r4 = 0xd2fce800  r5 = 0x00000000
> > > 	 r6 = 0xd2fce844  r7 = 0x00000000
> > > 	 r8 = 0xc0719541  r9 = 0xc2280e00
> > > 	r10 = 0x00000000
> > > ithread_loop() at ithread_loop+0x1f0
> > > 	 pc = 0xc02662fc  lr = 0xc0262ef8 (fork_exit+0xa0)
> > That's a strange exception stack, with lots of registers containing
> > zeroes at exception time that were non-zero in the prior stack
> > frame.
> > It makes me think something has overwritten the stack with garbage
> > data. When I look at ti_adc_tsc_read_data() it has a stack-
> > allocated
> > data array with 16 elements, and a loop that could load more than
> > 16
> > elements into that array (ADC_FIFO_COUNT_MSK is 0x7f), that seems
> > like
> > trouble.
> > 
> > You said you don't need interrupts, does that mean you're reading
> > the
> > values via sysctl and aren't using the EVDEV stuff? If so, you
> > might be
> > able to quickly work around the panic by building a custom kernel
> > using
> > 'nooption EVDEV_SUPPORT'.
> I forgot to mention, that at the time of the panic,
> dev.ti_adc.0.ain.0.enable and dev.ti_adc.0.ain.1.enable were not set
> to 1 (enabled) yet, and were not expected to read anything.
> 
> Yes, I only need the values in defined time intervals and I poll the
> ADC readings with the sysctlbyname() function.
> 
> I compared an (arbitrarily) old version of ti_adc_intr(void *arg) in
> ti_adc.c with the current one. The infinging call happens on line
> 508, and it is TI_ADC_LOCK(sc);. The striking difference between the
> old and the new code is that in the latter one TI_ADC_LOCK(sc); is
> called unconditionally, while in the old one the following check
> happens before TI_ADC_LOCK(sc); may be get called:
> 
> ti_adc_intr(void *arg) from 2014:
> 
> 	status = ADC_READ4(sc, ADC_IRQSTATUS);
> 	if (status == 0)
> 		return;
> 
> I started to set up a cross building environment on a fast i7 box. My
> plan is to place above check into the said function. If this doesn't
> help, I will rebuild the kernel with 'nooption EVDEV_SUPPORT'. Thank
> you for pointing me into that direction. I even don't know what EVDEV
> is good for.
> 
> Best regards
> 
> Rolf

The problem isn't the fact that the interrupt routine takes a lock, the
problem is that while holding the lock, a page fault occurs; the page
fault is the actual problem. The reason for the page fault appears to
be that something is dereferencing a NULL pointer. I'm inferring that
from the Fault Address Register (FAR) in the exception being 0x128... a
number that small is typically generated by accessing a field of a
struct through a NULL pointer. So the question is, which pointer is
NULL and why?

Now that I look at the code a bit closer, I'm not sure turning off
EVDEV_SUPPORT will help; it will likely just change the symptom to some
other kind of panic or fault in another location. I think the EVDEV
stuff has something to do with using the adc as a touchscreen input
device controller.

A better attempt to work around the problem may be to change the size
of the data[] array on line 430 from 16 to 128. If that helps it'll be
a powerful clue and we can look for a permanent fix.

-- Ian