Date: Sun, 02 Sep 2018 10:14:10 -0600 From: Ian Lepore <ian@freebsd.org> To: "Dr. Rolf Jansen" <rj@obsigna.com> Cc: freebsd-arm@freebsd.org Subject: Re: Kernel Panic on BBB cause by ti_adc intr Message-ID: <1535904850.9486.15.camel@freebsd.org> In-Reply-To: <09B4DAE6-4021-4D77-8D74-6E112EE5E9E8@obsigna.com> References: <B259CA27-7D08-45B1-97BB-35A544E346BB@obsigna.com> <1535900968.9486.5.camel@freebsd.org> <09B4DAE6-4021-4D77-8D74-6E112EE5E9E8@obsigna.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 2018-09-02 at 12:40 -0300, Dr. Rolf Jansen wrote: > > > > Am 02.09.2018 um 12:09 schrieb Ian Lepore <ian@freebsd.org>: > > > > On Sun, 2018-09-02 at 00:15 -0300, Dr. Rolf Jansen wrote: > > > > > > I got signal sources connected to AIN0 and AIN1 of the BBB. The > > > signals are divided, clipped and clamped and are guaranteed to > > > stay > > > in the range of 0 to 1.8 V. Generally, the circuitry does work > > > and > > > the ADC readings match very well the expectations. > > > > > > Only, sometimes, usually when I power on some considerable load > > > (e.g. > > > a hair dryer) connected to a different AC plug, but in the same > > > room, > > > the BBB bails out, giving the stack backtrace shown below. It > > > might > > > well be, that a power-on spike traverses the AC electricity > > > supply, > > > but there is no way that the spike after clipping and clamping > > > would > > > exceed said limits. > > > > > > My understanding of the stack backtrace is, that somehow an > > > interrupt > > > is triggered by said spike, and then it hits a bug in the > > > interrupt > > > handler. It seems that an exclusive sleep mutex is locked when it > > > is > > > not expected to be. This happened on FreeBSD 12.0-ALPHA3 and > > > today > > > also on -ALPHA4. > > > > > > Question: > > > > > > I don't need interrupt handling in my project, since the > > > signal > > > changes are slow, and the changes need to be read in defined > > > time intervals. So, is it possible to deactivate the interrupt > > > handler of the ti_adc? > > > > > > Presumably then the feature of the exclusive sleep mutex on > > > ti_adc0 > > > would not be challenged and therefore may continue sleeping > > > forever. > > > Of course, I want continue being able of timed reading of the ADC > > > values. > > > > > > Any suggestions would be greatly appreciated, since a BBB which > > > can > > > be DoS'ed by powering on a hair dryer is not as useful as it > > > could > > > be. > > > > > > Best regards > > > > > > Rolf > > > > > > > > > Kernel page fault with the following non-sleepable locks held: > > > exclusive sleep mutex ti_adc0 (ti_adc) r = 0 (0xc2277d08) locked > > > @ > > > /usr/src/sys/arm/ti/ti_adc.c:508 > > > stack backtrace: > > > Fatal kernel mode data abort: 'Translation Fault (L1)' on read > > > trapframe: 0xd2ebeca0 > > > FSR=00000005, FAR=00000128, spsr=20000013 > > > r0 =00000000, r1 =00000003, r2 =00000001, r3 =00000000 > > > r4 =00000000, r5 =00000000, r6 =00000003, r7 =00000016 > > > r8 =00000000, r9 =c2280e00, r10=00000021, r11=d2ebed60 > > > r12=c0ace03c, ssp=d2ebed30, slr=c067d61c, pc =c00888c0 > > > > > > panic: Fatal abort > > > cpuid = 0 > > > time = 1535844155 > > > KDB: stack backtrace: > > > db_trace_self() at db_trace_self > > > pc = 0xc05c7484 lr = 0xc0075d04 (db_trace_self_wrapper+0x30) > > > sp = 0xd2ebea80 fp = 0xd2ebeb98 > > > db_trace_self_wrapper() at db_trace_self_wrapper+0x30 > > > pc = 0xc0075d04 lr = 0xc029d60c (vpanic+0x16c) > > > sp = 0xd2ebeba0 fp = 0xd2ebebc0 > > > r4 = 0x00000100 r5 = 0x00000001 > > > r6 = 0xc071bb22 r7 = 0xc0a8cfd8 > > > vpanic() at vpanic+0x16c > > > pc = 0xc029d60c lr = 0xc029d3ec (doadump) > > > sp = 0xd2ebebc8 fp = 0xd2ebebcc > > > r4 = 0xd2ebeca0 r5 = 0x00000013 > > > r6 = 0x00000128 r7 = 0x00000005 > > > r8 = 0x00000005 r9 = 0xd2ebeca0 > > > r10 = 0x00000128 > > > doadump() at doadump > > > pc = 0xc029d3ec lr = 0xc05e9bb0 (abort_align) > > > sp = 0xd2ebebd4 fp = 0xd2ebec00 > > > r4 = 0xc029d3ec r5 = 0xd2ebebd4 > > > abort_align() at abort_align > > > pc = 0xc05e9bb0 lr = 0xc05e9740 (abort_handler+0x2e0) > > > sp = 0xd2ebec08 fp = 0xd2ebec98 > > > r4 = 0x00000013 r5 = 0x00000128 > > > abort_handler() at abort_handler+0x2e0 > > > pc = 0xc05e9740 lr = 0xc05c9dd4 (exception_exit) > > > sp = 0xd2ebeca0 fp = 0xd2ebed60 > > > r4 = 0x00000000 r5 = 0x00000000 > > > r6 = 0x00000003 r7 = 0x00000016 > > > r8 = 0x00000000 r9 = 0xc2280e00 > > > r10 = 0x00000021 > > > exception_exit() at exception_exit > > > pc = 0xc05c9dd4 lr = 0xc067d61c (ti_adc_intr+0x88) > > > sp = 0xd2ebed30 fp = 0xd2ebed60 > > > r0 = 0x00000000 r1 = 0x00000003 > > > r2 = 0x00000001 r3 = 0x00000000 > > > r4 = 0x00000000 r5 = 0x00000000 > > > r6 = 0x00000003 r7 = 0x00000016 > > > r8 = 0x00000000 r9 = 0xc2280e00 > > > r10 = 0x00000021 r12 = 0xc0ace03c > > > evdev_push_event() at evdev_push_event+0x4c > > > pc = 0xc00888c0 lr = 0xc067d61c (ti_adc_intr+0x88) > > > sp = 0xd2ebed68 fp = 0xd2ebedd0 > > > r4 = 0xd2fce800 r5 = 0xc2277d00 > > > r6 = 0x00000000 r7 = 0x00000421 > > > r8 = 0xc2277d18 r9 = 0xc2280e00 > > > ti_adc_intr() at ti_adc_intr+0x88 > > > pc = 0xc067d61c lr = 0xc02662fc (ithread_loop+0x1f0) > > > sp = 0xd2ebedd8 fp = 0xd2ebee20 > > > r4 = 0xd2fce800 r5 = 0x00000000 > > > r6 = 0xd2fce844 r7 = 0x00000000 > > > r8 = 0xc0719541 r9 = 0xc2280e00 > > > r10 = 0x00000000 > > > ithread_loop() at ithread_loop+0x1f0 > > > pc = 0xc02662fc lr = 0xc0262ef8 (fork_exit+0xa0) > > That's a strange exception stack, with lots of registers containing > > zeroes at exception time that were non-zero in the prior stack > > frame. > > It makes me think something has overwritten the stack with garbage > > data. When I look at ti_adc_tsc_read_data() it has a stack- > > allocated > > data array with 16 elements, and a loop that could load more than > > 16 > > elements into that array (ADC_FIFO_COUNT_MSK is 0x7f), that seems > > like > > trouble. > > > > You said you don't need interrupts, does that mean you're reading > > the > > values via sysctl and aren't using the EVDEV stuff? If so, you > > might be > > able to quickly work around the panic by building a custom kernel > > using > > 'nooption EVDEV_SUPPORT'. > I forgot to mention, that at the time of the panic, > dev.ti_adc.0.ain.0.enable and dev.ti_adc.0.ain.1.enable were not set > to 1 (enabled) yet, and were not expected to read anything. > > Yes, I only need the values in defined time intervals and I poll the > ADC readings with the sysctlbyname() function. > > I compared an (arbitrarily) old version of ti_adc_intr(void *arg) in > ti_adc.c with the current one. The infinging call happens on line > 508, and it is TI_ADC_LOCK(sc);. The striking difference between the > old and the new code is that in the latter one TI_ADC_LOCK(sc); is > called unconditionally, while in the old one the following check > happens before TI_ADC_LOCK(sc); may be get called: > > ti_adc_intr(void *arg) from 2014: > > status = ADC_READ4(sc, ADC_IRQSTATUS); > if (status == 0) > return; > > I started to set up a cross building environment on a fast i7 box. My > plan is to place above check into the said function. If this doesn't > help, I will rebuild the kernel with 'nooption EVDEV_SUPPORT'. Thank > you for pointing me into that direction. I even don't know what EVDEV > is good for. > > Best regards > > Rolf The problem isn't the fact that the interrupt routine takes a lock, the problem is that while holding the lock, a page fault occurs; the page fault is the actual problem. The reason for the page fault appears to be that something is dereferencing a NULL pointer. I'm inferring that from the Fault Address Register (FAR) in the exception being 0x128... a number that small is typically generated by accessing a field of a struct through a NULL pointer. So the question is, which pointer is NULL and why? Now that I look at the code a bit closer, I'm not sure turning off EVDEV_SUPPORT will help; it will likely just change the symptom to some other kind of panic or fault in another location. I think the EVDEV stuff has something to do with using the adc as a touchscreen input device controller. A better attempt to work around the problem may be to change the size of the data[] array on line 430 from 16 to 128. If that helps it'll be a powerful clue and we can look for a permanent fix. -- Ian
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1535904850.9486.15.camel>