Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 7 Nov 2020 16:12:18 -0800
From:      Pete Wright <pete@nomadlogic.org>
To:        Patrick Mahan <plmahan@gmail.com>
Cc:        questions list <freebsd-questions@freebsd.org>
Subject:   Re: Helping understand cause of SIGSEGV
Message-ID:  <46c6e046-8786-8142-0e4f-7c5ec407b3f4@nomadlogic.org>
In-Reply-To: <CAFDHx1JDyJq%2Bsepz1O186AeijTqyXP6AuQajsETY00j5eAsLXQ@mail.gmail.com>
References:  <c2eab4b0-b10b-9db3-1aa3-1f61689e24e8@nomadlogic.org> <CAFDHx1Jg_9k3oWU8X-WdP2CJX8hnBYgMz%2BvxwOs766JZcM3WRQ@mail.gmail.com> <0764e7ef-bd81-a6c5-47c4-7cd539a428f5@nomadlogic.org> <CAFDHx1K2-RWS4=xYtNUKMV3t_J7OKKPUE56f9JY45Q%2B0nH_TFA@mail.gmail.com> <f51dfaf6-46da-9cd8-ea37-b2733f5ad9bc@nomadlogic.org> <CAFDHx1JDyJq%2Bsepz1O186AeijTqyXP6AuQajsETY00j5eAsLXQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help


On 11/7/20 11:57 AM, Patrick Mahan wrote:
> On Sat, Nov 7, 2020 at 9:59 AM Pete Wright <pete@nomadlogic.org> wrote:
>
>>
>> On 11/5/20 9:44 PM, Patrick Mahan wrote:
>>
>> On Thu, Nov 5, 2020 at 5:01 PM Pete Wright <pete@nomadlogic.org> wrote:
>>
>>>
>>> On 11/5/20 4:01 PM, Patrick Mahan wrote:
>>>
>>>
>>>
>>>> | thread #1, name = 'fluent-bit', stop reason = signal SIGABRT
>>>>     * frame #0: 0x000000004087100a libc.so.7`__sys_thr_kill at
>>>> thr_kill.S:4
>>>>       frame #1: 0x00000000407e6c84 libc.so.7`__raise(s=6) at raise.c:52:10
>>>>       frame #2: 0x000000004089a5d9 libc.so.7`abort at abort.c:67:8
>>>>       frame #3: 0x000000000034a7a8
>>>> fluent-bit`flb_signal_handler(signal=11) at fluent-bit.c:418:9
>>>>       frame #4: 0x00000000406d1c20
>>>> libthr.so.3`handle_signal(actp=0x00007fffdfffc600, sig=11,
>>>> info=0x00007fffdfffc9f0, ucp=0x00007fffdfffc680) at thr_sig.c:303:3
>>>>       frame #5: 0x00000000406d11ef libthr.so.3`thr_sighandler(sig=11,
>>>> info=0x00007fffdfffc9f0, _ucp=0x00007fffdfffc680) at thr_sig.c:246:2
>>>>       frame #6: 0x00007fffffffe193
>>>>       frame #7: 0x000000000036fe0c fluent-bit`tasks_start [inlined]
>>>> output_params_set(th=0x00000000416091c0, data=0x000000004165d980,
>>>> bytes=128, tag="random.0", tag_len=8, i_ins=0x0000000040e58000,
>>>> out_plugin=0x0000000040e2dfc0, out_context=0x00000000416051e0,
>>>> config=0x0000000040e19180) at flb_output.h:429:5
>>>>
>>> I would look at what is happening here in output_params_set().  Something
>>> is accessing out of bounds memory.
>>>
>>>
>>>
>>> thanks for your response Patrick i really appreciate it.
>>>
>>> So here is where output_params_set() is defined - with an interesting
>>> comment that i haven't chased down yet:
>>>
>>> 521     /* Workaround for makecontext() */
>>> 522     output_params_set(th,
>>> 523                       buf,
>>> 524                       size,
>>> 525                       tag,
>>> 526                       tag_len,
>>> 527                       i_ins,
>>> 528                       o_ins->p,
>>> 529                       o_ins->context,
>>> 530                       config);
>>> 531     return th;
>>> 532 }
>>> 533
>>>
>>> and the frame from the backtrace is this for reference:
>>>       frame #8: 0x000000000036fd14 fluent-bit`tasks_start [inlined]
>>> flb_output_thread(task=0x00000000416410a0, i_ins=0x0000000040e58000,
>>> o_ins=0x0000000040e5b000, config=0x0000000040e19180,
>>> buf=0x000000004165d980, size=128, tag="random.0", tag_len=8) at
>>> flb_output.h:522
>>>
>>> and then later on line 429 of flb_output.h it does this:
>>> 428     FLB_TLS_SET(flb_libco_params, params);
>>> 429     co_switch(th->callee);
>>>
>>> like i said i'm not really sure how to grok this, but it sounds like one
>>> of the params in output_params_set isn't being set correctly.  hopefully
>>> the code snippet makes the error more obvious :)
>>>
>>>
>> Okay, I don't know lldb very well.  But according to the GDB to LLDB
>> command map <http://lldb.llvm.org/use/map.html>; it uses the same commands
>> to move between frames.  So at startup you want to ensure you are in thread
>> 1 (thread select 1).  That should place you in the last frame on the stack
>> (frame #0).  You just move up the stack using the command 'up' until you
>> are in frame #7.
>>
>> Once there you need to dump the contents of 'th' using the command 'p *th'
>> or 'frame variable -T *th'.  I suspect the value of th->callee is
>> incorrect.  The next frame on the stack is -
>>
>>      frame #6: 0x00007fffffffe193
>>
>> This is different from the rest of the stack addresses.  So I suspect it
>> is out of bounds.
>>
>> Patrick
>>
>>
>>
>> that's totally it - thanks Patrick!
>>
>> frame #7: 0x000000000036fe0c fluent-bit`tasks_start [inlined]
>> output_params_set(th=0x00000000416091c0, data=0x000000004165d980,
>> bytes=128, tag="random.0", tag_len=8, i_ins=0x0000000040e58000,
>> out_plugin=0x0000000040e2dfc0, out_context=0x00000000416051e0,
>> config=0x0000000040e19180) at flb_output.h:429:5
>>     426       params->th          = th;
>>     427
>>     428       FLB_TLS_SET(flb_libco_params, params);
>> -> 429       co_switch(th->callee);
>>     430   }
>>     431
>>     432   static FLB_INLINE void output_pre_cb_flush(void)
>> (lldb) p *th
>> (flb_thread) $0 = {
>>    caller = 0x00000000406b2950
>>    callee = 0x000000004169f640
>>    data = 0xa5a5a5a5a5a5a5a5
>>    cb_destroy = 0x0000000000000000
>> }
>> (lldb)
>>
>> i guess the next question to answer is why is this out of bounds.  i'm
>> gonna poke around and see what i can learn today.
>>
>>
> The value of th->callee should be a function, I think.  That is just from a
> cursory glance at libco.
>
> Good luck.

interesting - so it looks like fluent-bit includes their own version of 
libco under lib/flb_libco.  i didn't observe any major differences from 
it's upstream via a cursory glance.  the included doc has this to say 
about co_switch():

void co_switch(cothread_t cothread)
Switch to specified cothread.
Null (0) or invalid cothread handle is not allowed.
Passing handle of active cothread to this function is not allowed.

looking through their flb_thread_libco.h file the implementation looks 
like this:
#define flb_thread_return(th) co_switch(th->caller)

static FLB_INLINE void flb_thread_resume(struct flb_thread *th)
{
     pthread_setspecific(flb_thread_key, (void *) th);

     /*
      * In the past we used to have a flag to mark when a coroutine
      * has finished (th->ended == MK_TRUE), now we let the coroutine
      * to submit an event to the event loop indicating what's going on
      * through the call FLB_OUTPUT_RETURN(...).
      *
      * So we just swap context and let the event loop to handle all
      * the cleanup required.
      */

     th->caller = co_active();
     co_switch(th->callee);
}

the above code is old (from 2016) so i don't think that's the issue.

thanks for your help on the Patrick - i suspect to make much more 
progress i'll need someone from the fluent-bit team to take a closer 
look as to what's happening.

cheers,
-pete

-- 
Pete Wright
pete@nomadlogic.org
@nomadlogicLA




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?46c6e046-8786-8142-0e4f-7c5ec407b3f4>