Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 26 Jun 2013 22:31:33 +0200
From:      Michael Gmelin <freebsd@grem.de>
To:        Dimitry Andric <dim@FreeBSD.org>
Cc:        Kostik Belousov <kostikbel@gmail.com>, Brooks Davis <brooks@FreeBSD.org>, David Chisnall <theraven@freebsd.org>, "freebsd-ports@freebsd.org Ports" <freebsd-ports@freebsd.org>, Matthias Andree <mandree@FreeBSD.org>
Subject:   Re: Global destructor order problems (was: Re: Are ports supposed to build and run on 10-CURRENT?)
Message-ID:  <20130626223133.1cc1e009@bsd64.grem.de>
In-Reply-To: <7CD9075C-F8D6-41C1-8D21-8B10DF866ECE@FreeBSD.org>
References:  <20130613031535.4087d7f9@bsd64.grem.de> <EF830CD7-00F1-4628-8515-76133BBE85E7@FreeBSD.org> <C1CC40FC-4489-4164-96B7-5E1A25DCB37F@FreeBSD.org> <20130626015508.426ab5b9@bsd64.grem.de> <51CAADB8.7090603@FreeBSD.org> <20130626133149.4835f14a@bsd64.grem.de> <7CD9075C-F8D6-41C1-8D21-8B10DF866ECE@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 26 Jun 2013 21:26:09 +0200
Dimitry Andric <dim@FreeBSD.org> wrote:

> On Jun 26, 2013, at 13:31, Michael Gmelin <freebsd@grem.de> wrote:
> > On Wed, 26 Jun 2013 11:00:40 +0200
> > Dimitry Andric <dim@FreeBSD.org> wrote:
> >> On 2013-06-26 01:55, Michael Gmelin wrote:
> >> ...
> >>> The problem is that static initialization happens in the expected
> >>> order (same translation unit), but termination does *not* happen
> >>> in the reverse order of initialization,
> ...
> > Yep, strange indeed - my test cases didn't use fPIC at first, so it
> > took a while to figure it out. It's seems to be some sort of
> > combined link/runtime problem, since the same executable built on 10
> > runs fine on 9.1-RELEASE when copied over. I tried replacing various
> > system libraries with their versions from 9.1 in a jail to see if I
> > could make it run on 10, but to no success.
> > 
> > By the way, the same code built on 9.1 using clang 3.1 or clang 3.3
> > runs fine on 10 as well, so the only case that does NOT work is
> > build on 10 and run on 10 using clang. Also, when I link copies of
> > main.o and libout.so that have been built on 10 on 9.1 using
> > clang33 the problem doesn't happen as well. So it appears that the
> > problem happens when linking the executable when one of the objects
> > is position independent and then only surfaces on 10.
> 
> So I did a bit of investigation, and the root cause is that both clang
> and newer versions of gcc emit direct calls to the destructors of
> global objects, while older gcc's, such as the one in base, generate
> anonymous wrapper functions, which in turn call the destructors.
> 
> The direct destructor calls will not work correctly, if the
> destructors are located in shared objects, while the global objects
> themselves are in the main program, and if the main program is
> compiled with -fPIC. This problem happens after the following
> revision, which changed the behavior of __cxa_finalize();
> 
>   http://svnweb.freebsd.org/base?view=revision&revision=211706
> 
> This revision is not in 9.1-RELEASE, but it is in 9-STABLE, so the
> problem can also be reproduced there.
> 
> To illustrate: if you compile your test program's main.cpp with gcc
> -fPIC, it produces (excerpted the assembly a bit for readability):
> 
>         .section        .ctors,"aw",@progbits
>         .align 4
>         .long   _GLOBAL__I_main
> [...]
> __tcf_1:
>         pushl   %ebp
>         movl    %esp, %ebp
>         pushl   %ebx
>         call    __i686.get_pc_thunk.bx
>         addl    $_GLOBAL_OFFSET_TABLE_, %ebx
>         subl    $16, %esp
>         leal    innerInstance@GOTOFF(%ebx), %eax
>         pushl   %eax
>         call    _ZN5InnerD1Ev@PLT
>         addl    $16, %esp
>         movl    -4(%ebp), %ebx
>         leave
>         ret
> [...]
> _Z41__static_initialization_and_destruction_0ii:
>         pushl   %ebp
>         movl    %esp, %ebp
>         pushl   %esi
>         pushl   %ebx
>         call    __i686.get_pc_thunk.bx
>         addl    $_GLOBAL_OFFSET_TABLE_, %ebx
>         decl    %eax
>         jne     .L14
>         cmpl    $65535, %edx
>         jne     .L14
>         subl    $12, %esp
>         leal    outerInstance@GOTOFF(%ebx), %eax
>         pushl   %eax
>         call    _ZN5OuterC1Ev@PLT
>         movl    __dso_handle@GOT(%ebx), %esi
>         addl    $12, %esp
>         leal    __tcf_0@GOTOFF(%ebx), %eax
>         pushl   %esi
>         pushl   $0
>         pushl   %eax
>         call    __cxa_atexit@PLT
>         leal    innerInstance@GOTOFF(%ebx), %eax
>         movl    %eax, (%esp)
>         call    _ZN5InnerC1Ev@PLT
>         addl    $12, %esp
>         pushl   %esi
>         pushl   $0
>         leal    __tcf_1@GOTOFF(%ebx), %eax
>         pushl   %eax
>         call    __cxa_atexit@PLT
>         addl    $16, %esp
> .L14:
>         leal    -8(%ebp), %esp
>         popl    %ebx
>         popl    %esi
>         popl    %ebp
>         ret
> [...]
> _GLOBAL__I_main:
>         pushl   %ebp
>         movl    $65535, %edx
>         movl    %esp, %ebp
>         movl    $1, %eax
>         popl    %ebp
>         jmp     _Z41__static_initialization_and_destruction_0ii
> [...]
> __tcf_0:
>         pushl   %ebp
>         movl    %esp, %ebp
>         pushl   %ebx
>         call    __i686.get_pc_thunk.bx
>         addl    $_GLOBAL_OFFSET_TABLE_, %ebx
>         subl    $16, %esp
>         leal    outerInstance@GOTOFF(%ebx), %eax
>         pushl   %eax
>         call    _ZN5OuterD1Ev@PLT
>         addl    $16, %esp
>         movl    -4(%ebp), %ebx
>         leave
>         ret
> [...]
> 
> Summarizing:
> - the startup code calls _GLOBAL__I_main, a.k.a. "global constructors
> keyed to main"
> - jumps to _Z41__static_initialization_and_destruction_0ii, a.k.a.
>   __static_initialization_and_destruction_0(int, int)
> - calls _ZN5OuterC1Ev, a.k.a. Outer::Outer(), to construct the
>   outerInstance object
> - calls __cxa_atexit(), registering a generated wrapper function
>   __tcf_0(), which actually calls _ZN5OuterD1Ev, a.k.a.
> Outer::~Outer()
> - similar for the innerInstance object
> 
> In contrast, clang produces the following:
> 
> _GLOBAL__I_a:                           # @_GLOBAL__I_a
>         pushl   %ebp
>         movl    %esp, %ebp
>         pushl   %ebx
>         pushl   %edi
>         pushl   %esi
>         subl    $12, %esp
>         calll   .L2$pb
> .L2$pb:
>         popl    %ebx
>         addl    $_GLOBAL_OFFSET_TABLE_+(.Ltmp13-.L2$pb), %ebx
>         leal    _ZL13outerInstance@GOTOFF(%ebx), %edi
>         movl    %edi, (%esp)
>         calll   _ZN5OuterC1Ev@PLT
>         movl    __dso_handle@GOT(%ebx), %esi
>         movl    %esi, 8(%esp)
>         movl    %edi, 4(%esp)
>         movl    _ZN5OuterD1Ev@GOT(%ebx), %eax
>         movl    %eax, (%esp)
>         calll   __cxa_atexit@PLT
>         leal    .Lstr5@GOTOFF(%ebx), %eax
>         movl    %eax, (%esp)
>         calll   puts@PLT
>         movl    %esi, 8(%esp)
>         leal    _ZL13innerInstance@GOTOFF(%ebx), %eax
>         movl    %eax, 4(%esp)
>         movl    _ZN5InnerD1Ev@GOT(%ebx), %eax
>         movl    %eax, (%esp)
>         calll   __cxa_atexit@PLT
>         addl    $12, %esp
>         popl    %esi
>         popl    %edi
>         popl    %ebx
>         popl    %ebp
>         ret
> [...]
>         .section        .ctors,"aw",@progbits
>         .align  4
>         .long   _GLOBAL__I_a
> 
> Summarizing:
> - the startup code calls _GLOBAL__I_a, a.k.a. "global constructors
>   keyed to a"
> - calls _ZN5OuterC1Ev, a.k.a. Outer::Outer(), to construct the
>   outerInstance object
> - calls __cxa_atexit(), directly registering _ZN5OuterD1Ev, a.k.a
>   Outer::~Outer()
> - similar for the innerInstance object (though the constructor is
>   inlined)
> 
> The crucial difference is that clang *directly* registers the
> destructor's function pointer, instead of using a locally generated
> wrapper.  Newer versions of gcc behave the same way, since this
> upstream revision:
> 
>   http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=125253
> 
> This is roughly gcc 4.3.0 and later.  For example, gcc 4.8 generates:
> 
> _GLOBAL__sub_I_main.cpp:
>         pushl   %ebp
>         movl    %esp, %ebp
>         pushl   %edi
>         pushl   %esi
>         pushl   %ebx
>         call    __x86.get_pc_thunk.bx
>         addl    $_GLOBAL_OFFSET_TABLE_, %ebx
>         subl    $24, %esp
>         leal    _ZL13outerInstance@GOTOFF(%ebx), %edi
>         pushl   %edi
>         call    _ZN5OuterC1Ev@PLT
>         leal    __dso_handle@GOTOFF(%ebx), %esi
>         addl    $12, %esp
>         pushl   %esi
>         pushl   %edi
>         pushl   _ZN5OuterD1Ev@GOT(%ebx)
>         call    __cxa_atexit@PLT
>         leal    .LC2@GOTOFF(%ebx), %eax
>         movl    %eax, (%esp)
>         call    puts@PLT
>         addl    $12, %esp
>         pushl   %esi
>         leal    _ZL13innerInstance@GOTOFF(%ebx), %eax
>         pushl   %eax
>         pushl   _ZN5InnerD1Ev@GOT(%ebx)
>         call    __cxa_atexit@PLT
>         addl    $16, %esp
>         leal    -12(%ebp), %esp
>         popl    %ebx
>         popl    %esi
>         popl    %edi
>         popl    %ebp
>         ret
> [...]
>         .section        .ctors,"aw",@progbits
>         .align 4
>         .long   _GLOBAL__sub_I_main.cpp
> 
> In each case, __cxa_exit() is called with the following three
> arguments: the address of the destructor, the pointer to the object
> ('this'), and the dso handle, which in this case belongs to main.
> 
> Now, when the program exits, it will repeatedly call __cxa_finalize()
> to actually call the registered exit functions, each time passing a
> pointer to the dso being unloaded (or NULL for main):
> 
> void
> __cxa_finalize(void *dso)
> {
>         struct dl_phdr_info phdr_info;
>         struct atexit *p;
>         struct atexit_fn fn;
>         int n, has_phdr;
> 
>         if (dso != NULL)
>                 has_phdr = _rtld_addr_phdr(dso, &phdr_info);
>         else
>                 has_phdr = 0;
> 
>         _MUTEX_LOCK(&atexit_mutex);
>         for (p = __atexit; p; p = p->next) {
>                 for (n = p->ind; --n >= 0;) {
>                         if (p->fns[n].fn_type == ATEXIT_FN_EMPTY)
>                                 continue; /* already been called */
>                         fn = p->fns[n];
>                         if (dso != NULL && dso != fn.fn_dso) {
>                                 /* wrong DSO ? */
>                                 if (!has_phdr
> || !__elf_phdr_match_addr( &phdr_info, fn.fn_ptr.cxa_func))
>                                         continue;
>                         }
>                         /*
>                           Mark entry to indicate that this particular
> handler has already been called.
>                         */
>                         p->fns[n].fn_type = ATEXIT_FN_EMPTY;
>                         _MUTEX_UNLOCK(&atexit_mutex);
> 
>                         /* Call the function of correct type. */
>                         if (fn.fn_type == ATEXIT_FN_CXA)
>                                 fn.fn_ptr.cxa_func(fn.fn_arg);
>                         else if (fn.fn_type == ATEXIT_FN_STD)
>                                 fn.fn_ptr.std_func();
> [...]
> 
> The problem is in the part with the comment "wrong DSO?".  When the
> main program is compiled with -fPIC, and __cxa_finalize() is called
> for libout.so (which is the first dso to be processed), it will
> encounter the entry for Outer::~Outer().
> 
> Then, the "wrong DSO?" part will be entered, and because has_phdr is
> true, it will call __elf_phdr_match_addr() with the address of the
> destructor.  Since the destructor is registered with
> _ZN5OuterD1Ev@GOT, it will match, and it will be called.
> 
> In contrast, if the main program is not compiled with -fPIC, the
> destructor will be registered with _ZN5OuterD1Ev (e.g. without @GOT),
> and __elf_phdr_match_addr() will not match, and the loop continues
> without calling the destructor.
> 
> Finally, if the main program is compiled with gcc and -fPIC, the
> destructor itself is never considered in the __cxa_finalize() loop,
> only the locally generated wrapper function.  That function will only
> be called in the __cxa_finalize() call for the main program, and so
> the destructor will be called at the right time.
> 
> I am not entirely sure what can be done to remedy this scenario, and I
> also do not know the exact reasons for r211706 changing the behavior.
> 
> E.g., before r211706, if the atexit_fn's fn_dso did not match the dso
> being unloaded, the loop would unconditionally continue without
> calling the handler.  On the other hand, r211706 seems to make sure
> functions from dso's will be called before they are unloaded, as
> calling them afterwards would not always make sense... :-)
> 

Thanks for the in-depth analysis, quite interesting read that makes a
lot of sense and matches the gut feeling that "it's destroying
everything defined in the shared lib first".

Call me Mr. Obvious, but I assume clang and gcc won't change the way
destructors are registered, so we need a fix in FreeBSD. Maybe kib@
could shed some light on this?

Cheers,
Michael 

-- 
Michael Gmelin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130626223133.1cc1e009>