Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 03 Nov 2013 11:06:18 -0700
From:      Ian Lepore <ian@FreeBSD.org>
To:        Jason Evans <jasone@FreeBSD.org>
Cc:        Tim Kientzle <tim@kientzle.com>, freebsd-arm@FreeBSD.org, Howard Su <howard0su@gmail.com>
Subject:   Re: sshd crash
Message-ID:  <1383501978.31172.127.camel@revolution.hippie.lan>
In-Reply-To: <2F2E1775-A459-4D0F-A464-F41B8A7EAB9B@freebsd.org>
References:  <CAAvnz_rj43Ww6=mMfnp2u5TA2pWb20vWOqyAtuK08wgzy0dH6A@mail.gmail.com> <1383313834.31172.65.camel@revolution.hippie.lan> <CAHNYxxMMF_GJv10drYuQFO%2Bav%2BTdp8OBvJfFZObEZ=tgaBovSA@mail.gmail.com> <1383328423.31172.92.camel@revolution.hippie.lan> <CAHNYxxNiuKP8wfTaZuL%2BBXiLcYA9eU3LBb-659ZBYr-WBSmZeQ@mail.gmail.com> <1383343354.31172.102.camel@revolution.hippie.lan> <EB18203F-C516-4917-9AA4-DBA6E66DAAB6@kientzle.com> <1383399220.31172.116.camel@revolution.hippie.lan> <20131102153953.GA39106@night.db.net> <2F2E1775-A459-4D0F-A464-F41B8A7EAB9B@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 2013-11-03 at 08:51 -0800, Jason Evans wrote:
> On Nov 2, 2013, at 8:39 AM, Diane Bruce <db@db.net> wrote:
> > On Sat, Nov 02, 2013 at 07:33:40AM -0600, Ian Lepore wrote:
> >> 
> >> I'm not sure it's a mundane stray-write either.  The routine that's
> >> asserting is checking to see if the contents of a page are all-zero
> >> because a jemalloc internal flag is set that says it should be.  I had
> >> the routine print the non-zero data it found, and it looks like this:
> >> 
> >> not-zero at 0 0x20c99000 = 0x20800a00
> >> not-zero at 1 0x20c99004 = 0x00000001
> >> not-zero at 2 0x20c99008 = 0x0000002f
> >> not-zero at 3 0x20c9900c = 0xffffffff
> >> not-zero at 4 0x20c99010 = 0x00007fff
> >> not-zero at 5 0x20c99014 = 0x00000003
> >> not-zero at 96 0x20c99180 = 0x5a5a5a5a
> >> not-zero at 97 0x20c99184 = 0x5a5a5a5a
> >> not-zero at 98 0x20c99188 = 0x5a5a5a5a
> >> 
> >> The 0x5a continues to the end of the page.  So jemalloc has metadata
> >> that says it thinks the page is all-zeroes, and the page is a mix of
> >> data and some zeroes and the 5a junk-fill byte.  It seems more like the
> >> metadata is in error somehow.  (Maybe a stray write hit the metadata.)
> 
> This looks to me like the sort of thing that would happen if the chunk page map were corrupted.  This could happen due to a double free, freeing an interior pointer of a multi-page allocation, or a variety of more complicated errors.  The page is filled with 0x5a bytes, yet jemalloc thinks the page should contain 0x00 bytes, and that implies that the chunk page table claims this is the first use of the page since it was mapped.
> 
> Does this problem reproduce on amd64?  If so, I'll dig in and figure out if jemalloc is to blame.  If not on amd64, given enough hand holding re: hardware acquisition and configuration I can probably be convinced to set up an ARM system.
> 

FWIW, I noticed when re-examining that data yesterday that the 0x5a
doesn't continue to the end of the page, it continues until word 328,
then the rest of the page is zeroes.  I assume that's still consistant
with a double-free and other such usage errors.

An interesting part of this problem is that the changeset that
introduced this problem is the one that makes the malloc-related symbols
in libc weak references to the jemalloc implementation.  Diane sees some
evidence in gdb that there is a non-jemalloc implementation of malloc
present in the process.  I wonder if we've got something like a mix of
statically and dynamically linked code and thus two mallocs somehow?

Would allocating a block from one malloc implementation then freeing it
to the other be consistant with that asserted data above?

I think if this happened on x86 we'd be hearing from a LOT of folks
about it.  I wonder if it reproduces in an arm emulation environment?  I
don't know anything about using emulation, but others here do.

-- Ian





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1383501978.31172.127.camel>