Date: Sat, 2 Nov 2013 10:39:53 -0500 From: Diane Bruce <db@db.net> To: Ian Lepore <ian@FreeBSD.org> Cc: Tim Kientzle <tim@kientzle.com>, jasone@FreeBSD.org, freebsd-arm@FreeBSD.org, Howard Su <howard0su@gmail.com> Subject: Re: sshd crash Message-ID: <20131102153953.GA39106@night.db.net> In-Reply-To: <1383399220.31172.116.camel@revolution.hippie.lan> References: <CAAvnz_rj43Ww6=mMfnp2u5TA2pWb20vWOqyAtuK08wgzy0dH6A@mail.gmail.com> <1383313834.31172.65.camel@revolution.hippie.lan> <CAHNYxxMMF_GJv10drYuQFO%2Bav%2BTdp8OBvJfFZObEZ=tgaBovSA@mail.gmail.com> <1383328423.31172.92.camel@revolution.hippie.lan> <CAHNYxxNiuKP8wfTaZuL%2BBXiLcYA9eU3LBb-659ZBYr-WBSmZeQ@mail.gmail.com> <1383343354.31172.102.camel@revolution.hippie.lan> <EB18203F-C516-4917-9AA4-DBA6E66DAAB6@kientzle.com> <1383399220.31172.116.camel@revolution.hippie.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Nov 02, 2013 at 07:33:40AM -0600, Ian Lepore wrote: > On Fri, 2013-11-01 at 22:35 -0700, Tim Kientzle wrote: > > On Nov 1, 2013, at 3:02 PM, Ian Lepore <ian@freebsd.org> wrote: > > > > > On Sat, 2013-11-02 at 02:40 +0800, Jia-Shiun Li wrote: > > >> On Sat, Nov 2, 2013 at 1:53 AM, Ian Lepore <ian@freebsd.org> wrote: > > >>> On Sat, 2013-11-02 at 01:44 +0800, Jia-Shiun Li wrote: > > >>>> may I add: putty causes this to happen. mine 0.62. But ssh from another > > >>>> FreeBSD host has no problem. > > >>>> > > >>>> I suspect it to be some issues related to memory or malloc issues > > >>>> specific to bbb. 'tmux a -d' without existing detached sessions > > >>>> causes tmux client to core dump. But sshd and it are both fine on rpi. > > >>>> > > >>>> -Jia-Shiun. > > >>> > > >>> This is the first I've heard of being able to ssh to an arm platform > > >>> that doesn't have PrivSep disabled, since about July or so. I've never > > >>> heard a report yet that anything on the client side could make a > > >>> difference. > > >>> > > >>> It's definitely not a beaglebone thing, it happens on every arm board > > >>> I've got... dreamplug, rpi, bbw, imx53, wandboard. > > >> > > >> > > >> Ok let me make sure I did not mix things up. ;) > > >> > > >> IIRC I once saw similar issue on rpi shortly. But after another > > >> weekly update it was gone. I did not pay too much attention on rpi, > > >> and thought it was bbb specific. > > >> > > >> I did not change sshd_config, UsePrivilegeSeparation supposed > > >> remaining on as default is. > > > > I started looking into it a couple of months ago but didn't get > > very far; Diane Bruce got a lot further than I did. > > > > If I recall correctly, it started up when the malloc libc symbols > > were changed. That may have altered what malloc implementation > > sshd used. > > > > So it could be a long-standing stray write that jemalloc just > > happens to detect. > > > > It could also be related to locking (there's some multi-threaded > > crypto code in sshd that may be involved). > > There's lots of stuff with lock in the name, but I don't think there are > actually any threads involved in sshd, just forking. ldd says sshd > doesn't link to libthr. > > I'm not sure it's a mundane stray-write either. The routine that's > asserting is checking to see if the contents of a page are all-zero > because a jemalloc internal flag is set that says it should be. I had > the routine print the non-zero data it found, and it looks like this: > > not-zero at 0 0x20c99000 = 0x20800a00 > not-zero at 1 0x20c99004 = 0x00000001 > not-zero at 2 0x20c99008 = 0x0000002f > not-zero at 3 0x20c9900c = 0xffffffff > not-zero at 4 0x20c99010 = 0x00007fff > not-zero at 5 0x20c99014 = 0x00000003 > not-zero at 96 0x20c99180 = 0x5a5a5a5a > not-zero at 97 0x20c99184 = 0x5a5a5a5a > not-zero at 98 0x20c99188 = 0x5a5a5a5a > > The 0x5a continues to the end of the page. So jemalloc has metadata > that says it thinks the page is all-zeroes, and the page is a mix of > data and some zeroes and the 5a junk-fill byte. It seems more like the > metadata is in error somehow. (Maybe a stray write hit the metadata.) > > -- Ian > I did a ln -s "quarantine:16000000" /etc/malloc.conf which also works. This led me down the garden path of thinking it might be a use after free. This was the conclusion jasone also came to. Which led to me reporting this possibility to secteam and des. http://docs.freebsd.org/cgi/getmsg.cgi?fetch=199241+0+archive/2013/freebsd-arm/20130728.freebsd-arm Nevertheless, running efence from ports failed to come up with any use after free. I put together some notes for des at http://www.freebsd.org/~db/fordes The rev is question http://svnweb.freebsd.org/base?view=revision&revision=250991 > When jemalloc was turned on for userland. There existed an older malloc (also by jasone) /usr/src/lib/libc/stdlib/malloc.c I agree with Ian, it is not thread locking. I have a thread test program which does not show any faults in our thread locking. Yes we it is purely associated with the fork. zbb@ also reported a similar problem with another platform. === Hello. I'm sending you the logs. Please see below. Best regards Zbyszek Bodek 1. ======= --- ExprConstant.o --- <jemalloc>: /home/zbb/projects/armsp/freebsd-arm-superpages/lib/libc/../../contrib/jemalloc/include/jemalloc/internal/arena.h:757: Failed assertion: "binind < NBINS" ./StmtNodes.inc.h: In member function 'RetTy clang::StmtVisitorBase<Ptr, ImplClass, RetTy>::Visit(typename Ptr<clang::Stmt>::type) [with Ptr = clang::make_const_ptr, ImplClass = <unnamed>::LValueExprEvaluator, RetTy = bool]': ./StmtNodes.inc.h:873: internal compiler error: Abort trap Please submit a full bug report, with preprocessed source if appropriate. See <URL:http://gcc.gnu.org/bugs.html> for instructions. *** [ExprConstant.o] Error code 1 make[6]: stopped in /usr/src/lib/clang/libclangast make[6]: stopped in /usr/src/lib/clang/libclangast *** [all] Error code 2 make[5]: stopped in /usr/src/lib/clang 1 error make[5]: stopped in /usr/src/lib/clang *** [all] Error code 2 make[4]: stopped in /usr/src/lib 1 error make[4]: stopped in /usr/src/lib A failure has been detected in another branch of the parallel make make[3]: stopped in /usr/src *** [libraries] Error code 2 make[2]: stopped in /usr/src 1 error make[2]: stopped in /usr/src *** [_libraries] Error code 2 make[1]: stopped in /usr/src 1 error make[1]: stopped in /usr/src *** [buildworld] Error code 2 make: stopped in /usr/src 1 error 2. ======= --- ExprConstant.o --- <jemalloc>: /home/zbb/projects/armsp/freebsd-arm-superpages/lib/libc/../../contrib/jemalloc/include/jemalloc/internal/arena.h:757: Failed assertion: "binind < NBINS" /usr/src/lib/clang/libclangast/../../../contrib/llvm/tools/clang/lib/AST/ExprConstant.cpp: In member function 'RetTy<unnamed>::ExprEvaluatorBase<Derived, RetTy>::VisitCallExpr(const clang::CallExpr*) [with Derived = <unnamed>::IntExprEvaluator, RetTy = bool]': /usr/src/lib/clang/libclangast/../../../contrib/llvm/tools/clang/lib/AST/ExprConstant.cpp:3190: internal compiler error: Abort trap Please submit a full bug report, with preprocessed source if appropriate. See <URL:http://gcc.gnu.org/bugs.html> for instructions. *** [ExprConstant.o] Error code 1 ----- End forwarded message ----- There is also an open bug report for that one. >From both zbb and Matthias Meyser see PR 182060 It's time to bring in jasone again I think and I have included him on the cc. jemalloc has a number of fill places using the same pattern. I modified the pattern to be different in order to track what we are seeing. Where I have left it now is I think it might be associated with the thread cache code, because the pattern I see comes from that branch of his code. I have copious notes here but will have to dig them up. Both Ian and I were rather hoping zbb@ had fixed this one when he fixed a stupid in the arm vm, Ian tells me it is still there. - Diane -- - db@FreeBSD.org db@db.net http://www.db.net/~db
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20131102153953.GA39106>