From owner-freebsd-current@FreeBSD.ORG Mon Nov 26 21:21:15 2012 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A5472583; Mon, 26 Nov 2012 21:21:15 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail13.syd.optusnet.com.au (mail13.syd.optusnet.com.au [211.29.132.194]) by mx1.freebsd.org (Postfix) with ESMTP id 263408FC1C; Mon, 26 Nov 2012 21:21:14 +0000 (UTC) Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26]) by mail13.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qAQLL5T2008212 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 27 Nov 2012 08:21:06 +1100 Date: Tue, 27 Nov 2012 08:21:05 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov Subject: Re: clang compiled kernel panic when mounting zfs root on i386 In-Reply-To: <20121126171658.GD3013@kib.kiev.ua> Message-ID: <20121127071243.D1255@besplex.bde.org> References: <50b37d46.8584440a.735c.ffffb4e6@mx.google.com> <20121126171658.GD3013@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-Cloudmark-Score: 0 X-Optus-Cloudmark-Analysis: v=2.0 cv=TL4d0CZa c=1 sm=1 a=Dqyo4nusdH8A:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=PZxpgLn_-10A:10 a=IMFgP6cxdJxPq-vExgcA:9 a=CjuIK1q_8ugA:10 a=yiplnose_AnHQ6Q_:21 a=7WiQoUoJrVuut8JH:21 a=bxQHXO5Py4tHmhUgaywp5w==:117 X-Mailman-Approved-At: Mon, 26 Nov 2012 21:54:03 +0000 Cc: freebsd-current@freebsd.org, fs@freebsd.org, sig6247 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Nov 2012 21:21:15 -0000 On Mon, 26 Nov 2012, Konstantin Belousov wrote: > On Mon, Nov 26, 2012 at 06:31:34AM -0800, sig6247 wrote: >> >> Just checked out r243529, this only happens when the kernel is compiled >> by clang, and only on i386, either recompiling the kernel with gcc or >> booting from a UFS root works fine. Is it a known problem? > It looks like that clang uses more stack than gcc, and zfs makes quite > deep call chains. > > It would be a waste, generally, to increase the init process kernel > stack size only to pacify zfs. And I suspect that it would not help > in the similar situations when the same procedure initiated for non-root > mounts. Or to pacify clang... >> ---------------------------------------------------------------------- >> WARNING: WITNESS option enabled, expect reduced performance. >> Trying to mount root from zfs:zroot []... >> >> Fatal double fault: >> eip = 0xc0adc37d >> esp = 0xc86bffc8 >> ebp = 0xc86c003c >> cpuid = 1; apic id = 01 >> panic: double fault >> cpuid = 1 >> KDB: enter: panic >> [ thread pid 1 tid 100002 ] >> Stopped at kdb_enter+0x3d: movl $0,kdb_why >> db> bt >> Tracing pid 1 tid 100002 td 0xc89efbc0 >> kdb_enter(c1064aa4,c1064aa4,c10b806f,c139e3b8,f5eacada,...) at kdb_enter+0x3d >> panic(c10b806f,1,1,1,c86c003c,...) at panic+0x14b >> dblfault_handler() at dblfault_handler+0xab >> --- trap 0x17, eip = 0xc0adc37d, esp = 0xc86bffc8, ebp = 0xc86c003c --- >> witness_checkorder(c1fd7508,9,c109df18,7fa,0,...) at witness_checkorder+0x37d >> __mtx_lock_flags(c1fd7518,0,c109df18,7fa,c135d918,...) at __mtx_lock_flags+0x87 >> uma_zalloc_arg(c1fd66c0,0,1,4d3,c86c0110,...) at uma_zalloc_arg+0x605 >> vm_map_insert(c1fd508c,c13dfc10,bb1f000,0,cba1e000,...) at vm_map_insert+0x499 >> kmem_back(c1fd508c,cba1e000,1000,3,c86c01d4,...) at kmem_back+0x76 >> kmem_malloc(c1fd508c,1000,3) at kmem_malloc+0x250 >> page_alloc(c1fd1d80,1000,c86c020b,3,c1fd1d80,...) at page_alloc+0x27 >> keg_alloc_slab(103,4,c109df18,870,cb99ef6c,...) at keg_alloc_slab+0xc3 >> keg_fetch_slab(103,c1fd1d80,cb99ef6c,c1fc8230,c86c02c0,...) at keg_fetch_slab+0xe2 >> zone_fetch_slab(c1fd1d80,c1fd0480,103,826,0,...) at zone_fetch_slab+0x43 >> uma_zalloc_arg(c1fd1d80,0,102,3,2,...) at uma_zalloc_arg+0x3f2 >> malloc(4c,c1686100,102,c86c0388,c173d09a,...) at malloc+0xe9 >> zfs_kmem_alloc(4c,102,cb618820,c89efbc0,cb618820,...) at zfs_kmem_alloc+0x20 >> vdev_mirror_io_start(cb8218a0,10,cb8218a0,1,0,...) at vdev_mirror_io_start+0x14a >> zio_vdev_io_start(cb8218a0,c89efbc0,0,cb8218a0,c86c0600,...) at zio_vdev_io_start+0x228 >> zio_execute(cb8218a0,cb618000,cba1b640,cb900000,400,...) at zio_execute+0x106 >> spa_load_verify_cb(cb618000,0,cba1b640,cb884b40,c86c0600,...) at spa_load_verify_cb+0x89 >> traverse_visitbp(cb884b40,cba1b640,c86c0600,c86c0ba0,0,...) at traverse_visitbp+0x29f >> traverse_dnode(cb884b40,0,0,8b,0,...) at traverse_dnode+0x92 >> traverse_visitbp(cb884bb8,cba07200,c86c0890,cb884bf4,c16ce7e0,...) at traverse_visitbp+0xe47 >> traverse_visitbp(cb884bf4,cb9bf840,c86c0968,c86c0ba0,0,...) at traverse_visitbp+0xf32 >> traverse_dnode(cb884bf4,0,0,0,0,...) at traverse_dnode+0x92 >> traverse_visitbp(0,cb618398,c86c0b50,2,cb9f1c78,...) at traverse_visitbp+0x96d >> traverse_impl(0,0,cb618398,3e1,0,...) at traverse_impl+0x268 >> traverse_pool(cb618000,3e1,0,d,c1727830,...) at traverse_pool+0x79 >> spa_load(0,1,c86c0ec4,1e,0,...) at spa_load+0x1dde >> spa_load(0,0,c13d8c94,1,3,...) at spa_load+0x11a5 >> spa_load_best(0,ffffffff,ffffffff,1,c0adc395,...) at spa_load_best+0x71 >> spa_open_common(c17e0e1e,0,0,c86c1190,c16f5a1c,...) at spa_open_common+0x11a >> spa_open(c86c1078,c86c1074,c17e0e1e,c135d918,c1fd7798,...) at spa_open+0x27 >> dsl_dir_open_spa(0,cb770030,c17e11b1,c86c11f8,c86c11f4,...) at dsl_dir_open_spa+0x6c >> dsl_dataset_hold(cb770030,cb613800,c86c1240,cb613800,cb613800,...) at dsl_dataset_hold+0x3a >> dsl_dataset_own(cb770030,0,cb613800,c86c1240,c1684e30,...) at dsl_dataset_own+0x21 >> dmu_objset_own(cb770030,2,1,cb613800,c86c1290,...) at dmu_objset_own+0x2a >> zfsvfs_create(cb770030,c86c13ac,c17ee05d,681,0,...) at zfsvfs_create+0x4c >> zfs_mount(cb78ed20,c17f411c,c9ff4600,c89cae80,0,...) at zfs_mount+0x42c >> vfs_donmount(c89efbc0,4000,0,c86c1790,cb6c0800,...) at vfs_donmount+0xc6d >> kernel_mount(cb7700b0,4000,0,0,1,...) at kernel_mount+0x6b >> parse_mount(cb7700e0,c1194498,0,1,0,...) at parse_mount+0x606 >> vfs_mountroot(c13d95b0,4,c105c042,2bb,0,...) at vfs_mountroot+0x6cf >> start_init(0,c86c1d08,c105e94c,3db,0,...) at start_init+0x6a >> fork_exit(c0a42090,0,c86c1d08) at fork_exit+0x7f >> fork_trampoline() at fork_trampoline+0x8 >> --- trap 0, eip = 0, esp = 0xc86c1d40, ebp = 0 --- >> db> 43 deep (before the double fault) is disgusting, but even if clang has broken stack alignment due to a wrong default and no -mpreferred-stack-boundary to fix it, that's still only about 8*43 extra bytes (8 for the average extra stack to align to 16 bytes). Probably zfs is also putting large data structures on the stack. It would be useful if the stack trace printed the the stack pointer on every function call, so that you could see how much stack each function used. All those ', ...' printed after 5 args show further apparent clang lossage. vfs_mountroot takes no args at all, but the above shows ot taking 5 known args and further unknown args. ddb's stack tracer does limited parsing of the bytes in function's caller to guess the number of args. It usually guesses right for gcc. It apparently usually guesses wrong for clang. I can't test this for i386 since ref10-i386 is down, but for amd64 the parsing seems to be already too dificult for gcc and impossible for clang: for the function call in `void bar(void); void foo(void) { bar(); }', clang generates a tail call: % test: # @test % .cfi_startproc % # BB#0: # %entry % xorb %al, %al % jmp bar # TAILCALL % .Ltmp0: The bytes after the tail call are unparseable. The bytes before the tail call are difficult to parse, and AFAIR ddb doesn't look at them (the xorb says that there are no xmm args, but that is of no interest in the kernel). Gcc generates: % test: % .LFB2: % subq $8, %rsp % .LCFI0: % movl $0, %eax % call bar % addq $8, %rsp % ret The addq $8,%rsp after the call looks like it is popping 1 arg and is probably misguessed as such -- it is actually to clean up the stack alignment before return. Actually, this guess doesn't apply to amd64 -- the first few args are in registers where there number is almost impossible to guess even for gcc if there are none on stack; it is only when there are some on the stack that the number there can possibly be guessed and the total number = #number in registers + #on stack. With 1 arg: `void bar(int x); void foo(int) { bar(x); }', the generated code is identical for both compilers. So have any chance of guessing the number of args for bar(), bytes in bar()'s caller's callers would have to be checked, recursively. This is too hard for ddb. With tail calls, the checking is unbounded. Tail calls also break the stack trace, but save space so running out of kernel stack is less likely :-). It is arguable a bug to compile kernels with options that make stack tracing impossible. Even passing args in registers makes correct arg printing impossible -- ddb would have to assume that there are always at least 3 (?) args on amd64. The above shows 3 being printed for fork_exit(), which is correct, and 3 for kmem_malloc(), which is correct, and none for fork_trampoline(), which depends on ddb having special knowledge of this and a few other low-level/asm functions. All the others are broken. An very old bug in this area are that functions written in asm don't have normal frames (at least on amd64 and i386) unless profiling is configured. They are mostly leaf functions that shouldn't trap, so this is not very important for interactive use of ddb. But it makes the args of a function like memcpy() harder to find in interactive use, and breaks stack traces if an interrupt or trap occurs in the functions. Compilers may also omit frame pointers for leaf functions, but this is easy to prevent. The efficiency gains from not using frame pointers are small or negative, at least on modern x86 since the frame pointer management can run in parallel and is especially fast in leaf functions where it is not actually used (except for debugging), so frame pointers should never be omitted if there is any chance that the code needs to be debugged. The chance is almost 100% for kernels since they can be debugged if GDB or DDB or stack tracing is configured or if panic dumps are enabled and occur. A not so old bug in this area is that compiling with -O2 or just with with -march tends to break things. Even on old i386 kernels compiled with -O -march=i386, I mostly see 5 args (but no ', ...'). For a just a few functions, the calling sequence is "call foo; addl $4*N,%esp" and ddb correctly guesses that this means N args. Another sequence is "call foo; movl %eax,%esi; addl $8,%esp". ddb doesn't understand this at all, and misguesses that there are 5 args. Bruce