Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 2 Mar 2009 07:40:04 GMT
From:      Mikolaj Golub <to.my.trociny@gmail.com>
To:        freebsd-bugs@FreeBSD.org
Subject:   Re: kern/131290: [hang]: How to completely freeze FreeBSD 7.1 under a non-privileged user
Message-ID:  <200903020740.n227e4lA094780@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/131290; it has been noted by GNATS.

From: Mikolaj Golub <to.my.trociny@gmail.com>
To: freebsd-bugs@FreeBSD.org, bug-followup@FreeBSD.org
Cc: Sergey S <ad.sergey@gmail.com>
Subject: Re: kern/131290: [hang]: How to completely freeze FreeBSD 7.1 under a non-privileged user
Date: Mon, 02 Mar 2009 09:39:00 +0200

 Some of the messages were not attached to the pr and it looks a bit confusing,
 so I will summarize the previous conversation, which can be found in
 freebsd-bugs maillist archive.
 
 As Sergey wrote in the first message, one can hang the system running
 the simple erlang script. The system hangs when the following erlang
 statement is executed
 
 net_kernel:start([test_node, shortnames])
 
 Usually it will require to run the script several times to hang the system, so
 Sergey provided shell scrip to run the erlang program in loop.
 
 The problem was reproduced by me and Kenji Rikitake on i386 7.x two core
 systems (Sergey observed the hang on multicore amd64).
 
 To observe the hang erlang should be built with option
 -enable-smp-support. Without this option the problem is not observed.
 
 I compiled the kernel with additional debug and witness options hoping the
 system would panic instead of hanging but it didn't. Also I was not able to
 break into debugger during the hang.
 
 But after compiling erlang with -enable-smp-support I have noticed that the
 system panics when running
 
 /usr/local/etc/rc.d/ejabberd stop
 
 Usually I needed to run the command several times to make the system
 crash. The backtrace provided erlier was from such crash. And actually I don't
 know if this crash is related to the initial problem with hang. What does be
 related is that both problems appear when the erlang is built with
 -enable-smp-support.
 
 I have done a bit deeper analysis of the crash and here are results.
 
 At the moment of the crash two threads of beam.smp process look the most
 interesting. One of the threads is sleeping in poll system call waiting read
 data for two descriptors (3 and 8):
 
 #0  sched_switch (td=0xc52d9cc0, newtd=0xc3f12880, flags=1) at /usr/src/sys/kern/sched_4bsd.c:934
 #1  0xc07a25d3 in mi_switch (flags=Variable "flags" is not available.
 ) at /usr/src/sys/kern/kern_synch.c:440
 #2  0xc07cb702 in sleepq_switch (wchan=0xc0cc721c) at /usr/src/sys/kern/subr_sleepqueue.c:497
 #3  0xc07cb99f in sleepq_catch_signals (wchan=0xc0cc721c) at /usr/src/sys/kern/subr_sleepqueue.c:417
 #4  0xc07cc177 in sleepq_timedwait_sig (wchan=0xc0cc721c) at /usr/src/sys/kern/subr_sleepqueue.c:631
 #5  0xc0760f7b in _cv_timedwait_sig (cvp=0xc0cc721c, lock=0xc0cc7204, timo=903) at /usr/src/sys/kern/kern_condvar.c:369
 #6  0xc07d51f4 in poll (td=0xc52d9cc0, uap=0xe6988cfc) at /usr/src/sys/kern/sys_generic.c:962
 #7  0xc0a9f2d3 in syscall (frame=0xe6988d38) at /usr/src/sys/i386/i386/trap.c:1090
 #8  0xc0a85480 in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:255
 #9  0x00000033 in ?? ()
 
 
 (kgdb) p uap
 $1 = (struct poll_args *) 0xe6988cfc
 (kgdb) p *uap
 $2 = {fds_l_ = 0xe6988cfc "", fds = 0x28516500, fds_r_ = 0xe6988d00 "\002", nfds_l_ = 0xe6988d00 "\002", nfds = 2, 
   nfds_r_ = 0xe6988d04 "\206\003", timeout_l_ = 0xe6988d04 "\206\003", timeout = 902, timeout_r_ = 0xe6988d08 "\200Z@("}
 
 (kgdb) p bits
 $4 = (struct pollfd *) 0xe6988b74
 (kgdb) p *bits
 $5 = {fd = 3, events = 65, revents = 0}
 (kgdb) p *(bits+1)
 $6 = {fd = 8, events = 65, revents = 0}
 
 65 = 0x40 | 0x1
 
 #define POLLIN          0x0001          /* any readable data available */
 #define POLLRDNORM      0x0040          /* non-OOB/URG data available */
 
 The descriptor 3 is one end of a pipe:
 
 (kgdb)  p *td->td_proc->p_fd.fd_ofiles[3]
 $25 = {f_list = {le_next = 0xc42c9214, le_prev = 0xc4f6db48}, f_type = 3, f_data = 0xc42cc000, f_flag = 7, 
   f_mtxp = 0xc3edb7f0, f_ops = 0xc0c19660, f_cred = 0xc42b0900, f_count = 1, f_vnode = 0x0, f_offset = 0, 
   f_vnread_flags = 0, f_gcflag = 0, f_msgcount = 0, f_seqcount = 0, f_nextoff = 0, f_label = 0x0, f_cdevpriv = 0x0}
 
 (#define DTYPE_PIPE      3       /* pipe */)
 
 While the other descriptor (8) is a socket:
 
 (kgdb)  p *td->td_proc->p_fd.fd_ofiles[8]
 $30 = {f_list = {le_next = 0xc5013098, le_prev = 0xc4d09098}, f_type = 2, f_data = 0xc4ee34e0, f_flag = 7, 
   f_mtxp = 0xc3edb3e8, f_ops = 0xc0c19840, f_cred = 0xc42b0900, f_count = 1, f_vnode = 0x0, f_offset = 0, 
   f_vnread_flags = 0, f_gcflag = 0, f_msgcount = 0, f_seqcount = 0, f_nextoff = 0, f_label = 0x0, f_cdevpriv = 0x0}
 
 (#define DTYPE_SOCKET    2       /* communications endpoint */)
 
 The other thread is the thread that actually causes the panic:
 
 #0  doadump () at pcpu.h:196
 #1  0xc079a07e in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:418
 #2  0xc079a352 in panic (fmt=Variable "fmt" is not available.
 ) at /usr/src/sys/kern/kern_shutdown.c:574
 #3  0xc0493a07 in db_panic (addr=Could not find the frame base for "db_panic".
 ) at /usr/src/sys/ddb/db_command.c:446
 #4  0xc049440c in db_command (last_cmdp=0xc0c48114, cmd_table=0x0, dopager=1) at /usr/src/sys/ddb/db_command.c:413
 #5  0xc049451a in db_command_loop () at /usr/src/sys/ddb/db_command.c:466
 #6  0xc0495d0d in db_trap (type=3, code=0) at /usr/src/sys/ddb/db_main.c:228
 #7  0xc07c3866 in kdb_trap (type=3, code=0, tf=0xe6985ac8) at /usr/src/sys/kern/subr_kdb.c:524
 #8  0xc0a9fb5b in trap (frame=0xe6985ac8) at /usr/src/sys/i386/i386/trap.c:688
 #9  0xc0a8541b in calltrap () at /usr/src/sys/i386/i386/exception.s:159
 #10 0xc07c39ea in kdb_enter_why (why=0xc0b42038 "panic", msg=0xc0b42038 "panic") at cpufunc.h:60
 #11 0xc079a336 in panic (fmt=0xc0af1d18 "Bad link elm %p next->prev != elm") at /usr/src/sys/kern/kern_shutdown.c:557
 #12 0xc07d4b3b in doselwakeup (sip=0xc42cc064, pri=88) at /usr/src/sys/kern/sys_generic.c:1142
 #13 0xc07d4c1e in selwakeuppri (sip=0xc42cc064, pri=88) at /usr/src/sys/kern/sys_generic.c:1118
 #14 0xc07d9c8e in pipe_write (fp=0xc4f6db48, uio=0xe6985c60, active_cred=0xc42b0900, flags=0, td=0xc52da000)
     at /usr/src/sys/kern/sys_pipe.c:528
 #15 0xc07d6095 in dofilewrite (td=0xc52da000, fd=4, fp=0xc4f6db48, auio=0xe6985c60, offset=-1, flags=0) at file.h:256
 #16 0xc07d6318 in kern_writev (td=0xc52da000, fd=4, auio=0xe6985c60) at /usr/src/sys/kern/sys_generic.c:401
 #17 0xc07d638f in write (td=0xc52da000, uap=0xe6985cfc) at /usr/src/sys/kern/sys_generic.c:317
 #18 0xc0a9f2d3 in syscall (frame=0xe6985d38) at /usr/src/sys/i386/i386/trap.c:1090
 #19 0xc0a85480 in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:255
 #20 0x00000033 in ?? ()
 Previous frame inner to this frame (corrupt stack?)
 
 The thread just has written some data to the pipe (fd=4) and called
 selwakeuppri to wake up the thread sleeping in poll but has found selinfo
 inconsistent.
 
 (kgdb) frame 12   
 #12 0xc07d4b3b in doselwakeup (sip=0xc42cc064, pri=88) at /usr/src/sys/kern/sys_generic.c:1142
 1142            TAILQ_REMOVE(&td->td_selq, sip, si_thrlist);
 (kgdb) p sip
 $1 = (struct selinfo *) 0xc42cc064
 (kgdb) p *sip
 $2 = {si_thrlist = {tqe_next = 0xc4ee3530, tqe_prev = 0xc52d9ce0}, si_thread = 0xc52d9cc0, si_note = {kl_list = {
       slh_first = 0x0}, kl_lock = 0xc0770070 <knlist_mtx_lock>, kl_unlock = 0xc07700a0 <knlist_mtx_unlock>, 
     kl_locked = 0xc0773640 <knlist_mtx_locked>, kl_lockarg = 0xc42cc170}, si_flags = 0}
 (kgdb) p *sip->si_thrlist.tqe_next
 $3 = {si_thrlist = {tqe_next = 0x0, tqe_prev = 0x0}, si_thread = 0x0, si_note = {kl_list = {slh_first = 0x0}, 
     kl_lock = 0xc0770070 <knlist_mtx_lock>, kl_unlock = 0xc07700a0 <knlist_mtx_unlock>, 
     kl_locked = 0xc0773640 <knlist_mtx_locked>, kl_lockarg = 0xc4ee3554}, si_flags = 0}
 
 Here is how descriptors f_data structures look:
 
 (kgdb) p  *((struct pipe *) td->td_proc->p_fd.fd_ofiles[3].f_data)
 $37 = {pipe_buffer = {cnt = 1, in = 1, out = 0, size = 16384, buffer = 0xe1530000 "!"}, pipe_map = {cnt = 0, pos = 0, 
     npages = 0, ms = {0x0 <repeats 17 times>}}, pipe_sel = {si_thrlist = {tqe_next = 0xc4ee3530, 
       tqe_prev = 0xc52d9ce0}, si_thread = 0xc52d9cc0, si_note = {kl_list = {slh_first = 0x0}, 
       kl_lock = 0xc0770070 <knlist_mtx_lock>, kl_unlock = 0xc07700a0 <knlist_mtx_unlock>, 
       kl_locked = 0xc0773640 <knlist_mtx_locked>, kl_lockarg = 0xc42cc170}, si_flags = 0}, pipe_atime = {
     tv_sec = 1234768224, tv_nsec = 0}, pipe_mtime = {tv_sec = 1234768224, tv_nsec = 0}, pipe_ctime = {
     tv_sec = 1234768224, tv_nsec = 0}, pipe_sigio = 0x0, pipe_peer = 0xc42cc0b8, pipe_pair = 0xc42cc000, 
   pipe_state = 2304, pipe_busy = 0, pipe_present = 1}
 
 
 (kgdb) p  *((struct socket *) td->td_proc->p_fd.fd_ofiles[8].f_data)
 $36 = {so_count = 1, so_type = 1, so_options = 0, so_linger = 0, so_state = 258, so_qstate = 0, so_pcb = 0xc4bb2a8c, 
   so_proto = 0xc0c20fe8, so_head = 0x0, so_incomp = {tqh_first = 0x0, tqh_last = 0xc4ee34fc}, so_comp = {
     tqh_first = 0x0, tqh_last = 0xc4ee3504}, so_list = {tqe_next = 0x0, tqe_prev = 0x0}, so_qlen = 0, so_incqlen = 0, 
   so_qlimit = 0, so_timeo = 0, so_error = 0, so_sigio = 0x0, so_oobmark = 0, so_aiojobq = {tqh_first = 0x0, 
     tqh_last = 0xc4ee3528}, so_rcv = {sb_sel = {si_thrlist = {tqe_next = 0x0, tqe_prev = 0x0}, si_thread = 0x0, 
       si_note = {kl_list = {slh_first = 0x0}, kl_lock = 0xc0770070 <knlist_mtx_lock>, 
         kl_unlock = 0xc07700a0 <knlist_mtx_unlock>, kl_locked = 0xc0773640 <knlist_mtx_locked>, 
         kl_lockarg = 0xc4ee3554}, si_flags = 0}, sb_mtx = {lock_object = {lo_name = 0xc0b472c8 "so_rcv", 
         lo_type = 0xc0b472c8 "so_rcv", lo_flags = 16973824, lo_witness_data = {lod_list = {stqe_next = 0xc0c8a8d0}, 
           lod_witness = 0xc0c8a8d0}}, mtx_lock = 4, mtx_recurse = 0}, sb_sx = {lock_object = {
         lo_name = 0xc0b4aa15 "so_rcv_sx", lo_type = 0xc0b4aa15 "so_rcv_sx", lo_flags = 37421056, lo_witness_data = {
           lod_list = {stqe_next = 0xc0c87ea0}, lod_witness = 0xc0c87ea0}}, sx_lock = 1, sx_recurse = 0}, sb_state = 0, 
     sb_mb = 0x0, sb_mbtail = 0x0, sb_lastrecord = 0x0, sb_sndptr = 0x0, sb_sndptroff = 0, sb_cc = 0, sb_hiwat = 71680, 
     sb_mbcnt = 0, sb_mcnt = 0, sb_ccnt = 0, sb_mbmax = 262144, sb_ctl = 0, sb_lowat = 1, sb_timeo = 0, 
     sb_flags = 2048}, so_snd = {sb_sel = {si_thrlist = {tqe_next = 0x0, tqe_prev = 0x0}, si_thread = 0x0, si_note = {
         kl_list = {slh_first = 0x0}, kl_lock = 0xc0770070 <knlist_mtx_lock>, 
         kl_unlock = 0xc07700a0 <knlist_mtx_unlock>, kl_locked = 0xc0773640 <knlist_mtx_locked>, 
         kl_lockarg = 0xc4ee35e8}, si_flags = 0}, sb_mtx = {lock_object = {lo_name = 0xc0b472c1 "so_snd", 
         lo_type = 0xc0b472c1 "so_snd", lo_flags = 16973824, lo_witness_data = {lod_list = {stqe_next = 0xc0c8a8f8}, 
           lod_witness = 0xc0c8a8f8}}, mtx_lock = 4, mtx_recurse = 0}, sb_sx = {lock_object = {
         lo_name = 0xc0b4aa0b "so_snd_sx", lo_type = 0xc0b4aa0b "so_snd_sx", lo_flags = 37421056, lo_witness_data = {
           lod_list = {stqe_next = 0xc0c87ec8}, lod_witness = 0xc0c87ec8}}, sx_lock = 1, sx_recurse = 0}, sb_state = 0, 
     sb_mb = 0x0, sb_mbtail = 0x0, sb_lastrecord = 0x0, sb_sndptr = 0x0, sb_sndptroff = 0, sb_cc = 0, sb_hiwat = 43008, 
     sb_mbcnt = 0, sb_mcnt = 0, sb_ccnt = 0, sb_mbmax = 262144, sb_ctl = 0, sb_lowat = 2048, sb_timeo = 0, 
     sb_flags = 2048}, so_upcall = 0, so_upcallarg = 0x0, so_cred = 0xc42b0900, so_label = 0x0, so_peerlabel = 0x0, 
   so_gencnt = 850, so_emuldata = 0x0, so_accf = 0x0, so_fibnum = 0}
 
 So 
 
  so_rcv = {sb_sel = {si_thrlist = {tqe_next = 0x0, tqe_prev = 0x0}, si_thread = 0x0 
 
 for the socket descriptor looks wrong. Looking through the kernel code I tried
 to find the path that could lead to this state but failed and at present I
 don't have any idea where I should look.
 
 Is it possible to have the kernel built with QUEUE_MACRO_DEBUG defined?
 Setting it to 1 in sys/queue.h makes impossible even to build the kernel. Also
 I tried to add the line
 
 #define QUEUE_MACRO_DEBUG 1
 
 in kern/sys_generic.c (just before the headers). The kernel has built but I
 get panic during the system boot.
 
 -- 
 Mikolaj Golub



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200903020740.n227e4lA094780>