Date: Fri, 23 Sep 2016 10:10:04 +0000 From: bugzilla-noreply@freebsd.org To: freebsd-amd64@FreeBSD.org Subject: [Bug 212920] Li loaded web server cath race condition on _close () from /lib/libc.so.7 with accf_http Message-ID: <bug-212920-6@https.bugs.freebsd.org/bugzilla/>
next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D212920 Bug ID: 212920 Summary: Li loaded web server cath race condition on _close () from /lib/libc.so.7 with accf_http Product: Base System Version: 10.3-STABLE Hardware: amd64 OS: Any Status: New Severity: Affects Some People Priority: --- Component: kern Assignee: freebsd-bugs@FreeBSD.org Reporter: fbsd98816551@avksrv.org CC: freebsd-amd64@FreeBSD.org CC: freebsd-amd64@FreeBSD.org Hello! Recently we upgraded our high loaded web server to FREEBSD-STABLE 10.3 r305= 091 and got problem with NGINX (nginx-1.10.1_2,2 compiled from latest ports with most default settings). After some time one worker stopped answer requests = and top command shows it in state soclos 1072 nobody 1 22 0 1698M 65680K soclos 5 0:13 0.00% ng= inx after short while next worker stops in same state and so on untill all work= ers become "soclos" and web server stops serve requests (but still accept connections, which die on timeout after client sent a request). Increasing workers count only move problem to next half an hour. Restarting nginx fix for some not so long time. Server is more or less high loaded with 1000-2000 request/sec. Actually server is frontend proxy with proxy_cache functionality. We tried on 2 different phisical servers with actually different NICs and CPUs. When we returned kernel (only kernel and modules at /boot/kernel, not world) to r302223, problem gone. We tried to upgrade to yesterdey's r306194. Problem is still here. Somethin= g=20 changed between end of Jun and end of Aug in kernel code what generate a problem backtrace from nginx while it in "soclos" #0 0x0000000801a17d28 in _close () from /lib/libc.so.7 #1 0x000000080098a925 in pthread_suspend_all_np () from /lib/libthr.so.3 #2 0x00000000004329b9 in ngx_close_connection (c=3D0x869c1de70) at src/core/ngx_connection.c:1169 #3 0x0000000000486370 in ngx_http_close_connection (c=3D0x869c1de70) at src/http/ngx_http_request.c:3543 #4 0x0000000000488e86 in ngx_http_close_request (r=3D0x80244c050, rc=3D408= ) at src/http/ngx_http_request.c:3406 #5 0x000000000048d9ed in ngx_http_process_request_headers (rev=3D0x807810b= 70) at src/http/ngx_http_request.c:1202 #6 0x000000000044fdbd in ngx_event_expire_timers () at src/event/ngx_event_timer.c:94 #7 0x000000000044e60f in ngx_process_events_and_timers (cycle=3D0x80248805= 0) at src/event/ngx_event.c:256 #8 0x000000000045f406 in ngx_worker_process_cycle (cycle=3D0x802488050, data=3D0xa) at src/os/unix/ngx_process_cycle.c:753 #9 0x000000000045ae7c in ngx_spawn_process (cycle=3D0x802488050, proc=3D0x= 45f2f0 <ngx_worker_process_cycle>, data=3D0xa, name=3D0x53ecea "worker process", respawn=3D-3) at src/os/unix/ngx_process.c:198 #10 0x000000000045cc89 in ngx_start_worker_processes (cycle=3D0x802488050, = n=3D16, type=3D-3) at src/os/unix/ngx_process_cycle.c:358 #11 0x000000000045c486 in ngx_master_process_cycle (cycle=3D0x802488050) at src/os/unix/ngx_process_cycle.c:130 #12 0x0000000000413288 in main (argc=3D1, argv=3D0x7fffffffead0) at src/core/nginx.c:367 (gdb) list src/core/ngx_connection.c:1169 1164=20=20=20 1165 if (c->shared) { 1166 return; 1167 } 1168=20=20=20 1169 if (ngx_close_socket(fd) =3D=3D -1) { <<<<<<<< 1170=20=20=20 1171 err =3D ngx_socket_errno; 1172=20=20=20 1173 if (err =3D=3D NGX_ECONNRESET || err =3D=3D NGX_ENOTCONN) { and actually called close(fd): #define ngx_close_socket close All TCP sessions opened by worker frose in present state. Same if we do not load and do not use in nginx config accf_http, problem not repeased with all 3 tested kernels kernel GENERIC and only extra accf_http ipmi smbus mfip ums zfs and opensol= aris module loaded As long as accf_http did some good for our server, we can not simple disabe the module in production env. I'll debug more, but as long as I'm not is good C programmer, it will take = some time. If someone knows what changed in related functions, may be it will be faster to check from that side.. --=20 You are receiving this mail because: You are on the CC list for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-212920-6>