Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 23 Sep 2016 10:10:04 +0000
From:      bugzilla-noreply@freebsd.org
To:        freebsd-amd64@FreeBSD.org
Subject:   [Bug 212920] Li loaded web server cath race condition on _close () from /lib/libc.so.7 with accf_http
Message-ID:  <bug-212920-6@https.bugs.freebsd.org/bugzilla/>

next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D212920

            Bug ID: 212920
           Summary: Li loaded web server cath race condition on _close ()
                    from /lib/libc.so.7 with accf_http
           Product: Base System
           Version: 10.3-STABLE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: freebsd-bugs@FreeBSD.org
          Reporter: fbsd98816551@avksrv.org
                CC: freebsd-amd64@FreeBSD.org
                CC: freebsd-amd64@FreeBSD.org

Hello!

Recently we upgraded our high loaded web server to FREEBSD-STABLE 10.3 r305=
091
and got problem with NGINX (nginx-1.10.1_2,2 compiled from latest ports with
most default settings). After some time one worker stopped answer requests =
and
top command shows it in state soclos
 1072 nobody           1  22    0  1698M 65680K soclos  5   0:13   0.00% ng=
inx

after short while next worker stops in same state and so on untill all work=
ers
become "soclos" and web server stops serve requests (but still accept
connections, which die on timeout after client sent a request). Increasing
workers count only move problem to next half an hour.

Restarting nginx fix for some not so long time. Server is more or less high
loaded with 1000-2000 request/sec. Actually server is frontend proxy with
proxy_cache functionality. We tried on 2 different phisical servers with
actually different NICs and CPUs. When we returned kernel (only kernel and
modules at /boot/kernel, not world) to r302223, problem gone.

We tried to upgrade to yesterdey's r306194. Problem is still here. Somethin=
g=20
changed between end of Jun and end of Aug in kernel code what generate a
problem

backtrace from nginx while it in "soclos"

#0  0x0000000801a17d28 in _close () from /lib/libc.so.7
#1  0x000000080098a925 in pthread_suspend_all_np () from /lib/libthr.so.3
#2  0x00000000004329b9 in ngx_close_connection (c=3D0x869c1de70) at
src/core/ngx_connection.c:1169
#3  0x0000000000486370 in ngx_http_close_connection (c=3D0x869c1de70) at
src/http/ngx_http_request.c:3543
#4  0x0000000000488e86 in ngx_http_close_request (r=3D0x80244c050, rc=3D408=
) at
src/http/ngx_http_request.c:3406
#5  0x000000000048d9ed in ngx_http_process_request_headers (rev=3D0x807810b=
70) at
src/http/ngx_http_request.c:1202
#6  0x000000000044fdbd in ngx_event_expire_timers () at
src/event/ngx_event_timer.c:94
#7  0x000000000044e60f in ngx_process_events_and_timers (cycle=3D0x80248805=
0) at
src/event/ngx_event.c:256
#8  0x000000000045f406 in ngx_worker_process_cycle (cycle=3D0x802488050,
data=3D0xa) at src/os/unix/ngx_process_cycle.c:753
#9  0x000000000045ae7c in ngx_spawn_process (cycle=3D0x802488050, proc=3D0x=
45f2f0
<ngx_worker_process_cycle>, data=3D0xa, name=3D0x53ecea "worker process",
respawn=3D-3) at src/os/unix/ngx_process.c:198
#10 0x000000000045cc89 in ngx_start_worker_processes (cycle=3D0x802488050, =
n=3D16,
type=3D-3) at src/os/unix/ngx_process_cycle.c:358
#11 0x000000000045c486 in ngx_master_process_cycle (cycle=3D0x802488050) at
src/os/unix/ngx_process_cycle.c:130
#12 0x0000000000413288 in main (argc=3D1, argv=3D0x7fffffffead0) at
src/core/nginx.c:367

(gdb) list src/core/ngx_connection.c:1169
1164=20=20=20
1165        if (c->shared) {
1166            return;
1167        }
1168=20=20=20
1169        if (ngx_close_socket(fd) =3D=3D -1) { <<<<<<<<
1170=20=20=20
1171            err =3D ngx_socket_errno;
1172=20=20=20
1173            if (err =3D=3D NGX_ECONNRESET || err =3D=3D NGX_ENOTCONN) {

and actually called close(fd):
#define ngx_close_socket    close

All TCP sessions opened by worker frose in present state.

Same if we do not load and do not use in nginx config accf_http, problem not
repeased with all 3 tested kernels

kernel GENERIC and only extra accf_http ipmi smbus mfip ums zfs and opensol=
aris
 module loaded

As long as  accf_http did some good for our server, we can not simple disabe
the module in production env.

I'll debug more, but as long as I'm not is good C programmer, it will take =
some
time. If someone knows what changed in related functions, may be it will be
faster to check from that side..

--=20
You are receiving this mail because:
You are on the CC list for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-212920-6>