Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 02 Oct 2024 23:17:37 +0000
From:      bugzilla-noreply@freebsd.org
To:        net@FreeBSD.org
Subject:   [Bug 281560] gve (4) uma deadlock during high tcp throughput
Message-ID:  <bug-281560-7501-XoYNK7EcKd@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-281560-7501@https.bugs.freebsd.org/bugzilla/>
References:  <bug-281560-7501@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D281560

--- Comment #18 from shailend@google.com ---
(In reply to Konstantin Belousov from comment #14)

Although I do not have access to the VMs to do `show pcpu`, I checked my no=
tes
to find this `ps` entry:

100438                   Run     CPU 11                      [gve0 txq 4 xm=
it]

The packet transmitting thread is hogging the cpu and preventing iperf from
ever running to release the uma lock. The "gve0 txq 4 xmit" is running fore=
ver
because it is waiting on the tx cleanup thread to make room on the ring, and
that thread is not doing anything because it is waiting on the uma zone loc=
k.=20=20

I did another repro, and the situation is similar:

```
db> show lockchain 100416
thread 100416 (pid 0, gve0 rxq 0) is blocked on lock 0xfffffe00df57a3d0 (sl=
eep
mutex) "mbuf"
thread 100708 (pid 860, iperf) is on a run queue
db> show lockchain 100423
thread 100423 (pid 0, gve0 rxq 7) is blocked on lock 0xfffff8010447daa0 (rw)
"tcpinp"
thread 100736 (pid 860, iperf) is blocked on lock 0xfffffe00df57a3d0 (sleep
mutex) "mbuf"
thread 100708 (pid 860, iperf) is on a run queue
db> show lockchain 100452
thread 100452 (pid 0, gve0 txq 10) is blocked on lock 0xfffffe00df57a3d0 (s=
leep
mutex) "mbuf"
thread 100708 (pid 860, iperf) is on a run queue
```

Here 100708 is the offending iperf thread. Lets see its state:

```
db> show thread 100708
Thread 100708 at 0xfffff800a86bd000:
 proc (pid 860): 0xfffffe01a439bac0
 name: iperf
 pcb: 0xfffff800a86bd520
 stack: 0xfffffe01a4dc1000-0xfffffe01a4dc4fff
 flags: 0x5  pflags: 0x100
 state: RUNQ
 priority: 4
 container lock: sched lock 31 (0xfffffe001bee8440)
 last voluntary switch: 11510.470 s ago
 last involuntary switch: 11510.470 s ago
```

And now lets see whats happening on cpu 31:

```
db> show pcpu 31
cpuid        =3D 31
dynamic pcpu =3D 0xfffffe009a579d80
curthread    =3D 0xfffff800a8501740: pid 0 tid 100453 critnest 0 "gve0 txq =
10
xmit"
curpcb       =3D 0xfffff800a8501c60
fpcurthread  =3D none
idlethread   =3D 0xfffff80003b04000: tid 100034 "idle: cpu31"
self         =3D 0xffffffff8242f000
curpmap      =3D 0xffffffff81b79c50
tssp         =3D 0xffffffff8242f384
rsp0         =3D 0xfffffe01a4ca8000
kcr3         =3D 0xffffffffffffffff
ucr3         =3D 0xffffffffffffffff
scr3         =3D 0x0
gs32p        =3D 0xffffffff8242f404
ldt          =3D 0xffffffff8242f444
tss          =3D 0xffffffff8242f434
curvnet      =3D 0
spin locks held:
```

Sure enough a driver transmit thread is hogging the cpu. And to seal the lo=
op,
lets see what this queue's cleanup thread is doing:

```
db> show lockchain 100452
thread 100452 (pid 0, gve0 txq 10) is blocked on lock 0xfffffe00df57a3d0 (s=
leep
mutex) "mbuf"
thread 100708 (pid 860, iperf) is on a run queue
```

In summary this is the usual loop:

iperf thread (with uma zone lock) ---sched--->  gve tx xmit thread ---for
room---> gve tx cleanup thread -----uma zone lock----> iperf thread=20=20

There is clearly a problematic behavior in the driver transmit thread
(gve_xmit_br): this taskqueue should not enqueue itself, and should rather =
let
the cleanup taskqueue wake it up when room is made in the ring, so I'll wor=
k on
that.=20=20

But I also want to confirm that it is not problematic for an iperf thread t=
o be
knocked off the cpu with the zone lock held: is it not a critical enough lo=
ck
to disallow that? (I am not very familiar with schedulers to know if this i=
s a
naive question).

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-281560-7501-XoYNK7EcKd>