From nobody Thu Feb 6 19:58:07 2025 X-Original-To: freebsd-net@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Ypnw92nP4z5myNn for ; Thu, 06 Feb 2025 19:58:17 +0000 (UTC) (envelope-from gshapiro@freebsd.org) Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R10" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Ypnw92J50z3Ht5 for ; Thu, 06 Feb 2025 19:58:17 +0000 (UTC) (envelope-from gshapiro@freebsd.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1738871897; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type; bh=6E/AA0upAC/mYOBv8//+AP8kJUPU/GeK1wwN+JvpS4I=; b=p+xniL8Ry8rd5EZqeaDpKdQ3mofE/1Jq/ZDS5RaelIhp5EME6bs8g2hasI7Hy+p+NbPOUa zgODFBBVuGx5dk2kroiR7ve+wsJ9oTRuyR9j1czQhG2vmKdqFknsP3SvBP8S760QHuxq6m 1iY8+FoheqwkxJMW1cZOgdGYSM5Ze2hpdqvez5eOKix5K9hGrcB8ZlXI/Mn37n6hQjnhfy yZq7HGGShQRCLDV0diUKjfpXZxsPBaBHvz5W0AEd0XrRujQdh084gLxKnNO6YhH3sLvpCG GmaHuelNhvhqGWpMt3wR9xa5I09jIt2TMX6WA9ft/JYgswjXZLUbgb+k2sMx+w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1738871897; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type; bh=6E/AA0upAC/mYOBv8//+AP8kJUPU/GeK1wwN+JvpS4I=; b=et/d7e7X5UPF8jmGYIRPf46lCCcBJltqeo9iJu68d6LYBFwZdPlyZQRIPaSn+eWLDzgy4Q +LFTc3b8xb0ZdQzse1o33l+ECVT9fNdCMvot1vJn3/XdX2yvEOK0yfWt8uRj9XMHpoOvKX iygBw26prrHSjpu7xTF0rQTimUW72yh6dJl1IPUfNRvbzptHvNlo3A4UbAOgqFtasOJPi2 9jTfCG6YQpMM0eeUlMnRGmTYiag1ipRjWeoJ1eyWbZ8mo3DZ+/2/aeGSu8ohvIFkwNSmDN MmKil59iFufyEnIXRQCBINSKapg9cV/8Vk49F8eBWER2l8prS5K4rFTRQNxG5Q== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1738871897; a=rsa-sha256; cv=none; b=WTHstvfGXbB1swirVf3t0WMUuKxjlEMTW+tbM9/CSsFCqDlEDH3/FsN1DZMmmcC6RH7UAw /HR6jQ9D4xFi9KI3gVqyEcSd9b1lGDack3FstSYNk+coP/Oc+fsgCGdKyK/A7HWiGkJ/IJ G5z00XU3bKyZll5GeBj1hIsoJ8p3antjd0Bhdik9ekJ72NhqY83h75m2wyM03mW0DEVQ+7 gTA9rB3nw5ogFNDe5hchEQFWU80XX/cL6zHk735C1b2lBupXZygad/A9GT3aLh6Dvjq5PN ZujXL+dQNL2F4Uy6xmoLA+WsbmHCzcMDyZXLWxVN2w8gk7MtRBr9b6Sgfu6Flg== ARC-Authentication-Results: i=1; mx1.freebsd.org; none Received: from thornystick.home.snowbrush.net (thornystick.gshapiro.net [IPv6:2a0a:280:2357:5506::2ee5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) (Authenticated sender: gshapiro) by smtp.freebsd.org (Postfix) with ESMTPSA id 4Ypnw85jynzkgC for ; Thu, 06 Feb 2025 19:58:16 +0000 (UTC) (envelope-from gshapiro@freebsd.org) Date: Thu, 6 Feb 2025 11:58:07 -0800 From: Gregory Shapiro To: freebsd-net@freebsd.org Subject: bird2 netlink switch causing mbuf exhaustion and reboot Message-ID: <555hzkf4hdqr3d357bs4awilg2qpolfeyc245qzzutamnzwqwk@jfjb5htjuwcq> List-Id: Networking and TCP/IP with FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-net List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-net@FreeBSD.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline I have a handful of FreeBSD 14.2 machines acting as BGP routers using bird2. Three of them would drop BGP connections at least once a week, the others would either stay stable or sporadically drop off. Through a painful root cause exercise, I've narrowed the cause to the bird2 port changing from using a routing socket to netlink. On the routers which failed most often, I've switched from bird2 to bird2-rtsock and the problem has disappeared. I don't know the rtsock/netlink source well enough to debug deeper but I'm more than happy to go back to the netlink version in order to help debug if anyone is interested in tackling the problem. I'll give some details on the router that failed most often to wet your appetite. The problem will start with this appearing in the logs: Jan 29 03:42:08 pesto kernel: [zone: mbuf] kern.ipc.nmbufs limit reached After which, there are these every 30-40 seconds until I intercede: Jan 29 03:51:35 pesto kernel: sonewconn: pcb 0xfffff8002c198540 ([::]:179 (proto 6)): Listen queue overflow: 13 already in queue awaiting acceptance (1 occurrences), euid 0, rgid 502, jail 0 Jan 29 03:52:38 pesto kernel: sonewconn: pcb 0xfffff8002c198540 ([::]:179 (proto 6)): Listen queue overflow: 13 already in queue awaiting acceptance (17 occurrences), euid 0, rgid 502, jail 0 After a while, IPv4 connections will start appearing too: Jan 29 04:49:28 pesto kernel: sonewconn: pcb 0xfffff8002c198000 (0.0.0.0:179 (proto 6)): Listen queue overflow: 13 already in queue awaiting acceptance (1 occurrences), euid 0, rgid 502, jail 0 Jan 29 04:49:48 pesto kernel: sonewconn: pcb 0xfffff8002c198540 ([::]:179 (proto 6)): Listen queue overflow: 13 already in queue awaiting acceptance (36 occurrences), euid 0, rgid 502, jail 0 When I intercede, connections to bird2 (via birdc control) hang as do attempts to shut it down (/usr/local/etc/rc.d/bird stop). If I try to look at the routing table (netstat -rnf inet6), it will start to list and then the system will reboot itself (likely a kernel crash, but nothing is logged between the sonewconn logs and the boot). After the reboot, things are good for a few days. My naive guess it the netlink code leaks mbufs and there is a kernel crash in this state triggered by listed the routing table with netstat. Besides switching back to bird2 via netlink, if I have someone interested in debugging, I can run a debug kernel to catch the stack trace. System details: FreeBSD pesto.gshapiro.net 14.2-RELEASE FreeBSD 14.2-RELEASE releng/14.2-n269506-c8918d6c7412 GENERIC amd64 bird2-rtsock-2.16.1 (which is stable, bird2-2.16.1 which uses netlink instead of rtsock causes the issue) Since this is a router, it doesn't do anything except routing (bird2), networking (tailscale, vxlan, wiregaurd) and OS daemons (syslog, ntpd for log accuracy, cron, devd, getty for console). Only sysctl tuning done: net.fibs=2 net.add_addr_allfibs=1 As far as system configuration, here is my rc.conf (happy to share redacted values with those working on this issue): accounting_enable="YES" bird_enable="YES" cloned_interfaces="lo1 vxlan0 vxlan1" create_args_vxlan0="vxlanid _redacted_ vxlanlocal _redacted_ vxlanremote _redacted_ vxlanttl 255 tunnelfib 1" create_args_vxlan1="vxlanid _redacted_ vxlanlocal _redacted_ vxlanremote _redacted_ vxlanttl 255 tunnelfib 1" defaultrouter="_redacted_" defaultrouter_fib1="${defaultrouter}" devd_flags="-q" firewall_enable="YES" firewall_logging="YES" firewall_quiet="YES" firewall_type="/etc/ipfw.rules" fsck_y_enable="YES" gateway_enable="YES" hostname="pesto.gshapiro.net" ifconfig_lo1_descr="Announced" ifconfig_lo1_ipv6="inet6 _redacted_ prefer_source" ifconfig_lo1_alias0="inet6 _redacted_" ifconfig_vtnet0="inet _redacted_" ifconfig_vtnet0_descr="HYEHOST Uplink" ifconfig_vtnet0_ipv6="inet6 _redacted_ no_prefer_iface" ifconfig_vtnet1_descr="F4IX" ifconfig_vtnet1_ipv6="inet6 _redacted_ no_prefer_iface" ifconfig_vxlan0="inet _redacted_" ifconfig_vxlan0_descr="BGPExchange" ifconfig_vxlan0_ipv6="inet6 _redacted_ no_prefer_iface" ifconfig_vxlan1_descr="HNIX BGP Tunnel" ifconfig_vxlan1_ipv6="inet6 _redacted_ no_prefer_iface" ipv6_defaultrouter="_redacted_" ipv6_defaultrouter_fib1="${ipv6_defaultrouter}" ipv6_gateway_enable="YES" ipv6_route_bgp6="_redacted_ -iface vtnet0" ipv6_route_gw6="_redacted_ -iface vtnet0 -fib 0,1" ipv6_static_routes="gw6 bgp6" ntpd_enable="YES" ntpd_sync_on_start="YES" qemu_guest_agent_enable="YES" qemu_guest_agent_flags="-d -v -l /var/log/qemu-ga.log" sshd_enable="YES" syslogd_flags="-ss" tailscaled_enable="YES" tailscaled_exitnode_enable="YES" tailscaled_up_args="--accept-dns=false --timeout=30s" wireguard_enable="YES" wireguard_interfaces="wg0" And wg0.conf: # Route64 [Interface] PrivateKey = _redacted_ Address = _redacted_, _redacted_ Table = off PostUp = /sbin/ifconfig %i description "Route64 Upstream" PostUp = /sbin/ifconfig %i inet6 auto_linklocal PostUp = /sbin/ifconfig %i inet6 no_prefer_iface PostUp = /sbin/ifconfig %i tunnelfib 1 [Peer] PublicKey = _redacted_ AllowedIPs = ::/0, 0.0.0.0/0 Endpoint = _redacted_ PersistentKeepAlive = 30