FreeBSD Mail Archives

Date:      Tue, 10 Sep 2024 04:05:44 +0300
From:      Vadim Goncharov <vadimnuclight@gmail.com>
To:        freebsd-arch@FreeBSD.org, freebsd-hackers@FreeBSD.org, freebsd-net@FreeBSD.org, tcpdump-workers@lists.tcpdump.org, tech-net@NetBSD.org
Cc:        Alexander Nasonov <alnsn@NetBSD.org>
Subject:   BPF64: proposal of platform-independent hardware-friendly backwards-compatible eBPF alternative
Message-ID:  <20240910040544.125245ad@nuclight.lan>

next in thread | raw e-mail | index | archive | help

Hello!

                                         We don't need ELF relocations!
                                           We want better loop control!
                                               No so little parameters,
				        Verifier! Leave our code alone!
                                                          -- Ping Floyd

I've recently had some experience with Linux's ePBF in it's XDP, and this l=
eft
quite negative impression. I was following via https://github.com/xdp-proje=
ct/xdp-tutorial
and after 3rd lesson was trying to create a simple program for searching TCP
timestamp option and incrementing it by one. As you know, eBPF tool stack
consists of at least clang and eBPF verifier in the kernel, and after two d=
ozen
tries eBPF verifier still didn't accept my code. I was digging into verifier
sources, and the abysses opened in front of me! Carefully and boringly going
via disassembler and verifier output, I've found that clang optimizer ignor=
es
just checked register - patching one byte in assembler sources (and target =
.o)
did help. I've filed https://github.com/iovisor/bcc/issues/5062 with details
if one curious.

So, looking at eBPF ecosystem, I must say it's a Frankenstein. Sewn from go=
od,
sometimes brilliant parts, it's a monster in outcome. Verifier is in it's o=
wn
right, compiler/optimizer is in it's own right... But at the end you even
don't have a high-level programming language! You must write in C, relative=
ly
low-level C, and restricted subset of C. This requires very skilled
professionals - it's far from something not even user-friendly, but at least
sysadmin-friendly, like `ipfw` or `iptables` firewall rules.

Thus I looked at the foundation of eBPF architecture, with which presupposi=
tions
in mind it was created with. In fact, it tries to be just usual programming
after checks - that is, with all that pointers. It's too x86-centric and
Linux-centric - number of registers was added just ten. So if you look at t=
he
GitHub ticket above, when I tried to add debug to program - you know, just
specific `printf()`s - it failed verifier checks again because compiler now
had to move some variables between registers and memory, as there is limit =
on
just 5 arguments to call due to limit of 5 registers! And verifier, despite
being more than 20,000 lines of code, still was not smart enough to track i=
nfo
between registers and stack.

So, if we'd started from beginning, what should we do? Remember classic BPF:
it has very simple validator due to it's Virtual Machine design - only forw=
ard
jumps, checks for packet boundaries at runtime, etc. You'd say eBPF tries f=
or
performance if verifier's checks were passed? But in practice you have to t=
oss
in as much packet boundary checks as near to actual access as possible, or
verifier may "forget" it, because of compiler optimizer. So this is not of
much difference for checking if access is after packet in classic BPF - the
same CMP/JUMP in JIT if buffer is linear, and if your OS has put packet in
several buffers, like *BSD or DPDK `mbuf`'s, the runtime check overhead is
negligible in comparison.

Ensuring kernel stability? Just don't allow arbitrary pointers, like origin=
al BPF.
Guaranteed termination time? It's possible if you place some restrictions. =
For
example, don't allow backward jumps but allow function calls - in case of
stack overflow, terminate program. Really need backward jumps? Let's analyze
for what purpose. You'll find these are needed for loops on packet contents.
Solve it but supporting loops in "hardware"-controlled loops, which can't be
infinite.

Finally, platforms. It's beginning of sunset of x86 era now - RISC is comin=
g.
ARM is now not only on mobiles, but on desktops and servers. Moreover, it's
era of specialized hardware accelerators - e.g. GPU, neural processors. Even
general purpose ARM64 has 31 register, and specialized hardware can
implement much more. Then, don't tie to Linux kernel - BPF helpers are very
rigid interface, from ancient era, like syscalls.

So, let's continue *Berkeley* Packet Filter with Berkeley RISC design - hav=
ing
register window idea, updated by SPARC and then by Itanium (to not waste
registers). Take NetBSD's coprocessor functions which set is passed with
a context, instead of hardcoded enums of functions - for example, BPF maps =
is
not something universal, both NetBSD and FreeBSD have their own tables in
firewall.

Add more features actually needed for *network* processor - e.g. 128-bit
registers for IPv6 (eBPF axed out even BPF_MSH!). And do all of this in ful=
ly
backwards-compatible way - new language should allow to run older programs
from e.g. `tcpdump` to run without any modifications, binary-compatible
(again, eBPF does not do this - it's incompatible with classiv BPF and uses
a translator from it).

Next, eBPF took "we are masquerading usual x86 programming" way not only ju=
st
in assembly language. They have very complex ELF infrastructure around it w=
hich
may be not suitable for every network card - having pc-addressed literals, =
as
in RISC processors allows for much simpler format: just BLOB of instruction=
s.
BPF64 adds BPF_LITERAL "instruction" of varying length (it's interpreted by
just skipping over contents as if it was jump), which, if have special
signatures and format, allow for this BLOB of instructions to contain some
metadata about itself for loading, much simpler than ELF (esp. with DWARF).

Then, ecosystem. eBPF defines functions callable from user code like:

> enum bpf_func_id___x { BPF_FUNC_snprintf___x =3D 42 /* avoid zero */ };

That is, ancient syscall-like way of global constant, instead of context. A
"context" here is the structure passed with code to execution which contains
function pointers of what is available to this user code, in spirit of
NetBSD's `bpf_ctx_t` for their BPF_COP/BPF_COPX extensions. This is not only
provides better way than "set in stone" syscall-like number, but BPF64 goes
further and defines an "packages" in running kernel with namespaces to allow
e.g. Foo::Bar::baz() function to call Foo::quux() from another BPF program,
populating ("linking") it's context with needed function without relocation=
s.
These "packages" expected to be available to admin in e.g. sysctl tree, with
descriptions, versioning and other attributes.

Some other quotes about how restricted eBPF is:

> First, a BPF program using bpf_trace_printk() has to have a GPL-compatibl=
e license.
> Another hard limitation is that bpf_trace_printk() can accept only up to =
3 input arguments (in addition to fmt and fmt_size). This is quite often pr=
etty limiting and you might need to use multiple bpf_trace_printk() invocat=
ions to log all the data. This limitation stems from the BPF helpers abilit=
y to accept only up to 5 input arguments in total.
> Previously, bpf_trace_printk() allowed the use of only one string (%s) ar=
gument, which was quite limiting. Linux 5.13 release lifts this restriction=
 and allows multiple string arguments, as long as total formatted output do=
esn't exceed 512 bytes. Another annoying restriction was the lack of suppor=
t for width specifiers, like %10d or %-20s. This restriction is gone now as=
 well

> Helper function bpf_snprintf
> Outputs a string into the str buffer of size str_size based on a format s=
tring stored in a read-only map pointed by fmt.
>
> Each format specifier in fmt corresponds to one u64 element in the data a=
rray. For strings and pointers where pointees are accessed, only the pointe=
r values are stored in the data array. The data_len is the size of data in =
bytes - must be a multiple of 8.
>
> Formats %s and %p{i,I}{4,6} require to read kernel memory. Reading kernel=
 memory may fail due to either invalid address or valid address but requiri=
ng a major memory fault. If reading kernel memory fails, the string for %s =
will be an empty string, and the ip address for %p{i,I}{4,6} will be 0. Not=
 returning error to bpf program is consistent with what bpf_trace_printk() =
does for now.
>
> Returns
>
> The strictly positive length of the formatted string, including the trail=
ing zero character. If the return value is greater than str_size, str conta=
ins a truncated string, guaranteed to be zero-terminated except when str_si=
ze is 0.
>
> Or -EBUSY if the per-CPU memory copy buffer is busy.
>
> static long (* const bpf_snprintf)(char *str, __u32 str_size, const char =
*fmt, __u64 *data, __u32 data_len) =3D (void *) 165;

So. In summary: BPF64 has not only traditional ("firewall on network packet=
s")
area of use but also suitable - and having goals to design in mind - for:

* non-network. e.g. syscall arguments checking
* network protocols passing untrusted code which can be run by other side in
  restricted sandbox (e.g. muSCTP custom (de)compression rules)
* an alternative to https://capnproto.org/rpc.html for multi-method
  calls on high-latency links (e.g. MQTT5-like and/or for environments
  when Cap=E2=80=99n Proto itself is not applicable, e.g. no direct connect=
ions)

I've put a sketch of design to https://github.com/nuclight/bpf64 with files:

* The `nuc_ts_prog_kern.c` (and it's include `nuc_ts_common_kern_user.h`) is
  XDP/eBPF program (for Linux 6.5) for parsing TCP packet and incrementing
  it's Timestamp option, if any, recording statisitics intop eBPF map.
* The `nuc_ts_incr.baw` (and it's include `nuc_ts_incr.bah`) is the equival=
ent
  program doing the same thing, but in a new BPF64 Assembler Wrapper langua=
ge,
  not yet written and subject to change. Note this is a lower-level language
  than C, viewed as intermediate solution until BPF64 becomes stable, after
  which more higher-level language (higher than C) should be written, at le=
ast
  as expressible as `tcpdump` (`libpcap`) one.
* `bpf64spec.md` for draft of Specification, but currently it's more in a
  "a draft of draft" state :-)

I am requesting comments about this architecture before implementing,
especially from people knowledgeable in JIT compilers, because, though while
I see BPF64 much more simpler than eBPF (probably just several man-months to
implement), my sabbatical has ended - and design mistakes are much harder to
fix when implementation already exists.

--=20
WBR, @nuclight

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20240910040544.125245ad>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation