Date: Tue, 10 Sep 2024 04:05:44 +0300 From: Vadim Goncharov <vadimnuclight@gmail.com> To: freebsd-arch@FreeBSD.org, freebsd-hackers@FreeBSD.org, freebsd-net@FreeBSD.org, tcpdump-workers@lists.tcpdump.org, tech-net@NetBSD.org Cc: Alexander Nasonov <alnsn@NetBSD.org> Subject: BPF64: proposal of platform-independent hardware-friendly backwards-compatible eBPF alternative Message-ID: <20240910040544.125245ad@nuclight.lan>
next in thread | raw e-mail | index | archive | help
Hello! We don't need ELF relocations! We want better loop control! No so little parameters, Verifier! Leave our code alone! -- Ping Floyd I've recently had some experience with Linux's ePBF in it's XDP, and this l= eft quite negative impression. I was following via https://github.com/xdp-proje= ct/xdp-tutorial and after 3rd lesson was trying to create a simple program for searching TCP timestamp option and incrementing it by one. As you know, eBPF tool stack consists of at least clang and eBPF verifier in the kernel, and after two d= ozen tries eBPF verifier still didn't accept my code. I was digging into verifier sources, and the abysses opened in front of me! Carefully and boringly going via disassembler and verifier output, I've found that clang optimizer ignor= es just checked register - patching one byte in assembler sources (and target = .o) did help. I've filed https://github.com/iovisor/bcc/issues/5062 with details if one curious. So, looking at eBPF ecosystem, I must say it's a Frankenstein. Sewn from go= od, sometimes brilliant parts, it's a monster in outcome. Verifier is in it's o= wn right, compiler/optimizer is in it's own right... But at the end you even don't have a high-level programming language! You must write in C, relative= ly low-level C, and restricted subset of C. This requires very skilled professionals - it's far from something not even user-friendly, but at least sysadmin-friendly, like `ipfw` or `iptables` firewall rules. Thus I looked at the foundation of eBPF architecture, with which presupposi= tions in mind it was created with. In fact, it tries to be just usual programming after checks - that is, with all that pointers. It's too x86-centric and Linux-centric - number of registers was added just ten. So if you look at t= he GitHub ticket above, when I tried to add debug to program - you know, just specific `printf()`s - it failed verifier checks again because compiler now had to move some variables between registers and memory, as there is limit = on just 5 arguments to call due to limit of 5 registers! And verifier, despite being more than 20,000 lines of code, still was not smart enough to track i= nfo between registers and stack. So, if we'd started from beginning, what should we do? Remember classic BPF: it has very simple validator due to it's Virtual Machine design - only forw= ard jumps, checks for packet boundaries at runtime, etc. You'd say eBPF tries f= or performance if verifier's checks were passed? But in practice you have to t= oss in as much packet boundary checks as near to actual access as possible, or verifier may "forget" it, because of compiler optimizer. So this is not of much difference for checking if access is after packet in classic BPF - the same CMP/JUMP in JIT if buffer is linear, and if your OS has put packet in several buffers, like *BSD or DPDK `mbuf`'s, the runtime check overhead is negligible in comparison. Ensuring kernel stability? Just don't allow arbitrary pointers, like origin= al BPF. Guaranteed termination time? It's possible if you place some restrictions. = For example, don't allow backward jumps but allow function calls - in case of stack overflow, terminate program. Really need backward jumps? Let's analyze for what purpose. You'll find these are needed for loops on packet contents. Solve it but supporting loops in "hardware"-controlled loops, which can't be infinite. Finally, platforms. It's beginning of sunset of x86 era now - RISC is comin= g. ARM is now not only on mobiles, but on desktops and servers. Moreover, it's era of specialized hardware accelerators - e.g. GPU, neural processors. Even general purpose ARM64 has 31 register, and specialized hardware can implement much more. Then, don't tie to Linux kernel - BPF helpers are very rigid interface, from ancient era, like syscalls. So, let's continue *Berkeley* Packet Filter with Berkeley RISC design - hav= ing register window idea, updated by SPARC and then by Itanium (to not waste registers). Take NetBSD's coprocessor functions which set is passed with a context, instead of hardcoded enums of functions - for example, BPF maps = is not something universal, both NetBSD and FreeBSD have their own tables in firewall. Add more features actually needed for *network* processor - e.g. 128-bit registers for IPv6 (eBPF axed out even BPF_MSH!). And do all of this in ful= ly backwards-compatible way - new language should allow to run older programs from e.g. `tcpdump` to run without any modifications, binary-compatible (again, eBPF does not do this - it's incompatible with classiv BPF and uses a translator from it). Next, eBPF took "we are masquerading usual x86 programming" way not only ju= st in assembly language. They have very complex ELF infrastructure around it w= hich may be not suitable for every network card - having pc-addressed literals, = as in RISC processors allows for much simpler format: just BLOB of instruction= s. BPF64 adds BPF_LITERAL "instruction" of varying length (it's interpreted by just skipping over contents as if it was jump), which, if have special signatures and format, allow for this BLOB of instructions to contain some metadata about itself for loading, much simpler than ELF (esp. with DWARF). Then, ecosystem. eBPF defines functions callable from user code like: > enum bpf_func_id___x { BPF_FUNC_snprintf___x =3D 42 /* avoid zero */ }; That is, ancient syscall-like way of global constant, instead of context. A "context" here is the structure passed with code to execution which contains function pointers of what is available to this user code, in spirit of NetBSD's `bpf_ctx_t` for their BPF_COP/BPF_COPX extensions. This is not only provides better way than "set in stone" syscall-like number, but BPF64 goes further and defines an "packages" in running kernel with namespaces to allow e.g. Foo::Bar::baz() function to call Foo::quux() from another BPF program, populating ("linking") it's context with needed function without relocation= s. These "packages" expected to be available to admin in e.g. sysctl tree, with descriptions, versioning and other attributes. Some other quotes about how restricted eBPF is: > First, a BPF program using bpf_trace_printk() has to have a GPL-compatibl= e license. > Another hard limitation is that bpf_trace_printk() can accept only up to = 3 input arguments (in addition to fmt and fmt_size). This is quite often pr= etty limiting and you might need to use multiple bpf_trace_printk() invocat= ions to log all the data. This limitation stems from the BPF helpers abilit= y to accept only up to 5 input arguments in total. > Previously, bpf_trace_printk() allowed the use of only one string (%s) ar= gument, which was quite limiting. Linux 5.13 release lifts this restriction= and allows multiple string arguments, as long as total formatted output do= esn't exceed 512 bytes. Another annoying restriction was the lack of suppor= t for width specifiers, like %10d or %-20s. This restriction is gone now as= well > Helper function bpf_snprintf > Outputs a string into the str buffer of size str_size based on a format s= tring stored in a read-only map pointed by fmt. > > Each format specifier in fmt corresponds to one u64 element in the data a= rray. For strings and pointers where pointees are accessed, only the pointe= r values are stored in the data array. The data_len is the size of data in = bytes - must be a multiple of 8. > > Formats %s and %p{i,I}{4,6} require to read kernel memory. Reading kernel= memory may fail due to either invalid address or valid address but requiri= ng a major memory fault. If reading kernel memory fails, the string for %s = will be an empty string, and the ip address for %p{i,I}{4,6} will be 0. Not= returning error to bpf program is consistent with what bpf_trace_printk() = does for now. > > Returns > > The strictly positive length of the formatted string, including the trail= ing zero character. If the return value is greater than str_size, str conta= ins a truncated string, guaranteed to be zero-terminated except when str_si= ze is 0. > > Or -EBUSY if the per-CPU memory copy buffer is busy. > > static long (* const bpf_snprintf)(char *str, __u32 str_size, const char = *fmt, __u64 *data, __u32 data_len) =3D (void *) 165; So. In summary: BPF64 has not only traditional ("firewall on network packet= s") area of use but also suitable - and having goals to design in mind - for: * non-network. e.g. syscall arguments checking * network protocols passing untrusted code which can be run by other side in restricted sandbox (e.g. muSCTP custom (de)compression rules) * an alternative to https://capnproto.org/rpc.html for multi-method calls on high-latency links (e.g. MQTT5-like and/or for environments when Cap=E2=80=99n Proto itself is not applicable, e.g. no direct connect= ions) I've put a sketch of design to https://github.com/nuclight/bpf64 with files: * The `nuc_ts_prog_kern.c` (and it's include `nuc_ts_common_kern_user.h`) is XDP/eBPF program (for Linux 6.5) for parsing TCP packet and incrementing it's Timestamp option, if any, recording statisitics intop eBPF map. * The `nuc_ts_incr.baw` (and it's include `nuc_ts_incr.bah`) is the equival= ent program doing the same thing, but in a new BPF64 Assembler Wrapper langua= ge, not yet written and subject to change. Note this is a lower-level language than C, viewed as intermediate solution until BPF64 becomes stable, after which more higher-level language (higher than C) should be written, at le= ast as expressible as `tcpdump` (`libpcap`) one. * `bpf64spec.md` for draft of Specification, but currently it's more in a "a draft of draft" state :-) I am requesting comments about this architecture before implementing, especially from people knowledgeable in JIT compilers, because, though while I see BPF64 much more simpler than eBPF (probably just several man-months to implement), my sabbatical has ended - and design mistakes are much harder to fix when implementation already exists. --=20 WBR, @nuclight
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20240910040544.125245ad>