BPF of Berkeley Packet Filter mechanism was first introduced in linux in 1997 in version 2.1.75. It has seen a number of extensions of the years. Recently in versions 3.15 - 3.19 it received a major overhaul which drastically expanded it's applicability. This talk will cover how the instruction set looks today and why. It's architecture, capabilities, interface, just-in-time compilers. We will also talk about how it's being used in different areas of the kernel like tracing and networking and future plans.
6. 1. extended BPF in networking
โขโฏ socket filters
โขโฏ four use cases of bpf in openvswitch (bpf+ovs)
โขโฏ bpf as an action on flow-hit
โขโฏ bpf as fallback on flow-miss
โขโฏ bpf as packet parser before flow lookup
โขโฏ bpf to completely replace ovs datapath
โขโฏ two use cases in traffic control (bpf+tc)
โขโฏ cls โ packet parser and classifier
โขโฏ act โ action
โขโฏ bpf as net_device
7. 2. extended BPF in tracing
โขโฏ bpf+kprobe โ dtrace/systemtap like
โขโฏ bpf+syscalls โ analytics and monitoring
โขโฏ bpf+tracepoints โ faster alternative to kprobes
โขโฏ TCP stack instrumentation with bpf+tracepoints as non-
intrusive alternative to web10g
โขโฏ disk latency monitoring
โขโฏ live kernel debugging (with and without debug info)
8. 3. extended BPF for in-kernel optimizations
โขโฏ kernel interface is kept unmodified. subsystems use bpf to
accelerate internal execution
โขโฏ predicate tree walker of tracing filters -> bpf
โขโฏ nft (netfilter tables) -> bpf
9. 4. extended BPF for HW modeling
โขโฏ p4 โ language for programing flexible network switches
โขโฏ p4 compiler into bpf (userspace)
โขโฏ pass bpf into kernel via switchdev abstraction
โขโฏ rocker device (part of qemu) to execute bpf
10. 5. other crazy uses of BPF
โขโฏ 'reverse BPF' was proposed
โขโฏ in-kernel NIC drivers expose BPF back to user space as generic program to
construct hw specific data structures
โขโฏ bpf -> NPUs
โขโฏ some networking HW vendors planning to translate bpf directly to HW
11. classic BPF
โขโฏ BPF - Berkeley Packet Filter
โขโฏ inspired by BSD
โขโฏ introduced in linux in 1997 in version 2.1.75
โขโฏ initially used as socket filter by packet capture tool tcpdump
(via libpcap)
12. classic BPF
โขโฏ two 32-bit registers: A, X
โขโฏ implicit stack of 16 32-bit slots (LD_MEM, ST_MEM insns)
โขโฏ full integer arithmetic
โขโฏ explicit load/store from packet (LD_ABS, LD_IND insns)
โขโฏ conditional branches (with two destinations: jump true/false)
13. Ex: tcpdump syntax and classic BPF assembler
โขโฏ tcpdump โd 'ip and tcp port 22โ
(000) ldh [12] // fetch eth protoโจ
(001) jeq #0x800 jt 2"jf 12 // is it IPv4 ?โจ
(002) ldb [23] // fetch ip protoโจ
(003) jeq #0x6 jt 4"jf 12 // is it TCP ?โจ
(004) ldh [20] // fetch frag_offโจ
(005) jset #0x1fff jt 12 jf 6 // is it a frag?โจ
(006) ldxb 4*([14]&0xf) // fetch ip header lenโจ
(007) ldh [x + 14] // fetch src portโจ
(008) jeq #0x16 jt 11 jf 9 // is it 22 ?โจ
(009) ldh [x + 16] // fetch dest portโจ
(010) jeq #0x16 jt 11 jf 12 // is it 22 ?โจ
(011) ret #65535 // trim packet and passโจ
(012) ret #0 // ignore packet"
14. Classic BPF for use cases
โขโฏ socket filters (drop or trim packet and pass to user space)
โขโฏ used by tcpdump/libpcap, wireshark, nmap, dhcp, arpd, ...
โขโฏ in networking subsystems
โขโฏ cls_bpf (TC classifier), xt_bpf, ppp, team, ...
โขโฏ seccomp (chrome sandboxing)
โขโฏ introduced in 2012 to filter syscall arguments with bpf program
15. Classic BPF safety
โขโฏ verifier checks all instructions, forward jumps only, stack slot
load/store, etc
โขโฏ instruction set has some built-in safety (no exposed stack
pointer, instead load instruction has โmemโ modifier)
โขโฏ dynamic packet-boundary checks
16. Classic BPF extensions
โขโฏ over years multiple extensions were added in the form of โload
from negative hard coded offsetโ
โขโฏ LD_ABS -0x1000 โ skb->protocol
LD_ABS -0x1000+4 โ skb->pkt_type
LD_ABS -0x1000+56 โ get_random()
17. Extended BPF
โขโฏ design goals:
โขโฏ parse, lookup, update, modify network packets
โขโฏ loadable as kernel modules on demand, on live traffic
โขโฏ safe on production system
โขโฏ performance equal to native x86 code
โขโฏ fast interpreter speed (good performance on all architectures)
โขโฏ calls into bpf and calls from bpf to kernel should be free (no FFI overhead)
20. Early prototypes
โขโฏ Failed approach #1 (design a VM from scratch)
โขโฏ performance was too slow, user tools need to be developed from scratch as
well
โขโฏ Failed approach #2 (have kernel disassemble and verify x86
instructions)
โขโฏ too many instruction combinations, disasm/verifier needs to be rewritten for
every architecture
21. Extended BPF
โขโฏ take a mix of real CPU instructions
โขโฏ 10% classic BPF + 70% x86 + 25% arm64 + 5% risc
โขโฏ rename every x86 instruction โmov rax, rbxโ into โmov r1, r2โ
โขโฏ analyze x86/arm64/risc calling conventions and define a
common one for this โrenamedโ instruction set
โขโฏ make instruction encoding fixed size (for high interpreter
speed)
โขโฏ reuse classic BPF instruction encoding (for trivial classic-
>extended conversion)
22. extended vs classic BPF
โขโฏ ten 64-bit registers vs two 32-bit registers
โขโฏ arbitrary load/store vs stack load/store
โขโฏ call instruction
23. Performance
โขโฏ user space compiler โthinksโ that itโs emitting simplified x86
code
โขโฏ kernel verifies this โsimplified x86โ code
โขโฏ kernel JIT translates each โsimplified x86โ insn into real x86
โขโฏ all registers map one-to-one
โขโฏ most of instructions map one-to-one
โขโฏ bpf โcallโ instruction maps to x86 โcallโ
24. Extended BPF calling convention
โขโฏ BPF calling convention was carefully selected to match a
subset of amd64/arm64 ABIs to avoid extra copy in calls:
โขโฏ R0 โ return value
โขโฏ R1..R5 โ function arguments
โขโฏ R6..R9 โ callee saved
โขโฏ R10 โ frame pointer
26. calls and helper functions
โขโฏ bpf โcallโ and set of in-kernel helper functions define what bpf
programs can do
โขโฏ bpf code itself is a โglueโ between calls to in-kernel helper
functions
โขโฏ helpers
โขโฏ map_lookup/update/delete
โขโฏ ktime_get
โขโฏ packet_write
โขโฏ fetch
27. BPF maps
โขโฏ maps is a generic storage of different types for sharing data between
kernel and userspace
โขโฏ The maps are accessed from user space via BPF syscall, which has
commands:
โขโฏ create a map with given type and attributes
map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)
โขโฏ lookup key/value, update, delete, iterate, delete a map
โขโฏ userspace programs use this syscall to create/access maps that BPF
programs are concurrently updating
28. BPF compilers
โขโฏ BPF backend for LLVM is in trunk and will be released as part of 3.7
โขโฏ BPF backend for GCC is being worked on
โขโฏ C front-end (clang) is used today to compile C code into BPF
โขโฏ tracing and networking use cases may need custom languages
โขโฏ BPF backend only knows how to emit instructions (calls to helper
functions look like normal calls)
30. compiler as a library
tracing script in .txt file
bpf_create_map
.txt parser
llvm mcjit api
bpf
backend
x64
backend
bpf code x86 code
bpf_prog_load
user
kernel
libllvm
perf
binary
run it
31. BPF verifier (CFG check)
โขโฏ To minimize run-time overhead anything that can be checked
statically is done by verifier
โขโฏ all jumps of a program form a CFG which is checked for loops
โขโฏ DAG check = non-recursive depth-first-search
โขโฏ if back-edge exists -> there is a loop -> reject program
โขโฏ jumps back are allowed if they donโt form loops
โขโฏ bpf compiler can move cold basic blocks out of critical path
โขโฏ likely/unlikely() hints give extra performance
32. BPF verifier (instruction walking)
โขโฏ once itโs known that all paths through the program reach final โexitโ
instruction, brute force analyzer of all instructions starts
โขโฏ it descents all possible paths from the 1st insn till โexitโ insn
โขโฏ it simulates execution of every insn and updates the state change of
registers and stack
33. BPF verifier
โขโฏ at the start of the program:
โขโฏ type of R1 = PTR_TO_CTX
type of R10 = FRAME_PTR
other registers and stack is unreadable
โขโฏ when verifier sees:
โขโฏ โR2 = R1โ instruction it copies the type of R1 into R2
โขโฏ โR3 = 123โ instruction, the type of R3 becomes CONST_IMM
โขโฏ โexitโ instruction, it checks that R0 is readable
โขโฏ โif (R4 == 456) goto pc+5โ instruction, it checks that R4 is readable and forks current
state of registers and stack into โtrueโ and โfalseโ branches
34. BPF verifier (state pruning)
โขโฏ every branch adds another fork for verifier to explore, therefore
branch pruning is important
โขโฏ when verifiers sees an old state that has more strict register state and
more strict stack state then the current branch doesn't need to be
explored further, since verifier already concluded that more strict state
leads to valid โexitโ
โขโฏ two states are equivalent if register state is more conservative and
explored stack state is more conservative than the current one
35. unprivileged programs?
โขโฏ today extended BPF is root only
โขโฏ to consider unprivileged access:
โขโฏ teach verifier to conditionally reject programs that expose kernel addresses to
user space
โขโฏ constant blinding pass
36. BPF for tracing
โขโฏ BPF is seen as alternative to systemtap/dtrace
โขโฏ provides in-kernel aggregation, event filtering
โขโฏ can be 'always on'
โขโฏ must have minimal overhead
37. BPF for tracing (kernel part)
struct bpf_map_def SEC("maps") my_hist_map = {
.type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(u32),
.value_size = sizeof(u64),
.max_entries = 64,
};
SEC("events/syscalls/sys_enter_write")
int bpf_prog(struct bpf_context *ctx)
{
u64 write_size = ctx->arg3;
u32 index = log2(write_size);
u64 *value;
value = bpf_map_lookup_elem(&my_hist_map, &index);
if (value)
__sync_fetch_and_add(value, 1);
return 0;
}
sent to kernel as bpf map via bpf() syscall
compiled by llvm into .o and
loaded via bpf() syscall
name of elf section - tracing event to attach via perf_event ioctl