In-depth eBPF | Seven core questions you need to know original
Author| Yanxun
In the past year, ARMS built Kubernetes monitoring based on eBPF technology, providing multi-language non-intrusive application performance, system performance, and network performance observation capabilities, verifying the effectiveness of eBPF technology. The development of eBPF technology and ecology is very good, and the future is promising. As a practitioner of this technology, the goal of this article is to introduce the eBPF technology itself by answering 7 core questions, so as to unlock the veil of eBPF for everyone.
What is eBPF?
eBPF is a technology that can run sandbox programs in the kernel, providing a mechanism to safely inject code when kernel events and user program events occur, so that non-kernel developers can also control the kernel. With the development of the kernel, eBPF has gradually expanded from the original packet filtering to network, kernel, security, tracking, etc., and its functional characteristics are still developing rapidly. The early BPF is called classic BPF, or cBPF for short, which is exactly the This function expansion makes the current BPF called Extended BPF, or eBPF for short.
What are the application scenarios of eBPF?
Network Optimization
eBPF has both high performance and high scalability, making it the preferred solution for network packet processing in network solutions:
- high performance
JIT compilers provide near-kernel-native code execution efficiency.
- Highly scalable
In the context of the kernel, protocol parsing and routing strategies can be added quickly.
Troubleshooting
eBPF has both the kernel and user tracing capabilities through the kprobe and tracepoints tracing mechanism. This end-to-end tracing capability can quickly diagnose faults. At the same time, eBPF supports revealing profiling statistics in a more efficient way without the need for Traditional systems need to leak a large amount of sampled data, making continuous real-time profiling possible.
safely control
eBPF can see all system calls, all network packets and socket network operations, integrated with process context tracking, network operation level filtering, and system call filtering, which can provide better security control.
performance monitoring
Compared with traditional system monitoring components such as sar, which can only provide static counters and gauges, eBPF supports programmable dynamic collection and edge computing aggregation of custom indicators and events, which greatly improves the efficiency and imagination of performance monitoring.
Why does eBPF appear?
The emergence of eBPF is essentially to solve the contradiction between the slow iteration speed of the kernel and the rapid change of system requirements. An example commonly used in the field of eBPF is that eBPF is relative to the Linux Kernel, similar to Javascript relative to HTML, and the highlight is programmability. Generally speaking, the support of programmability usually brings some new problems. For example, the kernel module is actually designed to solve this problem, but it does not provide a good boundary, which causes the kernel module to affect the stability of the kernel itself. The kernel version needs to be adapted, etc. eBPF adopts the following strategies to make it a safe and efficient kernel programmable technology:
- Safety
eBPF programs can only be executed after being verified by the verifier, and cannot contain unreachable instructions; eBPF programs cannot call kernel functions at will, but can only call auxiliary functions defined in the API; eBPF program stack space is only 512 bytes at most, If you want larger storage, you must use mapped storage.
- efficient
With the help of the just-in-time compiler (JIT), and because the eBPF instructions still run in the kernel, there is no need to copy data to the user mode, which greatly improves the efficiency of event processing.
- standard
Provide standard interfaces and data models for developers to use through BPF Helpers, BTF, and PERF MAP.
- Powerful
eBPF not only expands the number of registers and introduces a new BPF map storage, but also gradually expands the original single packet filtering event to kernel mode functions, user mode functions, trace points, and performance events (perf_events) in the 4.x kernel. and security control.
How to use eBPF?
5 steps
1. Develop an eBPF program in C language;
That is, the eBPF sandbox program to be called when the instrumentation point triggers an event, and the program will run in kernel mode.
2. Compile the eBPF program into BPF bytecode with the help of LLVM;
The eBPF program is compiled into BPF bytecode for subsequent verification and execution in the eBPF virtual machine.
3. Submit the BPF bytecode to the kernel through the bpf system call;
The BPF bytecode is loaded into the kernel through the bpf system in user mode.
4. The kernel verifies and runs the BPF bytecode, and saves the corresponding state to the BPF map;
The kernel verifies the security of the BPF bytecode, and ensures that the correct eBPF program is called when the corresponding event occurs. If there is a state that needs to be saved, it is written into the corresponding BPF map. For example, monitoring data can be written into the BPF map.
5. The user program queries the running status of the BPF bytecode through the BPF mapping.
The user mode obtains the running status of the bytecode by querying the content of the BPF mapping, such as obtaining the captured monitoring data.
A complete eBPF program usually includes two parts: user mode and kernel mode: the user mode program needs to interact with the kernel through BPF system calls to complete tasks such as eBPF program loading, event mounting, and map creation and update; while in the kernel In the state, the eBPF program cannot call kernel functions arbitrarily, but needs to complete the required tasks through BPF auxiliary functions. Especially when accessing memory addresses, it is necessary to read memory data with the help of the bpf_probe_read series of functions to ensure safe and efficient memory access. When the eBPF program needs large blocks of storage, we also need to introduce a specific type of BPF mapping according to the application scenario, and use it to provide the user space program with the data of the running state.
eBPF program classification and usage scenarios
bpftool feature probe | grep program_type
- 1.
The above command can view the types of eBPF programs supported by the system, which are generally as follows:
eBPF program_type socket_filter is available
eBPF program_type kprobe is available
eBPF program_type sched_cls is available
eBPF program_type sched_act is available
eBPF program_type tracepoint is available
eBPF program_type xdp is available
eBPF program_type perf_event is available
eBPF program_type cgroup_skb is available
eBPF program_type cgroup_sock is available
eBPF program_type lwt_in is available
eBPF program_type lwt_out is available
eBPF program_type lwt_xmit is available
eBPF program_type sock_ops is available
eBPF program_type sk_skb is available
eBPF program_type cgroup_device is available
eBPF program_type sk_msg is available
eBPF program_type raw_tracepoint is available
eBPF program_type cgroup_sock_addr is available
eBPF program_type lwt_seg6local is available
eBPF program_type lirc_mode2 is NOT available
eBPF program_type sk_reuseport is available
eBPF program_type flow_dissector is available
eBPF program_type cgroup_sysctl is available
eBPF program_type raw_tracepoint_writable is available
eBPF program_type cgroup_sockopt is available
eBPF program_type tracing is available
eBPF program_type struct_ops is available
eBPF program_type ext is available
eBPF program_type lsm is available
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- twenty one.
- twenty two.
- twenty three.
- twenty four.
- 25.
- 26.
- 27.
- 28.
- 29.
For details, please refer to https://elixir.bootlin.com/linux/v5.13/source/include/linux/bpf_types.h
It is mainly divided into 3 major usage scenarios:
- track
Tracepoint, kprobe, perf_event, etc., are mainly used to extract trace information from the system, and then provide data support for monitoring, troubleshooting, and performance optimization.
- network
xdp, sock_ops, cgroup_sock_addr, sk_msg, etc., are mainly used to filter and process network data packets, and then realize various functions such as network observation, filtering, flow control and performance optimization. Packet loss and redirection can be performed here.
cilium basically uses all the hook points.
- security and other
lsm is used for security, and others include flow_dissector and lwt_in, which are not very commonly used, so I won't go into details.
What are the best practices for eBPF?
Finding kernel instrumentation points
It can be seen from the front that the eBPF program itself is not difficult, the difficulty is to find a suitable event source to trigger the operation. For monitoring and diagnostics, there are three types of event sources for tracing eBPF programs: kernel functions (kprobe), kernel tracepoints (tracepoints), or performance events (perf_event). There are 2 questions to answer at this point:
1. What kernel functions, kernel tracepoints or performance events are in the kernel?
- Use debug information to get kernel functions, kernel tracepoints
sudo ls /sys/kernel/debug/tracing/events
- 1.
- Use bpftrace to get kernel functions and kernel trace points
# 查询所有内核插桩和跟踪点
sudo bpftrace -l
# 使用通配符查询所有的系统调用跟踪点
sudo bpftrace -l 'tracepoint:syscalls:*'
# 使用通配符查询所有名字包含"open"的跟踪点
sudo bpftrace -l '*open*'
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- Use perf list to get performance events
sudo perf list tracepoint
- 1.
2. For kernel functions and kernel trace points, how to query the definition format of these data structures when they need to track their incoming parameters and return values?
- Use debug info to get
sudo cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_openat/format
- 1.
- Obtained using bpftrace
For details on how to use the above information, please refer to bcc.
Find instrumentation points for your application
1. How to query the tracepoint of the user process?
- Statically compiled languages retain debugging information through the -g compilation option. The application binary will contain DWARF (Debugging With Attributed Record Format). With the debugging information, you can use readelf, objdump, nm and other tools to query functions and variables that can be used for tracing. list of symbols
# 查询符号表
readelf -Ws /usr/lib/x86_64-linux-gnu/libc.so.6
# 查询USDT信息
readelf -n /usr/lib/x86_64-linux-gnu/libc.so.6
- 1.
- 2.
- 3.
- 4.
- 5.
- use bpftrace
# 查询uprobe
bpftrace -l 'uprobe:/usr/lib/x86_64-linux-gnu/libc.so.6:*'
# 查询USDT
bpftrace -l 'usdt:/usr/lib/x86_64-linux-gnu/libc.so.6:*'
- 1.
- 2.
- 3.
- 4.
- 5.
uprobe is file based. When a function in a file is traced, unless the process PID is filtered, all processes using the file are instrumented by default.
The above is a static compiled language. It is similar to the tracking of the kernel. The symbolic information of the application can be stored in the ELF binary file, or it can be placed in the debugging file in the form of a separate file; and the symbolic information of the kernel can be stored in addition to In addition to the kernel binary, it is also exposed to user space in the form of /proc/kallsyms and /sys/kernel/debug.
For non-statically compiled languages, there are two main ones:
- interpreted language
Use the tracepoint query method similar to compiled language applications to query their uprobe and USDT tracepoints at the interpreter level. How to associate the interpreter-level behavior with the application behavior needs to be analyzed by experts in the relevant language.
- Just-in-time compiled language
The application source code of this kind of language will be compiled into bytecode first, and then compiled into machine code by the just-in-time compiler (JIT), and there will be a lot of optimization, and the tracking is very difficult. Similar to the interpreted programming language, uprobe and USDT trace can only be used on the just-in-time compiler, and the function information of the final application can be obtained from the trace point parameters of the just-in-time compiler. Figuring out the relationship between the tracepoints of the JIT and the execution of the application requires analysis by experts in the relevant language.
You can refer to BCC's application trace, user process trace, which essentially executes the uprobe handler through breakpoints. Although the kernel community has done a lot of performance tuning for BPF, tracking user-mode functions (especially high-frequency functions such as lock contention and memory allocation) may still bring a lot of performance overhead. Therefore, when we use uprobe, we should try to avoid tracking high-frequency functions.
For details on how to use the above information, please refer to: https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#events--arguments
Correlating Issues and Instrumentation Points
An ideal state is that all problems should be clearly observed and those instrumentation points should be observed, but this requires technicians to have a thorough understanding of the details of the end-to-end software stack. A more reasonable method is the rule of 28. The core 80% of the context is seized, and it is enough to ensure that problems will be discovered in this context. At this time, use the kernel stack and user stack to view the specific call stack to find the core problem. For example, it is found that the network is losing packets, but I don't know why. At this time, we know that the network packet loss will definitely call the kfree_skb kernel function. Then we can pass:
sudo bpftrace -e 'kprobe:kfree_skb /comm=="<your comm>"/ {printf("kstack: %s\n", kstack);}'
- 1.
Find the call stack of the function:
kstack: kfree_skb+1 udpv6_destroy_sock+66 sk_common_release+34 udp_lib_close+9 inet_release+75 inet6_release+49 __sock_release+66 sock_close+21 __fput+159 ____fput+14 task_work_run+103 exit_to_user_mode_loop+411 exit_to_user_mode_prepare+187 syscall_exit_to_user_mode+23 do_syscall_64+110 entry_SYSCALL_64_after_hwframe+68
- 1.
Then you can backtrack the above functions to see which line they are called under what conditions, and you can locate the problem. This method can not only locate the problem, but can also be used to deepen the understanding of kernel calls, such as:
All network related tracepoints and their call stacks can be viewed.
What is the implementation principle of eBPF?
5 modules
eBPF is mainly composed of 5 modules in the kernel:
1. BPF Verifier
Secure the eBPF program. The verifier will create the instruction to be executed as a directed acyclic graph (DAG) to ensure that the program does not contain unreachable instructions; then simulate the execution process of the instruction to ensure that invalid instructions will not be executed. However, the validators here cannot guarantee 100% security, so for all BPF programs, strict monitoring and review are still required.
2. BPF JIT
Compile eBPF bytecode into native machine instructions for more efficient execution in the kernel.
3. A storage module consisting of multiple 64-bit registers, a program counter and a 512-byte stack
It is used to control the running of eBPF programs, save stack data, and participate in and out parameters.
4. BPF Helpers (helper functions)
Provides a set of functions for eBPF programs to interact with other modules of the kernel. These functions cannot be called by any eBPF program, and the set of available functions is determined by the type of BPF program. Note that all modifications to input and output parameters in eBPF must comply with the BPF specification. Except for local variable changes, other changes should be done using BPF Helpers. If BPF Helpers do not support it, they cannot be modified.
Through the above command, you can see which BPF Helpers can be run by different types of eBPF programs.
5. BPF Map & context
Used to provide large blocks of storage that can be accessed by user-space programs to control the running state of eBPF programs.
bpftool feature probe
- 1.
Through the above command, you can see which types of maps are supported by the system.
3 actions
Let's talk about the important system call bpf first:
int bpf(int cmd, union bpf_attr *attr, unsigned int size);
- 1.
Here cmd is the key, attr is the parameter of cmd, size is the parameter size, so the key is to see what cmd has
// 5.11内核
enum bpf_cmd {
BPF_MAP_CREATE,
BPF_MAP_LOOKUP_ELEM,
BPF_MAP_UPDATE_ELEM,
BPF_MAP_DELETE_ELEM,
BPF_MAP_GET_NEXT_KEY,
BPF_PROG_LOAD,
BPF_OBJ_PIN,
BPF_OBJ_GET,
BPF_PROG_ATTACH,
BPF_PROG_DETACH,
BPF_PROG_TEST_RUN,
BPF_PROG_GET_NEXT_ID,
BPF_MAP_GET_NEXT_ID,
BPF_PROG_GET_FD_BY_ID,
BPF_MAP_GET_FD_BY_ID,
BPF_OBJ_GET_INFO_BY_FD,
BPF_PROG_QUERY,
BPF_RAW_TRACEPOINT_OPEN,
BPF_BTF_LOAD,
BPF_BTF_GET_FD_BY_ID,
BPF_TASK_FD_QUERY,
BPF_MAP_LOOKUP_AND_DELETE_ELEM,
BPF_MAP_FREEZE,
BPF_BTF_GET_NEXT_ID,
BPF_MAP_LOOKUP_BATCH,
BPF_MAP_LOOKUP_AND_DELETE_BATCH,
BPF_MAP_UPDATE_BATCH,
BPF_MAP_DELETE_BATCH,
BPF_LINK_CREATE,
BPF_LINK_UPDATE,
BPF_LINK_GET_FD_BY_ID,
BPF_LINK_GET_NEXT_ID,
BPF_ENABLE_STATS,
BPF_ITER_CREATE,
BPF_LINK_DETACH,
BPF_PROG_BIND_MAP,
};
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- twenty one.
- twenty two.
- twenty three.
- twenty four.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
The core is PROG, MAP related cmd, which is program loading and mapping processing.
1. Program loading
Calling the BPF_PROG_LOAD cmd will load the BPF program into the kernel, but the eBPF program is not like a regular thread, it will always run there after it is started, and it will only be executed after an event is triggered. These events include system calls, kernel tracepoints, call exits of kernel functions and user mode functions, network events, etc., so the second action is required.
2. Bind events
b.attach_kprobe(event="xxx", fn_name="yyy")
- 1.
The above is to bind a specific event to a specific BPF function. The actual implementation principle is as follows:
(1) With the help of the bpf system call, after loading the BPF program, the returned file descriptor will be remembered;
(2) Know the event number of the corresponding function type through the attach operation;
(3) Call perf_event_open to create a performance monitoring event according to the return value of attach;
(4) Bind the BPF program to the performance monitoring event through the PERF_EVENT_IOC_SET_BPF command of ioctl.
3. Mapping operations
Through MAP-related cmd, control the addition and deletion of MAP, and then the user mode interacts with the kernel state based on the MAP.
The development status of eBPF?
Kernel support
Recommended >=4.14
ecology
The bottom-up situation of the eBPF ecology is as follows:
1. Infrastructure
Support the development of eBPF basic capabilities.
- Linux Kernal
- LLVM
2. Development toolset
It is mainly used to load, compile, and debug eBPF programs. Different languages have different development tool sets:
- Go
https://github.com/cilium/ebpf
https://github.com/aquasecurity/libbpfgo
- C/C++
https://github.com/libbpf/libbpf
3. eBPF application
- bcc (https://github.com/iovisor/bcc)
Provides a set of development tools and scripts.
- bpftrace (https://github.com/iovisor/bpftrace)
Based on bcc, it provides a scripting language.
- cilium (https://github.com/cilium/cilium)
Network optimization and security
- Falco (https://github.com/falcosecurity/falco)
cyber security
- Katran (https://github.com/facebookincubator/katran)
High-performance Layer 4 load balancing
- Hubble (https://github.com/cilium/hubble)
observable
- Kindling (https://github.com/CloudDectective-Harmonycloud/kindling)
observable
- Pixie (https://github.com/pixie-io/pixie)
observable
- kubectl trace (https://github.com/iovisor/kubectl-trace)
Schedule bpftrace script
- L3AF (https://github.com/l3af-project/l3afd)
A platform for launching and managing eBPF programs in a distributed environment
- ply (https://github.com/iovisor/ply)
dynamic linux trace
- Tracee (https://github.com/aquasecurity/tracee)
Linux runtime security monitoring
4. Websites that track ecology
- https://ebpf.io/projects
- https://github.com/zoidbergwill/awesome-ebpf
write at the end
The premise of using eBPF well is the understanding of the software stack
Through the above introduction, I believe that everyone has a sufficient understanding of eBPF. eBPF provides only a framework and mechanism. The core still needs to use eBPF people's understanding of the software stack to find suitable instrumentation points, and to be able to communicate with application problems. association.
The killer feature of eBPF is full coverage, non-intrusive, programmable
1. Full coverage
Kernel, application instrumentation points are fully covered.
2. No intrusion
No need to modify any hooked code.
3. Programmable
Dynamically issue eBPF programs, dynamically execute instructions at the edge, and perform dynamic aggregation analysis.