Introduction to eBPF and libbpf
eBPF (Extended Berkeley Packet Filter) is a powerful technology in the Linux kernel that allows developers to run custom code safely and efficiently within the kernel space without changing kernel source code or loading kernel modules. Initially designed for packet filtering, eBPF has evolved to support a wide range of use cases, including performance monitoring, security enforcement, and network traffic control. eBPF programs are sandboxed, ensuring they do not harm the stability or security of the system, and they are verified by the kernel before execution.
libbpf is a C library designed to integrate the process of working with eBPF programs. It provides the necessary tools and abstractions for loading, verifying, and managing eBPF programs and maps. libbpf simplifies the development of applications that leverage eBPF, allowing developers to focus on their specific use cases rather than the intricacies of the eBPF subsystem. One of its notable features is BPF CO-RE (Compile Once – Run Everywhere), which enhances portability by enabling eBPF programs to run across different kernel versions without modification. The main functions this library includes to facilitate working with eBPF programs include:
- Loading BPF Programs: Functions to load eBPF programs from ELF files into the kernel.
- Verifying BPF Programs: Tools to ensure the correctness and safety of eBPF programs before they are executed in the kernel.
- Managing BPF Maps: Functions to create, manage, and interact with BPF maps, which store data shared between eBPF programs and user space.
- Attaching BPF Programs: Utilities to attach eBPF programs to various kernel hooks and events.
- BPF Object Management: High-level APIs for handling BPF object skeletons, which simplify interaction with eBPF programs and maps.
Sample program
We are going to take a look into the following repository. It contains sample code to start taking a look into some eBPF examples with the use of libbpf. Let´s examine the bootstrap
example. Take into account that using this library, we will have a file with .bpf.c
extension symbolizing that it will be run in kernel mode and a .c
file that will run in user mode. Assembling the knowledge of some examples, we will create a program that kills every process trying to use the ptrace
syscall.
Kernel mode code
The following code belong to the one that will be executed in kernel mode:
In [1] we are declaring a structure thata that will be collected for each event:
- Process identifier (PID)
- Command name (comm)
- Whether the action was successful (success)
We will need to declare a ringbuffer map [2] to pass the information to our user mode code (we will inpect it later) with a maximum size of 256 KB.
Using the SEC macro allows the programmer to specify the section in which a function or variable will be placed within the eBPF object file that clang
will generate for us. In this case [3], we are telling libbpf that the following function will trigger when a tracepoint/syscalls/sys_enter_ptrace
event is generated in the kernel.
We are retrieving the PID [4] that has triggered the event for logging purposes and sending it the SIGKILL (9) signal in [5].
Finally, we propagate collected data into the event struct and sent to the user mode code in [6].
User mode code
The following code belongs to the one that will be executed in user mode:
The user mode code is rathe simple, it can be simplified in the following steps:
- Opening the eBPF object file
- Loading this file into the kernel using
bpf
syscalls - Attach to the loaded program
- Create the ring buffer structure to receive events from the kernel mode program and attach a function to it (handle_event)
- Polling the ringbuffer with a delay of 100 ms
Makefile
This Makefile
is pretty straightforward, just defines the steps needed for the compilation of this appilcation, including the use of bpftool
to retrieve the vmlinux.h
file.
The vmlinux.h
file is a header file generated from the Linux kernel’s BTF (BPF Type Format) information. It contains type definitions and other kernel data structures that eBPF programs need to interact with the kernel. This file is crucial for eBPF development because it provides the necessary context for writing eBPF programs that can interact with various kernel components and data structures.
Setting up a good testing environment
After we have seen a working example, we might want to try it ourselves in our lab machine. This is where we start to face problems. Dependencies.
It is not a straightforward process, so this post aims to ensure that you can set up a quick docker
environment that will aid you to compile your eBPF programs. The following Dockerfile
has been created for this purpose:
When built, this docker container already has the tools needed to start building you eBPF tools.
Still if you try to do something with bpftrace
to check if everything is ok you will encounter the following error:
root@8246a0d0a6c4:/src# ./bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
stdin:1:1-34: ERROR: tracepoint not found: raw_syscalls:sys_enter
tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This fails because the DebugFS is not currently mounted.
DebugFS
DebugFS is a special filesystem in the Linux kernel designed for debugging purposes. It provides a simple and efficient way for kernel developers to expose debugging information and controls to userspace.
By mounting the DebugFS filesystem, developers can access various files that represent kernel internals, such as system states, statistics, and debug information for specific subsystems or drivers. To mount it you will need to execute the following command:
After mounting it, you will be able to list the tracing
folder inside /sys/kernel/debug
which contains information about the kernel tracing events that eBPF uses to get information:
root@8246a0d0a6c4:/src# mount -t debugfs none /sys/kernel/debu
mount: /sys/kernel/debu: mount point does not exist.
root@8246a0d0a6c4:/src# mount -t debugfs none /sys/kernel/debug/
root@8246a0d0a6c4:/src# ls /sys/kernel/debug/tracing
README dyn_ftrace_total_info instances saved_cmdlines set_ftrace_notrace_pid synthetic_events trace_stat
available_events dynamic_events kprobe_events saved_cmdlines_size set_ftrace_pid timestamp_mode tracing_cpumask
available_filter_functions enabled_functions kprobe_profile saved_tgids set_graph_function trace tracing_max_latency
available_tracers error_log max_graph_depth set_event set_graph_notrace trace_clock tracing_on
buffer_percent events options set_event_notrace_pid snapshot trace_marker tracing_thresh
buffer_size_kb free_buffer osnoise set_event_pid stack_max_size trace_marker_raw uprobe_events
buffer_total_size_kb function_profile_enabled per_cpu set_ftrace_filter stack_trace trace_options uprobe_profile
current_tracer hwlat_detector printk_formats set_ftrace_notrace stack_trace_filter trace_pipe
Executing the BPF Docker container
To execute this docker container, you will need to add the --privileged
flag. This is needed because, due to seccomp
bpf syscall is prohibited and to load a BPF program into the kernel, some capabilities are needed, such as CAP_BPF
(a new one) or to bruteforce it CAP_SYS_ADMIN
.
If we examine the following default seccomp profile we can see the following:
- If
CAP_BPF
is set, the syscallbpf
can be performed
- If
CAP_SYS_ADMIN
is set, syscallbpf
can be performed
The default action for those not explicitly whitelisted is to deny the usage of those syscalls
Developing More Complex eBPF Programs
Once you have your development environment set up, you can start exploring more complex eBPF programs. eBPF’s versatility allows you to tackle various tasks, from advanced networking features to sophisticated performance monitoring tools.
Advanced Networking
eBPF can be used to implement advanced networking features such as load balancing, firewall rules, and network address translation (NAT). With libbpf, you can write eBPF programs that filter and manipulate network packets, attach them to various networking hooks, and dynamically manage network traffic based on custom logic. Some examples using bpftrace
could be tcpconnect.bt from the bpftrace
official repo.
This script just traces all TCP active connections using the connect()
syscall as the oracle of information received.
Security Enforcement
eBPF can enhance system security by implementing custom security policies and monitoring system calls. You can write eBPF programs that detect and prevent suspicious activities, enforce access control policies, and audit system events. eBPF’s ability to run in the kernel space with minimal overhead makes it an ideal choice for security-sensitive applications.
For example, actual implementations of seccomp
is done using BPF programs, as the kernel docs relate:
“Seccomp filtering provides a means for a process to specify a filter for incoming system calls. The filter is expressed as a Berkeley Packet Filter (BPF) program, as with socket filters, except that the data operated on is related to the system call being made: system call number and the system call arguments. This allows for expressive filtering of system calls using a filter program language with a long history of being exposed to userland and a straightforward data set.”
The actual syscall filter is done using BPF, how cool is that?
Conclusion
This was the first post talking abount eBPF! More will come, thanks for reading until the end.