Intro to development using eBPF

Introduction to eBPF and libbpf

eBPF (Extended Berkeley Packet Filter) is a powerful technology in the Linux kernel that allows developers to run custom code safely and efficiently within the kernel space without changing kernel source code or loading kernel modules. Initially designed for packet filtering, eBPF has evolved to support a wide range of use cases, including performance monitoring, security enforcement, and network traffic control. eBPF programs are sandboxed, ensuring they do not harm the stability or security of the system, and they are verified by the kernel before execution.

libbpf is a C library designed to integrate the process of working with eBPF programs. It provides the necessary tools and abstractions for loading, verifying, and managing eBPF programs and maps. libbpf simplifies the development of applications that leverage eBPF, allowing developers to focus on their specific use cases rather than the intricacies of the eBPF subsystem. One of its notable features is BPF CO-RE (Compile Once – Run Everywhere), which enhances portability by enabling eBPF programs to run across different kernel versions without modification. The main functions this library includes to facilitate working with eBPF programs include:

Loading BPF Programs: Functions to load eBPF programs from ELF files into the kernel.
Verifying BPF Programs: Tools to ensure the correctness and safety of eBPF programs before they are executed in the kernel.
Managing BPF Maps: Functions to create, manage, and interact with BPF maps, which store data shared between eBPF programs and user space.
Attaching BPF Programs: Utilities to attach eBPF programs to various kernel hooks and events.
BPF Object Management: High-level APIs for handling BPF object skeletons, which simplify interaction with eBPF programs and maps.

Sample program

We are going to take a look into the following repository. It contains sample code to start taking a look into some eBPF examples with the use of libbpf. Let´s examine the bootstrap example. Take into account that using this library, we will have a file with .bpf.c extension symbolizing that it will be run in kernel mode and a .c file that will run in user mode. Assembling the knowledge of some examples, we will create a program that kills every process trying to use the ptrace syscall.

Kernel mode code

The following code belong to the one that will be executed in kernel mode:

#include "vmlinux.h"
#include <bpf/bpf_helpers.h>

struct event {
    int pid;
    char comm[16];
    bool success;
}; // 1

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);
} rb SEC(".maps"); // 2

SEC("tp/syscalls/sys_enter_ptrace") // 3
int handle_ptrace(struct trace_event_raw_sys_enter *ctx) {

    size_t pid = bpf_get_current_pid_tgid() >> 32; // 4

    long success = bpf_send_signal(9); // 5

    struct event *e;
    e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
    if (e) {
        e->success = (success == 0);
        e->pid = pid;
        bpf_get_current_comm(&e->comm, sizeof(e->comm)); 
        bpf_ringbuf_submit(e, 0); // 6
    }

    return 0;
}

In [1] we are declaring a structure thata that will be collected for each event:

Process identifier (PID)
Command name (comm)
Whether the action was successful (success)

We will need to declare a ringbuffer map [2] to pass the information to our user mode code (we will inpect it later) with a maximum size of 256 KB.

Using the SEC macro allows the programmer to specify the section in which a function or variable will be placed within the eBPF object file that clang will generate for us. In this case [3], we are telling libbpf that the following function will trigger when a tracepoint/syscalls/sys_enter_ptrace event is generated in the kernel.

We are retrieving the PID [4] that has triggered the event for logging purposes and sending it the SIGKILL (9) signal in [5].

Finally, we propagate collected data into the event struct and sent to the user mode code in [6].

User mode code

The following code belongs to the one that will be executed in user mode:

// SPDX-License-Identifier: BSD-3-Clause
#include <argp.h>
#include <unistd.h>
#include "sigkill.skel.h"

static volatile __sig_atomic_t exiting;

struct event {
    int pid;
    char comm[16];
    bool success;
};

static int handle_event(void *ctx, void *data, size_t data_sz)
{
    const struct event *e = data;
    if (e->success)
        printf("Killed PID %d (%s) for trying to use ptrace syscall\n", e->pid, e->comm);
    else
        printf("Failed to kill PID %d (%s) for trying to use ptrace syscall\n", e->pid, e->comm);
    return 0;
}

int main(int argc, char **argv)
{
    struct ring_buffer *rb = NULL;
    struct sigkill *skel;
    int err;
    
    skel = sigkill__open(); // 1
    
    err = sigkill__load(skel); // 2

    err = sigkill__attach( skel); // 3
    
        // 4
    rb = ring_buffer__new(bpf_map__fd( skel->maps.rb), handle_event, NULL, NULL);

    printf("Successfully started!\n");
    printf("Sending SIGKILL to any program using the ptrace syscall\n");
    while (!exiting) {
        err = ring_buffer__poll(rb, 100 /* timeout, ms */); // 5
        if (err == -EINTR) {
            err = 0;
            break;
        }
    }

cleanup:
    sigkill__destroy( skel);
    return -err;
}

The user mode code is rathe simple, it can be simplified in the following steps:

Opening the eBPF object file
Loading this file into the kernel using bpf syscalls
Attach to the loaded program
Create the ring buffer structure to receive events from the kernel mode program and attach a function to it (handle_event)
Polling the ringbuffer with a delay of 100 ms

Makefile

APP=sigkill

.PHONY: $(APP)
$(APP): skel
    clang sigkill.c -lbpf -lelf -o $(APP)

.PHONY: vmlinux
vmlinux:
    bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h

.PHONY: bpf
bpf: vmlinux
    clang -g -O3 -target bpf -c sigkill.bpf.c -o sigkill.bpf.o

.PHONY: skel
skel: bpf
    bpftool gen skeleton sigkill.bpf.o name sigkill > sigkill.skel.h

.PHONY: run
run: $(APP)
    sudo ./$(APP)

.PHONY: clean
clean:
    -rm -rf *./.o *./skel.h ./vmlinux.h $(APP)

This Makefile is pretty straightforward, just defines the steps needed for the compilation of this appilcation, including the use of bpftool to retrieve the vmlinux.h file.

The vmlinux.h file is a header file generated from the Linux kernel’s BTF (BPF Type Format) information. It contains type definitions and other kernel data structures that eBPF programs need to interact with the kernel. This file is crucial for eBPF development because it provides the necessary context for writing eBPF programs that can interact with various kernel components and data structures.

Setting up a good testing environment

After we have seen a working example, we might want to try it ourselves in our lab machine. This is where we start to face problems. Dependencies.

It is not a straightforward process, so this post aims to ensure that you can set up a quick docker environment that will aid you to compile your eBPF programs. The following Dockerfile has been created for this purpose:

FROM ubuntu:latest

RUN apt-get update && \
    apt-get install -y build-essential git cmake \
                       zlib1g-dev libevent-dev \
                       libelf-dev llvm \
                       clang libc6-dev-i386 \
                       nano pkg-config wget

RUN mkdir /src && git init 
WORKDIR /src

RUN wget https://github.com/bpftrace/bpftrace/releases/download/v0.20.4/bpftrace
RUN chmod +x bpftrace

RUN ln -s /usr/include/x86_64-linux-gnu/asm/ /usr/include/asm

RUN git clone https://github.com/libbpf/libbpf-bootstrap.git && \
    cd libbpf-bootstrap && \
    git submodule update --init --recursive

RUN cd libbpf-bootstrap/libbpf/src && \
    make BUILD_STATIC_ONLY=y && \
    make install BUILD_STATIC_ONLY=y LIBDIR=/usr/lib/x86_64-linux-gnu/

RUN git clone --recurse-submodules https://github.com/libbpf/bpftool.git && \
    cd bpftool/src && \
    make -j$(nproc) && \
    make install

RUN git clone --depth 1 git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git && \
    cp linux/include/uapi/linux/bpf* /usr/include/linux/

When built, this docker container already has the tools needed to start building you eBPF tools.
Still if you try to do something with bpftrace to check if everything is ok you will encounter the following error:

root@8246a0d0a6c4:/src# ./bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
stdin:1:1-34: ERROR: tracepoint not found: raw_syscalls:sys_enter
tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This fails because the DebugFS is not currently mounted.

DebugFS

DebugFS is a special filesystem in the Linux kernel designed for debugging purposes. It provides a simple and efficient way for kernel developers to expose debugging information and controls to userspace.

By mounting the DebugFS filesystem, developers can access various files that represent kernel internals, such as system states, statistics, and debug information for specific subsystems or drivers. To mount it you will need to execute the following command:

mount -t debugfs none /sys/kernel/debug

After mounting it, you will be able to list the tracing folder inside /sys/kernel/debug which contains information about the kernel tracing events that eBPF uses to get information:

root@8246a0d0a6c4:/src# mount -t debugfs none /sys/kernel/debu
mount: /sys/kernel/debu: mount point does not exist.
root@8246a0d0a6c4:/src# mount -t debugfs none /sys/kernel/debug/
root@8246a0d0a6c4:/src# ls /sys/kernel/debug/tracing
README                      dyn_ftrace_total_info     instances        saved_cmdlines         set_ftrace_notrace_pid  synthetic_events  trace_stat
available_events            dynamic_events            kprobe_events    saved_cmdlines_size    set_ftrace_pid          timestamp_mode    tracing_cpumask
available_filter_functions  enabled_functions         kprobe_profile   saved_tgids            set_graph_function      trace             tracing_max_latency
available_tracers           error_log                 max_graph_depth  set_event              set_graph_notrace       trace_clock       tracing_on
buffer_percent              events                    options          set_event_notrace_pid  snapshot                trace_marker      tracing_thresh
buffer_size_kb              free_buffer               osnoise          set_event_pid          stack_max_size          trace_marker_raw  uprobe_events
buffer_total_size_kb        function_profile_enabled  per_cpu          set_ftrace_filter      stack_trace             trace_options     uprobe_profile
current_tracer              hwlat_detector            printk_formats   set_ftrace_notrace     stack_trace_filter      trace_pipe

Executing the BPF Docker container

To execute this docker container, you will need to add the --privileged flag. This is needed because, due to seccomp bpf syscall is prohibited and to load a BPF program into the kernel, some capabilities are needed, such as CAP_BPF (a new one) or to bruteforce it CAP_SYS_ADMIN.

If we examine the following default seccomp profile we can see the following:

If CAP_BPF is set, the syscall bpf can be performed

"names": [
  "bpf"
  ],
    "action": "SCMP_ACT_ALLOW",
    "includes": {
        "caps": [
            "CAP_BPF"
        ]
}

If CAP_SYS_ADMIN is set, syscall bpf can be performed

"names": [
                "bpf",
                "clone",
                "clone3",
                "fanotify_init",
                ...
            ],
            "action": "SCMP_ACT_ALLOW",
            "includes": {
                "caps": [
                    "CAP_SYS_ADMIN"
                ]
            }

The default action for those not explicitly whitelisted is to deny the usage of those syscalls

Developing More Complex eBPF Programs

Once you have your development environment set up, you can start exploring more complex eBPF programs. eBPF’s versatility allows you to tackle various tasks, from advanced networking features to sophisticated performance monitoring tools.

Advanced Networking

eBPF can be used to implement advanced networking features such as load balancing, firewall rules, and network address translation (NAT). With libbpf, you can write eBPF programs that filter and manipulate network packets, attach them to various networking hooks, and dynamically manage network traffic based on custom logic. Some examples using bpftrace could be tcpconnect.bt from the bpftrace official repo.

#!/usr/bin/env bpftrace

#ifndef BPFTRACE_HAVE_BTF
#include <linux/socket.h>
#include <net/sock.h>
#else
#define AF_INET   2 /* IPv4 */
#define AF_INET6 10 /* IPv6 */
#endif

BEGIN
{
  printf("Tracing tcp connections. Hit Ctrl-C to end.\n");
  printf("%-8s %-8s %-16s ", "TIME", "PID", "COMM");
  printf("%-39s %-6s %-39s %-6s\n", "SADDR", "SPORT", "DADDR", "DPORT");
}

kprobe:tcp_connect
{
  $sk = ((struct sock *) arg0);
  $inet_family = $sk->__sk_common.skc_family;

  if ($inet_family == AF_INET || $inet_family == AF_INET6) {
    if ($inet_family == AF_INET) {
      $daddr = ntop($sk->__sk_common.skc_daddr);
      $saddr = ntop($sk->__sk_common.skc_rcv_saddr);
    } else {
      $daddr = ntop($sk->__sk_common.skc_v6_daddr.in6_u.u6_addr8);
      $saddr = ntop($sk->__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr8);
    }
    $lport = $sk->__sk_common.skc_num;
    $dport = $sk->__sk_common.skc_dport;

    // Destination port is big endian, it must be flipped
    $dport = bswap($dport);

    time("%H:%M:%S ");
    printf("%-8d %-16s ", pid, comm);
    printf("%-39s %-6d %-39s %-6d\n", $saddr, $lport, $daddr, $dport);
  }
}

This script just traces all TCP active connections using the connect() syscall as the oracle of information received.

Security Enforcement

eBPF can enhance system security by implementing custom security policies and monitoring system calls. You can write eBPF programs that detect and prevent suspicious activities, enforce access control policies, and audit system events. eBPF’s ability to run in the kernel space with minimal overhead makes it an ideal choice for security-sensitive applications.

For example, actual implementations of seccomp is done using BPF programs, as the kernel docs relate:

“Seccomp filtering provides a means for a process to specify a filter for incoming system calls. The filter is expressed as a Berkeley Packet Filter (BPF) program, as with socket filters, except that the data operated on is related to the system call being made: system call number and the system call arguments. This allows for expressive filtering of system calls using a filter program language with a long history of being exposed to userland and a straightforward data set.”

The actual syscall filter is done using BPF, how cool is that?

Conclusion

This was the first post talking abount eBPF! More will come, thanks for reading until the end.