Jan 21, 2023

Benchmarking RISC-V: VisionFive 2 vs the world

I recently got my "super early bird" version of the VisionFive 2 RISC-V board. As the documentation says, it is supposed to be "the world’s first high-performance RISC-V single board computer (SBC) with an integrated GPU".

Since I have access to several kind of RISC-V, ARM and x86 boards, let's see if the claim about performance is true! We will look both at processing performance and energy efficiency.

Updated 2023-01-22: added Kobol Helios64 performance results from Max

Updated 2023-01-23: added results (performance and power) for Raspberry Pi 3B+

Updated 2023-01-24:

  • re-done all SBC power measurements, significant changes for VisionFive 2
  • fixed completely incorrect performance measurements for Raspberry Pi 3B+ caused by a faulty USB cable (causing a huge 2.5x drop in performance!)
  • re-done measurements for Raspberry Pi 1 with Debian 11 and without the faulty USB cable
  • added results for Raspberry Pi 3B, it was not fried after all, it was also the faulty USB cable
  • re-done Raspberry Pi 4 power measurements to be more comparable (avoid POE)

Picture of my setup Power measurement setup. The VisionFive 2 is visible at the bottom with its serial cable, the wattmeter is on the left. The other visible boards are Raspberry Pis.

A disclaimer on methodology

Benchmarking CPU performance correctly requires a huge software and hardware expertise, and I can certainly not claim to have such an expertise. I have chosen two basic computing primitives, hoping that they are representative enough: crypto (sha1 and chacha20-poly1305 using openssl) and decompression (xz).

All numbers shown in this article are very "unscientific": I made no formal repetition to account for variability, and there are many factors that I purposefully ignore (kernel version, software version, compiler...). That being said, I tried to document these parameters as much as possible to help further analysis.

Overall, the goal is to give a rough idea of the CPU performance and power efficiency you can expect from RISC-V hardware.

Hardware and software environment

The VisionFive 2 has a StarFive JH7110 SoC, with 4 SiFive U74 cores at 1.5 GHz.

The original VisionFive had a StarFive JH7100 SoC with 2 SiFive U74 cores at 1.2 GHz. It had known hardware design issues: frequent L2 cache flushing needed because of a non-coherent bus and a slow RAM controller. So, the new SoC should be significantly faster.

Software-wise, I built a Linux kernel using the non-upstream repository (5.18-based for VisionFive 1, and 5.15-based for VisionFive 2). I built a Debian rootfs using the Debian guide for VisionFive. That guide works almost the same way for VisionFive 2, but that (as well as the upstream status for kernel support) will be for another article.

For other systems used in the comparison, they mostly run Debian or Ubuntu, with a few exceptions (NixOS, Armbian).

CPU performance

Here are the three benchmarks I will use:

openssl speed -evp sha1
openssl speed -evp chacha20-poly1305
# https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.10.tar.xz (116606704 bytes)
time xz -d < /dev/shm/linux-5.10.tar.xz > /dev/null

The xz benchmark uses decompression of a known file to ease reproducibility, and this file is stored in memory (/dev/shm) to make sure we have no disk I/O.

All benchmarks are using a single CPU core.

I converted all results into MB/s for easier comparison, taking the largest block size for the openssl results (16 KiB). As a reminder, one MB equals 1000000 bytes. For the xz benchmark, the real elapsed time is used.

To ease comparisons with other hardware, I computed a "speedup" of each board compared to the VisionFive 2: 1x means the same performance, 2x means twice as fast, 0.33x means three times as slow, etc.

You can find the full output of the benchmarks for each machine here.

Most hardware is either running locally, from the Compile Farm, or from Grid'5000. The Celeron G1840T system belongs to Deuxfleurs. The Kobol Helios64 result is courtesy of Max. Raspberry Pi 3B+ and 4 are courtesy of $DAYJOB.

Hardware sha1 chacha20 xz decompr.
RISC-V
VisionFive 2
Debian unstable
97.5 MB/s 50.8 MB/s 4.66 MB/s
VisionFive 1 (gcc91)
Debian unstable
64.1 MB/s
(0.66x)
33.0 MB/s
(0.65x)
2.68 MB/s
(0.58x)
HiFive Unmatched
gcc92, Ubuntu 22.04
34.7 MB/s
(0.36x)
40.6 MB/s
(0.80x)
3.12 MB/s
(0.67x)
ARM / ARM64
Raspberry Pi 1
Debian 11, armv6l
27.3 MB/s
(0.28x)
22.9 MB/s
(0.45x)
0.928 MB/s
(0.20x)
Raspberry Pi 3B
Debian 11
149 MB/s
(1.53x)
188 MB/s
(3.70x)
4.35 MB/s
(0.93x)
Raspberry Pi 3B+
Debian 11
174 MB/s
(1.79x)
225 MB/s
(4.44x)
5.02 MB/s
(1.08x)
Raspberry Pi 4B
Debian 11
192 MB/s
(1.97x)
266 MB/s
(5.24x)
6.82 MB/s
(1.46x)
Kobol Helios64
Armbian 22.02.1
979 MB/s
(10x)
323 MB/s
(6.36x)
7.64 MB/s
(1.64x)
Ampere eMAG
gcc185, CentOS 8
903 MB/s
(9.26x)
296 MB/s
(5.83x)
12.3 MB/s
(2.64x)
Mac M1
gcc103, Debian 12
2244 MB/s
(23x)
1710 MB/s
(33.7x)
21.0 MB/s
(4.51x)
x86_64
Celeron G1840T
NixOS 22.11
599 MB/s
(6.14x)
678 MB/s
(13.3x)
11.2 MB/s
(2.40x)
Xeon Gold 6130
dahu.g5k, Deb. 11
1045 MB/s
(10.7x)
2611 MB/s
(51.4x)
17.1 MB/s
(3.67x)
i7-8086K
Ubuntu 20.04
1414 MB/s
(14.5x)
2971 MB/s
(58.5x)
22.6 MB/s
(4.85x)
AMD EPYC 7642
neowise.g5k, Deb. 11
1706 MB/s
(17.5x)
1796 MB/s
(35.4x)
17.0 MB/s
(3.65x)
AMD EPYC 7513
grat.g5k, Deb. 11
1875 MB/s
(19.2x)
2460 MB/s
(48.4x)
21.3 MB/s
(4.57x)

SHA1 and Chacha20-poly1305 results are very variable, which may be due to optimizations in certain versions of OpenSSL (vectorisation, assembly implementation) or even hardware acceleration for SHA1. They also seem to be sensitive to memory bandwidth: the Raspberry Pis have much better memory bandwidth than the RISC-V boards. In contrast, xz results seem much more representative of raw CPU performance (clock frequency, CPU cache, out-of-order execution, memory access patterns...)

To get clock frequency out of the equation, I am now showing xz results normalized by the clock frequency, measured in "CPU cycles per processed byte" (basically dividing clock frequency by xz performance). It should give an idea of the overall performance of the CPU architecture for this specific decompression task. Beware, lower values are now better!

Hardware Max clock frequency xz -d cycles/byte
(lower is better)
VisionFive 2 1.50 GHz 322
VisionFive 1 1.20 GHz 448 (0.72x)
HiFive Unmatched 1.20 GHz 385 (0.84x)
Raspberry Pi 1 0.70 GHz 754 (0.43x)
Raspberry Pi 3B 1.20 GHz 276 (1.17x)
Raspberry Pi 3B+ 1.40 GHz 279 (1.15x)
Raspberry Pi 4B 1.50 GHz 220 (1.46x)
Kobol Helios64 1.80 GHz 236 (1.37x)
Ampere eMAG 3.00 GHz 244 (1.32x)
Mac M1 3.00 GHz 143 (2.25x)
Celeron G1840T 2.50 GHz 223 (1.44x)
Xeon Gold 6130 3.70 GHz 216 (1.49x)
i7-8086K 5.00 GHz 221 (1.45x)
AMD EPYC 7642 3.30 GHz 194 (1.66x)
AMD EPYC 7513 3.65 GHz 171 (1.88x)

Here are some of the main highlights of these results:

  • VisionFive 2 single-core performance is 52% to 74% higher than VisionFive 1. This is very good compared to the 25% clock frequency improvement. When normalizing by the clock frequency, the VisionFive 2 is 39% faster per MHz compared to the VisionFive 1
  • VisionFive 2 is also 25% to 50% faster than the HiFive Unmatched. When normalizing by the clock frequency, the VisionFive 2 is 20% faster per MHz compared to the Unmatched. The Unmatched was itself slightly faster than the VisionFive 1 on a single-core basis.
  • VisionFive 2 is roughly as fast as a Raspberry Pi 3B/3B+ on the xz benchmark, but much slower for SHA1 and Chacha20.
  • VisionFive 2 is still around 1.5 slower than a Raspberry Pi 4 (and 5 times slower on Chacha20)

Here are other interesting insights:

  • The Raspberry Pi 1 always felt really slow. Well, now I know it's objectively really slow. Even when taking into account its low clock frequency of 700 MHz, performance per MHz is still really poor.
  • The Intel CPUs (from 2014, 2017 and 2018 respectively) have very similar performance per MHz for this task, despite being very different in terms of frequency, number of cores and price. This indicate that they basically share the same kind of architectural design.
  • The Raspberry Pi 4 and the Helios64 have good performance per MHz for a SoC, even comparable to an $1900 Intel CPU from 2017! Of course, the Intel CPU has much more cores, and there may be other workloads where Intel CPUs are much better.
  • The AMD EPYC CPUs (Zen 2 and Zen 3) have very good performance per MHz for this workload, and there is a clear improvement from Zen 2 to Zen 3.
  • As always, the Mac M1 is really impressive and easily smashes all other processors I could test on a per-MHz basis.

As a final note: remember that this is a single benchmark and is not representative of all kind of computing workloads. I suspect xz to be quite sensitive to the amount of CPU cache and to memory latency.

Energy consumption

Now that we have an idea of CPU performance, the other important criteria is energy consumption. Here, I am interested in whole-system energy consumption. I could only measure it for systems I have locally, so only a subset of the previous machines are tested here. Technically, I could have used wattmeters available on Grid'5000, but it makes little sense to compare the power consumption of a big server with that of a small embedded board.

All figures below are taken using a basic Perel plug-in wattmeter on 230V. The wattmeter gives the "active" (or real) power in Watts, as well as the power factor. All figures include the power transformer, which is either: an Akashi ALT2USBACCH USB transformer designed for up to 2.4A (VisionFive 1 & 2, Raspberry Pis) ; the stock Lenovo power transformer (Celeron G1840T) ; or an ATX power supply (HiFive Unmatched, i7-8086K).

For each system, I measure power consumption in the following situations: idle ; 1 CPU core at 100% ; half of CPU cores at 100% ; all CPU cores at 100% (ignoring hyper-threads). Each measurement is run for only a few seconds (still waiting for a steady-state) to avoid thermal throttling.

The workload is a simple infinite loop in bash: while :; do :; done. All systems run Linux (various versions and distributions), have one NIC up, and no screen or other peripheral attached.

Note: I am not very confident in the absolute power values shown below (because I don't really trust the wattmeter or the USB transformer). However, since I did all measurements in the same conditions, the values are comparable with each other.

Hardware Idle 1 core Half cores All cores
VisionFive 2
4 cores, 8 GB RAM
7.4 W 10.4 W 11.2 W 13.1 W
VisionFive 1
2 cores, 8 GB RAM
10.6 W 11.1 W - 11.6 W
HiFive Unmatched
4 cores, 16 GB RAM
56.8 W 57.7 W 58.6 W 60.7 W
Raspberry Pi 1
1 core, 512 MB RAM
5.9 W 6.2 W - -
Raspberry Pi 3B
rev 1.2
4 cores, 1 GB RAM
4.5 W 6.5 W 8.7 W 13.8 W
Raspberry Pi 3B+
4 cores, 1 GB RAM
7.0 W 9.7 W 12.2 W 18.0 W
Raspberry Pi 4B
rev 1.5
4 cores, 2 GB RAM
4.6 W 6.8 W 8.3 W 11.1 W
Celeron G1840T
2 cores
12 W 18 W - 23.5 W
i7-8086K
6 c. / 12 threads
23.4 W 61.5 W 79.5 W 112.7 W

Clearly, the VisionFive 2 is quite power-efficient compared to the older RISC-V boards. According to its documentation, it can run without any headsink or fan for bursty loads (e.g. web browsing), but a fan is recommended for long computations. This is consistent with my power consumption measurements.

Interestingly, the Raspberry Pi 3B+ has a similar power profile as the VisionFive 2. This makes sense because they are in the same class of devices: same amount of cores, similar maximum clock frequency, similar performance. But it's still noteworthy that the relatively young SoC found on the VisionFive 2 has a power consumption that is so similar to that of the more mature SoC found on the Raspberry Pi 3B+.

We can also observe that Intel is much better at dynamic frequency scaling, which helps to achieve low power usage when the CPU is idle. As far as I know, the SoC in the VisionFive 1 and the HiFive Unmatched have no frequency scaling, which explains their near-constant power usage. The VisionFive 2 does have frequency scaling, so it's already much better (-45% power usage when idle compared to fully loaded).

Here are some details about the hardware to put these numbers into context:

  • VisionFive 1: no fan, kernel 5.18 (Debian). Power consumption changes significantly with die temperature (9 W idle at 36 °C, 10.4 W idle at 50 °C)
  • VisionFive 2: no fan, kernel 5.15 (Debian), no NVMe, 100M NIC. Using the gigabit NIC would add 0.5 W of power usage.
  • Raspberry Pi 3B: rev 1.2, 1 GB RAM, no fan, kernel 5.10 (Debian 10). 600 MHz idle frequency, 1.20 GHz max frequency.
  • Raspberry Pi 3B+: 1 GB RAM, no fan, kernel 5.10 (Debian 10). 600 MHz idle frequency, 1.40 GHz max frequency.
  • Raspberry Pi 4B: rev 1.5, 2 GB RAM, no fan, kernel 5.10 (Debian 10). 600 MHz idle frequency, 1.50 GHz max frequency.
  • Celeron G1840T: 800 MHz idle frequency, 2.5 GHz max frequency. Lenovo ThinkCentre M73, 4 GB DDR3, ST500LM021-1KJ15 disk.
  • i7-8086K: 800 MHz idle frequency, 4 GHz max frequency, 5 GHz turbo frequency. ASRock H310CM-HDV/M.2 motherboard, 16 GB + 8 GB DDR4, Samsung 980 500GB NVMe, ATX power supply, 2 case fans

Note: earlier versions of this article used some POE power measurements from a switch (for the Raspberry Pis, with the POE hat). After re-doing the measurements with the USB power supply and plug-in wattmeter, it turns out that power measurements given by the POE switch were substantially lower than the wattmeter (probably because the POE switch measurements do not include the AC-to-DC power converter). Moreover, POE values were not stable. In the end, I decided to remove these POE values and only use the USB power supply to enable a fair comparison.

CPU performance vs. energy

Now that we have both CPU performance and energy consumption, we can mix the two results to look at energy efficiency. The most reliable figure in the table below is single-core efficiency: it is obtained by simply dividing the result of the single-core performance benchmark for xz by the single-core power consumption. I also extrapolate some figures for all-cores efficiency, but this value should be taken with a grain of salt: it is obtained by multiplying single-core performance by the number of cores (excluding hyper-threads) and dividing the total by the measured all-cores power consumption. Many effects such as thermal throttling, frequency boost for single-core load, and shared cache between cores may decrease the actual all-cores performance and thus decrease the actual all-cores efficiency compared to the figures below.

Hardware Single-core efficiency All-cores efficiency
(extrapolated)
VisionFive 2
(4 cores)
0.448 MB/s/W 1.42 MB/s/W
VisionFive 1
(2 cores)
0.241 MB/s/W 0.462 MB/s/W
HiFive Unmatched
(4 cores)
0.0541 MB/s/W 0.206 MB/s/W
Raspberry Pi 1
(1 core)
0.150 MB/s/W
Raspberry Pi 3B
(4 cores)
0.670 MB/s/W 1.26 MB/s/W
Raspberry Pi 3B+
(4 cores)
0.517 MB/s/W 1.12 MB/s/W
Raspberry Pi 4B
(4 cores)
1.00 MB/s/w 2.46 MB/s/W
Celeron G1840T
(2 cores)
0.622 MB/s/W 0.953 MB/s/W
i7-8086K
(6 c. / 12 threads)
0.367 MB/s/W 1.20 MB/s/W

Overall, the VisionFive 2 is much more energy-efficient than existing RISC-V boards: it is 2 to 3 times more energy-efficient than the VisionFive 1, and 7 to 8 times more energy-efficient than the Unmatched. It may seem counter-intuitive that the Unmatched is so inefficient, but that's probably because of its larger form factor, power-hungry PCIe and DDR4, and the need for an ATX power supply that may not be super efficient at low power load.

Similarly, even though x86_64 hardware is much faster than the VisionFive 2 (2.4 times to 4.8 times faster), it has roughly the same energy efficiency! If you have moderate computing needs, the VisionFive 2 is an efficient alternative to bigger systems.

Compared to the Raspberry Pi 3B and 3B+, the VisionFive 2 again has similar energy-efficiency. This makes sense because it has roughly the same performance and the same power consumption.

Finally, the Raspberry Pi 4 is the real winner on the efficiency metric: the VisionFive 2 is half as energy-efficiency as a Raspberry Pi 4.

Conclusion

When looking at single-core CPU performance, the VisionFive 2 is roughly 75% faster than the original VisionFive. Since it has twice the core count, that means a +150% total performance increase. And since it has a similar power consumption, it is also 2 to 3 times more energy-efficient. So that's definitely a very big improvement.

Compared to the HiFive Unmatched (which is not even technically a SBC), the VisionFive 2 still outperforms it by 50%, and is 7 to 8 times more energy-efficient. So, as far as I can tell, the claim about it being a "high-performance RISC-V SBC" is true.

When comparing with Raspberry Pis, the VisionFive 2 is about as fast as a Raspberry Pi 3B+, although much slower on memory-heavy benchmarks, and also as energy-efficient. However, it is still 46% slower than a Raspberry Pi 4, and two times less energy-efficient. As far as I can tell, both SoC are 28 nm, so we would ideally expect the same energy-efficiency.

Compared to low-power x86_64 systems, the VisionFive 2 is of course slower when looking at raw performance, but at the same time it is as energy-efficient. This is a general advantage that SBCs have over more complete systems: they have much less peripherals, are less extensible and have generally lower performance, but they are much more energy-efficient.

Again, remember that all figures discussed here are approximate, and specific benchmark results cannot be extrapolated to generic performance results for all applications.

Overall, the VisionFive 2 is a big step in the right direction, and this kind of RISC-V hardware can definitely compete with recent ARM boards since they have very similar performance-energy tradeoffs.

More pictures

VisionFive 2 in its box VisionFive 2 in its box (I removed the antistatic wrapping)

VisionFive 2 front Front with audio, 4xUSB, HDMI, 2xNIC (with one being a 100M NIC, specific to the super early bird version)

VisionFive 2 rear Rear with USB-C power input, reset button, GPIOs

VisionFive 2 back Back with NVMe M.2 slot, micro-SD card slot

Aug 13, 2022

Debugging eBPF-enabled programs in Docker

These days, I'm adding XDP offloading to l2tpns, a L2TP server used in production by several non-profit ISPs in France.

While doing that, I need to test if l2tpns can successfully load XDP programs into the kernel. But I don't want to run that directly on my Debian host: it might break network connectivity, and in addition l2tpns is updating the routing table of the kernel. So, let's just run l2tpns in Docker and allow it to break things! It turns out to be not so easy.

eBPF and XDP

As a reminder, XDP is a kernel mechanism that allows you to load custom eBPF programs that will execute right in the network device driver. You write your eBPF program in C, load it in the kernel from userspace with a simple system call, and from that point on, your program can process network packets in the kernel, before the rest of the kernel has even started parsing the packets! For a project like l2tpns, this is extremely powerful, fast and flexible, because we should be able to offload the bulk of encapsulation/desencapsulation work to the kernel while keeping a lot of flexibility.

That being said, the eBPF ecosystem is still young and is moving fast, and the whole software architecture to make this work is actually very complex. In the end, you always end up with weird errors that can be hard to track down, and especially when trying to run XDP in Docker!

What I want to debug

In this case, I'm extending l2tpns so that it loads XDP programs on network interfaces when it starts. The basic process looks like this with libbpf (error handling omitted):

char[] xdp_filename = "/path/to/xdp_prog.o";
char[] if_name = "eth0";
__u32 ifindex;
int prog_fd = -1;
struct bpf_object *obj;
__u32 xdp_flags = 0;

// Load XDP program into the kernel
bpf_prog_load(xdp_filename, BPF_PROG_TYPE_XDP, &obj, &prog_fd);

// Find network interface by name
ifindex = if_nametoindex(if_name);

// Attach XDP program to network interface
bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags);

See the xdp-tutorial repository for more complete examples, but as a starting point this is the basic functionality I want to debug in Docker.

Most programs manipulating eBPF are leveraging libbpf to do the hard work. As such, the debugging steps below can be generalized to any eBPF-enabled userspace program.

Basic Docker setup

To keep things simple, I only want to run l2tpns in a container. I will keep developing and building on my Debian host. So, let's get started with a simple Dockerfile that installs the required libraries and creates a minimum config to make l2tpns happy:

# Dockerfile used to test l2tpns during development.
# Do not use in production!

FROM debian:bullseye

RUN mkdir -p /etc/l2tpns; echo "10.10.10.0/24" > /etc/l2tpns/ip_pool
RUN apt update && apt install -y libbpf0 libcli1.10 iproute2

WORKDIR /src

VOLUME /src

ENTRYPOINT ["/src/l2tpns"]

My Debian host is running Bullseye, so I use the same distro in the container to make sure I have the same libraries.

Build the image from the Dockerfile:

$ docker build - -t l2tpns:latest < Dockerfile

Then give it a try (from the host, in the l2tpns git repository):

$ make -j4
# To send all logs to stderr
$ sed -i -e 's/set log_file/#set log_file/' etc/startup-config.default
# Run docker image with parameters
$ docker run -it --rm -v $PWD:/src l2tpns:latest -c etc/startup-config.default

This yields an error:

Can't open /dev/net/tun: No such file or directory

Ok, this first error is unrelated to XDP: l2tpns needs to create a tun interface and it cannot. Let's fix this:

$ docker run -it --rm -v $PWD:/src --cap-add=NET_ADMIN --device=/dev/net/tun l2tpns:latest -c etc/startup-config.default

Now we start seeing the interesting stuff:

libbpf: Error in bpf_object__probe_loading():Operation not permitted(1).
Couldn't load trivial BPF program. Make sure your kernel supports BPF (CONFIG_BPF_SYSCALL=y)
and/or that RLIMIT_MEMLOCK is set to big enough value.

From this point on, I will omit the tun-related options from the examples, but for the specific case of l2tpns they are still needed

Allowing the BPF syscall

Obviously, to load a eBPF program into the kernel, you need to do a syscall at some point. This is role of the BPF syscall, that is also used for other eBPF-related functionalities.

There is a new CAP_BPF capability that enables the BPF syscal for unprivileged users. This was introduced in Linux 5.8 according to capabilities(7), which is good because Debian bullseye runs a 5.10 kernel. Let's try:

$ docker run -it --rm -v $PWD:/src --cap-add=BPF l2tpns:latest -c etc/startup-config.default

Result:

docker: Error response from daemon: invalid CapAdd: unknown capability: "CAP_BPF".

Crap. Maybe my Docker version is too old to know about this capability. Let's just use a bigger hammer and settle for CAP_SYS_ADMIN, which gives a lot of priviledges, including BPF:

$ docker run -it --rm -v $PWD:/src --cap-add=SYS_ADMIN l2tpns:latest -c etc/startup-config.default

Result:

libbpf: Error in bpf_object__probe_loading():Operation not permitted(1).
Couldn't load trivial BPF program. Make sure your kernel supports BPF (CONFIG_BPF_SYSCALL=y)
and/or that RLIMIT_MEMLOCK is set to big enough value.

Well, this is the exact same error as before!

Configuring limits in the container

Helpfully, the error message mentions something about the "memlock" limit. Let's have a look at the limits in a simple Debian bullseye container:

$ docker run -it --rm debian:bullseye /bin/sh -c "ulimit -a"

Since ulimit is a shell builtin, we cannot run it directly as the command from Docker.

Result:

time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        8192
coredump(blocks)     unlimited
memory(kbytes)       unlimited
locked memory(kbytes) 64
process              unlimited
nofiles              1048576
vmemory(kbytes)      unlimited
locks                unlimited
rtprio               0

We are interested in the "locked memory" limit. 64 KB is indeed on the low side (try comparing this value with your host system).

Looking at the relevant Docker documentation, we find there's an option we can pass to Docker to raise this limit:

$ docker run -it --rm --ulimit memlock=1073741824 debian:bullseye /bin/sh -c "ulimit -l"
1048576

That looks much better! Now on the real container:

$ docker run -it --rm -v $PWD:/src --ulimit memlock=1073741824 --cap-add=SYS_ADMIN l2tpns:latest -c etc/startup-config.default
libbpf: map 'sessions_table': failed to create: Invalid argument(-22)

Ok, we still have an error, but it looks application-specific (libbpf fails to create a map that is defined in the l2tpns code).

EDIT 2022-08-15: it turned out to be indeed a programming error: BPF array maps MUST have a 32-bits key size and I was trying to create a map with a 16-bits key size. It's hard to debug because there is no detailed error reporting, the syscall simply fails with EINVAL. Here is what strace is seeing, not really helpful:

bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_ARRAY, key_size=2, value_size=20,
                     max_entries=60000, map_flags=0, inner_map_fd=0,
                     map_name="sessions_table", map_ifindex=0, btf_fd=0,
                     btf_key_type_id=0, btf_value_type_id=0,
                     btf_vmlinux_value_type_id=0},
    72)
  = -1 EINVAL (Invalid argument)

After fixing this bug, libbpf happily creates the map in the kernel:

libbpf: map 'sessions_table': created successfully, fd=8

Conclusion

So far, after a bit of efforts, I could get basic BPF functionalities to work in a Docker container for debugging purposes! Of course, for further debugging, you would need tools such as bpftools to dump the XDP programs, observe the behaviour of the program by sending packets to the interface, and so on. But this part of the work should be quite similar whether using Docker or not. If this turns out to be more difficult than expected, I will update the article!