Jan 21, 2023
I recently got my "super early bird" version of the VisionFive 2 RISC-V board.
As the documentation says, it is supposed to be "the world’s first high-performance RISC-V single board computer (SBC) with an integrated GPU".
Since I have access to several kind of RISC-V, ARM and x86 boards, let's
see if the claim about performance is true! We will look both at
processing performance and energy efficiency.
Updated 2023-01-22: added Kobol Helios64 performance results from Max
Updated 2023-01-23: added results (performance and power) for Raspberry Pi 3B+
Updated 2023-01-24:
- re-done all SBC power measurements, significant changes for VisionFive 2
- fixed completely incorrect performance measurements for Raspberry Pi 3B+
caused by a faulty USB cable (causing a huge 2.5x drop in performance!)
- re-done measurements for Raspberry Pi 1 with Debian 11 and without the
faulty USB cable
- added results for Raspberry Pi 3B, it was not fried after all, it was
also the faulty USB cable
- re-done Raspberry Pi 4 power measurements to be more comparable (avoid
POE)
Power measurement setup. The VisionFive 2 is visible at the bottom with
its serial cable, the wattmeter is on the left. The other visible boards
are Raspberry Pis.
A disclaimer on methodology
Benchmarking CPU performance correctly requires a huge software and
hardware expertise, and I can certainly not claim to have such an
expertise. I have chosen two basic computing primitives, hoping that they
are representative enough: crypto (sha1 and chacha20-poly1305 using
openssl) and decompression (xz
).
All numbers shown in this article are very "unscientific": I made no
formal repetition to account for variability, and there are many factors
that I purposefully ignore (kernel version, software version,
compiler...). That being said, I tried to document these parameters as
much as possible to help further analysis.
Overall, the goal is to give a rough idea of the CPU performance and power
efficiency you can expect from RISC-V hardware.
Hardware and software environment
The VisionFive 2 has a StarFive JH7110 SoC, with 4 SiFive U74 cores at 1.5 GHz.
The original VisionFive had a StarFive JH7100 SoC with 2 SiFive U74 cores at 1.2 GHz.
It had known hardware design issues: frequent L2 cache flushing needed because of a non-coherent bus
and a slow RAM controller.
So, the new SoC should be significantly faster.
Software-wise, I built a Linux kernel using the non-upstream repository
(5.18-based for VisionFive 1, and 5.15-based for VisionFive 2).
I built a Debian rootfs using the Debian guide for VisionFive.
That guide works almost the same way for VisionFive 2, but that (as well as the upstream status for kernel support) will be for another article.
For other systems used in the comparison, they mostly run Debian or
Ubuntu, with a few exceptions (NixOS, Armbian).
CPU performance
Here are the three benchmarks I will use:
openssl speed -evp sha1
openssl speed -evp chacha20-poly1305
# https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.10.tar.xz (116606704 bytes)
time xz -d < /dev/shm/linux-5.10.tar.xz > /dev/null
The xz
benchmark uses decompression of a known file to ease
reproducibility, and this file is stored in memory (/dev/shm
) to make
sure we have no disk I/O.
All benchmarks are using a single CPU core.
I converted all results into MB/s for easier comparison, taking the
largest block size for the openssl results (16 KiB). As a reminder, one
MB equals 1000000 bytes. For the xz
benchmark, the real elapsed time is
used.
To ease comparisons with other hardware, I computed a "speedup" of each
board compared to the VisionFive 2: 1x
means the same performance, 2x
means twice as fast, 0.33x
means three times as slow, etc.
You can find the full output of the benchmarks for each machine
here.
Most hardware is either running locally, from the Compile
Farm, or from
Grid'5000. The Celeron G1840T
system belongs to Deuxfleurs. The Kobol
Helios64 result is courtesy of Max. Raspberry Pi 3B+ and 4 are courtesy
of $DAYJOB.
Hardware |
sha1 |
chacha20 |
xz decompr. |
RISC-V |
|
|
|
VisionFive 2 Debian unstable |
97.5 MB/s |
50.8 MB/s |
4.66 MB/s |
VisionFive 1 (gcc91) Debian unstable |
64.1 MB/s (0.66x) |
33.0 MB/s (0.65x) |
2.68 MB/s (0.58x) |
HiFive Unmatched gcc92, Ubuntu 22.04 |
34.7 MB/s (0.36x) |
40.6 MB/s (0.80x) |
3.12 MB/s (0.67x) |
ARM / ARM64 |
|
|
|
Raspberry Pi 1 Debian 11, armv6l |
27.3 MB/s (0.28x) |
22.9 MB/s (0.45x) |
0.928 MB/s (0.20x) |
Raspberry Pi 3B Debian 11 |
149 MB/s (1.53x) |
188 MB/s (3.70x) |
4.35 MB/s (0.93x) |
Raspberry Pi 3B+ Debian 11 |
174 MB/s (1.79x) |
225 MB/s (4.44x) |
5.02 MB/s (1.08x) |
Raspberry Pi 4B Debian 11 |
192 MB/s (1.97x) |
266 MB/s (5.24x) |
6.82 MB/s (1.46x) |
Kobol Helios64 Armbian 22.02.1 |
979 MB/s (10x) |
323 MB/s (6.36x) |
7.64 MB/s (1.64x) |
Ampere eMAG gcc185, CentOS 8 |
903 MB/s (9.26x) |
296 MB/s (5.83x) |
12.3 MB/s (2.64x) |
Mac M1 gcc103, Debian 12 |
2244 MB/s (23x) |
1710 MB/s (33.7x) |
21.0 MB/s (4.51x) |
x86_64 |
|
|
|
Celeron G1840T NixOS 22.11 |
599 MB/s (6.14x) |
678 MB/s (13.3x) |
11.2 MB/s (2.40x) |
Xeon Gold 6130 dahu.g5k, Deb. 11 |
1045 MB/s (10.7x) |
2611 MB/s (51.4x) |
17.1 MB/s (3.67x) |
i7-8086K Ubuntu 20.04 |
1414 MB/s (14.5x) |
2971 MB/s (58.5x) |
22.6 MB/s (4.85x) |
AMD EPYC 7642 neowise.g5k, Deb. 11 |
1706 MB/s (17.5x) |
1796 MB/s (35.4x) |
17.0 MB/s (3.65x) |
AMD EPYC 7513 grat.g5k, Deb. 11 |
1875 MB/s (19.2x) |
2460 MB/s (48.4x) |
21.3 MB/s (4.57x) |
SHA1 and Chacha20-poly1305 results are very variable, which may be due to
optimizations in certain versions of OpenSSL (vectorisation, assembly
implementation) or even hardware acceleration for SHA1. They also seem to
be sensitive to memory bandwidth: the Raspberry Pis have much better
memory bandwidth than the RISC-V boards. In contrast, xz
results seem
much more representative of raw CPU performance (clock frequency, CPU
cache, out-of-order execution, memory access patterns...)
To get clock frequency out of the equation, I am now showing xz
results
normalized by the clock frequency, measured in "CPU cycles per processed byte"
(basically dividing clock frequency by xz
performance). It should give an
idea of the overall performance of the CPU architecture for this specific
decompression task. Beware, lower values are now better!
Hardware |
Max clock frequency |
xz -d cycles/byte (lower is better) |
VisionFive 2 |
1.50 GHz |
322 |
VisionFive 1 |
1.20 GHz |
448 (0.72x) |
HiFive Unmatched |
1.20 GHz |
385 (0.84x) |
Raspberry Pi 1 |
0.70 GHz |
754 (0.43x) |
Raspberry Pi 3B |
1.20 GHz |
276 (1.17x) |
Raspberry Pi 3B+ |
1.40 GHz |
279 (1.15x) |
Raspberry Pi 4B |
1.50 GHz |
220 (1.46x) |
Kobol Helios64 |
1.80 GHz |
236 (1.37x) |
Ampere eMAG |
3.00 GHz |
244 (1.32x) |
Mac M1 |
3.00 GHz |
143 (2.25x) |
Celeron G1840T |
2.50 GHz |
223 (1.44x) |
Xeon Gold 6130 |
3.70 GHz |
216 (1.49x) |
i7-8086K |
5.00 GHz |
221 (1.45x) |
AMD EPYC 7642 |
3.30 GHz |
194 (1.66x) |
AMD EPYC 7513 |
3.65 GHz |
171 (1.88x) |
Here are some of the main highlights of these results:
- VisionFive 2 single-core performance is 52% to 74% higher than
VisionFive 1. This is very good compared to the 25% clock frequency
improvement. When normalizing by the clock frequency, the VisionFive 2
is 39% faster per MHz compared to the VisionFive 1
- VisionFive 2 is also 25% to 50% faster than the HiFive Unmatched.
When normalizing by the clock frequency, the VisionFive 2 is 20%
faster per MHz compared to the Unmatched. The Unmatched was itself
slightly faster than the VisionFive 1 on a single-core basis.
- VisionFive 2 is roughly as fast as a Raspberry Pi 3B/3B+ on the
xz
benchmark, but much slower for SHA1 and Chacha20.
- VisionFive 2 is still around 1.5 slower than a Raspberry Pi 4 (and 5
times slower on Chacha20)
Here are other interesting insights:
- The Raspberry Pi 1 always felt really slow. Well, now I know it's
objectively really slow. Even when taking into account its low
clock frequency of 700 MHz, performance per MHz is still really poor.
- The Intel CPUs (from 2014, 2017 and 2018 respectively) have very similar
performance per MHz for this task, despite being very different in terms
of frequency, number of cores and price. This indicate that they
basically share the same kind of architectural design.
- The Raspberry Pi 4 and the Helios64 have good performance per MHz for a
SoC, even comparable to an $1900 Intel CPU from 2017! Of course, the
Intel CPU has much more cores, and there may be other workloads where
Intel CPUs are much better.
- The AMD EPYC CPUs (Zen 2 and Zen 3) have very good performance per MHz
for this workload, and there is a clear improvement from Zen 2 to Zen 3.
- As always, the Mac M1 is really impressive and easily smashes all other
processors I could test on a per-MHz basis.
As a final note: remember that this is a single benchmark and is not
representative of all kind of computing workloads. I suspect xz
to be
quite sensitive to the amount of CPU cache and to memory latency.
Energy consumption
Now that we have an idea of CPU performance, the other important criteria
is energy consumption. Here, I am interested in whole-system energy
consumption. I could only measure it for systems I have locally, so only
a subset of the previous machines are tested here. Technically, I could
have used wattmeters available on Grid'5000, but it makes little sense to
compare the power consumption of a big server with that of a small
embedded board.
All figures below are taken using a basic Perel plug-in wattmeter on
230V. The wattmeter gives the "active" (or real) power in Watts, as well
as the power factor. All
figures include the power transformer, which is either: an Akashi
ALT2USBACCH USB transformer designed for up to 2.4A (VisionFive 1 & 2,
Raspberry Pis) ; the stock Lenovo power transformer (Celeron G1840T) ; or an
ATX power supply (HiFive Unmatched, i7-8086K).
For each system, I measure power consumption in the following situations:
idle ; 1 CPU core at 100% ; half of CPU cores at 100% ; all CPU cores at 100%
(ignoring hyper-threads). Each measurement is run for only a few seconds
(still waiting for a steady-state) to avoid thermal throttling.
The workload is a simple infinite loop in bash: while :; do :; done
.
All systems run Linux (various versions and distributions), have one NIC
up, and no screen or other peripheral attached.
Note: I am not very confident in the absolute power values shown below
(because I don't really trust the wattmeter or the USB transformer).
However, since I did all measurements in the same conditions, the values
are comparable with each other.
Hardware |
Idle |
1 core |
Half cores |
All cores |
VisionFive 2 4 cores, 8 GB RAM |
7.4 W |
10.4 W |
11.2 W |
13.1 W |
VisionFive 1 2 cores, 8 GB RAM |
10.6 W |
11.1 W |
- |
11.6 W |
HiFive Unmatched 4 cores, 16 GB RAM |
56.8 W |
57.7 W |
58.6 W |
60.7 W |
Raspberry Pi 1 1 core, 512 MB RAM |
5.9 W |
6.2 W |
- |
- |
Raspberry Pi 3B rev 1.2 4 cores, 1 GB RAM |
4.5 W |
6.5 W |
8.7 W |
13.8 W |
Raspberry Pi 3B+ 4 cores, 1 GB RAM |
7.0 W |
9.7 W |
12.2 W |
18.0 W |
Raspberry Pi 4B rev 1.5 4 cores, 2 GB RAM |
4.6 W |
6.8 W |
8.3 W |
11.1 W |
Celeron G1840T 2 cores |
12 W |
18 W |
- |
23.5 W |
i7-8086K 6 c. / 12 threads |
23.4 W |
61.5 W |
79.5 W |
112.7 W |
Clearly, the VisionFive 2 is quite power-efficient compared to the older
RISC-V boards. According to its documentation, it can run without any
headsink or fan for bursty loads (e.g. web browsing), but a fan is
recommended for long computations. This is consistent with my power
consumption measurements.
Interestingly, the Raspberry Pi 3B+ has a similar power profile as the
VisionFive 2. This makes sense because they are in the same class of
devices: same amount of cores, similar maximum clock frequency, similar
performance. But it's still noteworthy that the relatively young SoC
found on the VisionFive 2 has a power consumption that is so similar to
that of the more mature SoC found on the Raspberry Pi 3B+.
We can also observe that Intel is much better at dynamic frequency
scaling, which helps to achieve low power usage when the CPU is idle. As
far as I know, the SoC in the VisionFive 1 and the HiFive Unmatched have
no frequency scaling, which explains their near-constant power usage. The
VisionFive 2 does have frequency scaling, so it's already much better
(-45% power usage when idle compared to fully loaded).
Here are some details about the hardware to put these numbers into context:
- VisionFive 1: no fan, kernel 5.18 (Debian). Power consumption
changes significantly with die temperature (9 W idle at 36 °C, 10.4 W idle at 50 °C)
- VisionFive 2: no fan, kernel 5.15 (Debian), no NVMe, 100M NIC.
Using the gigabit NIC would add 0.5 W of power usage.
- Raspberry Pi 3B: rev 1.2, 1 GB RAM, no fan, kernel 5.10 (Debian 10). 600 MHz idle frequency, 1.20 GHz max frequency.
- Raspberry Pi 3B+: 1 GB RAM, no fan, kernel 5.10 (Debian 10). 600 MHz idle frequency, 1.40 GHz max frequency.
- Raspberry Pi 4B: rev 1.5, 2 GB RAM, no fan, kernel 5.10 (Debian 10). 600 MHz idle frequency, 1.50 GHz max frequency.
- Celeron G1840T: 800 MHz idle frequency, 2.5 GHz max frequency.
Lenovo ThinkCentre M73, 4 GB DDR3, ST500LM021-1KJ15 disk.
- i7-8086K: 800 MHz idle frequency, 4 GHz max frequency, 5 GHz turbo
frequency. ASRock H310CM-HDV/M.2 motherboard, 16 GB + 8 GB DDR4,
Samsung 980 500GB NVMe, ATX power supply, 2 case fans
Note: earlier versions of this article used some POE power measurements
from a switch (for the Raspberry Pis, with the POE hat). After re-doing
the measurements with the USB power supply and plug-in wattmeter, it turns
out that power measurements given by the POE switch were substantially
lower than the wattmeter (probably because the POE switch measurements do
not include the AC-to-DC power converter). Moreover, POE values were not
stable. In the end, I decided to remove these POE values and only use the
USB power supply to enable a fair comparison.
CPU performance vs. energy
Now that we have both CPU performance and energy consumption, we can mix
the two results to look at energy efficiency. The most reliable figure in
the table below is single-core efficiency: it is obtained by simply
dividing the result of the single-core performance benchmark for xz
by
the single-core power consumption. I also extrapolate some figures for
all-cores efficiency, but this value should be taken with a grain of
salt: it is obtained by multiplying single-core performance by the number
of cores (excluding hyper-threads) and dividing the total by the measured
all-cores power consumption. Many effects such as thermal throttling,
frequency boost for single-core load, and shared cache between cores may
decrease the actual all-cores performance and thus decrease the actual
all-cores efficiency compared to the figures below.
Hardware |
Single-core efficiency |
All-cores efficiency (extrapolated) |
VisionFive 2 (4 cores) |
0.448 MB/s/W |
1.42 MB/s/W |
VisionFive 1 (2 cores) |
0.241 MB/s/W |
0.462 MB/s/W |
HiFive Unmatched (4 cores) |
0.0541 MB/s/W |
0.206 MB/s/W |
Raspberry Pi 1 (1 core) |
0.150 MB/s/W |
|
Raspberry Pi 3B (4 cores) |
0.670 MB/s/W |
1.26 MB/s/W |
Raspberry Pi 3B+ (4 cores) |
0.517 MB/s/W |
1.12 MB/s/W |
Raspberry Pi 4B (4 cores) |
1.00 MB/s/w |
2.46 MB/s/W |
Celeron G1840T (2 cores) |
0.622 MB/s/W |
0.953 MB/s/W |
i7-8086K (6 c. / 12 threads) |
0.367 MB/s/W |
1.20 MB/s/W |
Overall, the VisionFive 2 is much more energy-efficient than existing
RISC-V boards: it is 2 to 3 times more energy-efficient than the
VisionFive 1, and 7 to 8 times more energy-efficient than the
Unmatched. It may seem counter-intuitive that the Unmatched is so
inefficient, but that's probably because of its larger form factor,
power-hungry PCIe and DDR4, and the need for an ATX power supply that may
not be super efficient at low power load.
Similarly, even though x86_64 hardware is much faster than the
VisionFive 2 (2.4 times to 4.8 times faster), it has roughly the same
energy efficiency! If you have moderate computing needs, the VisionFive 2
is an efficient alternative to bigger systems.
Compared to the Raspberry Pi 3B and 3B+, the VisionFive 2 again has
similar energy-efficiency. This makes sense because it has roughly
the same performance and the same power consumption.
Finally, the Raspberry Pi 4 is the real winner on the efficiency metric:
the VisionFive 2 is half as energy-efficiency as a Raspberry Pi 4.
Conclusion
When looking at single-core CPU performance, the VisionFive 2 is roughly
75% faster than the original VisionFive. Since it has twice the core
count, that means a +150% total performance increase. And since it
has a similar power consumption, it is also 2 to 3 times more
energy-efficient. So that's definitely a very big improvement.
Compared to the HiFive Unmatched (which is not even technically a SBC),
the VisionFive 2 still outperforms it by 50%, and is 7 to 8 times
more energy-efficient. So, as far as I can tell, the claim about
it being a "high-performance RISC-V SBC" is true.
When comparing with Raspberry Pis, the VisionFive 2 is about as fast as
a Raspberry Pi 3B+, although much slower on memory-heavy benchmarks, and
also as energy-efficient. However, it is still 46% slower than a
Raspberry Pi 4, and two times less energy-efficient. As far as I
can tell, both SoC are 28 nm, so we would ideally expect the same
energy-efficiency.
Compared to low-power x86_64 systems, the VisionFive 2 is of course slower
when looking at raw performance, but at the same time it is as
energy-efficient. This is a general advantage that SBCs have over more
complete systems: they have much less peripherals, are less extensible and
have generally lower performance, but they are much more energy-efficient.
Again, remember that all figures discussed here are approximate, and
specific benchmark results cannot be extrapolated to generic performance
results for all applications.
Overall, the VisionFive 2 is a big step in the right direction, and this
kind of RISC-V hardware can definitely compete with recent ARM boards
since they have very similar performance-energy tradeoffs.
More pictures
VisionFive 2 in its box (I removed the antistatic wrapping)
Front with audio, 4xUSB, HDMI, 2xNIC (with one being a 100M NIC, specific to the super early bird version)
Rear with USB-C power input, reset button, GPIOs
Back with NVMe M.2 slot, micro-SD card slot
Aug 13, 2022
These days, I'm adding XDP offloading to l2tpns,
a L2TP server used in production by several non-profit ISPs in France.
While doing that, I need to test if l2tpns can successfully load
XDP programs into the kernel. But I don't want to run that directly
on my Debian host: it might break network connectivity, and in addition
l2tpns is updating the routing table of the kernel. So, let's just run
l2tpns in Docker and allow it to break things! It turns out to be not so easy.
eBPF and XDP
As a reminder, XDP is a kernel mechanism that allows you to load custom eBPF programs
that will execute right in the network device driver. You write your eBPF
program in C, load it in the kernel from userspace with a simple system call,
and from that point on, your program can process network packets in the kernel,
before the rest of the kernel has even started parsing the packets!
For a project like l2tpns, this is extremely powerful, fast and flexible,
because we should be able to offload the bulk of encapsulation/desencapsulation
work to the kernel while keeping a lot of flexibility.
That being said, the eBPF ecosystem is still young and is moving fast,
and the whole software architecture to make this work is actually very complex.
In the end, you always end up with weird errors that can be hard to track down,
and especially when trying to run XDP in Docker!
What I want to debug
In this case, I'm extending l2tpns so that it loads XDP programs on network
interfaces when it starts. The basic process looks like this with
libbpf
(error handling omitted):
char[] xdp_filename = "/path/to/xdp_prog.o";
char[] if_name = "eth0";
__u32 ifindex;
int prog_fd = -1;
struct bpf_object *obj;
__u32 xdp_flags = 0;
// Load XDP program into the kernel
bpf_prog_load(xdp_filename, BPF_PROG_TYPE_XDP, &obj, &prog_fd);
// Find network interface by name
ifindex = if_nametoindex(if_name);
// Attach XDP program to network interface
bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags);
See the xdp-tutorial repository
for more complete examples, but as a starting point this is the basic
functionality I want to debug in Docker.
Most programs manipulating eBPF are leveraging libbpf to do the hard work.
As such, the debugging steps below can be generalized to any eBPF-enabled
userspace program.
Basic Docker setup
To keep things simple, I only want to run l2tpns in a container. I will
keep developing and building on my Debian host. So, let's get started with
a simple Dockerfile that installs the required libraries and creates a minimum
config to make l2tpns happy:
# Dockerfile used to test l2tpns during development.
# Do not use in production!
FROM debian:bullseye
RUN mkdir -p /etc/l2tpns; echo "10.10.10.0/24" > /etc/l2tpns/ip_pool
RUN apt update && apt install -y libbpf0 libcli1.10 iproute2
WORKDIR /src
VOLUME /src
ENTRYPOINT ["/src/l2tpns"]
My Debian host is running Bullseye, so I use the same distro in the container
to make sure I have the same libraries.
Build the image from the Dockerfile:
$ docker build - -t l2tpns:latest < Dockerfile
Then give it a try (from the host, in the l2tpns git repository):
$ make -j4
# To send all logs to stderr
$ sed -i -e 's/set log_file/#set log_file/' etc/startup-config.default
# Run docker image with parameters
$ docker run -it --rm -v $PWD:/src l2tpns:latest -c etc/startup-config.default
This yields an error:
Can't open /dev/net/tun: No such file or directory
Ok, this first error is unrelated to XDP: l2tpns needs to create a tun
interface
and it cannot. Let's fix this:
$ docker run -it --rm -v $PWD:/src --cap-add=NET_ADMIN --device=/dev/net/tun l2tpns:latest -c etc/startup-config.default
Now we start seeing the interesting stuff:
libbpf: Error in bpf_object__probe_loading():Operation not permitted(1).
Couldn't load trivial BPF program. Make sure your kernel supports BPF (CONFIG_BPF_SYSCALL=y)
and/or that RLIMIT_MEMLOCK is set to big enough value.
From this point on, I will omit the tun-related options from the examples, but for the specific
case of l2tpns they are still needed
Allowing the BPF syscall
Obviously, to load a eBPF program into the kernel, you need to do a syscall at some point.
This is role of the BPF syscall, that is also used for other eBPF-related functionalities.
There is a new CAP_BPF
capability that enables the BPF syscal for unprivileged users.
This was introduced in Linux 5.8 according to capabilities(7)
, which is good because
Debian bullseye runs a 5.10 kernel. Let's try:
$ docker run -it --rm -v $PWD:/src --cap-add=BPF l2tpns:latest -c etc/startup-config.default
Result:
docker: Error response from daemon: invalid CapAdd: unknown capability: "CAP_BPF".
Crap. Maybe my Docker version is too old to know about this capability. Let's just use
a bigger hammer and settle for CAP_SYS_ADMIN
, which gives a lot of priviledges, including BPF:
$ docker run -it --rm -v $PWD:/src --cap-add=SYS_ADMIN l2tpns:latest -c etc/startup-config.default
Result:
libbpf: Error in bpf_object__probe_loading():Operation not permitted(1).
Couldn't load trivial BPF program. Make sure your kernel supports BPF (CONFIG_BPF_SYSCALL=y)
and/or that RLIMIT_MEMLOCK is set to big enough value.
Well, this is the exact same error as before!
Configuring limits in the container
Helpfully, the error message mentions something about the "memlock" limit.
Let's have a look at the limits in a simple Debian bullseye container:
$ docker run -it --rm debian:bullseye /bin/sh -c "ulimit -a"
Since ulimit
is a shell builtin, we cannot run it directly as the command from Docker.
Result:
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) 8192
coredump(blocks) unlimited
memory(kbytes) unlimited
locked memory(kbytes) 64
process unlimited
nofiles 1048576
vmemory(kbytes) unlimited
locks unlimited
rtprio 0
We are interested in the "locked memory" limit. 64 KB is indeed on the low side
(try comparing this value with your host system).
Looking at the relevant Docker documentation,
we find there's an option we can pass to Docker to raise this limit:
$ docker run -it --rm --ulimit memlock=1073741824 debian:bullseye /bin/sh -c "ulimit -l"
1048576
That looks much better! Now on the real container:
$ docker run -it --rm -v $PWD:/src --ulimit memlock=1073741824 --cap-add=SYS_ADMIN l2tpns:latest -c etc/startup-config.default
libbpf: map 'sessions_table': failed to create: Invalid argument(-22)
Ok, we still have an error, but it looks application-specific (libbpf fails to create
a map that is defined in the l2tpns code).
EDIT 2022-08-15: it turned out to be indeed a programming error:
BPF array maps MUST have a 32-bits key size and I was trying to create a map with
a 16-bits key size. It's hard to debug because there is no detailed error reporting,
the syscall simply fails with EINVAL. Here is what strace
is seeing, not really
helpful:
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_ARRAY, key_size=2, value_size=20,
max_entries=60000, map_flags=0, inner_map_fd=0,
map_name="sessions_table", map_ifindex=0, btf_fd=0,
btf_key_type_id=0, btf_value_type_id=0,
btf_vmlinux_value_type_id=0},
72)
= -1 EINVAL (Invalid argument)
After fixing this bug, libbpf happily creates the map in the kernel:
libbpf: map 'sessions_table': created successfully, fd=8
Conclusion
So far, after a bit of efforts, I could get basic BPF functionalities to work
in a Docker container for debugging purposes! Of course, for further debugging,
you would need tools such as bpftools
to dump the XDP programs, observe
the behaviour of the program by sending packets to the interface, and so on.
But this part of the work should be quite similar whether using Docker or not.
If this turns out to be more difficult than expected, I will update the article!
Feb 13, 2019
Today I have been mostly teaching or preparing upcoming courses. I also
had a nice lunch discussion with
colleagues on DNS and the
role of transaction IDs, but that story will have to wait until tomorrow!
Teaching routing
I gave another networking course for first year's students today. This
was the first practical session where they actually had to plug some
cables around: you can imagine the excitement but also the mess! To make
things even easier, the course was in a new networking lab I had never
been before, so I had to improvise with the hardware lying around.
The students learnt how to configure network interfaces (ifconfig
,
route
& netstat
on FreeBSD), and they had to use their prior knowledge
of packet capture and ping
to troubleshoot when things didn't work as
expected. They had to form a simple "chain" topology (shown below) with
two subnets, and the computer in the middle needed to be configured as a
router. They needed to figure out that static routes were required on
both edge computers, so that they knew how to reach the remote subnet
through the router. Finally, they looked in details at the behaviour of
ARP and the scope of MAC addresses.

Network security course
I then prepared an upcoming practical session on network security with a
colleague working for Quarkslab. I already
have a good part of the course ready from last year on firewalling and
advanced uses of iptables
(including compiling custom BPF programs!).
My colleague wants to add a part where students will practice ARP
spoofing, so we looked at how to integrate that with the existing content.
Interestingly, he showed me how to automate virtual machine generation
using Packer. This should be really helpful for
future teachers in this course: they will be able to easily customize and
rebuild the virtual machine images used by the students! Last year, I
installed and configured the virtual machine manually, which makes it hard
to update it or apply the same modifications to a new VM image.
Feb 12, 2019
Multipath scheduling
Today I mostly worked on my current research project, a simulator of
multipath multistream scheduling algorithms.
Multipath scheduling is needed when you want to transmit data over several
concurrent paths: which piece of data should be sent on which path? This
problem has been made visible by Multipath
TCP, since the Linux implementation includes
several
schedulers
that can be changed at runtime. Several new schedulers for MPTCP are
being proposed in the academic literature every year: the original
LowRTT
and its evaluation,
Delay-Aware Packet
Scheduling,
BLocking
ESTimation,
Earliest Completion
First,
and many
others.
All these algorithms mostly differ in the objective function they try to
optimize or in assumptions that can be made for specific data flows
(e.g. video streaming traffic). However, they all adopt the semantic of
TCP, which transports a single flow of data. I am interested in
extending the problem for several streams (have you heard of QUIC?)
that need to be scheduled on multiple paths. Instead of a single
optimisation problem, you now end up with several concurrent streams,
where each stream wants to complete as soon as possible!
Writing a simulator
The goal of my simulator is to quickly obtain an intuition on the
behaviour of scheduling algorithms: it provides a graphical and animated
visualisation of what's going on over time. The simulator also allows for
more in-depth exploration, for instance comparing the completion times of
streams for different scheduling algorithms.
Below is a screenshot of the current simulator: it is not very pretty
because that's not its goal! The streams are represented by vertical bars
whose size equals the amount of data remaining to be transmitted, and the
darker parts represent in-flight data. Below are two paths, each modelled
as a packet queue with a constant-rate service (link capacity) and a fixed
propagation time.

I am writing this simulator in Python thanks to
salabim: this is a really well designed,
easy-to-use and well-documented simulation framework. I had an initial
simulation prototype working in less than one day, and it took only an
additional day to add graphical visualisation. One of the reasons it's so
easy to use is thanks to Python: I didn't want to spend days implementing
complex algorithms in NS-3, even though it would be much more realistic.
At the same time, salabim is reasonably fast once you disable logging and
visualisation.
After working with salabim some more, I did find some limitations: the
programming style around salabim is fine for small simulations, but
quickly becomes a mess for larger projects. All examples use lots of
global variables, which
encourages you to write all your code in one file (after all, this is how
salabim itself is
developed
with its 15k lines in a single file...)
Feb 12, 2019
As you may have seen, I am not very good at writing regular articles here!
I often get ideas for an article; sometimes I am motivated enough to actually
start writing it; but then, most of the time, I never finish the article.
With my third and (hopefully) last year of PhD going full steam, and still
lots of involvement in community networks, I decided to change my writing
approach and start publishing a daily log of what I do.
What should you expect?
Content-wise, I will mostly talk about networking, of course!
More precisely, I will cover the following activities:
- my research activities: what I'm currently working on, interesting
discussions with colleagues, conferences I attend, etc;
- my teaching activities, also mostly related to networking;
- my non-profit activities in Grenode,
Rézine, Fédération FDN
and other organisations related to community networks.
I may also cover other activities that are not directly related to
networking, for instance
Openstreetmap, the GCC compile
farm, contribution to various free
software, and so on.
Feedback
Since this is a new exercise for me, I welcome all kind of feedback! You
can reach me on Mastodon or by email
(root at <this blog's domain name>
).
Feb 11, 2018
Qu'est-ce qu'un point d'échange Internet ?
Un point d'échange Internet,
ou IXP (Internet eXchange Point), c'est un endroit où plusieurs opérateurs réseau
s'interconnectent pour échanger du trafic.
De façon simplifiée, il faut voir ça comme un gros switch Ethernet sur lequel
chaque opérateur réseau va se brancher, à l'aide d'un câble RJ45 ou une fibre optique.
Oui oui, on parle bien du même genre de switch Ethernet que vous avez sûrement chez
vous pour brancher vos ordinateurs, juste un peu plus rapide et fiable (et donc plus cher).
Dans la réalité, la plupart des IXP ont une architecture plus complexe, avec plusieurs switches
dans différentes baies d'un datacenter, et des points de présence ("PoP" pour "Point of Presence")
dans plusieurs datacenters, reliés entre eux avec des fibres optiques. Mais le principe reste le même.
Analyse des points d'échange Internet
Récemment, le centre de recherche CAIDA a publié
un jeu de données sur les IXP. Du coup je me
suis dit que j'allais regarder ce qu'il y a dedans !
C'est intéressant d'avoir une vision globale du paysage des IXPs, parce qu'ils dessinent
une grande partie de l'architecture physique d'Internet (qui est, je le rappelle, justement
une interconnexion de réseaux).
Regardons d'abord quelles informations sont disponibles :
$ tail -n +2 ixs_201712.jsonl | jq 'select(.name == "France-IX")'
ce qui donne :
{
"name": "France-IX",
"city": "Paris",
"country": "FR",
"sources": [
"pdb",
"wiki",
"pch",
"looking"
],
"alternatenames": [
"Mix Internet Exchange and Transit",
"FNIX6",
"France Internet Exchange "
],
"geo_id": 2988507,
"region": "Paris",
"pch_id": 74,
"url": [
"http://www.mixt.net/",
"http://www.fnix6.net/"
],
"pdb_id": 68,
"pdb_org_id": 147,
"alternativenames": [
"French National Internet Exchange IPv6"
],
"ix_id": 377,
"org_id": 23
}
On peut déjà remarquer plusieurs choses :
-
les données viennent de différentes sources, comme indiqué ici :
PeeringDB, la page Wikipédia sur les points d'échanges,
PCH, et bgplookingglass.com.
-
c'est un peu le bazar... En recoupant les différentes sources, CAIDA a associé
des points d'échange qui n'ont rien à voir les uns avec les autres (FranceIX, MIXT, FNIX6) !
Là encore, la méthode utilisée est décrite sur la page du jeu de données.
Ensuite, on voit qu'on peut facilement filtrer par pays :
$ tail -n +2 ixs_201712.jsonl | jq 'select(.country == "FR") | .name' | wc -l
42
Il semble donc y avoir 42 IXPs en France (modulo les doublons et erreurs), ça ne s'invente pas :)
Les points d'échange en France
Après nettoyage manuel des doublons et des vieux IXPs qui ont disparus, il reste environ 21 points d'échange
en France, ce qui reste un nombre conséquent !
Contrairement à ce qu'on pourrait penser, il n'y a que 4 points d'échanges actifs à Paris :
FranceIX, FR-IX, Equinix et SFINX (ainsi que Hopus, qui n'est pas vraiment un IXP).
Tous les autres (environ 13 actifs) sont donc soit en région, soit en outre-mer.
C'est étonnant quand on sait que l'Internet français est ultra-centralisé sur Paris
(heureusement, ça s'améliore depuis quelques années, grâce notamment à Rézopole).
D'ailleurs, il y a historiquement eu beaucoup de points d'échange à Paris, mais la plupart sont morts
ou ont été absorbés.
On peut analyser cette rareté relative des points d'échanges à Paris, ainsi que leur prolifération
en région et en outre-mer, de plusieurs manières :
Les IXP permettent de développer le territoire local
Les points d'échange sont importants pour développer le réseau sur le territoire local, puisqu'ils
permettent aux opérateurs locaux de s'échanger du trafic directement, sans passer par les gros noeuds
d'interconnexion comme Paris, Londres ou Amsterdam. Ça permet de réduire la latence et le coût,
et de moins dépendre d'infrastructures qui deviennent critiques de par leur concentration
(par exemple TH2 à Paris concentre une grosse partie des interconnexions de l'Internet français...).
En somme, décentraliser et relocaliser le réseau, ce qui a des vertus non seulement techniques et économiques,
mais également humaines : cela permet aussi de relocaliser les compétences techniques.
C'est d'autant plus important en outre-mer ! Imaginons un abonné à La Réunion qui veut accéder à un serveur
hébergé également à La Réunion. Sans point d'échange local, sa requête passera par une fibre sous-marine,
parcourera probablement quelques centaines ou milliers de kilomètres, et reviendra par le même chemin...
Du coup, il paraît logique que de plus en plus de régions développent des points d'échanges Internet en local.
Par exemple, Rézopole est financé en partie par la Région Rhône-Alpes pour s'occuper
de LyonIX et GrenoblIX.
Deux IXP sur un même territoire se font concurrence
Une autre explication, c'est qu'il y a peu de place pour plusieurs points d'échanges sur
un même territoire. En effet :
1) pour un opérateur, se connecter à un point d'échange représente un coût majoritairement fixe,
qui ne dépend que très peu de la quantité de trafic échangée (contrairement à du transit).
Il faut payer le cablage dans le datacenter, puis le port sur le switch du point d'échange :
ce dernier coût est souvent lié à la capacité du port (1 Gbit/s, 10 Gbit/s, etc) et non à
son utilisation réelle.
Du coup, si un opérateur a le choix entre 5 petits points d'échanges qui permettront au total
d'échanger 400 Mbit/s, et un seul point d'échange plus gros sur lequel il pourra écouler ses
400 Mbit/s, il aura tendance à privilégier le plus gros.
Bien sûr, il y a d'autres critères de choix (redondance, présence dans plusieurs datacenters,
tarifs, qualité du service) qui font que quelques points d'échange peuvent cohabiter sur le même territoire,
mais ça limite quand même fortement le potentiel d'avoir des dizaines d'IXP au même endroit.
2) l'effet de réseau joue : comme pour beaucoup de systèmes en réseau, plus un point d'échange possède
de membres, plus il devient intéressant de s'y connecter. En effet, plus de membres présents
signifie d'avantage de trafic échangé potentiel, pour le même coût fixe.
Cet effet a naturellement tendance à faire grossir les gros IXP et à faire disparaître les petits,
et finit généralement par converger vers un unique IXP sur un territoire donné (sauf à Paris, où la demande
est suffisamment forte et les datacenters suffisamment nombreux pour permettre à quelques IXP
de co-exister ; on peut également voir des IXP avec des politiques très différentes co-exister,
par exemple un IXP académique et un IXP commercial)
Notons qu'il est quand même possible d'aller à contre-courant de cet effet de réseau. Par exemple,
le point d'échange SIX à Seattle a un modèle financier particulier :
les opérateurs payent uniquement des frais d'accès au service, et ensuite ils peuvent échanger du trafic
sur le point d'échange sans frais récurrents ! Le MINAP à Milan a un modèle similaire,
à une plus petite échelle, où même les frais d'accès sont offerts (mais pas les frais de raccordement).
Plus généralement, pas mal de points d'échange (notamment les petits) sont sponsorisés par des acteurs du marché
télécom local, qui se rendent bien compte des intérêts techniques et politiques de l'interconnexion locale :
faible latence, contrôle de l'infrastructure, indépendance. En outre, les membres du point d'échange représentent
des clients potentiels, à qui les sponsors du point d'échange pourront ensuite vendre de l'hébergement ou du transit !
La qualité de service d'un IXP doit être irréprochable
Lorsqu'un point d'échange commence à grossir, il se pose forcément la question de la qualité du service.
Tant que le point d'échange connecte le FAI associatif de la ville d'à côté et les 2 petites boîtes du coin,
les coupures n'ont pas un impact énorme. Mais lorsque des centaines de membres sont connectés, certains de grosse taille,
la moindre panne peut impacter des millions d'utilisateurs finals.
Par ailleurs, pour les opérateurs, cela représente du travail de maintenance et de suivi, qui peut s'avérer plus lourd
et coûteux que le bénéfice d'être connecté au point d'échange.
Les opérateurs ont donc naturellement tendance à privilégier les points d'échange bien gérés et fiables.
En réponse, les points d'échanges qui veulent subsister et grossir se donnent les moyens d'assurer un service fiable :
astreinte 24/24, architecture technique redondée, matériel de pointe, etc.
Soyons clair : gérer un point d'échange de taille raisonnable n'est pas facile, puisque cela demande à la fois une
forte expertise technique (matériel spécialisé, architecture distribuée sur plusieurs sites) mais il y a aussi une forte
composante relationnelle : la structure opérant le point d'échange doit interagir avec des centaines de structures hétérogènes,
qui souhaitent toutes avoir un service fonctionnel sans que ça ne leur demande trop de temps de gestion et d'entretien.
On assiste donc à la fois à un regroupement des compétences, via des structures comme Rézopole
pour éviter de tout réinventer de zéro à chaque IXP, mais aussi à un fort partage de connaissance
et d'expérience à plus large échelle, avec le RIPE et EuroIX.
Conclusion
L'ecosystème des points d'échange n'est pas un sujet nouveau, mais il reste fascinant
parce qu'il entrelace des problématiques techniques et des relations entre structures
parfois très différentes. Il illustre bien le modèle distribué et pair-à-pair
qui a fait d'Internet un succès. On peut par ailleurs constater que certains points
d'échange sont gérés comme un bien commun !
Si le sujet vous intéresse, le RIPE NCC maintient un blog collaboratif très actif sur
des sujets liés à Internet en Europe, notamment les IXP
et le peering. Toujours sur RIPE labs, Uta Meier-Hahn
écrit régulièrement des articles passionnants sur les enjeux des interconnexions
entre opérateurs.
Sep 24, 2015
OVH announced today its
OverTheBox project, which is
basically a link-aggregation solution for Internet access links.
Analysis of the technology
Foreword on link aggregation
First of all, aggregating Internet access links has nothing to do with
classical
link aggregation (also
called bonding or trunking). This is a much harder problem, because the
access links typically have very diverse characteristics, in terms of
latency, capacity, and packet loss.
Think of aggregating a DSL line, a FTTH line and a satellite connection.
If you simply send packets in a round-robin fashion, you will basically
get the worst out of each link: packets will be heavily reordered, causing
TCP to fall apart. The latency of a flow will basically be the latency of
the worst link. Additionally, packet loss on any of the links will
heavily impact the whole flow.
Technology used in OverTheBox
For OverTheBox, the main technology used by OVH is
Multipath TCP, often abbreviated as MPTCP.
Multipath TCP basically allows to split a TCP flow across multiple paths,
providing redundancy and increased throughput. It does so in a clever
way: each subflow runs TCP independently, providing congestion control and
packet loss recovery independently for each path. A scheduler decides on
which path to send data, based first on the RTT of each path (lower RTT is
preferred) and moving to the next path when the congestion window is
filled.
While Multipath TCP was not initially designed for link aggregation, it
implements all necessary ingredients to do this efficiently. However, it
only works for TCP traffic, and requires that both ends of a TCP
connection know how to speak Multipath TCP. This is actually by design:
end hosts are in the best position to discover paths and their associated
characteristics (the typical use-case being a
smartphone with both 4G and Wi-Fi).
OVH used the Linux implementation of Multipath TCP, and
based its distribution on OpenWRT,
using an existing patch.
Since Multipath TCP is not yet widely deployed in end-hosts, a
link-aggregation solution based on Multipath TCP must be transparent for
the devices behind the aggregation point. To do this, OVH used a
classical solution based on a VPN. The idea is to run a VPN protocol able
to tunnel data over TCP, such as OpenVPN. This way, provided both the VPN
client and servers and MPTCP-compatible, the VPN will automatically use
all available paths, with associated load-balancing and failover benefits.
OVH
apparently decided
to use vtun, which I had never heard of
before. That being said, there are also
references to OpenVPN
in the code, so I am not sure which one they use.
In addition to that, OVH seems to use a transparent SOCKS proxy,
shadowsocks. The goal is to avoid
TCP over TCP encapsulation, which is
notoriously a bad idea.
Thanks to the SOCKS proxy, TCP connections from local clients are
terminated locally, and new TCP connections are established from the other
end of the tunnel towards the destination. This way, any packet loss on
the path towards the destination does not trigger retransmissions inside
the VPN.
For UDP traffic, I am not sure whether it also goes through the SOCKS
proxy (this is possible with SOCKS5, but would be somewhat useless in this
case) or travels directly on the VPN.
Finally, as a last note, OVH decided to shoot IPv6 in the head by
completely ignoring AAAA DNS requests
in their local DNS resolver. This is a ugly hack, and sounds like a quick
and dirty fix for an issue discovered just before the initial release. My
guess is that either shadowsocks does not support IPv6, or the IPv6
connectivity provided by some of the access links interferes with the
operation of the OverTheBox box. I do hope that this is a temporary fix,
because crippling IPv6 like this will certainly not help its deployment.
By the way, Multipath TCP of course fully supports IPv6.
By the way, this analysis is based on a rather quick look at the source
code, and my own experience. If you think I made a mistake, feel free to
send me an email (contact at the domain name of this blog).
Impact of OverThebox
As such, this project from OVH merely assembles existing components. It
introduces nothing new, except maybe a nice web interface (which is
actually non-negligible in terms of user impact).
And indeed, technically speaking, people have already been doing the exact
same thing for a while: Multipath TCP for link aggregation, a VPN such as
OpenVPN for encapsulation, and a transparent SOCKS proxy to terminate
client TCP connections before entering the tunnel. See for instance
this mail on the mptcp-dev mailing list.
But this is, to my knowledge, the first open off-the-shelf solution
providing an easy-to-use interface. What's more, OVH
released the code, and the solution should work
just fine with your own VPN server: it does not force you to use OVH
services, which is extremely nice.
This is in huge contrast with existing proprietary solutions for the same
problem, such as the products sold by
Peplink. Their
business model is to sell you the hardware and the service, with
associated licensing fees. Since the protocol is proprietary, you are
forced to use the Peplink VPN servers (even though they seem to offer to
deploy VPN servers in the cloud, that you can manage through their
provided interface). OverTheBox is likely to have an effect on this kind
of proprietary businesses. On the other hand, providers like Peplink can
(and probably should) make a difference by providing custom support for
companies, something that OVH probably won't do.
Finally, let us note that there are other solutions to the original
problem, such as MLVPN (which is not
based on Multipath TCP). But OVH clearly has enough weight to make a huge
impact with its nice, integrated solution.
Aug 22, 2014
Cet article est en français, puisqu'il est susceptible d'intéresser
principalement des lecteurs français.
En France, SFR fournit de l'IPv6 sur ses accès ADSL et fibre : c'est très
bien ! En revanche, il ne s'agit pas d'une connectivité IPv6 native
(probablement parce que le réseau d'accès ne fonctionne qu'en IPv4 pour le
moment). La connectivité IPv6 est fournie par un tunnel monté au-dessus
d'IPv4. Lorsqu'on utilise la box de SFR, c'est transparent : la box monte
le tunnel elle-même, et on ne voit rien (à part une MTU un peu faible).
En revanche, si on remplace la box par un routeur à soi (par exemple, sous
OpenWRT), il faut monter le tunnel soi-même si on
veut profiter de l'IPv6. Ce n'est pas si évident à mettre en place
(techniquement, c'est de l'IPv6 sur L2TP sur UDP sur IPv4, avec du PPP et
du DHCPv6 pour faire bonne mesure). Le but de cet article est de
détailler la mise en place du tunnel IPv6 sous OpenWRT, sachant que la
configuration est adaptable pour d'autre systèmes GNU/Linux ou BSD.
Configuration IPv4
Obtenir une adresse IPv4 avec un routeur branché sur l'ONT SFR a déjà été
beaucoup
documenté :
il suffit de faire du DHCP avec un vendor-id spécifique. Sous OpenWRT, ça
se traduit par :
# /etc/config/network
config interface 'wan'
option ifname 'eth0' # à adapter
option proto 'dhcp'
option vendorid "neufbox-BypassedNeufBox-DirectConnectionToFTTH-toto@nowhere.xxx"
Le vendorid doit en fait simplement commencer par "neufbox", mais indiquer
que ce n'est pas une Neufbox semble recommandé, des fois que le support
technique passe par là (même si en pratique, le support de SFR est plutôt
du genre « C'est bizarre, votre connexion Internet n'a pas l'air de
fonctionner. » « Si si, je vous assure, ça marche très bien. » « Ah
bon. »).
Analyse du tunnel
La première étape est de déterminer l'adresse du LNS (L2TP Network
Server), qui est le routeur avec lequel le tunnel L2TP est monté. Pour
monter le tunnel, il y a ensuite deux niveaux d'authentification :
- une authentification pour monter le tunnel L2TP lui-même. C'est
simplement un mot de passe codé en dur, le même pour toutes les
Neufbox : 6pe
- une authentification PPP, dont le couple login/mot de passe est
spécifique à chaque client SFR.
Il faut donc connaître le login et le mot de passe PPP. Fort
heureusement, la Neufbox envoie ceux-ci en clair lorsqu'elle établit le
tunnel.
Pour récupérer toutes ces informations, il suffit donc d'écouter le trafic
de la Neufbox juste après son démarrage.
Écouter le trafic de la Neufbox
Pour une connexion fibre, c'est très simple, il suffit de se mettre entre
la Neufbox et l'ONT. Voir
cet article
pour plus de détails.
Le plus simple est probablement d'utiliser une machine sous Linux avec
deux interfaces réseau (par exemple, un laptop avec une carte Ethernet en
USB). Une interface est branchée sur le port WAN de la Neufbox, l'autre
est branchée sur l'ONT. Ensuite, on bridge les deux interfaces :
# brctl addbr br0
# brctl addif br0 eth0
# brctl addif br0 eth1
# ip link set eth0 up
# ip link set eth1 up
# ip link set br0 up
# sysctl -w net.ipv4.ip_forward=1
Il faut aussi s'assurer que le firewall autorise le forwarding de paquet.
En cas de doute :
# iptables -P FORWARD ACCEPT
# iptables -F FORWARD
Il ne reste plus qu'à regarder le trafic qui passe :
Adresse du LNS
Sur ma connexion fibre, l'adresse du LNS avec lequel la Neufbox établit le
tunnel est 109.6.3.95
. Il se peut que le serveur soit différent selon
la région, ou d'autres critères. Par exemple, dans la
capture faite par Marin,
le LNS est 109.6.1.72
. Un autre utilisateur
indique
que chez lui, le LNS est 109.6.4.36
.
Le plus simple est donc d'écouter le trafic et d'utiliser le même LNS que
votre Neufbox.
Login et mot de passe PPP
Le login PPP est manifestement de la forme
dhcp/XX.XX.XX.XX@YYYYYYYYYYYY, où XX.XX.XX.XX
est l'IPv4 publique de
l'accès Internet, et YYYYYYYYYYYY
est l'adresse MAC du port WAN de la
Neufbox, sans les :
.
Pour le mot de passe PPP, il ne semble pas y avoir de logique
particulière. Il s'agit visiblement d'une chaîne de 16 caractères dans
l'alphabet [A-Z0-9]
(alphanumérique avec uniquement des lettres en
majuscule).
Configuration sous OpenWRT Barrier Breaker
Tout est dans la documentation d'OpenWRT. En adaptant pour SFR, ça donne donc :
# /etc/config/network
config interface 6pe
option proto l2tp
option server '109.6.3.95' # à adapter
option username 'dhcp/XX.XX.XX.XX@YYYYYYYYYYYY'
option password 'ZZZZZZZZZZZZZZZZ'
option keepalive '6'
option ipv6 '1'
config interface 'wan6'
option ifname '@6pe'
option proto 'dhcpv6'
Ainsi que :
# /etc/xl2tpd/xl2tp-secrets
* * 6pe
Et hop, ça juste marche (autoconfiguration sur le LAN, règles de firewall,
etc). Magique, OpenWRT, non ? :)
Configuration pour d'autres OS (GNU/Linux, BSD)
La méthode pédestre, en configurant xl2tpd puis pppd, est également
documentée sur le wiki OpenWRT.
Quelqu'un a également
essayé avec un EdgeRouter Lite,
et a fini par
obtenir une config
pas super propre, mais qui marche.
Notons que le firmware des Neufbox est disponible :
http://neufbox.alwaysdata.net/.
Notamment, il est possible de récupérer la configuration de xl2tpd
et
pppd
pour être sûr d'avoir la même.
Performance
Le tunnel est monté sur un routeur
Netgear WNDR3800,
sous OpenWRT Barrier Breaker rc3. La connexion est une fibre SFR 1G/200M.
Les tests sont fait depuis un laptop branché en filaire, vers
ipv6.intuxication.testdebit.info
, qui est à environ 10 ms, avec les
commandes suivantes :
# Download
wget -O /dev/null http://ipv6.intuxication.testdebit.info/fichiers/1000Mo.dat
# Upload
curl -o /dev/null -F 'filecontent=@1000Mo.dat' http://ipv6.intuxication.testdebit.info
Chaque commande est lancé plusieurs fois en parallèle si nécessaire (pour
remplir le tuyau), et le débit instantané est relevé sur le routeur.
J'obtiens les débits IP maximaux suivants :
- 80 Mbits en upload
- 105 Mbits en download
En upload, un top
sur le routeur montre que le CPU passe 100% du temps à
traiter des interruptions logicielles (sirq). En download, l'utilisation
CPU est plutôt de l'ordre de 90%.
Pour comparer, en IPv4 dans les mêmes conditions, j'obtiens 180 Mbps avec
85% d'utilisation CPU sur le routeur, le résultat étant identique en
upload et en download. Visiblement, encapsuler des paquets L2TP est donc
plus coûteux que de les décapsuler.
Conclusion
Cet article décrit comment utiliser la connectivité IPv6 fournie par SFR
lorsqu'on remplace la Neufbox par un routeur à soi. On peut avoir envie
d'utiliser d'autres services (téléphone, télévision, etc), mais plein de
gens ont documenté comment faire : voir les liens ci-dessous.
Liens
Jul 23, 2014
So, writing a blog again. I had one, years ago. Hosted at home, like
this one (though it was behind DSL at the time, FTTH wasn't as widespread
as it is now). It was about free software, programming languages
(especially functional), and maybe already some bits of networks. I
remember writing a long post after discovering network neutrality for the
first time, thanks to a talk by La Quadrature du Net.
This new blog will be mostly about networks, and how you can use them for
fun and for saving the world (yes, this is overly ambitious). Overlay
networks, wireless mesh networks, routing protocols, free software, free
hardware, Do-It-Yourself ISPs, community-owned networks... And probably
other stuff I forgot.
As a general rule, ideas and principles behind networks will be
discussed, and not only "how to do this particular thing with that
particular software". I strongly believe that networking is not hard,
provided you understand what you are doing, which is often the most
difficult part. Once you know what you are doing, it is relatively easy
to use the available networking tools, or to create new ones. That being
said, I will definitely provide configuration examples when they are
non-obvious and/or use some obscure functionalities.
By the way, this blog will not show photos of big Cisco
routers, or explain how to do X
with insert your favourite prioprietary router OS here.
Hopefully, this blog will stay up longer than the previous one. Enjoy
reading, and happy hacking!