Bits of networks

Bits of networks

Networking, system, research

21 Jan 23

Benchmarking RISC-V: VisionFive 2 vs the world

I recently got my "super early bird" version of the VisionFive 2 RISC-V board. As the documentation says, it is supposed to be "the world’s first high-performance RISC-V single board computer (SBC) with an integrated GPU".

Since I have access to several kind of RISC-V, ARM and x86 boards, let's see if the claim about performance is true! We will look both at processing performance and energy efficiency.

Updated 2023-01-22: added Kobol Helios64 performance results from Max

Updated 2023-01-23: added results (performance and power) for Raspberry Pi 3B+

Updated 2023-01-24:

  • re-done all SBC power measurements, significant changes for VisionFive 2
  • fixed completely incorrect performance measurements for Raspberry Pi 3B+ caused by a faulty USB cable (causing a huge 2.5x drop in performance!)
  • re-done measurements for Raspberry Pi 1 with Debian 11 and without the faulty USB cable
  • added results for Raspberry Pi 3B, it was not fried after all, it was also the faulty USB cable
  • re-done Raspberry Pi 4 power measurements to be more comparable (avoid POE)

Picture of my setup

Power measurement setup. The VisionFive 2 is visible at the bottom with its serial cable, the wattmeter is on the left. The other visible boards are Raspberry Pis.

A disclaimer on methodology

Benchmarking CPU performance correctly requires a huge software and hardware expertise, and I can certainly not claim to have such an expertise. I have chosen two basic computing primitives, hoping that they are representative enough: crypto (sha1 and chacha20-poly1305 using openssl) and decompression (xz).

All numbers shown in this article are very "unscientific": I made no formal repetition to account for variability, and there are many factors that I purposefully ignore (kernel version, software version, compiler...). That being said, I tried to document these parameters as much as possible to help further analysis.

Overall, the goal is to give a rough idea of the CPU performance and power efficiency you can expect from RISC-V hardware.

Hardware and software environment

The VisionFive 2 has a StarFive JH7110 SoC, with 4 SiFive U74 cores at 1.5 GHz.

The original VisionFive had a StarFive JH7100 SoC with 2 SiFive U74 cores at 1.2 GHz. It had known hardware design issues: frequent L2 cache flushing needed because of a non-coherent bus and a slow RAM controller. So, the new SoC should be significantly faster.

Software-wise, I built a Linux kernel using the non-upstream repository (5.18-based for VisionFive 1, and 5.15-based for VisionFive 2). I built a Debian rootfs using the Debian guide for VisionFive. That guide works almost the same way for VisionFive 2, but that (as well as the upstream status for kernel support) will be for another article.

For other systems used in the comparison, they mostly run Debian or Ubuntu, with a few exceptions (NixOS, Armbian).

CPU performance

Here are the three benchmarks I will use:

openssl speed -evp sha1
openssl speed -evp chacha20-poly1305
# https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.10.tar.xz (116606704 bytes)
time xz -d < /dev/shm/linux-5.10.tar.xz > /dev/null

The xz benchmark uses decompression of a known file to ease reproducibility, and this file is stored in memory (/dev/shm) to make sure we have no disk I/O.

All benchmarks are using a single CPU core.

I converted all results into MB/s for easier comparison, taking the largest block size for the openssl results (16 KiB). As a reminder, one MB equals 1000000 bytes. For the xz benchmark, the real elapsed time is used.

To ease comparisons with other hardware, I computed a "speedup" of each board compared to the VisionFive 2: 1x means the same performance, 2x means twice as fast, 0.33x means three times as slow, etc.

You can find the full output of the benchmarks for each machine here.

Most hardware is either running locally, from the Compile Farm, or from Grid'5000. The Celeron G1840T system belongs to Deuxfleurs. The Kobol Helios64 result is courtesy of Max. Raspberry Pi 3B+ and 4 are courtesy of $DAYJOB.

Hardwaresha1chacha20xz decompr.
RISC-V
VisionFive 2
Debian unstable
97.5 MB/s50.8 MB/s4.66 MB/s
VisionFive 1 (gcc91)
Debian unstable
64.1 MB/s
(0.66x)
33.0 MB/s
(0.65x)
2.68 MB/s
(0.58x)
HiFive Unmatched
gcc92, Ubuntu 22.04
34.7 MB/s
(0.36x)
40.6 MB/s
(0.80x)
3.12 MB/s
(0.67x)
ARM / ARM64
Raspberry Pi 1
Debian 11, armv6l
27.3 MB/s
(0.28x)
22.9 MB/s
(0.45x)
0.928 MB/s
(0.20x)
Raspberry Pi 3B
Debian 11
149 MB/s
(1.53x)
188 MB/s
(3.70x)
4.35 MB/s
(0.93x)
Raspberry Pi 3B+
Debian 11
174 MB/s
(1.79x)
225 MB/s
(4.44x)
5.02 MB/s
(1.08x)
Raspberry Pi 4B
Debian 11
192 MB/s
(1.97x)
266 MB/s
(5.24x)
6.82 MB/s
(1.46x)
Kobol Helios64
Armbian 22.02.1
979 MB/s
(10x)
323 MB/s
(6.36x)
7.64 MB/s
(1.64x)
Ampere eMAG
gcc185, CentOS 8
903 MB/s
(9.26x)
296 MB/s
(5.83x)
12.3 MB/s
(2.64x)
Mac M1
gcc103, Debian 12
2244 MB/s
(23x)
1710 MB/s
(33.7x)
21.0 MB/s
(4.51x)
x86_64
Celeron G1840T
NixOS 22.11
599 MB/s
(6.14x)
678 MB/s
(13.3x)
11.2 MB/s
(2.40x)
Xeon Gold 6130
dahu.g5k, Deb. 11
1045 MB/s
(10.7x)
2611 MB/s
(51.4x)
17.1 MB/s
(3.67x)
i7-8086K
Ubuntu 20.04
1414 MB/s
(14.5x)
2971 MB/s
(58.5x)
22.6 MB/s
(4.85x)
AMD EPYC 7642
neowise.g5k, Deb. 11
1706 MB/s
(17.5x)
1796 MB/s
(35.4x)
17.0 MB/s
(3.65x)
AMD EPYC 7513
grat.g5k, Deb. 11
1875 MB/s
(19.2x)
2460 MB/s
(48.4x)
21.3 MB/s
(4.57x)

SHA1 and Chacha20-poly1305 results are very variable, which may be due to optimizations in certain versions of OpenSSL (vectorisation, assembly implementation) or even hardware acceleration for SHA1. They also seem to be sensitive to memory bandwidth: the Raspberry Pis have much better memory bandwidth than the RISC-V boards. In contrast, xz results seem much more representative of raw CPU performance (clock frequency, CPU cache, out-of-order execution, memory access patterns...)

To get clock frequency out of the equation, I am now showing xz results normalized by the clock frequency, measured in "CPU cycles per processed byte" (basically dividing clock frequency by xz performance). It should give an idea of the overall performance of the CPU architecture for this specific decompression task. Beware, lower values are now better!

HardwareMax clock frequencyxz -d cycles/byte
(lower is better)
VisionFive 21.50 GHz322
VisionFive 11.20 GHz448 (0.72x)
HiFive Unmatched1.20 GHz385 (0.84x)
Raspberry Pi 10.70 GHz754 (0.43x)
Raspberry Pi 3B1.20 GHz276 (1.17x)
Raspberry Pi 3B+1.40 GHz279 (1.15x)
Raspberry Pi 4B1.50 GHz220 (1.46x)
Kobol Helios641.80 GHz236 (1.37x)
Ampere eMAG3.00 GHz244 (1.32x)
Mac M13.00 GHz143 (2.25x)
Celeron G1840T2.50 GHz223 (1.44x)
Xeon Gold 61303.70 GHz216 (1.49x)
i7-8086K5.00 GHz221 (1.45x)
AMD EPYC 76423.30 GHz194 (1.66x)
AMD EPYC 75133.65 GHz171 (1.88x)

Here are some of the main highlights of these results:

  • VisionFive 2 single-core performance is 52% to 74% higher than VisionFive 1. This is very good compared to the 25% clock frequency improvement. When normalizing by the clock frequency, the VisionFive 2 is 39% faster per MHz compared to the VisionFive 1
  • VisionFive 2 is also 25% to 50% faster than the HiFive Unmatched. When normalizing by the clock frequency, the VisionFive 2 is 20% faster per MHz compared to the Unmatched. The Unmatched was itself slightly faster than the VisionFive 1 on a single-core basis.
  • VisionFive 2 is roughly as fast as a Raspberry Pi 3B/3B+ on the xz benchmark, but much slower for SHA1 and Chacha20.
  • VisionFive 2 is still around 1.5 slower than a Raspberry Pi 4 (and 5 times slower on Chacha20)

Here are other interesting insights:

  • The Raspberry Pi 1 always felt really slow. Well, now I know it's objectively really slow. Even when taking into account its low clock frequency of 700 MHz, performance per MHz is still really poor.
  • The Intel CPUs (from 2014, 2017 and 2018 respectively) have very similar performance per MHz for this task, despite being very different in terms of frequency, number of cores and price. This indicate that they basically share the same kind of architectural design.
  • The Raspberry Pi 4 and the Helios64 have good performance per MHz for a SoC, even comparable to an $1900 Intel CPU from 2017! Of course, the Intel CPU has much more cores, and there may be other workloads where Intel CPUs are much better.
  • The AMD EPYC CPUs (Zen 2 and Zen 3) have very good performance per MHz for this workload, and there is a clear improvement from Zen 2 to Zen 3.
  • As always, the Mac M1 is really impressive and easily smashes all other processors I could test on a per-MHz basis.

As a final note: remember that this is a single benchmark and is not representative of all kind of computing workloads. I suspect xz to be quite sensitive to the amount of CPU cache and to memory latency.

Energy consumption

Now that we have an idea of CPU performance, the other important criteria is energy consumption. Here, I am interested in whole-system energy consumption. I could only measure it for systems I have locally, so only a subset of the previous machines are tested here. Technically, I could have used wattmeters available on Grid'5000, but it makes little sense to compare the power consumption of a big server with that of a small embedded board.

All figures below are taken using a basic Perel plug-in wattmeter on 230V. The wattmeter gives the "active" (or real) power in Watts, as well as the power factor. All figures include the power transformer, which is either: an Akashi ALT2USBACCH USB transformer designed for up to 2.4A (VisionFive 1 & 2, Raspberry Pis) ; the stock Lenovo power transformer (Celeron G1840T) ; or an ATX power supply (HiFive Unmatched, i7-8086K).

For each system, I measure power consumption in the following situations: idle ; 1 CPU core at 100% ; half of CPU cores at 100% ; all CPU cores at 100% (ignoring hyper-threads). Each measurement is run for only a few seconds (still waiting for a steady-state) to avoid thermal throttling.

The workload is a simple infinite loop in bash: while :; do :; done. All systems run Linux (various versions and distributions), have one NIC up, and no screen or other peripheral attached.

Note: I am not very confident in the absolute power values shown below (because I don't really trust the wattmeter or the USB transformer). However, since I did all measurements in the same conditions, the values are comparable with each other.

HardwareIdle1 coreHalf coresAll cores
VisionFive 2
4 cores, 8 GB RAM
7.4 W10.4 W11.2 W13.1 W
VisionFive 1
2 cores, 8 GB RAM
10.6 W11.1 W-11.6 W
HiFive Unmatched
4 cores, 16 GB RAM
56.8 W57.7 W58.6 W60.7 W
Raspberry Pi 1
1 core, 512 MB RAM
5.9 W6.2 W--
Raspberry Pi 3B
rev 1.2
4 cores, 1 GB RAM
4.5 W6.5 W8.7 W13.8 W
Raspberry Pi 3B+
4 cores, 1 GB RAM
7.0 W9.7 W12.2 W18.0 W
Raspberry Pi 4B
rev 1.5
4 cores, 2 GB RAM
4.6 W6.8 W8.3 W11.1 W
Celeron G1840T
2 cores
12 W18 W-23.5 W
i7-8086K
6 c. / 12 threads
23.4 W61.5 W79.5 W112.7 W

Clearly, the VisionFive 2 is quite power-efficient compared to the older RISC-V boards. According to its documentation, it can run without any headsink or fan for bursty loads (e.g. web browsing), but a fan is recommended for long computations. This is consistent with my power consumption measurements.

Interestingly, the Raspberry Pi 3B+ has a similar power profile as the VisionFive 2. This makes sense because they are in the same class of devices: same amount of cores, similar maximum clock frequency, similar performance. But it's still noteworthy that the relatively young SoC found on the VisionFive 2 has a power consumption that is so similar to that of the more mature SoC found on the Raspberry Pi 3B+.

We can also observe that Intel is much better at dynamic frequency scaling, which helps to achieve low power usage when the CPU is idle. As far as I know, the SoC in the VisionFive 1 and the HiFive Unmatched have no frequency scaling, which explains their near-constant power usage. The VisionFive 2 does have frequency scaling, so it's already much better (-45% power usage when idle compared to fully loaded).

Here are some details about the hardware to put these numbers into context:

  • VisionFive 1: no fan, kernel 5.18 (Debian). Power consumption changes significantly with die temperature (9 W idle at 36 °C, 10.4 W idle at 50 °C)
  • VisionFive 2: no fan, kernel 5.15 (Debian), no NVMe, 100M NIC. Using the gigabit NIC would add 0.5 W of power usage.
  • Raspberry Pi 3B: rev 1.2, 1 GB RAM, no fan, kernel 5.10 (Debian 10). 600 MHz idle frequency, 1.20 GHz max frequency.
  • Raspberry Pi 3B+: 1 GB RAM, no fan, kernel 5.10 (Debian 10). 600 MHz idle frequency, 1.40 GHz max frequency.
  • Raspberry Pi 4B: rev 1.5, 2 GB RAM, no fan, kernel 5.10 (Debian 10). 600 MHz idle frequency, 1.50 GHz max frequency.
  • Celeron G1840T: 800 MHz idle frequency, 2.5 GHz max frequency. Lenovo ThinkCentre M73, 4 GB DDR3, ST500LM021-1KJ15 disk.
  • i7-8086K: 800 MHz idle frequency, 4 GHz max frequency, 5 GHz turbo frequency. ASRock H310CM-HDV/M.2 motherboard, 16 GB + 8 GB DDR4, Samsung 980 500GB NVMe, ATX power supply, 2 case fans

Note: earlier versions of this article used some POE power measurements from a switch (for the Raspberry Pis, with the POE hat). After re-doing the measurements with the USB power supply and plug-in wattmeter, it turns out that power measurements given by the POE switch were substantially lower than the wattmeter (probably because the POE switch measurements do not include the AC-to-DC power converter). Moreover, POE values were not stable. In the end, I decided to remove these POE values and only use the USB power supply to enable a fair comparison.

CPU performance vs. energy

Now that we have both CPU performance and energy consumption, we can mix the two results to look at energy efficiency. The most reliable figure in the table below is single-core efficiency: it is obtained by simply dividing the result of the single-core performance benchmark for xz by the single-core power consumption. I also extrapolate some figures for all-cores efficiency, but this value should be taken with a grain of salt: it is obtained by multiplying single-core performance by the number of cores (excluding hyper-threads) and dividing the total by the measured all-cores power consumption. Many effects such as thermal throttling, frequency boost for single-core load, and shared cache between cores may decrease the actual all-cores performance and thus decrease the actual all-cores efficiency compared to the figures below.

HardwareSingle-core efficiencyAll-cores efficiency
(extrapolated)
VisionFive 2
(4 cores)
0.448 MB/s/W1.42 MB/s/W
VisionFive 1
(2 cores)
0.241 MB/s/W0.462 MB/s/W
HiFive Unmatched
(4 cores)
0.0541 MB/s/W0.206 MB/s/W
Raspberry Pi 1
(1 core)
0.150 MB/s/W
Raspberry Pi 3B
(4 cores)
0.670 MB/s/W1.26 MB/s/W
Raspberry Pi 3B+
(4 cores)
0.517 MB/s/W1.12 MB/s/W
Raspberry Pi 4B
(4 cores)
1.00 MB/s/w2.46 MB/s/W
Celeron G1840T
(2 cores)
0.622 MB/s/W0.953 MB/s/W
i7-8086K
(6 c. / 12 threads)
0.367 MB/s/W1.20 MB/s/W

Overall, the VisionFive 2 is much more energy-efficient than existing RISC-V boards: it is 2 to 3 times more energy-efficient than the VisionFive 1, and 7 to 8 times more energy-efficient than the Unmatched. It may seem counter-intuitive that the Unmatched is so inefficient, but that's probably because of its larger form factor, power-hungry PCIe and DDR4, and the need for an ATX power supply that may not be super efficient at low power load.

Similarly, even though x86_64 hardware is much faster than the VisionFive 2 (2.4 times to 4.8 times faster), it has roughly the same energy efficiency! If you have moderate computing needs, the VisionFive 2 is an efficient alternative to bigger systems.

Compared to the Raspberry Pi 3B and 3B+, the VisionFive 2 again has similar energy-efficiency. This makes sense because it has roughly the same performance and the same power consumption.

Finally, the Raspberry Pi 4 is the real winner on the efficiency metric: the VisionFive 2 is half as energy-efficiency as a Raspberry Pi 4.

Conclusion

When looking at single-core CPU performance, the VisionFive 2 is roughly 75% faster than the original VisionFive. Since it has twice the core count, that means a +150% total performance increase. And since it has a similar power consumption, it is also 2 to 3 times more energy-efficient. So that's definitely a very big improvement.

Compared to the HiFive Unmatched (which is not even technically a SBC), the VisionFive 2 still outperforms it by 50%, and is 7 to 8 times more energy-efficient. So, as far as I can tell, the claim about it being a "high-performance RISC-V SBC" is true.

When comparing with Raspberry Pis, the VisionFive 2 is about as fast as a Raspberry Pi 3B+, although much slower on memory-heavy benchmarks, and also as energy-efficient. However, it is still 46% slower than a Raspberry Pi 4, and two times less energy-efficient. As far as I can tell, both SoC are 28 nm, so we would ideally expect the same energy-efficiency.

Compared to low-power x86_64 systems, the VisionFive 2 is of course slower when looking at raw performance, but at the same time it is as energy-efficient. This is a general advantage that SBCs have over more complete systems: they have much less peripherals, are less extensible and have generally lower performance, but they are much more energy-efficient.

Again, remember that all figures discussed here are approximate, and specific benchmark results cannot be extrapolated to generic performance results for all applications.

Overall, the VisionFive 2 is a big step in the right direction, and this kind of RISC-V hardware can definitely compete with recent ARM boards since they have very similar performance-energy tradeoffs.

More pictures

VisionFive 2 in its box

VisionFive 2 in its box (I removed the antistatic wrapping)

VisionFive 2 front

Front with audio, 4xUSB, HDMI, 2xNIC (with one being a 100M NIC, specific to the super early bird version)

VisionFive 2 rear

Rear with USB-C power input, reset button, GPIOs

VisionFive 2 back

Back with NVMe M.2 slot, micro-SD card slot