To Blog

Eating Every Last Slice of Your Raspberry Pi – GPU Offload on a Pi4

Sept. 26, 2022

IQT Labs has been experimenting using Raspberry Pi4s and Software Defined Radios (SDR) as an on-the-go platform through the GamutRF project. GamutRF primarily focuses on detecting, capturing, and distinguishing between different kinds of radio frequency (RF) signals such as FM radio stations and WiFi. In this post, we discuss how we made GamutRF, running on a Pi4, more responsive and reliable by utilizing the built-in GPU to perform fast Fourier transforms. Last month (August 2022), Vulkan 1.2 support for the Pi4 became available enabling us to use the VkFFT framework and the Vulkan API – a combo that proved to cut the processing time nearly in half, freeing up CPU time for compression and disk I/O to avoid dropped data. We also discovered that a subtle CPU hardware difference between Pi4s greatly influenced system reliability for our use case.

Putting a Pi4 and SDR to Work

The continuing silicon supply shortage has brought delays and uncertainty of supply to consumers of general-purpose CPUs and FPGAs globally. Even users of lower-end devices like the Raspberry Pi have felt this pinch. For our team, difficulties in securing the Pi4s we typically use have inspired a series of efforts to wring every last CPU cycle of performance out of the hardware we have on hand.

The Pi4 is certainly more powerful than previous Pi models, and can come with as much as 8GB RAM, a Gigabit Ethernet port, and USB3. It also has very low power requirements compared with desktop computers – 5W or less – so it can potentially be powered for hours by a consumer USB power bank. Of course, it was not designed to match a desktop or typical laptop’s performance, so software must be written with resource constraints in mind.

GamutRF has a few different components that rely on a Pi4. One of the more resource-intensive components is what we call the “worker”. The “worker” is collectively an SDR attached to a Pi4 that is tasked by other components of the system to make a brief, but high bandwidth recording (typically 20Msps) of a signal at a given RF frequency.

The key features required by the “worker” are:

  1. respond quickly to an RF record request
  2. record the signal
  3. produce a power spectrum for other tools to work with as fast as possible

These recordings can be a mess — modulation schemes, data protocols, and manufacturing artifacts all contribute to the patterns present in RF signals. Often these patterns are hidden or imperceptible, making them irrelevant to normal communications, but nonetheless informative to a carefully designed observer. “Fourier Analysis” allows us to decompose any RF signal into a summation of well-defined periodic components each with its own amplitude, frequency, and phase. Fast Fourier Transforms (FFT) are algorithms that efficiently convert a recorded signal to a representation in the frequency domain. Traditionally FFTs have been difficult and time-consuming to produce.  Part of the GamutRF exploration focuses on how to produce FFTs more effectively, using low-cost hardware like Pi4s and SDRs, as we seek to better understand the RF domain. 

De-Noising the Data and Improving Spectrum Visualization

As a quick primer, think of FFTs as the relative levels of different frequencies found in a small slice of the signal. FFTs can be used to build either a Power Spectral Density (PSD), which shows the distribution of power across the frequencies of a signal for a small window of time, or a spectrogram, which shows the changes in the power spectrum over multiple windows of time.

Power Spectrum Density (PSD), a view of the RF recording over a small window of time, is key to extracting some understanding of what signals it may contain. A common method for estimating the PSD is called the “periodogram”. This is found by computing the squared magnitude of each frequency component from the FFT (the “power”) and then normalizing by the frequency bin width.

While there are multiple ways of approaching PSD estimation, we use an improved PSD estimate called Welch’s Method. This method removes noise in the power spectrum across overlapping windowed segments of the signal and then averages the results. A spectrogram can then be created using a series of PSDs stacked together, showing how the power spectrum changes over a wider length of time. Depending on the characteristics of a particular signal, we can use either the PSD or spectrogram to identify signals of interest using machine learning models.

One example showing the visual differences between FFT, PSD, and spectrogram. In GamutRF the signal amplitude would be in decibels (dB).
Source: https://blog.endaq.com/vibration-analysis-fft-psd-and-spectrogram

We built a custom tool on the Gamut platform that is tasked with performing the process above – it records RF samples in a particular environment and uses the FFT to generate a spectral representation. The spectral output can be used by other GamutRF tools to determine what kind of signal is present.

Example of a GamutRF spectrogram.

Building the FFTs: Collecting, Compressing, Processing the Data

At 20Msps (megasamples per second, or millions of samples per second), where each I/Q sample from the SDR radio is a complex number represented as signed 16-bit integers, this represents 4 bytes * 20Msps, or 80e6 bytes per second of raw data.

This excessively large volume of data must first be written to disk in an efficient manner. GamutRF leverages Facebook’s zstd compression library which, depending on the signal recorded, achieves 20-200% compression in our experiments. This compression greatly reduces the amount of I/O needed to write out the recorded samples which gives the CPU more headroom. It also dramatically improves the number of samples that can be housed in the limited storage supported by a Pi4.

Next, FFTs must be calculated to produce the actual spectrogram and PSDs. In a separate process thread on the recorder, the I/Q samples are staged in memory for the Pi4 GPU to access them. A shader program is uploaded and performs the FFT calculation in batches. The power spectrum results are retrieved from the GPU and then converted to dB values and used to produce a spectrogram image directly to disk.

The processing load to perform all the operations above is pretty intense,  and we have found it consistently pushes the Pi4 to its limitations.

Sharing the Load – Leveraging the GPU for Additional Pi Compute

Once we cracked the ability to stage I/Q samples in memory for direct access on the GPU and offload FFT calculations from the CPU we were able to increase not only the number of samples that could be recorded and processed, but also resiliency in being about to constantly do so.

The following is a benchmark from comparing use of CPU-based FFT versus VkFFT GPU-based FFT on the same Pi4 and SDR:

CPU-based version:

$ sudo time -p ./mt_rx_samples_to_file --args num_recv_frames=128,recv_frame_size=16360 --nfft 2048 --nfft_overlap 1024 --duration 10 --fftnull --file test.dat.zst --rate 20.48e6 --novkfft

Creating the usrp device with: num_recv_frames=128,recv_frame_size=16360...
[INFO] [UHD] linux; GNU C++ version 11.2.0; Boost_107400; UHD_4.1.0.5-3
[INFO] [B200] Detected Device: B200
[INFO] [B200] Operating over USB 3.
[INFO] [B200] Initialize CODEC control...
[INFO] [B200] Initialize Radio control...
[INFO] [B200] Performing register loopback test... 
[INFO] [B200] Register loopback test passed
[INFO] [B200] Setting master clock rate selection to 'automatic'.
[INFO] [B200] Asking for clock rate 16.000000 MHz... 
[INFO] [B200] Actually got clock rate 16.000000 MHz.
Using Device: Single USRP:
  Device: B-Series Device
  Mboard 0: B200
  RX Channel: 0
    RX DSP: 0
    RX Dboard: A
    RX Subdev: FE-RX1
  TX Channel: 0
    TX DSP: 0
    TX Dboard: A
    TX Subdev: FE-TX1

Setting RX Rate: 20.480000 Msps...
[INFO] [B200] Asking for clock rate 20.480000 MHz... 
[INFO] [B200] Actually got clock rate 20.480000 MHz.
Actual RX Rate: 20.480000 Msps...

Setting RX Freq: 100.000000 MHz...
Setting RX LO Offset: 0.000000 MHz...
Actual RX Freq: 100.000000 MHz...

Waiting for "lo_locked": ++++++++++ locked.

Press Ctrl + C to stop streaming...
defaulting spb to rate (20480000)
max_samps_per_packet from stream: 4086
max_buffer_size: 81920000 (20480000 samples)
opening /home/ubuntu/mt_rx_samples_to_file/build/.test.dat.zst
writing zstd compressed output
using FFT point size 2048
.
.
.
stream stopped
.
.
.
.
.
.
.
write samples worker done
fft worker done
fft out worker done
closing test.dat.zst
closed

Done!

real 50.38
user 75.92
sys 17.26


GPU-based version:

$ sudo time -p ./mt_rx_samples_to_file --args num_recv_frames=128,recv_frame_size=16360 --nfft 2048 --nfft_overlap 1024 --duration 10 --fftnull --file test.dat.zst --rate 20.48e6
using vkFFT batch size 100 on V3D 4.2

Creating the usrp device with: num_recv_frames=128,recv_frame_size=16360...
[INFO] [UHD] linux; GNU C++ version 11.2.0; Boost_107400; UHD_4.1.0.5-3
[INFO] [B200] Detected Device: B200
[INFO] [B200] Operating over USB 3.
[INFO] [B200] Initialize CODEC control...
[INFO] [B200] Initialize Radio control...
[INFO] [B200] Performing register loopback test... 
[INFO] [B200] Register loopback test passed
[INFO] [B200] Setting master clock rate selection to 'automatic'.
[INFO] [B200] Asking for clock rate 16.000000 MHz... 
[INFO] [B200] Actually got clock rate 16.000000 MHz.
Using Device: Single USRP:
  Device: B-Series Device
  Mboard 0: B200
  RX Channel: 0
    RX DSP: 0
    RX Dboard: A
    RX Subdev: FE-RX1
  TX Channel: 0
    TX DSP: 0
    TX Dboard: A
    TX Subdev: FE-TX1

Setting RX Rate: 20.480000 Msps...
[INFO] [B200] Asking for clock rate 20.480000 MHz... 
[INFO] [B200] Actually got clock rate 20.480000 MHz.
Actual RX Rate: 20.480000 Msps...

Setting RX Freq: 100.000000 MHz...
Setting RX LO Offset: 0.000000 MHz...
Actual RX Freq: 100.000000 MHz...

Waiting for "lo_locked": ++++++++++ locked.

Press Ctrl + C to stop streaming...
defaulting spb to rate (20480000)
max_samps_per_packet from stream: 4086
max_buffer_size: 81920000 (20480000 samples)
opening /home/ubuntu/mt_rx_samples_to_file/build/.test.dat.zst
writing zstd compressed output
using FFT point size 2048
.
.
.
stream stopped
.
.
.
.
.
.
.
write samples worker done
fft worker done
fft out worker done
closing test.dat.zst
closed

Done!

real 54.39
user 49.70
sys 48.81

Improved Pi4 Performance and Process Resiliency via Sharing the Load

The processes above are essentially the same except for the time consumed:  the VkFFT version takes 49 seconds CPU time, while the CPU-based version takes 75 seconds CPU time for a recording duration of 10 seconds (one note: this particular Gamut tool uses multiple threads as a processing pipeline, so it will use more CPU time than real time).

While the overall processing time, which includes writing the compressed sample file to disk, is similar, the saved CPU time adds a significant reliability margin. Approximately 5% of CPU-based runs drop data, being unable to sustain the required I/O rate, while 0% of GPU runs fail. By pushing the processing load to the GPU in some instances, we were able to increase performance and build in some resiliency for the overloaded CPU.

Stepped On – The Headaches of Slight, Non-Obvious Differences in Pi4 Revisions

While exploring this GPU/CPU tasking concept, we encountered a puzzling performance difference between two seemingly identical Pi4s. One Pi4 could reliably keep up with the approximately 20Msps sample rate,  the other could not. We eventually determined that the CPUs were slightly different. It turns out that an earlier Pi revision (the B0 revision specifically) has restrictions on what areas of available memory can be accessed for I/O. This difference is barely noticeable in day-to-day tasks. For example, in our benchmarking of the B0 revision CPU, a compilation job takes a few seconds longer, over a timescale of many minutes, versus the slightly newer C0 revision CPU.

In the case of our RF collector, this slight difference was enough to make operations on a B0 CPU extremely unreliable. This unreliability appears in the form of overflows – our system was not able to complete its processing of RF samples from the SDR fast enough before new samples arrived. This resulted in dropped data – which is not ideal. Last year, excellent details about the two CPU revisions appearing in Pi4 8GB models were written up by Jeff Geerling. It is definitely worth a read.

GamutRF is ongoing and efforts to  better leverage the Pi4’s resources continue. Potentially, the GPU could perform more operations such as applying the Hamming window before the FFT and converting to dB afterwards – this would reduce user CPU time substantially. On the Pi4, every little bit helps. The more the Pi4 can do, the less reliant we are on more expensive devices like FPGAs.

Here’s hoping some of this helps others reduce the hours (and headaches) we spent trying to diagnose our issues or improves your usage of the Pi4.

IQT Blog

Insights & Thought Leadership from IQT

Read More