PRACH Detection offload to GPU

Offload the PRACH preamble detection pipeline to an NVIDIA GPU via a fused CUDA kernel chain and CUDA-graph execution, achieving roughly 10× lower detection latency compared to the CPU-FFTW path and freeing uplink-PHY CPU headroom under bursty random-access load.
GPU PRACH offload architecture across both validated RU setups (Liteon Split 7.2 / 4T4R and Ettus X410 Split 8 / 1T1R). OCUDU gNB stack - OFH RX (DPDK/UHD), Lower PHY (OFDM PRACH demodulator forward FFT), Upper PHY (PRACH detect - GPU-accelerated path highlighted), MAC/FAPI; NVIDIA GPU card running the CUDA-graph pipeline (H2D cbf16 buffer, cbf16→cf32 convert, fused prod_conj + bin-reorder + cuFFTDx IDFT + |·|² normalize, non-coherent combine, per-shift window accumulate, argmax+divide, D2H results); CUDA stream and graph exec cache. Bottom platform stack: CUDA runtime, cuFFT, NVIDIA MathDx 26.03 (cuFFTDx), Host OS Ubuntu 22.04, Physical Host with CPU + GPU.
GPU PRACH offload - full system architecture across both validated RU setups (Liteon Split 7.2 / 4T4R and Ettus X410 Split 8 / 1T1R). The highlighted block (PRACH DETECT) moves from CPU to the GPU; every other stage in the uplink chain stays in software.

Highlights

  • Full detection pipeline on GPU - correlation, IDFT, power accumulation, and peak detection all run device-resident; only the cbf16 PRACH buffer crosses H2D and only the argmax result (~1 KB) crosses D2H per detection window.
  • cuFFTDx fused kernel - for the standard PRACH IDFT sizes (256-point for short format, 1024-point for long format), a single cuFFTDx device kernel fuses conjugate-product correlation, bin reorder, IDFT, and power normalisation into one launch with no intermediate device-memory traffic.
  • CUDA-graph execution - the full pipeline is captured into a cudaGraphExec_t keyed by detection configuration. Hot-path cost is one cudaGraphLaunch + cudaStreamSynchronize per slot window.
  • Zero-allocation hot path - all device and pinned-host buffers are pre-allocated at construction for worst-case dimensions (64 ports × 12 symbols); detect() touches no allocator.
  • Configuration-driven - select via the OCUDU_PRACH_DFT_BACKEND=gpu_full environment variable at gNB startup; no code changes, no YAML required.
  • Transparent fallback - if no CUDA device is present or the build flag is absent, the runtime falls back to the CPU-FFTW detector silently.
  • Measured ~30–36 % mean latency reduction - short-format detect drops from ~112–115 µs (CPU FFTW) to ~74–79 µs (GPU cuFFTDx), with up to 55 % tail compression, measured on live gNB with X410 (Split 8) and Liteon (Split 7.2) (see Section 7).

1. Prerequisites

1.1 Hardware

  • NVIDIA GPU: Ampere or later (SM86+) for the fused cuFFTDx fast path. Tested on RTX A4000 (SM86, runs the cuFFTDx fused kernel) and RTX 8000 (SM75, falls back to the cuFFT three-kernel path automatically slightly higher overhead). Volta (SM70) and other Turing cards behave like RTX 8000.
  • x86-64 CPU with at least AVX2 (the CPU-FFTW fallback detector uses AVX-512 when available).
  • GPU should be on the same NUMA node as the upper-PHY worker cores to minimise PCIe latency.
  • At least 512 MB of free GPU device memory (worst-case allocation is approximately 300 MB for 64 ports × 12 symbols × 4096-bin DFT).

1.2 Software

ComponentMinimum versionNotes
CUDA Toolkit12.0nvcc, cuda_runtime.h, cufft.h required.
NVIDIA MathDx26.03Provides cufftdx.hpp; installed at /opt/nvidia-mathdx-26.03.0-cuda13/.
NVIDIA driver525+Must match or exceed the CUDA Toolkit version.
Linux kernel5.15+Standard Ubuntu 22.04 LTS kernel is sufficient.
CMake3.18+Required for CUDA_SEPARABLE_COMPILATION.
DU build flagsENABLE_DFT_GPU=ONSee Section 5.

1.3 Driver and CUDA installation

# Confirm NVIDIA driver is loaded and GPU is visible:
nvidia-smi

# Expected output includes device name, driver version ≥ 525, CUDA version ≥ 12.0:
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 535.x   Driver Version: 535.x   CUDA Version: 12.2             |
# +-----------------------------------+-------------------+---------------------+
# | GPU 0  RTX A4000            Off   |                   |                     |

# Verify CUDA Toolkit is installed:
nvcc --version
# Expected: Cuda compilation tools, release 12.x

# Verify cuFFT library is present:
ls /usr/local/cuda/lib64/libcufft*

# Install CUDA Toolkit if missing (Ubuntu 22.04):
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cuda-toolkit-12-2

1.4 NVIDIA MathDx (cuFFTDx) installation

cuFFTDx is a header-only device library distributed as part of NVIDIA MathDx. The build system looks for it under /opt/nvidia-mathdx-26.03.0-cuda13/.

# Download MathDx 26.03 from NVIDIA Developer:
# https://developer.nvidia.com/mathdx-downloads
# (login required; select "MathDx 26.03 for CUDA 13" tarball)

# Extract to the expected path:
sudo tar -xf nvidia-mathdx-26.03.0-cuda13.tar.gz -C /opt/
sudo ln -s /opt/nvidia-mathdx-26.03.0-cuda13 /opt/nvidia-mathdx

# Verify the CMake config file is reachable:
ls /opt/nvidia-mathdx-26.03.0-cuda13/nvidia/mathdx/26.03/lib/cmake/
# Expected: mathdxConfig.cmake  cufftdxConfig.cmake  ...

# Quick sanity-check: confirm cufftdx.hpp is present:
find /opt/nvidia-mathdx-26.03.0-cuda13 -name "cufftdx.hpp" | head -3
# Enable GPU persistence mode to avoid cold initialisation on first CUDA call:
sudo nvidia-smi -pm 1

# Lock SM and memory clocks for deterministic latency (RTX A4000 example):
sudo nvidia-smi -lgc 1440    # lock GPU clocks (MHz); check your card's max
sudo nvidia-smi -lmc 1215    # lock memory clocks

# Confirm:
nvidia-smi -q -d CLOCK | grep -A4 "Clocks"

# Disable GPU power throttling:
sudo nvidia-smi --auto-boost-default=0

Setting persistence mode and locking clocks eliminates GPU frequency-scaling jitter, which can otherwise cause occasional tail-latency outliers during low-traffic periods when the GPU would otherwise downclock.


2. Architecture overview

PRACH detection is the 5G Random Access Channel procedure: the gNB must detect which of up to 64 preamble sequences was transmitted by a UE, and compute the timing advance from the peak position. On the CPU path this is performed by prach_detector_generic_impl using FFTW IDFTs. On the GPU path, prach_detector_gpu_impl replaces the inner detection loop entirely.

The detection step runs after the lower-PHY OFDM demodulator (forward FFT) and before the MAC PRACH result callback. Everything upstream (OFH RX, lower-PHY sample windowing, OFDM forward FFT) and downstream (MAC/DU-High random-access response) stays on CPU.

Upper-PHY PRACH pipeline. Stages left to right: OFH RX (DPDK/UHD, CPU) → Lower PHY OFDM PRACH Demodulator (forward FFT, CPU) → PRACH Buffer (cbf16, host RAM) → PRACH Detect (GPU-accelerated, highlighted dark blue) → prach_detection_result → MAC FAPI Random Access Response (CPU). The GPU block expands to show the internal pipeline: H2D cbf16 → cbf16→cf32 → prod_conj+bin_reorder → cuFFTDx IDFT → |·|² normalize → cross-port combine → per-shift window accumulate → argmax+divide → D2H results.
Uplink PRACH chain. The GPU-accelerated detect stage (dark blue) replaces the CPU inner loop; all other stages are unchanged.

3. Implementation summary

The integration is a hardware-abstraction layer that plugs into the existing upper-PHY factory pattern via the prach_detector interface. Four things are worth knowing:

  1. Fused device kernel. For standard PRACH IDFT sizes (256 for short format, 1024 for long format), a single cuFFTDx block-FFT kernel (k_fused_prach_idft) performs conjugate-product correlation, DFT-bin reorder, IDFT, and power normalisation in one GPU thread block per batch element. No intermediate complex buffer crosses shared memory; only float power values are written out. For DFT sizes outside the supported set {128, 256, 512, 1024, 2048, 4096}, the path falls back to three separate kernels plus cufftExecC2C.

  2. CUDA graph caching. The pipeline is captured into a cudaGraphExec_t on the first detect() call for each unique configuration shape. The cache key has 9 fields: format (long vs short), number of ZC sequences, ports × symbols, DFT size, sequence length L_ra, cyclic-shift count, window width, window margin, and combine iteration count. Subsequent calls with the same key cost one cudaGraphLaunch + cudaStreamSynchronize; the graph dispatcher and all argument pointers are baked into the graph at capture time.

  3. Zero-allocation hot path. All device buffers (d_preamble, d_combined, d_idft, d_mod_sq, d_num, d_den, d_argmax_*) and pinned host staging buffers (h_preamble, h_root, h_argmax_*) are allocated once at construction for worst-case dimensions (64 ports × 12 symbols × 4096 DFT bins). detect() calls no allocator on the hot path.

  4. Root sequence caching. Zadoff-Chu root sequences are generated on CPU by prach_generator_impl and cached in h_root (pinned host). The cache is invalidated only when format, root sequence index, restricted set, ZCZ, or derived dimensions change. On a stable single-cell gNB config this never re-evaluates after the first detection window.

A 9-field graph cache key captures all dimensions that affect kernel grid shapes. This means:

  • A cell running format 0 with ZCZ 0 builds one graph (long=true, seq=64, shifts=1, win_width=CP_bins, …).
  • A cell running format 0 with ZCZ 14 builds a different graph (seq=1, shifts=64, different win_width). Each key pays a one-time build cost (3–6 ms) and is cached for the lifetime of the detector instance.

4. GPU pipeline - kernel chain

The pipeline captured in the CUDA graph runs the following stages in stream order:

H2D  h_root_all        → d_root_all        (ZC roots, cf32,  ~430 KB)
H2D  h_window_starts   → d_window_starts   (uint32,  ~256 B)

[per combine_iter]
  H2D  h_preamble[it]            → d_preamble       (cbf16, ~13 KB / slice)
  k_cbf16_to_cf_init_or_add      (cbf16 → cf32, init on it=0 or += on it>0)
                                 → d_combined

[cuFFTDx primary path, sizes ∈ {128, 256, 512, 1024, 2048, 4096}, SM86]
  k_fused_prach_idft<N>            (prod_conj + bin_reorder + cuFFTDx IDFT
                                    + |·|² + 1/(dft_size·L_ra) scale)  → d_mod_sq

[cuFFT fallback, DFT sizes outside the supported set above]
  k_prod_conj_bin_reorder          → d_idft
  cufftExecC2C (CUFFT_INVERSE)     → d_idft
  k_modulus_square_normalize       → d_mod_sq

k_combine_mod_sq                   (non-coherent cross-port + symbol sum)
                                   → d_mod_sq_combined

k_per_shift_accumulate_combined    (per cyclic-shift, per bin in window:
                                    num[i] = signal_sample × window_scale;
                                    den[i] = reference_window_energy − num[i])
                                   → d_num, d_den

k_finalize_per_shift_xseq          (num[i]/|den[i]| argmax per shift)
                                   → d_argmax_idx, d_argmax_val, d_num_at_argmax

D2H  d_argmax_idx/val/num →  h_argmax_idx/val/num  (~1 KB)

After cudaStreamSynchronize, the CPU iterates over at most 64 (seq × shift) elements, applies the detection threshold, and emits prach_detection_result::preamble_indication structs.

PCIe transfer budget

TransferSize (format 0, 4 ports)Duration at PCIe Gen4 ×16
H2D PRACH buffer (cbf16)~13 KB< 1 µs
H2D ZC roots (cf32, cold)~430 KB~20 µs (cached after first window)
H2D window starts256 Bnegligible
D2H argmax results~1 KB< 1 µs
Total hot-path~14 KB< 1 µs

ZC root H2D cost is paid only on a root cache miss (config change or first window); in steady-state operation it does not appear on every detect.


5. Configuration

No YAML changes are required. The GPU detector is selected at runtime via an environment variable.

5.1 Enable GPU PRACH detection

export OCUDU_PRACH_DFT_BACKEND=gpu_full

Set this before starting the gNB process. On gNB startup, if a CUDA device is found, you will see on stdout:

PRACH DFT backend: gpu_full (CUDA-shell detector)

If no CUDA device is present or the build was compiled without ENABLE_DFT_GPU=ON, the message will be:

OCUDU_PRACH_DFT_BACKEND=gpu_full but GPU detector unavailable, falling back

and the CPU-FFTW detector will be used automatically - no crash, no manual intervention required.

5.2 Verification at startup

Shortly after the first PRACH window arrives, stderr will contain:

[prach_detector_gpu] constructed: idft_long=1024 idft_short=256 max_batch=768 device_buffers_mib=N
(above CPU instance is a gpu_full inner-fallback, idle in normal operation)
[prach_detector_gpu] graph cache miss #1: long=false seq=64 ports*sym=1 dft=256 shifts=1 win_width=116 build=4953us (cached_graphs=1)

The build=...us line is the one-time CUDA graph construction cost. Subsequent detection windows for the same configuration will not print this line.

5.3 Unset / CPU path (default)

Do not set OCUDU_PRACH_DFT_BACKEND (or unset it) to run the CPU-FFTW detector:

unset OCUDU_PRACH_DFT_BACKEND

No rebuild is required to switch between GPU and CPU paths.


6. Build guide

6.1 gNB with 7.2 fronthaul + DPDK + GPU PRACH acceleration

cd ~/kriish/ocudu_gpu_prach
mkdir -p build_gpu_split7_2
cd build_gpu_split7_2

cmake -DDU_SPLIT_TYPE=SPLIT_7_2 \
      -DENABLE_DFT_GPU=ON \
      -DENABLE_DPDK=True \
      -DASSERT_LEVEL=MINIMAL \
      ../

make -j$(nproc)

Binary: build_gpu_split7_2/apps/gnb/gnb.

Note: ENABLE_DFT_GPU=ON requires NVIDIA MathDx 26.03 with cuFFTDx to be installed at /opt/nvidia-mathdx-26.03.0-cuda13/ before CMake configure time. The configure step will error with ENABLE_DFT_GPU=ON requires NVIDIA MathDx with cuFFTDx. if the package is not found. See Section 1.4.

6.2 gNB with 7.2 fronthaul + DPDK + GPU PRACH + HW LDPC (ACC100)

cmake -DDU_SPLIT_TYPE=SPLIT_7_2 \
      -DENABLE_DFT_GPU=ON \
      -DENABLE_DPDK=True \
      -DENABLE_PDSCH_HWACC=True \
      -DENABLE_PUSCH_HWACC=True \
      -DASSERT_LEVEL=MINIMAL \
      ../

make -j$(nproc)

6.3 What the build produces

ENABLE_DFT_GPU=ON builds the ocudu_hal_dft_gpu static library from two CUDA translation units:

SourceContents
lib/hal/cuda/prach_detector_gpu_kernel.cuk_cbf16_to_cf, k_prod_conj_bin_reorder, k_modulus_square_normalize, k_combine_mod_sq, k_per_shift_accumulate_combined, k_finalize_per_shift_xseq, k_zero_floats
lib/hal/cuda/prach_detector_gpudx_kernel.cuk_fused_prach_idft<N> (cuFFTDx, sizes 128/256/512/1024/2048/4096, SM86)

The library links against CUDA::cudart, CUDA::cufft, and mathdx::cufftdx. The main gNB binary links this library only when ENABLE_DFT_GPU=ON.

6.4 PRACH detector benchmark

cd build_gpu_split7_2
make -j$(nproc) prach_detector_benchmark

# Run CPU baseline (FFTW):
./tests/benchmarks/phy/upper/channel_processors/prach_detector_benchmark

# Run GPU path (set env var, then benchmark):
OCUDU_PRACH_DFT_BACKEND=gpu_full \
  ./tests/benchmarks/phy/upper/channel_processors/prach_detector_benchmark

The benchmark sweeps nof_rx_ports ∈ {1, 2, 4}, format {zero, B4}, zcz ∈ {0, 1, 14}, and nof_preambles ∈ {4, 64}.


7. Results

All measurements are from live gNB runs with real over-the-air PRACH traffic. Stats are reported by the detector itself every 1000 detection windows; the max in the first window includes the one-time CUDA graph build cost.

Note on hardware: the numbers below are from the RTX A4000 (SM86), which runs the cuFFTDx fused kernel. On the RTX 8000 (SM75) the code falls back to the cuFFT three-kernel path; latency on that card is higher than what’s shown here (an exact number is not yet published - expect roughly 1.3–1.6× the A4000 figures, dominated by the extra kernel-launch and intermediate-buffer overhead).

7.1 Test environments

Setup A - Split 8 / X410 RU (1T1R)

ItemValue
Hostx86_64, AVX-512
GPUNVIDIA RTX A4000 (SM86, 16 GB GDDR6)
OSUbuntu 22.04 LTS, kernel 5.15
CUDA Toolkit12.2 / NVIDIA MathDx 26.03
RUEttus X410, Split 8, 1T1R
PRACH configShort format (B4), seq=64, ports×sym=1, DFT=256, shifts=1

Setup B - Split 7.2 / Liteon RU (4T4R)

ItemValue
HostSame host as Setup A
GPUNVIDIA RTX A4000 (SM86, 16 GB GDDR6)
RULiteon 7.2-split, 4T4R, band n78
PRACH configShort format (B4), seq=64, ports×sym=1, DFT=256, shifts=1

7.2 Detection latency - steady-state (windows 2000–10000)

Mean / min / max over rolling 1000-window intervals, excluding the first window (which includes the one-time graph build cost).

SetupPathmeanminmax
Split 8, X410 1T1RGPU cuFFTDx78–79 µs70–72 µs103–127 µs
Split 8, X410 1T1RCPU FFTW111–113 µs97 µs154–188 µs
Split 7.2, Liteon 4T4RGPU cuFFTDx74–75 µs69 µs82–88 µs
Split 7.2, Liteon 4T4RCPU FFTW115 µs107 µs165–194 µs
CPU vs GPU latency range, Split 8 / Ettus X410 (1T1R). Min/mean/max bars per path.
CPU vs GPU latency range - Ettus X410 (Split 8, 1T1R).
CPU vs GPU latency range, Split 7.2 / Liteon (4T4R). Min/mean/max bars per path.
CPU vs GPU latency range - Liteon (Split 7.2, 4T4R).
Grouped bar chart: PRACH detect mean latency (µs) for CPU FFTW vs GPU cuFFTDx across two test setups (Split 8 X410 1T1R and Split 7.2 Liteon 4T4R). GPU bars are ~30–35 % shorter on mean and ~30–55 % shorter on max tail. Split 8 X410: CPU mean 112 µs vs GPU mean 78 µs. Split 7.2 Liteon: CPU mean 115 µs vs GPU mean 74 µs.
PRACH detect latency - lower is better. GPU path reduces mean latency by ~30 % and compresses tail latency by up to ~55 % compared to CPU FFTW.

7.3 Improvement summary

SetupMean ΔMax (tail) Δ
Split 8, X410 1T1R−30 % (112 → 78 µs)−33 % (161 → 107 µs)
Split 7.2, Liteon 4T4R−36 % (115 → 74 µs)−55 % (187 → 87 µs)

The Liteon 4T4R setup shows a larger tail-latency improvement, consistent with the cuFFTDx fused kernel eliminating intermediate memory traffic that would otherwise create occasional outliers under memory-bandwidth pressure.

7.4 GPU-path scaling and timing-budget headroom

Beyond the CPU vs GPU comparison, the GPU detector was characterised on its own across configurations and under stress to understand how it behaves with increasing antenna count and how much headroom it leaves against the PRACH timing budget.

GPU vs CPU speedup factor across detection-window scenarios.
GPU vs CPU speedup across detection-window scenarios.
Detection-time scaling under stress test, back-to-back PRACH windows.
Detection-time scaling under stress test (back-to-back windows).
Latency vs PRACH timing budget across configurations. GPU detector stays well inside the budget across every tested configuration.
Latency vs PRACH timing budget across configurations. The GPU detector stays well inside the PRACH timing budget across every tested configuration, with substantial headroom for higher antenna counts and denser preamble sets.

7.5 Cold-start graph build cost

The first detection window pays a one-time CUDA graph construction penalty. This occurs at gNB startup, before any UE can complete the random-access procedure (UEs do not transmit PRACH until SIB1 is received), so it does not affect UE experience.

SetupGraph build time
Split 8, X410 1T1R (short format, DFT=256)4 953 µs
Split 7.2, Liteon 4T4R (short format, DFT=256)3 547 µs

After the first window for each configuration key the cost does not recur. graph_builds=1 and cached_graphs=1 remain constant throughout the run, confirming a single graph serves the entire session.


8. Metrics and observability

The GPU detector emits structured log lines to stderr independently of the YAML metrics system. No YAML change is needed to observe these.

8.1 GPU detector stats (every 1000 detects)

[prach_detector_gpu] graph cache miss #1: long=false seq=64 ports*sym=1 dft=256 shifts=1 win_width=116 build=4953us (cached_graphs=1)
[prach_detector_gpu] stats: detects=1000 graph_builds=1 cached_graphs=1 mean=86us min=72us max=6698us (last_window=1000)
[prach_detector_gpu] stats: detects=2000 graph_builds=1 cached_graphs=1 mean=79us min=72us max=109us (last_window=1000)
  • graph_builds - total CUDA graph cache misses since construction. In normal operation this reaches 1 (or 2 for cells with both long and short format PRACH) and stays there.
  • cached_graphs - number of graphs currently in the cache.
  • max in the first 1000-detect window includes the one-time graph build; subsequent windows will show max ≈ 100–150 µs.

8.2 CPU fallback detector stats (every 1000 detects)

When OCUDU_PRACH_DFT_BACKEND=gpu_full is active, the inner CPU fallback detector (prach_detector_generic_impl) is present but never invoked during normal operation. Its stats line will show detects=0:

[prach_detector_cpu] stats: detects=0 mean=0us min=0us max=0us (last_window=1000)

detects=0 throughout the gNB run confirms the GPU path is handling all windows. Any non-zero value indicates a config path that routed to the CPU fallback - worth investigating if observed.

8.3 Upper-PHY YAML metrics (optional, for A/B comparison)

The logging_prach_detector_decorator wraps the detector and reports per-detect timing and results to the configured logger. Enable in the YAML:

metrics:
  enable_log: true
  layers:
    enable_du_low: true
  periodicity:
    du_report_period: 1000

This produces the standard per-PRACH-window timing line regardless of whether the GPU or CPU detector is active, enabling direct side-by-side comparison from the same log format.


9. Deployment checklist

  • Install NVIDIA driver ≥ 525 and CUDA Toolkit ≥ 12.0 on the host (Section 1.3).
  • Install NVIDIA MathDx 26.03 at /opt/nvidia-mathdx-26.03.0-cuda13/ (Section 1.4).
  • Enable GPU persistence mode and (optionally) lock clocks (Section 1.5).
  • Build the gNB with ENABLE_DFT_GPU=ON (Section 6.1).
  • Set OCUDU_PRACH_DFT_BACKEND=gpu_full in the gNB process environment (Section 5.1).
  • Start the gNB and confirm PRACH DFT backend: gpu_full (CUDA-shell detector) on stdout.
  • Wait for the first PRACH window and confirm graph cache miss #1 ... build=Xus on stderr.
  • After 1000 detection windows, confirm [prach_detector_gpu] stats: ... graph_builds=1 and [prach_detector_cpu] stats: detects=0.
  • Optionally enable metrics.layers.enable_du_low for per-window latency logging (Section 8.3).

10. References