Hardware Acceleration

Offload CPU-heavy upper-PHY stages to vRAN accelerator cards for lower latency, higher throughput, and reduced host CPU.

A modern 5G NR PHY layer is dominated by a handful of compute hot-spots: LDPC decode on PUSCH (up to ~50 iterations × thousands of variable nodes per code block), LDPC encode and rate matching on PDSCH, FFT/IFFT on the OFDM modulator/demodulator, and channel estimation/equalisation. Of these, LDPC is by far the most CPU-expensive and the easiest to offload to a fixed-function accelerator with a clean, batch-oriented API. OCUDU therefore concentrates its hardware-acceleration story on LDPC and leaves the rest to optimised CPU kernels.

The hardware-abstraction layer (HAL) is a C++ abstraction that lets the PHY stack drive an accelerator without knowing which device, transport, or vendor SDK is underneath. Today the only concrete backend is DPDK BBDev targeting Intel ACC100-class devices; the abstraction is intentionally generic enough that GPU and FPGA backends can be added without disturbing the PHY code.

Design principles

  • Pluggable backends behind one interface. Every accelerator implements the templated hw_accelerator<T,U> interface; the PHY block holds a unique_ptr to the interface and never sees the underlying device. Adding a GPU or FPGA backend is a new factory and a new concrete implementation; encoder/decoder PHY code stays unchanged.
  • Per-block, per-direction selection. PDSCH encode and PUSCH decode are independent decisions. A deployment can offload PDSCH encode to ACC100 and keep PUSCH decode on the CPU, or vice versa, by toggling separate pdsch_enc / pusch_dec blocks under hal.bbdev_hwacc.
  • Static binding at construction time. The factory returns either the HW or the SW implementation when the PHY is built. No per-TB runtime switch, that would defeat pre-allocated queues and pinned memory.
  • Graceful fallback when HW is absent. Open-source builds have OCUDU_HAS_ENTERPRISE undefined, so the BBDev factories return nullptr, the application detects that, and silently falls back to the pure-software path. The same source tree compiles and runs on a laptop with no accelerator.
  • Build-time feature flag, runtime device selection. CMake ENABLE_DPDK (default OFF) decides whether the DPDK/BBDev code is compiled in at all. With ENABLE_DPDK=ON, the actual device (ACC100, ACC200, VRB1) and per-channel offload settings are picked from YAML at startup.
  • DPDK is for baseband, not for fronthaul Ethernet. The OFH RX/TX path uses standard Linux networking; DPDK in OCUDU is exclusively the BBDev transport. Keeping these isolated avoids resource contention between baseband acceleration and fronthaul packet I/O.
  • Async queue semantics, polling completion. The HAL interface is enqueue_operation() / dequeue_operation() with internal queue IDs, so multiple code blocks pipeline through hardware concurrently. The wrapping encode() / decode() call appears synchronous to the PHY (it polls until all CBs are dequeued); from the device’s perspective many CBs are in flight at once.
  • External HARQ buffer support is optional. PUSCH decoders can keep soft-bit HARQ buffers either in on-device memory (lower latency, fewer concurrent CBs) or in host memory (force_local_harq=true, more flexible). This is a per-deployment trade-off exposed through config.

Supported accelerators

The HAL is shaped to host multiple backends; today only one is wired through. The ACC100 row links to its dedicated guide; the others are placed here so the roadmap is visible.

Accelerator Offloaded stages Transport Status Guide
Intel ACC100 / ACC200 / VRB1 PDSCH LDPC encode, PUSCH LDPC decode DPDK BBDev over PCIe Implemented Intel ACC100 - LDPC offload
NVIDIA A4000 (Ampere GPU) Target: PUSCH LDPC decode, channel estimation CUDA / pinned host memory over PCIe Planned
Xilinx (AMD) ZCU102 RFSoC Target: combined LDPC + rate matching, optionally OFDM XRT / XDMA over PCIe Planned

When to enable acceleration

Offloading is not free, it adds setup latency, mempool memory, an extra DMA round-trip, and operational complexity (DPDK, hugepages, NUMA pinning, device driver upgrades). The decision is per-deployment, per-channel. Offload PDSCH encode and/or PUSCH decode when:

  • Throughput per cell is high. Massive-MIMO, 100 MHz BW, 256-QAM, multiple aggregated carriers, anything that pushes per-cell PHY load past ~40 % of a CPU core dedicated to LDPC. ACC100 takes that load off entirely.
  • Cell count per server is high. The break-even shifts dramatically the moment you host more than two or three cells per box: linear cost of more software LDPC versus the fixed cost of one accelerator card flips quickly.
  • You need deterministic worst-case PHY latency. Software LDPC decode time scales with iteration count and SNR, and the variance is wide under heavy retransmission. ACC100’s per-CB latency is bounded and predictable, which matters for tight HARQ deadlines and URLLC slices.
  • You have CPU headroom problems but spare PCIe slots. ACC100 frees ~2–4 cores per heavily-loaded gNB process; if your bottleneck is core count rather than PCIe, this is a direct win.
  • Power per bit matters. Fixed-function silicon is several times more efficient per encoded/decoded bit than SIMD on a general-purpose core.
  • First time enabling HWACC? Start with the ACC100 guide, it covers prerequisites, kernel/VFIO setup, YAML, build flags, and a deployment checklist.
  • Already running HWACC and want to understand the numbers? Jump to the ACC100 results section for the A/B comparison.

Intel ACC100 - LDPC offload

Offload LDPC encoding (PDSCH) and decoding (PUSCH) to Intel ACC100 vRAN accelerator cards via DPDK BBDEV, with full upper-PHY metrics instrumentation for side-by-side comparison with the software-AVX-512 path.

PRACH Detection offload to GPU

Offload the PRACH preamble detection pipeline to an NVIDIA GPU via a fused CUDA kernel chain and CUDA-graph execution, achieving roughly 10× lower detection latency compared to the CPU-FFTW path and freeing uplink-PHY CPU headroom under bursty random-access load.

Inline GPU Acceleration - PRACH and SRS

Release A inline GPU path: NIC writes uplink fronthaul packets directly into GPU VRAM via GPUDirect RDMA so PRACH preamble detection and SRS channel estimation run end-to-end on the GPU with no CPU-side sample copy. Mean PRACH detect latency drops ~3x vs the CPU AVX-512 path; SRS per-occasion cost falls toward 0.6 us at 256-UE batches.