Intel ACC100 - LDPC offload
Offload LDPC encoding (PDSCH) and decoding (PUSCH) to Intel ACC100 vRAN accelerator cards via DPDK BBDEV, with full upper-PHY metrics instrumentation for side-by-side comparison with the software-AVX-512 path.
4 minute read
A modern 5G NR PHY layer is dominated by a handful of compute hot-spots: LDPC decode on PUSCH (up to ~50 iterations × thousands of variable nodes per code block), LDPC encode and rate matching on PDSCH, FFT/IFFT on the OFDM modulator/demodulator, and channel estimation/equalisation. Of these, LDPC is by far the most CPU-expensive and the easiest to offload to a fixed-function accelerator with a clean, batch-oriented API. OCUDU therefore concentrates its hardware-acceleration story on LDPC and leaves the rest to optimised CPU kernels.
The hardware-abstraction layer (HAL) is a C++ abstraction that lets the PHY stack drive an accelerator without knowing which device, transport, or vendor SDK is underneath. Today the only concrete backend is DPDK BBDev targeting Intel ACC100-class devices; the abstraction is intentionally generic enough that GPU and FPGA backends can be added without disturbing the PHY code.
hw_accelerator<T,U> interface; the PHY block holds a
unique_ptr to the interface and never sees the underlying device. Adding
a GPU or FPGA backend is a new factory and a new concrete implementation;
encoder/decoder PHY code stays unchanged.pdsch_enc / pusch_dec blocks under hal.bbdev_hwacc.OCUDU_HAS_ENTERPRISE undefined, so the BBDev factories return nullptr,
the application detects that, and silently falls back to the pure-software
path. The same source tree compiles and runs on a laptop with no
accelerator.ENABLE_DPDK
(default OFF) decides whether the DPDK/BBDev code is compiled in at all.
With ENABLE_DPDK=ON, the actual device (ACC100, ACC200, VRB1) and
per-channel offload settings are picked from YAML at startup.enqueue_operation() / dequeue_operation() with internal queue IDs, so
multiple code blocks pipeline through hardware concurrently. The wrapping
encode() / decode() call appears synchronous to the PHY (it polls
until all CBs are dequeued); from the device’s perspective many CBs are
in flight at once.force_local_harq=true, more
flexible). This is a per-deployment trade-off exposed through config.The HAL is shaped to host multiple backends; today only one is wired through. The ACC100 row links to its dedicated guide; the others are placed here so the roadmap is visible.
| Accelerator | Offloaded stages | Transport | Status | Guide |
|---|---|---|---|---|
| Intel ACC100 / ACC200 / VRB1 | PDSCH LDPC encode, PUSCH LDPC decode | DPDK BBDev over PCIe | Implemented | Intel ACC100 - LDPC offload |
| NVIDIA A4000 (Ampere GPU) | Target: PUSCH LDPC decode, channel estimation | CUDA / pinned host memory over PCIe | Planned | |
| Xilinx (AMD) ZCU102 RFSoC | Target: combined LDPC + rate matching, optionally OFDM | XRT / XDMA over PCIe | Planned |
Offloading is not free, it adds setup latency, mempool memory, an extra DMA round-trip, and operational complexity (DPDK, hugepages, NUMA pinning, device driver upgrades). The decision is per-deployment, per-channel. Offload PDSCH encode and/or PUSCH decode when:
Offload LDPC encoding (PDSCH) and decoding (PUSCH) to Intel ACC100 vRAN accelerator cards via DPDK BBDEV, with full upper-PHY metrics instrumentation for side-by-side comparison with the software-AVX-512 path.
Offload the PRACH preamble detection pipeline to an NVIDIA GPU via a fused CUDA kernel chain and CUDA-graph execution, achieving roughly 10× lower detection latency compared to the CPU-FFTW path and freeing uplink-PHY CPU headroom under bursty random-access load.
Release A inline GPU path: NIC writes uplink fronthaul packets directly into GPU VRAM via GPUDirect RDMA so PRACH preamble detection and SRS channel estimation run end-to-end on the GPU with no CPU-side sample copy. Mean PRACH detect latency drops ~3x vs the CPU AVX-512 path; SRS per-occasion cost falls toward 0.6 us at 256-UE batches.