This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Hardware Acceleration

Offload CPU-heavy upper-PHY stages to vRAN accelerator cards for lower latency, higher throughput, and reduced host CPU.

1: Intel ACC100 LDPC offload

OCUDU India supports offloading CPU-heavy stages of the upper-PHY from the host to dedicated vRAN accelerators. The goal is better latency and throughput on the same hardware budget, with the freedom to fall back to a pure-software path when no accelerator is available.

Design principles

Software parity. The HW and SW paths implement the same interfaces; no functional difference observable from MAC or above.
Configuration-driven. Choose path via DU YAML at startup. No code changes, no runtime toggling.
Unified metrics. Every enqueue/dequeue on the HW path emits the same metric fields the SW path emits. A/B comparisons use one log format.
Batched dispatch. Operations are accumulated per-TB and submitted to the accelerator as a single burst, amortising DPDK per-call cost.
On-device state. HARQ soft-buffer data stays on the accelerator between transmissions; the host tracks only context lifecycle.

Supported accelerators

Accelerator	Offloaded stages	Status	Guide
Intel ACC100	LDPC encode (PDSCH), LDPC decode (PUSCH), on-chip HARQ	Supported	Intel ACC100 LDPC offload
Intel ACC200	FEC + equalisation	Planned
NVIDIA Aerial	Full upper-PHY offload	Evaluation
Other DPDK BBDEV backends	via `baseband_*` PMD	Design review

When to enable acceleration

High-PRB-count cells (80–100 MHz) where software LDPC becomes the upper-PHY bottleneck.
Multi-cell hosts where freeing CPU on the upper-PHY increases cell density per server.
Tail-latency-sensitive deployments LDPC decode tail latency compresses substantially on hardware.

1 - Intel ACC100 LDPC offload

Offload LDPC encoding (PDSCH) and decoding (PUSCH) to Intel ACC100 vRAN accelerator cards via DPDK BBDEV, with full upper-PHY metrics instrumentation for side-by-side comparison with the software-AVX-512 path.

Scope: hardware-accelerated LDPC encoding (PDSCH) and decoding (PUSCH) on Intel ACC100 vRAN accelerator cards, with full upper-PHY metrics instrumentation so operators can directly compare CPU-software and HW-offload performance in production logs.

Introduced in commit: aa6db7d066

Highlights

LDPC offload for both PDSCH and PUSCH the most CPU-heavy channel-coding steps move off the host and onto the accelerator.
Configuration-driven enable or disable via the DU YAML; no code changes.
Multi-VF scaling allowlisting additional ACC100 VFs automatically spreads load across them.
Unified metrics the same per-block PHY metric fields populate whether LDPC runs in software (AVX-512) or on the accelerator, enabling direct side-by-side comparison from one log format.
Measured gains on real traffic significantly lower PUSCH decode latency, tighter tail latency, higher processor throughput, and reduced upper-PHY uplink CPU (see §7).

1. Prerequisites

1.1 Hardware

Intel ACC100 PCIe card, SR-IOV capable; at least one VF exposed to the host.
x86-64 CPU with AVX2 (AVX-512 recommended for the software-fallback path).
≥ 2 GB of 2 MiB hugepages.
PCIe slot on the same NUMA node as the upper-PHY worker cores.

1.2 Software

Component	Minimum version	Notes
DPDK	22.11	Tested with 25.11; ACC100 PMD (`baseband_acc`) required.
Linux kernel	5.15+	IOMMU enabled (`intel_iommu=on iommu=pt`).
`pf_bb_config`	24.03+	PF configurator daemon; must run continuously.
DU build flags	`ENABLE_DPDK=True`, `ENABLE_PDSCH_HWACC=True`, `ENABLE_PUSCH_HWACC=True`	See §5.

1.3 Kernel and VFIO setup

# Kernel boot parameters (then update-grub + reboot):
intel_iommu=on iommu=pt hugepagesz=2M hugepages=1024

# Hugepage mount if not done by distro:
echo 1024 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

# Load VFIO modules:
sudo modprobe vfio-pci
echo 1 | sudo tee /sys/module/vfio_pci/parameters/enable_sriov   # if kernel-builtin

# Create one SR-IOV VF and bind it to vfio-pci:
echo 1 | sudo tee /sys/bus/pci/devices/<ACC100_PF_BDF>/sriov_numvfs
sudo dpdk-devbind.py --bind=vfio-pci <ACC100_VF_BDF>

# Start pf_bb_config holding the PF group open with a VF token.
# The token (UUID) is required in the DU's EAL args.
sudo /opt/pf-bb-config/pf_bb_config ACC100 \
     -c /opt/pf-bb-config/acc100/acc100_config_vf_5g.cfg \
     -v <UUID> &

After setup, dpdk-test-bbdev should enumerate the VF as intel_acc100_vf.

2. Architecture overview

2.1 Where LDPC sits in the upper-PHY

LDPC encode (PDSCH) and LDPC decode (PUSCH) are the two steps this feature accelerates. On the software path they are AVX2/AVX-512 kernels; on the HW path they become batched DPDK BBDEV operations dispatched to the ACC100. Everything else in the chain stays on the CPU.

Upper-PHY pipeline. UL chain (top): OFH RX → Demodulate → Rate Matching → LDPC Encoding HW → CRC → MAC. DL chain (bottom): MAC TB → Segment → LDPC Encoding HW → Rate Matching → Rate Matching → Scramble → Modulate → Layer Map → Precoding → OFH TX. LDPC encoding stages are highlighted as hardware-accelerated. — UL and DL pipelines through the upper-PHY. The LDPC stages (dark blue) run on the ACC100; all other stages stay on CPU.

2.2 HAL layering

The upper-PHY factory checks for a hardware accelerator factory at construction. If present, the HW path is built; otherwise the software AVX-512 path is used. The choice is made once at startup from the YAML there is no runtime toggling.

HAL layering diagram with five layers: (1) Upper-PHY (PUSCH, PDSCH, PHY metrics) calling into hw_accelerator_*; (2) HWACC METRIC decorators (hwacc_pusch_dec_decorator, hwacc_pdsch_enc_decorator, metric emitter); (3) BBDEV layer (HARQ context, pdsch_enc_bbdev_imp, pusch_dec_bbdev_imp) calling rte_bbdev_enqueue/dequeue; (4) DPDK BBDEV device wrapper (bbdev_acc, mbuf, bbdev op) over VFIO; (5) ACC100 card (LDPC encode, LDPC decode, on-chip HARQ). — Five-layer HAL from the upper-PHY down to the ACC100 silicon. Each layer's interface to the one below is labelled.

3. Implementation summary

The integration is implemented as a hardware-abstraction layer that plugs into the existing upper-PHY factory pattern. Four things are worth knowing:

Batching. Encode and decode ops are accumulated in a per-instance buffer and submitted to the accelerator as a single burst on the first dequeue call of each transport block. This amortises the DPDK per-call cost across all code blocks of the TB.
Shared pools. The HAL factory owns one set of DPDK mbuf and op mempools for all encoder and decoder instances the upper-PHY creates. Pool size scales automatically with the total queue count across allowlisted accelerators.
On-chip HARQ. Soft data for HARQ combining stays on the accelerator between transmissions, addressed by a CB-indexed offset. The host tracks only the lifecycle of each context entry, not the soft data itself.
Unified metrics. A thin decorator wraps the HW accelerator interface, times each enqueue–dequeue pair, and emits the same metric events the software path does. The existing upper-PHY aggregator consumes both sources transparently.

A handful of ACC100 silicon quirks are handled internally by the HAL (input byte-alignment, a single-CB transport-block special case, a per-op E-limit guard, a long-session HARQ-context wrap-around). These require no action from the operator.

4. Configuration

Enable ACC100 offload in the DU YAML:

hal:
  eal_args: "--lcores (0-1)@(0-17) --file-prefix=ocudu_gnb --no-telemetry
             -a <OFH_NIC_VF_BDF>
             -a <ACC100_VF_BDF>
             --vfio-vf-token=<PF_BB_CONFIG_UUID>
             --iova-mode=pa"
  bbdev_hwacc:
    hwacc_type: "acc100"
    id: 0
    pdsch_enc:
      nof_hwacc: 4
      cb_mode: true
      dedicated_queue: true
    pusch_dec:
      nof_hwacc: 4
      force_local_harq: false
      dedicated_queue: true

Notes:

nof_hwacc for single-cell deployments, 2–4 is sufficient; for multi-cell or heavy-load setups, 8–16. Setting higher than the upper-PHY’s concurrency limits is wasteful.
--iova-mode=pa is recommended; the va mode is known to interact poorly with some NIC drivers in DPDK 25.11.
The VFIO-VF token must match the UUID passed to pf_bb_config -v. If pf_bb_config is restarted, update the YAML.

5. Build guide

5.1 gNB with 7.2 fronthaul and ACC100 offload

cd ~/ocudu
mkdir -p build_hwacc && cd build_hwacc

sudo cmake -DDU_SPLIT_TYPE=SPLIT_7_2 \
           -DENABLE_DPDK=True \
           -DENABLE_PDSCH_HWACC=True \
           -DENABLE_PUSCH_HWACC=True \
           -DASSERT_LEVEL=MINIMAL \
           ../

sudo make -j$(nproc)

Binary: build_hwacc/apps/gnb/gnb.

5.2 Benchmarks

cd build_hwacc
sudo make -j$(nproc) pdsch_processor_benchmark pusch_processor_benchmark

5.3 Startup verification

When the DU starts with ACC100 configured, the log should contain:

[HWACC] [I] [bbdev] dev=0 driver=intel_acc100_vf ...
[HWACC] [I] [bbdev] dev=0 started: ldpc_enc_q=N ldpc_dec_q=N ...

and on stdout:

Warning: the configured maximum PDSCH concurrency ... is overridden by the
         number of PDSCH encoder hardware accelerated functions (N)
Warning: the configured maximum PUSCH and SRS concurrency ... is overridden
         by the number of PUSCH decoder hardware accelerated functions (N)

6. Metrics

Enable the upper-PHY metric block in the YAML:

metrics:
  enable_log: true
  enable_verbose: true
  layers:
    enable_du_low: true
  periodicity:
    du_report_period: 1000

Every second the log will include a block similar to:

LDPC Encoder:  avg_cb_size=... bits, avg_latency=... us, encode_rate=... Mbps
LDPC Decoder:  avg_cb_size=... bits, avg_latency=... us, avg_nof_iter=..., decode_rate=... Mbps
  ...
CPU usage: upper_phy_dl=...%, ldpc_enc=...%, ...
           upper_phy_ul=...%, ldpc_dec=...%, ...

The same fields populate for both the CPU-software and ACC100-HW paths, enabling side-by-side A/B comparison from a single log format.

7. Results

7.1 Test environment

Item	Value
Host	Single-socket x86_64, 18 cores, AVX-512 capable
OS / DPDK	Linux 5.15 / DPDK 25.11
`pf_bb_config`	24.03
ACC100	1 PF + 1 VF bound to `vfio-pci`
OFH NIC	`iavf` VF for 7.2-split RU
UE workload	Real UE + `iperf3` DL/UL + 2 min video streaming + 2 speed tests
Cell	100 MHz TDD, 30 kHz SCS, 4T4R, band n78, 256-QAM max

All measurements taken with the same physical UE and RU; only the gNB binary differs between the two rows (AVX-512 vs ACC100).

Horizontal bar chart: ACC100 improvement over AVX-512 across 12 upper-PHY metrics. LDPC decode_rate +174 %, LDPC max (mean) +74 %, LDPC avg (mean) +56 %, PDSCH proc_rate +46 %, PUSCH proc_rate +45 %, DL proc max_lat (tail) +42 %, PDSCH max_lat (tail) +39 %, PDSCH avg (tail) +38 %, PDSCH max_lat (mean) +26 %, DL proc max_lat (mean) +19 %, PDSCH avg (mean) +17 %, PUSCH avg (tail) +10 %. — Summary ACC100 percentage improvement over AVX-512 across all measured upper-PHY metrics. LDPC decode throughput leads the board at +174 %.

7.2 End-to-end upper-PHY A/B

Mean / max over ~175 one-second metric windows each.

Metric	AVX-512 (mean / max)	ACC100 (mean / max)	ACC100 Δ
DL processing `max_latency`	114.9 / 330.2 µs	99.7 / 206.4 µs	−13 % mean, −37 % tail
PDSCH Processor `avg_latency`	44.0 / 101.7 µs	39.3 / 67.8 µs	−11 % mean, −33 % tail
PDSCH Processor `proc_rate`	198.0 Mbps	310.1 Mbps	+57 %
PDSCH Processor `max_latency`	96.5 / 309.8 µs	76.6 / 203.2 µs	−21 % mean, −34 % tail
LDPC Decoder `avg_latency`	51.6 / 109.4 µs	24.5 / 50.1 µs	−53 % (2.1× faster)
LDPC Decoder `max_latency`	145.9 / 457.7 µs	40.5 / 86.7 µs	−72 % (3.6× better tails)
LDPC Decoder `decode_rate`	46.6 Mbps	137.4 Mbps	+195 % (2.9×)
PUSCH Processor `avg_data_latency`	231.9 / 761.5 µs	246.3 / 736.4 µs	+6 % mean, −3 % tail
PUSCH Processor `proc_rate`	24.0 Mbps	37.4 Mbps	+56 %

LDPC Decoder latency, grouped bar chart comparing AVX-512 and ACC100 across avg_lat mean, avg_lat tail, max_lat mean, max_lat tail. ACC100 is consistently lower, with the biggest gap at max_lat tail (~457 µs AVX-512 vs ~87 µs ACC100). — LDPC Decoder latency lower is better. The tail-latency gap (rightmost pair) is the headline: 3.6× tighter under ACC100.

PDSCH Processor latency, grouped bar chart comparing AVX-512 and ACC100 across avg_lat mean, avg_lat tail, max_lat mean, max_lat tail. ACC100 lower across the board, most prominently at max_lat tail (~310 µs vs ~203 µs). — PDSCH Processor latency lower is better. Tail latency compresses by roughly one third.

Throughput comparison chart showing PDSCH proc_rate, LDPC decode_rate, and PUSCH proc_rate for AVX-512 vs ACC100. ACC100 higher on all three: PDSCH ~310 Mbps vs ~198 Mbps, LDPC decode ~137 Mbps vs ~47 Mbps, PUSCH ~37 Mbps vs ~24 Mbps. — Throughput higher is better. LDPC decode rate nearly triples; PDSCH and PUSCH processor rates both rise by about half.

7.3 CPU utilisation

Metric	AVX-512 (mean / max)	ACC100 (mean / max)
`upper_phy_dl`	3.63 % / 22.7 %	4.92 % / 33.9 %
`upper_phy_ul`	3.79 % / 28.9 %	2.36 % / 15.4 % (−38 %)
`ldpc_rm` (rate match)	1.30 % / 10.9 %	0.00 % (on accelerator)
`ldpc_rdm` (rate dematch)	0.13 % / 1.0 %	0.00 % (on accelerator)

The uplink CPU reduction is the headline operational benefit: one core is freed on the upper-PHY under sustained traffic, allowing either higher cell counts on the same host or tighter scheduling-latency budgets.

Note on the DL rows. The per-CB ldpc_encoder_* fields on the HW path reflect batch wall-clock time rather than serialised per-CB compute time, because ops are submitted to the accelerator in bursts. Use the PDSCH Processor rows and upper_phy_dl for apples-to-apples DL comparison those are measured once per TB and are directly comparable.

8. Deployment checklist

Install or confirm DPDK ≥ 22.11 with ACC100 PMD.
Confirm pf_bb_config daemon is running; note its VFIO-VF token.
Bind the ACC100 VF(s) to vfio-pci.
Add the hal.bbdev_hwacc block to the DU YAML (see §4).
Build with ENABLE_PDSCH_HWACC=True and ENABLE_PUSCH_HWACC=True (see §5).
Verify startup log shows intel_acc100_vf and ldpc_enc_q=N ldpc_dec_q=N.
Enable metrics.layers.enable_du_low to observe the per-component LDPC metrics (§6).

9. References

DPDK BBDEV programmer’s guide: https://doc.dpdk.org/guides-25.11/prog_guide/bbdev.html
Intel pf_bb_config: https://github.com/intel/pf-bb-config
3GPP TS 38.212 Multiplexing and channel coding.