This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Documentation

Specifications, reference implementations, deployment guides, and contribution workflows for OCUDU India.

Most sections are under active development. Page outlines are in place so the structure is clear, but detailed content will land over the next few release cycles. One section is already complete: Hardware Acceleration read it for a concrete sense of the depth and style the rest of the docs are headed toward.

Sections

  • Getting Started set up a development environment and run a first reference deployment.
  • Architecture upper-PHY layering, interfaces, platform assumptions.
  • Hardware Acceleration offload CPU-heavy upper-PHY stages to vRAN accelerators. Filled in.
  • Deployment reference blueprints, hardware profiles, network topologies.
  • Contributing how to file issues, propose RFCs, and submit pull requests.

1 - Getting Started

Set up a development environment and run a first reference deployment.

Coming soon. The quick-start walkthrough is being written alongside the first public reference-stack release. Until then, the Hardware Acceleration section has a complete operator guide that exercises most of the platform.

Planned content:

  • Prerequisites Linux host, Kubernetes cluster, CLI tools.
  • Installing the reference stack with Helm and kpt.
  • Deploying a minimal CU/DU pair.
  • Verifying the deployment.
  • Pointers to the Architecture, Deployment, and Hardware Acceleration sections for going deeper.

2 - Architecture

High-level architectural design of the OCUDU 5G gNB layer responsibilities, inter-layer correlation, deployment topology, and the execution and async fabric that binds them.

1. What OCUDU is

OCUDU is a full 3GPP/O-RAN compliant 5G NR gNB implemented in C++17/C++20. It terminates every standardized RAN interface Uu toward the UE (via PHY/OFH/RU), F1-C/F1-U between CU and DU, E1 between CU-CP and CU-UP, N2/N3 toward the 5G Core (AMF/UPF), Xn between peer gNBs, and E2 toward the near-RT RIC. The codebase is functionally disaggregated so the same binaries can run co-located (gnb) or split across machines (cu_cp + cu_up + du, with optional du_low for an O-RAN Split-6 PHY).

Three architectural ideas run top-to-bottom through the code:

  1. Single logical entity → independent functional units. Every layer (CU-CP, CU-UP, DU-High, DU-Low, RU) is an owned object tree with a well-defined public interface and internal adapter notifiers. Layers never call each other directly; they call adapters, which the assembly code wires to concrete callees at construction time. This is why the same DU-High object can be wired to a local in-process CU-CP (via f1c_local_connector) or to a remote CU-CP (via SCTP) without recompilation.
  2. Async-procedure first. All multi-step control-plane flows (UE setup, handover, PDU session setup, E1/F1 bearer ops) are modeled as async_task<R> C++ coroutines composed with CORO_AWAIT_VALUE(...). The procedure classes live under routines/ and procedures/ subdirectories of every protocol layer.
  3. Per-entity executors. Concurrency is expressed as task dispatch to named task_executor instances: per-cell, per-UE-UL, per-UE-DL, per-crypto-worker, per-gateway-IO. Serialization is achieved either by a single-threaded worker or a strand over a shared pool never by coarse locks.

2. High-level topology

flowchart LR
    UE((UE))
    AMF[/AMF/]
    UPF[/UPF/]
    RIC[/near-RT RIC/]
    PEER[/Peer gNB/]

    subgraph gNB
      direction LR
      subgraph CU[Centralized Unit]
        CUCP[CU-CP<br/>RRC · NGAP · XnAP · NRPPa]
        CUUP[CU-UP<br/>PDCP · SDAP · GTP-U]
      end
      subgraph DU[Distributed Unit]
        DUH[DU-High<br/>F1AP · MAC · RLC · Scheduler]
        DUL[DU-Low<br/>Upper PHY]
      end
      RU[Radio Unit<br/>Lower PHY / OFH-WG4]
    end

    UE  <-- Uu --> RU
    RU  <-- OFH/SDR --> DUL
    DUL <-- FAPI-like PDU API --> DUH
    DUH <-- F1-C sig --> CUCP
    DUH <-- F1-U data --> CUUP
    CUCP <-- E1 --> CUUP
    CUCP <-- N2 SCTP --> AMF
    CUUP <-- N3 GTP-U --> UPF
    CUCP <-- Xn --> PEER
    CUCP <-- E2 --> RIC
    DUH  <-- E2 --> RIC
    CUUP <-- E2 --> RIC

Each functional block is a standalone compilation unit under lib/, wrapped by an application unit under apps/units/ that adds YAML config, logging registration, metrics, and PCAP plumbing. The gnb binary composes all three app units in one process; the split binaries compose only their own unit and use SCTP/UDP gateways for cross-entity links.

3. CU-CP Centralized Unit, Control Plane

Source: lib/cu_cp/, lib/ngap/, lib/f1ap/cu_cp/, lib/e1ap/cu_cp/, lib/rrc/, lib/xnap/, lib/nrppa/.

CU-CP is the RRC/NGAP termination point and the orchestrator of every per-UE control procedure. The owning class is cu_cp_impl (lib/cu_cp/cu_cp_impl.h); it aggregates four repositories one each for DUs, CU-UPs, AMFs, and Xn peers plus a ue_manager, a mobility_manager, a cell_meas_manager, and an nrppa_entity.

graph TD
    CUCP[cu_cp_impl]
    UEM[ue_manager<br/>cu_cp_ue per ue_index]
    DUR[du_processor_repository]
    CUR[cu_up_processor_repository]
    NGR[ngap_repository]
    XNR[xnap_repository]
    MOB[mobility_manager]
    MEAS[cell_meas_manager]
    CUCP --> UEM
    CUCP --> DUR --> DUP[du_processor_impl<br/>owns F1AP + RRC-DU]
    CUCP --> CUR --> CUUPP[cu_up_processor_impl<br/>owns E1AP]
    CUCP --> NGR --> NGAP[ngap_impl<br/>per AMF]
    CUCP --> XNR
    CUCP --> MOB
    CUCP --> MEAS

Interfaces terminated. NGAP (TS 38.413) on N2, F1AP (TS 38.473) on F1-C, E1AP (TS 38.463) on E1, XnAP (TS 38.423 / 37.483) on Xn-C, plus NRPPa for positioning and E2AP for near-RT RIC. Each protocol has its own state machine in its lib/ directory and exposes an adapter interface back into cu_cp_impl.

UE lifecycle. The canonical Initial UE Message flow runs like this:

  1. DU sends F1AP Initial UL RRC Message Transferdu_processor_impl allocates a cu_cp_ue via ue_manager::add_ue() and binds F1AP/RRC adapters.
  2. NGAP forwards a NAS Initial UE Message to the AMF and establishes an NGAP UE context.
  3. AMF responds with Initial Context Setup Request, which launches initial_context_setup_routine a coroutine that sequentially awaits: Security Mode Command on RRC → F1AP UE Context Setup → UE Capability Transfer → nested pdu_session_resource_setup_routine (E1AP Bearer Context Setup → F1AP UE Context Modification → RRC Reconfiguration).

Every step is a CORO_AWAIT_VALUE on the next async sub-procedure, so the routine reads like synchronous pseudocode but never blocks a thread.

Concurrency model. CU-CP runs on a single cu_cp_executor. A CU-CP-wide FIFO (cu_cp_common_task_scheduler) orders global tasks; each UE has its own FIFO (ue_task_scheduler_impl) so per-UE procedures serialize without blocking unrelated UEs. AMF connections get their own FIFO per NGAP instance. The result is fine-grained serialization without a single global lock.

Mobility. mobility_manager inspects measurement reports from cell_meas_manager and dispatches to one of three paths: intra-CU, inter-CU via Xn, or inter-CU via NG (AMF-routed). Conditional Handover has its own state machine in cu_cp_ue_cho_context. All three paths share a common coroutine skeleton under lib/cu_cp/routines/.

4. CU-UP Centralized Unit, User Plane

Source: lib/cu_up/, lib/pdcp/, lib/sdap/, lib/gtpu/, lib/f1u/cu_up/, lib/e1ap/cu_up/.

CU-UP terminates N3 (GTP-U to UPF) and F1-U (NR-U to DU) and implements the PDCP/SDAP layers in between. The E1AP interface receives Bearer Context Setup/Modify/Release from CU-CP and materializes the per-UE object tree.

flowchart LR
    subgraph UE_CTX[Per-UE context]
      direction TB
      PDUs[pdu_session] --> DRB[drb_context]
      DRB --> QF[qos_flow_context]
    end
    subgraph N3[N3  UPF]
      NG[gtpu_tunnel_ngu_rx/tx]
    end
    subgraph F1U[F1-U  DU]
      FB[f1u_bearer_impl<br/>NR-U DDDS]
    end
    NG -->|TEID demux| SDAPT[sdap_entity_tx<br/>QFI mark]
    SDAPT --> PDCPT[pdcp_entity_tx<br/>cipher · integrity · SN]
    PDCPT --> FB
    FB --> PDCPR[pdcp_entity_rx<br/>decipher · reorder]
    PDCPR --> SDAPR[sdap_entity_rx<br/>QFI strip]
    SDAPR --> NG

Object hierarchy. A pdu_session owns its N3 GTP-U tunnel, an SDAP entity, and a map of drb_context entries. Each DRB owns a PDCP entity, an F1-U CU-UP bearer, and a map of qos_flow_context entries. TEIDs are allocated from n3_teid_allocator and f1u_teid_allocator pools.

PDCP. TX maintains TX_NEXT, TX_TRANS_CRYPTO, TX_REORD_CRYPTO, TX_TRANS, TX_NEXT_ACK (TS 38.323 §7.1). Ciphering and integrity run on a parallel crypto_executor pool; custom state variables track in-flight crypto operations so PDUs can be emitted in the correct order even when parallel workers finish out-of-sequence. RX implements the reordering window and the t-Reordering timer per TS 38.323 §5.2.2.2.

F1-U / NR-U. f1u_bearer_impl consumes NR-U data delivery status messages from the DU and feeds handle_transmit_notification() / handle_delivery_notification() into PDCP TX, which advances TX_NEXT_ACK and releases discard-timer slots. The DU also reports desired buffer size; PDCP TX uses it as a back-pressure signal for early drop.

Concurrency. Every UE is assigned four executors by the ue_executor_mapper: ctrl (E1AP), ul_pdu (F1-U RX), dl_pdu (N3 RX), crypto (pooled). GTP-U demux per-TEID dispatches PDUs onto the owning session’s dl_pdu executor in batches, so a single UE’s data path is serialized while different UEs run in parallel on different workers.

5. DU-High MAC, RLC, F1AP, DU manager

Source: lib/du/du_high/, lib/mac/, lib/rlc/, lib/f1ap/du/.

DU-High is orchestrated by du_manager_impl, which owns the cell and UE context repositories and reacts to three event streams: F1AP procedures from CU-CP (UE Context Setup/Modify/Release per TS 38.473), MAC indications (UL-CCCH from Msg3, C-RNTI CE on handover access), and operator reconfig from the app-level configurator.

flowchart TB
    F1AP[f1ap_du_impl<br/>TS 38.473 procedures] --> DMGR[du_manager_impl<br/>UE lifecycle orchestration]
    DMGR --> MACC[mac_controller<br/>UE ctx · RNTI]
    DMGR --> RLCF[rlc factory]
    MACC --> MDL[mac_dl_processor<br/>PDSCH / SIB / RAR / Paging assembler]
    MACC --> MUL[mac_ul_processor<br/>PUSCH demux · BSR · PHR · CRC]
    MDL <-->|slot_result| SCHED[MAC Scheduler]
    MUL --> SCHED
    MDL -->|pull_pdu LCID| RLCTX[rlc_tx]
    MUL -->|handle_pdu LCID| RLCRX[rlc_rx]

MAC split. mac_dl_processor per cell runs on a high-priority slot_ind_executor; on each slot it calls the scheduler’s get_slot_result() and then pulls PDUs from RLC TX entities for each granted logical channel. mac_ul_processor receives Rx_Data indications, routes by C-RNTI via rnti_manager (lock-free atomic allocator starting at MIN_CRNTI = 0x4601, TS 38.321 §7.1), demultiplexes MAC subPDUs, feeds BSR/PHR to the scheduler, and hands LCID payloads to RLC RX.

RLC. Three modes exist: TM (SRB0, passthrough), UM (SRBs, no ARQ), AM (DRBs, full ARQ). AM TX tracks TX_NEXT_ACK, TX_NEXT, POLL_SN plus byte/PDU poll counters; AM RX runs a reassembly window keyed on RX_NEXT with a t-Reassembly timer that generates STATUS PDUs on gap or timeout (TS 38.322 §5.2, §5.3.3). The SDU queue between PDCP and RLC is a lock-free SPSC this is the reason the slot-indication hot path can pull PDUs without blocking on RLC state updates happening on the UE executor.

F1AP-DU. f1ap_du_impl decodes F1AP ASN.1, dispatches to per-procedure coroutines (F1 Setup, UE Context Setup/Modify/Release, DL/UL RRC Message Transfer, Paging), and forwards RRC containers to the right RLC SRB via adapters in lib/du/du_high/du_manager/du_ue/du_ue_adapters.h.

Adapter pattern. DU-High never calls F1AP or PDCP directly; it holds a set of small adapter classes (f1c_rx_sdu_rlc_adapter, rlc_rx_rrc_sdu_adapter, mac_sdu_tx_builder, mac_sdu_rx_notifier) whose targets are set at UE-creation time. This is what lets the same DU-High binary work with an in-process F1-C connector or a remote SCTP F1-C without code changes.

6. MAC scheduler

Source: lib/scheduler/.

The scheduler is the most intricate subsystem. scheduler_impl owns one cell_scheduler per cell and one ue_scheduler per carrier-aggregation cell group this split is deliberate: cell-wide resources (SSB, PRACH, SI, CSI-RS, PUCCH format resources) are per-cell state, while UE data state must be shared across CA component carriers.

flowchart TD
    SI[scheduler_impl::slot_indication] --> CS[cell_scheduler::run_slot]
    CS --> RG[cell_resource_allocator<br/>ring buffer, ~16 slots]
    CS --> SSB[ssb_sch]
    CS --> CSIRS[csi_rs_sch]
    CS --> SIS[si_sch<br/>SIB1 + SI msgs]
    CS --> PR[prach_sch]
    CS --> RA[ra_scheduler<br/>RAR · Msg3 · Msg4]
    CS --> PG[paging_sch]
    CS --> US[ue_scheduler::run_slot]
    US --> EV[event_manager<br/>config · feedback]
    US --> UCI[uci_scheduler<br/>SR · CSI PUCCH]
    US --> SRS[srs_scheduler]
    US --> FB[fallback_sched<br/>SRB0]
    US --> INTER[inter_slice_scheduler]
    INTER --> INTRA[intra_slice_scheduler]
    INTRA --> POL[scheduler_policy<br/>time_rr · time_qos]

Resource grid. cell_resource_allocator is a circular buffer of per-slot cell_slot_resource_allocator entries, sized for SCHEDULER_MAX_K0 / K1 / K2 look-ahead. Each entry contains symbol × CRB bitmaps for DL and UL and the accumulated sched_result handed to MAC.

Per-UE state split. ue is cell-group-wide (logical channels, DRX, timing advance). ue_cell is per-cell (active BWP, HARQ entities, MCS calculator, power controllers, fallback flag). A UE in CA has one ue and several ue_cell views indexed by serving cell index (PCell = 0).

Slicing. Two layers: inter_slice_scheduler ranks RAN slices by SLA/min-PRB/max-PRB each slot and produces DL/UL candidates; intra_slice_scheduler then applies a pluggable scheduler_policy (time-domain Round-Robin or Proportional-Fair implemented in lib/scheduler/policy/) to rank UEs within a slice and allocate PDSCH/PUSCH. Fallback UEs on SRB0 use a dedicated ue_fallback_scheduler instead.

HARQ. Each UE has 8 DL + 8 UL HARQ processes per serving cell (cell_harq_manager), tracked by NDI toggling, a bounded max_nof_harq_retxs, and a slot_timeout for missed ACKs.

Config safety. sched_config_manager converts add/update/remove UE requests into ue_config_update_event objects applied at slot boundaries by the event manager. Config never changes mid-slot.

Concurrency. One slot runs on one thread per cell; in CA, cell_group_mutex is taken only when the cell group has more than one cell, so single-cell deployments pay zero lock cost.

7. DU-Low, PHY, RU, Open Fronthaul

Source: lib/du/du_low/, lib/phy/, lib/ru/, lib/ofh/, lib/radio/.

DU-Low implements only the Upper PHY lib/du/du_low/README.md states the DU-Low is O-RAN Split 7.2x aligned, with the Lower PHY pushed into the RU. The radio_unit interface (include/ocudu/ru/ru.h) has three concrete implementations: OFH-RU (O-RAN fronthaul, production), SDR-RU (direct baseband via UHD/ZMQ, which pulls Lower PHY back into the host), and Dummy-RU (loopback for testing).

flowchart LR
    MAC[MAC / Scheduler]
    subgraph DU_LOW[DU-Low · Upper PHY]
      DLPOOL[downlink_processor_pool]
      ULPOOL[uplink_processor_pool]
      DLPOOL --> PDSCH[PDSCH proc<br/>LDPC · mod]
      DLPOOL --> PDCCH[PDCCH proc<br/>polar]
      DLPOOL --> SSB[SSB proc]
      ULPOOL --> PUSCH[PUSCH proc]
      ULPOOL --> PUCCH[PUCCH proc]
      ULPOOL --> PRACH[PRACH det]
    end
    subgraph RU[Radio Unit]
      LP[Lower PHY<br/>OFDM · FFT · CP]
      OFH[OFH tx/rx<br/>eCPRI · BFP · WG4]
    end
    MAC -->|PDU API| DLPOOL
    ULPOOL -->|UL ind| MAC
    DLPOOL -->|DL grid| LP
    LP -->|UL grid| ULPOOL
    LP <--> OFH
    OFH <--> WIRE((Fronthaul Ethernet))

Upper PHY. Drives LDPC (base-graph 1/2 per TS 38.212 §5.3.2, with AVX2/AVX512/NEON kernels), polar coding for control, CRC (LUT or CLMUL), scrambling (Gold sequence per TS 38.211 §5.2.1), modulation mapping up to 256QAM. Channels are objects: pdsch_processor, pdcch_processor, ssb_processor, csi_rs_generator, pusch_processor, pucch_processor, prach_detector. For hardware-offload of LDPC, see Hardware Acceleration → Intel ACC100 (LDPC).

Lower PHY (when present via SDR path). OFDM modulator/demodulator with pluggable DFT backends FFTW, AMD FFTZ/AOCL, ARM Performance Library, generic Cooley-Tukey. CP length selection follows TS 38.211; phase compensation is precomputed via LUT.

Open Fronthaul (WG4 CUS). ofh_sector encapsulates one OFH logical antenna array. The transmitter encodes C-plane section types (1 DL/UL data, 3 PRACH), the U-plane packer compresses IQ via Block Floating Point (O-RAN.WG4.CUS Annex A.1.2, with SIMD kernels), and eCPRI framing produces Ethernet-ready packets. The receiver reverses this, with an rx_window_checker that rejects packets outside the RX window and a symbol reorderer that re-sequences out-of-order U-plane traffic. Timing is driven by realtime_timing_worker against CLOCK_REALTIME (PTP-disciplined in production) it emits OTA symbol boundaries to which transmitter, receiver, and DU-Low subscribe.

DPDK integration. Ethernet TX/RX under lib/ofh/ethernet/dpdk/ uses busy-polling on dedicated lcores, selected by CPU affinity in the worker manager.

8. Cross-cutting infrastructure

Source: include/ocudu/support/, include/ocudu/adt/, lib/gateways/, lib/ocudulog/.

Async. async_task<R> is a C++20 stackless coroutine; async_procedure<R> is a non-coroutine fallback with the same awaitable shape. event_signal and manual_event are the awaitable primitives used to park a coroutine until PHY/peer response arrives. protocol_transaction_manager wraps the transaction-ID + timeout pattern every ASN.1 protocol needs.

Executors. The task_executor interface has a zoo of implementations: inline_task_executor (test), general_task_worker_executor (one thread, policy-driven queue), priority_task_worker_executor (multi-priority), strand_executor (serialize over a shared pool using atomic job count), sync_task_executor (block until done). All tasks are unique_function<void(), 64> a 64-byte small-buffer-optimized closure, no heap allocation for typical lambdas.

Queues. Lock-free SPSC (rigtorp) and MPMC (rigtorp) underpin the data path; locking MPSC/MPMC variants exist for cold paths. The SPSC RLC SDU queue is the reason the slot-indication pull path is non-blocking.

byte_buffer. A segmented, reference-counted zero-copy buffer backed by a thread-local segment pool. Slicing produces views without copying; every data-path handoff moves buffers by reference-count bump.

Timers. timer_manager is a flat tick-driven timer service tick() is called once per ms and expired callbacks are dispatched to the per-timer executor. async_wait_for() wraps a timer as an awaitable, which is how PDCP t-Reordering, RLC t-PollRetransmit, and RA contention-resolution timers integrate with the coroutine model.

Gateways. sctp_network_server_impl / sctp_network_client_impl for N2/F1-C/E1/Xn; udp_network_gateway_impl for N3/F1-U; io_broker (epoll) manages socket FDs and dispatches events to executors. Every gateway takes an executor reference so RX callbacks run where the protocol layer expects them.

Logging & tracing. ocudulog is an async log framework with per-channel levels, pluggable sinks (file/stdout/syslog/UDP), and formatter classes. l1_dl_tracer, l1_ul_tracer, l2_tracer emit compile-time-gated binary trace events for latency analysis; Tracy integration is optional.

9. Deployment topologies and wiring

OCUDU compiles four binaries gnb, cu_cp, cu_up, du plus du_low for Split-6. The same lib/ code powers all of them; the difference is which app units the binary composes and which gateway factories it picks.

flowchart LR
    subgraph gnb[Co-located gnb binary]
      CC1[CU-CP] -- local --- CU1[CU-UP]
      CC1 -- local --- D1[DU]
      CU1 -- local --- D1
    end
    subgraph split[Split CU/DU]
      CC2[cu_cp binary] -- SCTP/F1-C --- D2[du binary]
      CC2 -- SCTP/E1 --- CU2[cu_up binary]
      CU2 -- UDP/F1-U --- D2
    end

Selection rule. gnb.cpp instantiates f1c_local_connector, e1_local_connector, f1u_local_connector (zero-copy in-process queues). The split binaries instantiate SCTP servers/clients and a UDP gateway instead. The application units and lib/ code are identical in both paths only the connector factory differs. Recent commits added full SCTP socket-parameter plumbing (RTO, heartbeat, retransmission) into F1 and E1 config so operators can tune transport per deployment.

App units (apps/units/) o_cu_cp, o_cu_up, flexible_o_du provide a uniform interface (application_unit) covering YAML schema registration, logger setup, worker-manager contribution (CPU affinity, NUMA, pool sizes), PCAP plumbing, and metrics aggregation. This is how a single gnb binary cleanly composes three functional entities with shared workers and a single buffer pool.

Worker manager and buffer pool. One worker_manager sizes and pins every executor thread per YAML-declared affinities. One buffer_pool_manager provides the byte_buffer segment pool that every layer uses no layer allocates its own heap in the data path.

Remote control and metrics. An optional uWebSockets-backed remote_server exposes JSON commands (UE dump, cell start/stop, metrics query). A central metrics_manager aggregates producers from every layer and fans them out to configurable sinks (log, stdout, JSON, file) on a periodic tick.

10. Layer correlation summary

PlaneCU-CPCU-UPDU-HighDU-Low / RU
L3 / NASRRC, NGAP, XnAP, NRPPa
L2 CtrlF1AP-CU, E1AP-CUE1AP-CU-UPF1AP-DU, MAC Ctrl
L2 DataPDCP, SDAP, GTP-U, F1-U CURLC, MAC DL/UL, Sched
L1FAPI-like PDU APIUpper PHY (DU-Low), Lower PHY + OFH (RU)
TransportSCTP (N2/F1-C/E1/Xn)UDP (N3/F1-U)SCTP (F1-C), UDP (F1-U)eCPRI/Ethernet (OFH) or UHD/ZMQ (SDR)

The control plane forms a chain CU-CP → DU-High via F1-C, with E1 as a side-link to CU-UP for bearer context. The user plane forms an independent pipe UPF ↔ CU-UP ↔ DU-High ↔ RU entirely outside CU-CP’s hot path. The coroutine-based procedure framework cuts across every control-plane layer uniformly, so a flow like PDU Session Setup reads as a single linear routine even though it straddles NGAP, E1AP, F1AP, and RRC.

Further reading

3 - Hardware Acceleration

Offload CPU-heavy upper-PHY stages to vRAN accelerator cards for lower latency, higher throughput, and reduced host CPU.

OCUDU India supports offloading CPU-heavy stages of the upper-PHY from the host to dedicated vRAN accelerators. The goal is better latency and throughput on the same hardware budget, with the freedom to fall back to a pure-software path when no accelerator is available.

Design principles

  • Software parity. The HW and SW paths implement the same interfaces; no functional difference observable from MAC or above.
  • Configuration-driven. Choose path via DU YAML at startup. No code changes, no runtime toggling.
  • Unified metrics. Every enqueue/dequeue on the HW path emits the same metric fields the SW path emits. A/B comparisons use one log format.
  • Batched dispatch. Operations are accumulated per-TB and submitted to the accelerator as a single burst, amortising DPDK per-call cost.
  • On-device state. HARQ soft-buffer data stays on the accelerator between transmissions; the host tracks only context lifecycle.

Supported accelerators

AcceleratorOffloaded stagesStatusGuide
Intel ACC100LDPC encode (PDSCH), LDPC decode (PUSCH), on-chip HARQSupportedIntel ACC100 LDPC offload
Intel ACC200FEC + equalisationPlanned
NVIDIA AerialFull upper-PHY offloadEvaluation
Other DPDK BBDEV backendsvia baseband_* PMDDesign review

When to enable acceleration

  • High-PRB-count cells (80–100 MHz) where software LDPC becomes the upper-PHY bottleneck.
  • Multi-cell hosts where freeing CPU on the upper-PHY increases cell density per server.
  • Tail-latency-sensitive deployments LDPC decode tail latency compresses substantially on hardware.
  • First time enabling HWACC? Start with the ACC100 guide it covers prerequisites, kernel/VFIO setup, YAML, build flags, and a deployment checklist.
  • Already running HWACC and want to understand the numbers? Jump to the benchmark results for the A/B comparison.
  • Adding a new accelerator backend? See the contributing guide for how to file an RFC.

3.1 - Intel ACC100 LDPC offload

Offload LDPC encoding (PDSCH) and decoding (PUSCH) to Intel ACC100 vRAN accelerator cards via DPDK BBDEV, with full upper-PHY metrics instrumentation for side-by-side comparison with the software-AVX-512 path.

Scope: hardware-accelerated LDPC encoding (PDSCH) and decoding (PUSCH) on Intel ACC100 vRAN accelerator cards, with full upper-PHY metrics instrumentation so operators can directly compare CPU-software and HW-offload performance in production logs.

Introduced in commit: aa6db7d066

Highlights

  • LDPC offload for both PDSCH and PUSCH the most CPU-heavy channel-coding steps move off the host and onto the accelerator.
  • Configuration-driven enable or disable via the DU YAML; no code changes.
  • Multi-VF scaling allowlisting additional ACC100 VFs automatically spreads load across them.
  • Unified metrics the same per-block PHY metric fields populate whether LDPC runs in software (AVX-512) or on the accelerator, enabling direct side-by-side comparison from one log format.
  • Measured gains on real traffic significantly lower PUSCH decode latency, tighter tail latency, higher processor throughput, and reduced upper-PHY uplink CPU (see §7).

1. Prerequisites

1.1 Hardware

  • Intel ACC100 PCIe card, SR-IOV capable; at least one VF exposed to the host.
  • x86-64 CPU with AVX2 (AVX-512 recommended for the software-fallback path).
  • 2 GB of 2 MiB hugepages.
  • PCIe slot on the same NUMA node as the upper-PHY worker cores.

1.2 Software

ComponentMinimum versionNotes
DPDK22.11Tested with 25.11; ACC100 PMD (baseband_acc) required.
Linux kernel5.15+IOMMU enabled (intel_iommu=on iommu=pt).
pf_bb_config24.03+PF configurator daemon; must run continuously.
DU build flagsENABLE_DPDK=True, ENABLE_PDSCH_HWACC=True, ENABLE_PUSCH_HWACC=TrueSee §5.

1.3 Kernel and VFIO setup

# Kernel boot parameters (then update-grub + reboot):
intel_iommu=on iommu=pt hugepagesz=2M hugepages=1024

# Hugepage mount if not done by distro:
echo 1024 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

# Load VFIO modules:
sudo modprobe vfio-pci
echo 1 | sudo tee /sys/module/vfio_pci/parameters/enable_sriov   # if kernel-builtin

# Create one SR-IOV VF and bind it to vfio-pci:
echo 1 | sudo tee /sys/bus/pci/devices/<ACC100_PF_BDF>/sriov_numvfs
sudo dpdk-devbind.py --bind=vfio-pci <ACC100_VF_BDF>

# Start pf_bb_config holding the PF group open with a VF token.
# The token (UUID) is required in the DU's EAL args.
sudo /opt/pf-bb-config/pf_bb_config ACC100 \
     -c /opt/pf-bb-config/acc100/acc100_config_vf_5g.cfg \
     -v <UUID> &

After setup, dpdk-test-bbdev should enumerate the VF as intel_acc100_vf.

2. Architecture overview

2.1 Where LDPC sits in the upper-PHY

LDPC encode (PDSCH) and LDPC decode (PUSCH) are the two steps this feature accelerates. On the software path they are AVX2/AVX-512 kernels; on the HW path they become batched DPDK BBDEV operations dispatched to the ACC100. Everything else in the chain stays on the CPU.

Upper-PHY pipeline. UL chain (top): OFH RX → Demodulate → Rate Matching → LDPC Encoding HW → CRC → MAC. DL chain (bottom): MAC TB → Segment → LDPC Encoding HW → Rate Matching → Rate Matching → Scramble → Modulate → Layer Map → Precoding → OFH TX. LDPC encoding stages are highlighted as hardware-accelerated.
UL and DL pipelines through the upper-PHY. The LDPC stages (dark blue) run on the ACC100; all other stages stay on CPU.

2.2 HAL layering

The upper-PHY factory checks for a hardware accelerator factory at construction. If present, the HW path is built; otherwise the software AVX-512 path is used. The choice is made once at startup from the YAML there is no runtime toggling.

HAL layering diagram with five layers: (1) Upper-PHY (PUSCH, PDSCH, PHY metrics) calling into hw_accelerator_*; (2) HWACC METRIC decorators (hwacc_pusch_dec_decorator, hwacc_pdsch_enc_decorator, metric emitter); (3) BBDEV layer (HARQ context, pdsch_enc_bbdev_imp, pusch_dec_bbdev_imp) calling rte_bbdev_enqueue/dequeue; (4) DPDK BBDEV device wrapper (bbdev_acc, mbuf, bbdev op) over VFIO; (5) ACC100 card (LDPC encode, LDPC decode, on-chip HARQ).
Five-layer HAL from the upper-PHY down to the ACC100 silicon. Each layer's interface to the one below is labelled.

3. Implementation summary

The integration is implemented as a hardware-abstraction layer that plugs into the existing upper-PHY factory pattern. Four things are worth knowing:

  1. Batching. Encode and decode ops are accumulated in a per-instance buffer and submitted to the accelerator as a single burst on the first dequeue call of each transport block. This amortises the DPDK per-call cost across all code blocks of the TB.

  2. Shared pools. The HAL factory owns one set of DPDK mbuf and op mempools for all encoder and decoder instances the upper-PHY creates. Pool size scales automatically with the total queue count across allowlisted accelerators.

  3. On-chip HARQ. Soft data for HARQ combining stays on the accelerator between transmissions, addressed by a CB-indexed offset. The host tracks only the lifecycle of each context entry, not the soft data itself.

  4. Unified metrics. A thin decorator wraps the HW accelerator interface, times each enqueue–dequeue pair, and emits the same metric events the software path does. The existing upper-PHY aggregator consumes both sources transparently.

A handful of ACC100 silicon quirks are handled internally by the HAL (input byte-alignment, a single-CB transport-block special case, a per-op E-limit guard, a long-session HARQ-context wrap-around). These require no action from the operator.

4. Configuration

Enable ACC100 offload in the DU YAML:

hal:
  eal_args: "--lcores (0-1)@(0-17) --file-prefix=ocudu_gnb --no-telemetry
             -a <OFH_NIC_VF_BDF>
             -a <ACC100_VF_BDF>
             --vfio-vf-token=<PF_BB_CONFIG_UUID>
             --iova-mode=pa"
  bbdev_hwacc:
    hwacc_type: "acc100"
    id: 0
    pdsch_enc:
      nof_hwacc: 4
      cb_mode: true
      dedicated_queue: true
    pusch_dec:
      nof_hwacc: 4
      force_local_harq: false
      dedicated_queue: true

Notes:

  • nof_hwacc for single-cell deployments, 2–4 is sufficient; for multi-cell or heavy-load setups, 8–16. Setting higher than the upper-PHY’s concurrency limits is wasteful.
  • --iova-mode=pa is recommended; the va mode is known to interact poorly with some NIC drivers in DPDK 25.11.
  • The VFIO-VF token must match the UUID passed to pf_bb_config -v. If pf_bb_config is restarted, update the YAML.

5. Build guide

5.1 gNB with 7.2 fronthaul and ACC100 offload

cd ~/ocudu
mkdir -p build_hwacc && cd build_hwacc

sudo cmake -DDU_SPLIT_TYPE=SPLIT_7_2 \
           -DENABLE_DPDK=True \
           -DENABLE_PDSCH_HWACC=True \
           -DENABLE_PUSCH_HWACC=True \
           -DASSERT_LEVEL=MINIMAL \
           ../

sudo make -j$(nproc)

Binary: build_hwacc/apps/gnb/gnb.

5.2 Benchmarks

cd build_hwacc
sudo make -j$(nproc) pdsch_processor_benchmark pusch_processor_benchmark

5.3 Startup verification

When the DU starts with ACC100 configured, the log should contain:

[HWACC] [I] [bbdev] dev=0 driver=intel_acc100_vf ...
[HWACC] [I] [bbdev] dev=0 started: ldpc_enc_q=N ldpc_dec_q=N ...

and on stdout:

Warning: the configured maximum PDSCH concurrency ... is overridden by the
         number of PDSCH encoder hardware accelerated functions (N)
Warning: the configured maximum PUSCH and SRS concurrency ... is overridden
         by the number of PUSCH decoder hardware accelerated functions (N)

6. Metrics

Enable the upper-PHY metric block in the YAML:

metrics:
  enable_log: true
  enable_verbose: true
  layers:
    enable_du_low: true
  periodicity:
    du_report_period: 1000

Every second the log will include a block similar to:

LDPC Encoder:  avg_cb_size=... bits, avg_latency=... us, encode_rate=... Mbps
LDPC Decoder:  avg_cb_size=... bits, avg_latency=... us, avg_nof_iter=..., decode_rate=... Mbps
  ...
CPU usage: upper_phy_dl=...%, ldpc_enc=...%, ...
           upper_phy_ul=...%, ldpc_dec=...%, ...

The same fields populate for both the CPU-software and ACC100-HW paths, enabling side-by-side A/B comparison from a single log format.

7. Results

7.1 Test environment

ItemValue
HostSingle-socket x86_64, 18 cores, AVX-512 capable
OS / DPDKLinux 5.15 / DPDK 25.11
pf_bb_config24.03
ACC1001 PF + 1 VF bound to vfio-pci
OFH NICiavf VF for 7.2-split RU
UE workloadReal UE + iperf3 DL/UL + 2 min video streaming + 2 speed tests
Cell100 MHz TDD, 30 kHz SCS, 4T4R, band n78, 256-QAM max

All measurements taken with the same physical UE and RU; only the gNB binary differs between the two rows (AVX-512 vs ACC100).

Horizontal bar chart: ACC100 improvement over AVX-512 across 12 upper-PHY metrics. LDPC decode_rate +174 %, LDPC max (mean) +74 %, LDPC avg (mean) +56 %, PDSCH proc_rate +46 %, PUSCH proc_rate +45 %, DL proc max_lat (tail) +42 %, PDSCH max_lat (tail) +39 %, PDSCH avg (tail) +38 %, PDSCH max_lat (mean) +26 %, DL proc max_lat (mean) +19 %, PDSCH avg (mean) +17 %, PUSCH avg (tail) +10 %.
Summary ACC100 percentage improvement over AVX-512 across all measured upper-PHY metrics. LDPC decode throughput leads the board at +174 %.

7.2 End-to-end upper-PHY A/B

Mean / max over ~175 one-second metric windows each.

MetricAVX-512 (mean / max)ACC100 (mean / max)ACC100 Δ
DL processing max_latency114.9 / 330.2 µs99.7 / 206.4 µs−13 % mean, −37 % tail
PDSCH Processor avg_latency44.0 / 101.7 µs39.3 / 67.8 µs−11 % mean, −33 % tail
PDSCH Processor proc_rate198.0 Mbps310.1 Mbps+57 %
PDSCH Processor max_latency96.5 / 309.8 µs76.6 / 203.2 µs−21 % mean, −34 % tail
LDPC Decoder avg_latency51.6 / 109.4 µs24.5 / 50.1 µs−53 % (2.1× faster)
LDPC Decoder max_latency145.9 / 457.7 µs40.5 / 86.7 µs−72 % (3.6× better tails)
LDPC Decoder decode_rate46.6 Mbps137.4 Mbps+195 % (2.9×)
PUSCH Processor avg_data_latency231.9 / 761.5 µs246.3 / 736.4 µs+6 % mean, −3 % tail
PUSCH Processor proc_rate24.0 Mbps37.4 Mbps+56 %
LDPC Decoder latency, grouped bar chart comparing AVX-512 and ACC100 across avg_lat mean, avg_lat tail, max_lat mean, max_lat tail. ACC100 is consistently lower, with the biggest gap at max_lat tail (~457 µs AVX-512 vs ~87 µs ACC100).
LDPC Decoder latency lower is better. The tail-latency gap (rightmost pair) is the headline: 3.6× tighter under ACC100.
PDSCH Processor latency, grouped bar chart comparing AVX-512 and ACC100 across avg_lat mean, avg_lat tail, max_lat mean, max_lat tail. ACC100 lower across the board, most prominently at max_lat tail (~310 µs vs ~203 µs).
PDSCH Processor latency lower is better. Tail latency compresses by roughly one third.
Throughput comparison chart showing PDSCH proc_rate, LDPC decode_rate, and PUSCH proc_rate for AVX-512 vs ACC100. ACC100 higher on all three: PDSCH ~310 Mbps vs ~198 Mbps, LDPC decode ~137 Mbps vs ~47 Mbps, PUSCH ~37 Mbps vs ~24 Mbps.
Throughput higher is better. LDPC decode rate nearly triples; PDSCH and PUSCH processor rates both rise by about half.

7.3 CPU utilisation

MetricAVX-512 (mean / max)ACC100 (mean / max)
upper_phy_dl3.63 % / 22.7 %4.92 % / 33.9 %
upper_phy_ul3.79 % / 28.9 %2.36 % / 15.4 % (−38 %)
ldpc_rm (rate match)1.30 % / 10.9 %0.00 % (on accelerator)
ldpc_rdm (rate dematch)0.13 % / 1.0 %0.00 % (on accelerator)

The uplink CPU reduction is the headline operational benefit: one core is freed on the upper-PHY under sustained traffic, allowing either higher cell counts on the same host or tighter scheduling-latency budgets.

Note on the DL rows. The per-CB ldpc_encoder_* fields on the HW path reflect batch wall-clock time rather than serialised per-CB compute time, because ops are submitted to the accelerator in bursts. Use the PDSCH Processor rows and upper_phy_dl for apples-to-apples DL comparison those are measured once per TB and are directly comparable.

8. Deployment checklist

  • Install or confirm DPDK ≥ 22.11 with ACC100 PMD.
  • Confirm pf_bb_config daemon is running; note its VFIO-VF token.
  • Bind the ACC100 VF(s) to vfio-pci.
  • Add the hal.bbdev_hwacc block to the DU YAML (see §4).
  • Build with ENABLE_PDSCH_HWACC=True and ENABLE_PUSCH_HWACC=True (see §5).
  • Verify startup log shows intel_acc100_vf and ldpc_enc_q=N ldpc_dec_q=N.
  • Enable metrics.layers.enable_du_low to observe the per-component LDPC metrics (§6).

9. References

4 - Deployment

Reference blueprints for real deployments from a single edge site to a multi-site operator footprint.

Coming soon. Deployment blueprints are being authored and validated in OCUDU India lab environments. Feature-specific deployment guidance that is ready sits under the relevant feature page for example, Intel ACC100 LDPC offload has a full deployment checklist.

Planned content:

  • Single edge-site blueprint one DU, one CU, one core simulator.
  • Multi-DU aggregation one CU serving multiple DUs (F1 at scale).
  • Regional deployment multiple edge sites + central CU-CP via Nephio GitOps.
  • Validated hardware profiles.
  • Timing and sync, observability, upgrade procedures.

5 - Contributing

How to file issues, propose changes, and submit pull requests to OCUDU India repositories.

Coming soon. The full contributor guide is being written to match the project onboarding flow. Until it lands, the Community page lists all the public channels (mailing list, Slack, TSC meeting) where you can get oriented.

Planned content:

  • Code of Conduct and DCO sign-off requirements.
  • How small changes, features, and substantial RFCs flow through review.
  • Coding conventions per language Go, Python, C++, Helm/Kustomize.
  • CI expectations and how to run the test suites locally.
  • Getting help mailing list, Slack, office hours.