Inline GPU Acceleration - PRACH and SRS

Release A inline GPU path: NIC writes uplink fronthaul packets directly into GPU VRAM via GPUDirect RDMA so PRACH preamble detection and SRS channel estimation run end-to-end on the GPU with no CPU-side sample copy. Mean PRACH detect latency drops ~3x vs the CPU AVX-512 path; SRS per-occasion cost falls toward 0.6 us at 256-UE batches.

16 minute read

Move the two most timing-critical uplink-PHY functions - PRACH preamble detection and SRS channel estimation - onto an inline GPU data path where the NIC writes fronthaul packets directly into GPU VRAM via GPUDirect RDMA and the CPU is never in the data path. PRACH detect mean latency drops ~3.3× vs the CPU AVX-512 path; SRS estimate per-occasion cost falls toward 0.6 µs at 256-UE batches (~22× faster than serial CPU) on a commodity NVIDIA RTX A4000.

Highlights

Zero host-to-device copy. Packets land in a VRAM-backed mbuf pool the NIC DMA-maps directly. BFP9 decompression is the first GPU kernel, no longer a CPU step. Only a 48-byte header is peeked by the host poll thread to classify (PRACH / SRS / other); the body never leaves VRAM.
Two-queue NIC split. A duplicate rte_flow rule directs PRACH/SRS UL packets to NIC queue 1 (GPU VRAM) while PUSCH/PUCCH and everything else stays on queue 0 (CPU RAM). Both paths coexist in one binary; YAML flags pick which runs.
One batched CUDA graph per detection/occasion. PRACH detect is a single graph launch + sync (≈49 µs microbench, ~106 µs system A/B). SRS estimate batches up to 256 UEs into one captured graph; per-UE GPU cost falls from 44 µs at N=1 to 0.59 µs at N=256.
Lock-free MAC ↔ GPU coordination. A 64-slot srs_schedule_tap ring lets the GPU listener identify SRS packets by (slot, symbol) against the MAC-published schedule; a paired srs_result_tap ring feeds the GPU’s estimate back into the upper PHY’s process_srs, which attaches RNTI and suppresses the redundant CPU estimate.
Numerically equivalent to the CPU baseline. PRACH inline detector reports metric=47.824 on the canonical FR1-TDD B4 config vs 47.825 from the CPU reference (diff < 1e-3 relative, well inside bfloat16 quantization). SRS channel matrix / EPRE / RSRP match exactly; noise variance agrees to five significant figures.
Commodity workstation GPU. Validated on NVIDIA RTX A4000 (sm_86, 16 GiB VRAM). Scales unchanged to A30 / A100 / H100; PCIe bandwidth ceases to matter because samples never traverse it.
Transparent CPU fallback. Drop prach_rx_to_gpu/srs_rx_to_gpu from YAML - the same gNB binary runs the upstream CPU detector / estimator with no rebuild and no code path divergence.

1. Prerequisites

1.1 Hardware

NVIDIA GPU, Ampere or later (sm_86+). Validated on RTX A4000 (48 SMs, 16 GiB VRAM, PCIe Gen4 x16). Older Turing (sm_75) should work but is unvalidated.
NIC with GPUDirect RDMA support. Validated on Mellanox/NVIDIA ConnectX-5 (mlx5_pci driver) at 25/100 Gbps. The duplicate-rule trick (host-queue copy alongside the GPU steered queue) is mlx5-specific; on other DPDK PMDs the duplicate rule may be rejected and non-PRACH eCPRI will be silently dropped by the GPU listener - see Section 8.
GPU on the same NUMA node as the upper-PHY worker cores. Cross-NUMA PCIe adds tail-latency variance.
x86-64 CPU with AVX2 minimum (AVX-512 used by the CPU PRACH detector when available).
≥ 1 GiB free VRAM for the inline mempool + per-sector PRACH/SRS buffers. Worst-case at 273 PRB / 4 RX is well under 256 MiB.

1.2 Software

Component	Minimum	Notes
CUDA Toolkit	12.0	`nvcc`, `cuda_runtime.h`, `cufft.h`. Validated on 12.6 and 13.2.
NVIDIA driver	525+	Must match or exceed the CUDA toolkit version.
DPDK	22.11	Built with `gpu/cuda` driver + `gpudev` library enabled.
`libbsd-dev`, `libnuma-dev`	system	Transitive DPDK pkg-config deps.
Linux kernel	5.15+	Standard Ubuntu 22.04 LTS suffices.
CMake	3.18+	`CUDA_SEPARABLE_COMPILATION` is used by the inline-detector lib.
OCUDU build flags	`ENABLE_GPU_FRONTHAUL=ON`	See Section 5.

The inline path needs no MathDx / cuFFTDx (those are for the earlier “GPU-full” PRACH path). Only standard cuFFT is required.

1.3 Verify driver + CUDA

# Driver loaded, GPU visible:
nvidia-smi
# Expected: NVIDIA-SMI ≥ 535.x, Driver Version ≥ 525, CUDA Version ≥ 12.0

# CUDA toolkit:
nvcc --version
# Expected: Cuda compilation tools, release 12.x or newer

# cuFFT present:
ls /usr/local/cuda/lib64/libcufft*

# DPDK pkg-config visible:
PKG_CONFIG_PATH=/opt/mellanox/dpdk/lib/x86_64-linux-gnu/pkgconfig \
  pkg-config --modversion libdpdk
# Expected: 22.11 or newer

If libbsd.pc / libnuma.pc are missing, install them:

sudo apt install libbsd-dev libnuma-dev

1.4 GPU persistence and clocks (recommended for production)

# Persistence mode to avoid cold-init on first CUDA call:
sudo nvidia-smi -pm 1

# Lock SM and memory clocks for deterministic latency (RTX A4000 example):
sudo nvidia-smi -lgc 1560
sudo nvidia-smi -lmc 7000

# Confirm:
nvidia-smi -q -d CLOCK | head -20

1.5 Hugepages + DPDK runtime dirs

# 1 GB hugepages for the GPU mempool backing memory:
echo 4 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
sudo mkdir -p /mnt/huge1G
sudo mount -t hugetlbfs -o pagesize=1G none /mnt/huge1G

# Verify:
grep Huge /proc/meminfo

2. Architecture overview

2.1 The two-queue NIC split

                       NIC RX
                       ╱        ╲
             queue 0 (CPU RAM)   queue 1 (GPU VRAM)
             PUSCH / PUCCH /      PRACH + SRS UL
             everything else      (accelerated, inline)
             → stock CPU path     → gpu_listener → inline pipelines

At sector start-up, OCUDU allocates a contiguous VRAM region per sector, calls rte_extmem_register + rte_dev_dma_map to install it in the NIC’s IOMMU domain, and wraps it as an mbuf pool. Two rte_flow rules then steer:

Queue 1 rule - matches eCPRI Ethertype 0xAEFE over VLAN 0x8100, steers the matched packets into the VRAM-backed mempool.
Queue 0 duplicate rule - installs the same match steered to queue 0, so PUSCH/PUCCH and everything else still reaches the CPU’s OFH receiver. On mlx5 this duplicate rule is accepted; on PMDs that reject it, all non-PRACH UL is silently lost (the path is dev-only without the duplicate).

A dedicated CPU thread - the GPU listener - polls queue 1, peeks each frame’s 48-byte header, classifies it (PRACH eAxC, SRS via srs_schedule_tap, or “other”), and dispatches the packet to the matching aggregator. The mbuf body never leaves VRAM.

2.2 Inline vs the earlier offload paths

Path	Sample lives in	First-stage cost	Stages on GPU
CPU-only (default)	CPU RAM	BFP9 decompress on CPU	none
CUDA-graph PRACH (`OCUDU_PRACH_DFT_BACKEND=gpu_full`)	CPU RAM, then H2D copy	CPU BFP9 + PCIe H2D	correlation + IDFT + peak
Inline (this doc) - `prach_rx_to_gpu: true`	GPU VRAM (NIC→VRAM DMA)	GPU BFP9 kernel	all of detect / estimate

The inline path removes the PCIe sample copy entirely; only the result (few hundred bytes per detection) crosses the bus, D2H, after the work.

3. PRACH inline pipeline

3.1 Aggregator

PRACH packets arrive across multiple eAxC streams and symbols within a slot. The aggregator (inline_prach_pipeline) maintains a per-(slot_id, eAxC) bucket ring (default depth slot_ring_depth = 4, configurable). When all nof_prach_eaxc × nof_prach_symbols packets for an occasion have arrived, the aggregator fires the BFP-decompress kernel + the inline detector and invokes on_result_cb with the detection result. Stale slots are evicted with an empty result so a lost packet can’t wedge the pipeline.

3.2 Kernel chain

GPU VRAM samples
   ▼ BFP9 decompress + PRACH-band RE extract        (k_bfp_decompress_re_demap)
   ▼ convert cbf16 → cf32, non-coherent sum across symbols
   ▼ correlate: received · conjugate(reference Zadoff-Chu)
   ▼ cuFFT IDFT (1024-pt long / 256-pt short, batched plan)
   ▼ magnitude² + scale, non-coherent antenna combine
   ▼ per-shift window reduction → GLRT, argmax + threshold
   ▼ result: { delay[], detected[], metric[], power[] }

Reference Zadoff-Chu sequences are generated once on the CPU at pipeline init using the same prach_generator_factory_sw the CPU detector uses, then cached in pinned host memory. The cache is invalidated only on cell-config change (format / prach_root_sequence_index / restricted-set / ZCZ); on a stable single-cell gNB this never re-evaluates after init.

3.3 One graph launch per detection

The pipeline issues a single batched cuFFT execution + the surrounding kernel chain into one CUDA stream and does exactly one cudaStreamSynchronize. There is no per-sequence loop - all nof_sequences correlations and IDFTs are batched into one launch. Microbench detect() mean dropped from 1869 µs → 49 µs (~37×) after this collapse (status doc, stage 6F).

4. SRS inline pipeline

SRS mirrors PRACH with three structural differences that drove the design:

4.1 Schedule tap (MAC → listener)

SRS shares uplink eAxC with PUSCH/PUCCH - no dedicated stream. The GPU listener cannot tell an SRS symbol from a regular UL symbol just by looking at the packet. The MAC’s FAPI-to-PHY fastpath translator publishes each scheduled SRS occasion to a lock-free 64-entry ring (srs_schedule_tap) keyed on a packed (sfn & 0xFF) << 16 | subframe << 8 | slot slot id. The listener calls srs_schedule_tap::is_srs_symbol(slot_id, sym_idx) per packet; on a hit it routes the packet to the SRS aggregator, otherwise the packet stays on the “other” path.

4.2 Estimator kernel chain

For each SRS occasion (per RX port batched together):

GPU VRAM samples
   ▼ BFP9 decompress + UL-band RE extract           (srs_inline_bfp_kernel)
   ▼ least-squares estimate with noise accumulation
   ▼ correlate against reference sequence
   ▼ cuFFT IDFT for the timing-advance peak (batched plan)
   ▼ fractional-tap peak fit → TA in ns
   ▼ per-RE phase compensation (signal + noise)
   ▼ wideband coefficient, signal subtraction, noise variance accumulation
   ▼ result: { channel_matrix, EPRE, RSRP, noise_var, time_alignment }

Phase-compensation kernels use the same 1024-entry quantized exponential table as the CPU estimator. This is why noise_var agrees with the CPU reference to five significant figures even on steep per-RE phase ramps.

4.3 Result tap (GPU → upper PHY)

The GPU produces channel_matrix + scalars but does not know the UE’s RNTI

that lives only in the MAC-scheduled srs_pdu. So the pipeline’s on_result_cb publishes the result to a paired lock-free 64-entry srs_result_tap ring keyed on the same slot id. The upper PHY’s process_srs worker, when it runs for that slot, calls srs_result_tap::consume(...). On hit, it attaches the MAC PDU’s RNTI to the GPU result and emits on_new_srs_results to L2 - and skips the CPU srs_estimator::estimate(...). On miss, it runs the CPU estimator as a fallback. Same callback either way; L2 cannot tell which backend produced the estimate.

4.4 Batched graph for many UEs

When the cell has multiple concurrent SRS UEs in the same occasion, the inline pipeline batches them into a single captured CUDA graph (build_batch_graph + run_batch_graph). N occasions → one launch → one sync. This is why per-UE GPU cost falls as the cell fills (one fixed dispatch cost amortized across N).

5. Build

5.1 GPU-inline build (PRACH + SRS on the inline path)

cd ~/ocudu                           # or wherever your checkout lives
rm -rf build_72
mkdir build_72 && cd build_72

cmake .. \
  -DDU_SPLIT_TYPE=SPLIT_7_2 \
  -DENABLE_DPDK=True \
  -DENABLE_GPU_FRONTHAUL=ON \
  -DASSERT_LEVEL=MINIMAL \
  -DENABLE_UHD=OFF \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_ARCHITECTURES=86

make gnb ru_emulator srs_inline_benchmark -j$(nproc)

Key flags:

ENABLE_GPU_FRONTHAUL=ON - compiles the inline-GPU PRACH + SRS pipelines, the GPU mempool, the duplicate-flow steering rules, and the GPU listener thread.
DU_SPLIT_TYPE=SPLIT_7_2 - the inline path is split-7.2 OFH only.
ENABLE_DPDK=True - required; the inline path is DPDK-backed.
CMAKE_CUDA_ARCHITECTURES=86 - change to 89 for Ada (RTX 4090), 80 for A100, 90 for H100. SM 86 is the validated minimum.
ASSERT_LEVEL=MINIMAL - removes hot-path asserts that distort measurement; correct for production / benchmark builds.

ru_emulator and srs_inline_benchmark are optional but useful: the former drives synthetic UL traffic for closed-loop A/B testing; the latter is a standalone microbenchmark of CPU vs GPU SRS estimation.

5.2 CPU-only build (no GPU dependencies)

The same source tree builds a CPU-only gnb. Drop the GPU flag; CUDA and DPDK-gpudev become unnecessary:

cd ~/ocudu
rm -rf build_cpu
mkdir build_cpu && cd build_cpu

cmake .. \
  -DDU_SPLIT_TYPE=SPLIT_7_2 \
  -DENABLE_DPDK=True \
  -DENABLE_GPU_FRONTHAUL=OFF \
  -DASSERT_LEVEL=MINIMAL \
  -DENABLE_UHD=OFF \
  -DCMAKE_BUILD_TYPE=Release

make gnb -j$(nproc)

If DPDK isn’t needed either (socket-based RU), drop -DENABLE_DPDK=True too.

5.3 Single binary, runtime backend selection

The GPU-inline build also fully supports CPU-only execution - load a YAML without prach_rx_to_gpu / srs_rx_to_gpu and the same gNB runs the upstream CPU detector / estimator. There is no rebuild between modes; A/B is a YAML flag and a restart.

6. Configuration

6.1 Enable GPU-inline PRACH and SRS (YAML)

Per-cell, under the OFH cells: block:

ru_ofh:
  cells:
    - network_interface: 0000:53:00.0
      ru_mac_addr:  b8:3f:d2:b6:ff:e2
      du_mac_addr:  b8:3f:d2:b6:ff:e3
      enable_promiscuous: true
      dl_port_id:    [0, 1, 2, 3]
      ul_port_id:    [0, 1, 2, 3]
      prach_port_id: [4, 5, 6, 7]
      # ---- inline GPU switches ----
      prach_rx_to_gpu: true        # NIC→VRAM + inline PRACH detect
      srs_rx_to_gpu:   true        # SRS classify + inline channel estimate

Notes:

srs_rx_to_gpu requires prach_rx_to_gpu: true - the SRS classifier shares the listener thread and duplicate-rule steering created for PRACH.
The full PRACH config (prach_config_index, zero_correlation_zone, prach_root_sequence_index, restricted_set) is read from the standard cell_cfg.prach block and wired into the inline pipeline at startup. No inline-specific PRACH knobs.
Drop both flags (or set to false) to run the CPU-only path on the same binary.

6.2 SRS schedule + cell config

The inline SRS pipeline reaches into cell_cfg.srs: for the resource shape. Minimum block:

cell_cfg:
  srs:
    type_enabled:       periodic
    period_ms:          20
    max_nof_sym_per_slot: 2
    nof_sym_per_resource: 1
    tx_comb:            4
    cyclic_shift_reuse: 1
    sequence_id_reuse:  1

6.3 DPDK / EAL knobs

The GPU mempool backing memory is allocated from 1-GiB hugepages. Your dpdk.eal_args must include a matching --huge-dir and a per-binary --file-prefix to keep state from clashing with other DPDK processes:

dpdk:
  eal_args: "--lcores (0-1)@(0,2) -a 0000:53:00.0 --iova-mode=pa --huge-dir /mnt/huge1G --file-prefix=gnb"

6.4 Runtime environment variables

Variable	Purpose
`OCUDU_PRACH_BENCH=1`	Unsuppress periodic GPU listener / detector stats lines (`[gpu_listener] …`, `[prach_detector_inline] stats: …`, `[prach_detector_cpu] stats: …`). Default off - interactive gnb stays clean. Set by `scripts/prach_ab_benchmark.sh` automatically.
`OCUDU_SRS_GPU_FORCE=<sf>:<slot>:<sym>`	Force the GPU listener to treat all packets at the given `(subframe, slot_in_sf, symbol)` triple as SRS, regardless of `srs_schedule_tap`. Used for UE-less rig validation where the MAC never publishes a real SRS PDU.
`OCUDU_SRS_GPU_RES=csrs:bsrs:fpos:fshift:comb:coff:seqid:nsym:nrx`	When set together with `OCUDU_SRS_GPU_FORCE`, publishes a synthetic SRS resource configuration matching the captured-IQ payload. Lets the inline estimator run against a `ru_emulator` SRS replay file with no UE attached.
`OCUDU_PRACH_DFT_BACKEND`	Selects the earlier CUDA-graph PRACH backend (`gpu_full`) or its variants. Leave unset / `cpu` when running the inline path - they don’t compose.

6.5 Validating the configuration at startup

Once the gnb is launched, look for these lines in the log (green-highlighted on a TTY):

[ofh_factories] Sector#0 inline GPU PRACH pipeline active
                (nof_prach_eaxc=4 nof_rx_ports=4 prach_compr=BFP9 iq_offset=36)
[ofh_factories] Sector#0 inline GPU SRS pipeline active
                (nof_rx_ports=4 ul_eaxcs=4 cell_prbs=273 max_seq=1638 ul_compr=BFP9)

Plus, once traffic starts:

[gpu_listener] thread started: port=0 queue=1 prach_eaxcs_count=4
[gpu_dpdk_mempool] dma_map(port=0) ok — VRAM=0x… len=… now visible to NIC IOMMU

If the dma_map(port=0) ok line is missing, see Section 8.

7. Performance

7.1 PRACH - live A/B on the gNB

Captured via scripts/prach_ab_benchmark.sh. ru_emulator drives synthetic PRACH occasions; the gnb is restarted between A and B, flipping only prach_rx_to_gpu. ZCZ=0, 4 RX ports, BFP9.

Metric	Inline GPU	Default CPU
Mean detector latency	~108 µs	~325 µs
Speedup	~3.0×	(baseline)
Latency floor (min)	44 µs	320 µs
Tail (max)	2.5 ms (rare)	770 µs

stage 6F. The GPU’s floor (44 µs) sits well below the CPU’s best case (320 µs); the rare ~2.5 ms outlier is a sync-stall against another CUDA workload - worth monitoring under sustained multi-tenant GPU use.

7.2 SRS - standalone microbench

apps/examples/srs_inline_benchmark/srs_inline_benchmark builds and replays a captured CUDA graph for N concurrent SRS occasions. 4 RX ports, 1 Tx, 1 symbol, comb=4, varying allocation width.

Throughput sweep at 64 PRB (seq_len=192):

#UEs	CPU /occ	GPU graph/occ	Speedup
1	13.5 µs	44.1 µs	0.31×
4	13.5 µs	11.0 µs	1.23×
16	13.5 µs	2.7 µs	5.06×
64	13.5 µs	0.84 µs	16.0×
128	13.5 µs	0.69 µs	19.6×
256	13.5 µs	0.59 µs	22.8×

Crossover by allocation width:

Allocation	GPU overtakes CPU at	Peak speedup
4 PRB (seq 12)	~16 UEs	10.8× @ 256 UEs
64 PRB (seq 192)	~4 UEs	22.8× @ 256 UEs
184 PRB (seq 552)	~2 UEs	32.7× @ 128 UEs

Crossover drops as allocation grows because each occasion carries more parallel work for the GPU.

7.3 CPU time freed per slot (offload metric)

CPU serial = host runs N back-to-back srs_estimator::estimate() calls. GPU dispatch = host submits the captured graph asynchronously; GPU executes in parallel. Slot budget = 500 µs (30 kHz TDD).

#UEs	CPU serial	GPU dispatch (CPU)	Offload	GPU walltime	Fits 500 µs slot
1	13.5 µs	9.8 µs	1.4×	40 µs	✓
8	107.7 µs	9.9 µs	10.9×	43 µs	✓
16	215.4 µs	9.8 µs	21.9×	45 µs	✓
32	430.9 µs	9.8 µs	43.8×	50 µs	✓
64	861.7 µs	10.0 µs	85.9×	54 µs	✓

At 64 UEs the CPU-serial path (862 µs) has already overrun the slot budget; the inline GPU finishes in 54 µs (~11 % of the budget) and the CPU thread is free for the remaining 446 µs to handle PUSCH / PUCCH / L2.

7.4 Correctness

PRACH - inline detector reports metric=47.824 on the canonical FR1-TDD B4 config vs the CPU reference’s 47.825 (diff < 1e-3 relative, well inside bfloat16 quantization). Threshold-table and preamble-emission semantics identical to the canonical CPU path. stage 6D.
SRS - channel_matrix.frobenius_norm(), EPRE, RSRP match the CPU reference exactly. noise_variance agrees to five significant figures even for steep per-RE phase ramps

8. Troubleshooting

8.1 NIC delivers nothing - `rx_packets_phy` increments, `rx_packets` stays zero

Quiet failure mode. The wire shows packets arriving but DPDK delivers none - because the VRAM-to-NIC DMA mapping silently failed and every queue-1 packet is being dropped by the NIC’s IOMMU. Confirm by:

# Should see this in the gnb startup log:
grep -E 'dma_map\(port=' /tmp/gnb*.log
# Expected: [gpu_dpdk_mempool] dma_map(port=0) ok — VRAM=0x… len=… …

If missing, the most common causes are:

DPDK built without gpudev (rte_gpu_count_avail() == 0).
GPU not on the same IOMMU group as the NIC (cross-NUMA without iommu passthrough).
nvidia-peermem / gdrdrv kernel module not loaded - needed by GPUDirect for VRAM ↔ IOMMU mapping.

lsmod | grep -E 'nvidia_peermem|gdrdrv'
sudo modprobe nvidia-peermem

8.2 Duplicate host-queue rule rejected by mlx5

If you see host-queue duplicate rule REJECTED by mlx5 in the startup log, the NIC firmware refused to install the queue-0 mirror of the queue-1 steering rule. Effect: non-PRACH eCPRI (PUSCH/PUCCH/SRS) will be silently dropped by the GPU listener, because they no longer reach the CPU OFH receiver. Fix is firmware-side (newer mlx5 firmware accepts the duplicate); until then, prach_rx_to_gpu: true is dev-only on this NIC.

8.3 GPU listener `srs={}` counter stays at zero

The SRS classifier returns “not SRS” for every packet because no MAC SRS PDU has been published to srs_schedule_tap. Two causes:

No UE attached - the MAC’s FAPI translator only publishes on add_srs_pdu, which fires only for a UE with an active SRS resource.
The cell’s cell_cfg.srs block is missing or malformed.

For UE-less rig validation, force-publish a synthetic SRS resource via environment variables (Section 6.4):

sudo OCUDU_SRS_GPU_FORCE=1:1:13 \
     OCUDU_SRS_GPU_RES="61:0:0:0:4:0:1:1:4" \
     ./build_72/apps/gnb/gnb -c configs/gpu_inline_acc_gnb.yml

8.4 PRACH detector latency tail spikes

If the GPU PRACH detector’s max occasionally jumps to multi-ms, the most common cause is host-side cudaStreamSynchronize blocking on an unrelated CUDA context (display driver tick, other CUDA processes). Mitigations:

Pin the GPU to compute-exclusive mode (nvidia-smi -c EXCLUSIVE_PROCESS).
Disable GPU power-state transitions (Section 1.4).
Ensure no graphical session shares the GPU.

8.5 ru_emulator can’t link (`srs_estimator_inline_impl.cpp.o`: undefined reference)

Symptom on a fresh checkout: ru_emulator fails to link with undefined reference to 'ocudu::create_low_papr_sequence_generator_sw_factory()'. The inline detector library bundles srs_estimator_inline_impl.cpp.o, which pulls ocudu_sequence_generators. Declared as PRIVATE link on the inline library (lib/hal/cuda/CMakeLists.txt). If this is missing in your tree, add ocudu_sequence_generators to the inline detector’s target_link_libraries PRIVATE list and rebuild.

9. Future work - PUSCH and PUCCH

This release brings PRACH and SRS onto the inline path (queue 1) while PUSCH and PUCCH still ride queue 0 to the CPU. The next step is extending the same NIC→VRAM dispatch to PUSCH/PUCCH symbols so the bulk of the uplink arrives in GPU memory and L1 user-data processing (LDPC decode, equalization, demodulation, MIMO detection) runs where the samples already live.

Once that lands, every uplink sample resides in VRAM the instant it arrives - the position needed for AI-RAN workloads (neural receivers, learned channel estimation, AI beam management) where an ML model becomes just another consumer of a buffer already in GPU memory.