User-defined buffers in the Arena SDK

Overview

This knowledge base article explains how to use user-defined buffers with Arena SDK. User-defined buffers allow you to provide your own memory pointer to the Arena SDK, rather than having the SDK automatically allocate memory. This gives you control over where image data resides. When combined with GPU memory and GPUDirect RDMA hardware, this enables zero-copy workflows where the camera data is written directly to GPU VRAM.

Note: For information on choosing an appropriate buffer count, see the companion KB article How to choose an appropriate buffer count for the Arena SDK.

Prerequisites

For user-defined buffers (basic):

Arena SDK: Version with user-defined buffer support.
Memory: Any valid memory pointer (Malloc, cudaMalloc, mmap, etc.)

For zero-copy GPUDirect RDMA (advanced):

Network Interface Card: RDMA-capable NIC with GPUDirect RDMA support.
GPU: NVIDIA GPU with CUDA support.
Software: For NVIDIA: CUDA Toolkit installed.
Drivers: For NVIDIA: nvidia-peermem kernel module.
Network: Proper RDMA configuration on NIC and network infrastructure (RoCEv2).
Operating System: Ubuntu 24.04 LTS (Noble Numbat)

Important: Not all RDMA_capable NICs support GPUDirect RDMA. The NIC must specifically support peer-to-peer DMA and have drivers that integrate with CUDA.

Memory Location

The fundamental difference between the classic StartStream call and user-defined buffers is where that memory physically resides in your computer.

Classic StartStream (automatic allocation)

pDevice->StartStream(10);

ArenaSDK internally calls void* buffers = malloc(10 * bufferSize);
Physical location: System RAM.
Accessible by: GPU, but only after cudaMemcpy (transfer over PCIe).
Total allocation: 10 buffers x 12MB = 120MB in system RAM.

User-defined buffers with system RAM

void* myBuffers = malloc(120MB);
// Provide your system memory pointer to the Arena SDK
pDevice->StartStream(bufferList);

Provide your system memory pointer to ArenaSDK pDevice->StartStream.
Physical location: System RAM, which is the same as the classic approach. You control allocation, ArenaSDK uses your memory.
Still requires cudaMemcpy for GPU processing.

User-defined buffers with GPU memory (no RDMA)

void* gpuBuffers;
cudaMalloc(&gpuBuffers, 120MB);
pDevice->StartStream(gpuBufferList);

Physical location: GPU VRAM (on graphics card).
Data flow: Camera → NiC → System RAM → PCIe → GPU VRAM.
Internal copy from system RAM to GPU is still still required.
Arena writes to GPU memory, but data passes through the CPU first.

User-defined buffers with GPUDirect RDMA

void* gpuBuffers;
cudaMalloc(&gpuBuffers, 120MB);
pDevice->StartStream(gpuBufferList);

RDMA-capable NIC + nvidia-peermem loaded.
Physical location: GPU VRAM (on graphics card).
Data flow: Camera → NIC → PCIe → GPU VRAM.
Zero-copy: NIC writes directly to GPU, bypassing system RAM entirely.

Understanding the components

User-defined buffers (Arena SDK feature)

You provide a memory pointer, Arena SDK uses it instead of allocating its own.
Works with ANY memory type: system RAM (malloc), GPU memory (cudaMalloc), pinned memory (cudaMallocHost), shared memory (mmap).
Does NOT require RDMA or GPU.
Use case: Control memory allocation strategy.

GPU memory (CUDA feature)

Memory allocated on GPU using cudaMalloc().
Does not automatically enable zero-copy; without RDMA, data still goes through system RAM first.
Use case: Keep data on GPU for processing, reduce copy operations.

GPUDirect RDMA (NIC vendor feature)

Technology that allows NIC to write directly to GPU memory, bypassing CPU and system RAM entirely.
Requires RDMA-capable NIC with GPUDirect support (Mellanox/NVIDIA, Broadcom, or Intel models), NVIDIA-peermem (NIC-agnostic) driver loaded, GPU memory allocated with cudaMalloc.
Must be combined with user-defined buffers pointing to GPU memory.
Use case: zero-copy workflows for GPU-accelerated processing.

UserSuppliedBuffer API

The UserSuppliedBuffer struct is how you provide buffers to the Arena SDK:

struct UserSuppliedBuffer
{
    uint8_t* pData;      // Pointer to the actual buffer
    size_t bufferSize;   // Size of the buffer in bytes
    void* userContext;   // Optional handle for user metadata
    UserSuppliedBuffer(uint8_t* data, size_t size, 
                       void* ctx = nullptr);
};

pData: Pointer to your allocated buffer (system RAM or GPU VRAM).
bufferSize: Size of this buffer in bytes (typically from PayloadSize node).
userContext: Optional metadata pointer that travels with this buffer (retrieved via GetPrivateDataPtr()).

Functional signature

void StartStream(const std::vector& bufferList);

Example code

// Step 1: Get buffer size
size_t bufferSize = Arena::GetNodeValue(
    pDevice->GetNodeMap(), "PayloadSize");
const int numBuffers = 10;
// Step 2: Allocate GPU memory pool
void* d_bufferPool;
cudaError_t err = cudaMalloc(&d_bufferPool, 
                              bufferSize * numBuffers);
if (err != cudaSuccess) {
    std::cerr << "GPU allocation failed: " 
              << cudaGetErrorString(err) << std::endl;
    return -1;
}
// Step 3: Create buffer list with GPU pointers
std::vector bufferList;
bufferList.reserve(numBuffers);
for (int i = 0; i < numBuffers; i++) {
    uint8_t* bufferPtr = 
        static_cast(d_bufferPool) + 
        (i * bufferSize);
    bufferList.emplace_back(bufferPtr, bufferSize);
}
// Step 4: Start streaming
pDevice->StartStream(bufferList);
// Step 5: Get images
for (int i = 0; i < 100; i++) {
    Arena::IImage* pImage = pDevice->GetImage(2000);
    // At this point, image data is in GPU memory
    // pImage->GetData() returns a GPU memory address
    //////////////////////////////////////////////////////////////////////////
    // STILL NEED TO PROCESS THIS DATA ON THE GPU - SEE CUDA KERNEL SECTION //
    //////////////////////////////////////////////////////////////////////////
    pDevice->RequeueBuffer(pImage);
}
// Step 6: Cleanup
pDevice->StopStream();
bufferList.clear();
cudaFree(d_bufferPool);

Understanding GPU processing with CUDA kernels

The most common way to process GPU data is with CUDA kernels. A CUDA kernel is a function that runs on the GPU and processes data in parallel across thousands of threads. In the context of image processing, each thread typically handles one pixel, allowing the entire image to be processed simultaneously.

__global__ void processKernel(uint8_t* image, size_t width, size_t height) {
    // Calculate which pixel this thread processes
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x < width && y < height) {
        size_t idx = y * width + x;
        image[idx] = 255 - image[idx];  // Process this pixel
    }
}

__global__: Keyword indicating this function runs on the GPU but is called from CPU code
uint8_t* image: Pointer to image data. With user-defined GPU buffers, this points to GPU VRAM (not system RAM)
size_t width, height: Image dimensions passed to the kernel
blockIdx.x, blockIdx.y: Which block (tile) this thread belongs to in the grid
blockDim.x, blockDim.y: Number of threads per block in each dimension (16×16 = 256 threads)
threadIdx.x, threadIdx.y: This thread’s position within its block (0-15 in each dimension)
int x, y: The pixel coordinates this thread will process, calculated from block and thread indices
if (x < width && y < height): Bounds check to ensure thread only processes pixels inside the image
size_t idx = y * width + x: Converts 2D pixel coordinates to 1D array index (images stored row-by-row)
image[idx] = 255 - image[idx]: The actual processing (this example inverts pixel values)

Once you have defined your CUDA kernel, it will need to be launched from the CPU:

// Get image from GPU buffer
Arena::IImage* pImage = pDevice->GetImage(2000);
 
// Configure kernel launch parameters
dim3 threads(16, 16);
dim3 blocks((pImage->GetWidth() + 15) / 16, 
            (pImage->GetHeight() + 15) / 16);
 
// Launch kernel
processKernel<<>>(
    const_cast(pImage->GetData()),
    pImage->GetWidth(),
    pImage->GetHeight());
 
// Wait for GPU to finish
cudaDeviceSynchronize();
 
// Safe to requeue now
pDevice->RequeueBuffer(pImage);

dim3 threads(16, 16): Defines the number of threads per block. Creates 16×16 = 256 threads per block. Each block processes a 16×16 tile of the image.
dim3 blocks((width + 15) / 16, (height + 15) / 16): Calculates how many blocks are needed to cover the entire image. The formula (width + 15) / 16 rounds up to ensure all pixels are covered. For a 1024×768 image, this creates 64×48 = 3,072 blocks.
<<<blocks, threads>>>: CUDA kernel launch syntax. The triple angle brackets specify the grid configuration: blocks defines how many blocks to create, threads defines how many threads per block. This launches 3,072 blocks × 256 threads = 786,432 parallel threads.
const_cast<uint8_t*>(pImage->GetData()): Removes the const qualifier from the image data pointer. With user-defined GPU buffers, GetData() returns a GPU memory address (not system RAM). The const_cast is needed because the kernel will modify the data.
pImage->GetWidth(), pImage->GetHeight(): Pass image dimensions to the kernel so each thread knows the image boundaries.
cudaDeviceSynchronize(): Blocks CPU execution until the GPU kernel completes. Kernel launches are asynchronous (they return immediately without waiting). You must synchronize before requeuing the buffer, otherwise you’ll requeue while the GPU is still processing it, causing undefined behavior.
pDevice->RequeueBuffer(pImage): Returns the buffer to Arena SDK for reuse.

Working with buffer metadata (userContext)

The userContext field in UserSuppliedBuffer allows you to attach metadata to each buffer. This metadata pointer travels with the buffer and can be retrieved via GetPrivateDataPtr() when you get an image. The userContext is an opaque pointer; Arena SDK stores it and returns it to you unchanged, but never reads or validates the data. This is useful for tracking which buffer is being used, collecting statistics, or debugging buffer reuse patterns.

// Create metadata (this example uses simple integers)
int bufferIDs[10];
for (int i = 0; i < 10; i++) {
    bufferIDs[i] = i;  // Metadata: just the buffer number
}
// Attach metadata when creating buffers
bufferList.emplace_back(
    bufferPtr,        // Buffer pointer
    bufferSize,       // Buffer size  
    &bufferIDs[i]     // Metadata pointer (third parameter)
);
// Later, retrieve metadata
Arena::IImage* pImage = pDevice->GetImage(timeout);
int* bufferID = (int*)pImage->GetPrivateDataPtr();
printf("Using buffer %d\n", *bufferID);