backends_guide

Compute Backends

ReliQ supports three compute backends, each implementing the same each macro interface. User code is backend-agnostic — the same for n in each loop works across all three backends.

Backend Comparison

FeatureOpenCLSYCLOpenMP
CompilationJIT (runtime)Pre-compiled (C++)Source-level (Nim→C)
TargetGPUs, FPGAs, CPUsIntel/AMD GPUs, CPUsCPUs only
RequirementsOpenCL runtimeIntel oneAPI (icpx)GCC/Clang with OpenMP
Compile flag(default)BACKEND=syclBACKEND=openmp
SIMDGPU warpsGPU subgroupsExplicit intrinsics
MemoryOpenCL buffersSYCL USM/buffersShared memory (AoSoA)
Test count245 tests245 tests295 tests

OpenCL Backend (Default)

The OpenCL backend generates kernel source code at compile time and JIT compiles it at runtime. This is the default backend.

How It Works

The each macro in opencl/cldisp receives the loop body as an AST and:

  1. Classifies each assignment expression into a dispatch kind (copy, add, matmul, scalar-mul, etc.)
  2. Gathers all TensorFieldView symbols referenced in the loop
  3. Detects stencil neighbor accesses
  4. Generates an OpenCL C kernel source string with:
    • Buffer parameters for each view
    • Element type declarations (float/double/int/long)
    • Site indexing with get_global_id(0)
    • Inlined arithmetic matching the expression pattern
  5. JIT-compiles the kernel using clCreateProgramWithSource and clBuildProgram
  6. Dispatches with clEnqueueNDRangeKernel

Expression Kinds

KindExampleKernel pattern
CopyvC[n] = vA[n]C[i] = A[i]
AddvC[n] = vA[n] + vB[n]Element-wise sum
SubvC[n] = vA[n] - vB[n]Element-wise diff
MatMulvC[n] = vA[n] * vB[n]C[i,j] = Σ A[i,k]*B[k,j]
MatVecvC[n] = vA[n] * vB[n]C[i] = Σ A[i,k]*B[k]
ScalarMulvC[n] = 3.0 * vA[n]C[i] = s * A[i]
ScalarAddvC[n] = vA[n] + 1.0C[i] = A[i] + s
StencilCopyvC[n] = vA[fwd]Offset index lookup

Debug Kernels

Compile with -d:DebugKernels to print generated OpenCL kernel source to stdout at compile time.

Modules

ModuleDescription
opencl/cldispeach macro → OpenCL kernel generation
opencl/clbasePlatform, device, context, buffer management

OpenCL Base Layer

opencl/clbase provides the OpenCL platform and device management:

# Initialization (called automatically by TensorFieldView constructors)
initCL()
finalizeCL()

# Manual platform selection
let platform = firstPlatform()
let devices = getDevices(platform)
echo platform.name  # e.g., "NVIDIA CUDA"

# Device properties
echo devices[0].globalMemory
echo devices[0].maxWorkItems

SYCL Backend

The SYCL backend dispatches to pre-compiled C++ kernel templates in a shared library (libreliq_sycl.so), avoiding JIT compilation overhead.

How It Works

The each macro in sycl/sycldisp:

  1. Analyzes the AST (same expression classification as OpenCL)
  2. Builds an execution plan for complex expressions involving temporaries
  3. Dispatches to typed C++ template functions via FFI:
    • sycl_kernel_copy_f64(queue, bufA, bufC, nSites, elems)
    • sycl_kernel_matmul_f64(queue, bufA, bufB, bufC, nSites, rows, cols)
    • etc.
  4. Each kernel function is a sycl::handler::parallel_for with type-specialized inner loops

Building the SYCL Wrapper

# Build libreliq_sycl.so (requires Intel oneAPI or hipSYCL)
make sycl-lib

# Then build/test with SYCL backend
make tensorview BACKEND=sycl
make test-sycl

Type-Specialized Kernels

The SYCL wrapper provides kernels for each element type:

Plus complex-number variants for Complex64 fields.

Stencil Kernels

Pre-compiled stencil kernels accept an offset buffer and perform neighbor lookups on-device:

KernelDescription
kernelStencilCopyCopy from neighbor site
kernelStencilScalarMulScalar × neighbor value
kernelStencilAddSum of site and neighbor

Modules

ModuleDescription
sycl/sycldispeach macro → native SYCL dispatch
sycl/syclbaseQueue/buffer management
sycl/syclwrapLow-level C++ FFI (60+ kernel functions)

OpenMP Backend

The OpenMP backend generates SIMD-vectorized C code with explicit intrinsic calls, targeting CPU architectures.

How It Works

The each macro in openmp/ompdisp:

  1. Analyzes the AST (same expression classification)
  2. Determines if SIMD vectorization is applicable
  3. Generates C code with:
    • #pragma omp parallel for for outer loop parallelism
    • SIMD intrinsics (AVX2/AVX-512) for inner loop vectorization
    • AoSoA memory access pattern matching the SimdLatticeLayout

SIMD-Vectorized Views

When using the OpenMP backend, TensorFieldView can use a SIMD-aware AoSoA layout:

# Default SIMD grid (auto-distributed based on VectorWidth)
var vA = localA.newTensorFieldView(iokRead)

# Explicit SIMD grid
var vB = localB.newTensorFieldView(iokRead, [1, 1, 1, 8])

The SIMD grid controls how lattice sites are grouped into SIMD lanes. With VectorWidth=8 (AVX-512), 8 consecutive sites along the innermost SIMD dimension are processed simultaneously.

Loop Patterns

The SIMD backend uses an outer/inner loop pattern:

# Outer loop: iterates over SIMD groups (OpenMP parallel)
# Inner loop: iterates over SIMD lanes (vectorized)
for outer in 0..<nSitesOuter:
  for lane in 0..<VectorWidth:
    # Each lane processes one site within the SIMD group

The eachOuter macro in openmp/ompsimd provides direct access to this pattern for low-level SIMD programming.

SIMD Intrinsics

The OpenMP backend can use hardware-specific SIMD intrinsics:

# SIMD vector operations (compile with -d:AVX2 or -d:AVX512)
import reliq

var a = SimdF64x4(data: [1.0, 2.0, 3.0, 4.0])
var b = SimdF64x4(data: [5.0, 6.0, 7.0, 8.0])
var c = a + b              # [6.0, 8.0, 10.0, 12.0]
let s = a.sum()            # 10.0

Modules

ModuleDescription
openmp/ompdispeach macro → SIMD-vectorized code
openmp/ompbaseOpenMP initialization, thread management
openmp/ompsimdSIMD-aware dispatch, eachOuter macro

Building and Testing

# Build with specific backend
make tensorview                    # OpenCL (default)
make tensorview BACKEND=openmp     # OpenMP
make tensorview BACKEND=sycl       # SYCL

# Run all backend tests
make test          # All backends (1,660 tests)
make test-core     # Core tests (875)
make test-opencl   # OpenCL tests (245)
make test-openmp   # OpenMP tests (295)
make test-sycl     # SYCL tests (245)