ReliQ supports three compute backends, each implementing the same each macro interface. User code is backend-agnostic — the same for n in each loop works across all three backends.
| Feature | OpenCL | SYCL | OpenMP |
|---|---|---|---|
| Compilation | JIT (runtime) | Pre-compiled (C++) | Source-level (Nim→C) |
| Target | GPUs, FPGAs, CPUs | Intel/AMD GPUs, CPUs | CPUs only |
| Requirements | OpenCL runtime | Intel oneAPI (icpx) | GCC/Clang with OpenMP |
| Compile flag | (default) | BACKEND=sycl | BACKEND=openmp |
| SIMD | GPU warps | GPU subgroups | Explicit intrinsics |
| Memory | OpenCL buffers | SYCL USM/buffers | Shared memory (AoSoA) |
| Test count | 245 tests | 245 tests | 295 tests |
The OpenCL backend generates kernel source code at compile time and JIT compiles it at runtime. This is the default backend.
The each macro in opencl/cldisp receives the loop body as an AST and:
| Kind | Example | Kernel pattern |
|---|---|---|
| Copy | vC[n] = vA[n] | C[i] = A[i] |
| Add | vC[n] = vA[n] + vB[n] | Element-wise sum |
| Sub | vC[n] = vA[n] - vB[n] | Element-wise diff |
| MatMul | vC[n] = vA[n] * vB[n] | C[i,j] = Σ A[i,k]*B[k,j] |
| MatVec | vC[n] = vA[n] * vB[n] | C[i] = Σ A[i,k]*B[k] |
| ScalarMul | vC[n] = 3.0 * vA[n] | C[i] = s * A[i] |
| ScalarAdd | vC[n] = vA[n] + 1.0 | C[i] = A[i] + s |
| StencilCopy | vC[n] = vA[fwd] | Offset index lookup |
Compile with -d:DebugKernels to print generated OpenCL kernel source to stdout at compile time.
| Module | Description |
|---|---|
| opencl/cldisp | each macro → OpenCL kernel generation |
| opencl/clbase | Platform, device, context, buffer management |
opencl/clbase provides the OpenCL platform and device management:
# Initialization (called automatically by TensorFieldView constructors) initCL() finalizeCL() # Manual platform selection let platform = firstPlatform() let devices = getDevices(platform) echo platform.name # e.g., "NVIDIA CUDA" # Device properties echo devices[0].globalMemory echo devices[0].maxWorkItems
The SYCL backend dispatches to pre-compiled C++ kernel templates in a shared library (libreliq_sycl.so), avoiding JIT compilation overhead.
The each macro in sycl/sycldisp:
# Build libreliq_sycl.so (requires Intel oneAPI or hipSYCL) make sycl-lib # Then build/test with SYCL backend make tensorview BACKEND=sycl make test-sycl
The SYCL wrapper provides kernels for each element type:
Plus complex-number variants for Complex64 fields.
Pre-compiled stencil kernels accept an offset buffer and perform neighbor lookups on-device:
| Kernel | Description |
|---|---|
| kernelStencilCopy | Copy from neighbor site |
| kernelStencilScalarMul | Scalar × neighbor value |
| kernelStencilAdd | Sum of site and neighbor |
| Module | Description |
|---|---|
| sycl/sycldisp | each macro → native SYCL dispatch |
| sycl/syclbase | Queue/buffer management |
| sycl/syclwrap | Low-level C++ FFI (60+ kernel functions) |
The OpenMP backend generates SIMD-vectorized C code with explicit intrinsic calls, targeting CPU architectures.
The each macro in openmp/ompdisp:
When using the OpenMP backend, TensorFieldView can use a SIMD-aware AoSoA layout:
# Default SIMD grid (auto-distributed based on VectorWidth) var vA = localA.newTensorFieldView(iokRead) # Explicit SIMD grid var vB = localB.newTensorFieldView(iokRead, [1, 1, 1, 8])
The SIMD grid controls how lattice sites are grouped into SIMD lanes. With VectorWidth=8 (AVX-512), 8 consecutive sites along the innermost SIMD dimension are processed simultaneously.
The SIMD backend uses an outer/inner loop pattern:
# Outer loop: iterates over SIMD groups (OpenMP parallel) # Inner loop: iterates over SIMD lanes (vectorized) for outer in 0..<nSitesOuter: for lane in 0..<VectorWidth: # Each lane processes one site within the SIMD group
The eachOuter macro in openmp/ompsimd provides direct access to this pattern for low-level SIMD programming.
The OpenMP backend can use hardware-specific SIMD intrinsics:
# SIMD vector operations (compile with -d:AVX2 or -d:AVX512) import reliq var a = SimdF64x4(data: [1.0, 2.0, 3.0, 4.0]) var b = SimdF64x4(data: [5.0, 6.0, 7.0, 8.0]) var c = a + b # [6.0, 8.0, 10.0, 12.0] let s = a.sum() # 10.0
| Module | Description |
|---|---|
| openmp/ompdisp | each macro → SIMD-vectorized code |
| openmp/ompbase | OpenMP initialization, thread management |
| openmp/ompsimd | SIMD-aware dispatch, eachOuter macro |
# Build with specific backend make tensorview # OpenCL (default) make tensorview BACKEND=openmp # OpenMP make tensorview BACKEND=sycl # SYCL # Run all backend tests make test # All backends (1,660 tests) make test-core # Core tests (875) make test-opencl # OpenCL tests (245) make test-openmp # OpenMP tests (295) make test-sycl # SYCL tests (245)