TensorFieldView - Device-Side Views for Backend Dispatch
This module provides TensorFieldView[L,T], the type that the each macro operates on. A view wraps a LocalTensorField in device-side buffers (OpenCL cl_mem, SYCL buffers, or raw host pointers for OpenMP) and handles the AoSoA layout transformation needed for efficient SIMD / GPU execution.
Key capabilities:
- Construction: newTensorFieldView(local, ioKind) allocates device buffers and optionally synchronises data from the parent local tensor (read/readwrite) or just allocates (write)
- AoSoA layout: transformToAoSoA / transformFromAoSoA convert between natural Array-of-Structures order and the blocked Array-of-Structures-of-Arrays layout used by kernels
- Backend dispatch: the each macro inspects view arguments at compile time and emits OpenCL, SYCL, or OpenMP code accordingly
- Stencil integration: views can be passed to each together with a LatticeStencil for neighbor access in kernels
- Destruction: on scope exit, write/readwrite views synchronise data back to the parent local tensor field
The backend is selected at compile time:
| Backend | Compile flag | Buffer type |
|---|---|---|
| OpenCL | (default) | cl_mem |
| SYCL | -d:UseSycl | SyclBuffer |
| OpenMP | -d:UseOpenMP | raw pointer |
Example
var local = field.newLocalTensorField() var vSrc = local.newTensorFieldView(iokRead) var vDst = local.newTensorFieldView(iokWrite) each vDst, vSrc, n: vDst[n] = 2.0 * vSrc[n]
Types
DeviceStorage = object buffers*: seq[BackendBuffer] queues*: seq[BackendQueue] sitesPerDevice*: seq[int] totalSites*: int elementsPerSite*: int ## Scalar elements per site (for OpenCL - complex counts as 2) tensorElementsPerSite*: int ## Tensor elements per site (for memory - Complex64 counts as 1) elementSize*: int ## sizeof(T) hostPtr*: pointer hostOffsets*: seq[int] siteOffsets*: seq[int] ## Precomputed flat offsets for each lex site in padded GA memory destroyed*: bool ## Flag to prevent double destruction simdLayout*: SimdLatticeLayout ## SIMD layout for vectorized AoSoA access aosoaData*: pointer ## Pointer to AoSoA transformed data (OpenMP) aosoaSeqRef*: RootRef ## Reference to keep AoSoA seq alive (OpenMP)
- Device memory storage representation (type-erased)
IOKind = enum iokRead, iokWrite, iokReadWrite
TensorFieldView[L; T] = object ioKind*: IOKind lattice*: L dims*: int rank*: int shape*: seq[int] data*: DeviceStorage hasPadding*: bool simdGrid*: seq[int] ## Runtime SIMD lane grid (e.g., [1,2,2,2] for 8 lanes)
-
Tensor field view on device memory
L is the lattice type, T is the scalar element type. Shape is stored at runtime to avoid static generic issues.
Consts
UseOpenMP {.booldefine.} = false
UseSycl {.booldefine.} = false
VectorWidth {.intdefine.} = 8
- Number of sites processed together in a vector group. Set via -d:VectorWidth=4 for AVX2, -d:VectorWidth=8 for AVX-512, etc. Default is 8 for good GPU and modern CPU performance.
Procs
proc `=copy`(dest: var DeviceStorage; src: DeviceStorage) {. error: "DeviceStorage cannot be copied".}
proc `=copy`[L, T](dest: var TensorFieldView[L, T]; src: TensorFieldView[L, T]) {. error: "TensorFieldView cannot be copied".}
proc `=destroy`[L, T](view: var TensorFieldView[L, T]) {....raises: [].}
- Destructor for TensorFieldView (OpenCL/SYCL backend) Reads device data back, transforms from AoSoA to AoS, and scatters into the padded GA memory via siteOffsets. Because the host pointer points directly into GA memory, this write-back immediately updates the Global Array — no separate flush is needed. Releases all device buffers after write-back.
proc `[]`[L, T, D: static int](view: TensorFieldView[L, T]; shift: StencilShift[D]): TensorSiteProxy[L, T] {. inline.}
-
Access neighbor site using stencil shift (phantom for GPU codegen)
Usage: let nbr = viewstencil.shift(n, +1, 0) # Forward x neighbor let nbr = viewstencil.fwd(n, 0) # Same as above let nbr = viewstencil.bwd(n, 3) # Backward t neighbor
proc `[]`[L, T](view: TensorFieldView[L, T]; site: int): TensorSiteProxy[L, T]
proc `[]=`[L, T](view: TensorFieldView[L, T]; site: int; value: MatAddResult[L, T])
proc `[]=`[L, T](view: TensorFieldView[L, T]; site: int; value: MatMulResult[L, T])
proc `[]=`[L, T](view: TensorFieldView[L, T]; site: int; value: MatVecResult[L, T])
proc `[]=`[L, T](view: TensorFieldView[L, T]; site: int; value: ScalarAddResult[L, T])
proc `[]=`[L, T](view: TensorFieldView[L, T]; site: int; value: ScalarMulResult[L, T])
proc `[]=`[L, T](view: TensorFieldView[L, T]; site: int; value: T)
proc `[]=`[L, T](view: TensorFieldView[L, T]; site: int; value: TensorSiteProxy[L, T])
proc `[]=`[L, T](view: TensorFieldView[L, T]; site: int; value: VecAddResult[L, T])
proc aosoaDataPtr[L, T](view: TensorFieldView[L, T]): ptr UncheckedArray[T] {. inline.}
- Returns pointer to AoSoA data buffer (for SIMD views)
proc buffers[L, T](view: TensorFieldView[L, T]): seq[BackendBuffer]
- Returns the underlying device memory buffers
proc computeElementsPerSite(shape: openArray[int]; isComplex: bool = false): int {. ...raises: [], tags: [], forbids: [].}
- Compute the number of elements per lattice site
proc computeTotalLatticeSites(localGrid: openArray[int]): int {....raises: [], tags: [], forbids: [].}
- Compute the total number of lattice sites from the local grid
proc defaultSimdGrid(localGrid: seq[int]): seq[int] {....raises: [], tags: [], forbids: [].}
- Generate default SIMD grid that distributes VectorWidth lanes across dimensions. Overload for seq input.
proc defaultSimdGrid[D: static[int]](localGrid: array[D, int]): seq[int]
- Generate default SIMD grid that distributes VectorWidth lanes across dimensions. Prefers faster-varying (lower index) dimensions.
proc elementsPerSite[L, T](view: TensorFieldView[L, T]): int {.inline.}
- Returns the number of elements per lattice site
proc formatSiteData[L, T](view: TensorFieldView[L, T]; site: int): string
- Read and format tensor data for a single site. Returns a nicely formatted string representation.
proc hasSimdLayout[L, T](view: TensorFieldView[L, T]): bool {.inline.}
- Returns true if this view was created with a SIMD layout
proc matadd[L, T](a, b: TensorFieldView[L, T]; site: int): MatAddResult[L, T] {. inline.}
- Matrix addition marker for OpenCL codegen
proc matmul[L, T](a, b: TensorFieldView[L, T]; site: int): MatMulResult[L, T] {. inline.}
- Matrix multiplication marker for OpenCL codegen Usage: mViewCn = matmul(mViewA, mViewB, n) Prefer using: mViewCn = mViewAn * mViewBn
proc matvec[L, T](mat, vec: TensorFieldView[L, T]; site: int): MatVecResult[L, T] {. inline.}
- Matrix-vector multiplication marker for OpenCL codegen
proc nSitesInner[L, T](view: TensorFieldView[L, T]): int {.inline.}
- Returns the number of SIMD lanes (sites per vector group)
proc nSitesOuter[L, T](view: TensorFieldView[L, T]): int {.inline.}
- Returns the number of vector groups (outer loop iterations)
proc numGlobalSites[L, T](view: TensorFieldView[L, T]): int {.inline.}
- Returns the total number of global lattice sites across all MPI ranks
proc numSites[L, T](view: TensorFieldView[L, T]): int {.inline.}
- Returns the total number of lattice sites in the tensor field view
proc numVectorGroups(numSites: int): int {.inline, ...raises: [], tags: [], forbids: [].}
- Compute number of vector groups (ceiling division)
proc readSiteData[L, T](view: TensorFieldView[L, T]; site: int): RuntimeSiteData[ T]
- Read tensor data for a single site from device memory. Used for CPU fallback when print statements are present.
proc simdLayout[L, T](view: TensorFieldView[L, T]): SimdLatticeLayout
- Returns the SIMD layout for vectorized access
proc sitesPerDevice[L, T](view: TensorFieldView[L, T]): seq[int]
- Returns the number of sites assigned to each device
proc splitLatticeSites(totalSites: int; numDevices: int): seq[int] {....raises: [], tags: [], forbids: [].}
- Split lattice sites as evenly as possible among devices
proc totalElements[L, T](view: TensorFieldView[L, T]): int {.inline.}
- Returns the total number of elements across all sites
proc transformAoSoAtoAoS[T](src: pointer; numSites, elemsPerSite: int): seq[T]
-
Transform data from AoSoA back to flat contiguous AoS layout.
This is used by updateGlobalTensorField for manual write-back.
proc transformAoSoAtoAoSSimd[T](src: pointer; layout: SimdLatticeLayout; elemsPerSite: int): seq[T]
-
Transform data from AoSoA back to flat contiguous AoS layout using SIMD layout.
This is used by updateGlobalTensorField for manual write-back. Inverse of transformAoStoAoSoASimd.
proc transformAoStoAoSoA[T](src: pointer; numSites, elemsPerSite: int; siteOffsets: seq[int]): seq[T]
-
Transform data from AoS (Array of Structures) to AoSoA layout.
AoS layout: site0e0,e1,..., site1e0,e1,..., ... AoSoA layout: group0e0: s0,s1,...,sV-1, e1: s0,s1,...,sV-1, ..., group1...
Each vector group contains VectorWidth sites with elements interleaved. This enables SIMD-friendly memory access patterns. Uses siteOffsets to handle padded GA memory strides.
proc transformAoStoAoSoASimd[T](src: pointer; layout: SimdLatticeLayout; elemsPerSite: int; siteOffsets: seq[int]): seq[T]
-
Transform data from AoS to AoSoA layout using SIMD layout with configurable lane grid.
Unlike the simple VectorWidth-based AoSoA, this uses the SimdLatticeLayout to properly map lattice coordinates to SIMD lanes based on the user-specified simdGrid. Uses siteOffsets to handle padded GA memory strides.
AoS layout: site0e0,e1,..., site1e0,e1,..., ... AoSoA layout: outer0e0: lane0..laneN, e1: lane0..laneN, ..., outer1...
Parameters: src: Pointer to GA memory (padded strides) layout: SimdLatticeLayout with innerGeom (SIMD lanes) and outerGeom (vector groups) elemsPerSite: Number of tensor elements per site siteOffsets: Precomputed flat offsets for each lexicographic site in padded GA memory
proc vecadd[L, T](a, b: TensorFieldView[L, T]; site: int): VecAddResult[L, T] {. inline.}
- Vector addition marker for OpenCL codegen
proc writeSiteElement[L, T](view: TensorFieldView[L, T]; site: int; elementIdx: int; value: T)
- Write a single element of a tensor at a specific site to device memory. Used for element-level writes like viewn = value during CPU fallback. This is an immediate write that syncs with the device.
Templates
template newTensorFieldView[D: static[int]; R: static[int]; L, T]( tensor: LocalTensorField[D, R, L, T]; io: IOKind): TensorFieldView[L, T]
- Create tensor field view from local tensor field Uses AoSoA layout on device for SIMD-friendly access patterns.
template newTensorFieldView[D: static[int]; R: static[int]; L, T]( tensor: TensorField[D, R, L, T]; io: IOKind): TensorFieldView[L, T]
- Create tensor field view from global tensor field
Exports
-
newLatticeStencil, newLatticeStencil, UseOpenCL, LatticeStencil, newLatticeStencil, nLanes, Y, pathToStencil, addPoint, fwd, ==, localToPadded, newStencilView, backward, hash, nPoints, newLatticeStencil, idx, Z, bwd, $, $, directions, $, shift, allDirections, neighborOffset, offsetBufferSize, rectanglePath, forEachNeighbor, addPoint, nearestNeighborStencil, paddedToLocal, bwd, forwardStencil, VectorWidth, neighborSimd, Stencil, nearestNeighborStencil, StencilPattern, laplacianStencil, Direction, nOuter, SignedDirection, backwardStencil, StencilEntry, StencilPoint, neighbor, step, newLatticeStencil, T, plaquettePath, StencilBackend, paddedToLocal, forwardStencil, fwd, backwardStencil, forward, points, laplacianStencil, getEntry, newStencilPattern, UseSYCL, newStencilPoint, X, getOffsetBuffer, $, localToPadded, shift, StencilShift, StencilView, nSites, newStencil, neighbors, newStencilPoint, addPoint, isGhostNeighbor, PathStep, sites, UseOpenMP, SimdLatticeLayout, $, simdLanes, lexicographicToCoords, outerInnerToLocal, computeStrides, newSimdLatticeLayout, localToOuterInner, coordsToLexicographic, newSimdLatticeLayout, vectorGroups, validateSimdGrid, computeLocalGeom, generateCoordTable, aosoaIndexFromLocal, aosoaIndex, computeProduct, Tcommand_type, DEVICE_VENDOR_ID, buildProgram, fmod, []=, COMMAND_COPY_BUFFER, createCommandQueue, getImageInfo, DEVICE_AFFINITY_DOMAIN_NEXT_PARTITIONABLE, KERNEL_ARG_ADDRESS_CONSTANT, KERNEL_ARG_ADDRESS_GLOBAL, enqueueFillImage, KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, PROGRAM_BINARY_TYPE_LIBRARY, FILTER_NEAREST, GLOBAL, KERNEL_ARG_TYPE_RESTRICT, PROGRAM_DEVICES, createContextFromType, EVENT_COMMAND_QUEUE, Tcommand_queue_info, Tmem_object_type, DEVICE_IMAGE_MAX_BUFFER_SIZE, DEVICE_PARTITION_BY_COUNTS, write_mem_fence_impl, EVENT_COMMAND_EXECUTION_STATUS, enqueueMapBuffer, Tdevice_exec_capabilities, round, CL_SIGNED_INT32, PLATFORM_NAME, COMPLETE, atomic_and_impl, finalizeCL, PLATFORM_PROFILE, Tbuffer_region, MEM_READ_WRITE, Tmem_info, MEM_OBJECT_IMAGE1D, KERNEL_CONTEXT, PROGRAM_BINARY_TYPE_EXECUTABLE, DEVICE_MAX_COMPUTE_UNITS, COMMAND_WRITE_BUFFER_RECT, DEVICE_PARTITION_TYPE, getEventInfo, enqueueTask, enqueueReadBufferRect, createImage3D, release, native_tan_impl, getCommandQueueInfo, retainMemObject, MEM_OBJECT_IMAGE1D_BUFFER, Tplatform_info, createProgramWithSource, PROGRAM_KERNEL_NAMES, createUserEvent, native_exp, DEVICE_AFFINITY_DOMAIN_L2_CACHE, PROGRAM_BINARY_TYPE, CONTEXT_PROPERTIES, setKernelArg, CL_A, DEVICE_PROFILE, MEM_HOST_WRITE_ONLY, CL_SIGNED_INT16, clamp, Pplatform_id, enqueueReadImage, args, DEVICE_MAX_CONSTANT_ARGS, MAP_READ, [], native_sin, PROGRAM_BINARY_TYPE_COMPILED_OBJECT, MEM_HOST_READ_ONLY, EVENT_CONTEXT, run2d, COMMAND_WRITE_BUFFER, native_exp2, Timage_format, name, global, DEVICE_IMAGE3D_MAX_DEPTH, PROGRAM_NUM_KERNELS, native_powr_impl, DEVICE_MAX_WORK_ITEM_DIMENSIONS, COMMAND_MAP_BUFFER, getProgramBuildInfo, enqueueCopyBufferRect, setArg, retainProgram, DEVICE_PLATFORM, ADDRESS_MIRRORED_REPEAT, gpuBuffer, waitForEvents, MEM_OBJECT_IMAGE2D, enqueueUnmapMemObject, LocalBuffer, MEM_ALLOC_HOST_PTR, Tkernel_arg_info, KERNEL_ARG_TYPE_CONST, getDevices, Timage_info, CL_RG, DEVICE_TYPE_ALL, finish, createProgramWithBinary, MIGRATE_MEM_OBJECT_CONTENT_UNDEFINED, DEPTH_STENCIL, Dim, IMAGE_BUFFER, CL_RGx, releaseContext, atomic_xor, DEVICE_PARTITION_MAX_SUB_DEVICES, DEVICE_PARENT_DEVICE, enqueueMapImage, multipleDeviceDefaults, EVENT_COMMAND_TYPE, DEVICE_MAX_READ_IMAGE_ARGS, createProgram, DEVICE_EXECUTION_CAPABILITIES, mem_fence, DEVICE_NATIVE_VECTOR_WIDTH_INT, DEVICE_PARTITION_BY_AFFINITY_DOMAIN, QUEUE_PROFILING_ENABLE, getKernelArgInfo, CL_TRUE, DEVICE_GLOBAL_MEM_CACHE_TYPE, KERNEL_PRIVATE_MEM_SIZE, FP_ROUND_TO_ZERO, run2d, KERNEL_ARG_ADDRESS_QUALIFIER, PROGRAM_BUILD_STATUS, trunc, COMMAND_TASK, COMMAND_READ_BUFFER_RECT, ceil, retainDevice, createKernelsInProgram, DEVICE_PREFERRED_VECTOR_WIDTH_LONG, Tkernel_work_group_info, getSupportedImageFormats, Tcontext_properties, releaseDevice, buildOn, native_recip_impl, MEM_SIZE, DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE, native_rsqrt, releaseProgram, MEM_USE_HOST_PTR, globalMemory, Tkernel_arg_type_qualifier, PROGRAM_REFERENCE_COUNT, createImage2D, Tprogram_info, MAP_WRITE, DEVICE_IMAGE2D_MAX_HEIGHT, IMAGE_FORMAT, write, BLOCKING, NON_BLOCKING, KERNEL_ARG_ACCESS_READ_ONLY, Tmem_flags, DEVICE_IMAGE3D_MAX_WIDTH, Tbuild_status, native_log2_impl, release, EXEC_NATIVE_KERNEL, IMAGE_WIDTH, setArg, atomic_cmpxchg_impl, buildOn, DEVICE_VENDOR, CLK_LOCAL_MEM_FENCE, native_cos, check, CL_BGRA, fmin, FP_ROUND_TO_NEAREST, getContextInfo, KERNEL_PROGRAM, CL_SIGNED_INT8, COMMAND_MAP_IMAGE, CL_RGBx, getKernelWorkGroupInfo, MEM_WRITE_ONLY, RUNNING, BUILD_SUCCESS, DEVICE_LOCAL_MEM_SIZE, DEVICE_QUEUE_PROPERTIES, IMAGE_ARRAY_SIZE, setUserEventStatus, KERNEL_WORK_GROUP_SIZE, PROGRAM_BUILD_LOG, acos, Tchannel_order, Tdevice_partition_property, read_mem_fence, Tcommand_queue_properties, createSampler, IMAGE_NUM_MIP_LEVELS, raiseEOpenCL, enqueueCopyBufferToImage, IMAGE_ELEMENT_SIZE, native_sqrt_impl, flush, DEVICE_TYPE_GPU, COMMAND_MARKER, createAndBuild, Pkernel, FP_SOFT_FLOAT, DEVICE_AFFINITY_DOMAIN_L3_CACHE, atomic_max, MEM_OBJECT_BUFFER, DEVICE_PARTITION_PROPERTIES, get_work_dim, createImage, Tsampler_info, ADDRESS_CLAMP_TO_EDGE, DEVICE_NATIVE_VECTOR_WIDTH_HALF, enqueueBarrier, tanh, DEVICE_AFFINITY_DOMAIN_NUMA, Tbool, createKernel, Tprofiling_info, SAMPLER_FILTER_MODE, openclDefaults, read, COMMAND_COPY_BUFFER_RECT, rsqrt_impl, clamp_impl, Tdevice_info, MAP_WRITE_INVALIDATE_REGION, write, native_recip, getPlatformByName, DEVICE_PREFERRED_VECTOR_WIDTH_HALF, run, Tmap_flags, DEVICE_IMAGE_MAX_ARRAY_SIZE, retainContext, release, setMemObjectDestructorCallback, getPlatformIDs, kernel, floor, SAMPLER_CONTEXT, KERNEL_ARG_NAME, BUILD_ERROR, TUserCb, COMMAND_WRITE_IMAGE, VERSION_1_1, DEVICE_TYPE_ACCELERATOR, PLATFORM_VENDOR, MEM_HOST_PTR, atomic_xchg, release, DEVICE_IMAGE_BASE_ADDRESS_ALIGNMENT, native_sin_impl, TEventCb, COMMAND_USER, getDeviceIDs, atomic_xor_impl, BUILD_IN_PROGRESS, CL_Rx, atomic_or_impl, run3d, COMMAND_MIGRATE_MEM_OBJECTS, maxWorkGroups, setArg, Pdevice_id, GpuBuffer, native_powr, DEVICE_IMAGE_PITCH_ALIGNMENT, TClResult, DEVICE_MAX_PARAMETER_SIZE, enqueueCopyImage, QUEUE_DEVICE, CL_UNORM_INT_101010, fmax, DEVICE_PREFERRED_INTEROP_USER_SYNC, MEM_OBJECT_IMAGE2D_ARRAY, sinh, release, KERNEL_ARG_ACCESS_READ_WRITE, MEM_TYPE, barrier, DEVICE_MAX_WORK_GROUP_SIZE, CONTEXT_PLATFORM, atomic_dec, DRIVER_VERSION, CL_SNORM_INT16, Tdevice_affinity_domain, Tevent_info, BUFFER_CREATE_TYPE_REGION, Tbuffer_create_type, VERSION_1_2, eachImpl, atomic_add_impl, CONTEXT_NUM_DEVICES, Tdevice_fp_config, DEVICE_PREFERRED_VECTOR_WIDTH_SHORT, PROGRAM_SOURCE, FP_INF_NAN, release, SUBMITTED, MEM_FLAGS, BUILD_NONE, MIGRATE_MEM_OBJECT_HOST, CONTEXT_DEVICES, enqueueMigrateMemObjects, PROGRAM_BUILD_OPTIONS, buildErrors, DEVICE_TYPE_CPU, DEVICE_IMAGE_SUPPORT, Tdevice_local_mem_type, CL_FLOAT, createProgramWithBuiltInKernels, Tkernel_arg_address_qualifier, createSubDevices, READ_WRITE_CACHE, DEPTH, DEVICE_OPENCL_C_VERSION, DEVICE_PREFERRED_VECTOR_WIDTH_INT, releaseSampler, native_sqrt, QUEUE_REFERENCE_COUNT, native_log10, DEVICE_ERROR_CORRECTION_SUPPORT, log10, INTENSITY, atomic_or, DEVICE_GLOBAL_MEM_CACHE_SIZE, native_log, atomic_inc_impl, exp, COMMAND_FILL_IMAGE, DEVICE_EXTENSIONS, FP_DENORM, KERNEL_ARG_ADDRESS_LOCAL, VERSION_1_0, KERNEL_ARG_TYPE_NONE, native_cos_impl, DEVICE_AVAILABLE, DEVICE_SINGLE_FP_CONFIG, KERNEL_NUM_ARGS, linkProgram, DEVICE_TYPE_DEFAULT, PROGRAM_NUM_DEVICES, atan, createBuffer, barrier_impl, CL_UNORM_INT16, IMAGE_NUM_SAMPLES, getExtensionFunctionAddressForPlatform, retainKernel, IMAGE_SLICE_PITCH, asin, TCreateContextCb, sin, MEM_HOST_NO_ACCESS, CL_RA, atan2, QUEUE_PROPERTIES, DEVICE_PARTITION_EQUALLY, run3d, enqueueReadBuffer, enqueueBarrierWithWaitList, SAMPLER_ADDRESSING_MODE, Tkernel_arg_access_qualifier, DEVICE_MAX_CLOCK_FREQUENCY, enqueueWriteBufferRect, setArg, CL_UNSIGNED_INT8, COMMAND_UNMAP_MEM_OBJECT, log, COMMAND_COPY_IMAGE_TO_BUFFER, CLK_GLOBAL_MEM_FENCE, SAMPLER_NORMALIZED_COORDS, Taddressing_mode, Pprogram, pow, native_log_impl, [], unloadCompiler, KERNEL_GLOBAL_WORK_SIZE, compileProgram, createContext, SAMPLER_REFERENCE_COUNT, DEVICE_IMAGE3D_MAX_HEIGHT, singleDeviceDefaults, get_group_id, ADDRESS_REPEAT, atomic_min, Tdevice_mem_cache_type, DEVICE_NATIVE_VECTOR_WIDTH_SHORT, COMMAND_RELEASE_GL_OBJECTS, Pevent, DEVICE_TYPE_CUSTOM, atomic_sub_impl, createAndBuild, releaseKernel, unloadPlatformCompiler, createAndBuildBinary, get_num_groups, atomic_max_impl, atomic_sub, atomic_cmpxchg, each, DEVICE_REFERENCE_COUNT, DEVICE_MEM_BASE_ADDR_ALIGN, name, CL_UNORM_INT24, run, CL_UNORM_SHORT_555, PLATFORM_EXTENSIONS, Tprogram_build_info, FP_ROUND_TO_INF, PROGRAM_CONTEXT, CL_SNORM_INT8, ElementType, COMMAND_ACQUIRE_GL_OBJECTS, Tchannel_type, KERNEL_REFERENCE_COUNT, DEVICE_MAX_WORK_ITEM_SIZES, CL_ARGB, DEVICE_MAX_CONSTANT_BUFFER_SIZE, DEVICE_PREFERRED_VECTOR_WIDTH_CHAR, CL_R, DEVICE_BUILT_IN_KERNELS, NONE, getPlatformInfo, write_mem_fence, initCL, CL_UNSIGNED_INT16, TDeviceType, EXEC_KERNEL, DEVICE_VERSION, enqueueWriteBuffer, enqueueNDRangeKernel, DEVICE_ENDIAN_LITTLE, setArg, ADDRESS_NONE, get_global_id, TProgramCb, enqueueCopyBuffer, getEventProfilingInfo, KERNEL_COMPILE_WORK_GROUP_SIZE, UseWorkGroups, COMMAND_READ_BUFFER, setEventCallback, log2, COMMAND_NDRANGE_KERNEL, atomic_min_impl, DEVICE_AFFINITY_DOMAIN_L4_CACHE, cosh, releaseEvent, Tcontext_info, rsqrt, buffer, Tfilter_mode, KERNEL_ARG_TYPE_VOLATILE, DebugKernels, IMAGE_HEIGHT, COMMAND_BARRIER, atomic_inc, DEVICE_NAME, Tprogram_binary_type, COMMAND_FILL_BUFFER, native_exp_impl, enqueueWaitForEvents, createKernel, retainEvent, enqueueWriteImage, localMemory, Tkernel_info, CL_FALSE, PLATFORM_VERSION, tan, DEVICE_MAX_MEM_ALLOC_SIZE, getSamplerInfo, retainCommandQueue, KERNEL_ARG_ADDRESS_PRIVATE, CL_HALF_FLOAT, CONTEXT_INTEROP_USER_SYNC, releaseMemObject, getProgramInfo, MEM_MAP_COUNT, Timage_desc, COMMAND_COPY_BUFFER_TO_IMAGE, KERNEL_ARG_ACCESS_WRITE_ONLY, fma, KERNEL_ARG_TYPE_QUALIFIER, bufferLike, device, DEVICE_GLOBAL_MEM_SIZE, getKernelInfo, DEVICE_ADDRESS_BITS, createContext, gpuBufferLike, exp2, DEVICE_NATIVE_VECTOR_WIDTH_CHAR, maxWorkItems, DEVICE_NATIVE_VECTOR_WIDTH_LONG, DEVICE_IMAGE2D_MAX_WIDTH, atomic_add, DEVICE_AFFINITY_DOMAIN_L1_CACHE, get_global_size, QUEUED, MEM_ASSOCIATED_MEMOBJECT, EVENT_REFERENCE_COUNT, Tmem_migration_flags, DEVICE_PARTITION_BY_COUNTS_LIST_END, DEVICE_MAX_WRITE_IMAGE_ARGS, MEM_CONTEXT, write, PROGRAM_BINARIES, CL_UNORM_SHORT_565, createSubBuffer, IMAGE_DEPTH, CL_UNSIGNED_INT32, atomic_and, firstPlatform, Psampler, DEVICE_PARTITION_AFFINITY_DOMAIN, read, DEVICE_PROFILING_TIMER_RESOLUTION, IMAGE_ROW_PITCH, sqrt, EOpenCL, TMemObjectDestructorCb, DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE, mem_fence_impl, QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, CONTEXT_REFERENCE_COUNT, FILTER_LINEAR, constant, MEM_COPY_HOST_PTR, CL_RGBA, DEVICE_GLOBAL_MEM_CACHELINE_SIZE, COMMAND_READ_IMAGE, DEVICE_MAX_SAMPLERS, native_tan, MEM_OBJECT_IMAGE3D, getDeviceInfo, DEVICE_COMPILER_AVAILABLE, DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT, releaseCommandQueue, DEVICE_LINKER_AVAILABLE, DEVICE_NATIVE_VECTOR_WIDTH_FLOAT, version, KERNEL_FUNCTION_NAME, commandQueueFor, Pcommand_queue, QUEUE_CONTEXT, createProgramBinary, Tbitfield, get_local_size, KERNEL_ARG_ACCESS_QUALIFIER, DEVICE_LOCAL_MEM_TYPE, native_log10_impl, enqueueNativeKernel, KERNEL_ARG_ACCESS_NONE, enqueueCopyImageToBuffer, atomic_xchg_impl, enqueueMarker, MEM_OFFSET, DEVICE_PRINTF_BUFFER_SIZE, ADDRESS_CLAMP, retainSampler, atomic_dec_impl, get_local_id, COMMAND_COPY_IMAGE, getExtensionFunctionAddress, PROGRAM_BINARY_TYPE_NONE, enqueueMarkerWithWaitList, read_mem_fence_impl, KERNEL_ATTRIBUTES, MEM_READ_ONLY, DEVICE_HOST_UNIFIED_MEMORY, Pmem, FP_FMA, native_log2, FP_CORRECTLY_ROUNDED_DIVIDE_SQRT, COMMAND_NATIVE_KERNEL, DEVICE_TYPE, oclName, VectorWidth, native_rsqrt_impl, getMemObjectInfo, PROGRAM_BINARY_SIZES, CL_UNORM_INT8, enqueueFillBuffer, read, READ_ONLY_CACHE, get_global_offset, DEVICE_MIN_DATA_TYPE_ALIGN_SIZE, LUMINANCE, []=, native_exp2_impl, fabs, cos, MEM_REFERENCE_COUNT, KERNEL_LOCAL_MEM_SIZE, DEVICE_DOUBLE_FP_CONFIG, MEM_OBJECT_IMAGE1D_ARRAY, local, CL_RGB, Pcontext, KERNEL_ARG_TYPE_NAME, LOCAL, []=, *, MatVecResult, Vec, +, RuntimeSiteData, MatMulResult, *, LocalSiteProxy, *, ScalarMulResult, LocalScalarMulResult, Mat4d, Vec3, $, LocalScalarAddResult, Vec2f, $, isVec, -, [], storageType, LocalMulResult, Vec2, *, []=, [], -, *, numElements, -, -, matRows, Vec4f, +, TensorSiteProxy, -, []=, -, Vec4d, *, Mat2, Mat3f, *, Mat2d, +, *, TensorElementProxy, Mat4, elementType, transpose, *, isSiteTensor, []=, vecSize, +, +, elementType, UseOpenMP, []=, +, *, identity, matCols, VecAddResult, Vec2d, +, +, +, [], trace, $, ScalarAddResult, Vec3f, -, -, MatAddResult, -, Mat3d, +, +, Mat2f, LocalAddResult, isMat, dot, []=, -, []=, UseSycl, [], numElements, Vec3d, Mat, []=, +, *, -, *, +, Mat4f, +, *, +, Mat3, [], -, +, [], *, *, *, [], Vec4, []