tensor/tensorview

Search:
Group by:

TensorFieldView - Device-Side Views for Backend Dispatch

This module provides TensorFieldView[L,T], the type that the each macro operates on. A view wraps a LocalTensorField in device-side buffers (OpenCL cl_mem, SYCL buffers, or raw host pointers for OpenMP) and handles the AoSoA layout transformation needed for efficient SIMD / GPU execution.

Key capabilities:

  • Construction: newTensorFieldView(local, ioKind) allocates device buffers and optionally synchronises data from the parent local tensor (read/readwrite) or just allocates (write)
  • AoSoA layout: transformToAoSoA / transformFromAoSoA convert between natural Array-of-Structures order and the blocked Array-of-Structures-of-Arrays layout used by kernels
  • Backend dispatch: the each macro inspects view arguments at compile time and emits OpenCL, SYCL, or OpenMP code accordingly
  • Stencil integration: views can be passed to each together with a LatticeStencil for neighbor access in kernels
  • Destruction: on scope exit, write/readwrite views synchronise data back to the parent local tensor field

The backend is selected at compile time:

BackendCompile flagBuffer type
OpenCL(default)cl_mem
SYCL-d:UseSyclSyclBuffer
OpenMP-d:UseOpenMPraw pointer

Example

var local = field.newLocalTensorField()
var vSrc = local.newTensorFieldView(iokRead)
var vDst = local.newTensorFieldView(iokWrite)
each vDst, vSrc, n:
  vDst[n] = 2.0 * vSrc[n]

Types

DeviceStorage = object
  buffers*: seq[BackendBuffer]
  queues*: seq[BackendQueue]
  sitesPerDevice*: seq[int]
  totalSites*: int
  elementsPerSite*: int      ## Scalar elements per site (for OpenCL - complex counts as 2)
  tensorElementsPerSite*: int ## Tensor elements per site (for memory - Complex64 counts as 1)
  elementSize*: int          ## sizeof(T)
  hostPtr*: pointer
  hostOffsets*: seq[int]
  siteOffsets*: seq[int]     ## Precomputed flat offsets for each lex site in padded GA memory
  destroyed*: bool           ## Flag to prevent double destruction
  simdLayout*: SimdLatticeLayout ## SIMD layout for vectorized AoSoA access
  aosoaData*: pointer        ## Pointer to AoSoA transformed data (OpenMP)
  aosoaSeqRef*: RootRef      ## Reference to keep AoSoA seq alive (OpenMP)
Device memory storage representation (type-erased)
IOKind = enum
  iokRead, iokWrite, iokReadWrite
TensorFieldView[L; T] = object
  ioKind*: IOKind
  lattice*: L
  dims*: int
  rank*: int
  shape*: seq[int]
  data*: DeviceStorage
  hasPadding*: bool
  simdGrid*: seq[int]        ## Runtime SIMD lane grid (e.g., [1,2,2,2] for 8 lanes)

Tensor field view on device memory

L is the lattice type, T is the scalar element type. Shape is stored at runtime to avoid static generic issues.

Consts

UseOpenMP {.booldefine.} = false
UseSycl {.booldefine.} = false
VectorWidth {.intdefine.} = 8
Number of sites processed together in a vector group. Set via -d:VectorWidth=4 for AVX2, -d:VectorWidth=8 for AVX-512, etc. Default is 8 for good GPU and modern CPU performance.

Procs

proc `=copy`(dest: var DeviceStorage; src: DeviceStorage) {.
    error: "DeviceStorage cannot be copied".}
proc `=copy`[L, T](dest: var TensorFieldView[L, T]; src: TensorFieldView[L, T]) {.
    error: "TensorFieldView cannot be copied".}
proc `=destroy`[L, T](view: var TensorFieldView[L, T]) {....raises: [].}
Destructor for TensorFieldView (OpenCL/SYCL backend) Reads device data back, transforms from AoSoA to AoS, and scatters into the padded GA memory via siteOffsets. Because the host pointer points directly into GA memory, this write-back immediately updates the Global Array — no separate flush is needed. Releases all device buffers after write-back.
proc `[]`[L, T, D: static int](view: TensorFieldView[L, T];
                               shift: StencilShift[D]): TensorSiteProxy[L, T] {.
    inline.}

Access neighbor site using stencil shift (phantom for GPU codegen)

Usage: let nbr = viewstencil.shift(n, +1, 0) # Forward x neighbor let nbr = viewstencil.fwd(n, 0) # Same as above let nbr = viewstencil.bwd(n, 3) # Backward t neighbor

proc `[]`[L, T](view: TensorFieldView[L, T]; site: int): TensorSiteProxy[L, T]
proc `[]=`[L, T](view: TensorFieldView[L, T]; site: int;
                 value: MatAddResult[L, T])
proc `[]=`[L, T](view: TensorFieldView[L, T]; site: int;
                 value: MatMulResult[L, T])
proc `[]=`[L, T](view: TensorFieldView[L, T]; site: int;
                 value: MatVecResult[L, T])
proc `[]=`[L, T](view: TensorFieldView[L, T]; site: int;
                 value: ScalarAddResult[L, T])
proc `[]=`[L, T](view: TensorFieldView[L, T]; site: int;
                 value: ScalarMulResult[L, T])
proc `[]=`[L, T](view: TensorFieldView[L, T]; site: int; value: T)
proc `[]=`[L, T](view: TensorFieldView[L, T]; site: int;
                 value: TensorSiteProxy[L, T])
proc `[]=`[L, T](view: TensorFieldView[L, T]; site: int;
                 value: VecAddResult[L, T])
proc aosoaDataPtr[L, T](view: TensorFieldView[L, T]): ptr UncheckedArray[T] {.
    inline.}
Returns pointer to AoSoA data buffer (for SIMD views)
proc buffers[L, T](view: TensorFieldView[L, T]): seq[BackendBuffer]
Returns the underlying device memory buffers
proc computeElementsPerSite(shape: openArray[int]; isComplex: bool = false): int {.
    ...raises: [], tags: [], forbids: [].}
Compute the number of elements per lattice site
proc computeTotalLatticeSites(localGrid: openArray[int]): int {....raises: [],
    tags: [], forbids: [].}
Compute the total number of lattice sites from the local grid
proc defaultSimdGrid(localGrid: seq[int]): seq[int] {....raises: [], tags: [],
    forbids: [].}
Generate default SIMD grid that distributes VectorWidth lanes across dimensions. Overload for seq input.
proc defaultSimdGrid[D: static[int]](localGrid: array[D, int]): seq[int]
Generate default SIMD grid that distributes VectorWidth lanes across dimensions. Prefers faster-varying (lower index) dimensions.
proc elementsPerSite[L, T](view: TensorFieldView[L, T]): int {.inline.}
Returns the number of elements per lattice site
proc formatSiteData[L, T](view: TensorFieldView[L, T]; site: int): string
Read and format tensor data for a single site. Returns a nicely formatted string representation.
proc hasSimdLayout[L, T](view: TensorFieldView[L, T]): bool {.inline.}
Returns true if this view was created with a SIMD layout
proc matadd[L, T](a, b: TensorFieldView[L, T]; site: int): MatAddResult[L, T] {.
    inline.}
Matrix addition marker for OpenCL codegen
proc matmul[L, T](a, b: TensorFieldView[L, T]; site: int): MatMulResult[L, T] {.
    inline.}
Matrix multiplication marker for OpenCL codegen Usage: mViewCn = matmul(mViewA, mViewB, n) Prefer using: mViewCn = mViewAn * mViewBn
proc matvec[L, T](mat, vec: TensorFieldView[L, T]; site: int): MatVecResult[L, T] {.
    inline.}
Matrix-vector multiplication marker for OpenCL codegen
proc nSitesInner[L, T](view: TensorFieldView[L, T]): int {.inline.}
Returns the number of SIMD lanes (sites per vector group)
proc nSitesOuter[L, T](view: TensorFieldView[L, T]): int {.inline.}
Returns the number of vector groups (outer loop iterations)
proc numGlobalSites[L, T](view: TensorFieldView[L, T]): int {.inline.}
Returns the total number of global lattice sites across all MPI ranks
proc numSites[L, T](view: TensorFieldView[L, T]): int {.inline.}
Returns the total number of lattice sites in the tensor field view
proc numVectorGroups(numSites: int): int {.inline, ...raises: [], tags: [],
    forbids: [].}
Compute number of vector groups (ceiling division)
proc readSiteData[L, T](view: TensorFieldView[L, T]; site: int): RuntimeSiteData[
    T]
Read tensor data for a single site from device memory. Used for CPU fallback when print statements are present.
proc simdLayout[L, T](view: TensorFieldView[L, T]): SimdLatticeLayout
Returns the SIMD layout for vectorized access
proc sitesPerDevice[L, T](view: TensorFieldView[L, T]): seq[int]
Returns the number of sites assigned to each device
proc splitLatticeSites(totalSites: int; numDevices: int): seq[int] {....raises: [],
    tags: [], forbids: [].}
Split lattice sites as evenly as possible among devices
proc totalElements[L, T](view: TensorFieldView[L, T]): int {.inline.}
Returns the total number of elements across all sites
proc transformAoSoAtoAoS[T](src: pointer; numSites, elemsPerSite: int): seq[T]

Transform data from AoSoA back to flat contiguous AoS layout.

This is used by updateGlobalTensorField for manual write-back.

proc transformAoSoAtoAoSSimd[T](src: pointer; layout: SimdLatticeLayout;
                                elemsPerSite: int): seq[T]

Transform data from AoSoA back to flat contiguous AoS layout using SIMD layout.

This is used by updateGlobalTensorField for manual write-back. Inverse of transformAoStoAoSoASimd.

proc transformAoStoAoSoA[T](src: pointer; numSites, elemsPerSite: int;
                            siteOffsets: seq[int]): seq[T]

Transform data from AoS (Array of Structures) to AoSoA layout.

AoS layout: site0e0,e1,..., site1e0,e1,..., ... AoSoA layout: group0e0: s0,s1,...,sV-1, e1: s0,s1,...,sV-1, ..., group1...

Each vector group contains VectorWidth sites with elements interleaved. This enables SIMD-friendly memory access patterns. Uses siteOffsets to handle padded GA memory strides.

proc transformAoStoAoSoASimd[T](src: pointer; layout: SimdLatticeLayout;
                                elemsPerSite: int; siteOffsets: seq[int]): seq[T]

Transform data from AoS to AoSoA layout using SIMD layout with configurable lane grid.

Unlike the simple VectorWidth-based AoSoA, this uses the SimdLatticeLayout to properly map lattice coordinates to SIMD lanes based on the user-specified simdGrid. Uses siteOffsets to handle padded GA memory strides.

AoS layout: site0e0,e1,..., site1e0,e1,..., ... AoSoA layout: outer0e0: lane0..laneN, e1: lane0..laneN, ..., outer1...

Parameters: src: Pointer to GA memory (padded strides) layout: SimdLatticeLayout with innerGeom (SIMD lanes) and outerGeom (vector groups) elemsPerSite: Number of tensor elements per site siteOffsets: Precomputed flat offsets for each lexicographic site in padded GA memory

proc vecadd[L, T](a, b: TensorFieldView[L, T]; site: int): VecAddResult[L, T] {.
    inline.}
Vector addition marker for OpenCL codegen
proc writeSiteElement[L, T](view: TensorFieldView[L, T]; site: int;
                            elementIdx: int; value: T)
Write a single element of a tensor at a specific site to device memory. Used for element-level writes like viewn = value during CPU fallback. This is an immediate write that syncs with the device.

Templates

template newTensorFieldView[D: static[int]; R: static[int]; L, T](
    tensor: LocalTensorField[D, R, L, T]; io: IOKind): TensorFieldView[L, T]
Create tensor field view from local tensor field Uses AoSoA layout on device for SIMD-friendly access patterns.
template newTensorFieldView[D: static[int]; R: static[int]; L, T](
    tensor: TensorField[D, R, L, T]; io: IOKind): TensorFieldView[L, T]
Create tensor field view from global tensor field

Exports

newLatticeStencil, newLatticeStencil, UseOpenCL, LatticeStencil, newLatticeStencil, nLanes, Y, pathToStencil, addPoint, fwd, ==, localToPadded, newStencilView, backward, hash, nPoints, newLatticeStencil, idx, Z, bwd, $, $, directions, $, shift, allDirections, neighborOffset, offsetBufferSize, rectanglePath, forEachNeighbor, addPoint, nearestNeighborStencil, paddedToLocal, bwd, forwardStencil, VectorWidth, neighborSimd, Stencil, nearestNeighborStencil, StencilPattern, laplacianStencil, Direction, nOuter, SignedDirection, backwardStencil, StencilEntry, StencilPoint, neighbor, step, newLatticeStencil, T, plaquettePath, StencilBackend, paddedToLocal, forwardStencil, fwd, backwardStencil, forward, points, laplacianStencil, getEntry, newStencilPattern, UseSYCL, newStencilPoint, X, getOffsetBuffer, $, localToPadded, shift, StencilShift, StencilView, nSites, newStencil, neighbors, newStencilPoint, addPoint, isGhostNeighbor, PathStep, sites, UseOpenMP, SimdLatticeLayout, $, simdLanes, lexicographicToCoords, outerInnerToLocal, computeStrides, newSimdLatticeLayout, localToOuterInner, coordsToLexicographic, newSimdLatticeLayout, vectorGroups, validateSimdGrid, computeLocalGeom, generateCoordTable, aosoaIndexFromLocal, aosoaIndex, computeProduct, Tcommand_type, DEVICE_VENDOR_ID, buildProgram, fmod, []=, COMMAND_COPY_BUFFER, createCommandQueue, getImageInfo, DEVICE_AFFINITY_DOMAIN_NEXT_PARTITIONABLE, KERNEL_ARG_ADDRESS_CONSTANT, KERNEL_ARG_ADDRESS_GLOBAL, enqueueFillImage, KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, PROGRAM_BINARY_TYPE_LIBRARY, FILTER_NEAREST, GLOBAL, KERNEL_ARG_TYPE_RESTRICT, PROGRAM_DEVICES, createContextFromType, EVENT_COMMAND_QUEUE, Tcommand_queue_info, Tmem_object_type, DEVICE_IMAGE_MAX_BUFFER_SIZE, DEVICE_PARTITION_BY_COUNTS, write_mem_fence_impl, EVENT_COMMAND_EXECUTION_STATUS, enqueueMapBuffer, Tdevice_exec_capabilities, round, CL_SIGNED_INT32, PLATFORM_NAME, COMPLETE, atomic_and_impl, finalizeCL, PLATFORM_PROFILE, Tbuffer_region, MEM_READ_WRITE, Tmem_info, MEM_OBJECT_IMAGE1D, KERNEL_CONTEXT, PROGRAM_BINARY_TYPE_EXECUTABLE, DEVICE_MAX_COMPUTE_UNITS, COMMAND_WRITE_BUFFER_RECT, DEVICE_PARTITION_TYPE, getEventInfo, enqueueTask, enqueueReadBufferRect, createImage3D, release, native_tan_impl, getCommandQueueInfo, retainMemObject, MEM_OBJECT_IMAGE1D_BUFFER, Tplatform_info, createProgramWithSource, PROGRAM_KERNEL_NAMES, createUserEvent, native_exp, DEVICE_AFFINITY_DOMAIN_L2_CACHE, PROGRAM_BINARY_TYPE, CONTEXT_PROPERTIES, setKernelArg, CL_A, DEVICE_PROFILE, MEM_HOST_WRITE_ONLY, CL_SIGNED_INT16, clamp, Pplatform_id, enqueueReadImage, args, DEVICE_MAX_CONSTANT_ARGS, MAP_READ, [], native_sin, PROGRAM_BINARY_TYPE_COMPILED_OBJECT, MEM_HOST_READ_ONLY, EVENT_CONTEXT, run2d, COMMAND_WRITE_BUFFER, native_exp2, Timage_format, name, global, DEVICE_IMAGE3D_MAX_DEPTH, PROGRAM_NUM_KERNELS, native_powr_impl, DEVICE_MAX_WORK_ITEM_DIMENSIONS, COMMAND_MAP_BUFFER, getProgramBuildInfo, enqueueCopyBufferRect, setArg, retainProgram, DEVICE_PLATFORM, ADDRESS_MIRRORED_REPEAT, gpuBuffer, waitForEvents, MEM_OBJECT_IMAGE2D, enqueueUnmapMemObject, LocalBuffer, MEM_ALLOC_HOST_PTR, Tkernel_arg_info, KERNEL_ARG_TYPE_CONST, getDevices, Timage_info, CL_RG, DEVICE_TYPE_ALL, finish, createProgramWithBinary, MIGRATE_MEM_OBJECT_CONTENT_UNDEFINED, DEPTH_STENCIL, Dim, IMAGE_BUFFER, CL_RGx, releaseContext, atomic_xor, DEVICE_PARTITION_MAX_SUB_DEVICES, DEVICE_PARENT_DEVICE, enqueueMapImage, multipleDeviceDefaults, EVENT_COMMAND_TYPE, DEVICE_MAX_READ_IMAGE_ARGS, createProgram, DEVICE_EXECUTION_CAPABILITIES, mem_fence, DEVICE_NATIVE_VECTOR_WIDTH_INT, DEVICE_PARTITION_BY_AFFINITY_DOMAIN, QUEUE_PROFILING_ENABLE, getKernelArgInfo, CL_TRUE, DEVICE_GLOBAL_MEM_CACHE_TYPE, KERNEL_PRIVATE_MEM_SIZE, FP_ROUND_TO_ZERO, run2d, KERNEL_ARG_ADDRESS_QUALIFIER, PROGRAM_BUILD_STATUS, trunc, COMMAND_TASK, COMMAND_READ_BUFFER_RECT, ceil, retainDevice, createKernelsInProgram, DEVICE_PREFERRED_VECTOR_WIDTH_LONG, Tkernel_work_group_info, getSupportedImageFormats, Tcontext_properties, releaseDevice, buildOn, native_recip_impl, MEM_SIZE, DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE, native_rsqrt, releaseProgram, MEM_USE_HOST_PTR, globalMemory, Tkernel_arg_type_qualifier, PROGRAM_REFERENCE_COUNT, createImage2D, Tprogram_info, MAP_WRITE, DEVICE_IMAGE2D_MAX_HEIGHT, IMAGE_FORMAT, write, BLOCKING, NON_BLOCKING, KERNEL_ARG_ACCESS_READ_ONLY, Tmem_flags, DEVICE_IMAGE3D_MAX_WIDTH, Tbuild_status, native_log2_impl, release, EXEC_NATIVE_KERNEL, IMAGE_WIDTH, setArg, atomic_cmpxchg_impl, buildOn, DEVICE_VENDOR, CLK_LOCAL_MEM_FENCE, native_cos, check, CL_BGRA, fmin, FP_ROUND_TO_NEAREST, getContextInfo, KERNEL_PROGRAM, CL_SIGNED_INT8, COMMAND_MAP_IMAGE, CL_RGBx, getKernelWorkGroupInfo, MEM_WRITE_ONLY, RUNNING, BUILD_SUCCESS, DEVICE_LOCAL_MEM_SIZE, DEVICE_QUEUE_PROPERTIES, IMAGE_ARRAY_SIZE, setUserEventStatus, KERNEL_WORK_GROUP_SIZE, PROGRAM_BUILD_LOG, acos, Tchannel_order, Tdevice_partition_property, read_mem_fence, Tcommand_queue_properties, createSampler, IMAGE_NUM_MIP_LEVELS, raiseEOpenCL, enqueueCopyBufferToImage, IMAGE_ELEMENT_SIZE, native_sqrt_impl, flush, DEVICE_TYPE_GPU, COMMAND_MARKER, createAndBuild, Pkernel, FP_SOFT_FLOAT, DEVICE_AFFINITY_DOMAIN_L3_CACHE, atomic_max, MEM_OBJECT_BUFFER, DEVICE_PARTITION_PROPERTIES, get_work_dim, createImage, Tsampler_info, ADDRESS_CLAMP_TO_EDGE, DEVICE_NATIVE_VECTOR_WIDTH_HALF, enqueueBarrier, tanh, DEVICE_AFFINITY_DOMAIN_NUMA, Tbool, createKernel, Tprofiling_info, SAMPLER_FILTER_MODE, openclDefaults, read, COMMAND_COPY_BUFFER_RECT, rsqrt_impl, clamp_impl, Tdevice_info, MAP_WRITE_INVALIDATE_REGION, write, native_recip, getPlatformByName, DEVICE_PREFERRED_VECTOR_WIDTH_HALF, run, Tmap_flags, DEVICE_IMAGE_MAX_ARRAY_SIZE, retainContext, release, setMemObjectDestructorCallback, getPlatformIDs, kernel, floor, SAMPLER_CONTEXT, KERNEL_ARG_NAME, BUILD_ERROR, TUserCb, COMMAND_WRITE_IMAGE, VERSION_1_1, DEVICE_TYPE_ACCELERATOR, PLATFORM_VENDOR, MEM_HOST_PTR, atomic_xchg, release, DEVICE_IMAGE_BASE_ADDRESS_ALIGNMENT, native_sin_impl, TEventCb, COMMAND_USER, getDeviceIDs, atomic_xor_impl, BUILD_IN_PROGRESS, CL_Rx, atomic_or_impl, run3d, COMMAND_MIGRATE_MEM_OBJECTS, maxWorkGroups, setArg, Pdevice_id, GpuBuffer, native_powr, DEVICE_IMAGE_PITCH_ALIGNMENT, TClResult, DEVICE_MAX_PARAMETER_SIZE, enqueueCopyImage, QUEUE_DEVICE, CL_UNORM_INT_101010, fmax, DEVICE_PREFERRED_INTEROP_USER_SYNC, MEM_OBJECT_IMAGE2D_ARRAY, sinh, release, KERNEL_ARG_ACCESS_READ_WRITE, MEM_TYPE, barrier, DEVICE_MAX_WORK_GROUP_SIZE, CONTEXT_PLATFORM, atomic_dec, DRIVER_VERSION, CL_SNORM_INT16, Tdevice_affinity_domain, Tevent_info, BUFFER_CREATE_TYPE_REGION, Tbuffer_create_type, VERSION_1_2, eachImpl, atomic_add_impl, CONTEXT_NUM_DEVICES, Tdevice_fp_config, DEVICE_PREFERRED_VECTOR_WIDTH_SHORT, PROGRAM_SOURCE, FP_INF_NAN, release, SUBMITTED, MEM_FLAGS, BUILD_NONE, MIGRATE_MEM_OBJECT_HOST, CONTEXT_DEVICES, enqueueMigrateMemObjects, PROGRAM_BUILD_OPTIONS, buildErrors, DEVICE_TYPE_CPU, DEVICE_IMAGE_SUPPORT, Tdevice_local_mem_type, CL_FLOAT, createProgramWithBuiltInKernels, Tkernel_arg_address_qualifier, createSubDevices, READ_WRITE_CACHE, DEPTH, DEVICE_OPENCL_C_VERSION, DEVICE_PREFERRED_VECTOR_WIDTH_INT, releaseSampler, native_sqrt, QUEUE_REFERENCE_COUNT, native_log10, DEVICE_ERROR_CORRECTION_SUPPORT, log10, INTENSITY, atomic_or, DEVICE_GLOBAL_MEM_CACHE_SIZE, native_log, atomic_inc_impl, exp, COMMAND_FILL_IMAGE, DEVICE_EXTENSIONS, FP_DENORM, KERNEL_ARG_ADDRESS_LOCAL, VERSION_1_0, KERNEL_ARG_TYPE_NONE, native_cos_impl, DEVICE_AVAILABLE, DEVICE_SINGLE_FP_CONFIG, KERNEL_NUM_ARGS, linkProgram, DEVICE_TYPE_DEFAULT, PROGRAM_NUM_DEVICES, atan, createBuffer, barrier_impl, CL_UNORM_INT16, IMAGE_NUM_SAMPLES, getExtensionFunctionAddressForPlatform, retainKernel, IMAGE_SLICE_PITCH, asin, TCreateContextCb, sin, MEM_HOST_NO_ACCESS, CL_RA, atan2, QUEUE_PROPERTIES, DEVICE_PARTITION_EQUALLY, run3d, enqueueReadBuffer, enqueueBarrierWithWaitList, SAMPLER_ADDRESSING_MODE, Tkernel_arg_access_qualifier, DEVICE_MAX_CLOCK_FREQUENCY, enqueueWriteBufferRect, setArg, CL_UNSIGNED_INT8, COMMAND_UNMAP_MEM_OBJECT, log, COMMAND_COPY_IMAGE_TO_BUFFER, CLK_GLOBAL_MEM_FENCE, SAMPLER_NORMALIZED_COORDS, Taddressing_mode, Pprogram, pow, native_log_impl, [], unloadCompiler, KERNEL_GLOBAL_WORK_SIZE, compileProgram, createContext, SAMPLER_REFERENCE_COUNT, DEVICE_IMAGE3D_MAX_HEIGHT, singleDeviceDefaults, get_group_id, ADDRESS_REPEAT, atomic_min, Tdevice_mem_cache_type, DEVICE_NATIVE_VECTOR_WIDTH_SHORT, COMMAND_RELEASE_GL_OBJECTS, Pevent, DEVICE_TYPE_CUSTOM, atomic_sub_impl, createAndBuild, releaseKernel, unloadPlatformCompiler, createAndBuildBinary, get_num_groups, atomic_max_impl, atomic_sub, atomic_cmpxchg, each, DEVICE_REFERENCE_COUNT, DEVICE_MEM_BASE_ADDR_ALIGN, name, CL_UNORM_INT24, run, CL_UNORM_SHORT_555, PLATFORM_EXTENSIONS, Tprogram_build_info, FP_ROUND_TO_INF, PROGRAM_CONTEXT, CL_SNORM_INT8, ElementType, COMMAND_ACQUIRE_GL_OBJECTS, Tchannel_type, KERNEL_REFERENCE_COUNT, DEVICE_MAX_WORK_ITEM_SIZES, CL_ARGB, DEVICE_MAX_CONSTANT_BUFFER_SIZE, DEVICE_PREFERRED_VECTOR_WIDTH_CHAR, CL_R, DEVICE_BUILT_IN_KERNELS, NONE, getPlatformInfo, write_mem_fence, initCL, CL_UNSIGNED_INT16, TDeviceType, EXEC_KERNEL, DEVICE_VERSION, enqueueWriteBuffer, enqueueNDRangeKernel, DEVICE_ENDIAN_LITTLE, setArg, ADDRESS_NONE, get_global_id, TProgramCb, enqueueCopyBuffer, getEventProfilingInfo, KERNEL_COMPILE_WORK_GROUP_SIZE, UseWorkGroups, COMMAND_READ_BUFFER, setEventCallback, log2, COMMAND_NDRANGE_KERNEL, atomic_min_impl, DEVICE_AFFINITY_DOMAIN_L4_CACHE, cosh, releaseEvent, Tcontext_info, rsqrt, buffer, Tfilter_mode, KERNEL_ARG_TYPE_VOLATILE, DebugKernels, IMAGE_HEIGHT, COMMAND_BARRIER, atomic_inc, DEVICE_NAME, Tprogram_binary_type, COMMAND_FILL_BUFFER, native_exp_impl, enqueueWaitForEvents, createKernel, retainEvent, enqueueWriteImage, localMemory, Tkernel_info, CL_FALSE, PLATFORM_VERSION, tan, DEVICE_MAX_MEM_ALLOC_SIZE, getSamplerInfo, retainCommandQueue, KERNEL_ARG_ADDRESS_PRIVATE, CL_HALF_FLOAT, CONTEXT_INTEROP_USER_SYNC, releaseMemObject, getProgramInfo, MEM_MAP_COUNT, Timage_desc, COMMAND_COPY_BUFFER_TO_IMAGE, KERNEL_ARG_ACCESS_WRITE_ONLY, fma, KERNEL_ARG_TYPE_QUALIFIER, bufferLike, device, DEVICE_GLOBAL_MEM_SIZE, getKernelInfo, DEVICE_ADDRESS_BITS, createContext, gpuBufferLike, exp2, DEVICE_NATIVE_VECTOR_WIDTH_CHAR, maxWorkItems, DEVICE_NATIVE_VECTOR_WIDTH_LONG, DEVICE_IMAGE2D_MAX_WIDTH, atomic_add, DEVICE_AFFINITY_DOMAIN_L1_CACHE, get_global_size, QUEUED, MEM_ASSOCIATED_MEMOBJECT, EVENT_REFERENCE_COUNT, Tmem_migration_flags, DEVICE_PARTITION_BY_COUNTS_LIST_END, DEVICE_MAX_WRITE_IMAGE_ARGS, MEM_CONTEXT, write, PROGRAM_BINARIES, CL_UNORM_SHORT_565, createSubBuffer, IMAGE_DEPTH, CL_UNSIGNED_INT32, atomic_and, firstPlatform, Psampler, DEVICE_PARTITION_AFFINITY_DOMAIN, read, DEVICE_PROFILING_TIMER_RESOLUTION, IMAGE_ROW_PITCH, sqrt, EOpenCL, TMemObjectDestructorCb, DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE, mem_fence_impl, QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, CONTEXT_REFERENCE_COUNT, FILTER_LINEAR, constant, MEM_COPY_HOST_PTR, CL_RGBA, DEVICE_GLOBAL_MEM_CACHELINE_SIZE, COMMAND_READ_IMAGE, DEVICE_MAX_SAMPLERS, native_tan, MEM_OBJECT_IMAGE3D, getDeviceInfo, DEVICE_COMPILER_AVAILABLE, DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT, releaseCommandQueue, DEVICE_LINKER_AVAILABLE, DEVICE_NATIVE_VECTOR_WIDTH_FLOAT, version, KERNEL_FUNCTION_NAME, commandQueueFor, Pcommand_queue, QUEUE_CONTEXT, createProgramBinary, Tbitfield, get_local_size, KERNEL_ARG_ACCESS_QUALIFIER, DEVICE_LOCAL_MEM_TYPE, native_log10_impl, enqueueNativeKernel, KERNEL_ARG_ACCESS_NONE, enqueueCopyImageToBuffer, atomic_xchg_impl, enqueueMarker, MEM_OFFSET, DEVICE_PRINTF_BUFFER_SIZE, ADDRESS_CLAMP, retainSampler, atomic_dec_impl, get_local_id, COMMAND_COPY_IMAGE, getExtensionFunctionAddress, PROGRAM_BINARY_TYPE_NONE, enqueueMarkerWithWaitList, read_mem_fence_impl, KERNEL_ATTRIBUTES, MEM_READ_ONLY, DEVICE_HOST_UNIFIED_MEMORY, Pmem, FP_FMA, native_log2, FP_CORRECTLY_ROUNDED_DIVIDE_SQRT, COMMAND_NATIVE_KERNEL, DEVICE_TYPE, oclName, VectorWidth, native_rsqrt_impl, getMemObjectInfo, PROGRAM_BINARY_SIZES, CL_UNORM_INT8, enqueueFillBuffer, read, READ_ONLY_CACHE, get_global_offset, DEVICE_MIN_DATA_TYPE_ALIGN_SIZE, LUMINANCE, []=, native_exp2_impl, fabs, cos, MEM_REFERENCE_COUNT, KERNEL_LOCAL_MEM_SIZE, DEVICE_DOUBLE_FP_CONFIG, MEM_OBJECT_IMAGE1D_ARRAY, local, CL_RGB, Pcontext, KERNEL_ARG_TYPE_NAME, LOCAL, []=, *, MatVecResult, Vec, +, RuntimeSiteData, MatMulResult, *, LocalSiteProxy, *, ScalarMulResult, LocalScalarMulResult, Mat4d, Vec3, $, LocalScalarAddResult, Vec2f, $, isVec, -, [], storageType, LocalMulResult, Vec2, *, []=, [], -, *, numElements, -, -, matRows, Vec4f, +, TensorSiteProxy, -, []=, -, Vec4d, *, Mat2, Mat3f, *, Mat2d, +, *, TensorElementProxy, Mat4, elementType, transpose, *, isSiteTensor, []=, vecSize, +, +, elementType, UseOpenMP, []=, +, *, identity, matCols, VecAddResult, Vec2d, +, +, +, [], trace, $, ScalarAddResult, Vec3f, -, -, MatAddResult, -, Mat3d, +, +, Mat2f, LocalAddResult, isMat, dot, []=, -, []=, UseSycl, [], numElements, Vec3d, Mat, []=, +, *, -, *, +, Mat4f, +, *, +, Mat3, [], -, +, [], *, *, *, [], Vec4, []