The OpenCL standard allows targeting a large variety of CPU, GPU and accelerator architectures using a single unified programming interface and language. While the standard guarantees portability of functionality for complying applications and platforms, performance portability on such a diverse set of hardware is limited. Devices may vary significantly in memory architecture as well as type, number and complexity of computational units. To characterize and compare the OpenCL performance of existing and future devices we propose a suite of microbenchmarks, uCLbench.
It provides programs measuring the following data points:

  1. Arithmetic Throughput
    Parallel and sequential throughput for all basic mathematical operations, and many built-in functions defined by the OpenCL standard. When available, native implementations (with reduced accuracy) are also measured.
  2. Memory Subsystem
    Host to device, device to device and device to host copying bandwidth. Streaming bandwidth for on-device address spaces. Latency for memory accesses to global, local and constant address spaces. Also determines existence and size of caches.
  3. Branching Penalty
    Impact of divergent dynamic branching on device performance, particularly pronounced on GPUs.
  4. Runtime Overheads
    Kernel compilation time and queuing delays incurred when invoking kernels of various code volume.
GitHub Repository