Latency Data

Notes about data:

Vulkan data was collected using Nemes's test. She did not use a random access pattern. Instead, her test tries to spill out of cache cleanly.
An OpenCL based test was used to gather other GPU latency data, using a full random access pattern.
CPU data was gathered using a test written in C, also using a full random access pattern.
OpenCL/CPU tests use current = arr[current] which means complex address generation is rolled into latency. Expect CPU latencies to be 1 cycle higher than the minimum. We've recently added a test mode that uses hand-written ASM to ensure simple addressing. We'll be updating data for CPUs that have that complex addressing penalty as we get around to re-testing.
Huge pages results use VirtualAlloc with MEM_LARGE_PAGES in Windows, or mmap with MAP_HUGETLB in Linux, which usually gives 2 MB page sizes (virtual to physical address translation granularity). That reduces TLB miss penalties, and for most CPUs, eliminates them in cache-sized regions. So, huge pages results make raw cache latency eaiser to see. But they're not representative of latency experienced by most applications, which use 4 KB pages.
Unless otherwise noted, latency results represent access latency using the scalar datapath on AMD GPUs that have separate vector and scalar datapaths. That is, all GCN-based and newer AMD GPUs. Vector access latences are tested by running two threads in a workgroup that will both always follow the pointer chasing chain in sync, but the compiler can't determine that. On Terascale and most Nvidia architectures, there's no difference when measuring latency using that vector latency measurement workaround.

Latency Data Graphing and Reference