Data compiled by Clamchowder from various sources. Thanks to Cheese, Luma, Sylvie, Titanic, Karbin, Sask, and many others from Cheese's discord for testing. Google Charts used
for graphing.
Some things to keep in mind:
Be careful about graphing CPUs and GPUs together. They do different things.
Refer to page source for raw data
Some data is incomplete, as the test was changed to add more data points over time. Where there's a gap, the average of the two nearest data points is shown.
Notes about data:
Vulkan data was collected using Nemes's test. She did not use a random access pattern. Instead, her test
tries to spill out of cache cleanly.
An OpenCL based test was used to gather other GPU latency data, using a full random access pattern.
CPU data was gathered using a test written in C, also using a full random access pattern.
OpenCL/CPU tests use current = arr[current] which means complex address generation is rolled
into latency. Expect CPU latencies to be 1 cycle higher than the minimum. We've recently added a test mode that uses hand-written ASM to ensure simple addressing. We'll be updating data for CPUs that have that complex addressing penalty as we get around to re-testing.
Huge pages results use VirtualAlloc with MEM_LARGE_PAGES in Windows, or mmap with MAP_HUGETLB in Linux, which usually gives 2 MB page sizes
(virtual to physical address translation granularity). That reduces TLB miss penalties, and
for most CPUs, eliminates them in cache-sized regions. So, huge pages results make raw cache latency eaiser to see. But they're not representative of latency experienced by most applications, which use 4 KB pages.
Unless otherwise noted, latency results represent access latency using the scalar datapath on AMD GPUs that have separate vector and scalar datapaths. That is, all GCN-based and newer AMD GPUs.
Vector access latences are tested by running two threads in a workgroup that will both always follow the pointer chasing chain in sync, but the compiler can't determine that. On Terascale and most Nvidia
architectures, there's no difference when measuring latency using that vector latency measurement workaround.