'SIMD intrinsic and memory bus size - How CPU fetches all 128/256 bits in a single memory read?
Hello Forum – I have a few similar/related questions about SIMD intrinsic for which I searched online including stackoverflow but did not find good answers so requesting your help.
Basically I am trying to understand how a 64 bit CPU fetches all 128 bits in a single read and what are the requirements for such an operation.
- Would CPU fetch all 128 bits from memory in a single memory operation or will it do two 64 bit reads?
- Do CPU manufactures demand certain size of the memory bus, example, for a 64 bit CPU, would Intel require 128 bit bus for SSE memory bound operations?
- Are these operations dependent on memory bus size, number of channels and number of memory modules?
Solution 1:[1]
Loads/stores don't go to directly to memory (unless you use them on an uncacheable memory region). Even NT stores go into a write-combining fill buffer.
Load/stores go between execution units and L1D cache. CPUs internally have wide data paths from cache to execution units, and from L1D to outer caches. See How can cache be that fast? on electronics.SE, about Intel IvyBridge.
e.g. IvB has 128b data paths between execution units and L1D. Haswell widened that to 256 bits. Unaligned loads/stores have full performance as long as they don't cross a cache-line boundary. Skylake-AVX512 widened that to 512 bits, so it can do 2 64-byte loads and a 64-byte store in a single clock cycle. (As long as data is hot in L1D cache).
AMD CPUs including Ryzen handle 256b vectors in 128b chunks (even in the execution units, unlike Intel after Pentium M). Older CPUs (e.g. Pentium III and Pentium-M) split 128b loads/stores (and vector ALU) into two 64-bit halves because their load/store execution units were only 64 bits wide.
The memory controllers are DDR2/3/4. The bus is 64-bits wide, but uses a burst mode with a burst size of 64 bytes (not coincidentally, the size of a cache line.)
Being a "64-bit" CPU is unrelated to the width of any internal or external data buses. That terminology did get used for other CPUs in the past, but even P5 Pentium had a 64-bit data bus. (aligned 8-byte load/store is guaranteed atomic as far back as P5, e.g. x87 or MMX.) 64-bit in this case refers to the width of pointers, and of integer registers.
Further reading:
David Kanter's Haswell deep dive compares data-path widths in Haswell vs. SnB cores.
What Every Programmer Should Know About Memory (but note that much of the software-prefetch stuff is obsolete, modern CPUs have better HW prefetchers than Pentium4). Still essential reading, especially if you want to understand how CPUs are connected to DDR2/3/4 memory.
Other performance links in the x86 tag wiki.
Enhanced REP MOVSB for memcpy for more about x86 memory bandwidth. Note especially that single-threaded bandwidth can be limited by max_concurrency / latency, rather than by the DRAM controller, especially on a many-core Xeon (higher latency to L3 / memory).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |