• 1. Memory Hierarchies and OptimizationsKathy Yelick yelick@cs.berkeley.edu www.cs.berkeley.edu/~yelick/cs194f071CS194 Lecture 3
  • 2. MotivationMultiple cores or processors on a single system are there for performance Many applications run well below the “peak” of the systems, often under 10% of arithmetic performance Perhaps optimizing the code on a single core will give as much benefit as writing in parallel Most of the single processor performance loss is in the memory system To understand this, we need to look under the hood of modern processors A parallel programmer is also a performance programmer Understand your hardware2CS194 Lecture 3
  • 3. OutlineIdealized and actual costs in modern processors Parallelism within single processors Memory hierarchies Use of microbenchmarks to characterized performance Case study: Matrix Multiplication Use of performance models to understand performance 3CS194 Lecture 3
  • 4. OutlineIdealized and actual costs in modern processors Parallelism within single processors Memory hierarchies Use of microbenchmarks to characterized performance Case study: Matrix Multiplication Use of performance models to understand performance 4CS194 Lecture 3
  • 5. Idealized Uniprocessor ModelProcessor names bytes, words, etc. in its address space These represent integers, floats, pointers, arrays, etc. Operations include Read and write into very fast memory called registers Arithmetic and other logical operations on registers Order specified by program Read returns the most recently written data Compiler and architecture translate high level expressions into “obvious” lower level instructions Hardware executes instructions in order specified by compiler Idealized Cost Each operation has roughly the same cost (read, write, add, multiply, etc.)A = B + C Read address(B) to R1 Read address(C) to R2 R3 = R1 + R2 Write R3 to Address(A)5CS194 Lecture 3
  • 6. Uniprocessors in the Real WorldReal processors have registers and caches small amounts of fast memory store values of recently used or nearby data different memory ops can have very different costs parallelism multiple “functional units” that can run in parallel different orders, instruction mixes have different costs pipelining a form of parallelism, like an assembly line in a factory Why is this your problem? In theory, compilers understand all of this and can optimize your program; in practice they don’t. Even if they could optimize one algorithm, they won’t know about a different algorithm that might be a much better “match” to the processor6CS194 Lecture 3
  • 7. OutlineIdealized and actual costs in modern processors Parallelism within single processors Hidden from software (sort of) Pipelining SIMD units Memory hierarchies Use of microbenchmarks to characterized performance Case study: Matrix Multiplication Use of performance models to understand performance 7CS194 Lecture 3
  • 8. What is Pipelining? In this example: Sequential execution takes 4 * 90min = 6 hours Pipelined execution takes 30+4*40+20 = 3.5 hours Bandwidth = loads/hour BW = 4/6 l/h w/o pipelining BW = 4/3.5 l/h w pipelining BW <= 1.5 l/h w pipelining, more total loads Pipelining helps bandwidth but not latency (90 min) Bandwidth limited by slowest pipeline stage Potential speedup = Number pipe stagesABCD6 PM789T a s k O r d e rTime304040404020Dave Patterson’s Laundry example: 4 people doing laundry wash (30 min) + dry (40 min) + fold (20 min) = 90 min Latency8CS194 Lecture 3
  • 9. Example: 5 Steps of MIPS Datapath Figure 3.4, Page 134 , CA:AQA 2e by Patterson and HennessyMemory AccessWrite BackInstruction FetchInstr. Decode Reg. FetchExecute Addr. CalcALUMemoryReg FileMUXMUXData MemoryMUXSign ExtendZero?IF/IDID/EXMEM/WBEX/MEM4AdderNext SEQ PCNext SEQ PCRDRDRDWB Data Pipelining is also used within arithmetic units a fp multiply may have latency 10 cycles, but throughput of 1/cycleNext PCAddressRS1RS2ImmMUX9CS194 Lecture 3
  • 10. SIMD: Single Instruction, Multiple Data+Scalar processing traditional mode one operation produces one resultSIMD processing with SSE / SSE2 one operation produces multiple results XYX + Y+x3x2x1x0y3y2y1y0x3+y3x2+y2x1+y1x0+y0XYX + YSlide Source: Alex Klimovitski & Dean Macri, Intel Corporation10CS194 Lecture 3
  • 11. SSE / SSE2 SIMD on Intel16x bytes4x floats2x doublesSSE2 data types: anything that fits into 16 bytes, e.g., Instructions perform add, multiply etc. on all the data in this 16-byte register in parallel Challenges: Need to be contiguous in memory and aligned Some instructions to move data around from one part of register to another11CS194 Lecture 3
  • 12. What does this mean to you?In addition to SIMD extensions, the processor may have other special instructions Fused Multiply-Add (FMA) instructions: x = y + c * z is so common some processor execute the multiply/add as a single instruction, at the same rate (bandwidth) as + or * alone In theory, the compiler understands all of this When compiling, it will rearrange instructions to get a good “schedule” that maximizes pipelining, uses FMAs and SIMD It works with the mix of instructions inside an inner loop or other block of code But in practice the compiler may need your help Choose a different compiler, optimization flags, etc. Rearrange your code to make things more obvious Using special functions (“intrinsics”) or write in assembly 12CS194 Lecture 3
  • 13. Peak PerformanceWhenever you look at performance, it’s useful to understand what you’re expecting Machine peak is the maximum number of arithmetic operations a processor (or set of them) can perform Often referred to as “speed of light” or “guaranteed not to exceed” number Treat integer and floating point separately and be clear on how wide (# bits) the operands have If you see a number higher than peak, you have a mistake in your timers, formulas, or …13CS194 Lecture 3
  • 14. Peak Performance ExamplesExample 1: Power5 procs in Bassi at NERSC 1.9 GHz (G cycles/sec) with 4 double precision floating point ops/cycle  7.6 GFlop/s peak per processor Two fp units, each can do a fused MADD (throughput!) Each SMP has 8 processors  60.8 Gflop/s Example 2: 20 of the PSI nodes (in CITRIS) 2.33GHz with SSE2 (2-wide for 64-bit FPUs with FMA)  9.32 GFlop/s 2 Quad-Cores per node  74.56 Gflop/s 14CS194 Lecture 3
  • 15. OutlineIdealized and actual costs in modern processors Parallelism within single processors Memory hierarchies Temporal and spatial locality Basics of caches Use of microbenchmarks to characterized performance Case study: Matrix Multiplication Use of performance models to understand performance 15CS194 Lecture 3
  • 16. Memory HierarchyMost programs have a high degree of locality in their accesses spatial locality: accessing things nearby previous accesses temporal locality: reusing an item that was previously accessed Memory hierarchy tries to exploit localityon-chip cacheregistersdatapathcontrolprocessorSecond level cache (SRAM)Main memory (DRAM)Secondary storage (Disk)Tertiary storage (Disk/Tape)Speed1ns10ns100ns10ms10secSizeBKBMBGBTB16CS194 Lecture 3
  • 17. Processor-DRAM Gap (latency)µProc 60%/yr.DRAM 7%/yr.110100100019801981198319841985198619871988198919901991199219931994199519961997199819992000DRAMCPU1982Processor-Memory Performance Gap: (grows 50% / year)PerformanceTime“Moore’s Law”Memory hierarchies are getting deeper Processors get faster more quickly than memory17CS194 Lecture 3
  • 18. Approaches to Handling Memory LatencyBandwidth has improved more than latency Approach to address the memory latency problem Eliminate memory operations by saving values in small, fast memory (cache) and reusing them need temporal locality in program Take advantage of better bandwidth by getting a chunk of memory and saving it in small fast memory (cache) and using whole chunk need spatial locality in program Take advantage of better bandwidth by allowing processor to issue multiple reads to the memory system at once concurrency in the instruction stream, eg load whole array, as in vector processors; or prefetching18CS194 Lecture 3
  • 19. Little’s LawLatency vs. Bandwidth Latency is physics (wire length) e.g., the network latency on the Earth Simulation is only about 2x times the speed of light across the machine room Bandwidth is cost: add more wires to increase bandwidth (over-simplification) Principle (Little's Law): the relationship of a production system in steady state is: Inventory = Throughput × Flow Time For parallel computing, Little’s Law is about the required concurrency to be limited by bandwidth rather than latency Required concurrency = Bandwidth * Latency bandwidth-delay product For parallel computing, this means: Concurrency = bandwidth x latency19CS194 Lecture 3
  • 20. Little’s LawExample 1: a single processor: If the latency is to memory is 50ns, and the bandwidth is 5 GB/s (.2ns / Bytes = 12.8 ns / 64-byte cache line) The system must support 50/12.8 ~= 4 outstanding cache line misses to keep things balanced (run at bandwidth speed) An application must be able to prefetch 4 cache line misses in parallel (without dependencies between them) Example 2: 1000 processor system 1 GHz clock, 100 ns memory latency, 100 words of memory in data paths between CPU and memory. Main memory bandwidth is: ~ 1000 x 100 words x 109/s = 1014 words/sec. To achieve full performance, an application needs: ~ 10-7 x 1014 = 107 way concurrency (some of that may be hidden in the instruction stream)20CS194 Lecture 3
  • 21. Cache Basics (Review)Cache is fast (expensive) memory which keeps copy of data in main memory; it is hidden from software Cache hit: in-cache memory access—cheap Cache miss: non-cached memory access—expensive Need to access next, slower level of cache Consider a tiny cache (for illustration only)Cache line length: # of bytes loaded together in one entry 2 in above example Associativity direct-mapped: only 1 address (line) in a given range in cache n-way: n  2 lines with different addresses can be storedAddress PatternData (2 Bytes)X000101000 through 101001X010001010 through 001011X100111100 through 111101X110110110 through 11011121CS194 Lecture 3
  • 22. Why Have Multiple Levels of Cache?On-chip vs. off-chip On-chip caches are faster, but limited in size A large cache has delays Hardware to check longer addresses in cache takes more time Associativity, which gives a more general set of data in cache, also takes more time Some examples: Cray T3E eliminated one cache to speed up misses IBM uses a level of cache as a “victim cache” which is cheaper There are other levels of the memory hierarchy Register, pages (TLB, virtual memory), … And it isn’t always a hierarchy22CS194 Lecture 3
  • 23. Experimental Study of Memory (Membench)Microbenchmark for memory system performance time the following loop (repeat many times and average) for i from 0 to L load A[i] from memory (4 Bytes)1 experiment for array A of length L from 4KB to 8MB by 2x for stride s from 4 Bytes (1 word) to L/2 by 2x time the following loop (repeat many times and average) for i from 0 to L by s load A[i] from memory (4 Bytes)s23CS194 Lecture 3
  • 24. Membench: What to ExpectConsider the average cost per load Plot one line for each array length, time vs. stride Small stride is best: if cache line holds 4 words, at most ¼ miss If array is smaller than a given cache, all those accesses will hit (after the first run, which is negligible for large enough runs) Picture assumes only one level of cache Values have gotten more difficult to measure on modern procss = strideaverage cost per accesstotal size < L1cache hit timememory timesize > L124CS194 Lecture 3
  • 25. Memory Hierarchy on a Sun Ultra-2iL1: 16 KB 2 cycles (6ns)Sun Ultra-2i, 333 MHzL2: 64 byte lineSee www.cs.berkeley.edu/~yelick/arvindk/t3d-isca95.ps for detailsL2: 2 MB, 12 cycles (36 ns)Mem: 396 ns (132 cycles)8 K pages, 32 TLB entriesL1: 16 B lineArray length25CS194 Lecture 3
  • 26. Memory Hierarchy on a Pentium IIIL1: 32 byte line ?L2: 512 KB 60 nsL1: 64K 5 ns, 4-way?Katmai processor on Millennium, 550 MHzArray size26CS194 Lecture 3
  • 27. Memory Hierarchy on a Power3 (Seaborg)Power3, 375 MHzL2: 8 MB 128 B line 9 cyclesL1: 32 KB 128B line .5-2 cyclesArray sizeMem: 396 ns (132 cycles)27CS194 Lecture 3
  • 28. Memory Performance on Itanium 2 (CITRIS)Itanium2, 900 MHzL2: 256 KB 128 B line .5-4 cyclesL1: 32 KB 64B line .34-1 cyclesArray sizeMem: 11-60 cyclesL3: 2 MB 128 B line 3-20 cyclesUses MAPS Benchmark: http://www.sdsc.edu/PMaC/MAPs/maps.html28CS194 Lecture 3
  • 29. Stanza Triad Even smaller benchmark for prefetching Derived from STREAM Triad Stanza (L) is the length of a unit stride run while i < arraylength for each L element stanza A[i] = scalar * X[i] + Y[i] skip k elements 1) do L triads3) do L triads2) skip k elements. . .. . .stanzastanzaSource: Kamil et al, MSP0529CS194 Lecture 3
  • 30. Stanza Triad ResultsThis graph (x-axis) starts at a cache line size (>=16 Bytes) If cache locality was the only thing that mattered, we would expect Flat lines equal to measured memory peak bandwidth (STREAM) as on Pentium3 Prefetching gets the next cache line (pipelining) while using the current one This does not “kick in” immediately, so performance depends on L30CS194 Lecture 3
  • 31. LessonsActual performance of a simple program can be a complicated function of the architecture Slight changes in the architecture or program change the performance significantly To write fast programs, need to consider architecture True on sequential or parallel processor We would like simple models to help us design efficient algorithms We will illustrate with a common technique for improving cache performance, called blocking or tiling Idea: used divide-and-conquer to define a problem that fits in register/L1-cache/L2-cache31CS194 Lecture 3
  • 32. OutlineIdealized and actual costs in modern processors Parallelism within single processors Memory hierarchies Use of microbenchmarks to characterized performance Case study: Matrix Multiplication Use of performance models to understand performance Simple cache model Warm-up: Matrix-vector multiplication Case study continued next time 32CS194 Lecture 3
  • 33. Why Matrix Multiplication? An important kernel in scientific problems Appears in many linear algebra algorithms Closely related to other algorithms, e.g., transitive closure on a graph using Floyd-Warshall Optimization ideas can be used in other problems The best case for optimization payoffs The most-studied algorithm in parallel computing33CS194 Lecture 3
  • 34. AdministriviaHomework 1 (with the code!) is online Lot to understand, relatively little code to write Understanding performance results and code important in grade! Homework 0 should be done Submit should (finally) be working, but perhaps not on CITRIS/PSI machines (use instructional machines to submit) 34CS194 Lecture 3
  • 35. Note on Matrix StorageA matrix is a 2-D array of elements, but memory addresses are “1-D” Conventions for matrix layout by column, or “column major” (Fortran default); A(i,j) at A+i+j*n by row, or “row major” (C default) A(i,j) at A+i*n+j recursive (later) Column major (for now)012345678910111213141516171819048121615913172610141837111519Column majorRow majorcachelinesBlue row of matrix is stored in red cachelinesFigure source: Larry Carter, UCSDColumn major matrix in memory35CS194 Lecture 3
  • 36. Computational Intensity: Key to algorithm efficiencyMachine Balance:Key to machine efficiency Using a Simple Model of Memory to OptimizeAssume just 2 levels in the hierarchy, fast and slow All data initially in slow memory m = number of memory elements (words) moved between fast and slow memory tm = time per slow memory operation f = number of arithmetic operations tf = time per arithmetic operation << tm q = f / m average number of flops per slow memory access Minimum possible time = f* tf when all data in fast memory Actual time f * tf + m * tm = f * tf * (1 + tm/tf * 1/q) Larger q means time closer to minimum f * tf q  tm/tf needed to get at least half of peak speed36CS194 Lecture 3
  • 37. Warm up: Matrix-vector multiplication{implements y = y + A*x} for i = 1:n for j = 1:n y(i) = y(i) + A(i,j)*x(j) =+*y(i)y(i)A(i,:)x(:)37CS194 Lecture 3
  • 38. Warm up: Matrix-vector multiplication{read x(1:n) into fast memory} {read y(1:n) into fast memory} for i = 1:n {read row i of A into fast memory} for j = 1:n y(i) = y(i) + A(i,j)*x(j) {write y(1:n) back to slow memory} m = number of slow memory refs = 3n + n2 f = number of arithmetic operations = 2n2 q = f / m ~= 2 Matrix-vector multiplication limited by slow memory speed38CS194 Lecture 3
  • 39. Modeling Matrix-Vector MultiplicationCompute time for nxn = 1000x1000 matrix Time f * tf + m * tm = f * tf * (1 + tm/tf * 1/q) = 2*n2 * tf * (1 + tm/tf * 1/2) For tf and tm, using data from R. Vuduc’s PhD (pp 351-3) http://bebop.cs.berkeley.edu/pubs/vuduc2003-dissertation.pdf For tm use minimum-memory-latency / words-per-cache-line machine balance (q must be at least this for ½ peak speed)39CS194 Lecture 3
  • 40. Simplifying AssumptionsWhat simplifying assumptions did we make in this analysis? Ignored parallelism in processor between memory and arithmetic within the processor Sometimes drop arithmetic term in this type of analysis Assumed fast memory was large enough to hold three vectors Reasonable if we are talking about any level of cache Not if we are talking about registers (~32 words) Assumed the cost of a fast memory access is 0 Reasonable if we are talking about registers Not necessarily if we are talking about cache (1-2 cycles for L1) Memory latency is constant Could simplify even further by ignoring memory operations in X and Y vectors Mflop rate/element = 2 / (2* tf + tm)40CS194 Lecture 3
  • 41. Validating the ModelHow well does the model predict actual performance? Actual DGEMV: Most highly optimized code for the platform Model sufficient to compare across machines But under-predicting on most recent ones due to latency estimate41CS194 Lecture 3
  • 42. SummaryDetails of machine are important for performance Processor and memory system (not just parallelism) Before you parallelize, make sure you’re getting good serial performance What to expect? Use understanding of hardware limits There is parallelism hidden within processors Pipelining, SIMD, etc Locality is at least as important as computation Temporal: re-use of data recently used Spatial: using data nearby that recently used Machines have memory hierarchies 100s of cycles to read from DRAM (main memory) Caches are fast (small) memory that optimize average case Can rearrange code/data to improve locality42CS194 Lecture 3