• 1. Memory Hierarchies and Optimizations: Case Study in Matrix MultiplicationKathy Yelick yelick@cs.berkeley.edu www.cs.berkeley.edu/~yelick/cs194f079/10/20071CS194 Lecture
• 2. Naïve Matrix Multiply{implements C = C + A*B} for i = 1 to n for j = 1 to n for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)=+*C(i,j)C(i,j)A(i,:)B(:,j)Algorithm has 2*n3 = O(n3) Flops and operates on 3*n2 words of memory Reuse quotient (q = flops/word) in the algorithm is potentially as large as 2*n3 / 3*n2 = O(n)9/10/20072CS194 Lecture
• 3. Naïve Matrix Multiply{implements C = C + A*B} for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} {read column j of B into fast memory} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory}=+*C(i,j)A(i,:)B(:,j)C(i,j)CCAB9/10/20073CS194 Lecture
• 4. Naïve Matrix MultiplyNumber of slow memory references on unblocked matrix multiply m = n3 to read each column of B n times + n2 to read each row of A once + 2n2 to read and write each element of C once = n3 + 3n2 So the re-use quotient, q = f / m = 2n3 / (n3 + 3n2) ~= 2 for large n This is no better than matrix-vector multiply And is far from the “best possible” which is 2/3*n for large n And this doesn’t take into account cache lines=+*C(i,j)C(i,j)A(i,:)B(:,j)9/10/20074CS194 Lecture
• 5. Blocked (Tiled) Matrix MultiplyConsider A,B,C to be N-by-N matrices of b-by-b subblocks where b=n / N is called the block size for i = 1 to N for j = 1 to N {read block C(i,j) into fast memory} for k = 1 to N {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} {write block C(i,j) back to slow memory}=+*C(i,j)A(i,k)B(k,j)C(i,j)9/10/20075CS194 Lecture
• 6. Blocked (Tiled) Matrix MultiplyRecall: m is amount memory traffic between slow and fast memory matrix has nxn elements, also NxN blocks each of size bxb (N=n/b) f is number of floating point operations, 2n3 for this problem q = f / m is our measure of algorithm efficiency in the memory system m = N*n2 read each block of B N3 times (N3 * b2 = N3 * (n/N)2 = N*n2) + N*n2 read each block of A N3 times + 2n2 read and write each block of C once = (2N + 2) * n2 Re-use quotient is: q = f / m = 2n3 / ((2N + 2) * n2) ~= n / N = b for large n We can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiply (q=2)9/10/20076CS194 Lecture
• 7. Using Analysis to Understand MachinesThe blocked algorithm has computational intensity q ~= b The larger the block size, the more efficient our algorithm will be Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large Assume your fast memory has size Mfast 3b2 <= Mfast, so q ~= b <= sqrt(Mfast/3)To build a machine to run matrix multiply at 1/2 peak arithmetic speed of the machine, we need a fast memory of size Mfast >= 3b2 ~= 3q2 = 3(tm/tf)2 This size is reasonable for L1 cache, but not for register sets Note: analysis assumes it is possible to schedule the instructions perfectly9/10/20077CS194 Lecture
• 8. Limits to Optimizing Matrix MultiplyThe blocked algorithm changes the order in which values are accumulated into each C[i,j] by applying associativity Get slightly different answers from naïve code, because of roundoff - OK The previous analysis showed that the blocked algorithm has computational intensity: q ~= b <= sqrt(Mfast/3) Aside (for those who have taken CS170 or equivalent) There is a lower bound result that says we cannot do any better than this (using only associativity) Theorem (Hong & Kung, 1981): Any reorganization of this algorithm (that uses only associativity) is limited to q = O(sqrt(Mfast)) What if more levels of memory hierarchy?9/10/20078CS194 Lecture
• 9. Tiling for Multiple LevelsMultiple levels: pages, caches, registers in machines Each level could have it’s only 3-nested loops: for i, for j, for k {matul blocks that fit in L2 cache} for ii, for jj, for kk {matmul blocks that fit in L1} for iii, for jjj, for kkk {matmul blocks that fit in registers} But in practice don’t use so many levels and fully unroll code for register “tiles” E.g., if these are 2x2 matrix multiplies, can write out all 4 statements without loops c[0,0] = c[0,0] + a [0,0] * b[0,0] + a[0,1]*b[1,0] c[1,0] = c[0,1] = c[1,1] =Many possible code variations; see Mark’s notes on code generators in HW page9/10/20079CS194 Lecture
• 10. Matrix-multiply, optimized several waysSpeed of n-by-n matrix multiply on Sun Ultra-1/170, peak = 330 MFlops9/10/200710CS194 Lecture
• 11. Search Over Block SizesStrategies for choose block sizes: Find out hardware parameters: Read vendor manual for register #, cache sizes (not part of ISA), page sizes (OS config) Measure yourself using memory benchmark from last time But in practice these don’t always work well; memory systems are complex as we saw Manually try many variations  Write “autotuner” to search over “design space” of possible implementations Atlas – incorporated into Matlab PhiPAC – original dense linear algebra tuning project from Berkeley; came from homework assignment like HW29/10/200711CS194 Lecture
• 12. What the Search Space Looks LikeA 2-D slice of a 3-D register-tile search space. The dark blue region was pruned. (Platform: Sun Ultra-IIi, 333 MHz, 667 Mflop/s peak, Sun cc v5.0 compiler)Number of columns in register blockNumber of rows in register block9/10/200712CS194 Lecture
• 13. ATLAS (DGEMM n = 500)ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.Source: Jack Dongarra9/10/200713CS194 Lecture
• 14. Recursion: Cache Oblivious AlgorithmsThe tiled algorithm requires finding a good block size Cache Oblivious Algorithms offer an alternative Treat nxn matrix multiply set of smaller problems Eventually, these will fit in cache The idea of cache-oblivious algorithms is use for other problems than matrix multiply. The general idea is: Think of recursive formulation If the subproblems use smaller data sets and some reuse within that data set, then a recursive order may improve performance9/10/200714CS194 Lecture
• 15. Cache-Oblivious AlgorithmsC00 = A00*B00 + A01*B10 C01 = A01*B11 + A00*B01 C11 = A11*B01 + A10*B01 C10 = A10*B00 + A11*B10 Divide all dimensions (AD) 8-way recursive tree down to 1x1 blocks Gray-code order promotes reuse Bilardi, et al. A00A01A11A10C00C01C11C10B00B01B11B10A0A1C0C1BC0 = A0*B C1 = A1*B C11 = A11*B01 + A10*B01 C10 = A10*B00 + A11*B10 Divide largest dimension (LD) Two-way recursive tree down to 1x1 blocks Frigo, Leiserson, et al. Slide source: Roeder, Yotov, Pingali at Cornell/UTexas 9/10/200715CS194 Lecture
• 16. Cache-Oblivious: DiscussionBlock sizes Generated dynamically at each level in the recursive call tree Our experience Performance is similar Use AD for the rest of the talkSlide source: Roeder, Yotov, Pingali at Cornell/UTexas Mflop/s (up is good)9/10/200716CS194 Lecture
• 17. Cache-Conscious AlgorithmsUsually Iterative Nested loops Implementation of blocking Cache blocking achieved by Loop Tiling Register blocking also requires Loop UnrollingSlide source: Roeder, Yotov, Pingali at Cornell/UTexas 9/10/200717CS194 Lecture
• 18. Structure of Tiled (Cache Concious) Code Cache Blocking Register BlockingSlide source: Roeder, Yotov, Pingali at Cornell/UTexas 9/10/200718CS194 Lecture
• 19. Data StructuresFit control structure better Improve Spatial locality Streaming, prefecthingRow-majorRow-Block-RowMorton-ZSlide source: Roeder, Yotov, Pingali at Cornell/UTexas 9/10/200719CS194 Lecture
• 20. Data Structures: DiscussionMorton-Z Matches recursive control structure better than RBR Suggests better performance for CO More complicated to implement In our experience payoff is small or even negative Use RBR for the rest of the talkSlide source: Roeder, Yotov, Pingali at Cornell/UTexas 9/10/200720CS194 Lecture
• 21. UltraSPARC IIIiPeak performance 2 GFlops Memory hierarchy Registers: 32 L1 data cache: 64KB, 4-way L2 data cache: 1MB, 4-way Compilers FORTRAN: SUN F95 7.1 C: SUN C 5.5Slide source: Roeder, Yotov, Pingali at Cornell/UTexas 9/10/200721CS194 Lecture
• 22. Control StructuresIterative: triply nested loop Recursive: down to 1 x 1 x 1Outer Control StructureIterativeRecursiveInner Control StructureStatementSlide source: Roeder, Yotov, Pingali at Cornell/UTexas 9/10/200722CS194 Lecture
• 23. Control StructuresOuter Control StructureIterativeRecursiveInner Control StructureStatementRecursiveMicro-KernelNone / CompilerRecursion down to NB Unfold completely below NB to get a basic block Micro-Kernel: The basic block compiled with native compiler Best performance for NB =12 Compiler unable to use registers Unfolding reduces control overhead limited by I-cacheSlide source: Roeder, Yotov, Pingali at Cornell/UTexas 9/10/200723CS194 Lecture
• 24. Control StructuresRecursion down to NB Unfold completely below NB to get a basic block Micro-Kernel Scalarize all array references in the basic block Compile with native compilerScalarized / CompilerSlide source: Roeder, Yotov, Pingali at Cornell/UTexas Outer Control StructureIterativeRecursiveInner Control StructureStatementRecursiveMicro-KernelNone / Compiler9/10/200724CS194 Lecture
• 25. Control StructuresRecursion down to NB Unfold completely below NB to get a basic block Micro-Kernel Perform Belady’s register allocation on the basic block Schedule using BRILA compilerBelady / BRILASlide source: Roeder, Yotov, Pingali at Cornell/UTexas Scalarized / CompilerOuter Control StructureIterativeRecursiveInner Control StructureStatementRecursiveMicro-KernelNone / Compiler9/10/200725CS194 Lecture
• 26. Control StructuresRecursion down to NB Unfold completely below NB to get a basic block Micro-Kernel Construct a preliminary schedule Perform Graph Coloring register allocation Schedule using BRILA compilerSlide source: Roeder, Yotov, Pingali at Cornell/UTexas Belady / BRILAScalarized / CompilerOuter Control StructureIterativeRecursiveInner Control StructureStatementRecursiveMicro-KernelNone / CompilerColoring / BRILA9/10/200726CS194 Lecture
• 27. Control StructuresRecursion down to MU x NU x KU Micro-Kernel Completely unroll MU x NU x KU triply nested loop Construct a preliminary schedule Perform Graph Coloring register allocation Schedule using BRILA compilerSlide source: Roeder, Yotov, Pingali at Cornell/UTexas Belady / BRILAScalarized / CompilerOuter Control StructureIterativeRecursiveInner Control StructureStatementRecursiveMicro-KernelNone / CompilerColoring / BRILAIterative9/10/200727CS194 Lecture
• 28. Control StructuresMini-KernelRecursion down to NB Mini-Kernel NB x NB x NB triply nested loop Tiling for L1 cache Body is Micro-KernelSlide source: Roeder, Yotov, Pingali at Cornell/UTexas Belady / BRILAScalarized / CompilerOuter Control StructureIterativeRecursiveInner Control StructureStatementRecursiveMicro-KernelNone / CompilerColoring / BRILAIterative9/10/200728CS194 Lecture
• 29. Control StructuresSlide source: Roeder, Yotov, Pingali at Cornell/UTexas Mini-KernelBelady / BRILAScalarized / CompilerOuter Control StructureIterativeRecursiveInner Control StructureStatementRecursiveMicro-KernelNone / CompilerColoring / BRILAIterativeATLAS CGw/S ATLAS UnleashedSpecialized code generator with search9/10/200729CS194 Lecture
• 30. Control StructuresSlide source: Roeder, Yotov, Pingali at Cornell/UTexas Mini-KernelBelady / BRILAScalarized / CompilerOuter Control StructureIterativeRecursiveInner Control StructureStatementRecursiveMicro-KernelNone / CompilerColoring / BRILAIterativeATLAS CGw/S ATLAS Unleashed9/10/200730CS194 Lecture
• 31. ExperienceIn practice, need to cut off recursion Implementing a high-performance Cache-Oblivious code is not easy Careful attention to micro-kernel and mini-kernel is needed Using fully recursive approach with highly optimized recursive micro-kernel, Pingali et al report that they never got more than 2/3 of peak. Issues with Cache Oblivious (recursive) approach Recursive Micro-Kernels yield less performance than iterative ones using same scheduling techniques Pre-fetching is needed to compete with best code: not well-understood in the context of CO codes
• 32. Basic Linear Algebra Subroutines (BLAS)Industry standard interface (evolving) www.netlib.org/blas, www.netlib.org/blas/blast--forum Vendors, others supply optimized implementations History BLAS1 (1970s): vector operations: dot product, saxpy (y=a*x+y), etc m=2*n, f=2*n, q ~1 or less BLAS2 (mid 1980s) matrix-vector operations: matrix vector multiply, etc m=n^2, f=2*n^2, q~2, less overhead somewhat faster than BLAS1 BLAS3 (late 1980s) matrix-matrix operations: matrix matrix multiply, etc m <= 3n^2, f=O(n^3), so q=f/m can possibly be as large as n, so BLAS3 is potentially much faster than BLAS2 Good algorithms used BLAS3 when possible (LAPACK & ScaLAPACK) See www.netlib.org/{lapack,scalapack} More later in course9/10/200732CS194 Lecture
• 33. BLAS speeds on an IBM RS6000/590BLAS 3BLAS 2BLAS 1BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2 (n-by-n matrix vector multiply) vs BLAS 1 (saxpy of n vectors)Peak speed = 266 MflopsPeak9/10/200733CS194 Lecture
• 34. Strassen’s Matrix MultiplyThe traditional algorithm (with or without tiling) has O(n^3) flops Strassen discovered an algorithm with asymptotically lower flops O(n^2.81) Consider a 2x2 matrix multiply, normally takes 8 multiplies, 4 adds Strassen does it with 7 multiplies and 18 addsLet M = m11 m12 = a11 a12 b11 b12 m21 m22 = a21 a22 b21 b22 Let p1 = (a12 - a22) * (b21 + b22) p5 = a11 * (b12 - b22) p2 = (a11 + a22) * (b11 + b22) p6 = a22 * (b21 - b11) p3 = (a11 - a21) * (b11 + b12) p7 = (a21 + a22) * b11 p4 = (a11 + a12) * b22 Then m11 = p1 + p2 - p4 + p6 m12 = p4 + p5 m21 = p6 + p7 m22 = p2 - p3 + p5 - p7Extends to nxn by divide&conquer9/10/200734CS194 Lecture
• 35. Strassen (continued)Asymptotically faster Several times faster for large n in practice Cross-over depends on machine Available in several libraries “Tuning Strassen's Matrix Multiplication for Memory Efficiency”, M. S. Thottethodi, S. Chatterjee, and A. Lebeck, in Proceedings of Supercomputing '98 Caveats Needs more memory than standard algorithm Can be less accurate because of roundoff error9/10/200735CS194 Lecture
• 36. Other Fast Matrix Multiplication AlgorithmsCurrent world’s record is O(n 2.376... ) (Coppersmith & Winograd) Why does Hong/Kung theorem not apply? Possibility of O(n2+) algorithm! (Cohn, Umans, Kleinberg, 2003) Fast methods (besides Strassen) may need unrealistically large n9/10/200736CS194 Lecture
• 37. Optimizing in PracticeTiling for registers loop unrolling, use of named “register” variables Tiling for multiple levels of cache and TLB Exploiting fine-grained parallelism in processor superscalar; pipelining Complicated compiler interactions Hard to do by hand (but you’ll try) Automatic optimization an active research area BeBOP: bebop.cs.berkeley.edu/ PHiPAC: www.icsi.berkeley.edu/~bilmes/phipac in particular tr-98-035.ps.gz ATLAS: www.netlib.org/atlas9/10/200737CS194 Lecture
• 38. Removing False DependenciesUsing local variables, reorder operations to remove false dependenciesa[i] = b[i] + c; a[i+1] = b[i+1] * d;float f1 = b[i]; float f2 = b[i+1]; a[i] = f1 + c; a[i+1] = f2 * d;false read-after-write hazard between a[i] and b[i+1]With some compilers, you can declare a and b unaliased. Done via “restrict pointers,” compiler flag, or pragma)9/10/200738CS194 Lecture
• 39. Exploit Multiple RegistersReduce demands on memory bandwidth by pre-loading into local variableswhile( … ) { *res++ = filter[0]*signal[0] + filter[1]*signal[1] + filter[2]*signal[2]; signal++; }float f0 = filter[0]; float f1 = filter[1]; float f2 = filter[2]; while( … ) { *res++ = f0*signal[0] + f1*signal[1] + f2*signal[2]; signal++; }also: register float f0 = …;Example is a convolution9/10/200739CS194 Lecture
• 40. Minimize Pointer UpdatesReplace pointer updates for strided memory addressing with constant array offsetsf0 = *r8; r8 += 4; f1 = *r8; r8 += 4; f2 = *r8; r8 += 4;f0 = r8[0]; f1 = r8[4]; f2 = r8[8]; r8 += 12;Pointer vs. array expression costs may differ. Some compilers do a better job at analyzing one than the other9/10/200740CS194 Lecture
• 41. Loop UnrollingExpose instruction-level parallelismfloat f0 = filter[0], f1 = filter[1], f2 = filter[2]; float s0 = signal[0], s1 = signal[1], s2 = signal[2]; *res++ = f0*s0 + f1*s1 + f2*s2; do { signal += 3; s0 = signal[0]; res[0] = f0*s1 + f1*s2 + f2*s0; s1 = signal[1]; res[1] = f0*s2 + f1*s0 + f2*s1; s2 = signal[2]; res[2] = f0*s0 + f1*s1 + f2*s2; res += 3; } while( … );9/10/200741CS194 Lecture
• 42. Expose Independent OperationsHide instruction latency Use local variables to expose independent operations that can execute in parallel or in a pipelined fashion Balance the instruction mix (what functional units are available?)f1 = f5 * f9; f2 = f6 + f10; f3 = f7 * f11; f4 = f8 + f12;9/10/200742CS194 Lecture
• 43. Copy optimizationCopy input operands or blocks Reduce cache conflicts Constant array offsets for fixed size blocks Expose page-level locality0123456789101112131415Original matrix (numbers are addresses)0145236789121410111315Reorganized into 2x2 blocks9/10/200743CS194 Lecture
• 44. Locality in Other AlgorithmsThe performance of any algorithm is limited by q In matrix multiply, we increase q by changing computation order increased temporal locality For other algorithms and data structures, even hand-transformations are still an open problem sparse matrices (reordering, blocking) Weekly research meetings Bebop.cs.berkeley.edu About to release OSKI – tuning for sparse-matrix-vector multiply trees (B-Trees are for the disk level of the hierarchy) linked lists (some work done here) 9/10/200744CS194 Lecture
• 45. SummaryPerformance programming on uniprocessors requires understanding of memory system understanding of fine-grained parallelism in processor Simple performance models can aid in understanding Two ratios are key to efficiency (relative to peak) computational intensity of the algorithm: q = f/m = # floating point operations / # slow memory references machine balance in the memory system: tm/tf = time for slow memory reference / time for floating point operation Want q > tm/tf to get half machine peak Blocking (tiling) is a basic approach to increase q Techniques apply generally, but the details (e.g., block size) are architecture dependent Similar techniques possible on other data structures / algorithms9/10/200745CS194 Lecture