Spark上的分布式矩阵计算


Marlin: Efficient Large-Scale Distributed Matrix Computation with Spark 顾荣、唐云 南京大学 PASA大数据实验室 About Me • Ph.D Student in Nanjing University – Homepage Link – Contact: gurongwalker@gmail.com – Github: • https://github.com/RongGu • https://github.com/PasaLab Outline • Background & Related Work • Overview of Marlin • Distributed Matrix Multiplication on Spark • Evaluation • Conclusion Background • Matrix computation is the core of many massive data-intensive scientific applications1 – Such as large-scale numerical analysis, data mining, and computational physics • In the Big Data era, as the scale of the matrix grows, traditional single-node systems can hardly solve the problem 1. G. Stewart, “The decompositional approach to matrix computation,” Computing in Science & Engineering, vol. 2, no. 1, pp. 50–59, 2000. Parallel Matrix Multiplication Algorithms • Grid-based approach – regard processors as residing on a two- or three- dimensional grid. – The computation is iterative with several rounds. – For example, SUMMA1(the most widely-used parallel matrix multiplication algorithm). – achieve good performance on grid or torus-based topologies. May not perform as well in more general topologies. 1. R. A. Van De Geijn and J. Watts, “Summa: Scalable universal matrix multiplication algorithm,” Concurrency-Practice and Experience, vol. 9, no. 4, pp. 255–274, 1997. Parallel Matrix Multiplication Algorithms • BFS/DFS approach – view the processor layout as a hierarchy rather than a grid – based on sequential recursive algorithms. – For example, CARMA1. J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and O. Spillinger, “Communication- optimal parallel recursive rectangular matrix multiplication,” in IPDPS. IEEE, 2013, pp. 261–272. Parallel Matrix Multiplication Algorithms • Demmel J et al. have proved that SUMMA is only communication-optimal for certain matrix dimensions, while CARMA can minimize communication for all matrix dimensions cases.1 • The data layout requirement of CARMA is quite different from any existing linear algebra library and it cannot work with the widely-used linear algebra libraries well, CARMA is limited in practical use. J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and O. Spillinger, “Communication- optimal parallel recursive rectangular matrix multiplication,” in IPDPS. IEEE, 2013, pp. 261–272. Related Work(in brief) • ScaLAPACK1 (MPI) – Fast, not easy to use, not good robustness. • HAMA 2(MapReduce) – Easy to use, not efficient • MadLINQ 3(Dryad) – Also DAG Execution, Not exploit efficient memory, Not open source. Is Dryad ended? 1. J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker, “Scalapack: A scalable linear algebra library for distributed memory concurrent computers,” in Frontiers of Massively Parallel Computation, 1992., Fourth Symposium on the. IEEE, 1992, pp. 120– 127. 2. S. Seo, E. J. Yoon, J. Kim, S. Jin, J.-S. Kim, and S. Maeng, “Hama: An efficient matrix computation with the mapreduce framework,” in CloudCom. IEEE, 2010, pp. 721–726. 3. Z. Qian, X. Chen, N. Kang, M. Chen, Y. Yu, T. Moscibroda, and Z. Zhang, “Madlinq: large-scale distributed matrix computation for the cloud,” in EuroSys. ACM, 2012, pp. 197–210. Outline • Background & Related Work • Overview of Marlin • Distributed Matrix Multiplication on Spark • Evaluation • Conclusion The system stack of Marlin and its related systems Features of Marlin • Native Linear Algebra Library Acceleration – Marlin takes a divide-and-conquer strategy to deal with the large scale matrix computation. – For each sub-problem, instead of performing linear algebra computations on JVM, Marlin offloads the CPU-intensive operation from JVM to the native linear algebra library (e.g. BLAS,Lapack,MKL ) Features of Marlin • Fine-grained Fault Tolerance and Ease to Use – achieves the fine-grained fault tolerance which is extended from Spark. – offers developers with high level matrix computation interfaces in Scala/Java which can accelerate the development of big data applications • Efficient Distributed Matrix Operations – Has quite a few distributed matrix computing operations. – This talk focuses on distributed matrix multiplication Outline • Background & Related Work • Overview of Marlin • Distributed Matrix Multiplication on Spark • Evaluation • Conclusion Representing Large Scale Matrices on Spark RDD Distributed Matrix Multiplication in Marlin • proposed three distributed matrix multiplication algorithms which are suitable for different situations. • Based on this, we designed an adaptive model to choose the best approach for different problems. • Instead of naively using Spark, we put forward some optimization methods. Approach 1: Block-splitting matrix multiplication • Similar to the blocking-approach in HAMA [4], Split two original matrices into blocked matrices and executes the multiplication of submatrices in parallel. • This approach is suitable for multiplying two square matrices. Approach 2: CARMA matrix multiplication • When two input matrices are not square, the above dimension-splitting method is no longer suitable. • To solve this problem, we refer to the equal representation of dimension-splitting in BFS steps of CARMA and design a dimension-splitting method similar to CARMA. Approach 2: CARMA matrix multiplication The workflow of CARMA approach matrix multiplication on Spark programming mode, here r = 1; s = 2; t = 2 Approach 3: Broadcast matrix multiplication • If matrix B is quite small, broadcast it to each executor to avoid shuffling the large scale matrix A across network Adaptive Approaches Selection • Based on the time cost model analyzed above, we put forward an algorithm for selecting the appropriate matrix multiplication approach when given two distributed matrices. Outline • Background & Related Work • Overview of Marlin • Distributed Matrix Multiplication on Spark • Evaluation • Conclusion Experimental Setup • A local cluster with 17 nodes. • Each node has two Xeon Quad 2.4 GHz processors altogether 16 logical cores, 24 GB memory and two 2 TB 7200 RPM SATA hard disks. • The Marlin contains three matrix multiplication approaches. Thus, the block-splitting approach is denoted as Marlin-Blocking, the CARMA approach is denoted as Marlin-CARMA, and the broadcast approach is denoted as Marlin-Broadcast. Effects of Adopting Native Linear Algebra Library Effects of Adaptive Approach Selection Effects of Tuning the Matrix Split Granularity Performance Comparison With Other Systems Performance Comparison With Other Systems Scalability Performance Analysis Outline • Background & Related Work • Overview of Marlin • Distributed Matrix Multiplication on Spark • Evaluation • Conclusion • We propose Marlin, an efficient distributed matrix computation library built on top of Spark. Three distributed matrix multiplication algorithms, suitable for different scenarios, are designed in Marlin. Also, an adaptive model is proposed to select the best matrix multiplication approach. • Marlin is currently open-sourced at https://github.com/PasaLab/marlin Thanks! QA. Backups 并行矩阵乘法概况 • 约定要进行的运算是 C = AB – A: m× k, B: k × n, C: m×n • 算法计算复杂度 – 经典算法 O(mnk) – 新的快速算法 O(n2.977), 目前还没有并行实现 • 负载均衡(将计算平均平摊到各个处理器上) • 通信复杂度* – P < d3/d2 1 large dimension – d3/d2 < P < (d2d3)/d12 2 large dimensions –(d2d3)/d12 < P 3 large dimensions * d1,d2,d3分别是m,n,k三者中的最小,次大,最大者;更多细节见 J. Demmel, et. al , “Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication,” in 2013 IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS), 2013, pp. 261–272. CARMA {2,3}D SUMMA/ CARMA 3D SUMMA/ CARMA 能取到通信复杂度 下限的算法 2D SUMMA —— 数据分布 • 这是在PBLAS库中采用的 算法 • 首先定义处理器网格 – 假设有6个处理器 – 分布在一个2×3的网格上 • 将A,B,C三个矩阵按照处 理器网格指定的形式分布 存储在这些处理器的内存 里 – 见右侧X矩阵的分块方案 –Xij分块将被分配给处理器 Pij 0 1 2 3 4 5 X00 X01 X02 X10 X11 X12 X P12 2D SUMMA —— 算法 * = i j A(i,k) k k B(k,j) C(i,j) For k=0 to n/b-1 … where b is the block size … b = # cols in A(i,k) and # rows in B(k,j) for all i = 1 to pr … in parallel. in this example pr = 4 owner of A(i,k) broadcasts it to whole processor row for all j = 1 to pc … in parallel. in this example pc = 4 owner of B(k,j) broadcasts it to whole processor column Receive A(i,k) into Acol Receive B(k,j) into Brow C_myproc = C_myproc + Acol * Brow 本页PPT来自 James Demmel .Dense Linear Algebra:History and Structure, Parallel Matrix Multiplication 基于递归的CARMA算法 • 该算法是Network-Oblivious的,在任何情况 下都能取到最优的通信复杂度下界 • 基于局部分块矩阵乘法 • 基于递归分治的思想 • 假设计算任务中m = 32, k = 8, n = 16,一共 有8个处理器,见下例 CARMA算法 更多细节见文献: J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and O. Spillinger, “Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication,” in 2013 IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS), 2013, pp. 261–272. CARMA算法示例 P={0,1,2,...,7} 32 8 8 16 × A B 16 8 16 × A2 B A1 16 8 16 B × 8 P1={0,1,2,3} P2={4,5,6,7} 16 8 16 × B A1 8 P1={0,1,2,3} 16 8 8 × B1 A1 8 16 8 8 × B2 A1 8 P11={0,1} P12={2,3} 0 2 1 3 = C1 C2 0 2 C11 = 1 3 C12 = = CARMA算法 —— 数据分解 • 基于递归的算法一大问题是:数据分解不 具有连贯性 • 同一个矩阵在不同情况下有不同的分解方 式 0,2 4,6 1,3 5,7 0,1 4,5 2,3 6,7 32 8 8 16 × A B = 0 1 2 3 4 5 6 7 C 图 CARMA算法将数据分解到8个处理器上示例 HAMA的实现 —— 数据存储 • HAMA早期致力于实现Spark中MLLib所完成的功能,但 2012年后专心做BSP框架 • 在HAMA的早期论文中,基于MapReduce框架完成了一 个密集矩阵乘法的实现 • 它的矩阵是保存在HBase的表中 column_vector row index (行号) elements in that row HAMA的实现 —— 算法 • 利用的矩阵乘法计算公式 – C(i,j) = Σk A(i,k) * B(k,j) – 该公式对于分块矩阵也是适用的,只要A矩阵 在列方向的分块方式和B矩阵在行方向的分块 方式相同 • 它使用两趟MapReduce Job实现乘法功能 – 第一趟:由HBase中的存储表生成 CollectionTable – 第二趟:从CollectionTable计算分块矩阵乘积, 并汇合成整个结果矩阵C HAMA的实现 —— 算法(续) • Pass 1 : 由HBase中的表生成CollectionTable • Pass 2 : 对分块矩阵进行乘法 – Mapper: • 接收block(i,j)-k,进行分块矩阵乘A(i,?) * B(?,j), 发射C(i,j)的部分和C’(i,j)-k – Reducer: • 接受C(i,j)的部分和,进行累加求和C(i,j) = Σk C’(i,j)-k • 因为Spark支持表达MapReduce框架,因此我们现在先考虑使用HAMA的实现 CollectionTable: matrix A matrix B ------------------------+------------------------------- block(0, 0)-0 block(0, 0) block(0, 0) block(0, 0)-1 block(0, 1) block(1, 0) ... ... block(i, j)-k block(i, k) block(k, j) ... ... block(N-1, n-1)-(N^3-1) block(N-1, N-1) block(N-1, N-1) 性能问题 生成CollectionTable的时候,大矩阵A和B 需要完全地在网络上传输一次,而且传输 的数据量是O(mnk/b),b是分块矩阵的宽 度
还剩41页未读

继续阅读

下载pdf到电脑,查找使用更方便

pdf的实际排版效果,会与网站的显示效果略有不同!!

需要 8 金币 [ 分享pdf获得金币 ] 3 人已下载

下载pdf

pdf贡献者

gcfb

贡献于2014-12-04

下载需要 8 金币 [金币充值 ]
亲,您也可以通过 分享原创pdf 来获得金币奖励!
下载pdf