Advanced MPI: I/O and One-Sided Communication


Argonne National Laboratory is managed by The University of Chicago for the U.S. Department of Energy Advanced MPI: I/O and One-Sided Communication William Gropp, Rusty Lusk, Rob Ross, and Rajeev Thakur Mathematics and Computer Science Division 2 Table of Contents • Conway’s Game of Life – 8 – Parallel I/O and Life – 19 – Exchanging Data with RMA – 49 – Life with 2D Block-Block Decomposition – 72 • Sparse Matrix I/O – 88 – Writing Sparse Matrices – 117 • pNeo: Modeling the Human Brain – 118 • Passive Target RMA – 128 • Improving Performance – 157 – Tuning MPI-IO – 158 – Tuning RMA – 195 • Conclusions - 214 3 Outline Before Lunch • Introduction – MPI-1 Status, MPI-2 Status – C++ and Fortran90 • Life, 1D Decomposition – point-to-point – checkpoint/restart • stdout • MPI-IO • PnetCDF – RMA • fence • post/start/complete/wait After Lunch • Life, 2D Decomposition – point-to-point – RMA • Sparse Matrix I/O – CSR format – checkpoint/restart • stdout • MPI-IO • pNeo application • Passive Target RMA • Tuning – I/O tuning – RMA tuning • Conclusions 4 MPI-1 • MPI is a message-passing library interface standard. – Specification, not implementation – Library, not a language – Classical message-passing programming model • MPI was defined (1994) by a broadly-based group of parallel computer vendors, computer scientists, and applications developers. – 2-year intensive process • Implementations appeared quickly and now MPI is taken for granted as vendor-supported software on any parallel machine. • Free, portable implementations exist for clusters (MPICH, LAM, OpenMPI) and other environments (MPICH) 5 MPI-2 • Same process of definition by MPI Forum • MPI-2 is an extension of MPI – Extends the message-passing model. • Parallel I/O • Remote memory operations (one-sided) • Dynamic process management – Adds other functionality • C++ and Fortran 90 bindings – similar to original C and Fortran-77 bindings • External interfaces • Language interoperability • MPI interaction with threads 6 MPI-2 Implementation Status • Most parallel computer vendors now support MPI-2 on their machines – Except in some cases for the dynamic process management functions, which require interaction with other system software • Cluster MPIs, such as MPICH2 and LAM, support most of MPI-2 including dynamic process management • Our examples here have all been run on MPICH2 7 Our Approach in this Tutorial • Example driven – Structured data (Life) – Unstructured data (Sparse Matrix) – Unpredictable communication (pNeo) – Passive target RMA (global arrays and MPI mutex) • Show solutions that use the MPI-2 support for parallel I/O and RMA – Walk through actual code • We assume familiarity with MPI-1 8 Conway’s Game of Life • A cellular automata – Described in 1970 Scientific American – Many interesting behaviors; see: • http://www.ibiblio.org/lifepatterns/october1970.html • Program issues are very similar to those for codes that use regular meshes, such as PDE solvers – Allows us to concentrate on the MPI issues 9 Rules for Life • Matrix values A(i,j) initialized to 1 (live) or 0 (dead) • In each iteration, A(i,j) is set to – 1(live) if either • the sum of the values of its 8 neighbors is 3, or • the value was already 1 and the sum of its 8 neighbors is 2 or 3 – 0 (dead) otherwise j i j-1 j+1 i+1 i-1 10 Implementing Life • For the non-parallel version, we: – Allocate a 2D matrix to hold state • Actually two matrices, and we will swap them between steps – Initialize the matrix • Force boundaries to be “dead” • Randomly generate states inside – At each time step: • Calculate each new cell state based on previous cell states (including neighbors) • Store new states in second matrix • Swap new and old matrices 11 Steps in Designing the Parallel Version • Start with the “global” array as the main object – Natural for output – result we’re computing • Describe decomposition in terms of global array • Describe communication of data, still in terms of the global array • Define the “local” arrays and the communication between them by referring to the global array 12 Step 1: Description of Decomposition • By rows (1D or row-block) – Each process gets a group of adjacent rows • Later we’ll show a 2D decomposition Columns Rows 13 Step 2: Communication • “Stencil” requires read access to data from neighbor cells • We allocate extra space on each process to store neighbor cells • Use send/recv or RMA to update prior to computation 14 Step 3: Define the Local Arrays • Correspondence between the local and global array • “Global” array is an abstraction; there is no one global array allocated anywhere • Instead, we compute parts of it (the local arrays) on each process • Provide ways to output the global array by combining the values on each process (parallel I/O!) 15 Boundary Regions • In order to calculate next state of cells in edge rows, need data from adjacent rows • Need to communicate these regions at each step – First cut: use isend and irecv – Revisit with RMA later 16 Life Point-to-Point Code Walkthrough • Points to observe in the code: – Handling of command-line arguments – Allocation of local arrays – Use of a routine to implement halo exchange • Hides details of exchange mdatamatrix Allows us to use matrix[row][col] to address elements See mlife.c pp. 1-8 for code example. 17 Note: Parsing Arguments • MPI standard does not guarantee that command line arguments will be passed to all processes. – Process arguments on rank 0 – Broadcast options to others • Derived types allow one bcast to handle most args – Two ways to deal with strings • Big, fixed-size buffers • Two-step approach: size first, data second (what we do in the code) See mlife.c pp. 9-10 for code example. 18 Point-to-Point Exchange • Duplicate communicator to ensure communications do not conflict • Non-blocking sends and receives allow implementation greater flexibility in passing messages See mlife-pt2pt.c pp. 1-3 for code example. 19 Parallel I/O and Life 20 Supporting Checkpoint/Restart • For long-running applications, the cautious user checkpoints • Application-level checkpoint involves the application saving its own state – Portable! • A canonical representation is preferred – Independent of number of processes • Restarting is then possible – Canonical representation aids restarting with a different number of processes 21 Defining a Checkpoint • Need enough to restart – Header information • Size of problem (e.g. matrix dimensions) • Description of environment (e.g. input parameters) – Program state • Should represent the global (canonical) view of the data • Ideally stored in a convenient container – Single file! • If all processes checkpoint at once, naturally a parallel, collective operation 22 Life Checkpoint/Restart API • Define an interface for checkpoint/restart for the row-block distributed Life code • Five functions: – MLIFEIO_Init – MLIFEIO_Finalize – MLIFEIO_Checkpoint – MLIFEIO_Can_restart – MLIFEIO_Restart • All functions are collective • Once the interface is defined, we can implement it for different back-end formats 23 Life Checkpoint • MLIFEIO_Checkpoint(char *prefix, int **matrix, int rows, int cols, int iter, MPI_Info info); • Prefix is used to set filename • Matrix is a reference to the data to store • Rows, cols, and iter describe the data (header) • Info is used for tuning purposes (more later!) 24 Life Checkpoint (Fortran) • MLIFEIO_Checkpoint(prefix, matrix, rows, cols, iter, info ) character*(*) prefix integer rows, cols, iter integer matrix(rows,cols) integer info • Prefix is used to set filename • Matrix is a reference to the data to store • Rows, cols, and iter describe the data (header) • Info is used for tuning purposes (more later!) 25 stdio Life Checkpoint Code Walkthrough • Points to observe – All processes call checkpoint routine • Collective I/O from the viewpoint of the program – Interface describes the global array – Output is independent of the number of processes See mlife-io-stdout.c pp. 1-2 for code example. 26 Life stdout “checkpoint” • The first implementation is one that simply prints out the “checkpoint” in an easy-to-read format • MPI standard does not specify that all stdout will be collected in any particular way – Pass data back to rank 0 for printing – Portable! – Not scalable, but ok for the purpose of stdio See mlife-io-stdout.c pp. 3 for code example. 27 Describing Data • Lots of rows, all the same size – Rows are all allocated as one big block – Perfect for MPI_Type_vector MPI_Type_vector(count = myrows, blklen = cols, stride = cols+2, MPI_INT, &vectype); – Second type gets memory offset right MPI_Type_hindexed(count = 1, len = 1, disp = &matrix[1][1], vectype, &type); matrix[1][0..cols-1] matrix[myrows][0..cols-1] See mlife-io-stdout.c pp. 4-6 for code example. 28 Describing Data (Fortran) • Lots of rows, all the same size – Rows are all allocated as one big block – Perfect for MPI_Type_vector Call MPI_Type_vector(count = myrows, blklen = cols, stride = cols+2, MPI_INTEGER, vectype, ierr ) Matrix(1,0:cols-1) Matrix(myrows,0:cols-1) 29 Life Checkpoint/Restart Notes • MLIFEIO_Init – Duplicates communicator to avoid any collisions with other communication • MLIFEIO_Finalize – Frees the duplicated communicator • MLIFEIO_Checkpoint and _Restart – MPI_Info parameter is used for tuning I/O behavior Note: Communicator duplication may not always be necessary, but is good practice for safety See mlife-io-stdout.c pp. 1-8 for code example. 30 Parallel I/O and MPI • The stdio checkpoint routine works but is not parallel – One process is responsible for all I/O – Wouldn’t want to use this approach for real • How can we get the full benefit of a parallel file system? – We first look at how parallel I/O works in MPI – We then implement a fully parallel checkpoint routine • Because it will use the same interface, we can use it without changing the rest of the parallel life code 31 Why MPI is a Good Setting for Parallel I/O • Writing is like sending and reading is like receiving. • Any parallel I/O system will need: – collective operations – user-defined datatypes to describe both memory and file layout – communicators to separate application-level message passing from I/O-related message passing – non-blocking operations • I.e., lots of MPI-like machinery 32 What does Parallel I/O Mean? • At the program level: – Concurrent reads or writes from multiple processes to a common file • At the system level: – A parallel file system and hardware that support such concurrent access 33 Collective I/O and MPI • A critical optimization in parallel I/O • All processes (in the communicator) must call the collective I/O function • Allows communication of “big picture” to file system – Framework for I/O optimizations at the MPI-IO layer • Basic idea: build large blocks, so that reads/writes in I/O system will be large – Requests from different processes may be merged together – Particularly effective when the accesses of different processes are noncontiguous and interleaved Small individual requests Large collective access 34 Collective I/O Functions • MPI_File_write_at_all, etc. – _all indicates that all processes in the group specified by the communicator passed to MPI_File_open will call this function – _at indicates that the position in the file is specified as part of the call; this provides thread-safety and clearer code than using a separate “seek” call • Each process specifies only its own access information — the argument list is the same as for the non-collective functions 35 MPI-IO Life Checkpoint Code Walkthrough • Points to observe – Use of a user-defined MPI datatype to handle the local array – Use of MPI_Offset for the offset into the file • “Automatically” supports files larger than 2GB if the underlying file system supports large files – Collective I/O calls • Extra data on process 0 See mlife-io-mpiio.c pp. 1-2 for code example. 36 Life MPI-IO Checkpoint/Restart • We can map our collective checkpoint directly to a single collective MPI-IO file write: MPI_File_write_at_all – Process 0 writes a little extra (the header) • On restart, two steps are performed: – Everyone reads the number of rows and columns from the header in the file with MPI_File_read_at_all • Sometimes faster to read individually and bcast (see later example) – If they match those in current run, a second collective call used to read the actual data • Number of processors can be different See mlife-io-mpiio.c pp. 3-6 for code example. 37 Describing Header and Data • Data is described just as before • Create a struct wrapped around this to describe the header as well: – no. of rows – no. of columns – Iteration no. – data (using previous type) See mlife-io-mpiio.c pp. 7 for code example. 38 Placing Data in Checkpoint Rows Columns Iteration Global Matrix File Layout Note: We store the matrix in global, canonical order with no ghost cells. See mlife-io-mpiio.c pp. 9 for code example. 39 The Other Collective I/O Calls •MPI_File_seek •MPI_File_read_all •MPI_File_write_all •MPI_File_read_at_all •MPI_File_write_at_all •MPI_File_read_ordered •MPI_File_write_ordered combine seek and I/O for thread safety use shared file pointer like Unix I/O 40 Portable Checkpointing 41 Portable File Formats • Ad-hoc file formats – Difficult to collaborate – Cannot leverage post-processing tools • MPI provides external32 data encoding • High level I/O libraries – netCDF and HDF5 – Better solutions than external32 • Define a “container” for data – Describes contents – May be queried (self-describing) • Standard format for metadata about the file • Wide range of post-processing tools available 42 File Interoperability in MPI-IO • Users can optionally create files with a portable binary data representation • “datarep” parameter to MPI_File_set_view • native - default, same as in memory, not portable • external32 - a specific representation defined in MPI, (basically 32-bit big-endian IEEE format), portable across machines and MPI implementations • internal – implementation-defined representation providing an implementation-defined level of portability – Not used by anyone we know of… 43 Higher Level I/O Libraries • Scientific applications work with structured data and desire more self-describing file formats • netCDF and HDF5 are two popular “higher level” I/O libraries – Abstract away details of file layout – Provide standard, portable file formats – Include metadata describing contents • For parallel machines, these should be built on top of MPI-IO – HDF5 has an MPI-IO option • http://hdf.ncsa.uiuc.edu/HDF5/ 44 Parallel netCDF (PnetCDF) • (Serial) netCDF – API for accessing multi-dimensional data sets – Portable file format – Popular in both fusion and climate communities • Parallel netCDF – Very similar API to netCDF – Tuned for better performance in today’s computing environments – Retains the file format so netCDF and PnetCDF applications can share files – PnetCDF builds on top of any MPI-IO implementation ROMIOROMIO PnetCDFPnetCDF PVFS2PVFS2 Cluster IBM MPIIBM MPI PnetCDFPnetCDF GPFSGPFS IBM SP 45 I/O in netCDF and PnetCDF • (Serial) netCDF – Parallel read • All processes read the file independently • No possibility of collective optimizations – Sequential write • Parallel writes are carried out by shipping data to a single process • Just like our stdout checkpoint code • PnetCDF – Parallel read/write to shared netCDF file – Built on top of MPI-IO which utilizes optimal I/O facilities of the parallel file system and MPI-IO implementation – Allows for MPI-IO hints and datatypes for further optimization P0 P1 P2 P3 netCDF Parallel File System Parallel netCDF P0 P1 P2 P3 Parallel File System 46 Life PnetCDF Checkpoint/Restart • Third implementation of MLIFEIO interface • Stores matrix as a two-dimensional array of integers in the netCDF file format – Same canonical ordering as in MPI-IO version • Iteration number stored as an attribute 47 PnetCDF Life Checkpoint Code Walkthrough • Points to observe – Creating a netCDF file – Defining dimensions – Defining variables – Storing attributes – Discovering dimensions on restart See mlife-io-pnetcdf.c pp. 1-6 for code example. 48 Discovering Variable Dimensions • Because netCDF is self-describing, applications can inquire about data in netCDF files: err = ncmpi_inq_dimlen(ncid, dims[0], &coldimsz); • Allows us to discover the dimensions of our matrix at restart time See mlife-io-pnetcdf.c pp. 7-8 for code example. 49 Exchanging Data with RMA 50 Revisiting Mesh Communication • Recall how we designed the parallel implementation – Determine source and destination data • Do not need full generality of send/receive – Each process can completely define what data needs to be moved to itself, relative to each processes local mesh • Each process can “get” data from its neighbors – Alternately, each can define what data is needed by the neighbor processes • Each process can “put” data to its neighbors 51 Remote Memory Access • Separates data transfer from indication of completion (synchronization) • In message-passing, they are combined store send receive load Proc 0 Proc 1 Proc 0 Proc 1 fence put fence fence fence load store fence fence get or 52 Remote Memory Access in MPI-2 (also called One-Sided Operations) • Goals of MPI-2 RMA Design – Balancing efficiency and portability across a wide class of architectures • shared-memory multiprocessors • NUMA architectures • distributed-memory MPP’s, clusters • Workstation networks – Retaining “look and feel” of MPI-1 – Dealing with subtle memory behavior issues: cache coherence, sequential consistency 53 Remote Memory Access Windows and Window Objects Get Put Process 2 Process 1 Process 3 Process 0 = address spaces = window object window 54 Basic RMA Functions for Communication • MPI_Win_create exposes local memory to RMA operation by other processes in a communicator – Collective operation – Creates window object • MPI_Win_free deallocates window object • MPI_Put moves data from local memory to remote memory • MPI_Get retrieves data from remote memory into local memory • MPI_Accumulate updates remote memory using local values • Data movement operations are non-blocking • Subsequent synchronization on window object needed to ensure operation is complete 55 Performance of RMA Caveats: On SGI, MPI_Put uses specially allocated memory 56 Advantages of RMA Operations • Can do multiple data transfers with a single synchronization operation – like BSP model • Bypass tag matching – effectively precomputed as part of remote offset • Some irregular communication patterns can be more economically expressed • Can be significantly faster than send/receive on systems with hardware support for remote memory access, such as shared memory systems 57 Irregular Communication Patterns with RMA • If communication pattern is not known a priori, the send-recv model requires an extra step to determine how many sends-recvs to issue • RMA, however, can handle it easily because only the origin or target process needs to issue the put or get call • This makes dynamic communication easier to code in RMA 58 RMA Window Objects MPI_Win_create(base, size, disp_unit, info, comm, win) • Exposes memory given by (base, size) to RMA operations by other processes in comm • win is window object used in RMA operations • disp_unit scales displacements: – 1 (no scaling) or sizeof(type), where window is an array of elements of type type – Allows use of array indices – Allows heterogeneity 59 RMA Communication Calls • MPI_Put - stores into remote memory • MPI_Get - reads from remote memory • MPI_Accumulate - updates remote memory • All are non-blocking: data transfer is described, maybe even initiated, but may continue after call returns • Subsequent synchronization on window object is needed to ensure operations are complete 60 Put, Get, and Accumulate • MPI_Put(origin_addr, origin_count, origin_datatype, target_rank, target_offset, target_count, target_datatype, window) • MPI_Get( ... ) • MPI_Accumulate( ..., op, ... ) • op is as in MPI_Reduce, but no user-defined operations are allowed 61 The Synchronization Issue • Issue: Which value is retrieved? – Some form of synchronization is required between local load/stores and remote get/put/accumulates • MPI provides multiple forms local stores MPI_Get 62 Synchronization with Fence Simplest methods for synchronizing on window objects: • MPI_Win_fence - like barrier, supports BSP model Process 0 MPI_Win_fence(win) MPI_Put MPI_Put MPI_Win_fence(win) Process 1 MPI_Win_fence(win) MPI_Win_fence(win) 63 Mesh Exchange Using MPI RMA • Define the windows – Why – safety, options for performance (later) • Define the data to move • Mark the points where RMA can start and where it must complete (e.g., fence/put/put/fence) 64 Outline of 1D RMA Exchange • Create Window object • Computing target offsets • Exchange operation 65 Computing the Offsets • Offset to top ghost row – 1 • Offset to bottom ghost row – 1 + (# cells in a row)*(# of rows – 1) – = 1 + (nx + 2)*(e – s + 2) e s nx a(1,e) a(1,s) 66 Fence Life Exchange Code Walkthrough • Points to observe – MPI_Win_fence is used to separate RMA accesses from non-RMA accesses • Both starts and ends data movement phase – Any memory may be used • No special malloc or restrictions on arrays – Uses same exchange interface as the point-to-point version See mlife-fence.c pp. 1-3 for code example. 67 Comments on Window Creation • MPI-2 provides MPI_SIZEOF for Fortran users – Not universally implemented – Use MPI_Type_size for portability • Using a displacement size corresponding to a basic type allows use of put/get/accumulate on heterogeneous systems – Even when the sizes of basic types differ • Displacement size also allows easier computation of offsets in terms of array index instead of byte offset 68 More on Fence • MPI_Win_fence is collective over the group of the window object • MPI_Win_fence is used to separate, not just complete, RMA and local memory operations – That is why there are two fence calls • Why? – MPI RMA is designed to be portable to a wide variety of machines, including those without cache coherent hardware (including some of the fastest machines made) – See performance tuning for more info 69 Scalable Synchronization with Post/Start/Complete/Wait • Fence synchronization is not scalable because it is collective over the group in the window object • MPI provides a second synchronization mode: Scalable Synchronization – Uses four routines instead of the single MPI_Win_fence: • 2 routines to mark the begin and end of calls to RMA routines – MPI_Win_start, MPI_Win_complete • 2 routines to mark the begin and end of access to the memory window – MPI_Win_post, MPI_Win_wait • P/S/C/W allows synchronization to be performed only among communicating processes 70 Synchronization with P/S/C/W • Origin process calls MPI_Win_start and MPI_Win_complete • Target process calls MPI_Win_post and MPI_Win_wait Process 0 MPI_Win_start(target_grp) MPI_Put MPI_Put MPI_Win_complete(target_grp) Process 1 MPI_Win_post(origin_grp) MPI_Win_wait(origin_grp) 71 P/S/C/W Life Exchange Code Walkthrough • Points to Observe – Use of MPI group routines to describe neighboring processes – No change to MPI_Put calls • You can start with MPI_Win_fence, then switch to P/S/C/W calls if necessary to improve performance See mlife-pscw.c pp. 1-4 for code example. 72 Life with 2D Block-Block Decomposition 73 Why Use a 2D Decomposition? • More scalable due to reduced communication requirements – We can see why with a simple communication model. – Let the time to move n words from one process to another be Tc = s + rn – 1D decomposition time on p processes is • T = 2(s+rn) + T1/p – 2D decomposition time on p processes is • T = 4(s + r(n/√p)) + T1/p – For large n, 2D decomposition has much smaller communication time – (Even stronger effect for 3D decompositions of 3D problems) 74 Designing the 2D Decomposition • Go back to global mesh view • Define decomposition • Define data to move • Define local mesh 75 Mesh Exchange for 2D Decomposition • Creating the datatypes • Using fence • Using scalable synchronization 76 Outline of 2D RMA Exchange • Create Window Object • Computing target offsets – Even for less regular decompositions • Creating Datatypes • Exchange Operation 77 Creating the Window MPI_Win win; int *localMesh; /* nx is the number of (non-ghost) values in x, ny in y */ nx = ex - sx + 1; ny = ey - sy + 1; MPI_Win_create(localMesh, (ex-sx+3)*(ey-sy+3)*sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win); • Nothing new here 78 Creating the Window (C++) MPI::Win win; int *localMesh; // nx is the number of (non-ghost) values in x, // ny in y nx = ex - sx + 1; ny = ey - sy + 1; win = MPI::Win::Create(localMesh, (ex-sx+3)*(ey-sy+3)*sizeof(int), sizeof(int), MPI::INFO_NULL, MPI::COMM_WORLD); • Nothing new here 79 Creating the Window (Fortran) integer win, sizedouble, ierr double precision a(sx-1:ex+1,sy-1:ey+1) ! nx is the number of (non-ghost) values in x, ny in y nx = ex - sx + 1 ny = ey - sy + 1 call MPI_TYPE_SIZE(MPI_DOUBLE_PRECISION, sizedouble,& ierr) call MPI_WIN_CREATE(a, (ex-sx+3)*(ey-sy+3)*sizedouble, & sizedouble, MPI_INFO_NULL, & MPI_COMM_WORLD, win, ierr) • Nothing new here 80 Computing Target Offsets • Similar to 1D, but may include some computation since neighbor with shared boundary still needs to know the size of the other dimension as that is needed to compute the offsets 81 Creating Datatypes for Columns MPI_Datatype coltype; /* Vector type used on origin process */ MPI_Type_vector(1, ny, nx+2, MPI_INT, &coltype); MPI_Type_commit(&coltype); GLastRow GFirstRow LCols LRows GFirstCol GLastCol Stride# elements • For both the left and right side 82 Creating Datatypes for Columns (C++) MPI::Datatype coltype; // Vector type used on origin process coltype = MPI::Type::Create_vector(1, ny, nx+2, MPI::INT ); coltype.Commit(); GLastRow GFirstRow LCols LRows GFirstCol GLastCol Stride# elements • For both the left and right side 83 Creating Datatypes for Columns (Fortran) integer coltype ! Vector type used on origin process call MPI_TYPE_VECTOR(1, ny, nx+2,& MPI_DOUBLE_PRECISION, & coltype, ierr) call MPI_TYPE_COMMIT(coltype, ierr) sy ey nx ny sx ex Stride # elements • For both the left and right side 84 2D Life Code Walkthrough • Points to observe – More complicated than 1D! – Communication of noncontiguous regions uses derived datatypes • For the RMA version (mlife2d-fence) – Be careful in determining the datatype for the target process – Be careful in determining the offset – MPI_Win_fence must return before data may be used on target See mlife2d.c, mlife2d-pt2pt.c, mlife2d-fence.c for code examples. 85 LUNCH 86 I/O for General Distributed Data 87 Handling Irregular Data Structures • One strength of MPI is that you can handle any kind of situation (because you have to do much of the work yourself) • Example: sparse matrix operations, such as used in PDE codes 88 Sparse Matrix I/O • We have seen how to use MPI-I/O with regular data structures. What about irregular data structures? – Each process has a different amount of data • For a simple example, we look at I/O for sparse matrices – Similar code can be used for unstructured meshes • First more on I/O, then the example 89 Sparse Matrix I/O Characteristics • Local to global data mapping not known by each process – Depends on number of nonzeros on previous ranks! • Will need to communicate to determine relative positions before performing I/O • Will use independent I/O in some cases • Will read noncontiguous regions from file 90 Independent I/O with MPI-IO 91 Writing to a File • Use MPI_File_write or MPI_File_write_at • Use MPI_MODE_WRONLY or MPI_MODE_RDWR as the flags to MPI_File_open • If the file doesn’t exist previously, the flag MPI_MODE_CREATE must also be passed to MPI_File_open • We can pass multiple flags by using bitwise-or ‘|’ in C, or addition ‘+” in Fortran 92 Ways to Access a Shared File •MPI_File_seek •MPI_File_read •MPI_File_write •MPI_File_read_at •MPI_File_write_at •MPI_File_read_shared •MPI_File_write_shared combine seek and I/O for thread safety use shared file pointer like Unix I/O 93 Using Explicit Offsets #include “mpi.h” MPI_Status status; MPI_File fh; MPI_Offset offset; MPI_File_open(MPI_COMM_WORLD, “/pfs/datafile”, MPI_MODE_RDONLY, MPI_INFO_NULL, &fh) nints = FILESIZE / (nprocs*INTSIZE); offset = rank * nints * INTSIZE; MPI_File_read_at(fh, offset, buf, nints, MPI_INT, &status); MPI_Get_count(&status, MPI_INT, &count ); printf( “process %d read %d ints\n”, rank, count ); MPI_File_close(&fh); 94 Using Explicit Offsets (C++) #include “mpi.h” MPI::Status status; MPI::Offset offset; fh = MPI::FILE::Open(MPI::COMM_WORLD,“/pfs/datafile”, MPI::MODE_RDONLY, MPI::INFO_NULL); nints = FILESIZE / (nprocs*sizeof(int)); offset = rank * nints * sizeof(int); fh.Read_at(offset, buf, nints, MPI::INT, status); count = status.Get_count( MPI::INT ); cout << “process “ << rank << “read “ << count << “ints” << “\n”; fh.Close(); 95 Using Explicit Offsets (Fortran) include 'mpif.h' integer status(MPI_STATUS_SIZE) integer (kind=MPI_OFFSET_KIND) offset C in F77, see implementation notes (might be integer*8) call MPI_FILE_OPEN(MPI_COMM_WORLD, '/pfs/datafile', & MPI_MODE_RDONLY, MPI_INFO_NULL, fh, ierr) nints = FILESIZE / (nprocs*INTSIZE) offset = rank * nints * INTSIZE call MPI_FILE_READ_AT(fh, offset, buf, nints, MPI_INTEGER, status, ierr) call MPI_GET_COUNT(status, MPI_INTEGER, count, ierr) print *, 'process ', rank, 'read ', count, 'integers' call MPI_FILE_CLOSE(fh, ierr) 96 Why Use Independent I/O? • Sometimes the synchronization of collective calls is not natural • Sometimes the overhead of collective calls outweighs their benefits – Example: very small I/O during header reads 97 Sparse Matrix Operations • A typical operation is a matrix-vector multiply • Consider an example where the sparse matrix is produced by one application and you wish to use a parallel program to solve the linear system 98 Sparse Matrix Format n – number of rows/cols (matrix dimensions) nz – number of nonzero elements ia[0..n] – index into data for first element in row i ja[0..nz-1] – column location for element j a[0..nz-1] – actual data ( 0, 0, 0, 0, 4 1, 0, 3, 0, 0 5, 2, 0, 0, 8 0, 6, 7, 0, 0 0, 0, 0, 9, 0 ) n = 5 nz = 9 ia[] = ( 0, 1, 3, 6, 8, 9 ) ja[] = ( 4, 0, 2, 0, 1, 4, 1, 2, 3 ) a[] = ( 4, 1, 3, 5, 2, 8, 6, 7, 9 ) (known as CSR or AIJ format) Note: Format isn’t a win for a matrix of this size and density. 99 Steps in Designing the Parallel Version • Same as our other examples: – Decomposition – Communication (for the matrix-vector product) – Define the local representation 100 Step 1: Description of Decomposition • By rows (matches equations) • In practice, the diagonal block and off-diagonal block are stored separately – For simplicity, we will ignore this 101 Step 2: Communication • For matrix-vector product, we would need – Elements of vector (also distributed in the same way as the matrix) from other processes corresponding to columns in which there are non-zero entries • Can be implemented with send and receive or with RMA – For simplicity, we will not show this part of the code 102 Step 3: Define the Local Arrays • Correspondence between the local and global arrays • “Global” array is an abstraction; there is no one global array allocated anywhere. Instead, we compute parts of it (the local arrays) and provide ways to output the global array by combining the values on each process (parallel I/O!) 103 I/O in Sparse Matrix Codes • Define the file format • We want the file to be independent of the number of processes • File requires: – Header information • Size of matrix, number of non-zeros • Name of matrix – ia, ja, and A vectors 104 Placing Data in Checkpoint • Unlike data layout in the Life case, positioning of data for a given process depends on the values held by other processes (number of nonzero values)! • Each process has pieces that are spread out in the file (noncontiguous!) title nnz ia[0..n] File Layout ja[0..nz-1] a[0..nz-1] 105 stdio CSRIO Code Walkthrough • Points to observe – MPI_Exscan and MPI_Allreduce to discover starting locations and complete sizes of vectors – Passing data to rank 0 for printing – Converting ia from local to global references See csrio-stdout.c pp. 1-2 for code example. 106 Writing Sparse Matrices (stdout) • Steps: – MPI_Exscan to get count of nonzeros from all previous processes • gives starting offset in ja[] and a[] arrays and value to add to ia[] elements – MPI_Allreduce to get total count of nonzeros (nz) – gives size of ja[] and a[] arrays – Process zero writes header (title, n, nz) – Copy ia[] and adjust to refer to global matrix locations – Pass data back to rank zero for printing title nnz ia[0..n] File Layout ja[0..nz-1] a[0..nz-1] See csrio-stdout.c pp. 3-8 for code example. 107 Noncontiguous I/O in File • Each process describes the part of the file that it is responsible for – This is the “file view” – Described in MPI with an offset (useful for headers) and an MPI_Datatype • Only the part of the file described by the file view is visible to the process; reads and writes access these locations • This provides an efficient way to perform noncontiguous accesses 108 Noncontiguous Accesses • Common in parallel applications • Example: distributed arrays stored in files • A big advantage of MPI I/O over Unix I/O is the ability to specify noncontiguous accesses in memory and file within a single function call by using derived datatypes • Allows implementation to optimize the access • Collective I/O combined with noncontiguous accesses yields the highest performance 109 File Views • Specified by a triplet (displacement, etype, and filetype) passed to MPI_File_set_view • displacement = number of bytes to be skipped from the start of the file – E.g., to skip a file header • etype = basic unit of data access (can be any basic or derived datatype) • filetype = specifies which portion of the file is visible to the process 110 A Simple Noncontiguous File View Example etype = MPI_INT filetype = two MPI_INTs followed by a gap of four MPI_INTs displacement filetype filetype and so on... FILEhead of file 111 Noncontiguous File View Code MPI_Aint lb, extent; MPI_Datatype etype, filetype, contig; MPI_Offset disp; MPI_Type_contiguous(2, MPI_INT, &contig); lb = 0; extent = 6 * sizeof(int); MPI_Type_create_resized(contig, lb, extent, &filetype); MPI_Type_commit(&filetype); disp = 5 * sizeof(int); etype = MPI_INT; MPI_File_open(MPI_COMM_WORLD, "/pfs/datafile", MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh); MPI_File_set_view(fh, disp, etype, filetype, "native", MPI_INFO_NULL); MPI_File_write(fh, buf, 1000, MPI_INT, MPI_STATUS_IGNORE); 112 Noncontiguous File View Code (C++) MPI::Aint lb, extent; MPI::Datatype etype, filetype, contig; MPI::Offset disp; contig = MPI::Type::Contiguous(2, MPI::INT); lb = 0; extent = 6 * sizeof(int); filetype = MPI::Type::Create_resized(contig, lb, extent); filetype.Commit(); disp = 5 * sizeof(int); etype = MPI::INT; fh = MPI::File::Open(MPI::COMM_WORLD, "/pfs/datafile", MPI::MODE_CREATE | MPI::MODE_RDWR, MPI::INFO_NULL ); fh.Set_view( disp, etype, filetype, "native", MPI::INFO_NULL); fh.Write(buf, 1000, MPI::INT); 113 Noncontiguous File View Code (Fortran) integer (kind=MPI_ADDRESS_KIND) lb, extent; integer etype, filetype, contig; integer (kind=MPI_OFFSET_KIND) disp; call MPI_Type_contiguous(2, MPI_INTEGER, contig, ierr) call MPI_Type_size( MPI_INTEGER, sizeofint, ierr ) lb = 0 extent = 6 * sizeofint call MPI_Type_create_resized(contig, lb, extent, filetype, ierr) call MPI_Type_commit(filetype, ierr); disp = 5 * sizeof(int); etype = MPI_INTEGER call MPI_File_open(MPI_COMM_WORLD, "/pfs/datafile", & MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, fh, ierr) call MPI_File_set_view(fh, disp, etype, filetype, "native", & MPI_INFO_NULL, ierr) call MPI_File_write(fh, buf, 1000, MPI_INTEGER, MPI_STATUS_IGNORE) 114 MPI-IO CSRIO Code Walkthrough • Points to observe – Independent I/O when reading or writing the header – Use of file views when reading or writing data See csrio-mpiio.c pp. 1-2 for code example. 115 Reading Sparse Matrix Header • Steps: – Process 0 reads the title, n, and nz independently (i.e., using independent I/O) • Collective open times can be very large – MPI_Bcast values to everyone • MPI_Type_struct used to combine data into a single broadcast title nnz ia[0..n] File Layout ja[0..nz-1] a[0..nz-1] See csrio-mpiio.c pp. 3-5 for code example. 116 Reading Sparse Matrix Data • Steps: – Everyone reads the portion of ia[] for their rows – MPI_Allreduce to verify that everyone successfully allocated memory • Avoids potential deadlocks if one process were to return an error – Collectively read data – Convert ia[] to refer to local matrix title nnz ia[0..n] File Layout ja[0..nz-1] a[0..nz-1] See csrio-mpiio.c pp. 6-9 for code example. 117 Writing Sparse Matrices • Steps: – MPI_Scan to get count of nonzeros from all previous processes • gives starting offset in ja[] and a[] arrays and value to add to ia[] elements – MPI_Allreduce to get total count of nonzeros (nz) – gives size of ja[] and a[] arrays – Process zero writes header (title, n, nz) – Copy ia[] and adjust to refer to global matrix locations – All processes write ia, ja, a collectively title nnz ia[0..n] File Layout ja[0..nz-1] a[0..nz-1] See csrio-mpiio.c pp. 10-13 for code example. 118 pNeo - Modeling the Human Brain 119 Science Driver • Goal: Understand conditions, causes, and possible corrections for epilepsy • Approach: Study the onset and progression of epileptiform activity in the neocortex • Technique: Create a model of neurons and their interconnection network, based on models combining wet lab measurements of resected tissue samples and in vivo studies • Computation: Develop a simulation program that can be used for detailed parameter studies 120 Model Neurons IS Soma Na K Spike Ex Inh IS Soma Na K Spike Ex Inh Soma Na K Spike Ex Neurons in the focal neocortex Compartmental neural models Excitatory and inhibitory signal wiring between neurons 121 Modeling Approach • Individual neurons are modeled using electrical analogs to parameters measured in the laboratory • Differential equations describe evolution of the neuron state variables • Neuron spiking output is wired to thousands of cells in a neighborhood • Wiring diagram is based on wiring patterns observed in neocortex tissue samples • Computation is divided among available processors Schematic of a two dimensional patch of neurons showing communication neighborhood for one of the cells in the simulation and partitioning of the patch among processors. 122 Abstract pNeo for Tutorial Example • “Simulate the simulation” of the evolution of neuron state instead of solving the differential equations • Focus on how to code the interactions between cells in MPI • Assume one cell per process for simplicity – Real code multiplexes many individual neurons onto one MPI process 123 What Happens In Real Life • Each cell has a fixed number of connections to some other cells • Cell “state” evolves continuously • From time to time “spikes” arrive from connected cells. • Spikes influence the evolution of cell state • From time to time the cell state causes spikes to be sent to other connected cells 124 What Happens In Existing pNeo Code • In pNeo, each cell is connected to about 1000 cells – Large runs have 73,000 cells – Brain has ~100 billion cells • Connections are derived from neuro-anatomical data • There is a global clock marking time steps • The state evolves according to a set of differential equations • About 10 or more time steps between spikes – I.e., communication is unpredictable and sparse • Possible MPI-1 solutions – Redundant communication of communication pattern before communication itself, to tell each process how many receives to do – Redundant “no spikes this time step” messages • MPI-2 solution: straightforward use of Put, Fence 125 What Happens in Tutorial Example • There is a global clock marking time steps • At the beginning of a time step, a cell notes spikes from connected cells (put by them in a previous time step). • A dummy evolution algorithm is used in place of the differential equation solver. • This evolution computes which new spikes are to be sent to connected cells. • Those spikes are sent (put), and the time step ends. • We show both a Fence and a Post/Start/Complete/Wait version. 126 Two Examples Using RMA • Global synchronization – Global synchronization of all processes at each step – Illustrates Put, Get, Fence • Local synchronization – Synchronization across connected cells, for improved scalability (synchronization is local) – Illustrates Start, Complete, Post, Wait 127 pNeo Code Walkthrough • Points to observe – Data structures can be the same for multiple synchronization approaches • Code is simple compared to what a send/receive version would look like – Processes do no need to know which other processes will send them spikes at each step See pneo_fence.c and pneo_pscw.c for code examples. 128 Passive Target RMA 129 Active vs. Passive Target RMA • Active target RMA requires participation from the target process in the form of synchronization calls (fence or P/S/C/W) • In passive target RMA, target process makes no synchronization call 130 Passive Target RMA • We need to indicate the beginning and ending of RMA calls by the process performing the RMA – This process is called the origin process – The process being accessed is the target process • For passive target, the begin/end calls are – MPI_Win_lock, MPI_Win_unlock 131 Synchronization for Passive Target RMA • MPI_Win_lock(locktype, rank, assert, win) – Locktype is • MPI_LOCK_EXCLUSIVE – One process at a time may access – Use when modifying the window • MPI_LOCK_SHARED – Multiple processes (as long as none hold MPI_LOCK_EXCLUSIVE) – Consider using when using MPI_Get (only) on the window – Assert is either 0 or MPI_MODE_NOCHECK • MPI_Win_unlock(rank, win) • Lock is not a real lock but means begin-RMA; unlock is end- RMA, not real unlock 132 Put with Lock if (rank == 0) { MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 1, 0, win); MPI_Put(outbuf, n, MPI_INT, 1, 0, n, MPI_INT, win); MPI_Win_unlock(1, win); } • Only process performing MPI_Put makes MPI RMA calls – Process with memory need not make any MPI calls; it is “passive” • Similarly for MPI_Get, MPI_Accumulate 133 Put with Lock (C++) if (rank == 0) { win.Lock( MPI::LOCK_EXCLUSIVE, 1, 0 ); win.Put(outbuf, n, MPI::INT, 1, 0, n, MPI::INT); win.Unlock(1); } • Only process performing MPI_Put makes MPI RMA calls – Process with memory need not make any MPI calls; it is “passive” • Similarly for MPI_Get, MPI_Accumulate 134 Put with Lock (Fortran) if (rank .eq. 0) then call MPI_Win_lock(MPI_LOCK_EXCLUSIVE,& 1, 0, win, ierr ) call MPI_Put(outbuf, n, MPI_INTEGER, & 1, 0, n, MPI_INTEGER, win, ierr) call MPI_Win_unlock(1, win, ierr) endif • Only process performing MPI_Put makes MPI RMA calls – Process with memory need not make any MPI calls; it is “passive” • Similarly for MPI_Get, MPI_Accumulate 135 Global Arrays • Lets look at updating a single array, distributed across a group of processes 136 A Global Distributed Array • Problem: Application needs a single, 1-dimensional array that any process can update or read • Solution: Create a window object describing local parts of the array, and use MPI_Put and MPI_Get to access • Each process has alocal[n] • We must provide access to a[pn] • We cannot use MPI_Win_fence; we must use MPI_Win_lock and MPI_Win_unlock n pn 137 Creating the Global Array double *locala; ... MPI_Alloc_mem(n * sizeof(double), MPI_INFO_NULL, &locala); MPI_Win_create(locala, n * sizeof(double), sizeof(double), MPI_INFO_NULL, comm, &win); 138 Creating the Global Array (C++) Volatile double *locala; ... locala = MPI::Alloc_mem(n * sizeof(double), MPI::INFO_NULL ); win = MPI::Win::Create(locala, n * sizeof(double), sizeof(double), MPI::INFO_NULL, comm); 139 Comments • MPI-2 allows “global” to be relative to a communicator, enabling hierarchical algorithms – i.e., “global” does not have to refer to MPI_COMM_WORLD • MPI_Alloc_mem is required for greatest portability – Some MPI implementations may allow memory not allocated with MPI_Alloc_mem in passive target RMA operations 140 Accessing the Global Array From a Remote Process • To update: rank = i / n; offset = i % n; MPI_Win_lock(MPI_LOCK_EXCLUSIVE, rank, 0, win); MPI_Put(&value, 1, MPI_DOUBLE, rank, offset, 1,MPI_DOUBLE, win); MPI_Win_unlock(rank, win); • To read: rank = i / n; offset = i % n; MPI_Win_lock(MPI_LOCK_SHARED, rank, 0, win); MPI_Get(&value, 1, MPI_DOUBLE, rank, offset, 1, MPI_DOUBLE, win); MPI_Win_unlock(rank, win); 141 Accessing the Global Array From a Remote Process (C++) • To update: rank = i / n; offset = i % n; win.Lock(MPI_LOCK_EXCLUSIVE, rank, 0); win.Put(&value, 1, MPI::DOUBLE, rank, offset, 1, MPI::DOUBLE); win.Unlock(rank); • To read: rank = i / n; offset = i % n; win.Lock(MPI::LOCK_SHARED, rank, 0); win.Get(&value, 1, MPI_DOUBLE, rank, offset, 1, MPI_DOUBLE); win.Unlock(rank); 142 Accessing the Global Array From a Remote Process (Fortran) • To update: rank = i / n offset = mod(i,n) call MPI_Win_lock(MPI_LOCK_EXCLUSIVE, rank, 0, & win, ierr) call MPI_Put(value, 1, MPI_DOUBLE_PRECISION, & rank, offset, 1, MPI_DOUBLE_PRECISION, & win, ierr ) call MPI_Win_unlock(rank, win, ierr ) • To read: rank = i / n offset = mod(i,n) call MPI_Win_lock(MPI_LOCK_SHARED, rank, 0, & win, ierr ) call MPI_Get(value, 1, MPI_DOUBLE_PRECISION, & rank, offset, 1, MPI_DOUBLE_PRECISION, & win, ierr ) call MPI_Win_unlock(rank, win, ierr ) 143 Accessing the Global Array From a Local Process • The issues – Cache coherence (if no hardware) – Data in register • To read: volatile double *locala; rank = i / n; offset = i % n; MPI_Win_lock(MPI_LOCK_SHARED, rank, 0, win); if (rank == myrank) { value = locala[offset]; } else { MPI_Get(&value, 1, MPI_DOUBLE, rank, offset, 1, MPI_DOUBLE, win); } MPI_Win_unlock(rank, win); 144 Accessing the Global Array From a Local Process (C++) • The issues – Cache coherence (if no hardware) – Data in register • To read: volatile double *locala; rank = i / n; offset = i % n; win.Lock(MPI::LOCK_SHARED, rank, 0); if (rank == myrank) { value = locala[offset]; } else { win.Get(&value, 1, MPI::DOUBLE, rank, offset, 1, MPI::DOUBLE); } win.Unlock(rank); 145 Accessing the Global Array From a Local Process (Fortran) • The issues – Cache coherence (if no hardware) – Data in register – (We’ll come back to this case) • To read: double precision locala(0:mysize-1) rank = i / n; offset = mod(i,n) call MPI_Win_lock(MPI_LOCK_SHARED, rank, 0, win, ierr) if (rank .eq. myrank) then value = locala(offset); else call MPI_Get(&value, 1, MPI_DOUBLE_PRECISION, & rank, offset, 1, MPI_DOUBLE_PRECISION, & win, ierr ) endif call MPI_Win_unlock(rank, win, ierr ) 146 Memory for Passive Target RMA • Passive target operations are harder to implement – Hardware support helps • MPI allows (but does not require) an implementation to require that windows objects used for passive target RMA use local windows allocated with MPI_Alloc_mem 147 Allocating Memory • MPI_Alloc_mem, MPI_Free_mem • Special Issue: Checking for no memory available: – e.g., the Alloc_mem equivalent of a null return from malloc – Default error behavior of MPI is to abort • Solution: – Change the error handler on MPI_COMM_WORLD to MPI_ERRORS_RETURN, using MPI_COMM_SET_ERRHANDLER (in MPI-1, MPI_ERRHANDLER_SET) – Check error class with MPI_ERROR_CLASS • Error codes are not error classes 148 Using MPI_Alloc_mem from Fortran • No general solution, but some Fortran extensions allow the following: double precision u pointer (p, u(0:50,0:20)) integer (kind=MPI_ADDRESS_KIND) size integer sizeofdouble, ierror ! careful with size (must be MPI_ADDRESS_KIND) call MPI_SIZEOF(u, sizeofdouble, ierror) size = 51 * 21 * sizeofdouble call MPI_ALLOC_MEM(size, MPI_INFO_NULL, p, ierror) ... ... program may now refer to u, including passing it ... to MPI_WIN_CREATE ... call MPI_FREE_MEM(u, ierror) ! not p! 149 Mutex with Passive Target RMA • MPI_Win_lock/unlock DO NOT define a critical section • One has to implement a distributed locking algorithm using passive target RMA operations in order to achieve the equivalent of a mutex • Example follows 150 Implementing Mutex • Create “waitwin” window object – One process has N-byte array (byte per process) • One access epoch to try to lock – Put “1” into corresponding byte – Get copy of all other values • If all other values are zero, obtained lock • Otherwise must wait … waitwin[N] Process 0 Process 1 Process N-1 … waitwin window object 151 11 Attempting to lock • Processes use one access epoch to attempt to obtain the lock • Process 1 succeeds, but process 3 must wait 0 0 0 0 waitwin[4] Process 0 Process 1 Process 3 Lock Put(1 at byte 1) Get(other 3 bytes) Unlock Lock Put(1 at byte 3) Get(other 3 bytes) Unlock 0 0 0 0 1 0 No other 1s, so lock was obtained 1 in rank 1 position, so process must wait 152 Waiting for the lock • Naïve approach: simply MPI_Get the other bytes over and over – Lots of extra remote memory access – Better approach is to somehow notify waiting processes – Using RMA, set up a second window object with a byte on each process, spin-wait on local memory • This approach is like MCS locks • Lots of wasted CPU cycles spinning • Better approach: Using MPI-1 point-to-point, send a zero-byte message to the waiting process to notify it that it has the lock • Let MPI implementation handle checking for message arrival 153 1 Releasing the Lock • Process 1 uses one access epoch to release the lock • Because process 3 is waiting, process 1 must send a message to notify process 3 that it now owns the lock 100 0 waitwin[4] Process 0 Process 1 Process 3 Lock Put(0 at byte 1) Get(other 3 bytes) Unlock MPI_Recv(ANY_SRC) 0 0 1 1 in rank 3 position, must notify of release MPI_Recv completes, Process 3 has lock MPI_Send(rank 3) 154 Mutex Code Walkthrough • mpimutex_t type, for reference: typedef struct mpimutex { int nprocs, myrank, homerank; MPI_Comm comm; MPI_Win waitlistwin; MPI_Datatype waitlisttype; unsigned char *waitlist; } *mpimutex_t; See mpimutex.c for code example. … waitlist[N] Process “homerank” Process nprocs - 1Process 0 … waitlistwin object • Code allows any process to be the “home” of the array: 155 Comments on Local Access • Volatile: – Tells compiler that some other agent (such as another thread or process) may change the value – In practice, rarely necessary for arrays but usually necessary for scalars – Volatile is not just for MPI-2. Any shared-memory program needs to worry about this (even for cache- coherent shared-memory systems) • Fortran users don’t have volatile (yet): – But they can use the following evil trick … 156 Simulating Volatile for Fortran • Replace MPI_Win_unlock with subroutine My_Win_unlock(rank, win, var, ierr) integer rank, win, ierr double precision var call MPI_Win_unlock(rank, win) return • When used in Fortran code, the compiler only sees call My_Win_unlock(rank, win, var, ierr) and assumes that var might be changed, causing the compiler to reload var from memory rather than using a value in register 157 Improving Performance • MPI provides ways to tune for performance • I/O – Using the right functions the right way – Providing Hints • RMA – Asserts and info 158 Tuning MPI-IO 159 General Guidelines for Achieving High I/O Performance • Buy sufficient I/O hardware for the machine • Use fast file systems, not NFS-mounted home directories • Do not perform I/O from one process only • Make large requests wherever possible • For noncontiguous requests, use derived datatypes and a single collective I/O call 160 Using the Right MPI-IO Function • Any application as a particular “I/O access pattern” based on its I/O needs • The same access pattern can be presented to the I/O system in different ways depending on what I/O functions are used and how • In our SC98 paper, we classify the different ways of expressing I/O access patterns in MPI-IO into four levels: level 0 – level 3 • We demonstrate how the user’s choice of level affects performance 161 Example: Distributed Array Access P0 P12 P4 P8 P2 P14 P6 P10 P1 P13 P5 P9 P3 P15 P7 P11 P0 P1 P2 P3 P0 P1 P2 P4 P5 P6 P7 P4 P5 P6 P8 P9 P8 P9 Large array distributed among 16 processes Access Pattern in the file Each square represents a subarray in the memory of a single process P10 P11 P10 P15P13P12 P12 P13 P14P14 162 Level-0 Access • Each process makes one independent read request for each row in the local array (as in Unix) MPI_File_open(..., file, ..., &fh) for (i=0; i
还剩221页未读

继续阅读

下载pdf到电脑,查找使用更方便

pdf的实际排版效果,会与网站的显示效果略有不同!!

需要 8 金币 [ 分享pdf获得金币 ] 0 人已下载

下载pdf