• 1. CS 194 Parallel Programming Creating and Using ThreadsKatherine Yelick yelick@cs.berkeley.edu http://www.cs.berkeley.edu/~yelick/cs194f0710/22/20181CS194 Lecure
• 2. Parallel Programming ModelsProgramming model is made up of the languages and libraries that create an abstract view of the machine Control How is parallelism created? How is are dependencies (orderings) enforced? Data Can data be shared or is it all private? How is shared data accessed or private data communicated? Synchronization What operations can be used to coordinate parallelism What are the atomic (indivisible) operations? 10/22/20182CS194 Lecure
• 3. Simple ExampleConsider applying a function square to the elements of an array A and then computing its sum: Stop and discuss: What can be done in parallel? How long would it take (in big-O) if we have an unlimited number of processors?A = array of all data Asqr = map(square, A) s = sum(Asqr)A:Asqr:sums:square10/22/20183CS194 Lecure
• 4. Problem DecompositionIn designing a parallel algorithm Decompose the problem into smaller tasks Determine which can be done in parallel with each other, and which must be done in some order Conceptualize a decomposition as a task dependency graph: A directed graph with Nodes corresponding to tasks Edges indicating dependencies, that the result of one task is required for processing the next. A given problem may be decomposed into tasks in many different ways. Tasks may be of same, different, or indeterminate sizes. sqr (A[0])sqr(A[1])sqr(A[2])sqr(A[n])sum…10/22/20184CS194 Lecure
• 5. Example: Multiplying a Matrix with a VectorDependencies: Each output element of y depends on one row of A and all of x. Task graph: Since each output is independent, our task graph can have n nodes and no dependence edges Observations: All tasks are of the same size in terms of number of operations. Question: Could we have more tasks? Fewer?Slide derived from: Grama, Karypis, Kumar and Guptaxfor i = 1 to n for j = 1 to n y[i] += A[i,j] * x[j];10/22/20185CS194 Lecure
• 6. Example: Database Query Processing Consider the execution of the query: MODEL = ``CIVIC'' AND YEAR = 2001 AND (COLOR = ``GREEN'' OR COLOR = ``WHITE”) on the following database: ID# Model Year Color Dealer Price 4523 Civic 2002 Blue MN \$18,000 3476 Corolla 1999 White IL \$15,000 7623 Camry 2001 Green NY \$21,000 9834 Prius 2001 Green CA \$18,000 6734 Civic 2001 White OR \$17,000 5342 Altima 2001 Green FL \$19,000 3845 Maxima 2001 Blue NY \$22,000 8354 Accord 2000 Green VT \$18,000 4395 Civic 2001 Red CA \$17,000 7352 Civic 2002 Red WA \$18,000 Slide derived from: Grama, Karypis, Kumar and Gupta10/22/20186CS194 Lecure
• 7. Example: Database Query ProcessingOne decomposition creates tasks that generate an intermediate table of entries that satisfy a particular clause. Slide derived from: Grama, Karypis, Kumar and Gupta10/22/20187CS194 Lecure
• 8. Example: Database Query Processing Here is a different decomposition Choice of decomposition will affect parallel performance. Slide derived from: Grama, Karypis, Kumar and Gupta10/22/20188CS194 Lecure
• 9. Granularity of Task Decompositions Granularity: number of tasks into which a problem is decomposed. Rough terminology Fine-grained: large number of small tasks Coarse-grained: smaller number of larger tasksA coarse grained version of dense matrix-vector product.Slide derived from: Grama, Karypis, Kumar and Gupta10/22/20189CS194 Lecure
• 10. Degree of Concurrency The degree of concurrency of a task graph is the number of tasks that can be executed in parallel. May vary over the execution, so we can talk about the maximum or average The degree of concurrency increases as the decomposition becomes finer in granularity. A directed path in a task graph represents a sequence of tasks that must be processed one after the other. The critical path is the longest such path. These graphs are normally weighted by the cost of each task (node), and the path lengths are the sum of weightsSlide derived from: Grama, Karypis, Kumar and Gupta10/22/201810CS194 Lecure
• 11. Limits on Parallel Performance Parallel time can be made smaller by making decomposition finer. There is an inherent bound on how fine the granularity of a computation can be. For example, in the case of multiplying a dense matrix with a vector, there can be no more than (n2) concurrent tasks. In addition, there may be communication overhead between tasks. Slide derived from: Grama, Karypis, Kumar and Gupta10/22/201811CS194 Lecure
• 12. Shared Memory ProgrammingProgram is a collection of threads of control. Can be created dynamically, mid-execution, in some languages Each thread has a set of private variables, e.g., local stack variables Also a set of shared variables, e.g., static variables, shared common blocks, or global heap. Threads communicate implicitly by writing and reading shared variables. Threads coordinate by synchronizing on shared variablesPnP1P0s s = ...y = ..s ...Shared memoryi: 2i: 5Private memoryi: 810/22/201812CS194 Lecure
• 13. Shared Memory “Code” for Computing a SumThread 1 for i = 0, n/2-1 s = s + sqr(A[i])Thread 2 for i = n/2, n-1 s = s + sqr(A[i])static int s = 0;Problem is a race condition on variable s in the program A race condition or data race occurs when: two processors (or two threads) access the same variable, and at least one does a write. The accesses are concurrent (not synchronized) so they could happen simultaneously10/22/201813CS194 Lecure
• 14. Shared Memory Code for Computing a SumThread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …static int s = 0;Assume A = [3,5], f is the square function, and s=0 initially For this program to work, s should be 34 at the end but it may be 34,9, or 25 The atomic operations are reads and writes Never see ½ of one number, but no += operation is not atomic All computations happen in (private) registers9250092525935Af = square10/22/201814CS194 Lecure
• 15. Corrected Code for Computing a SumThread 1 local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + sqr(A[i]) s = s + local_s1 Thread 2 local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + sqr(A[i]) s = s +local_s2 static int s = 0; Since addition is associative, it’s OK to rearrange order Right? Most computation is on private variables Sharing frequency is also reduced, which might improve speed But there is still a race condition on the update of shared sstatic lock lk;lock(lk);unlock(lk);lock(lk);unlock(lk);10/22/201815CS194 Lecure
• 16. Parallel Programming with Threads10/22/201816CS194 Lecure
• 17. Overview of POSIX ThreadsPOSIX: Portable Operating System Interface for UNIX Interface to Operating System utilities PThreads: The POSIX threading interface System calls to create and synchronize threads Should be relatively uniform across UNIX-like OS platforms PThreads contain support for Creating parallelism Synchronizing No explicit support for communication, because shared memory is implicit; a pointer to shared data is passed to a thread10/22/201817CS194 Lecure
• 19. Simple Threading Examplevoid* SayHello(void *foo) { printf( "Hello, world!\n" ); return NULL; } int main() { pthread_t threads[16]; int tn; for(tn=0; tn<16; tn++) { pthread_create(&threads[tn], NULL, SayHello, NULL); } for(tn=0; tn<16 ; tn++) { pthread_join(threads[tn], NULL); } return 0; }Compile using gcc –lpthread See Millennium/NERSC docs for paths/modulesStop, run code, and discuss10/22/201819CS194 Lecure
• 20. Loop Level ParallelismMany application have parallelism in loops With threads: … A [n]; for (int i = 0; i < n; i++) … pthread_create (square, …, &A[i]); Problem: Overhead of thread creation is nontrivial Square is not enough work for a separate thread (much less time to square a number than to create a thread) Unlike original example, this would overwrite A[i]; how would you do this if you wanted a separate result array? 10/22/201820CS194 Lecure
• 21. Shared Data and ThreadsVariables declared outside of main are shared Object allocated on the heap may be shared (if pointer is passed) Variables on the stack are private: passing pointer to these around to other threads can cause problems Often done by creating a large “thread data” struct Passed into all threads as argument Simple example: char *message = "Hello World!\n"; pthread_create( &thread1, NULL, (void*)&print_fun, (void*) message);10/22/201821CS194 Lecure
• 23. Barrier -- global synchronization Especially common when running multiple copies of the same function in parallel SPMD “Single Program Multiple Data” simple use of barriers -- all threads hit the same one work_on_my_problem(); barrier; get_data_from_others(); barrier; more complicated -- barriers on branches (or loops) if (tid % 2 == 0) { work1(); barrier } else { barrier }Basic Types of Synchronization: Barrier10/22/201823CS194 Lecure
• 24. Creating and Initializing a BarrierTo (dynamically) initialize a barrier, use code similar to this (which sets the number of threads to 3): pthread_barrier_t b; pthread_barrier_init(&b,NULL,3); The second argument specifies an object attribute; using NULL yields the default attributes. To wait at a barrier, a process executes: pthread_barrier_wait(&b); This barrier could have been statically initialized by assigning an initial value created using the macro PTHREAD_BARRIER_INITIALIZER(3). Note: barrier is not in all pthreads implementations, but we’ll provide something you can use when it isn’t.10/22/201824CS194 Lecure
• 25. Basic Types of Synchronization: MutexesMutexes -- mutual exclusion aka locks threads are working mostly independently need to access common data structure lock *l = alloc_and_init(); /* shared */ acquire(l); access data release(l); Java and other languages have lexically scoped synchronization similar to cobegin/coend vs. fork and join tradeoff Semaphores give guarantees on “fairness” in getting the lock, but the same idea of mutual exclusion Locks only affect processors using them: pair-wise synchronization10/22/201825CS194 Lecure