• 1. CS 194 Parallel Programming Creating and Using ThreadsKatherine Yelick yelick@cs.berkeley.edu http://www.cs.berkeley.edu/~yelick/cs194f0710/22/20181CS194 Lecure
  • 2. Parallel Programming ModelsProgramming model is made up of the languages and libraries that create an abstract view of the machine Control How is parallelism created? How is are dependencies (orderings) enforced? Data Can data be shared or is it all private? How is shared data accessed or private data communicated? Synchronization What operations can be used to coordinate parallelism What are the atomic (indivisible) operations? 10/22/20182CS194 Lecure
  • 3. Simple ExampleConsider applying a function square to the elements of an array A and then computing its sum: Stop and discuss: What can be done in parallel? How long would it take (in big-O) if we have an unlimited number of processors?A = array of all data Asqr = map(square, A) s = sum(Asqr)A:Asqr:sums:square10/22/20183CS194 Lecure
  • 4. Problem DecompositionIn designing a parallel algorithm Decompose the problem into smaller tasks Determine which can be done in parallel with each other, and which must be done in some order Conceptualize a decomposition as a task dependency graph: A directed graph with Nodes corresponding to tasks Edges indicating dependencies, that the result of one task is required for processing the next. A given problem may be decomposed into tasks in many different ways. Tasks may be of same, different, or indeterminate sizes. sqr (A[0])sqr(A[1])sqr(A[2])sqr(A[n])sum…10/22/20184CS194 Lecure
  • 5. Example: Multiplying a Matrix with a VectorDependencies: Each output element of y depends on one row of A and all of x. Task graph: Since each output is independent, our task graph can have n nodes and no dependence edges Observations: All tasks are of the same size in terms of number of operations. Question: Could we have more tasks? Fewer?Slide derived from: Grama, Karypis, Kumar and Guptaxfor i = 1 to n for j = 1 to n y[i] += A[i,j] * x[j];10/22/20185CS194 Lecure
  • 6. Example: Database Query Processing Consider the execution of the query: MODEL = ``CIVIC'' AND YEAR = 2001 AND (COLOR = ``GREEN'' OR COLOR = ``WHITE”) on the following database: ID# Model Year Color Dealer Price 4523 Civic 2002 Blue MN $18,000 3476 Corolla 1999 White IL $15,000 7623 Camry 2001 Green NY $21,000 9834 Prius 2001 Green CA $18,000 6734 Civic 2001 White OR $17,000 5342 Altima 2001 Green FL $19,000 3845 Maxima 2001 Blue NY $22,000 8354 Accord 2000 Green VT $18,000 4395 Civic 2001 Red CA $17,000 7352 Civic 2002 Red WA $18,000 Slide derived from: Grama, Karypis, Kumar and Gupta10/22/20186CS194 Lecure
  • 7. Example: Database Query ProcessingOne decomposition creates tasks that generate an intermediate table of entries that satisfy a particular clause. Slide derived from: Grama, Karypis, Kumar and Gupta10/22/20187CS194 Lecure
  • 8. Example: Database Query Processing Here is a different decomposition Choice of decomposition will affect parallel performance. Slide derived from: Grama, Karypis, Kumar and Gupta10/22/20188CS194 Lecure
  • 9. Granularity of Task Decompositions Granularity: number of tasks into which a problem is decomposed. Rough terminology Fine-grained: large number of small tasks Coarse-grained: smaller number of larger tasksA coarse grained version of dense matrix-vector product.Slide derived from: Grama, Karypis, Kumar and Gupta10/22/20189CS194 Lecure
  • 10. Degree of Concurrency The degree of concurrency of a task graph is the number of tasks that can be executed in parallel. May vary over the execution, so we can talk about the maximum or average The degree of concurrency increases as the decomposition becomes finer in granularity. A directed path in a task graph represents a sequence of tasks that must be processed one after the other. The critical path is the longest such path. These graphs are normally weighted by the cost of each task (node), and the path lengths are the sum of weightsSlide derived from: Grama, Karypis, Kumar and Gupta10/22/201810CS194 Lecure
  • 11. Limits on Parallel Performance Parallel time can be made smaller by making decomposition finer. There is an inherent bound on how fine the granularity of a computation can be. For example, in the case of multiplying a dense matrix with a vector, there can be no more than (n2) concurrent tasks. In addition, there may be communication overhead between tasks. Slide derived from: Grama, Karypis, Kumar and Gupta10/22/201811CS194 Lecure
  • 12. Shared Memory ProgrammingProgram is a collection of threads of control. Can be created dynamically, mid-execution, in some languages Each thread has a set of private variables, e.g., local stack variables Also a set of shared variables, e.g., static variables, shared common blocks, or global heap. Threads communicate implicitly by writing and reading shared variables. Threads coordinate by synchronizing on shared variablesPnP1P0s s = ...y = ..s ...Shared memoryi: 2i: 5Private memoryi: 810/22/201812CS194 Lecure
  • 13. Shared Memory “Code” for Computing a SumThread 1 for i = 0, n/2-1 s = s + sqr(A[i])Thread 2 for i = n/2, n-1 s = s + sqr(A[i])static int s = 0;Problem is a race condition on variable s in the program A race condition or data race occurs when: two processors (or two threads) access the same variable, and at least one does a write. The accesses are concurrent (not synchronized) so they could happen simultaneously10/22/201813CS194 Lecure
  • 14. Shared Memory Code for Computing a SumThread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …static int s = 0;Assume A = [3,5], f is the square function, and s=0 initially For this program to work, s should be 34 at the end but it may be 34,9, or 25 The atomic operations are reads and writes Never see ½ of one number, but no += operation is not atomic All computations happen in (private) registers9250092525935Af = square10/22/201814CS194 Lecure
  • 15. Corrected Code for Computing a SumThread 1 local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + sqr(A[i]) s = s + local_s1 Thread 2 local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + sqr(A[i]) s = s +local_s2 static int s = 0; Since addition is associative, it’s OK to rearrange order Right? Most computation is on private variables Sharing frequency is also reduced, which might improve speed But there is still a race condition on the update of shared sstatic lock lk;lock(lk);unlock(lk);lock(lk);unlock(lk);10/22/201815CS194 Lecure
  • 16. Parallel Programming with Threads10/22/201816CS194 Lecure
  • 17. Overview of POSIX ThreadsPOSIX: Portable Operating System Interface for UNIX Interface to Operating System utilities PThreads: The POSIX threading interface System calls to create and synchronize threads Should be relatively uniform across UNIX-like OS platforms PThreads contain support for Creating parallelism Synchronizing No explicit support for communication, because shared memory is implicit; a pointer to shared data is passed to a thread10/22/201817CS194 Lecure
  • 18. Forking Posix Threadsthread_id is the thread id or handle (used to halt, etc.) thread_attribute various attributes standard default values obtained by passing a NULL pointer thread_fun the function to be run (takes and returns void*) fun_arg an argument can be passed to thread_fun when it starts errorcode will be set nonzero if the create operation failsSignature: int pthread_create(pthread_t *, const pthread_attr_t *, void * (*)(void *), void *); Example call: errcode = pthread_create(&thread_id, &thread_attribute, &thread_fun, &fun_arg);10/22/201818CS194 Lecure
  • 19. Simple Threading Examplevoid* SayHello(void *foo) { printf( "Hello, world!\n" ); return NULL; } int main() { pthread_t threads[16]; int tn; for(tn=0; tn<16; tn++) { pthread_create(&threads[tn], NULL, SayHello, NULL); } for(tn=0; tn<16 ; tn++) { pthread_join(threads[tn], NULL); } return 0; }Compile using gcc –lpthread See Millennium/NERSC docs for paths/modulesStop, run code, and discuss10/22/201819CS194 Lecure
  • 20. Loop Level ParallelismMany application have parallelism in loops With threads: … A [n]; for (int i = 0; i < n; i++) … pthread_create (square, …, &A[i]); Problem: Overhead of thread creation is nontrivial Square is not enough work for a separate thread (much less time to square a number than to create a thread) Unlike original example, this would overwrite A[i]; how would you do this if you wanted a separate result array? 10/22/201820CS194 Lecure
  • 21. Shared Data and ThreadsVariables declared outside of main are shared Object allocated on the heap may be shared (if pointer is passed) Variables on the stack are private: passing pointer to these around to other threads can cause problems Often done by creating a large “thread data” struct Passed into all threads as argument Simple example: char *message = "Hello World!\n"; pthread_create( &thread1, NULL, (void*)&print_fun, (void*) message);10/22/201821CS194 Lecure
  • 22. (Details: Setting Attribute Values)Once an initialized attribute object exists, changes can be made. For example: To change the stack size for a thread to 8192 (before calling pthread_create), do this: pthread_attr_setstacksize(&my_attributes, (size_t)8192); To get the stack size, do this: size_t my_stack_size; pthread_attr_getstacksize(&my_attributes, &my_stack_size); Other attributes: Detached state – set if no other thread will use pthread_join to wait for this thread (improves efficiency) Guard size – use to protect against stack overfow Inherit scheduling attributes (from creating thread) – or not Scheduling parameter(s) – in particular, thread priority Scheduling policy – FIFO or Round Robin Contention scope – with what threads does this thread compete for a CPU Stack address – explicitly dictate where the stack is located Lazy stack allocation – allocate on demand (lazy) or all at once, “up front”Slide Sorce: Theewara Vorakosit10/22/201822CS194 Lecure
  • 23. Barrier -- global synchronization Especially common when running multiple copies of the same function in parallel SPMD “Single Program Multiple Data” simple use of barriers -- all threads hit the same one work_on_my_problem(); barrier; get_data_from_others(); barrier; more complicated -- barriers on branches (or loops) if (tid % 2 == 0) { work1(); barrier } else { barrier }Basic Types of Synchronization: Barrier10/22/201823CS194 Lecure
  • 24. Creating and Initializing a BarrierTo (dynamically) initialize a barrier, use code similar to this (which sets the number of threads to 3): pthread_barrier_t b; pthread_barrier_init(&b,NULL,3); The second argument specifies an object attribute; using NULL yields the default attributes. To wait at a barrier, a process executes: pthread_barrier_wait(&b); This barrier could have been statically initialized by assigning an initial value created using the macro PTHREAD_BARRIER_INITIALIZER(3). Note: barrier is not in all pthreads implementations, but we’ll provide something you can use when it isn’t.10/22/201824CS194 Lecure
  • 25. Basic Types of Synchronization: MutexesMutexes -- mutual exclusion aka locks threads are working mostly independently need to access common data structure lock *l = alloc_and_init(); /* shared */ acquire(l); access data release(l); Java and other languages have lexically scoped synchronization similar to cobegin/coend vs. fork and join tradeoff Semaphores give guarantees on “fairness” in getting the lock, but the same idea of mutual exclusion Locks only affect processors using them: pair-wise synchronization10/22/201825CS194 Lecure
  • 26. Mutexes in POSIX ThreadsTo create a mutex: #include pthread_mutex_t amutex = PTHREAD_MUTEX_INITIALIZER; pthread_mutex_init(&amutex, NULL); To use it: int pthread_mutex_lock(amutex); int pthread_mutex_unlock(amutex); To deallocate a mutex int pthread_mutex_destroy(pthread_mutex_t *mutex); Multiple mutexes may be held, but can lead to deadlock: thread1 thread2 lock(a) lock(b) lock(b) lock(a)10/22/201826CS194 Lecure
  • 27. Shared Memory ProgrammingSeveral other thread libraries besides PTHREADS E.g., Solaris threads are very similar Other older libraries P4, Parmacs, etc. OpenMP can also be used for shared memory parallel programmer http://www.openMP.org Easier to use, i.e., just mark a loop as parallel But not available everywhere And performance is harder to control10/22/201827CS194 Lecure
  • 28. Summary of Programming with ThreadsPOSIX Threads are based on OS features Can be used from multiple languages (need appropriate header) Familiar language for most of program Ability to shared data is convenient Pitfalls Data race bugs are very nasty to find because they can be intermittent Deadlocks are usually easier, but can also be intermittent Researchers look at transactional memory an alternative OpenMP is commonly used today as an alternative10/22/201828CS194 Lecure
  • 29. Monte Carlo Example10/22/201829CS194 Lecure
  • 30. Example: Monte Carlo Pi CalculationEstimate Pi by throwing darts at a unit square Calculate percentage that fall in the unit circle Area of square = r2 = 1 Area of circle quadrant = ¼ * p r2 = p/4 Randomly throw darts at x,y positions If x2 + y2 < 1, then point is inside circle Compute ratio: # points inside / # points total p = 4*ratio r =110/22/201830CS194 Lecure
  • 31. Pi in C Independent estimates of pi: main(int argc, char **argv) { int i, hits, trials = 0; double pi; if (argc != 2)trials = 1000000; else trials = atoi(argv[1]); srand(0); // see hw tutorial for (i=0; i < trials; i++) hits += hit(); pi = 4.0*hits/trials; printf("PI estimated to %f.", pi); }10/22/201831CS194 Lecure
  • 32. Helper Code for Pi in UPCRequired includes: #include #include #include Function to throw dart and calculate where it hits: int hit(){ int const rand_max = 0xFFFFFF; double x = ((double) rand()) / RAND_MAX; double y = ((double) rand()) / RAND_MAX; if ((x*x + y*y) <= 1.0) { return(1); } else { return(0); } }10/22/201832CS194 Lecure
  • 33. Parallelism in PIStop and discuss What are some possible parallel task decompositions? 10/22/201833CS194 Lecure
  • 34. AdministriviaSee we page for lecture slides and some reading assignments Please fill out the course survey Homework 0 due Friday Homework 1 handed out by Friday (online) Pick up your NERSC user agreement forms from Brian Return them to Brian when they’re filled out (list UC Berkeley as your institution and Prof. Yelick as PI) If you didn’t get one, you can download and fax yourself http://www.nersc.gov/nusers/accounts/usage.php 10/22/201834CS194 Lecure
  • 35. Extra Slides10/22/201835CS194 Lecure
  • 36. Simple ExampleShared memory strategy: small number p << n=size(A) processors attached to single memory Parallel Decomposition: Each evaluation and each partial sum is a task. Assign n/p numbers to each of p procs Each computes independent “private” results and partial sum. Collect the p partial sums and compute a global sum. Two Classes of Data: Logically Shared The original n numbers, the global sum. Logically Private The individual function evaluations. What about the individual partial sums?10/22/201836CS194 Lecure