• 1. CS 194 Parallel Programming Why Program for Parallelism?Katherine Yelick yelick@cs.berkeley.edu http://www.cs.berkeley.edu/~yelick/cs194f0710/15/20181CS194 Lecure
  • 2. What is Parallel Computing?Parallel computing: using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel machines: A cluster computer that contains multiple PCs combined together with a high speed network A shared memory multiprocessor (SMP*) by connecting multiple processors to a single memory system A Chip Multi-Processor (CMP) contains multiple processors (called cores) on a single chip Concurrent execution comes from desire for performance; unlike the inherent concurrency in a multi-user distributed system * Technically, SMP stands for “Symmetric Multi-Processor” 10/15/20182CS194 Lecure
  • 3. Why Parallel Computing Now?Researchers have been using parallel computing for decades: Mostly used in computational science and engineering Problems too large to solve on one computer; use 100s or 1000s There has been a graduate course in parallel computing (CS267) for over a decade Many companies in the 80s/90s “bet” on parallel computing and failed Computers got faster too quickly for there to be a large market Why is Berkeley adding an undergraduate course now? Because the entire computing industry has bet on parallelism There is a desperate need for parallel programmers Let’s see why…10/15/20183CS194 Lecure
  • 4. Technology Trends: Microprocessor Capacity2X transistors/Chip Every 1.5 years Called “Moore’s Law” Moore’s LawMicroprocessors have become smaller, denser, and more powerful.Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Slide source: Jack Dongarra10/15/20184CS194 Lecure
  • 5. Microprocessor Transistors and Clock RateGrowth in transistors per chipIncrease in clock rateWhy bother with parallel programming? Just wait a year or two…10/15/20185CS194 Lecure
  • 6. Limit #1: Power density40048008808080858086286386486Pentium®P611010010001000019701980199020002010YearPower Density (W/cm2)Hot PlateNuclearReactorRocketNozzleSun’sSurfaceSource: Patrick Gelsinger, IntelScaling clock speed (business as usual) will not workCan soon put more transistors on a chip than can afford to turn on. -- Patterson ‘0710/15/20186CS194 Lecure
  • 7. Parallelism Saves PowerExploit explicit parallelism for reducing powerPower = C * V2 * F Performance = Cores * F Capacitance Voltage Frequency Using additional cores Increase density (= more transistors = more capacitance) Can increase cores (2x) and performance (2x) Or increase cores (2x), but decrease frequency (1/2): same performance at ¼ the power Power = 2C * V2 * F Performance = 2Cores * FPower = 2C * V2/4 * F/2 Performance = 2Cores * F/2Power = (C * V2 * F)/4 Performance = (Cores * F)*1Additional benefits Small/simple cores  more predictable performance10/15/20187CS194 Lecure
  • 8. Limit #2: Hidden Parallelism Tapped Out VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006Application performance was increasing by 52% per year as measured by the SpecInt benchmarks here ½ due to transistor density ½ due to architecture changes, e.g., Instruction Level Parallelism (ILP)10/15/20188CS194 Lecure
  • 9. Limit #2: Hidden Parallelism Tapped OutSuperscalar (SS) designs were the state of the art; many forms of parallelism not visible to programmer multiple instruction issue dynamic scheduling: hardware discovers parallelism between instructions speculative execution: look past predicted branches non-blocking caches: multiple outstanding memory ops You may have heard of these in 61C, but you haven’t needed to know about them to write software Unfortunately, these sources have been used up10/15/20189CS194 Lecure
  • 10. Performance ComparisonMeasure of success for hidden parallelism is Instructions Per Cycle (IPC) The 6-issue has higher IPC than 2-issue, but far less than 3x Reasons are: waiting for memory (D and I-cache stalls) and dependencies (pipeline stalls)Graphs from: Olukotun et al, ASPLOS, 199610/15/201810CS194 Lecure
  • 11. Uniprocessor Performance (SPECint) Today VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ??%/year 2002 to presentFrom Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 Sea change in chip design: multiple “cores” or processors per chip3X2x every 5 years?10/15/201811CS194 Lecure
  • 12. Limit #3: Chip Yield Moore’s (Rock’s) 2nd law: fabrication costs go up Yield (% usable chips) drops Parallelism can help More smaller, simpler processors are easier to design and validate Can use partially working chips: E.g., Cell processor (PS3) is sold with 7 out of 8 “on” to improve yieldManufacturing costs and yield problems limit use of density10/15/201812CS194 Lecure
  • 13. Limit #4: Speed of Light (Fundamental)Consider the 1 Tflop/s sequential machine: Data must travel some distance, r, to get from memory to CPU. To get 1 data element per cycle, this means 1012 times per second at the speed of light, c = 3x108 m/s. Thus r < c/1012 = 0.3 mm. Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area: Each bit occupies about 1 square Angstrom, or the size of a small atom. No choice but parallelismr = 0.3 mm1 Tflop/s, 1 Tbyte sequential machine10/15/201813CS194 Lecure
  • 14. Revolution is Happening NowChip density is continuing increase ~2x every 2 years Clock speed is not Number of processor cores may double instead There is little or no hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by softwareSource: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)10/15/201814CS194 Lecure
  • 15. Multicore in Products “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2005) All microprocessor companies switch to MP (2X CPUs / 2 yrs)  Procrastination penalized: 2X sequential perf. / 5 yrs And at the same time, The STI Cell processor (PS3) has 8 cores The latest NVidia Graphics Processing Unit (GPU) has 128 cores Intel has demonstrated an 80-core research chipManufacturer/YearAMD/’05Intel/’06IBM/’04Sun/’07Processors/chip2228Threads/Processor12216Threads/chip24412810/15/201815CS194 Lecure
  • 16. Tunnel Vision by Experts“On several recent occasions, I have been asked whether parallel computing will soon be relegated to the trash heap reserved for promising technologies that never quite make it.” Ken Kennedy, CRPC Directory, 1994 “640K [of memory] ought to be enough for anybody.” Bill Gates, chairman of Microsoft,1981. “There is no reason for any individual to have a computer in their home” Ken Olson, president and founder of Digital Equipment Corporation, 1977. “I think there is a world market for maybe five computers.” Thomas Watson, chairman of IBM, 1943.Slide source: Warfield et al.10/15/201816CS194 Lecure
  • 17. Why Parallelism (2007)?These arguments are no long theoretical All major processor vendors are producing multicore chips Every machine will soon be a parallel machine All programmers will be parallel programmers??? New software model Want a new feature? Hide the “cost” by speeding up the code first All programmers will be performance programmers??? Some may eventually be hidden in libraries, compilers, and high level languages But a lot of work is needed to get there Big open questions: What will be the killer apps for multicore machines How should the chips be designed, and how will they be programmed?10/15/201817CS194 Lecure
  • 18. Outline Why powerful computers must be parallel processors Why writing (fast) parallel programs is hard Principles of parallel computing performance Structure of the course Including your laptopall10/15/201818CS194 Lecure
  • 19. Why writing (fast) parallel programs is hard10/15/201819CS194 Lecure
  • 20. Principles of Parallel ComputingFinding enough parallelism (Amdahl’s Law) Granularity Locality Load balance Coordination and synchronization Performance modelingAll of these things makes parallel programming even harder than sequential programming.10/15/201820CS194 Lecure
  • 21. Finding Enough ParallelismSuppose only part of an application seems parallel Amdahl’s law let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable P = number of processorsSpeedup(P) = Time(1)/Time(P) <= 1/(s + (1-s)/P) <= 1/sEven if the parallel part speeds up perfectly performance is limited by the sequential part10/15/201821CS194 Lecure
  • 22. Overhead of ParallelismGiven enough parallel work, this is the biggest barrier to getting desired speedup Parallelism overheads include: cost of starting a thread or process cost of communicating shared data cost of synchronizing extra (redundant) computation Each of these can be in the range of milliseconds (=millions of flops) on some systems Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work 10/15/201822CS194 Lecure
  • 23. Locality and ParallelismLarge memories are slow, fast memories are small Storage hierarchies are large and fast on average Parallel processors, collectively, have large, fast cache the slow accesses to “remote” data we call “communication” Algorithm should do most work on local dataProcCacheL2 CacheL3 CacheMemoryConventional Storage HierarchyProcCacheL2 CacheL3 CacheMemoryProcCacheL2 CacheL3 CacheMemorypotential interconnects
  • 24. Load ImbalanceLoad imbalance is the time that some processors in the system are idle due to insufficient parallelism (during that phase) unequal size tasks Examples of the latter adapting to “interesting parts of a domain” tree-structured computations fundamentally unstructured problems Algorithm needs to balance load 10/15/201824CS194 Lecure
  • 25. Course Organization10/15/201825CS194 Lecure
  • 26. Course MechanicsExpected background All of 61 series At least one upper div software/systems course, preferably 162 Work in course Homework with programming (~1/week for first 8 weeks) Parallel hardware in CS, from Intel, at LBNL Final project of your own choosing: may use other hardware (PS3, GPUs, Niagra2, etc.) depending on availability 2 in-class quizzes mostly covering lecture topics See course web page for tentative calendar, etc.: http://www.cs.berkeley.edu/~yelick/cs194f07 Grades: homework (30%), quizzes (30%), project (40%) Caveat: This is the first offering of this course, so things will change dynamically10/15/201826CS194 Lecure
  • 27. Reading MaterialsOptional text Introduction to Parallel Computing, 2nd Edition Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar, Addison-Wesley, 2003 Some on-line texts (on high performance scientific programming): Demmel’s notes from CS267 Spring 1999, which are similar to 2000 and 2001. However, they contain links to html notes from 1996. http://www.cs.berkeley.edu/~demmel/cs267_Spr99/ Ian Foster’s book, “Designing and Building Parallel Programming”. http://www-unix.mcs.anl.gov/dbpp/10/15/201827CS194 Lecure