• 1. The Spark Project TodayAnd What’s NextAndy Konwinski@andykonwinski
  • 2. Community
  • 3. Project HistorySpark started as research project in 2009 Open sourced in 2010Entered Apache Incubator June 2013 Graduated to Top Level Project in 8 months
  • 4. Development CommunityOver 100 contributors, from 30+ companies»Databricks, Yahoo!, Intel, Adobe, Cloudera,Bizo, …
  • 5. Development Community One of the most active communities in big dataComparison: Storm (48), Giraph (52), Drill (18), Tez (1Past 14 months: more active devs than Hadoop MapReduce!
  • 6. Development CommunityHealthy across the whole ecosystem
  • 7. Spark 0.6: 17 contributorsSept ‘13Feb ‘13Oct ‘12contributorsRelease Growth Spark 0.8: 67 contributors Spark 0.7: 31Feb ‘14Spark 0.9: 83 contributors
  • 8. YARN support (Yahoo!)Columnar compression in Shark (Yahoo!) Fair scheduling (Intel)Metrics reporting (Intel, Quantifind) New RDD operators (Bizo, ClearStory) Scala 2.10 support (Imaginea)Some Community Contributions
  • 9. Attendees150220440900200 100 0900 800 700 600 500 400 300Events Growing Exponentially 1000AMP Camp 1(Aug 2012)AMP Camp 2(Aug 2013)(Nov 2013)Spark Summit Spark Summit(Jun 2014)
  • 10. Meetup Groups Taking OffSpark meetups now scheduled or running in»SF & South Bay Area »NYC»Seattle»Vancouver »AustinAnd early plans coming together in:»DC»Portland
  • 11. Vendor Training & Support
  • 12. Certified on SparkEnsure Apache Spark is THE standard- Certify Distributions- Certify Apps that build on Spark
  • 13. Integrated Component StackSparkGraphX (graph)… MLlib (machine learning) BlinkDB Spark Spark SQL Streaming (Catalyst, (real-time) Shark)
  • 14. What’s Next?
  • 15. Our ViewWhile big data tools have advanced a lot, they are still far too difficult to tune and useGoal: design big data systems that are as powerful & seamless as those for small data
  • 16. Current Priorities1) Standard libraries2) Deployment3) Ease of use4) Spark 1.0
  • 17. 1) Standard LibrariesWhile writing K-means in 30 lines is great, it’s even better to call it from a library!Spark 0.8 introduced MLlib with 7 algorithmsSpark 0.9 introduced GraphX and expanded MLlibGoal:to completeSparkinteroperable stackMore a in and 1.0
  • 18. 1) Standard Librariesval rdd: RDD[Array[Double]] = ...val model = KMeans.train(rdd, k = 10)val graph = Graph(vertexRDD, edgeRDD)val ranks = PageRank.run(graph, iters = 10)
  • 19. 2) DeploymentWant Spark to easily run anywhere Spark 0.8: much improved YARN, EC2 supportSpark 0.8.1: support for YARN 2.2SIMR: launch Spark in MapReduce clusters as a Hadoop job (no installation needed!)»For experimenting; see talk by Ahir
  • 20. Monitoring and metrics (0.8)Better support for large # of tasks (0.8.1)High availability for standalone mode (0.8.1)3) Ease of UseLong-term: hashing & sorting (0.9)beyondExternal remove need tunedefaults
  • 21. 4) Spark 1.0 ReleaseIn final release phase: feature freeze, QA onlySpark SQL»Fully integrated SQL support »Remove dependency on HiveJava 8 supportWeb UI Persistence
  • 22. Future RoadmapSparkR: R API and libraryContinued investment in Spark SQL (check out YouTube video from recent Meetup for more about this)
  • 23. What Makes Spark Unique?
  • 24. Big Data Systems TodayMapReducePregelDremelGraphLabStormGiraphDrillTezImpalaS4… Specialized systems (iterative, interactive and streaming apps)General batch processing
  • 25. StreamingGraphXMLbaseSharkSpark’s Approach Instead of specializing, generalize MapReduce to support new apps in same engine Two changes (general task DAG & data sharing) are enough to express previous models!Unification has big benefits »For the engine »For usersSpark…
  • 26. Code Size 140000 120000 100000 80000 60000 40000 20000 0HadoopStormImpala(SQL)Giraph(Graph)Spark MapReduce (Streaming) non-test, non-example source lines
  • 27. 40000 20000 0Code Size 140000 120000 100000 80000 60000HadoopStormImpala(SQL)Giraph(Graph)Spark MapReduce (Streaming) non-test, non-example source linesStreaming
  • 28. 40000 20000 0Code Size 140000 120000 100000 80000 60000HadoopStormImpala(SQL)Giraph(Graph)Spark MapReduce (Streaming) non-test, non-example source linesShark* Streaming* also calls into Hive
  • 29. 60000 40000 20000 0Code Size 140000 120000 100000 80000HadoopStormImpala(SQL)Giraph(Graph)Spark MapReduce (Streaming) non-test, non-example source linesGraphX Shark* Streaming* also calls into Hive
  • 30. Throughput (MB/s/node)StormSparkResponse Time (s)Impala (mem)Shark (disk)Shark (mem)GiraphGraphXImpala (disk)HiveResponse Time (min)HadoopPerformance45 40 35 30 25 20 15 10 5 0SQL30 25 20 15 10 5 0Graph35 30 25 20 15 10 5 0Streaming
  • 31. trainqueryETLETLtrainqueryWhat it Means for Users Separate frameworks:…HDFSreadHDFSwriteHDFSreadHDFSwriteHDFSreadHDFSwriteHDFSSpark: HDFS readInteractive analysis
  • 32. Combining Processing Typesval points = sc.runSql[Double, Double](“select latitude, longitude from historic_tweets”)val model = KMeans.train(points, 10)sc.twitterStream(...).map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)From Scala:
  • 33. Combining Processing TypesGENERATE KMeans(tweet_locations) AS TABLE tweet_clusters// Scala table generating function (TGF): object KMeans {@Schema(spec = “x double, y double, cluster int”) def apply(points: RDD[(Double, Double)]) = {... } }From SQL (in Shark 0.8.1):
  • 34. ConclusionNext challenge in big data will be complex and low-latency applications Spark offers a unified engine to tackle and combine these appsBest strength is the community: enjoy Spark Summit China!
  • 35. ContributorsAaron Davidson Alexander Pivovarov Ali Ghodsi Ameet Talwalkar Andre Shumacher Andrew Ash Andrew Psaltis Andrew Xia Andrey Kouznetsov Andy Feng Andy Konwinski Ankur Dave Antonio Lupher Benjamin Hindman Bill Zhao Charles Reiss Chris Mattmann Christoph Grothaus Christopher Nguyen Chu Tong Cliff Engle Cody Koeninger David McCauley Denny Britz Dmitriy Lyubimov Edison Tung Eric Zhang Erik van Oosten Ethan JewettEvan Chan Evan Sparks Ewen Cheslack-Postava Fabrizio Milo Fernand Pajot Frank Dai Gavin Li Ginger Smith Giovanni Delussu Grace Huang Haitao Yao Haoyuan Li Harold Lim Harvey Feng Henry Milner Henry Saputra Hiral Patel Holden Karau Ian Buss Imran Rashid Ismael Juma James Phillpotts Jason Dai Jerry Shao Jey Kottalam Joseph E. Gonzalez Josh Rosen Justin Ma Kalpit ShahKaren Feng Karthik Tunga Kay Ousterhout Kody Koeniger Konstantin Boudnik Lee Moon Soo Lian Cheng Liang-Chi Hsieh Marc Mercer Marek Kolodziej Mark Hamstra Matei Zaharia Matthew Taylor Michael Heuer Mike Potts Mikhail Bautin Mingfei Shi Mosharaf Chowdhury Mridul Muralidharan Nathan Howell Neal Wiggins Nick Pentreath Olivier Grisel Patrick Wendell Paul Cavallaro Paul Ruan Peter Sankauskas Pierre Borckmans Prabeesh K.Prashant Sharma Ram Sriharsha Ravi Pandya Ray Racine Reynold Xin Richard Benkovsky Richard McKinley Rohit Rai Roman Tkalenko Ryan LeCompte S. Kumar Sean McNamara Shane Huang Shivaram Venkataraman Stephen Haberman Tathagata Das Thomas Dudziak Thomas Graves Timothy Hunter Tyson Hamilton Vadim Chekan Wu Zeming Xinghao Pan And many more…