• 1. This is how you would solve it for small data.Chief Architect | Traintracks.io 首席架构师 | Traintracks.io 大数据分析平台Ryan BraleyBig Data Algorithms with Twitter Algebird + Spark Streaming 基于Twitter Algebird 和 Spark Streaming 的大规模流式数据处理算法
  • 2. You are an engineer at a video-sharing website company. 你是一个在视频分享网站工作的工程师。
  • 3. Task: Calculate the number of unique users in the views of all videos… 任务: 对于所有观看了视频的用户,统计去重后的用户数...…in real time. …如何实时统计?
  • 4. Small data, small problems 数据量小的时候,这个问题很容易解决Big data? 数据量非常大?SELECT COUNT(DISTINCT USER_ID) FROM VIEWS; 1 billion users. 1 PetaByte. Slow. 十亿用户,PB 级别数据,传统的解决方案将会很慢
  • 5. What if I told you that we could solve this same problem with 98% accuracy with only 1.5kb of memory? 如果我告诉你只需使用 1.5 kb 的内存,就可以将答案的准确率控制在 98% 左右,你是否会像我一样感到无比的兴奋?
  • 6. Approximation Algorithms with Twitter Algebird. 当近似算法遇见 Twitter Algebird。
  • 7. Approximation algorithms 近似算法1. Usually based on hashing 通常基于哈希表2. Performance guarantee and provable accuracy 良好的性能与可证明的准确性 3. Many are based on Monoids 很多基于 Monoids
  • 8. MonoidsIntegers: (1 + 2) + 3 = 1 + (2 + 3) 2. Strings: ("spark" + "meetup") + "beijing" = "spark" + ("meetup" + "beijing") 3. Set union: ([“jeff”, “hsu”] u [“ryan”, “braley”]) u [“nils”, “pihl"] = [“jeff”, “hsu”] u ([“ryan”, “braley”] u [“nils”, “pihl”])
  • 9. Monoids are great because they are associative and parallelizable. Monoids 是极好的因为它具有结合律和并发性。
  • 10. Approximation algorithms based on monoids have low memory footprint and are inherently parallelizable. 基于Monoids 的近似算法具有低内存占用且天生支持并行化的特点。
  • 11. Algebird
  • 12. Calculate the number of unique users in the views of all videos… 统计 views 中去重后的用户数…in real time, with the HyperLogLog algorithm. 利用 HyperLogLog 算法来实时计算
  • 13. HyperLogLog1. Similar to an approximate set 与近似集合相似2. Calculates approximation of cardinality of the set. 近似计算集合的基数3. Can calculate with predictable error bounds 可预见的误差范围98% accuracy with only 1.5kb of memory. 只需 1.5KB 的内存,98%的准确率。
  • 14. Task #2: Calculate the number of views per unique users of all videos… 任务2:计算去重后每个用户观看的视频的次数...…in real time, with the Count-Min Sketch algorithm. …使用 Count-Min Sketch 算法实时计算。
  • 15. Count-Min Sketch1. Keeps track of frequency of each element in stream 记录一连串数据中每个数据出现的频次2. Returns an upper bound, but may overestimate 具有上限,有可能会过高 3. Similar to a bloom filter 与布隆过滤器类似x% accuracy with only y-kb of memory. 只需 y kb 内存,拥有 x% 的准确率。
  • 16. Task #3: Calculate the number of views per video and top 10 videos... 任务3:计算所有视频中观看次数最多的 10 个视频...…in real time, with the Sketch Map algorithm. …使用 Sketch Map 算法。
  • 17. Sketch Map1. Keeps track of frequency of each element in the stream 记录一连串数据中每个数据出现的频次2. Returns an upper bound, but may overestimate 具有上限,有可能会过高3. Similar to a bloom filter 与布隆过滤器类似 4. Includes running list of Top K heavy hitters 能实时获取当前所有元素中最频繁的 K 个x% accuracy with only y-kb of memory. 只需 y kb 内存,拥有 x% 的准确率。
  • 18. Demo
  • 19. In addition to HyperLogLog, Count-min Sketch, and Sketch map… 除了 HyperLogLog, Count-min Sketch 和 Sketch map…Twitter Algebird offers way more approximation algorithms based on monoids… Twitter Algebird 还提供了更多基于 monoids 的近似算法... BloomFilter, ExponentialMA, Moments, MinHash, QTree, TopK, just to name a few.
  • 20. This is how you would solve it for small data.Algebirdwww.traintracks.io@Traintracks ryan@traintracks.io @TraintracksTeam facebook.com/traintracks.io http://github.com/traintracks/sparkstreaming-algebird-algorithms-demo