• 1. This is how you would solve it for small data.Chief Architect | Traintracks.io 首席架构师 | Traintracks.io 大数据分析平台Ryan BraleyBig Data Algorithms with Twitter Algebird + Spark Streaming 基于Twitter Algebird 和 Spark Streaming 的大规模流式数据处理算法
• 2. You are an engineer at a video-sharing website company. 你是一个在视频分享网站工作的工程师。
• 3. Task: Calculate the number of unique users in the views of all videos… 任务: 对于所有观看了视频的用户，统计去重后的用户数...…in real time. …如何实时统计?
• 4. Small data, small problems 数据量小的时候，这个问题很容易解决Big data? 数据量非常大？SELECT COUNT(DISTINCT USER_ID) FROM VIEWS; 1 billion users. 1 PetaByte. Slow. 十亿用户，PB 级别数据，传统的解决方案将会很慢
• 5. What if I told you that we could solve this same problem with 98% accuracy with only 1.5kb of memory? 如果我告诉你只需使用 1.5 kb 的内存，就可以将答案的准确率控制在 98% 左右，你是否会像我一样感到无比的兴奋?
• 7. Approximation algorithms 近似算法1. Usually based on hashing 通常基于哈希表2. Performance guarantee and provable accuracy 良好的性能与可证明的准确性 3. Many are based on Monoids 很多基于 Monoids
• 8. MonoidsIntegers: (1 + 2) + 3 = 1 + (2 + 3) 2. Strings: ("spark" + "meetup") + "beijing" = "spark" + ("meetup" + "beijing") 3. Set union: ([“jeff”, “hsu”] u [“ryan”, “braley”]) u [“nils”, “pihl"] = [“jeff”, “hsu”] u ([“ryan”, “braley”] u [“nils”, “pihl”])
• 9. Monoids are great because they are associative and parallelizable. Monoids 是极好的因为它具有结合律和并发性。
• 10. Approximation algorithms based on monoids have low memory footprint and are inherently parallelizable. 基于Monoids 的近似算法具有低内存占用且天生支持并行化的特点。
• 11. Algebird
• 12. Calculate the number of unique users in the views of all videos… 统计 views 中去重后的用户数…in real time, with the HyperLogLog algorithm. 利用 HyperLogLog 算法来实时计算
• 13. HyperLogLog1. Similar to an approximate set 与近似集合相似2. Calculates approximation of cardinality of the set. 近似计算集合的基数3. Can calculate with predictable error bounds 可预见的误差范围98% accuracy with only 1.5kb of memory. 只需 1.5KB 的内存，98%的准确率。
• 14. Task #2: Calculate the number of views per unique users of all videos… 任务2：计算去重后每个用户观看的视频的次数...…in real time, with the Count-Min Sketch algorithm. …使用 Count-Min Sketch 算法实时计算。
• 15. Count-Min Sketch1. Keeps track of frequency of each element in stream 记录一连串数据中每个数据出现的频次2. Returns an upper bound, but may overestimate 具有上限，有可能会过高 3. Similar to a bloom filter 与布隆过滤器类似x% accuracy with only y-kb of memory. 只需 y kb 内存，拥有 x% 的准确率。
• 16. Task #3: Calculate the number of views per video and top 10 videos... 任务3：计算所有视频中观看次数最多的 10 个视频...…in real time, with the Sketch Map algorithm. …使用 Sketch Map 算法。
• 17. Sketch Map1. Keeps track of frequency of each element in the stream 记录一连串数据中每个数据出现的频次2. Returns an upper bound, but may overestimate 具有上限，有可能会过高3. Similar to a bloom filter 与布隆过滤器类似 4. Includes running list of Top K heavy hitters 能实时获取当前所有元素中最频繁的 K 个x% accuracy with only y-kb of memory. 只需 y kb 内存，拥有 x% 的准确率。
• 18. Demo
• 19. In addition to HyperLogLog, Count-min Sketch, and Sketch map… 除了 HyperLogLog, Count-min Sketch 和 Sketch map…Twitter Algebird offers way more approximation algorithms based on monoids… Twitter Algebird 还提供了更多基于 monoids 的近似算法... BloomFilter, ExponentialMA, Moments, MinHash, QTree, TopK, just to name a few.
• 20. This is how you would solve it for small data.Algebirdwww.traintracks.io@Traintracks ryan@traintracks.io @TraintracksTeam facebook.com/traintracks.io http://github.com/traintracks/sparkstreaming-algebird-algorithms-demo