Apache Kylin 介绍

Apache Kylin 介绍 Yang Li | 李扬 liyang@apache.org 领先的 大数据智能分析平台 http://kyligence.io 关于Kyligence . Kyligence是首家在国内由Apache顶级项目核心贡献者团队组建的 创业公司 . 致力于进一步推动Apache Kylin开源项目的发展和演进,提供基于 的Apache Kylin的大数据分析产品和服务,拓展全球用户社区,构 建更为丰富的生态系统 http://kyligence.io Apache Kylin Why Happiness Latency 10s What we have tried? Kylin 关于Apache Kylin Extreme OLAP Engine for Big Data Apache Kylin™是一个开源的分布式分析引擎, 为Hadoop等大型分布式数据平台之上的超大规 模数据集通过标准SQL查询及多维分析(OLAP) 功能,提供亚秒级的交互式分析能力。 官方网站: http://kylin.apache.org Apache Kylin简介 OLAP/数据集市 • 为数据分析服务而生 • 最佳OLAP on Hadoop • 极速查询能力 • 标准SQL接口 • 无缝连接现有BI工具 • 可扩展架构 Apache Kylin社区发展 • 项目启动 2013年9月 • 开源并加入Apache孵化器项目 2014年10月 • InfoWorld Bossie Award – 最佳开源大数据工具奖 2015年9月 • Apache Kylin v1.0 正式发布 2015年10月 • 正式毕业成为Apache顶级项目,首个全部由中国团队贡献的顶级项目 2015年11月 • Kyligence Inc, 由Kylin核心贡献者组建的创业公司正式成立 2016年3月 Apache Kylin全球用户 使用案例 eBay 案例 • DSS Nous –NRT SEO Dashboard – Near Real-time and Historical together • eCG –GA Deep Dive – Pilot Case – Extreme challenge for storage and building time – Enabled Tableau • eCS –Tracking Report – Hugest Cube: 100+B raw records in one cube 京东案例 OLAP分析 单个Cube最大维度16个,最大数据条数 100亿,最大存储空间400G, 95%的查 询响应时间在15秒以内 原始明细数据查询 (京东已经贡献回社区) 单个Cube最大维度8个,最大数据条数4 亿,最大存储空间800G。30个Cube占 用总空间4T左右,查询QPS在50左右, 所有查询平均响应时间200ms, 查询 QPS在200左右平均响应时间可以保持 在1s以内 百度地图案例 网易案例 架构及技术实现 time, item time, item, location time, item, location, supplier time item location supplier time, location Time, supplier item, location item, supplier location, supplier time, item, supplier time, location, supplier item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid • Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1. (9/15, milk, Urbana, Dairy_land) - 2. (9/15, milk, Urbana, *) - 3. (*, milk, Urbana, *) - 4. (*, milk, Chicago, *) - 5. (*, milk, *, *) - • Cuboid = one combination of dimensions • Cube = all combination of dimensions (all cuboids) OLAP Cube 理论基础:空间换时间 体系架构 Map Reduce/Spark Kylin BI Tools, Web App… ANSI SQL Define Data Model Manage Jobs Explore the Data Interactive with BI Tool - Tableau Integration with Excel/PowerBI Full Cube vs. Partial Cube . Full Cube . Pre-aggregate all dimension combinations .“Curse of dimensionality”: N dimension cube has 2N cuboid. . Partial Cube . To avoid dimension explosion, we divide the dimensions into different aggregation groups . 2N+M+L  2N + 2M + 2L . For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands . 230  210 + 210 + 210 . Tradeoff between online aggregation and offline pre-aggregation Partial Cube 性能及并发 Apache Kylin 1.5 Cube Builder (MapReduce…) SQL Low Latency - Seconds Routing 3rd Party App (Web App, Mobile…) Metadata SQL-Based Tool (BI Tools: Tableau…) Query Engine Hadoop Hive REST API JDBC/ODBC  Online Analysis Data Flow  Offline Data Flow  Clients/Users interactive with Kylin via SQL  OLAP Cube is transparent to users Star Schema Data Key Value Data Data Cube OLAP Cubes (HBase) SQL REST Server Data Source Abstraction Engine Abstraction Storage Abstraction 可扩展架构 MR Engine IN OUT Hive Source HBase Storage Cube Metadata SourceFactory StorageFactory EngineFactory 可扩展架构 MR Engine 可扩展架构 Hive Adapter HBase Adapter load data save cube Hive Source HBase Storage adapt to IN adapt to OUT 开发模块 . Engine . MR V1 . MR V2 . Spark (early) . Streaming (experimental) . Source . Hive . Kafka . Spark SQL & DataFrames . Storage . HBase .? Kudu .? Cassandra 可扩展性和灵活性 .Freedom . Zoo break, not bound to Hadoop any more . Free to go to a better engine or storage .Extensibility . Accept any input, e.g. Kafka . Embrace next-gen distributed platform, e.g. Spark .Flexibility . Choose different engine for different data set Full Data 0-D Cuboid 1-D Cuboid 2-D Cuboid 3-D Cuboid 4-D Cuboid MR MR MR MR MR A,B,C,D A,B,C A,B,D A,C,D B,C,D 分层Cube算法 . Pros . Simple implementation, depends on MR shuffle to merge sort and then aggregate . Little requirement on memory . Cons . Aggregation happens at reducer side . Mapper outputs raw data thus shuffle is huge . Multiple rounds of MR overhead . Shuffle can be 100x of cube size, big I/O pressure mapper mapper mapper reducer 快速Cube算法 . Pros . In-mem cubing algorithm that can be reused by Streaming, Spark etc. . Mapper side aggregation . Lesser shuffling given the right data split . One round MR . Cons . Code complexity . High mapper CPU/Mem consumption Data Split Data Split Data Split …… Final Cube Merge Sort (Shuffle) MR V2引擎 .If data splits are unique . Fast cubing wins .If data splits are common . Layer cubing wins .New cube engine chooses the right algorithm based on data sampling. .Overall build time is 1.5x faster, sum results from 500 jobs. 并行扫描 . Slow queries are 5-10x faster. . New Hbase storage enables partition on cuboids that are big enough. . Overall query time is 2x faster than before, sum results from 10,000+ queries. Query Cuboid A Cuboid B Query A1 B1 A2 B2 A3 C Cuboid C Server 1 Server 2 Server 3 Server 1 Server 2 Server 3 近实时Cube计算  Minutes micro cubes  Kafka source  In-mem cubing  Auto merge 用户自定义函数 . HyperLogLog Count Distinct . TopN . BitMap Precise Count Distinct . from Sun, Yerui (meituan.com) . Raw Records . from Wang, Xiaoyu (jd.com) . Domain specific aggregations now become easy . aggregate user events to detect time serials or access patterns . draw a sketch of certain user groups . pre-calculate clusters of data points . histogram… ODBC的增强 . Works with Tableau 9.1 . Works with MS Excel . Works with MS Power BI Apache Kylin 1.5 .New in Apache Kylin 1.5 . Plugin-able architecture . New MR Cube Engine with fast cubing (1.5x faster) . New HBase Storage with parallel scan (2x faster) . Near real-time analysis (experimental) . User defined aggregations . Excel / PowerBI / Zeppelin integration Q & A .更多信息: .网站: .http://kyligence.io .http://kylin.apache.org .微信公众号 .ApacheKylin .Twitter: .@ApacheKylin




需要 10 金币 [ 分享pdf获得金币 ] 0 人已下载





下载需要 10 金币 [金币充值 ]
亲,您也可以通过 分享原创pdf 来获得金币奖励!