Kylin OLAP Engine

互联⺴新技术在线教育领航者 LOGO Kylin OLAP Engine 韩卿 | Luke Han | 互联⺴新技术在线教育领航者 Agenda •  What’s Kylin •  Tech Highlights •  Performance •  Open Source •  Q & A 互联⺴新技术在线教育领航者 About eBay Data 互联⺴新技术在线教育领航者 Big Data Era n  More and more data becoming available on Hadoop n  Limitations in existing Business Intelligence (BI) Tools n  Limited support for Hadoop n  Data size growing exponentially n  High latency of interactive queries n  Challenges to adopt Hadoop as interactive analysis system n  Majority of analyst groups are SQL savvy n  No mature SQL interface on Hadoop n  Full OLAP capability on Hadoop ecosystem not ready yet 互联⺴新技术在线教育领航者 Business Needs for Big Data Analysis n  Sub-second query latency on billions of rows n  ANSI SQL for both analysts and engineers n  Full OLAP capability to offer advanced functionality n  Seamless Integration with BI Tools n  High concurrency – thousands of end users n  Support of high cardinality and high dimensions n  Distributed and scale out architecture for large data volume 互联⺴新技术在线教育领航者 n  Commercial Solutions n  Open Source Options Options Considered No one solution that could match our business needs 互联⺴新技术在线教育领航者 7 Build an engine from scratch 互联⺴新技术在线教育领航者 Extreme OLAP Engine for Big Data Kylin is an open source Distributed Analytics Engine from eBay that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets What’s Kylin kylin / ˈkiːˈlɪn / 麒麟 --n. (in Chinese art) a mythical animal of composite form 互联⺴新技术在线教育领航者 Transaction Operation Strategy High Level Aggregation • Very High Level, e.g GMV by site by vertical by weeks Analysis Query • Middle level, e.g GMV by site by vertical, by category (level x) past 12 weeks Drill Down to Detail • Detail Level (Summary Table) Low Level Aggregation • First Level Aggragation Transaction Level • Transaction Data 9 Analytics Query Taxonomy 80+% Analytic s Kylin is designed to accelerate analytics queries performance on Hadoop 互联⺴新技术在线教育领航者 10 What’s OLAP Cube? 10 time, item time, item, location time, item, location, supplier time item location supplier time, location Time, supplier item, location item, supplier location, supplier time, item, supplier time, location, supplier item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid •  Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1.  (9/15, milk, Urbana, Dairy_land) - 2.  (9/15, milk, Urbana, *) - 3.  (*, milk, Urbana, *) - 4.  (*, milk, Chicago, *) - 5.  (*, milk, *, *) - •  Cuboid = one combination of dimensions •  Cube = all combination of dimensions (all cuboids) 互联⺴新技术在线教育领航者 11 From Relational to Key-Value 互联⺴新技术在线教育领航者 12 Kylin Architecture Overview 12 Cube Build Engine SQ L Low Latency - Seconds Mid Latency - Minutes Roung 3rd Party App (Web App, Mobile…) Metadata SQL-Based Tool (BI Tools: Tableau…) Query Engine Hadoop Hive REST API JDBC/ODBC Ø  Online Analysis Data Flow Ø  Offline Data Flow Ø  Clients/Users interactive with Kylin via SQL Ø  OLAP Cube is transparent to users Star Schema Data Key Value Data Data Cube OLAP Cube (HBase) SQ L REST Server 互联⺴新技术在线教育领航者 13 n  Pre-built cube n  No runtime Hive table scan and MapReduce job n  Leveraging distributed computing infrastructure n  Compression and encoding n  Put “Computing” to “Data” n  Cached Why is Kylin fast? Agenda n  What’s Kylin n  Tech Highlights n  Performance n  Open Source n  Q & A 互联⺴新技术在线教育领航者 15 Data Modeling Cube: … Fact Table: … Dimensions: … Measures: … Storage(HBase): … Fact Dim Dim Dim Source Star Schema row A row B row C Column Family Val 1 Val 2 Val 3 Row Key Column Target HBase Storage Mapping Cube Metadata End User Cube Modeler Admin 互联⺴新技术在线教育领航者 16 Cube Metadata •  Dimension –  Normal –  Mandatory –  Hierarchy –  Derived •  Measure –  Sum –  Count –  Max –  Min –  Average –  Distinct Count (based on HyperLogLog) 互联⺴新技术在线教育领航者 17 Cube Build Job Flow 互联⺴新技术在线教育领航者 18 Query Engine – Calcite (Optiq) n  Dynamic data management framework. n  Formerly known as Optiq, Calcite is an Apache incubator project, used by Apache Drill and Apache Hive, among others. n 互联⺴新技术在线教育领航者 19 Calcite Plugins •  Metadata SPI –  Provide table schema from kylin metadata •  Optimize Rule –  Translate the logic operator into kylin operator •  Relational Operator –  Find right cube –  Translate SQL into storage engine api call –  Generate physical execute plan by linq4j java implementation •  Result Enumerator –  Translate storage engine result into java implementation result. •  SQL Function –  Add HyperLogLog for distinct count –  Implement date time related functions (i.e. Quarter) 互联⺴新技术在线教育领航者 20 Query Engine – Kylin Explain Plan SELECT test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name, test_sites.site_name, SUM(test_kylin_fact.price) AS GMV, COUNT(*) AS TRANS_CNT FROM test_kylin_fact LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt LEFT JOIN test_category ON test_kylin_fact.leaf_categ_id = test_category.leaf_categ_id AND test_kylin_fact.lstg_site_id = test_category.site_id LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id WHERE test_kylin_fact.seller_id = 123456OR test_kylin_fact.lstg_format_name = ’New' GROUP BY test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name,test_sites.site_name OLAPToEnumerableConverter OLAPProjectRel(WEEK_BEG_DT=[$0], category_name=[$1], CATEG_LVL2_NAME=[$2], CATEG_LVL3_NAME=[$3], LSTG_FORMAT_NAME=[$4], SITE_NAME=[$5], GMV=[CASE(=($7, 0), null, $6)], TRANS_CNT=[$8]) OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)], agg#1=[COUNT($6)], TRANS_CNT=[COUNT()]) OLAPProjectRel(WEEK_BEG_DT=[$13], category_name=[$21], CATEG_LVL2_NAME=[$15], CATEG_LVL3_NAME=[$14], LSTG_FORMAT_NAME=[$5], SITE_NAME=[$23], PRICE=[$0]) OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ’New'))]) OLAPJoinRel(condition=[=($2, $25)], joinType=[left]) OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left]) OLAPJoinRel(condition=[=($4, $12)], joinType=[left]) OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]) OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]]) OLAPTableScan(table=[[DEFAULT, test_category]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]]) OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]]) 互联⺴新技术在线教育领航者 21 n  Provide cube query for query engine n  Common iterator interface for storage engine n  Isolate query engine from underline storage n  Translate cube query into HBase table scan n  Groups à Cuboid ID n  Filters -> Scan Range (Row Key) n  Aggregations -> Measure Columns (Row Values) n  Scan HBase table and translate HBase result into cube result n  HBase Result (key + value) -> Cube Result (dimensions + measures) How To Query Cube? – Storage Engine 互联⺴新技术在线教育领航者 22 n  “Curse of dimensionality”: N dimension cube has 2N cuboid n  Full Cube vs. Partial Cube n  Hugh data volume n  Dictionary Encoding n  Incremental Building n  Slow Table Scan – TopN Query on High Cardinality Dimension n  Bitmap inverted index n  Time range partition n  In-memory parallel scan: block cache + endpoint coprocessor How To Optimize Cube? - Challenges 互联⺴新技术在线教育领航者 23 n  Full Cube n  Pre-aggregate all dimension combinations n  “Curse of dimensionality”: N dimension cube has 2N cuboid. n  Partial Cube n  To avoid dimension explosion, we divide the dimensions into different aggregation groups n  2N+M+L à 2N + 2M + 2L n  For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands n  230 à 210 + 210 + 210 n  Tradeoff between online aggregation and offline pre-aggregation How To Optimize Cube? – Full Cube vs. Partial Cube 互联⺴新技术在线教育领航者 24 How To Optimize Cube? – Partial Cube 互联⺴新技术在线教育领航者 25 n  Data cube has lost of duplicated dimension values n  Dictionary maps dimension values into IDs that will reduce the memory and storage footprint. n  Dictionary is based on Trie How To Optimize Cube? – Dictionary Encoding 互联⺴新技术在线教育领航者 26 How To Optimize Cube? – Incremental Building Agenda n  What’s Kylin n  Tech Highlights n  Performance n  Open Source n  Q & A 互联⺴新技术在线教育领航者 28 Kylin vs. Hive # Query Type Return Dataset Query On Kylin (s) Query On Hive (s) Comments 1 High Level Aggregation 4 0.129 157.437 1,217 times 2 Analysis Query 22,669 1.615 109.206 68 times 3 Drill Down to Detail 325,029 12.058 113.123 9 times 4 Drill Down to Detail 524,780 22.42 6383.21 278 times 5 Data Dump 972,002 49.054 N/A 0 50 100 150 200 SQL #1 SQL #2 SQL #3 Hive Kylin High Level Aggregati on Analysis Query Drill Down to Detail Low Level Aggregati on Transactio n Level 互联⺴新技术在线教育领航者 29 Performance -- Concurrency Linear scale out with more nodes 互联⺴新技术在线教育领航者 30 Performance - Query Latency 90%tile queries <5s Agenda n  What’s Kylin n  Tech Highlights n  Performance n  Open Source n  Q & A 互联⺴新技术在线教育领航者 32 n  Kylin Site: n n  Github Repo: n n  Google Group: n  Kylin OLAP Open Source 互联⺴新技术在线教育领航者 33 n  Kylin Core n  Fundamental framework of Kylin OLAP Engine n  Extension n  Plugins to support for additional functions and features n  Integration n  Lifecycle Management Support to integrate with other applications n  Interface n  Allows for third party users to build more features via user-interface atop Kylin core n  Driver n  ODBC and JDBC Drivers Kylin Ecosystem 互联⺴新技术在线教育领航者 34 What’s Next? •  v1.1 (Current version under development) –  InvertedIndex –  Remote JDBC Driver –  Filter on Coprocessor •  v1.2 –  Job Schedule and Priority –  Capacity Management –  Automation •  v2.0 –  HOLAP (Hybrid OLAP to combine MOLAP, ROLAP and MPP) –  In-Memory Analysis –  Mobile –  Spark Support –  More… 互联⺴新技术在线教育领航者 35 Thanks 互联⺴新技术在线教育领航者 LOGO 小象学院:互联网新技术在线教育领航者 联系我们: –  新浪微博:ChinaHadoop –  微信公号:ChinaHadoop




需要 7 金币 [ 分享pdf获得金币 ] 1 人已下载





下载需要 7 金币 [金币充值 ]
亲,您也可以通过 分享原创pdf 来获得金币奖励!