在大数据上近似查询


Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, AmeetTalwalkar, Michael Jordan, Samuel Madden, Ion Stoica Our Goal Support interactive SQL-like aggregate queries over massive sets of data Our Goal Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log AVG, COUNT, SUM, STDEV, PERCENTILE etc. Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ FILTERS, GROUP BY clauses Our Goal Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id JOINS, Nested Queries etc. Our Goal Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT my_function(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id ML Primitives, User Defined Functions Our Goal Hard Disks ½ - 1 Hour 1 - 5 Minutes 1 second ? Memory 100 TB on 1000 machines Query Execution on Samples ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? 0.2325 ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.19 0.2325 ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.19 +/- 0.05 0.2325 ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/2 3 Berkeley 0.25 1/2 5 NYC 0.19 1/2 6 Berkeley 0.09 1/2 8 NYC 0.18 1/2 12 Berkeley 0.49 1/2 Uniform Sample $0.22 +/- 0.02 0.2325 0.19 +/- 0.05 Speed/Accuracy Trade-off Error 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Execution Time (Sample Size) Error 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Speed/Accuracy Trade - off Pre - Existing Noise Execution Time (Sample Size) Sampling Vs. No Sampling 0 100 200 300 400 500 600 700 800 900 1000 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data Query Response Time (Seconds) 103 1020 18 13 10 8 10x as response time is dominated by I/O Sampling Vs. No Sampling 0 100 200 300 400 500 600 700 800 900 1000 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data Query Response Time (Seconds) 103 1020 18 13 10 8 (0.02%) (0.07%) (1.1%) (3.4%) (11%) Error Bars What is BlinkDB? A framework built on Shark and Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime What is BlinkDB? A framework built on Shark and Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime Uniform Samples 2 4 1 3 Uniform Samples 2 4 1 3 U Uniform Samples 2 4 1 3 U ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Uniform Samples 2 4 1 3 U 1. FILTER rand() < 1/3 2. Adds per-row Weights 3. (Optional) ORDER BY rand() ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Uniform Samples 2 4 1 3 U ID City Data Weight 2 NYC 0.13 1/3 8 NYC 0.25 1/3 6 Berkeley 0.09 1/3 11 NYC 0.19 1/3 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Doesn’t change Shark RDD Semantics Stratified Samples 2 4 1 3 Stratified Samples 2 4 1 3 S Stratified Samples 2 4 1 3 S ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Stratified Samples 2 4 1 3 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 SPLIT Stratified Samples 2 4 1 3 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count NYC 7 Berkele y 5 GROUP Stratified Samples 2 4 1 3 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count Ratio NYC 7 2/7 Berkele y 5 2/5 GROUP Stratified Samples 2 4 1 3 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count Ratio NYC 7 2/7 Berkele y 5 2/5 S2 JOIN Stratified Samples 2 4 1 3 S1 S2 S2 U ID City Data Weight 2 NYC 0.13 2/7 8 NYC 0.25 2/7 6 Berkeley 0.09 2/5 12 Berkeley 0.49 2/5 Doesn’t change Shark RDD Semantics What is BlinkDB? A framework built on Shark and Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEVA 1 2 Sample AVG SUM COUNT STDEV VARIANCE A 1 2 Sample A ±ε A Error Estimation Generalized Aggregate Functions - Statistical Bootstrap - Applicable to complex and nested queries, UDFs, joins etc. Error Estimation Generalized Aggregate Functions - Statistical Bootstrap - Applicable to complex and nested queries, UDFs, joins etc. Sample A Sample AA1A2A100 … … B ±ε What is BlinkDB? A framework built on Shark and Spark that … - creates and maintains a variety of random and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime Kleiner’s Diagnostics Error Sample Size More Data  Higher Accuracy 300 Data Points  97% Accuracy [KDD 13] What is BlinkDB? A framework built on Shark and Spark that … - creates and maintains a variety of random and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime Single Pass Execution Sample A Approximate Query on a Sample Single Pass Execution Sample A State Age Metric Weight CA 20 1971 1/4 CA 22 2819 1/4 MA 22 3819 1/4 MA 30 3091 1/4 Single Pass Execution Sample R A State Age Metric Weight CA 20 1971 1/4 CA 22 2819 1/4 MA 22 3819 1/4 MA 30 3091 1/4 Resampling Operator Single Pass Execution Sample A R Sample “Pushdown” State Age Metric Weight CA 20 1971 1/4 CA 22 2819 1/4 MA 22 3819 1/4 MA 30 3091 1/4 Single Pass Execution Sample A R Metric Weight 1971 1/4 3819 1/4 Sample “Pushdown” Single Pass Execution Sample A R Metric Weight 1971 1/4 3819 1/4 Resampling Operator Single Pass Execution Sample A R Metric Weight 1971 1/4 3819 1/4 Metric Weight 1971 1/4 1971 1/4 Metric Weight 3819 1/4 3819 1/4 Resampling Operator Single Pass Execution Sample A R Metric Weight 1971 1/4 3819 1/4 Metric Weight 1971 1/4 1971 1/4 Metric Weight 3819 1/4 3819 1/4 AA1 An … Single Pass Execution Sample A R Metric Weight 1971 1/4 3819 1/4 Metric Weight 1971 1/4 1971 1/4 Metric Weight 3819 1/4 3819 1/4 AA1 An … Sample A Metric Weight 1971 1/4 3819 1/4 Leverage Poissonized Resampling to generate samples with replacement Single Pass Execution Sample A Metric Weight S1 1971 1/4 2 3819 1/4 1 Sample from a Poisson (1) Distribution Single Pass Execution A1 Sample A Metric Weight S1 Sk 1971 1/4 2 1 3819 1/4 1 0 Construct all Resamples in Single Pass Single Pass Execution A1 Ak SS1 S100…Da1 Da100…Db1 Db100…Dc1 Dc100… SAMPLE BOOTSTRAP WEIGHTS DIAGNOSTICS WEIGHTS Single Pass Execution Sample A SS1 S100…Da1 Da100…Db1 Db100…Dc1 Dc100… SAMPLE BOOTSTRAP WEIGHTS DIAGNOSTICS WEIGHTS Single Pass Execution Sample A Additional Overhead: 200 bytes/row SS1 S100…Da1 Da100…Db1 Db100…Dc1 Dc100… Single Pass Execution Sample A Embarrassingly Parallel SAMPLE BOOTSTRAP WEIGHTS DIAGNOSTICS WEIGHTS Response Time (s) Query Execution Single Pass Execution Single Pass Execution Response Time (s) Error Estimation Overhead Single Pass Execution Response Time (s) Diagnostics Overhead BlinkDB Architecture Hadoop Storage (e.g., HDFS, Hbase, Presto) Meta store Hadoop/Spark/Presto SQL Parser Query Optimizer Physical Plan SerDes, UDFs Execution Driver Command-line Shell Thrift/JDBC Getting Started 1. Alpha 0.1.1 released and available at http://blinkdb.org 2. Allows you to create random and stratified samples on native tables and materialized views 3. Adds approximate aggregate functions with statistical closed forms to HiveQL 4. Compatible with Apache Hive, AMP Lab’s Shark and Facebook’s Presto (storage, serdes, UDFs, types, metadata) Feature Roadmap 1. Support a majority of Hive Aggregates and User-Defined Functions 2. Runtime Correctness Tests 3. Integrating BlinkDB with Shark and Facebook’s Presto as an experimental feature 4. Automatic Sample Management 5. Streaming Support 1. Approximate queries is an important means to achieve interactivity in processing large datasets 2. BlinkDB.. - approximate answers with error bars by executing queries on small samples of data - supports existing Hive/Shark/Presto queries 3. For more information, please check out our EuroSys 2013 (http://bit.ly/blinkdb-1) and KDD 2014 (http://bit.ly/blinkdb-2) papers Summary Thanks!
还剩60页未读

继续阅读

下载pdf到电脑,查找使用更方便

pdf的实际排版效果,会与网站的显示效果略有不同!!

需要 6 金币 [ 分享pdf获得金币 ] 1 人已下载

下载pdf

pdf贡献者

bcp3

贡献于2015-01-26

下载需要 6 金币 [金币充值 ]
亲,您也可以通过 分享原创pdf 来获得金币奖励!
下载pdf