9. 更快的性能:Cost Based Optimizer9Table A
1000 Recordskurtmaryjohnsmith622523454095243622550042034568622544334568763622534878982324v_nameCard_id12……999999910000000No.Table B
1000 recordsJOIN ON A.card_id=B.card_idCost based optimizerTable sizeImmediate result sizeData skewValue distributionselectivityMap JoinLookup JoinHash JoinQuery PlanCommon JoinCo-Group Join
11. 采用SSD固态盘作为缓存层Use CaseSQL statementCountselect count(ss_item_sk) from store_sales;Findselect * from store_sales where ss_item_sk=141031;Filterselect count(1) from store_sales where ss_customer_sk like "%634%";Inner joinselect /*+mapjoin(b)*/ count(*) from store_sales a, store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_customer_sk=b.sr_customer_sk and a. ss_item_sk=141031 ;Dimension Statsselect ss_item_sk, count(distinct ss_customer_sk) as customers from store_sales group by ss_item_sk order by customers desc limit 10;Implicit Joinselect count(*) from store_sales a, store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_customer_sk=b.sr_customer_sk;Sortselect ss_item_sk, ss_sold_date_sk, count(1) as num from store_sales group by ss_item_sk, ss_sold_date_sk order by num desc limit 10;Window Aggregationselect * from (select *, rank() over (partition by ss_sold_date_sk, ss_item_sk order by num desc) as r from
(select ss_sold_date_sk, ss_item_sk, count(1) as num from store_sales group by ss_sold_date_sk, ss_item_sk) tmp) tmp2 where r=1 limit 100;Only 20% performance degradation for SSD comparing to memoryUsing SDD as cache layer allows user to process 10x larger data at same price as memory with similar performance.
12. 交互式分析 – 纽约市311服务电话记录分析Dataset
NYC 311 service call records
10GB data size
Steps
Load dataset into memory
Connect tableau to Inceptor
Service calls’ geo distribution (few calls in central park)
Time distribution (few calls in weekends)
Type distribution (most are street lights or neighbor noises)
纽约中央公园
13. 对R语言的完整支持
R package
from Transwarp
R StudioR – SQL
Interface
from TranswarpTables
Distributed Columnar Store on SSDStatistics LibraryMachine Learning
LibraryFiles
Hadoop Distributed File SystemR – Spark
Interface
from TranswarpSpark RDD
Resilient Distributed Dataset in MemoryCall parallelized algorithmsCall SQLcall sequential algorithm for distributed dataset
14. 使用R语言进行商圈分析Dataset
PoS transaction records within past three months for all shops in Shanghai city