- 1. Pig介绍
roachxiang 2013-07-29
- 2. 主要内容 pig简介
安装及工具用法
pig基础概念
pig常用操作
pig内建函数
广点通PB用法
- 3. 主要内容 pig简介
安装及工具用法
pig基础概念
pig常用操作
pig内建函数
广点通PB用法
- 4. pig是什么?运行在Hadoop平台上的海量数据分析工具
Pig Latin语言和Pig Engine
- 5. pig名称的来历pig什么都吃,不管数据有没有结构化
pig很温驯,数据处理流程很好控制
pig随遇而安,不管数据在哪里都能处理
- 6. pig开源项目现状当前版本pig-0.11.1
Yahoo当前50%以上的生产任务使用pig进行(一天10k+)
Twitter、LinkedIn都在使用
- 7. 主要内容 pig简介
安装及工具用法
pig基础概念
pig常用操作
pig内建函数
广点通PB用法
- 8. Pig安装下载pig
解压pig的压缩包(自带hadoop版本1.0.0)
添加pig下的bin目录到PATH
pig --help
- 9. Pig运行模式1.交互式模式Grunt:pig/pig -x local
2.本地模式批量:pig -x local xx.pig
3.MR模式批量:pig -x mapreduce xx.pig
- 10. 主要内容 pig简介
安装及工具用法
pig基础概念
pig常用操作
pig内建函数
广点通PB用法
- 11. Pig基础概念relation:a bag
bag:a collection of tuples
tuple:an ordered set of fields
field:a piece of data
- 12. Pig基础概念A = LOAD 'data' as (f1:int, f2:int, f3;int);
DUMP A;
(1,2,3)
(4,2,1)
X = GROUP A BY f1;
DUMP X;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(8,{(8,3,4)})
- 13. Pig支持的数据类型int, long,float,double,boolean
chararray(String)
bytearray(blob)
datatime
tuple
bag
map
- 14. 主要内容 pig简介
安装及工具用法
pig基础概念
pig常用操作
pig内建函数
广点通PB用法
- 15. LoadLOAD 'data' [USING function] [AS schema];
'data':文件或者目录
USING:关键字,缺省为PigStorage
function:用于加载文件的函数
AS:关键字
scheme:用户定义scheme
A = LOAD 'data' as (f1:int, f2:int, f3;int);
- 16. StoreSTORE alias INTO 'directory' [USING function];
alias:一个relation的名字
INTO:关键字
'directory':指定存储目录
USING:关键字,缺省使用PigStorage
function:存储函数
STORE A INTO 'myoutput' USING PigStorage ();
- 17. 常用Load/Store FunctionPigStorage([field_delimiter],['options']):处理文本
BinStorage()
JsonLoader/JsonStorage
- 18. FOREACH(1)alias = FOREACH { block | nested_block };
X = FOREACH A GENERATE a1, a2;
X = FOREACH C GENERATE group, SUM (A.a1);
X = FOREACH C GENERATE group, FLATTEN(A);
- 19. FOREACH(2)a = load '1.txt' as (a0, a1:chararray, a2:chararray);
b = group a by a0;
c = foreach b {
c0 = foreach a generate TOMAP(a1,a2);
generate c0;
}
dump c;
- 20. GROUPA = load 'student' AS (name:chararray,age:int,gpa:float);
B = GROUP A BY age;
B: {group: int, A: {name: chararray,age: int,gpa: float}}
C = FOREACH B GENERATE group, COUNT(A);
- 21. COGROUPA = LOAD 'data1' AS (owner:chararray,pet:chararray);
B = LOAD 'data2' AS (friend1:chararray,friend2:chararray);
X = COGROUP A BY owner, B BY friend2;
X: {group: chararray,A: {owner: chararray,pet: chararray},B: {friend1: chararray,friend2: chararray}}
- 22. JoinA = LOAD 'data1' AS (a1:int,a2:int,a3:int);
B = LOAD 'data2' AS (b1:int,b2:int);
X = JOIN A BY a1, B BY b1;
C = JOIN A by $0 LEFT OUTER, B BY $0;
C = JOIN A BY $0 FULL, B BY $0;
- 23. UNIONA = LOAD 'data' AS (a1:int,a2:int,a3:int);
B = LOAD 'data' AS (b1:int,b2:int);
X = UNION A, B;
- 24. DISTINCTA = LOAD 'data' AS (a1:int,a2:int,a3:int);
X = DISTINCT A;
B = GROUP A BY a1;
C = FOREACH B {D = DISTINCT A;GENERATE group,COUNT(D)}
- 25. ORDER BYA = LOAD 'mydata' AS (x: int, y:int);
B = ORDER A BY x;
C = ORDER A BY y DESC;
- 26. 主要内容 pig简介
安装及工具用法
pig基础概念
pig常用操作
pig内建函数
广点通PB用法
- 27. 聚合函数AVG
CONCAT
COUNT/COUNT_STAR
MAX/MIN
SIZE
SUM
TOKENIZE
- 28. 数学函数ABS
ACOS/ASIN/ATAN/COS/COSH/SIN/SINH/TAN/TANH
CEIL/FLOOR/ROUND
LOG/LOG10
SQRT
RANDOM
- 29. 字符串函数INDEXOF/LAST_INDEX_OF
LCFIRST/LOWER/UCFIRST/UPPER
REGEX_EXTRACT/REGEX_EXTRACT_ALL
REPLACE
STRSPLIT
TRIM
- 30. 时间函数AddDuration/SubtractDuration
XXBetween
GETXX
CurrentTime
ToDate
- 31. 主要内容 pig简介
安装及工具用法
pig基础概念
pig常用操作
pig内建函数
广点通PB用法
- 32. 时间函数REGISTER '/data/etl/roach/pig/protobuftest/*.jar';
raw_data = LOAD 'ad00008_20130726085937_991_10.171.86.18.txt.gz' USING org.apache.pig.tdw.protobuf.load.ProtobufLengthRecordPigLoader('QZAP.log.OuterPageview.Pageview');
position = FOREACH raw_data GENERATE FLATTEN(positions);
imp = FOREACH position GENERATE FLATTEN(positions::imps);
mt = FOREACH imp GENERATE imps::id, FLATTEN(imps::ad.matched_targeting);
DUMP mt;
- 33. Q&A