SlideShare a Scribd company logo
1 of 43
PIG on Storm
P R E S E N T E D B Y M r i d u l J a i n ⎪ J u n e 3 , 2 0 1 4
2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
1 2014 Hadoop Summit, San Jose, California
• Intro – PIG, Storm
• PIG on Storm
• PIG – Hybrid Mode
2 2014 Hadoop Summit, San Jose, California
Quick Intuition: PIG
3 2014 Hadoop Summit, San Jose, California
Q = LOAD “sentences” USING PigStorage() AS (query:chararray);
words = FOREACH Q GENERATE FLATTEN(Tokenize(query));
word_grps = GROUP words BY $0;
word_counts = FOREACH word_grps GENERATE $0, COUNT($1);
Quick Intuition: Storm
4 2014 Hadoop Summit, San Jose, California
Kafka Loader
Tokenizer
Group By
Count
“Obama wins elections”
“Obama care ”
“Obama wins
elections”
“Obama
Care”
“Obama”
“Care”“Obama” “wins”
“elections”
“Obama”
“Obama”
“wins”
“elections”
“care”
Quick Intuition: PIG on Storm
5 2014 Hadoop Summit, San Jose, California
Kafka Loader
Tokenizer
Group By
Count
“Obama wins elections”
“Obama care ”
“Obama wins
elections”
“Obama
Care”
“Obama”
“Care”“Obama” “wins”
“elections”
“Obama”
“Obama”
“wins”
“elections”
“care”
Q = LOAD “sentences” USING KafkaLoader() AS
(query:chararray);
words = FOREACH Q GENERATE
FLATTEN(Tokenize(query));
word_grps = GROUP words BY $0;
word_counts = FOREACH word_grps GENERATE
$0, COUNT($1);
Storm Modes
6 2014 Hadoop Summit, San Jose, California
Bolt
A
Bolt
B
Bolt
CSpout
Bolt
A
Bolt
B
Bolt
C
C
S
E1
E2
E3
Kafka
AMQ
Event Processing
Batch Processing
Event
Data
Stream
Bunch of events, tagged by BatchID
Coord
inator
Emitter1
Bolt
A
Bolt
B
Bolt
CEmitter2
Emitter3
Kafka
AMQ
ID1
ID1
ID1
Batch spans across the emitters.
7 2014 Hadoop Summit, San Jose, California
Coord
Spout
Emitter1
Bolt
A
Bolt
B
Bolt
CEmitter2
Emitter3
Kafka
AMQ
tuple 1
tuple 2
tuple 3
Tuples (marked by batch identifer) are emitted, as
and when they arrive.
8 2014 Hadoop Summit, San Jose, California
Coord
Spout
Emitter1
Bolt
A
Bolt
B
Bolt
CEmitter2
Emitter3
Kafka
AMQ
ID2
ID2
ID2
9 2014 Hadoop Summit, San Jose, California
Coord
Spout
Emitter1
Bolt
A
Bolt
B
Bolt
C
Emitter2
Emitter3
Kafka
AMQ
tuple 4
tuple 5
tuple 6
Multiple batches can run in parallel in the
topology.
Ordering of batches at any node can be
guaranteed using commiters.
10 2014 Hadoop Summit, San Jose, California
PIG on Storm
Write once run anywhere
11 2014 Hadoop Summit, San Jose, California
PIG Script
Map
Reduce
Storm
• Express in PIG and run on Storm -
simplified code & inbuilt ops.
• Think & write in PIG - scripts as well as
UDFs.
• The same script would generally run
on MR or Storm - existing scripts easy
to move over to realtime, quickly
• Easy pluggability to any streaming
data source.
Batch Aggregation
• Batches supported in PIG on which
aggregation happen
• Aggregation across batches also possible
now!
Rich State Semantics
• A state can now be associated & represented as a
PIG relation & operated upon as a usual PIG relation.
• Sliding Windows now available in Storm via PIG -
automatically updates the window with every new
batch: HBase & any other store pluggable
• Global Mutable State - updated state available with
every batch and exclusively accessible during
commits: PigStorage, HBase & any other store
pluggable.
• Richer operations & state mgmt - upcoming!
Hybrid Mode
Mode which decides what parts of your PIG
script to run on Storm & what on MR,
automatically.
12 2014 Hadoop Summit, San Jose, California
Think streaming in PIG
A = Load "a.source" from StorageA();
B = Load "b.source" from StorageB();
C = foreach A generate PIGUDF(*);
D = group A by $0;
E = foreach D generate PIGUDF1(*)
F = cross A,B;
A script variable which contains the PIG types are open pipes
which will get data as time passes, than all records available
upfront unlike PIG.
Two streams A and B are open here.
• Semantics are same as PIG's i.e programmer deals with batches of records and thinks in
the same terms.
• Each batch here corresponds to a single batch in storm.
• The tuples for a batch get generated as timepasses, in a streaming fashion; though tuples
start moving in the pipeline as and when they are generated within a batch, than waiting
for whole batch to finish.
• Pipelining of batches is supported as the batch doesn't have to traverse the topology (the
whole pig script completely here), before the next batch can start.
• In-line with PIG's philosophy, all operations like joining, merging of stream and every other
stream transformation is explicitly done by the programmer.
13 2014 Hadoop Summit, San Jose, California
Language Models for Trend Detection
Copy
Query @ time
1600hrs today:
Obama wins
elections
Current Window (Past x hrs from
current time) Calculator
Yesterday's Window (Past x hrs from
current time, yesterday) Calculator
Total Count of each n-gram from 1200-1600hrs
today:
Obama wins elections
Total Count 10 30 5
Total Vocab size for the window 1200-1600hrs
today: 1000
Total Count of each n-gram from 1200-1600hrs
yesterday:
Obama wins elections
Total Count 5 10 2
Total Vocab size for the window 1200-1600hrs
yesterday: 1200
Probability("Obama wins
elections") in current window
-----------------------------------------
Probability("Obama wins
elections") in Yesterday's window
● Detects trends in Search or Twitter signals by comparing n-gram
frequencies across current and historic time windows
● Needs notion of time windows
● Needs state mgmt for historic data
14 2014 Hadoop Summit, San Jose, California
Implementation
Coordinated
Spout
Emitter
(Query)
Splitter
Vocab
Count
Buzz
Scorer
ngrams
(Twitter
)
Ngrams
(Twitter)
ngrams
(Query)
ngrams
(Query) Total
Count
ngrams
(Twitter)
Past
Day
ngrams
(Twitter)
Past
Day
ngrams
(Twitter)
Past
Week
ngrams
(Twitter)
Past
Week
Emitter
(Twitter)
HBase
Emitter
candset
Same batchid
across
emitters
● Lines of code(Native
Storm): 5000
● Lines of code (PIG): ~200
● From 55mins to 5 mins
(classic to storm)
● Nearest: > 8x-10x
15 2014 Hadoop Summit, San Jose, California
register '/home/mridul/posapps/trending/ngram_query.py' using jython as nqfunc;
--get live queries as a batch {(q1),(q2),(q3)...}
LiveQueries = LOAD 'twitter' USING org.pos.udfs.kafka.KafkaEmitter('60');
--generate the relation Ngrams having {(n1,q1),(n2,q2)...} from LiveQueries
Ngrams = FOREACH LiveQueries GENERATE FLATTEN(nqfunc.split_ngram_query($0));
--store the above Ngram in in Hbase for sliding window
STORE Ngram INTO 'testtable' USING org.pos.state.hbase.WindowHbaseStore('fam');
--load the current 5 hr window from the datasource which is of the form
{(n1,c1),(n2,c2),(n3,c3)...}
NgramModel = LOAD 'testtable,-6,-1' USING
org.pos.state.hbase.WindowHbaseStore('fam') as (word, cnt);
--group all to find the total in next step {(ALL,{(n1,c1),(n2,c2),(n3,c3)...})}
GM = GROUP NgramModel ALL;
--find total count of all tuples in the current window
TotalNgramCount = FOREACH GM GENERATE SUM($1.$1);
--find the unique count of tuples in the current window
VocabCount = FOREACH GM GENERATE COUNT($1);
--Next steps get all the data per ngram in a fmt which helps in calculating MLE
--
{(ngram1,query1,ngram1,ngram1_frequency,total_ngrams,vocab_size),(ngram2,query2,ng
ram2,ngram2_frequency,total_ngrams,vocab_size)}
CW1 = JOIN Ngrams BY $0, NgramModel BY $0;
Kafka Spout
WindowHbaseStore
Spout
1 2 3 4 5 6 7 8
CW
1
Joined
batch
search/twitter
--Join the streams to calculate the counts for an Ngram in
the query and unique vocab, from NgramModel (every
batch)
16 2014 Hadoop Summit, San Jose, California
Advanced State features on Storm
● Efficient in-memory state
● Rich out-of-the-box in-memory data structures (union, intersections
and sketches)
● Advanced semantics like sliding windows supported inherently
● Rich expression of state via PIG
● Fast recovery of the in-memory state
● Potentially query-able state and immutable jobs on those states
● Scheduling of tasks based on data locality
17 2014 Hadoop Summit, San Jose, California
PIG Hybrid Mode
One script to bind it ALL!
18 2014 Hadoop Summit, San Jose, California
Fast path + Slow path
User Profile & History
(Offline Model)
Realtime User
Event
Enriched
User Event
UDF
1. User Event Processing
Motivation
User latest profile
lookup
● Merge latest user event into the
user profile model.
● User profile model crunches
huge data, periodically.
● Separate processing logic for
realtime vs batched pipeline.
19 2014 Hadoop Summit, San Jose, California
Current Solutions
 Batch processing and real-time processing systems have been developed in isolation and
maintained separately.
 Requires architecting the whole system explicitly as there is no system supporting both
currently.
 Shared state store schema is tied to application logic.
 Read-write sync logic and locking needs to be custom designed between the pipelines.
20 2014 Hadoop Summit, San Jose, California
Hybrid Mode
--get live queries as a batch {(q1),(q2),(q3)...}
LiveQueries = LOAD 'twitter' USING org.pos.udfs.kafka.KafkaEmitter('60');
--generate the relation Ngrams having {(n1,q1),(n2,q2)...} from LiveQueries
Ngrams = FOREACH LiveQueries GENERATE FLATTEN(nqfunc.split_ngram_query($0));
--store the above Ngram in in Hbase for sliding window
STORE Ngram INTO 'testtable' USING org.pos.state.hbase.WindowHbaseStore('fam');
--load the current 5 hr window from the datasource which is of the form {(n1,c1),(n2,c2)}
NgramModel = LOAD 'testtable,-1000,-1,10' USING org.pos.state.hbase.WindowHbaseStore('fam') as
(word, cnt);
--group all to find the total in next step {(ALL,{(n1,c1),(n2,c2),(n3,c3)...})}
GM = GROUP NgramModel ALL;
--find total count of all tuples in the current window
TotalNgramCount = FOREACH GM GENERATE SUM($1.$1);
--find the unique count of tuples in the current window
VocabCount = FOREACH GM GENERATE COUNT($1);
--Next steps get all the data per ngram in a fmt which helps in calculating MLE
--
{(ngram1,query1,ngram1,ngram1_frequency,total_ngrams,vocab_size),(ngram2,query2,ngram2,ngram2
_frequency,total_ngrams,vocab_size)}
CW1 = JOIN Ngrams BY $0, NgramModel BY $0;
Every 10th batch, read in a
relative state of 1000 batches
from the current batch. Process
that range in MR/Offline.
Large Batch Range & Low
Frequency of Processing + High
Data payload - helps decide the
Storm/MR parts.
Point of merge/interaction with
the online relation (Ngrams)
defines the boundary for the MR
Job.
21 2014 Hadoop Summit, San Jose, California
Hybrid Mode Architecture & Program
22 2014 Hadoop Summit, San Jose, California
Next
● Perf testing @ scale
● Install & Setup - environments, dependencies, paths, software installs
on different systems
● Demo & Documentation
● State related optimizations
● Hybrid Mode
23 2014 Hadoop Summit, San Jose, California
Thank You
Twitter:
@mridul_jain
@gupta_kapilg
@jyotiahuja11
We are hiring!
Stop by Kiosk P9
or reach out to us at
bigdata@yahoo-inc.com.
24 2014 Hadoop Summit, San Jose, California
Backup
25 2014 Hadoop Summit, San Jose, California
Job Creation & Deployment
test-load-2: (Name: LOStore Schema: null)
|
|---F: (Name: LOForEach Schema: group#4:chararray,#8:long)
| |
| (Name: LOGenerate[false,false] Schema: group#4:chararray,#8:long)
| | |
| | group:(Name: Project Type: chararray Uid: 4 Input: 0 Column: (*))
| | |
| | (Name: Project Type: long Uid: 8 Input: 1 Column: (*))
| |
| |---(Name: LOInnerLoad[0] Schema: group#4:chararray)
| |
| |---(Name: LOInnerLoad[1] Schema: #8:long)
|
|---E: (Name: LOForEach Schema: group#4:chararray,#8:long)
| |
| (Name: LOGenerate[false,false] Schema: group#4:chararray,#8:long)
| | |
| | group:(Name: Project Type: chararray Uid: 4 Input: 0 Column: (*))
| | |
| | (Name: UserFunc(org.apache.pig.builtin.COUNT) Type: long Uid: 8)
| | |
| | |---C:(Name: Project Type: bag Uid: 5 Input: 1 Column: (*))
| |
| |---(Name: LOInnerLoad[0] Schema: group#4:chararray)
| |
| |---C: (Name: LOInnerLoad[1] Schema: bag_of_tokenTuples_from_null::token#4:chararray)
|
|---D: (Name: LOCogroup Schema: group#4:chararray,C#5:bag{#9:tuple(bag_of_tokenTuples_from_null::token#4:chararray)})
| |
| bag_of_tokenTuples_from_null::token:(Name: Project Type: chararray Uid: 4 Input: 0 Column: 0)
|
|---C: (Name: LOForEach Schema: bag_of_tokenTuples_from_null::token#4:chararray)
| |
| (Name: LOGenerate[true] Schema: bag_of_tokenTuples_from_null::token#4:chararray)
| | |
| | bag_of_tokenTuples_from_null:(Name: Project Type: bag Uid: 2 Input: 0 Column: (*))
| |
| |---bag_of_tokenTuples_from_null: (Name: LOInnerLoad[0] Schema: token#4:chararray)
|
|---B: (Name: LOForEach Schema: bag_of_tokenTuples_from_null#2:bag{tuple_of_tokens#3:tuple(token#4:chararray)})
| |
| (Name: LOGenerate[false] Schema: bag_of_tokenTuples_from_null#2:bag{tuple_of_tokens#3:tuple(token#4:chararray)})
| | |
| | (Name: UserFunc(org.apache.pig.builtin.TOKENIZE) Type: bag Uid: 2)
| | |
| | |---(Name: Project Type: bytearray Uid: 1 Input: 0 Column: (*))
| |
| |---(Name: LOInnerLoad[0] Schema: #1:bytearray)
|
|---A: (Name: LOLoad Schema: null)RequiredFields:null
Logical Plan
A
B
C
D
E
F
Grouping Implementation - MR
D
LR
D
GR
D
Pkg
D
Pkg
D
LR
“cow” “jumped”
“jumped”
“cow”
“jumped”
“jumped”
MR
Implementation
Physical Plan
Relation D (for grouping) gets broken into
3 operators from Logical to Physical Plan
Ma
p
Reduc
e
test-load-2: Store(file:///Users/mridul/workspace/PIGOnStorm/screen:org.apache.pig.builtin.PigStorage) - scope-23
|
|---F: New For Each(false,false)[bag] - scope-22
| |
| Project[chararray][0] - scope-18
| |
| Project[long][1] - scope-20
|
|---E: New For Each(false,false)[bag] - scope-17
| |
| Project[chararray][0] - scope-12
| |
| POUserFunc(org.apache.pig.builtin.COUNT)[long] - scope-15
| |
| |---Project[bag][1] - scope-14
|
|---D: Package[tuple]{chararray} - scope-9
|
|---D: Global Rearrange[tuple] - scope-8
|
|---D: Local Rearrange[tuple]{chararray}(false) - scope-10
| |
| Project[chararray][0] - scope-11
|
|---C: New For Each(true)[bag] - scope-7
| |
| Project[bag][0] - scope-5
|
|---B: New For Each(false)[bag] - scope-4
| |
| POUserFunc(org.apache.pig.builtin.TOKENIZE)[bag] - scope-2
| |
| |---Project[bytearray][0] - scope-1
|
|---A: Load(sentences.txt:org.pos.main.RandomSentenceLoadFunc('sentence','2','the cow jumped over the moon,the cow man went to the store
and bought some candy moon, the cow went to the moon')) - scope-0
A
B
C
D
LR
D
GR
D
Pkg
E
F
Physical Plan
Clustering Algo(Boundary detection to form a job):
● How are POs clustered into an MR Job?
● How POs within a job, are clustered into Maps and Reduces?
PO1
PO2 PO3
PO4
pass
thro’
Map
PO5
MR Job1 MR Job2
Map containing 2 Physical Operators where data is passed directly
from one operator to another.
Reduce containing a reducing operator PO3 followed by PO4 which processes the output of PO3.
Shuffling needs to be done between PO2 and PO3. This defines a boundary condition between a
Map & a Reduce within a MR Job.
An MR Job can have only 1 Map & 1
Reduce.
PO5 is another reducer which cannot be in
the same job as MR1 and so a new job MR2
is created as a result of this boundary
condition.
Load PO gets mapped to an MROp which becomes parts of the data loading framework for a MR Job. After loading the
data, it is passed tuple by tuple to the subsequent Map by the MR framework.
WordCount PhysicalPlan -----> WordCount MRPlan
A
B
C
D
LR
D
GR
D
Pkg
E
F
MROp(Map)
MROp(Reduce)
MR Job
WordCount PhysicalPlan -----> WordCount StormPlan
A
B
C
D
LR
D
GR
D
Pkg
E
F
StormOp(Bolt)
StormOp(Bolt)
Storm
Topology
Storm UDF scope-24
A: Load(sentences.txt:org.pos.main.RandomSentenceLoadFunc('sentence','2','the cow jumped over the moon,the cow man went to the store and bought some candy moon, the
cow went to the moon')) - scope-0--------
Storm UDF scope-25
D: Local Rearrange[tuple]{chararray}(false) - scope-10
| |
| Project[chararray][0] - scope-11
|
|---C: New For Each(true)[bag] - scope-7
| |
| Project[bag][0] - scope-5
|
|---B: New For Each(false)[bag] - scope-4
| |
| POUserFunc(org.apache.pig.builtin.TOKENIZE)[bag] - scope-2
| |
| |---Project[bytearray][0] - scope-1
Input: A--------
Storm UDF scope-26
F: New For Each(false,false)[bag] - scope-22
| |
| Project[chararray][0] - scope-18
| |
| Project[long][1] - scope-20
|
|---E: New For Each(false,false)[bag] - scope-17
| |
| Project[chararray][0] - scope-12
| |
| POUserFunc(org.apache.pig.builtin.COUNT)[long] - scope-15
| |
| |---Project[bag][1] - scope-14
|
|---D: Package[tuple]{chararray} - scope-9--------
Storm Plan
Data Flow & Execution
Data Flow @ Runtime - MR
A
B
C
D
LR
D
GR
D
Pkg
E
F
MROp(Map)
MROp(Reduce)
MR Job
Data Flow @ Runtime - MR
A
B
C
D
LR
MROp(Map)
MR Job
Data Flow @ Runtime - MR
A
B
C
D
LR
MROp(Map)
MR Job
Loads data from HDFS. MR framework
passes it tuple by tuple to the
subsequent Map(MROp here).
The MROp callback should attach the tuple given
by the framework to the root node of it’s local tree
(B here)
The leaf node in every MROp pulls the data tuple recursively
(i.e D->C->B). This is the same tuple which was attached
initially to B.
Process repeats.
Data Flow @ Runtime - Storm
A
B
C
D
LR
D
GR
D
Pkg
E
F
StormOp(Bolt)
StormOp(Bolt)
Storm
Topology
Spout pulls data from a datasource or
queue and emits to the next Bolt.
Bolt attaches the tuple to the root node B.
The leaf node D pulls the tuple and emits.
Grouping Implementation - Storm
D
LR
D
GR
D
Pkg
D
Pkg
D
LR
“cow” “jumped”
“jumped”
“cow”
“jumped”
“jumped”
Storm
Implementation
Physical Plan
Each task of DPkg could maintain multiple
states corresponding to specific tokens (like a
reducer)
Distinct Implementation - Storm
D
Pkg
D
LR
“cow” “jumped”
“jumped”
“cow”
“jumped”
“jumped”
Each task of DPkg could maintain multiple
states corresponding to the specific token (like a
reducer) and emit only the distinct tokens from
each, directly.
D
LR
D
GR
D
Pkg
Storm
Implementation
Physical Plan
Sort Implementation - Storm
D
Pkg
D
LR
“cow” “jumped”
“jumped” “cow”
“jumped”
“jumped”
Single DPkg task maintains a sorted Treemap,
which is emitted at end of the batch.
D
LR
D
GR
D
Pkg
Storm
Implementation
Physical Plan
Global
Grouping
Cogroup Implementation - Storm
D
Pkg
Each DPkg task emits a tuple containing the field grouped
on and a bag having grouped tuples, from relation A and B,
which is emitted at end of the batch.
A
LR
B
LR
D
PkgA
LR
B
LR
Relation A and B are merged into a Stream by
Trident for the field grouped tuple before
passing to resp POPackage
“cow” “jumped”
“cow”
{“cow”, “cow”}
Field grouping as
property of the edge
Data Format Transformation during Data Flow (Storm)
A B
C
D
LR
Queu
e
StormOp(Bolt)
Takes Storm tuple and converts into PIG tuple
before fwding/attaching as input to POs.
Converts from PIG tuple back to Storm tuple, before
emitting from a bolt. Payload is packaged into a single
PIG tuple i.e outgoing stream has complex data
structures (like bags etc) which are packed as blobs in
a tuple.

More Related Content

What's hot

Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm ConceptsAndré Dias
 
Parallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopParallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopThibault Debatty
 
Storm 2012-03-29
Storm 2012-03-29Storm 2012-03-29
Storm 2012-03-29Ted Dunning
 
Buzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningBuzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningMapR Technologies
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012Dan Lynn
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesOleksii Diagiliev
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormAndrea Iacono
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013Nathan Bijnens
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesDataWorks Summit/Hadoop Summit
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
anti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIHanti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIHLeo Chu
 
Streaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's BagheeraStreaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's BagheeraVarunkumar Manohar
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelTakahiro Inoue
 
Lightning Talk: MongoDB Sharding
Lightning Talk: MongoDB ShardingLightning Talk: MongoDB Sharding
Lightning Talk: MongoDB ShardingMongoDB
 

What's hot (19)

Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Parallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopParallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with Hadoop
 
Storm 2012-03-29
Storm 2012-03-29Storm 2012-03-29
Storm 2012-03-29
 
Buzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningBuzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time Learning
 
Hadoop
HadoopHadoop
Hadoop
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
anti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIHanti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIH
 
Streaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's BagheeraStreaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's Bagheera
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
Lightning Talk: MongoDB Sharding
Lightning Talk: MongoDB ShardingLightning Talk: MongoDB Sharding
Lightning Talk: MongoDB Sharding
 

Similar to Pig on Storm

9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 
44CON 2014: Using hadoop for malware, network, forensics and log analysis
44CON 2014: Using hadoop for malware, network, forensics and log analysis44CON 2014: Using hadoop for malware, network, forensics and log analysis
44CON 2014: Using hadoop for malware, network, forensics and log analysisMichael Boman
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureGabriele Modena
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationGeorge Long
 
Kafka for data scientists
Kafka for data scientistsKafka for data scientists
Kafka for data scientistsJenn Rawlins
 
Streaming options in the wild
Streaming options in the wildStreaming options in the wild
Streaming options in the wildAtif Akhtar
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu
 
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...HostedbyConfluent
 
Hadoop Summit 2014 - recap
Hadoop Summit 2014 - recapHadoop Summit 2014 - recap
Hadoop Summit 2014 - recapUserReport
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Guido Schmutz
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjugDavid Morin
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify StoryNeville Li
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaGlenn K. Lockwood
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 

Similar to Pig on Storm (20)

9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
44CON 2014: Using hadoop for malware, network, forensics and log analysis
44CON 2014: Using hadoop for malware, network, forensics and log analysis44CON 2014: Using hadoop for malware, network, forensics and log analysis
44CON 2014: Using hadoop for malware, network, forensics and log analysis
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
 
Cascalog
CascalogCascalog
Cascalog
 
Kafka for data scientists
Kafka for data scientistsKafka for data scientists
Kafka for data scientists
 
Streaming options in the wild
Streaming options in the wildStreaming options in the wild
Streaming options in the wild
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
 
Hadoop Summit 2014 - recap
Hadoop Summit 2014 - recapHadoop Summit 2014 - recap
Hadoop Summit 2014 - recap
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjug
 
Pywps
PywpsPywps
Pywps
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 

Pig on Storm

  • 1. PIG on Storm P R E S E N T E D B Y M r i d u l J a i n ⎪ J u n e 3 , 2 0 1 4 2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a 1 2014 Hadoop Summit, San Jose, California
  • 2. • Intro – PIG, Storm • PIG on Storm • PIG – Hybrid Mode 2 2014 Hadoop Summit, San Jose, California
  • 3. Quick Intuition: PIG 3 2014 Hadoop Summit, San Jose, California Q = LOAD “sentences” USING PigStorage() AS (query:chararray); words = FOREACH Q GENERATE FLATTEN(Tokenize(query)); word_grps = GROUP words BY $0; word_counts = FOREACH word_grps GENERATE $0, COUNT($1);
  • 4. Quick Intuition: Storm 4 2014 Hadoop Summit, San Jose, California Kafka Loader Tokenizer Group By Count “Obama wins elections” “Obama care ” “Obama wins elections” “Obama Care” “Obama” “Care”“Obama” “wins” “elections” “Obama” “Obama” “wins” “elections” “care”
  • 5. Quick Intuition: PIG on Storm 5 2014 Hadoop Summit, San Jose, California Kafka Loader Tokenizer Group By Count “Obama wins elections” “Obama care ” “Obama wins elections” “Obama Care” “Obama” “Care”“Obama” “wins” “elections” “Obama” “Obama” “wins” “elections” “care” Q = LOAD “sentences” USING KafkaLoader() AS (query:chararray); words = FOREACH Q GENERATE FLATTEN(Tokenize(query)); word_grps = GROUP words BY $0; word_counts = FOREACH word_grps GENERATE $0, COUNT($1);
  • 6. Storm Modes 6 2014 Hadoop Summit, San Jose, California Bolt A Bolt B Bolt CSpout Bolt A Bolt B Bolt C C S E1 E2 E3 Kafka AMQ Event Processing Batch Processing Event Data Stream Bunch of events, tagged by BatchID
  • 8. Coord Spout Emitter1 Bolt A Bolt B Bolt CEmitter2 Emitter3 Kafka AMQ tuple 1 tuple 2 tuple 3 Tuples (marked by batch identifer) are emitted, as and when they arrive. 8 2014 Hadoop Summit, San Jose, California
  • 10. Coord Spout Emitter1 Bolt A Bolt B Bolt C Emitter2 Emitter3 Kafka AMQ tuple 4 tuple 5 tuple 6 Multiple batches can run in parallel in the topology. Ordering of batches at any node can be guaranteed using commiters. 10 2014 Hadoop Summit, San Jose, California
  • 11. PIG on Storm Write once run anywhere 11 2014 Hadoop Summit, San Jose, California
  • 12. PIG Script Map Reduce Storm • Express in PIG and run on Storm - simplified code & inbuilt ops. • Think & write in PIG - scripts as well as UDFs. • The same script would generally run on MR or Storm - existing scripts easy to move over to realtime, quickly • Easy pluggability to any streaming data source. Batch Aggregation • Batches supported in PIG on which aggregation happen • Aggregation across batches also possible now! Rich State Semantics • A state can now be associated & represented as a PIG relation & operated upon as a usual PIG relation. • Sliding Windows now available in Storm via PIG - automatically updates the window with every new batch: HBase & any other store pluggable • Global Mutable State - updated state available with every batch and exclusively accessible during commits: PigStorage, HBase & any other store pluggable. • Richer operations & state mgmt - upcoming! Hybrid Mode Mode which decides what parts of your PIG script to run on Storm & what on MR, automatically. 12 2014 Hadoop Summit, San Jose, California
  • 13. Think streaming in PIG A = Load "a.source" from StorageA(); B = Load "b.source" from StorageB(); C = foreach A generate PIGUDF(*); D = group A by $0; E = foreach D generate PIGUDF1(*) F = cross A,B; A script variable which contains the PIG types are open pipes which will get data as time passes, than all records available upfront unlike PIG. Two streams A and B are open here. • Semantics are same as PIG's i.e programmer deals with batches of records and thinks in the same terms. • Each batch here corresponds to a single batch in storm. • The tuples for a batch get generated as timepasses, in a streaming fashion; though tuples start moving in the pipeline as and when they are generated within a batch, than waiting for whole batch to finish. • Pipelining of batches is supported as the batch doesn't have to traverse the topology (the whole pig script completely here), before the next batch can start. • In-line with PIG's philosophy, all operations like joining, merging of stream and every other stream transformation is explicitly done by the programmer. 13 2014 Hadoop Summit, San Jose, California
  • 14. Language Models for Trend Detection Copy Query @ time 1600hrs today: Obama wins elections Current Window (Past x hrs from current time) Calculator Yesterday's Window (Past x hrs from current time, yesterday) Calculator Total Count of each n-gram from 1200-1600hrs today: Obama wins elections Total Count 10 30 5 Total Vocab size for the window 1200-1600hrs today: 1000 Total Count of each n-gram from 1200-1600hrs yesterday: Obama wins elections Total Count 5 10 2 Total Vocab size for the window 1200-1600hrs yesterday: 1200 Probability("Obama wins elections") in current window ----------------------------------------- Probability("Obama wins elections") in Yesterday's window ● Detects trends in Search or Twitter signals by comparing n-gram frequencies across current and historic time windows ● Needs notion of time windows ● Needs state mgmt for historic data 14 2014 Hadoop Summit, San Jose, California
  • 16. register '/home/mridul/posapps/trending/ngram_query.py' using jython as nqfunc; --get live queries as a batch {(q1),(q2),(q3)...} LiveQueries = LOAD 'twitter' USING org.pos.udfs.kafka.KafkaEmitter('60'); --generate the relation Ngrams having {(n1,q1),(n2,q2)...} from LiveQueries Ngrams = FOREACH LiveQueries GENERATE FLATTEN(nqfunc.split_ngram_query($0)); --store the above Ngram in in Hbase for sliding window STORE Ngram INTO 'testtable' USING org.pos.state.hbase.WindowHbaseStore('fam'); --load the current 5 hr window from the datasource which is of the form {(n1,c1),(n2,c2),(n3,c3)...} NgramModel = LOAD 'testtable,-6,-1' USING org.pos.state.hbase.WindowHbaseStore('fam') as (word, cnt); --group all to find the total in next step {(ALL,{(n1,c1),(n2,c2),(n3,c3)...})} GM = GROUP NgramModel ALL; --find total count of all tuples in the current window TotalNgramCount = FOREACH GM GENERATE SUM($1.$1); --find the unique count of tuples in the current window VocabCount = FOREACH GM GENERATE COUNT($1); --Next steps get all the data per ngram in a fmt which helps in calculating MLE -- {(ngram1,query1,ngram1,ngram1_frequency,total_ngrams,vocab_size),(ngram2,query2,ng ram2,ngram2_frequency,total_ngrams,vocab_size)} CW1 = JOIN Ngrams BY $0, NgramModel BY $0; Kafka Spout WindowHbaseStore Spout 1 2 3 4 5 6 7 8 CW 1 Joined batch search/twitter --Join the streams to calculate the counts for an Ngram in the query and unique vocab, from NgramModel (every batch) 16 2014 Hadoop Summit, San Jose, California
  • 17. Advanced State features on Storm ● Efficient in-memory state ● Rich out-of-the-box in-memory data structures (union, intersections and sketches) ● Advanced semantics like sliding windows supported inherently ● Rich expression of state via PIG ● Fast recovery of the in-memory state ● Potentially query-able state and immutable jobs on those states ● Scheduling of tasks based on data locality 17 2014 Hadoop Summit, San Jose, California
  • 18. PIG Hybrid Mode One script to bind it ALL! 18 2014 Hadoop Summit, San Jose, California
  • 19. Fast path + Slow path User Profile & History (Offline Model) Realtime User Event Enriched User Event UDF 1. User Event Processing Motivation User latest profile lookup ● Merge latest user event into the user profile model. ● User profile model crunches huge data, periodically. ● Separate processing logic for realtime vs batched pipeline. 19 2014 Hadoop Summit, San Jose, California
  • 20. Current Solutions  Batch processing and real-time processing systems have been developed in isolation and maintained separately.  Requires architecting the whole system explicitly as there is no system supporting both currently.  Shared state store schema is tied to application logic.  Read-write sync logic and locking needs to be custom designed between the pipelines. 20 2014 Hadoop Summit, San Jose, California
  • 21. Hybrid Mode --get live queries as a batch {(q1),(q2),(q3)...} LiveQueries = LOAD 'twitter' USING org.pos.udfs.kafka.KafkaEmitter('60'); --generate the relation Ngrams having {(n1,q1),(n2,q2)...} from LiveQueries Ngrams = FOREACH LiveQueries GENERATE FLATTEN(nqfunc.split_ngram_query($0)); --store the above Ngram in in Hbase for sliding window STORE Ngram INTO 'testtable' USING org.pos.state.hbase.WindowHbaseStore('fam'); --load the current 5 hr window from the datasource which is of the form {(n1,c1),(n2,c2)} NgramModel = LOAD 'testtable,-1000,-1,10' USING org.pos.state.hbase.WindowHbaseStore('fam') as (word, cnt); --group all to find the total in next step {(ALL,{(n1,c1),(n2,c2),(n3,c3)...})} GM = GROUP NgramModel ALL; --find total count of all tuples in the current window TotalNgramCount = FOREACH GM GENERATE SUM($1.$1); --find the unique count of tuples in the current window VocabCount = FOREACH GM GENERATE COUNT($1); --Next steps get all the data per ngram in a fmt which helps in calculating MLE -- {(ngram1,query1,ngram1,ngram1_frequency,total_ngrams,vocab_size),(ngram2,query2,ngram2,ngram2 _frequency,total_ngrams,vocab_size)} CW1 = JOIN Ngrams BY $0, NgramModel BY $0; Every 10th batch, read in a relative state of 1000 batches from the current batch. Process that range in MR/Offline. Large Batch Range & Low Frequency of Processing + High Data payload - helps decide the Storm/MR parts. Point of merge/interaction with the online relation (Ngrams) defines the boundary for the MR Job. 21 2014 Hadoop Summit, San Jose, California
  • 22. Hybrid Mode Architecture & Program 22 2014 Hadoop Summit, San Jose, California
  • 23. Next ● Perf testing @ scale ● Install & Setup - environments, dependencies, paths, software installs on different systems ● Demo & Documentation ● State related optimizations ● Hybrid Mode 23 2014 Hadoop Summit, San Jose, California
  • 24. Thank You Twitter: @mridul_jain @gupta_kapilg @jyotiahuja11 We are hiring! Stop by Kiosk P9 or reach out to us at bigdata@yahoo-inc.com. 24 2014 Hadoop Summit, San Jose, California
  • 25. Backup 25 2014 Hadoop Summit, San Jose, California
  • 26. Job Creation & Deployment
  • 27. test-load-2: (Name: LOStore Schema: null) | |---F: (Name: LOForEach Schema: group#4:chararray,#8:long) | | | (Name: LOGenerate[false,false] Schema: group#4:chararray,#8:long) | | | | | group:(Name: Project Type: chararray Uid: 4 Input: 0 Column: (*)) | | | | | (Name: Project Type: long Uid: 8 Input: 1 Column: (*)) | | | |---(Name: LOInnerLoad[0] Schema: group#4:chararray) | | | |---(Name: LOInnerLoad[1] Schema: #8:long) | |---E: (Name: LOForEach Schema: group#4:chararray,#8:long) | | | (Name: LOGenerate[false,false] Schema: group#4:chararray,#8:long) | | | | | group:(Name: Project Type: chararray Uid: 4 Input: 0 Column: (*)) | | | | | (Name: UserFunc(org.apache.pig.builtin.COUNT) Type: long Uid: 8) | | | | | |---C:(Name: Project Type: bag Uid: 5 Input: 1 Column: (*)) | | | |---(Name: LOInnerLoad[0] Schema: group#4:chararray) | | | |---C: (Name: LOInnerLoad[1] Schema: bag_of_tokenTuples_from_null::token#4:chararray) | |---D: (Name: LOCogroup Schema: group#4:chararray,C#5:bag{#9:tuple(bag_of_tokenTuples_from_null::token#4:chararray)}) | | | bag_of_tokenTuples_from_null::token:(Name: Project Type: chararray Uid: 4 Input: 0 Column: 0) | |---C: (Name: LOForEach Schema: bag_of_tokenTuples_from_null::token#4:chararray) | | | (Name: LOGenerate[true] Schema: bag_of_tokenTuples_from_null::token#4:chararray) | | | | | bag_of_tokenTuples_from_null:(Name: Project Type: bag Uid: 2 Input: 0 Column: (*)) | | | |---bag_of_tokenTuples_from_null: (Name: LOInnerLoad[0] Schema: token#4:chararray) | |---B: (Name: LOForEach Schema: bag_of_tokenTuples_from_null#2:bag{tuple_of_tokens#3:tuple(token#4:chararray)}) | | | (Name: LOGenerate[false] Schema: bag_of_tokenTuples_from_null#2:bag{tuple_of_tokens#3:tuple(token#4:chararray)}) | | | | | (Name: UserFunc(org.apache.pig.builtin.TOKENIZE) Type: bag Uid: 2) | | | | | |---(Name: Project Type: bytearray Uid: 1 Input: 0 Column: (*)) | | | |---(Name: LOInnerLoad[0] Schema: #1:bytearray) | |---A: (Name: LOLoad Schema: null)RequiredFields:null Logical Plan A B C D E F
  • 28. Grouping Implementation - MR D LR D GR D Pkg D Pkg D LR “cow” “jumped” “jumped” “cow” “jumped” “jumped” MR Implementation Physical Plan Relation D (for grouping) gets broken into 3 operators from Logical to Physical Plan Ma p Reduc e
  • 29. test-load-2: Store(file:///Users/mridul/workspace/PIGOnStorm/screen:org.apache.pig.builtin.PigStorage) - scope-23 | |---F: New For Each(false,false)[bag] - scope-22 | | | Project[chararray][0] - scope-18 | | | Project[long][1] - scope-20 | |---E: New For Each(false,false)[bag] - scope-17 | | | Project[chararray][0] - scope-12 | | | POUserFunc(org.apache.pig.builtin.COUNT)[long] - scope-15 | | | |---Project[bag][1] - scope-14 | |---D: Package[tuple]{chararray} - scope-9 | |---D: Global Rearrange[tuple] - scope-8 | |---D: Local Rearrange[tuple]{chararray}(false) - scope-10 | | | Project[chararray][0] - scope-11 | |---C: New For Each(true)[bag] - scope-7 | | | Project[bag][0] - scope-5 | |---B: New For Each(false)[bag] - scope-4 | | | POUserFunc(org.apache.pig.builtin.TOKENIZE)[bag] - scope-2 | | | |---Project[bytearray][0] - scope-1 | |---A: Load(sentences.txt:org.pos.main.RandomSentenceLoadFunc('sentence','2','the cow jumped over the moon,the cow man went to the store and bought some candy moon, the cow went to the moon')) - scope-0 A B C D LR D GR D Pkg E F Physical Plan
  • 30. Clustering Algo(Boundary detection to form a job): ● How are POs clustered into an MR Job? ● How POs within a job, are clustered into Maps and Reduces? PO1 PO2 PO3 PO4 pass thro’ Map PO5 MR Job1 MR Job2 Map containing 2 Physical Operators where data is passed directly from one operator to another. Reduce containing a reducing operator PO3 followed by PO4 which processes the output of PO3. Shuffling needs to be done between PO2 and PO3. This defines a boundary condition between a Map & a Reduce within a MR Job. An MR Job can have only 1 Map & 1 Reduce. PO5 is another reducer which cannot be in the same job as MR1 and so a new job MR2 is created as a result of this boundary condition. Load PO gets mapped to an MROp which becomes parts of the data loading framework for a MR Job. After loading the data, it is passed tuple by tuple to the subsequent Map by the MR framework.
  • 31. WordCount PhysicalPlan -----> WordCount MRPlan A B C D LR D GR D Pkg E F MROp(Map) MROp(Reduce) MR Job
  • 32. WordCount PhysicalPlan -----> WordCount StormPlan A B C D LR D GR D Pkg E F StormOp(Bolt) StormOp(Bolt) Storm Topology
  • 33. Storm UDF scope-24 A: Load(sentences.txt:org.pos.main.RandomSentenceLoadFunc('sentence','2','the cow jumped over the moon,the cow man went to the store and bought some candy moon, the cow went to the moon')) - scope-0-------- Storm UDF scope-25 D: Local Rearrange[tuple]{chararray}(false) - scope-10 | | | Project[chararray][0] - scope-11 | |---C: New For Each(true)[bag] - scope-7 | | | Project[bag][0] - scope-5 | |---B: New For Each(false)[bag] - scope-4 | | | POUserFunc(org.apache.pig.builtin.TOKENIZE)[bag] - scope-2 | | | |---Project[bytearray][0] - scope-1 Input: A-------- Storm UDF scope-26 F: New For Each(false,false)[bag] - scope-22 | | | Project[chararray][0] - scope-18 | | | Project[long][1] - scope-20 | |---E: New For Each(false,false)[bag] - scope-17 | | | Project[chararray][0] - scope-12 | | | POUserFunc(org.apache.pig.builtin.COUNT)[long] - scope-15 | | | |---Project[bag][1] - scope-14 | |---D: Package[tuple]{chararray} - scope-9-------- Storm Plan
  • 34. Data Flow & Execution
  • 35. Data Flow @ Runtime - MR A B C D LR D GR D Pkg E F MROp(Map) MROp(Reduce) MR Job
  • 36. Data Flow @ Runtime - MR A B C D LR MROp(Map) MR Job
  • 37. Data Flow @ Runtime - MR A B C D LR MROp(Map) MR Job Loads data from HDFS. MR framework passes it tuple by tuple to the subsequent Map(MROp here). The MROp callback should attach the tuple given by the framework to the root node of it’s local tree (B here) The leaf node in every MROp pulls the data tuple recursively (i.e D->C->B). This is the same tuple which was attached initially to B. Process repeats.
  • 38. Data Flow @ Runtime - Storm A B C D LR D GR D Pkg E F StormOp(Bolt) StormOp(Bolt) Storm Topology Spout pulls data from a datasource or queue and emits to the next Bolt. Bolt attaches the tuple to the root node B. The leaf node D pulls the tuple and emits.
  • 39. Grouping Implementation - Storm D LR D GR D Pkg D Pkg D LR “cow” “jumped” “jumped” “cow” “jumped” “jumped” Storm Implementation Physical Plan Each task of DPkg could maintain multiple states corresponding to specific tokens (like a reducer)
  • 40. Distinct Implementation - Storm D Pkg D LR “cow” “jumped” “jumped” “cow” “jumped” “jumped” Each task of DPkg could maintain multiple states corresponding to the specific token (like a reducer) and emit only the distinct tokens from each, directly. D LR D GR D Pkg Storm Implementation Physical Plan
  • 41. Sort Implementation - Storm D Pkg D LR “cow” “jumped” “jumped” “cow” “jumped” “jumped” Single DPkg task maintains a sorted Treemap, which is emitted at end of the batch. D LR D GR D Pkg Storm Implementation Physical Plan Global Grouping
  • 42. Cogroup Implementation - Storm D Pkg Each DPkg task emits a tuple containing the field grouped on and a bag having grouped tuples, from relation A and B, which is emitted at end of the batch. A LR B LR D PkgA LR B LR Relation A and B are merged into a Stream by Trident for the field grouped tuple before passing to resp POPackage “cow” “jumped” “cow” {“cow”, “cow”} Field grouping as property of the edge
  • 43. Data Format Transformation during Data Flow (Storm) A B C D LR Queu e StormOp(Bolt) Takes Storm tuple and converts into PIG tuple before fwding/attaching as input to POs. Converts from PIG tuple back to Storm tuple, before emitting from a bolt. Payload is packaged into a single PIG tuple i.e outgoing stream has complex data structures (like bags etc) which are packed as blobs in a tuple.

Editor's Notes

  1. Its not very clear - the two tuples 1 and 3 became one tuple 1’ post Bolt A? Or we are just saying here that for all tuples that are generated from a particular batch, the ID remains same even though the tuples change themselves
  2. This one is WIP i guess
  3. Should we add some info on how we support rich state semantics? Whats our storage and what could be its constraints if any?
  4. Lets say that we are now talking about a real world example, the transition from previous to this one is a little abrupt. For instance, Example application - Trend detection Requires - Time windows State management
  5. We need to add label for 4 in the image. I am assuming the unlabeled arrow is number 4, is it? If yes, then arrow’s direction should be reversed as well as the flow is from MR to Hbase. The right side script can completely go, already covered in slide 16