SlideShare a Scribd company logo
1 of 56
Download to read offline
May 9, 2014
Collaborative
Filtering with Spark
Chris Johnson
@MrChrisJohnson
Friday, May 9, 14
Who am I??
•Chris Johnson
– Machine Learning guy from NYC
– Focused on music recommendations
– Formerly a graduate student at UT Austin
Friday, May 9, 14
3
What is MLlib?
Algorithms:
• classification: logistic regression, linear support vector machine
(SVM), naive bayes
• regression: generalized linear regression
• clustering: k-means
• decomposition: singular value decomposition (SVD), principle
component analysis (PCA
• collaborative filtering: alternating least squares (ALS)
http://spark.apache.org/docs/0.9.0/mllib-guide.html
Friday, May 9, 14
4
What is MLlib?
Algorithms:
• classification: logistic regression, linear support vector machine
(SVM), naive bayes
• regression: generalized linear regression
• clustering: k-means
• decomposition: singular value decomposition (SVD), principle
component analysis (PCA
• collaborative filtering: alternating least squares (ALS)
http://spark.apache.org/docs/0.9.0/mllib-guide.html
Friday, May 9, 14
Collaborative Filtering - “The Netflix Prize”
5
Friday, May 9, 14
Collaborative Filtering
6
Hey,
I like tracks P, Q, R, S!
Well,
I like tracks Q, R, S, T!
Then you should check out
track P!
Nice! Btw try track T!
Image via Erik Bernhardsson
Friday, May 9, 14
7
Collaborative Filtering at Spotify
• Discover (personalized recommendations)
• Radio
• Related Artists
• Now Playing
Friday, May 9, 14
Section name 8
Friday, May 9, 14
Explicit Matrix Factorization 9
Movies
Users
Chris
Inception
•Users explicitly rate a subset of the movie catalog
•Goal: predict how users will rate new movies
Friday, May 9, 14
• = bias for user
• = bias for item
• = regularization parameter
Explicit Matrix Factorization 10
Chris
Inception
? 3 5 ?
1 ? ? 1
2 ? 3 2
? ? ? 5
5 2 ? 4
•Approximate ratings matrix by the product of low-
dimensional user and movie matrices
•Minimize RMSE (root mean squared error)
• = user rating for movie
• = user latent factor vector
• = item latent factor vector
X Y
Friday, May 9, 14
Implicit Matrix Factorization 11
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
•Replace Stream counts with binary labels
– 1 = streamed, 0 = never streamed
•Minimize weighted RMSE (root mean squared error) using a
function of stream counts as weights
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• =i tem latent factor vector
X Y
Friday, May 9, 14
Alternating Least Squares 12
• Initialize user and item vectors to random noise
• Fix item vectors and solve for optimal user vectors
– Take the derivative of loss function with respect to user’s vector, set
equal to 0, and solve
– Results in a system of linear equations with closed form solution!
• Fix user vectors and solve for optimal item vectors
• Repeat until convergence
code: https://github.com/MrChrisJohnson/implicitMF
Friday, May 9, 14
Alternating Least Squares 13
• Note that:
• Then, we can pre-compute once per iteration
– and only contain non-zero elements for tracks that
the user streamed
– Using sparse matrix operations we can then compute each user’s
vector efficiently in time where is the number of
tracks the user streamed
code: https://github.com/MrChrisJohnson/implicitMF
Friday, May 9, 14
14
Alternating Least Squares
code: https://github.com/MrChrisJohnson/implicitMF
Friday, May 9, 14
Section name 15
Friday, May 9, 14
Scaling up Implicit Matrix Factorization
with Hadoop
16
Friday, May 9, 14
Hadoop at Spotify 2009
17
Friday, May 9, 14
Hadoop at Spotify 2014
18
700 Nodes in our London data center
Friday, May 9, 14
Implicit Matrix Factorization with Hadoop
19
Reduce stepMap step
u % K = 0
i % L = 0
u % K = 0
i % L = 1
...
u % K = 0
i % L = L-1
u % K = 1
i % L = 0
u % K = 1
i % L = 1
... ...
... ... ... ...
u % K = K-1
i % L = 0
... ...
u % K = K-1
i % L = L-1
item vectors
item%L=0
item vectors
item%L=1
item vectors
i % L = L-1
user vectors
u % K = 0
user vectors
u % K = 1
user vectors
u % K = K-1
all log entries
u % K = 1
i % L = 1
u % K = 0
u % K = 1
u % K = K-1
Figure via Erik Bernhardsson
Friday, May 9, 14
Implicit Matrix Factorization with Hadoop
20
One map task
Distributed
cache:
All user vectors
where u % K = x
Distributed
cache:
All item vectors
where i % L = y
Mapper Emit contributions
Map input:
tuples (u, i, count)
where
u % K = x
and
i % L = y
Reducer New vector!
Figure via Erik Bernhardsson
Friday, May 9, 14
Hadoop suffers from I/O overhead
21
IO Bottleneck
Friday, May 9, 14
Spark to the rescue!!
22
Vs
http://www.slideshare.net/Hadoop_Summit/spark-and-shark
Spark
Hadoop
Friday, May 9, 14
Section name 23
Friday, May 9, 14
First Attempt
24
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Join user vectors along with all ratings for that user and all item vectors for
which the user rated the item
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
Friday, May 9, 14
First Attempt
25
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Join user vectors along with all ratings for that user and all item vectors for
which the user rated the item
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
node 2 node 3 node 4 node 5
Friday, May 9, 14
First Attempt
26
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Join user vectors along with all ratings for that user and all item vectors for
which the user rated the item
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
node 2 node 3 node 4 node 5
Friday, May 9, 14
First Attempt
27
Friday, May 9, 14
First Attempt
28
•Issues:
–Unnecessarily sending multiple copies of item vector to each node
–Unnecessarily shuffling data across cluster at each iteration
–Not taking advantage of Spark’s in memory capabilities!
Friday, May 9, 14
Second Attempt
29
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Group ratings matrix into blocks, and join blocks with necessary user and
item vectors (to avoid multiple item vector copies at each node)
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
Friday, May 9, 14
Second Attempt
30
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Group ratings matrix into blocks, and join blocks with necessary user and
item vectors (to avoid multiple item vector copies at each node)
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
Friday, May 9, 14
Second Attempt
31
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Group ratings matrix into blocks, and join blocks with necessary user and
item vectors (to avoid multiple item vector copies at each node)
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
Friday, May 9, 14
Second Attempt
32
Friday, May 9, 14
Second Attempt
33
Friday, May 9, 14
Second Attempt
34
•Issues:
–Still Unnecessarily shuffling data across cluster at each iteration
–Still not taking advantage of Spark’s in memory capabilities!
Friday, May 9, 14
So, what are we missing?...
35
•Partitioner: Defines how the elements in a key-value pair RDD are
partitioned across the cluster.
node 1 node 2 node 3 node 4 node 5 node 6
user vectors
partition 1
partition 2
partition 3
Friday, May 9, 14
So, what are we missing?...
36
•partitionBy(partitioner): Partitions all elements of the same key to the
same node in the cluster, as defined by the partitioner.
node 1 node 2 node 3 node 4 node 5 node 6
user vectors
partition 1
partition 2
partition 3
Friday, May 9, 14
So, what are we missing?...
37
•mapPartitions(func): Similar to map, but runs separately on each
partition (block) of the RDD, so func must be of type Iterator[T] =>
Iterator[U] when running on an RDD of type T.
node 1 node 2 node 3 node 4 node 5 node 6
user vectors
partition 1
partition 2
partition 3
function()
function()
function()
Friday, May 9, 14
So, what are we missing?...
38
• persist(storageLevel): Set this RDD's storage level to persist (cache)
its values across operations after the first time it is computed.
node 1 node 2 node 3 node 4 node 5 node 6
user vectors
partition 1
partition 2
partition 3
Friday, May 9, 14
Third Attempt
39
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
40
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
41
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
42
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
43
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
44
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third attempt
45
Friday, May 9, 14
Third attempt
46
Friday, May 9, 14
Third attempt
47
Friday, May 9, 14
ALS Running Times
48
Via Xiangrui Meng (Databricks) http://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdf
Friday, May 9, 14
Section name 49
Fin
Friday, May 9, 14
Section name 50
Friday, May 9, 14
Section name 51
Friday, May 9, 14
Section name 52
Friday, May 9, 14
Section name 53
Friday, May 9, 14
Section name 54
Friday, May 9, 14
Section name 55
Friday, May 9, 14
Section name 56
Friday, May 9, 14

More Related Content

What's hot

CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyVidhya Murali
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Alexandros Karatzoglou
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Esh Vckay
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsChris Johnson
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsErik Bernhardsson
 
Talk@rmit 09112017
Talk@rmit 09112017Talk@rmit 09112017
Talk@rmit 09112017Shuai Zhang
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introductionLiang Xiang
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At SpotifyVidhya Murali
 
An introduction to Recommender Systems
An introduction to Recommender SystemsAn introduction to Recommender Systems
An introduction to Recommender SystemsDavid Zibriczky
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at SpotifyRohan Agrawal
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Anoop Deoras
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Erik Bernhardsson
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architectureLiang Xiang
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupAndy Sloane
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016DataStax
 
Big data and machine learning @ Spotify
Big data and machine learning @ SpotifyBig data and machine learning @ Spotify
Big data and machine learning @ SpotifyOscar Carlsson
 
Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at SpotifyNeville Li
 

What's hot (20)

CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At Spotify
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music Recommendations
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
 
Talk@rmit 09112017
Talk@rmit 09112017Talk@rmit 09112017
Talk@rmit 09112017
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At Spotify
 
An introduction to Recommender Systems
An introduction to Recommender SystemsAn introduction to Recommender Systems
An introduction to Recommender Systems
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at Spotify
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data Meetup
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
Big data and machine learning @ Spotify
Big data and machine learning @ SpotifyBig data and machine learning @ Spotify
Big data and machine learning @ Spotify
 
Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at Spotify
 

Viewers also liked

Lecture 6 lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
Lecture 6   lu factorization & determinants - section 2-5 2-7 3-1 and 3-2Lecture 6   lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
Lecture 6 lu factorization & determinants - section 2-5 2-7 3-1 and 3-2njit-ronbrown
 
Intro to Factorization Machines
Intro to Factorization MachinesIntro to Factorization Machines
Intro to Factorization MachinesPavel Kalaidin
 
Neighbor methods vs matrix factorization - case studies of real-life recommen...
Neighbor methods vs matrix factorization - case studies of real-life recommen...Neighbor methods vs matrix factorization - case studies of real-life recommen...
Neighbor methods vs matrix factorization - case studies of real-life recommen...Domonkos Tikk
 
Matrix factorization
Matrix factorizationMatrix factorization
Matrix factorizationrubyyc
 
Factorization Machines with libFM
Factorization Machines with libFMFactorization Machines with libFM
Factorization Machines with libFMLiangjie Hong
 
Nonnegative Matrix Factorization
Nonnegative Matrix FactorizationNonnegative Matrix Factorization
Nonnegative Matrix FactorizationTatsuya Yokota
 
Matrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender SystemsMatrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender SystemsAladejubelo Oluwashina
 
آموزش محاسبات عددی - بخش دوم
آموزش محاسبات عددی - بخش دومآموزش محاسبات عددی - بخش دوم
آموزش محاسبات عددی - بخش دومfaradars
 
Introduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringIntroduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringDKALab
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBenjamin Bengfort
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender SystemsT212
 
Recommendation system
Recommendation system Recommendation system
Recommendation system Vikrant Arya
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemMilind Gokhale
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineNYC Predictive Analytics
 

Viewers also liked (15)

Lecture 6 lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
Lecture 6   lu factorization & determinants - section 2-5 2-7 3-1 and 3-2Lecture 6   lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
Lecture 6 lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
 
Intro to Factorization Machines
Intro to Factorization MachinesIntro to Factorization Machines
Intro to Factorization Machines
 
Neighbor methods vs matrix factorization - case studies of real-life recommen...
Neighbor methods vs matrix factorization - case studies of real-life recommen...Neighbor methods vs matrix factorization - case studies of real-life recommen...
Neighbor methods vs matrix factorization - case studies of real-life recommen...
 
Matrix factorization
Matrix factorizationMatrix factorization
Matrix factorization
 
Factorization Machines with libFM
Factorization Machines with libFMFactorization Machines with libFM
Factorization Machines with libFM
 
Nonnegative Matrix Factorization
Nonnegative Matrix FactorizationNonnegative Matrix Factorization
Nonnegative Matrix Factorization
 
Matrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender SystemsMatrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender Systems
 
آموزش محاسبات عددی - بخش دوم
آموزش محاسبات عددی - بخش دومآموزش محاسبات عددی - بخش دوم
آموزش محاسبات عددی - بخش دوم
 
Introduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringIntroduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative Filtering
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Recommendation system
Recommendation system Recommendation system
Recommendation system
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation System
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
 

Similar to Collaborative Filtering with Spark

Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...Sc Huang
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReducesscdotopen
 
IntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdfIntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdfAlphaIssaghaDiallo
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part IIQuantUniversity
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analyticsCollin Bennett
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
 
Final Presentation - Edan&Itzik
Final Presentation - Edan&ItzikFinal Presentation - Edan&Itzik
Final Presentation - Edan&Itzikitzik cohen
 
Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network	Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network Cigdem Aslay
 
Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systemsfuchaoqun
 
Recent advances in deep recommender systems
Recent advances in deep recommender systemsRecent advances in deep recommender systems
Recent advances in deep recommender systemsNAVER Engineering
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptxJasonTuran2
 
Download
DownloadDownload
Downloadbutest
 
Download
DownloadDownload
Downloadbutest
 
Ire presentation
Ire presentationIre presentation
Ire presentationRaj Patel
 
Understanding Large Social Networks | IRE Major Project | Team 57 | LINE
Understanding Large Social Networks | IRE Major Project | Team 57 | LINEUnderstanding Large Social Networks | IRE Major Project | Team 57 | LINE
Understanding Large Social Networks | IRE Major Project | Team 57 | LINERaj Patel
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 

Similar to Collaborative Filtering with Spark (20)

Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
IntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdfIntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdf
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part II
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Matrix Factorization
Matrix FactorizationMatrix Factorization
Matrix Factorization
 
Final Presentation - Edan&Itzik
Final Presentation - Edan&ItzikFinal Presentation - Edan&Itzik
Final Presentation - Edan&Itzik
 
Project Matsu
Project MatsuProject Matsu
Project Matsu
 
Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network	Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network
 
Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systems
 
Recent advances in deep recommender systems
Recent advances in deep recommender systemsRecent advances in deep recommender systems
Recent advances in deep recommender systems
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 
Download
DownloadDownload
Download
 
Download
DownloadDownload
Download
 
Ire presentation
Ire presentationIre presentation
Ire presentation
 
Understanding Large Social Networks | IRE Major Project | Team 57 | LINE
Understanding Large Social Networks | IRE Major Project | Team 57 | LINEUnderstanding Large Social Networks | IRE Major Project | Team 57 | LINE
Understanding Large Social Networks | IRE Major Project | Team 57 | LINE
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
L04.pdf
L04.pdfL04.pdf
L04.pdf
 

Recently uploaded

Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024Mind IT Systems
 
Deep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampDeep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampVICTOR MAESTRE RAMIREZ
 
Webinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptWebinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptkinjal48
 
Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesShyamsundar Das
 
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfTobias Schneck
 
AI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyAI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyRaymond Okyere-Forson
 
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...OnePlan Solutions
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.Sharon Liu
 
How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?AmeliaSmith90
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies
 
ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxAutus Cyber Tech
 
Fields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxFields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxJoão Esperancinha
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeNeo4j
 
OpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorOpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorShane Coughlan
 
Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesSoftwareMill
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilVICTOR MAESTRE RAMIREZ
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmonyelliciumsolutionspun
 
Streamlining Your Application Builds with Cloud Native Buildpacks
Streamlining Your Application Builds  with Cloud Native BuildpacksStreamlining Your Application Builds  with Cloud Native Buildpacks
Streamlining Your Application Builds with Cloud Native BuildpacksVish Abrams
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Jaydeep Chhasatia
 

Recently uploaded (20)

Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024
 
Deep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampDeep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - Datacamp
 
Webinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptWebinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.ppt
 
Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security Challenges
 
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
 
AI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyAI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human Beauty
 
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
 
How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in Trivandrum
 
ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptx
 
Fields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxFields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptx
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG time
 
OpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorOpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS Calculator
 
Salesforce AI Associate Certification.pptx
Salesforce AI Associate Certification.pptxSalesforce AI Associate Certification.pptx
Salesforce AI Associate Certification.pptx
 
Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retries
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-Council
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
 
Streamlining Your Application Builds with Cloud Native Buildpacks
Streamlining Your Application Builds  with Cloud Native BuildpacksStreamlining Your Application Builds  with Cloud Native Buildpacks
Streamlining Your Application Builds with Cloud Native Buildpacks
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
 

Collaborative Filtering with Spark

  • 1. May 9, 2014 Collaborative Filtering with Spark Chris Johnson @MrChrisJohnson Friday, May 9, 14
  • 2. Who am I?? •Chris Johnson – Machine Learning guy from NYC – Focused on music recommendations – Formerly a graduate student at UT Austin Friday, May 9, 14
  • 3. 3 What is MLlib? Algorithms: • classification: logistic regression, linear support vector machine (SVM), naive bayes • regression: generalized linear regression • clustering: k-means • decomposition: singular value decomposition (SVD), principle component analysis (PCA • collaborative filtering: alternating least squares (ALS) http://spark.apache.org/docs/0.9.0/mllib-guide.html Friday, May 9, 14
  • 4. 4 What is MLlib? Algorithms: • classification: logistic regression, linear support vector machine (SVM), naive bayes • regression: generalized linear regression • clustering: k-means • decomposition: singular value decomposition (SVD), principle component analysis (PCA • collaborative filtering: alternating least squares (ALS) http://spark.apache.org/docs/0.9.0/mllib-guide.html Friday, May 9, 14
  • 5. Collaborative Filtering - “The Netflix Prize” 5 Friday, May 9, 14
  • 6. Collaborative Filtering 6 Hey, I like tracks P, Q, R, S! Well, I like tracks Q, R, S, T! Then you should check out track P! Nice! Btw try track T! Image via Erik Bernhardsson Friday, May 9, 14
  • 7. 7 Collaborative Filtering at Spotify • Discover (personalized recommendations) • Radio • Related Artists • Now Playing Friday, May 9, 14
  • 9. Explicit Matrix Factorization 9 Movies Users Chris Inception •Users explicitly rate a subset of the movie catalog •Goal: predict how users will rate new movies Friday, May 9, 14
  • 10. • = bias for user • = bias for item • = regularization parameter Explicit Matrix Factorization 10 Chris Inception ? 3 5 ? 1 ? ? 1 2 ? 3 2 ? ? ? 5 5 2 ? 4 •Approximate ratings matrix by the product of low- dimensional user and movie matrices •Minimize RMSE (root mean squared error) • = user rating for movie • = user latent factor vector • = item latent factor vector X Y Friday, May 9, 14
  • 11. Implicit Matrix Factorization 11 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 •Replace Stream counts with binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a function of stream counts as weights • = bias for user • = bias for item • = regularization parameter • = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector X Y Friday, May 9, 14
  • 12. Alternating Least Squares 12 • Initialize user and item vectors to random noise • Fix item vectors and solve for optimal user vectors – Take the derivative of loss function with respect to user’s vector, set equal to 0, and solve – Results in a system of linear equations with closed form solution! • Fix user vectors and solve for optimal item vectors • Repeat until convergence code: https://github.com/MrChrisJohnson/implicitMF Friday, May 9, 14
  • 13. Alternating Least Squares 13 • Note that: • Then, we can pre-compute once per iteration – and only contain non-zero elements for tracks that the user streamed – Using sparse matrix operations we can then compute each user’s vector efficiently in time where is the number of tracks the user streamed code: https://github.com/MrChrisJohnson/implicitMF Friday, May 9, 14
  • 14. 14 Alternating Least Squares code: https://github.com/MrChrisJohnson/implicitMF Friday, May 9, 14
  • 16. Scaling up Implicit Matrix Factorization with Hadoop 16 Friday, May 9, 14
  • 17. Hadoop at Spotify 2009 17 Friday, May 9, 14
  • 18. Hadoop at Spotify 2014 18 700 Nodes in our London data center Friday, May 9, 14
  • 19. Implicit Matrix Factorization with Hadoop 19 Reduce stepMap step u % K = 0 i % L = 0 u % K = 0 i % L = 1 ... u % K = 0 i % L = L-1 u % K = 1 i % L = 0 u % K = 1 i % L = 1 ... ... ... ... ... ... u % K = K-1 i % L = 0 ... ... u % K = K-1 i % L = L-1 item vectors item%L=0 item vectors item%L=1 item vectors i % L = L-1 user vectors u % K = 0 user vectors u % K = 1 user vectors u % K = K-1 all log entries u % K = 1 i % L = 1 u % K = 0 u % K = 1 u % K = K-1 Figure via Erik Bernhardsson Friday, May 9, 14
  • 20. Implicit Matrix Factorization with Hadoop 20 One map task Distributed cache: All user vectors where u % K = x Distributed cache: All item vectors where i % L = y Mapper Emit contributions Map input: tuples (u, i, count) where u % K = x and i % L = y Reducer New vector! Figure via Erik Bernhardsson Friday, May 9, 14
  • 21. Hadoop suffers from I/O overhead 21 IO Bottleneck Friday, May 9, 14
  • 22. Spark to the rescue!! 22 Vs http://www.slideshare.net/Hadoop_Summit/spark-and-shark Spark Hadoop Friday, May 9, 14
  • 24. First Attempt 24 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Join user vectors along with all ratings for that user and all item vectors for which the user rated the item – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
  • 25. First Attempt 25 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Join user vectors along with all ratings for that user and all item vectors for which the user rated the item – Sum up YtCuIY and YtCuPu and solve for optimal user vectors node 2 node 3 node 4 node 5 Friday, May 9, 14
  • 26. First Attempt 26 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Join user vectors along with all ratings for that user and all item vectors for which the user rated the item – Sum up YtCuIY and YtCuPu and solve for optimal user vectors node 2 node 3 node 4 node 5 Friday, May 9, 14
  • 28. First Attempt 28 •Issues: –Unnecessarily sending multiple copies of item vector to each node –Unnecessarily shuffling data across cluster at each iteration –Not taking advantage of Spark’s in memory capabilities! Friday, May 9, 14
  • 29. Second Attempt 29 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Group ratings matrix into blocks, and join blocks with necessary user and item vectors (to avoid multiple item vector copies at each node) – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
  • 30. Second Attempt 30 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Group ratings matrix into blocks, and join blocks with necessary user and item vectors (to avoid multiple item vector copies at each node) – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
  • 31. Second Attempt 31 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Group ratings matrix into blocks, and join blocks with necessary user and item vectors (to avoid multiple item vector copies at each node) – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
  • 34. Second Attempt 34 •Issues: –Still Unnecessarily shuffling data across cluster at each iteration –Still not taking advantage of Spark’s in memory capabilities! Friday, May 9, 14
  • 35. So, what are we missing?... 35 •Partitioner: Defines how the elements in a key-value pair RDD are partitioned across the cluster. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 Friday, May 9, 14
  • 36. So, what are we missing?... 36 •partitionBy(partitioner): Partitions all elements of the same key to the same node in the cluster, as defined by the partitioner. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 Friday, May 9, 14
  • 37. So, what are we missing?... 37 •mapPartitions(func): Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator[T] => Iterator[U] when running on an RDD of type T. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 function() function() function() Friday, May 9, 14
  • 38. So, what are we missing?... 38 • persist(storageLevel): Set this RDD's storage level to persist (cache) its values across operations after the first time it is computed. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 Friday, May 9, 14
  • 39. Third Attempt 39 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 40. Third Attempt 40 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 41. Third Attempt 41 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 42. Third Attempt 42 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 43. Third Attempt 43 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 44. Third Attempt 44 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 48. ALS Running Times 48 Via Xiangrui Meng (Databricks) http://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdf Friday, May 9, 14