SlideShare a Scribd company logo
1 of 51
Download to read offline
How ML
-0.15, 0.2, 0, 1.5
A, B, C, D
The cat sat on the
mat.
Numerical, great!
Categorical, great!
Uhhh…….
How text is dealt with (ML perspective)
Text
Features
(bow, TFIDF, LSA, etc...)
Linear Model
(SVM, softmax)
Structure is important!
The cat sat on the mat.
sat the on mat cat the
● Certain tasks, structure is essential:
○ Humor
○ Sarcasm
● Certain tasks, ngrams can get you a
long way:
○ Sentiment Analysis
○ Topic detection
● Specific words can be strong indicators
○ useless, fantastic (sentiment)
○ hoop, green tea, NASDAQ (topic)
Structure is hard
Ngrams is typical way of preserving some structure.
sat
the on
mat
cat
the cat cat sat sat on
on thethe mat
Beyond bi or tri-grams occurrences become very rare and
dimensionality becomes huge (1, 10 million + features)
How text is dealt with (ML perspective)
Text
Features
(bow, TFIDF, LSA, etc...)
Linear Model
(SVM, softmax)
How text should be dealt with?
Text RNN
Linear Model
(SVM, softmax)
How an RNN works
the cat sat on the mat
How an RNN works
the cat sat on the mat
input to hidden
How an RNN works
the cat sat on the mat
input to hidden
hidden to hidden
How an RNN works
the cat sat on the mat
input to hidden
hidden to hidden
How an RNN works
the cat sat on the mat
projections
(activities x weights)
activities
(vectors of values)
input to hidden
hidden to hidden
How an RNN works
the cat sat on the mat
projections
(activities x weights)
activities
(vectors of values)
Learned representation of
sequence.
input to hidden
hidden to hidden
How an RNN works
the cat sat on the mat
projections
(activities x weights)
activities
(vectors of values)
cat
hidden to output
input to hidden
hidden to hidden
From text to RNN input
the cat sat on the mat
“The cat sat on the mat.”
Tokenize .
Assign index 0 1 2 3 0 4 5
String input
Embedding lookup 2.5 0.3 -1.2 0.2 -3.3 0.7 -4.1 1.6 2.8 1.1 5.7 -0.2 2.5 0.3 -1.2 1.4 0.6 -3.9 -3.8 1.5 0.1
2.5 0.3 -1.2
0.2 -3.3 0.7
-4.1 1.6 2.8
1.1 5.7 -0.2
1.4 0.6 -3.9
-3.8 1.5 0.1
Learned matrix
You can stack them too
the cat sat on the mat
cat
hidden to output
input to hidden
hidden to hidden
But aren’t RNNs unstable?
Simple RNNs trained with SGD are unstable/difficult to learn.
But modern RNNs with various tricks blow up much less often!
● Gating Units
● Gradient Clipping
● Steeper gates
● Better initialization
● Better optimizers
● Bigger datasets
Simple Recurrent Unit
ht-1
xt
+ ht
xt+1
+ ht+1
+ Element wise addition
Activation function
Routes information can propagate along
Involved in modifying information flow and
values
⊙
⊙⊙
Gated Recurrent Unit - GRU
xt
r
ht
ht-1
ht
z
+
~
1-z z
+ Element wise addition
⊙ Element wise multiplication
Routes information can propagate along
Involved in modifying information flow and
values
Gated Recurrent Unit - GRU
⊙
⊙⊙
xt
r
ht
ht-1
z
+
~
1-z z
⊙
⊙⊙
xt+1
r
ht+1
ht
z
+
~
1-z z
ht+1
Gating is important
For sentiment analysis of longer
sequences of text (paragraph or so)
a simple RNN has difficulty learning
at all while a gated RNN does so
easily.
Which One?
There are two types of gated RNNs:
● Gated Recurrent Units (GRU) by
K. Cho, recently introduced and
used for machine translation and
speech recognition tasks.
● Long short term memory (LSTM)
by S. Hochreiter and J.
Schmidhuber has been around
since 1997 and has been used
far more. Various modifications to
it exist.
Which One?
GRU is simpler, faster, and optimizes
quicker (at least on sentiment).
Because it only has two gates
(compared to four) approximately 1.5-
1.75x faster for theano
implementation.
If you have a huge dataset and don’t
mind waiting LSTM may be better in
the long run due to its greater
complexity - especially if you add
peephole connections.
Exploding Gradients?
Exploding gradients are a major problem
for traditional RNNs trained with SGD. One
of the sources of the reputation of RNNs
being hard to train.
In 2012, R Pascanu and T. Mikolov
proposed clipping the norm of the gradient
to alleviate this.
Modern optimizers don’t seem to have this
problem - at least for classification text
analysis.
Better Gating Functions
Interesting paper at NIPS workshop (Q. Lyu, J. Zhu) - make the gates “steeper” so
they change more rapidly from “off” to “on” so model learns to use them quicker.
Better Initialization
Andrew Saxe last year showed that initializing weight matrices with random
orthogonal matrices works better than random gaussian (or uniform) matrices.
In addition, Richard Socher (and more recently Quoc Le) have used identity
initialization schemes which work great as well.
Understanding Optimizers
2D moons dataset
courtesy of scikit-learn
Comparing Optimizers
Adam (D. Kingma) combines the
early optimization speed of
Adagrad (J. Duchi) with the better
later convergence of various other
methods like Adadelta (M. Zeiler)
and RMSprop (T. Tieleman).
Warning: Generalization
performance of Adam seems
slightly worse for smaller datasets.
It adds up
Up to 10x more efficient training once you
add all the tricks together compared to a
naive implementation - much more stable
- rarely diverges.
Around 7.5x faster, the various tricks add
a bit of computation time.
Too much? - Overfitting
RNNs can overfit very well as we will
see. As they continue to fit to training
dataset, their performance on test data
will plateau or even worsen.
Keep track of it using a validation set,
save model at each iteration over
training data and pick the earliest, best,
validation performance.
The Showdown
Model #1 Model #2
+ 512 dim
embedding
512 dim
hidden state
output
Using bigrams and grid search on min_df for
vectorizer and regularization coefficient for model.
Using whatever I tried that worked :)
Adam, GRU, steeper sigmoid gates, ortho/identity
init are good defaults
Sentiment & Helpfulness
Effect of Dataset Size
● RNNs have poor generalization properties on small
datasets.
○ 1K labeled examples 25-50% worse than linear model…
● RNNs have better generalization properties on large
datasets.
○ 1M labeled examples 0-30% better than linear model.
● Crossovers between 10K and 1M examples
○ Depends on dataset.
The Thing we don’t talk about
For 1 million paragraph sized text examples to converge:
● Linear model takes 30 minutes on a single CPU core.
● RNN takes 90 minutes on a Titan X.
● RNN takes five days on a single CPU core.
RNN is about 250x slower on CPU than linear model…
This is why we use GPUs
Visualizing
representations of
words learned via
sentiment
TSNE - L.J.P. van derIndividual words colored by average sentiment
Negative
Positive
Model learns to separate negative and positive words, not too surprising
Quantities of TimeQualifiers
Product nouns
Punctuation
Much cooler, model also begins to learn components of language from only binary sentiment labels
The library - Passage
● Tiny RNN library built on top of Theano
● https://github.com/IndicoDataSolutions/Passage
● Still alpha - we’re working on it!
● Supports simple, LSTM, and GRU recurrent layers
● Supports multiple recurrent layers
● Supports deep input to and deep output from hidden layers
○ no deep transitions currently
● Supports embedding and onehot input representations
● Can be used for both regression and classification problems
○ Regression needs preprocessing for stability - working on it
● Much more in the pipeline
An example
Sentiment analysis of movie reviews - 25K labeled examples
RNN imports
RNN imports
preprocessing
RNN imports
preprocessing
load training data
RNN imports
preprocessing
tokenize data
load training data
RNN imports
preprocessing
configure model
tokenize data
load training data
RNN imports
preprocessing
make and train model
tokenize data
load training data
configure model
RNN imports
preprocessing
load test data
make and train model
tokenize data
load training data
configure model
RNN imports
preprocessing
predict on test data
load test data
make and train model
tokenize data
load training data
configure model
The results
Top 10! - barely :)
Summary
● RNNs look to be a competitive tool in certain situations
for text analysis.
● Especially if you have a large 1M+ example dataset
○ A GPU or great patience is essential
● Otherwise it can be difficult to justify over linear models
○ Speed
○ Complexity
○ Poor generalization with small datasets
Contact
alec@indico.io

More Related Content

More from indico data

ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPindico data
 
TensorFlow in Practice
TensorFlow in PracticeTensorFlow in Practice
TensorFlow in Practiceindico data
 
The Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep LearningThe Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep Learningindico data
 
How Machine Learning is Shaping Digital Marketing
How Machine Learning is Shaping Digital MarketingHow Machine Learning is Shaping Digital Marketing
How Machine Learning is Shaping Digital Marketingindico data
 
Deep Advances in Generative Modeling
Deep Advances in Generative ModelingDeep Advances in Generative Modeling
Deep Advances in Generative Modelingindico data
 
Machine Learning for Non-technical People
Machine Learning for Non-technical PeopleMachine Learning for Non-technical People
Machine Learning for Non-technical Peopleindico data
 
Getting started with indico APIs [Python]
Getting started with indico APIs [Python]Getting started with indico APIs [Python]
Getting started with indico APIs [Python]indico data
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Pythonindico data
 

More from indico data (8)

ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 
TensorFlow in Practice
TensorFlow in PracticeTensorFlow in Practice
TensorFlow in Practice
 
The Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep LearningThe Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep Learning
 
How Machine Learning is Shaping Digital Marketing
How Machine Learning is Shaping Digital MarketingHow Machine Learning is Shaping Digital Marketing
How Machine Learning is Shaping Digital Marketing
 
Deep Advances in Generative Modeling
Deep Advances in Generative ModelingDeep Advances in Generative Modeling
Deep Advances in Generative Modeling
 
Machine Learning for Non-technical People
Machine Learning for Non-technical PeopleMachine Learning for Non-technical People
Machine Learning for Non-technical People
 
Getting started with indico APIs [Python]
Getting started with indico APIs [Python]Getting started with indico APIs [Python]
Getting started with indico APIs [Python]
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Python
 

Recently uploaded

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 

Recently uploaded (20)

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 

General Sequence Learning using Recurrent Neural Networks

  • 1.
  • 2. How ML -0.15, 0.2, 0, 1.5 A, B, C, D The cat sat on the mat. Numerical, great! Categorical, great! Uhhh…….
  • 3. How text is dealt with (ML perspective) Text Features (bow, TFIDF, LSA, etc...) Linear Model (SVM, softmax)
  • 4. Structure is important! The cat sat on the mat. sat the on mat cat the ● Certain tasks, structure is essential: ○ Humor ○ Sarcasm ● Certain tasks, ngrams can get you a long way: ○ Sentiment Analysis ○ Topic detection ● Specific words can be strong indicators ○ useless, fantastic (sentiment) ○ hoop, green tea, NASDAQ (topic)
  • 5. Structure is hard Ngrams is typical way of preserving some structure. sat the on mat cat the cat cat sat sat on on thethe mat Beyond bi or tri-grams occurrences become very rare and dimensionality becomes huge (1, 10 million + features)
  • 6. How text is dealt with (ML perspective) Text Features (bow, TFIDF, LSA, etc...) Linear Model (SVM, softmax)
  • 7. How text should be dealt with? Text RNN Linear Model (SVM, softmax)
  • 8. How an RNN works the cat sat on the mat
  • 9. How an RNN works the cat sat on the mat input to hidden
  • 10. How an RNN works the cat sat on the mat input to hidden hidden to hidden
  • 11. How an RNN works the cat sat on the mat input to hidden hidden to hidden
  • 12. How an RNN works the cat sat on the mat projections (activities x weights) activities (vectors of values) input to hidden hidden to hidden
  • 13. How an RNN works the cat sat on the mat projections (activities x weights) activities (vectors of values) Learned representation of sequence. input to hidden hidden to hidden
  • 14. How an RNN works the cat sat on the mat projections (activities x weights) activities (vectors of values) cat hidden to output input to hidden hidden to hidden
  • 15. From text to RNN input the cat sat on the mat “The cat sat on the mat.” Tokenize . Assign index 0 1 2 3 0 4 5 String input Embedding lookup 2.5 0.3 -1.2 0.2 -3.3 0.7 -4.1 1.6 2.8 1.1 5.7 -0.2 2.5 0.3 -1.2 1.4 0.6 -3.9 -3.8 1.5 0.1 2.5 0.3 -1.2 0.2 -3.3 0.7 -4.1 1.6 2.8 1.1 5.7 -0.2 1.4 0.6 -3.9 -3.8 1.5 0.1 Learned matrix
  • 16. You can stack them too the cat sat on the mat cat hidden to output input to hidden hidden to hidden
  • 17. But aren’t RNNs unstable? Simple RNNs trained with SGD are unstable/difficult to learn. But modern RNNs with various tricks blow up much less often! ● Gating Units ● Gradient Clipping ● Steeper gates ● Better initialization ● Better optimizers ● Bigger datasets
  • 18. Simple Recurrent Unit ht-1 xt + ht xt+1 + ht+1 + Element wise addition Activation function Routes information can propagate along Involved in modifying information flow and values
  • 19. ⊙ ⊙⊙ Gated Recurrent Unit - GRU xt r ht ht-1 ht z + ~ 1-z z + Element wise addition ⊙ Element wise multiplication Routes information can propagate along Involved in modifying information flow and values
  • 20. Gated Recurrent Unit - GRU ⊙ ⊙⊙ xt r ht ht-1 z + ~ 1-z z ⊙ ⊙⊙ xt+1 r ht+1 ht z + ~ 1-z z ht+1
  • 21. Gating is important For sentiment analysis of longer sequences of text (paragraph or so) a simple RNN has difficulty learning at all while a gated RNN does so easily.
  • 22. Which One? There are two types of gated RNNs: ● Gated Recurrent Units (GRU) by K. Cho, recently introduced and used for machine translation and speech recognition tasks. ● Long short term memory (LSTM) by S. Hochreiter and J. Schmidhuber has been around since 1997 and has been used far more. Various modifications to it exist.
  • 23. Which One? GRU is simpler, faster, and optimizes quicker (at least on sentiment). Because it only has two gates (compared to four) approximately 1.5- 1.75x faster for theano implementation. If you have a huge dataset and don’t mind waiting LSTM may be better in the long run due to its greater complexity - especially if you add peephole connections.
  • 24. Exploding Gradients? Exploding gradients are a major problem for traditional RNNs trained with SGD. One of the sources of the reputation of RNNs being hard to train. In 2012, R Pascanu and T. Mikolov proposed clipping the norm of the gradient to alleviate this. Modern optimizers don’t seem to have this problem - at least for classification text analysis.
  • 25. Better Gating Functions Interesting paper at NIPS workshop (Q. Lyu, J. Zhu) - make the gates “steeper” so they change more rapidly from “off” to “on” so model learns to use them quicker.
  • 26. Better Initialization Andrew Saxe last year showed that initializing weight matrices with random orthogonal matrices works better than random gaussian (or uniform) matrices. In addition, Richard Socher (and more recently Quoc Le) have used identity initialization schemes which work great as well.
  • 27. Understanding Optimizers 2D moons dataset courtesy of scikit-learn
  • 28. Comparing Optimizers Adam (D. Kingma) combines the early optimization speed of Adagrad (J. Duchi) with the better later convergence of various other methods like Adadelta (M. Zeiler) and RMSprop (T. Tieleman). Warning: Generalization performance of Adam seems slightly worse for smaller datasets.
  • 29. It adds up Up to 10x more efficient training once you add all the tricks together compared to a naive implementation - much more stable - rarely diverges. Around 7.5x faster, the various tricks add a bit of computation time.
  • 30. Too much? - Overfitting RNNs can overfit very well as we will see. As they continue to fit to training dataset, their performance on test data will plateau or even worsen. Keep track of it using a validation set, save model at each iteration over training data and pick the earliest, best, validation performance.
  • 31. The Showdown Model #1 Model #2 + 512 dim embedding 512 dim hidden state output Using bigrams and grid search on min_df for vectorizer and regularization coefficient for model. Using whatever I tried that worked :) Adam, GRU, steeper sigmoid gates, ortho/identity init are good defaults
  • 33. Effect of Dataset Size ● RNNs have poor generalization properties on small datasets. ○ 1K labeled examples 25-50% worse than linear model… ● RNNs have better generalization properties on large datasets. ○ 1M labeled examples 0-30% better than linear model. ● Crossovers between 10K and 1M examples ○ Depends on dataset.
  • 34. The Thing we don’t talk about For 1 million paragraph sized text examples to converge: ● Linear model takes 30 minutes on a single CPU core. ● RNN takes 90 minutes on a Titan X. ● RNN takes five days on a single CPU core. RNN is about 250x slower on CPU than linear model… This is why we use GPUs
  • 35. Visualizing representations of words learned via sentiment TSNE - L.J.P. van derIndividual words colored by average sentiment
  • 36. Negative Positive Model learns to separate negative and positive words, not too surprising
  • 37. Quantities of TimeQualifiers Product nouns Punctuation Much cooler, model also begins to learn components of language from only binary sentiment labels
  • 38. The library - Passage ● Tiny RNN library built on top of Theano ● https://github.com/IndicoDataSolutions/Passage ● Still alpha - we’re working on it! ● Supports simple, LSTM, and GRU recurrent layers ● Supports multiple recurrent layers ● Supports deep input to and deep output from hidden layers ○ no deep transitions currently ● Supports embedding and onehot input representations ● Can be used for both regression and classification problems ○ Regression needs preprocessing for stability - working on it ● Much more in the pipeline
  • 39. An example Sentiment analysis of movie reviews - 25K labeled examples
  • 40.
  • 46. RNN imports preprocessing make and train model tokenize data load training data configure model
  • 47. RNN imports preprocessing load test data make and train model tokenize data load training data configure model
  • 48. RNN imports preprocessing predict on test data load test data make and train model tokenize data load training data configure model
  • 49. The results Top 10! - barely :)
  • 50. Summary ● RNNs look to be a competitive tool in certain situations for text analysis. ● Especially if you have a large 1M+ example dataset ○ A GPU or great patience is essential ● Otherwise it can be difficult to justify over linear models ○ Speed ○ Complexity ○ Poor generalization with small datasets