SlideShare a Scribd company logo
1 of 21
GBM PACKAGE IN R
7/24/2014
Presentation Outline
• Algorithm Overview
• Basics
• How it solves problems
• Why to use it
• Deeper investigation while going through live code
What is GBM?
• Predictive modeling algorithm
• Classification & Regression
• Decision tree as a basis*
• Boosted
• Multiple weak models combined algorithmically
• Gradient boosted
• Iteratively solves residuals
• Stochastic
(some additional references on last slide)
* technically, GBM can take on other forms such as linear, but decision trees are the dominant usage,
Friedman specifically optimized for trees, and R’s implementation is internally represented as a tree
Predictive Modeling Landscape:
General Purpose Algorithms
(forillustrativepurposes only,nottoscale,precise,orcomprehensive;author’sperspective)
Linear Models Decision Trees Others
Linear Models
(lm)
Generalized
Linear Models (glm)
Regularized
Linear
Models
(glmnet)
Classification
And Regression
Trees (rpart)
Random
Forest
(randomForest)
Gradient
Boosted
Machines
(gbm)
Nearest Neighbor
(kNN)
Neural
Networks
(nnet)
Support
Vector
Machines
(kernlab)
complexity
Naïve Bayes
(klaR)
Splines
(earth)
More Comprehensive List: http://caret.r-forge.r-project.org/modelList.html
GBM’s decision tree structure
Why GBM?
• Characteristics
• Competitive Performance
• Robust
• Loss functions
• Fast (relatively)
• Usages
• Quick modeling
• Variable selection
• Final-stage precision modeling
Competitive Performance
• Competitive with high-end algorithms such as
RandomForest
• Reliable performance
• Avoids nonsensical predictions
• Rare to produce worse predictions than simpler models
• Often in winning Kaggle solutions
• Cited within winning solution descriptions in numerous
competitions, including $3M competition
• Many of the highest ranked competitors use it frequently
• Used in 4 of 5 personal top 20 finishes
Robust
• Explicitly handles NAs
• Scaling/normalization is unnecessary
• Handles more factor levels than random forest (1024 vs
32)
• Handles perfectly correlated independent variables
• No [known] limit to number of independent variables
Loss Functions
• Gaussian: squared loss
• Laplace: absolute loss
• Bernoulli: logistic, for 0/1
• Huberized: hinge, for 0/1
• Adaboost: exponential loss, for 0/1
• Multinomial: more than one class (produces probability matrix)
• Quantile: flexible alpha (e.g. optimize for 2 StDev threshold)
• Poisson: Poisson distribution, for counts
• CoxPH: Cox proportional hazard, for right-censored
• Tdist: t-distribution loss
• Pairwise: rankings (e.g. search result scoring)
• Concordant pairs
• Mean reciprocal rank
• Mean average precision
• Normalized discounted cumulative gain
Drawbacks
• Several hyper-parameters to tune
• I typically use roughly the same parameters to start, unless I
suspect the data set might have peculiar characteristics
• For creating a final model, tuning several parameters is advisable
• Still has capacity to overfit
• Despite internal cross-validation, it is still particularly prone to
overfit ID-like columns (suggestion: withhold them)
• Can have trouble with highly noisy data
• Black box
• However, GBM package does provide tools to analyze the resulting
models
Deeper Analysis via Walkthrough
• Hyper-parameter explanations (some, not all)
• Quickly analyze performance
• Analyze influence of variables
• Peek under the hood…then follow a toy problem
For those not attending the presentation, the code at the back is run at this
point and discussed. The remaining four slides were mainly to supplement
the discussion of the code and comments, and there was not sufficient
time.
Same analysis with a simpler data set
Note that one can recreate the
predictions of this first tree by finding the
terminal node for any prediction and
using the Prediction value (final column
in data frame). Those values for all
desired trees, plus the initial value (mean
for this) is the prediction.
Matches predictions 1 & 3
Matches predictions 2,4 & 5
Same analysis with a simpler data set
Explanation
1 tree built.
Tree has one decision only, node 0.
Node 0 indicates it split the 3rd field (SplitVar:2), to where values below 1.5
(ordered values 0 & 1 which are a & b) went to node 1; values above
1.5 (2/3 = c/d) went to node 2; missing (none) go to node 3.
Node 1 (X3=A/B) is a terminal node (SplitVar -1) and it predicts the mean plus -0.925.
Node 2 (X3=C/D) is a terminal node and it predicts the mean plus 1.01.
Node 3 (none) is a terminal node and it predicts the mean plus 0, effectively.
Later saw that gbm1$initF will show the intercept, which in this case is the mean.
GBM predict: fit a GBM to data
• gbm(formula = formula(data),
• distribution = "bernoulli",
• n.trees = 100,
• interaction.depth = 1,
• n.minobsinnode = 10,
• shrinkage = 0.001,
• bag.fraction = 0.5,
• train.fraction = 1.0,
• cv.folds=0,
• weights,
• data = list(),
• var.monotone = NULL,
• keep.data = TRUE,
• verbose = "CV",
• class.stratify.cv=NULL,
• n.cores = NULL)
Effect of shrinkage & trees
Source: https://www.youtube.com/watch?v=IXZKgIsZRm0 (GBM explanation by SciKit author)
Code Dump
• The code has been copied from a text R script into PowerPoint, so
the format isn’t great, but it should look OK if copying and pasting
back out to a text file. If not, here it is on Github.
• The code shown uses a competition data set that is comparable to
real world data and uses a simple GBM to predict sale prices of
construction equipment at auction.
• A GBM model was fit against 100k rows with 45-50 variables in about
2-4 minutes during the presentation. It improves the RMSE of
prediction against the mean from ~24.5k to ~9.7k, when scored on
data the model had not seen (and future dates, so the 100k/50k splits
should be valid), with fairly stable train:test performance.
• After predictions are made and scored, some GBM utilities are used
to see which variables the model found most influential, see how the
top 2 variables are used (per factor for one; throughout a continuous
distribution for the other), and see interaction effects of specific
variable pairs.
• Note: GBM was used by my teammate and I to finish 12th out of 476
in this competition (albeit a complex ensemble of GBMs)
Code Dump: Page1
library(Metrics) ##load evaluation package
setwd("C:/Users/Mark_Landry/Documents/K/dozer/")
##Done in advance to speed up loading of data set
train<-read.csv("Train.csv")
## Kaggle data set: http://www.kaggle.com/c/bluebook-for-bulldozers/data
train$saleTransform<-strptime(train$saledate,"%m/%d/%Y %H:%M")
train<-train[order(train$saleTransform),]
save(train,file="rTrain.Rdata")
load("rTrain.Rdata")
xTrain<-train[(nrow(train)-149999):(nrow(train)-50000),5:ncol(train)]
xTest<-train[(nrow(train)-49999):nrow(train),5:ncol(train)]
yTrain<-train[(nrow(train)-149999):(nrow(train)-50000),2]
yTest<-train[(nrow(train)-49999):nrow(train),2]
dim(xTrain); dim(xTest)
sapply(xTrain,function(x) length(levels(x)))
## check levels; gbm is robust, but still has a limit of 1024 per factor; for initial model, remove
## after iterating through model, would want to go back and compress these factors to investigate
## their usefulness (or other information analysis)
xTrain$saledate<-NULL; xTest$saledate<-NULL
xTrain$fiModelDesc<-NULL; xTest$fiModelDesc<-NULL
xTrain$fiBaseModel<-NULL; xTest$fiBaseModel<-NULL
xTrain$saleTransform<-NULL; xTest$saleTransform<-NULL
Code Dump: Page2
library(gbm)
## Set up parameters to pass in; there are many more hyper-parameters available, but these are the most common to
control
GBM_NTREES = 400
## 400 trees in the model; can scale back later for predictions, if desired or overfitting is suspected
GBM_SHRINKAGE = 0.05
## shrinkage is a regularization parameter dictating how fast/aggressive the algorithm moves across
the loss gradient
## 0.05 is somewhat aggressive; default is 0.001, values below 0.1 tend to produce good results
## decreasing shrinkage generally improves results, but requires more trees, so the two
should be adjusted in tandem
GBM_DEPTH = 4
## depth 4 means each tree will evaluate four decisions;
## will always yield [3*depth + 1] nodes and [2*depth + 1] terminal nodes (depth 4 = 9)
## because each decision yields 3 nodes, but each decision will come from a prior node
GBM_MINOBS = 30
## regularization parameter to dictate how many observations must be present to yield a terminal node
## higher number means more conservative fit; 30 is fairly high, but good for exploratory fits; default is
10
## Fit model
g<-gbm.fit(x=xTrain,y=yTrain,distribution = "gaussian",n.trees = GBM_NTREES,shrinkage = GBM_SHRINKAGE,
interaction.depth = GBM_DEPTH,n.minobsinnode = GBM_MINOBS)
## gbm fit; provide all remaining independent variables in xTrain; provide targets as yTrain;
## gaussian distribution will optimize squared loss;
Code Dump: Page3
## get predictions; first on train set, then on unseen test data
tP1 <- predict.gbm(object = g,newdata = xTrain,GBM_NTREES)
hP1 <- predict.gbm(object = g,newdata = xTest,GBM_NTREES)
## compare model performance to default (overall mean)
rmse(yTrain,tP1) ## 9452.742 on data used for training
rmse(yTest,hP1) ## 9740.559 ~3% drop on unseen data; does not seem
to be overfit
rmse(yTest,mean(yTrain)) ## 24481.08 overall mean; cut error rate (from perfection) by 60%
## look at variables
summary(g) ## summary will plot and then show the relative influence of each variable to the entire GBM model (all trees)
## test dominant variable mean
library(sqldf)
trainProdClass<-as.data.frame(cbind(as.character(xTrain$fiProductClassDesc),yTrain))
testProdClass<-as.data.frame(cbind(as.character(xTest$fiProductClassDesc),yTest))
colnames(trainProdClass)<-c("fiProductClassDesc","y"); colnames(testProdClass)<-c("fiProductClassDesc","y")
ProdClassMeans<-sqldf("SELECT fiProductClassDesc,avg(y) avg, COUNT(*) n FROM trainProdClass GROUP BY
fiProductClassDesc")
ProdClassPredictions<-sqldf("SELECT case when n > 30 then avg ELSE 31348.63 end avg
FROM ProdClassMeans P LEFT JOIN testProdClass t ON t.fiProductClassDesc = P.fiProductClassDesc")
rmse(yTest,ProdClassPredictions$avg) ## 29082.64 ? peculiar result on the fiProductClassDesc means, which seemed
fairly stable and useful
##seems to say that the primary factor alone is not helpful; full tree needed
Code Dump: Page4
## Investigate actual GBM model
pretty.gbm.tree(g,1) ## show underlying model for the first decision tree
summary(xTrain[,10]) ## underlying model showed variable 9 to be first point in tree (9 with 0 index = 10th
column)
g$initF ## view what is effectively the "y intercept"
mean(yTrain) ## equivalence shows gaussian y intercept is the mean
t(g$c.splits[1][[1]]) ## show whether each factor level should go left or right
plot(g,10) ## plot fiProductClassDesc, the variable with the highest
rel.inf
plot(g,3) ## plot YearMade, continuous variable with 2nd highest
rel.inf
interact.gbm(g,xTrain,c(10,3))
## compute H statistic to
show interaction; integrates
interact.gbm(g,xTrain,c(10,3))
## example of uninteresting
interaction
Selected References
• CRAN
• Documentation
• vignette
• Algorithm publications:
• Greedy function approximation: A gradient boosting machine
Friedman 2/99
• Stochastic Gradient Boosting; Friedman 3/99
• Overviews
• Gradient boosting machines, a tutorial: Frontiers (4/13)
• Wikipedia (pretty good article, really)
• Video of author of GBM in Python: Gradient Boosted Regression
Trees in scikit-learn
• Very helpful, but the implementation is not decision “stumps” in R, so
some things are different in R (e.g. number of trees need not be so high)

More Related Content

What's hot

[DL輪読会]Grasping Field: Learning Implicit Representations for Human Grasps
[DL輪読会]Grasping Field: Learning Implicit Representations for  Human Grasps[DL輪読会]Grasping Field: Learning Implicit Representations for  Human Grasps
[DL輪読会]Grasping Field: Learning Implicit Representations for Human GraspsDeep Learning JP
 
PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)
PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)
PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)Akihiro Nitta
 
合成変量とアンサンブル:回帰森と加法モデルの要点
合成変量とアンサンブル:回帰森と加法モデルの要点合成変量とアンサンブル:回帰森と加法モデルの要点
合成変量とアンサンブル:回帰森と加法モデルの要点Ichigaku Takigawa
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostTakami Sato
 
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
文献紹介:X3D: Expanding Architectures for Efficient Video RecognitionToru Tamaki
 
Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2harmonylab
 
Rにおける大規模データ解析(第10回TokyoWebMining)
Rにおける大規模データ解析(第10回TokyoWebMining)Rにおける大規模データ解析(第10回TokyoWebMining)
Rにおける大規模データ解析(第10回TokyoWebMining)Shintaro Fukushima
 
Decision trees and random forests
Decision trees and random forestsDecision trees and random forests
Decision trees and random forestsDebdoot Sheet
 
PRML輪読#7
PRML輪読#7PRML輪読#7
PRML輪読#7matsuolab
 
第五回統計学勉強会@東大駒場
第五回統計学勉強会@東大駒場第五回統計学勉強会@東大駒場
第五回統計学勉強会@東大駒場Daisuke Yoneoka
 
状態空間モデルの考え方・使い方 - TokyoR #38
状態空間モデルの考え方・使い方 - TokyoR #38状態空間モデルの考え方・使い方 - TokyoR #38
状態空間モデルの考え方・使い方 - TokyoR #38horihorio
 
RBMを応用した事前学習とDNN学習
RBMを応用した事前学習とDNN学習RBMを応用した事前学習とDNN学習
RBMを応用した事前学習とDNN学習Masayuki Tanaka
 
深層学習によるポアソンデノイジング: 残差学習はポアソンノイズに対して有効か? 論文 Poisson Denoising by Deep Learnin...
深層学習によるポアソンデノイジング: 残差学習はポアソンノイズに対して有効か?  論文 Poisson Denoising by Deep Learnin...深層学習によるポアソンデノイジング: 残差学習はポアソンノイズに対して有効か?  論文 Poisson Denoising by Deep Learnin...
深層学習によるポアソンデノイジング: 残差学習はポアソンノイズに対して有効か? 論文 Poisson Denoising by Deep Learnin...doboncho
 
経験過程
経験過程経験過程
経験過程hoxo_m
 
Rで学ぶ回帰分析と単位根検定
Rで学ぶ回帰分析と単位根検定Rで学ぶ回帰分析と単位根検定
Rで学ぶ回帰分析と単位根検定Nagi Teramo
 
Prml 1.2,4 5,1.3|輪講資料1120
Prml 1.2,4 5,1.3|輪講資料1120Prml 1.2,4 5,1.3|輪講資料1120
Prml 1.2,4 5,1.3|輪講資料1120Hayato K
 

What's hot (20)

[DL輪読会]Grasping Field: Learning Implicit Representations for Human Grasps
[DL輪読会]Grasping Field: Learning Implicit Representations for  Human Grasps[DL輪読会]Grasping Field: Learning Implicit Representations for  Human Grasps
[DL輪読会]Grasping Field: Learning Implicit Representations for Human Grasps
 
PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)
PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)
PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)
 
合成変量とアンサンブル:回帰森と加法モデルの要点
合成変量とアンサンブル:回帰森と加法モデルの要点合成変量とアンサンブル:回帰森と加法モデルの要点
合成変量とアンサンブル:回帰森と加法モデルの要点
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboost
 
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
 
Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2
 
Rにおける大規模データ解析(第10回TokyoWebMining)
Rにおける大規模データ解析(第10回TokyoWebMining)Rにおける大規模データ解析(第10回TokyoWebMining)
Rにおける大規模データ解析(第10回TokyoWebMining)
 
Rの高速化
Rの高速化Rの高速化
Rの高速化
 
Decision trees and random forests
Decision trees and random forestsDecision trees and random forests
Decision trees and random forests
 
PRML輪読#7
PRML輪読#7PRML輪読#7
PRML輪読#7
 
第五回統計学勉強会@東大駒場
第五回統計学勉強会@東大駒場第五回統計学勉強会@東大駒場
第五回統計学勉強会@東大駒場
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
状態空間モデルの考え方・使い方 - TokyoR #38
状態空間モデルの考え方・使い方 - TokyoR #38状態空間モデルの考え方・使い方 - TokyoR #38
状態空間モデルの考え方・使い方 - TokyoR #38
 
Prml2.1 2.2,2.4-2.5
Prml2.1 2.2,2.4-2.5Prml2.1 2.2,2.4-2.5
Prml2.1 2.2,2.4-2.5
 
RBMを応用した事前学習とDNN学習
RBMを応用した事前学習とDNN学習RBMを応用した事前学習とDNN学習
RBMを応用した事前学習とDNN学習
 
R seminar on igraph
R seminar on igraphR seminar on igraph
R seminar on igraph
 
深層学習によるポアソンデノイジング: 残差学習はポアソンノイズに対して有効か? 論文 Poisson Denoising by Deep Learnin...
深層学習によるポアソンデノイジング: 残差学習はポアソンノイズに対して有効か?  論文 Poisson Denoising by Deep Learnin...深層学習によるポアソンデノイジング: 残差学習はポアソンノイズに対して有効か?  論文 Poisson Denoising by Deep Learnin...
深層学習によるポアソンデノイジング: 残差学習はポアソンノイズに対して有効か? 論文 Poisson Denoising by Deep Learnin...
 
経験過程
経験過程経験過程
経験過程
 
Rで学ぶ回帰分析と単位根検定
Rで学ぶ回帰分析と単位根検定Rで学ぶ回帰分析と単位根検定
Rで学ぶ回帰分析と単位根検定
 
Prml 1.2,4 5,1.3|輪講資料1120
Prml 1.2,4 5,1.3|輪講資料1120Prml 1.2,4 5,1.3|輪講資料1120
Prml 1.2,4 5,1.3|輪講資料1120
 

Viewers also liked

Gbm.more GBM in H2O
Gbm.more GBM in H2OGbm.more GBM in H2O
Gbm.more GBM in H2OSri Ambati
 
Automated data analysis with Python
Automated data analysis with PythonAutomated data analysis with Python
Automated data analysis with PythonGramener
 
Gradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboostGradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboostJaroslaw Szymczak
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
 
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDecision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDeepak George
 

Viewers also liked (9)

Inlining Heuristics
Inlining HeuristicsInlining Heuristics
Inlining Heuristics
 
Gbm.more GBM in H2O
Gbm.more GBM in H2OGbm.more GBM in H2O
Gbm.more GBM in H2O
 
XGBoost (System Overview)
XGBoost (System Overview)XGBoost (System Overview)
XGBoost (System Overview)
 
Automated data analysis with Python
Automated data analysis with PythonAutomated data analysis with Python
Automated data analysis with Python
 
GBM theory code and parameters
GBM theory code and parametersGBM theory code and parameters
GBM theory code and parameters
 
Gradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboostGradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboost
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Xgboost
XgboostXgboost
Xgboost
 
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDecision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
 

Similar to GBM PACKAGE IN R: A GUIDE TO GRADIENT BOOSTED MACHINES

Algorithm explanations
Algorithm explanationsAlgorithm explanations
Algorithm explanationsnikita kapil
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
Genetic programming
Genetic programmingGenetic programming
Genetic programmingYun-Yan Chi
 
Minmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptxMinmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptxPriyadharshiniG41
 
Kaggle review Planet: Understanding the Amazon from Space
Kaggle reviewPlanet: Understanding the Amazon from SpaceKaggle reviewPlanet: Understanding the Amazon from Space
Kaggle review Planet: Understanding the Amazon from SpaceEduard Tyantov
 
Musings of kaggler
Musings of kagglerMusings of kaggler
Musings of kagglerKai Xin Thia
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big DataGianvito Siciliano
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree inductionthamizh arasi
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraJason Riedy
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화NAVER Engineering
 
DAA 1 ppt.pptx
DAA 1 ppt.pptxDAA 1 ppt.pptx
DAA 1 ppt.pptxRAJESH S
 
DAA ppt.pptx
DAA ppt.pptxDAA ppt.pptx
DAA ppt.pptxRAJESH S
 

Similar to GBM PACKAGE IN R: A GUIDE TO GRADIENT BOOSTED MACHINES (20)

Algorithm explanations
Algorithm explanationsAlgorithm explanations
Algorithm explanations
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Machine Learning - Supervised Learning
Machine Learning - Supervised LearningMachine Learning - Supervised Learning
Machine Learning - Supervised Learning
 
Genetic programming
Genetic programmingGenetic programming
Genetic programming
 
XgBoost.pptx
XgBoost.pptxXgBoost.pptx
XgBoost.pptx
 
Minmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptxMinmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptx
 
Kaggle review Planet: Understanding the Amazon from Space
Kaggle reviewPlanet: Understanding the Amazon from SpaceKaggle reviewPlanet: Understanding the Amazon from Space
Kaggle review Planet: Understanding the Amazon from Space
 
Musings of kaggler
Musings of kagglerMusings of kaggler
Musings of kaggler
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big Data
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
R user group meeting 25th jan 2017
R user group meeting 25th jan 2017R user group meeting 25th jan 2017
R user group meeting 25th jan 2017
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
 
DAA 1 ppt.pptx
DAA 1 ppt.pptxDAA 1 ppt.pptx
DAA 1 ppt.pptx
 
DAA ppt.pptx
DAA ppt.pptxDAA ppt.pptx
DAA ppt.pptx
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
 

Recently uploaded

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 

Recently uploaded (20)

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 

GBM PACKAGE IN R: A GUIDE TO GRADIENT BOOSTED MACHINES

  • 1. GBM PACKAGE IN R 7/24/2014
  • 2. Presentation Outline • Algorithm Overview • Basics • How it solves problems • Why to use it • Deeper investigation while going through live code
  • 3. What is GBM? • Predictive modeling algorithm • Classification & Regression • Decision tree as a basis* • Boosted • Multiple weak models combined algorithmically • Gradient boosted • Iteratively solves residuals • Stochastic (some additional references on last slide) * technically, GBM can take on other forms such as linear, but decision trees are the dominant usage, Friedman specifically optimized for trees, and R’s implementation is internally represented as a tree
  • 4. Predictive Modeling Landscape: General Purpose Algorithms (forillustrativepurposes only,nottoscale,precise,orcomprehensive;author’sperspective) Linear Models Decision Trees Others Linear Models (lm) Generalized Linear Models (glm) Regularized Linear Models (glmnet) Classification And Regression Trees (rpart) Random Forest (randomForest) Gradient Boosted Machines (gbm) Nearest Neighbor (kNN) Neural Networks (nnet) Support Vector Machines (kernlab) complexity Naïve Bayes (klaR) Splines (earth) More Comprehensive List: http://caret.r-forge.r-project.org/modelList.html
  • 6. Why GBM? • Characteristics • Competitive Performance • Robust • Loss functions • Fast (relatively) • Usages • Quick modeling • Variable selection • Final-stage precision modeling
  • 7. Competitive Performance • Competitive with high-end algorithms such as RandomForest • Reliable performance • Avoids nonsensical predictions • Rare to produce worse predictions than simpler models • Often in winning Kaggle solutions • Cited within winning solution descriptions in numerous competitions, including $3M competition • Many of the highest ranked competitors use it frequently • Used in 4 of 5 personal top 20 finishes
  • 8. Robust • Explicitly handles NAs • Scaling/normalization is unnecessary • Handles more factor levels than random forest (1024 vs 32) • Handles perfectly correlated independent variables • No [known] limit to number of independent variables
  • 9. Loss Functions • Gaussian: squared loss • Laplace: absolute loss • Bernoulli: logistic, for 0/1 • Huberized: hinge, for 0/1 • Adaboost: exponential loss, for 0/1 • Multinomial: more than one class (produces probability matrix) • Quantile: flexible alpha (e.g. optimize for 2 StDev threshold) • Poisson: Poisson distribution, for counts • CoxPH: Cox proportional hazard, for right-censored • Tdist: t-distribution loss • Pairwise: rankings (e.g. search result scoring) • Concordant pairs • Mean reciprocal rank • Mean average precision • Normalized discounted cumulative gain
  • 10. Drawbacks • Several hyper-parameters to tune • I typically use roughly the same parameters to start, unless I suspect the data set might have peculiar characteristics • For creating a final model, tuning several parameters is advisable • Still has capacity to overfit • Despite internal cross-validation, it is still particularly prone to overfit ID-like columns (suggestion: withhold them) • Can have trouble with highly noisy data • Black box • However, GBM package does provide tools to analyze the resulting models
  • 11. Deeper Analysis via Walkthrough • Hyper-parameter explanations (some, not all) • Quickly analyze performance • Analyze influence of variables • Peek under the hood…then follow a toy problem For those not attending the presentation, the code at the back is run at this point and discussed. The remaining four slides were mainly to supplement the discussion of the code and comments, and there was not sufficient time.
  • 12. Same analysis with a simpler data set Note that one can recreate the predictions of this first tree by finding the terminal node for any prediction and using the Prediction value (final column in data frame). Those values for all desired trees, plus the initial value (mean for this) is the prediction. Matches predictions 1 & 3 Matches predictions 2,4 & 5
  • 13. Same analysis with a simpler data set Explanation 1 tree built. Tree has one decision only, node 0. Node 0 indicates it split the 3rd field (SplitVar:2), to where values below 1.5 (ordered values 0 & 1 which are a & b) went to node 1; values above 1.5 (2/3 = c/d) went to node 2; missing (none) go to node 3. Node 1 (X3=A/B) is a terminal node (SplitVar -1) and it predicts the mean plus -0.925. Node 2 (X3=C/D) is a terminal node and it predicts the mean plus 1.01. Node 3 (none) is a terminal node and it predicts the mean plus 0, effectively. Later saw that gbm1$initF will show the intercept, which in this case is the mean.
  • 14. GBM predict: fit a GBM to data • gbm(formula = formula(data), • distribution = "bernoulli", • n.trees = 100, • interaction.depth = 1, • n.minobsinnode = 10, • shrinkage = 0.001, • bag.fraction = 0.5, • train.fraction = 1.0, • cv.folds=0, • weights, • data = list(), • var.monotone = NULL, • keep.data = TRUE, • verbose = "CV", • class.stratify.cv=NULL, • n.cores = NULL)
  • 15. Effect of shrinkage & trees Source: https://www.youtube.com/watch?v=IXZKgIsZRm0 (GBM explanation by SciKit author)
  • 16. Code Dump • The code has been copied from a text R script into PowerPoint, so the format isn’t great, but it should look OK if copying and pasting back out to a text file. If not, here it is on Github. • The code shown uses a competition data set that is comparable to real world data and uses a simple GBM to predict sale prices of construction equipment at auction. • A GBM model was fit against 100k rows with 45-50 variables in about 2-4 minutes during the presentation. It improves the RMSE of prediction against the mean from ~24.5k to ~9.7k, when scored on data the model had not seen (and future dates, so the 100k/50k splits should be valid), with fairly stable train:test performance. • After predictions are made and scored, some GBM utilities are used to see which variables the model found most influential, see how the top 2 variables are used (per factor for one; throughout a continuous distribution for the other), and see interaction effects of specific variable pairs. • Note: GBM was used by my teammate and I to finish 12th out of 476 in this competition (albeit a complex ensemble of GBMs)
  • 17. Code Dump: Page1 library(Metrics) ##load evaluation package setwd("C:/Users/Mark_Landry/Documents/K/dozer/") ##Done in advance to speed up loading of data set train<-read.csv("Train.csv") ## Kaggle data set: http://www.kaggle.com/c/bluebook-for-bulldozers/data train$saleTransform<-strptime(train$saledate,"%m/%d/%Y %H:%M") train<-train[order(train$saleTransform),] save(train,file="rTrain.Rdata") load("rTrain.Rdata") xTrain<-train[(nrow(train)-149999):(nrow(train)-50000),5:ncol(train)] xTest<-train[(nrow(train)-49999):nrow(train),5:ncol(train)] yTrain<-train[(nrow(train)-149999):(nrow(train)-50000),2] yTest<-train[(nrow(train)-49999):nrow(train),2] dim(xTrain); dim(xTest) sapply(xTrain,function(x) length(levels(x))) ## check levels; gbm is robust, but still has a limit of 1024 per factor; for initial model, remove ## after iterating through model, would want to go back and compress these factors to investigate ## their usefulness (or other information analysis) xTrain$saledate<-NULL; xTest$saledate<-NULL xTrain$fiModelDesc<-NULL; xTest$fiModelDesc<-NULL xTrain$fiBaseModel<-NULL; xTest$fiBaseModel<-NULL xTrain$saleTransform<-NULL; xTest$saleTransform<-NULL
  • 18. Code Dump: Page2 library(gbm) ## Set up parameters to pass in; there are many more hyper-parameters available, but these are the most common to control GBM_NTREES = 400 ## 400 trees in the model; can scale back later for predictions, if desired or overfitting is suspected GBM_SHRINKAGE = 0.05 ## shrinkage is a regularization parameter dictating how fast/aggressive the algorithm moves across the loss gradient ## 0.05 is somewhat aggressive; default is 0.001, values below 0.1 tend to produce good results ## decreasing shrinkage generally improves results, but requires more trees, so the two should be adjusted in tandem GBM_DEPTH = 4 ## depth 4 means each tree will evaluate four decisions; ## will always yield [3*depth + 1] nodes and [2*depth + 1] terminal nodes (depth 4 = 9) ## because each decision yields 3 nodes, but each decision will come from a prior node GBM_MINOBS = 30 ## regularization parameter to dictate how many observations must be present to yield a terminal node ## higher number means more conservative fit; 30 is fairly high, but good for exploratory fits; default is 10 ## Fit model g<-gbm.fit(x=xTrain,y=yTrain,distribution = "gaussian",n.trees = GBM_NTREES,shrinkage = GBM_SHRINKAGE, interaction.depth = GBM_DEPTH,n.minobsinnode = GBM_MINOBS) ## gbm fit; provide all remaining independent variables in xTrain; provide targets as yTrain; ## gaussian distribution will optimize squared loss;
  • 19. Code Dump: Page3 ## get predictions; first on train set, then on unseen test data tP1 <- predict.gbm(object = g,newdata = xTrain,GBM_NTREES) hP1 <- predict.gbm(object = g,newdata = xTest,GBM_NTREES) ## compare model performance to default (overall mean) rmse(yTrain,tP1) ## 9452.742 on data used for training rmse(yTest,hP1) ## 9740.559 ~3% drop on unseen data; does not seem to be overfit rmse(yTest,mean(yTrain)) ## 24481.08 overall mean; cut error rate (from perfection) by 60% ## look at variables summary(g) ## summary will plot and then show the relative influence of each variable to the entire GBM model (all trees) ## test dominant variable mean library(sqldf) trainProdClass<-as.data.frame(cbind(as.character(xTrain$fiProductClassDesc),yTrain)) testProdClass<-as.data.frame(cbind(as.character(xTest$fiProductClassDesc),yTest)) colnames(trainProdClass)<-c("fiProductClassDesc","y"); colnames(testProdClass)<-c("fiProductClassDesc","y") ProdClassMeans<-sqldf("SELECT fiProductClassDesc,avg(y) avg, COUNT(*) n FROM trainProdClass GROUP BY fiProductClassDesc") ProdClassPredictions<-sqldf("SELECT case when n > 30 then avg ELSE 31348.63 end avg FROM ProdClassMeans P LEFT JOIN testProdClass t ON t.fiProductClassDesc = P.fiProductClassDesc") rmse(yTest,ProdClassPredictions$avg) ## 29082.64 ? peculiar result on the fiProductClassDesc means, which seemed fairly stable and useful ##seems to say that the primary factor alone is not helpful; full tree needed
  • 20. Code Dump: Page4 ## Investigate actual GBM model pretty.gbm.tree(g,1) ## show underlying model for the first decision tree summary(xTrain[,10]) ## underlying model showed variable 9 to be first point in tree (9 with 0 index = 10th column) g$initF ## view what is effectively the "y intercept" mean(yTrain) ## equivalence shows gaussian y intercept is the mean t(g$c.splits[1][[1]]) ## show whether each factor level should go left or right plot(g,10) ## plot fiProductClassDesc, the variable with the highest rel.inf plot(g,3) ## plot YearMade, continuous variable with 2nd highest rel.inf interact.gbm(g,xTrain,c(10,3)) ## compute H statistic to show interaction; integrates interact.gbm(g,xTrain,c(10,3)) ## example of uninteresting interaction
  • 21. Selected References • CRAN • Documentation • vignette • Algorithm publications: • Greedy function approximation: A gradient boosting machine Friedman 2/99 • Stochastic Gradient Boosting; Friedman 3/99 • Overviews • Gradient boosting machines, a tutorial: Frontiers (4/13) • Wikipedia (pretty good article, really) • Video of author of GBM in Python: Gradient Boosted Regression Trees in scikit-learn • Very helpful, but the implementation is not decision “stumps” in R, so some things are different in R (e.g. number of trees need not be so high)