Besides normality of residuals and homogeneity of variance, one of the biggest assumptions of linear modeling is independence of predictors. If one or more of the predictors in a model are correlated, then the model may produce unstable parameter estimates with highly inflated standard errors, resulting in an overall significant model with no significant predictors. In other words, bad news if your goal is to try and determine the contribution of each predictor in explaining the response. But there is hope!
There are several methods for dealing with multicollinearity. The simplest is to construct a correlation matrix and corresponding scatterplots. If the correlations between predictors approach 1, then multicollinearity might be a problem. In that case, one can make some educated guesses about which predictors to retain in the analysis (based on biological significance, ease of measurement, etc.).
Another way to identify collinear predictors is by calculating a variance inflation factor (VIF) for each predictor. The variance inflation factor represents the proportion of variance in one predictor explained by all the other predictors in the model. A VIF = 1 indicates no collinearity, whereas increasingly higher values suggest increasing multicollinearity. The approach suggested by Zuur et al. (2010, reference below) is to calculate VIFs for each parameter in the model, and if they are larger than some cutoff, sequentially drop the predictor with the largest VIF, recalculate, and repeat until all values are below the cutoff (they suggest a cutoff of 2). VIFs are especially nice for dealing with collinearity of interaction terms.
The vif function in the “car” package in R will calculate VIFs for a linear model. I’ve written a quick function that will identify if any VIFs > cutoff, remove the largest value, recalculate, and repeat until all VIFS < cutoff. It produces a final model with the same name as the original model. I’ve also included a function for calculating VIFs for linear mixed effects models built using the lmer function in the “lme4” package (From: http://hlplab.wordpress.com/2011/02/24/diagnosing-collinearity-in-lme4/).
A WORD OF CAUTION! The following code is a brute force method to removing variables. Its always important to consider the identity of variables that are being removed, and whether certain variables are more biologically meaningful, or whether certain variables are better representations of underlying processes. I recommend going through the sequential deletion step-by-step after this method to see whether you agree with what is happening mathematically. To that end, the function also outputs a table showing the VIF values and deleted variables at each step in the sequence.
# Load psych package library(psych) #Calls: pairs.panels # Load car package library(car) #Calls: vif # Load plyr package library(plyr) #Calls: rbind.fill # Create a fake data frame with two pairs of highly collinear variables df=data.frame(y=rnorm(500,0,20),x1=rnorm(500,50,100),x2=rnorm(500,10,40)) df$x3=df$x1+runif(500,-50,50); df$x4=df$x2+runif(500,-5,5) # Fit a linear model to the data mod=lm(y~.,data=df); plot(mod,which=1:2) # Generate scatterplots with Pearson correlations pairs.panels(df) # Use function `vif` to calculate variance inflation factors for each variable # in the model vif(mod) # Choose a VIF cutoff under which a variable is retained (Zuur et al. 2010 # MEE recommends 2) cutoff=2 # Create function to sequentially drop the variable with the largest VIF until # all variables have VIF > cutoff flag=TRUE viftable=data.frame() while(flag==TRUE) { vfit=vif(mod) viftable=rbind.fill(viftable,as.data.frame(t(vfit))) if(max(vfit)>cutoff) { mod= update(mod,as.formula(paste(".","~",".","-",names(which.max(vfit))))) } else { flag=FALSE } } # Look at the final model print(mod) # And associated VIFs print(vfit) # And show the order in which variables were dropped print(viftable)
And the function for lmer (from here):
vif.mer <- function (fit) { ## adapted from rms::vif v <- vcov(fit) nam <- names(fixef(fit)) ## exclude intercepts ns <- sum(1 * (nam == "Intercept" | nam == "(Intercept)")) if (ns > 0) { v <- v[-(1:ns), -(1:ns), drop = FALSE] nam <- nam[-(1:ns)] } d <- diag(v)^0.5 v <- diag(solve(v/(d %o% d))) names(v) <- nam v }
Created by Pretty R at inside-R.org
References:
Zurr et al. (2010). A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution 1: 3-14.
Note, often collinarities will be generated with interaction terms. By calculating a centered interaction, you can remove these, rather than worrying about removing the interaction. You just have to be careful about interpretation of main effects afterwards!
Good point, Jarrett! Centering is yet another good way to deal with collinearity. I will have to ponder on situations where certain techniques are more or less appropriate…
Centering and scaling is a good way to deal with multicollinearity. Thanks for R code and ideas about programming.
Hello, If i fit a Logistic Mixed Model with lme4 and I want to take into account the colineality and correlation between discrete covariates, what is your opinion? I should’nt use “vif.mer “, isn’t it?
Thanks.
Great post!!
Martí
Hmm, good question. You could calculate variance inflation factors based on a pseudo-R^2 including the variance of the random effects (i.e., the conditional R^2 from this post: https://jonlefcheck.net/2013/03/13/r2-for-linear-mixed-effects-models/). I’m not sure how legitimate that is or whether you would want to take a different approach to assessing collinearity (like scatterplots, etc.). Good luck! Cheers, Jon
Thank you so much. I’ll try it. I think the the conditional R^2 is useful to select the “best” model such as AIC or BIC selection. However, in the new version of lme4 we can fit the model with the script ” print(model, corr=FALSE).”. I don’t know if it takes account the correlation.
Thank you again.
Martí
There is an error in the function to test the collinearity! After ## exclude intercepts
It’s missing this part:
ns <- sum(1 * (nam == "Intercept" | nam == "(Intercept)"))
Cheers,
Matteo
Hi Matteo, Thanks for the heads up but I’m afraid I don’t follow your comment. Can you be more specific? Cheers, Jon
Nevermind, didn’t scroll down far enough! Error corrected (as well as a few other minor errors). Thanks again, Jon