R-Squared is also termed the standardized version of MSE. R-squared represents the fraction of variance of the actual value of the response variable captured by the regression model rather than the MSE which captures the residual error.
How is r2 error calculated?
Today we’re going to introduce some terms that are important to machine learning:
- Variance
- r2 score
- Mean square error
What is r2 score?
The r2 score varies between 0 and 100%. It is closely related to the MSE (see below), but not the same. Wikipedia defines r2 as
” …the proportion of the variance in the dependent variable that is predictable from the independent variable(s).”
Another definition is “(total variance explained by model) / total variance.” So if it is 100%, the two variables are perfectly correlated, i.e., with no variance at all. A low value would show a low level of correlation, meaning a regression model that is not valid, but not in all cases.
Reading the code below, we do this calculation in three steps to make it easier to understand. g is the sum of the differences between the observed values and the predicted ones. (ytest[i] – preds[i]) **2. y is each observed value y[i] minus the average of observed values np.mean(ytest). And then the results are printed thus:
print ("total sum of squares", y) print ("ẗotal sum of residuals ", g) print ("r2 calculated", 1 - (g / y))
Our goal here is to explain. We can of course let scikit-learn to this with the r2_score() method:
print("R2 score : %.2f" % r2_score(ytest,preds))
Similarly, there is also no correct answer as to what R2 should be. 100% means perfect correlation. Yet, there are models with a low R2 that are still good models.
Our take away message here is that you cannot look at these metrics in isolation in sizing up your model. You have to look at other metrics as well, plus understand the underlying math. We will get into all of this in subsequent blog posts.