Data Science Notes

Correlation

  • We should look at the correlation of our features with our target but also at the correlation between our different features.
  • Correlation is the degree to which things move together
  • The amount of sun and the amount of icecream tend to move together; if one is high, the other also tends to be high and if one is low the other also tends to be low. This is positive correlation.
  • Negative correlation: If one is high, other tends to be lower
  • No correlation: Imagine how a graph would look unlike
  • Correlation is calculated as a single number which ranges between -1 to 1
    $\rho_{XY}=corr(X,Y)$
  • $\rho=0$ means no correlation or uncorrelated
  • Correlation is a statistical measure of a linear relationship between two variables
  • Why should we look at the correlations of our features during the data exploration stage. The answer is that we primarily care about two things
    1. Strength: Strength of the correlation; it is important because it tells us how much correlation is there
    2. Direction: direction of the correlation
  • Our model should include features that are correlated with the target; want to include features whose movement is associated with a big movement of target value; want a correlation that is not close to zero
  • Python command to calculate all the correlations between all the features; correlation matrix
data.corr()
  • There are multiple ways to calculate a correlation and the default way of doing this calculation is the Pearson correlation; assumption: It is only valid for continuous variables. That means it is valid for dummy variables like whether a property is on cahrles river or not.
  • If two features were highly correlated would that be a good thing or a bad thing for our regression Modeling and the answer is it depends
  • High correlation between features can be problematic; so this we probably want to discover early on
  • Example: Predict bone density using: Age, bodyfat, weight
  • Because bodyfat and weight move together you are going to have difficulty telling apart their effects on bone density. It is difficult to see the individual contributions to bone density. One of the features is redundant. This is multicollinearity. Each of these do not provide unique and independent information to the regression
  • The problem of multicollinearity means unreliable estimates and nonsensical findings; Simply put, the models start getting confused
  • High correlation between features $\ne$ multicollinearity; high correlations don’t automatically mean that you have this problem
  • We should investigate why high correlations can be an early warning sign.
  • Correlation matrix is not a silver data exploration bullet. While it may not answer all our questions but it can give us a bit more perspective and the correlation matrix has its pros and cons
  • High correlations don’t necessarily imply this problem of multicollinearity
  • Correlations
    1. Continuous data only
    2. Correlation does not imply causation : just because two things move together doesn’t mean that one thing causes another; People who drank water before 1980 were all dead. This does not mean drinking water will kill you.
    3. Linear relationships only: only checks for linear relationship and it turns out just because there is a low Pearson coefficient does not mean that there is no relationship
  • Anscombe’s quartet:
  • These four graphs actually have very similar descriptive statistics and a very similar regression, but of course they are showing us completely different relationships. They are showing us that outliers and non-linear relationships often only become apparent after visualizing
  • It means it’s important to look at these correlations and these descriptive statistics in conjunction with some charts
  • To plot all correlation plots between features, use seaborn
sns.pairplot(data)
plt.show()