Correlation

We should look at the correlation of our features with our target but also at the correlation between our different features.
Correlation is the degree to which things move together
The amount of sun and the amount of icecream tend to move together; if one is high, the other also tends to be high and if one is low the other also tends to be low. This is positive correlation.
Negative correlation: If one is high, other tends to be lower
No correlation: Imagine how a graph would look unlike
Correlation is calculated as a single number which ranges between -1 to 1
$\rho_{XY}=corr(X,Y)$
$\rho=0$ means no correlation or uncorrelated
Correlation is a statistical measure of a linear relationship between two variables
Why should we look at the correlations of our features during the data exploration stage. The answer is that we primarily care about two things
1. Strength: Strength of the correlation; it is important because it tells us how much correlation is there
2. Direction: direction of the correlation
Our model should include features that are correlated with the target; want to include features whose movement is associated with a big movement of target value; want a correlation that is not close to zero
Python command to calculate all the correlations between all the features; correlation matrix

data.corr()

There are multiple ways to calculate a correlation and the default way of doing this calculation is the Pearson correlation; assumption: It is only valid for continuous variables. That means it is valid for dummy variables like whether a property is on cahrles river or not.
If two features were highly correlated would that be a good thing or a bad thing for our regression Modeling and the answer is it depends
High correlation between features can be problematic; so this we probably want to discover early on
Example: Predict bone density using: Age, bodyfat, weight
Because bodyfat and weight move together you are going to have difficulty telling apart their effects on bone density. It is difficult to see the individual contributions to bone density. One of the features is redundant. This is multicollinearity. Each of these do not provide unique and independent information to the regression
The problem of multicollinearity means unreliable estimates and nonsensical findings; Simply put, the models start getting confused
High correlation between features $\ne$ multicollinearity; high correlations don’t automatically mean that you have this problem
We should investigate why high correlations can be an early warning sign.
Correlation matrix is not a silver data exploration bullet. While it may not answer all our questions but it can give us a bit more perspective and the correlation matrix has its pros and cons
High correlations don’t necessarily imply this problem of multicollinearity
Correlations
1. Continuous data only
2. Correlation does not imply causation : just because two things move together doesn’t mean that one thing causes another; People who drank water before 1980 were all dead. This does not mean drinking water will kill you.
3. Linear relationships only: only checks for linear relationship and it turns out just because there is a low Pearson coefficient does not mean that there is no relationship
Anscombe’s quartet:

These four graphs actually have very similar descriptive statistics and a very similar regression, but of course they are showing us completely different relationships. They are showing us that outliers and non-linear relationships often only become apparent after visualizing
It means it’s important to look at these correlations and these descriptive statistics in conjunction with some charts
To plot all correlation plots between features, use seaborn

sns.pairplot(data)
plt.show()