High Correlation filter
If two persons are having same ideas in meeting, than one person is not required
Introduction
In the previous article, Variance and Low variance filter we saw a feature selection technique Missing Value Ratio. In this article, we’re going to cover another technique of feature selection known as the High Correlation filter.
Dataset & Notebook
In this article, we are going to use the titanic dataset itself, with some imputed values. (Check out this notebook for reference ).
What is a High Correlation Filter?
A high correlation between two variables means they have similar trends and are likely to carry similar information. This can bring down the performance of some models drastically (linear and logistic regression models, for instance).
We can calculate the correlation between independent numerical variables that are numerical. If the correlation coefficient crosses a certain threshold value, we can drop one of the variables (dropping a variable is highly subjective and should always be done keeping the domain in mind).
Note: Same as Low Variance Filter, we can apply this only to numerical variables and not to categorical variables.
As a general guideline, we should keep those variables that show a decent or high correlation with the target variable.
Explanation
Let us consider our titanic dataset,
data = pd.read_csv("https://raw.githubusercontent.com/syedjafer/datasets/main/titanic.csv")
data.head()
Now we can calculate correlation between different variables,
data.corr()
we can try to visualize using seaborn,
from the above, we can see the variables Parch and SibSp are having greater correlation so we can remove any one of them.
Here were are removing the variable Parch.
data.drop(axis=1, columns=["Parch"])
Now we have reduced one variable, thus reduced one dimension of the data.
Note: Generally, if the correlation between a pair of variables is greater than 0.5-0.6, we should seriously consider dropping one of those variables.
When to use a high correlation filter?
A pair of variables having high correlation increases multi-collinearity in the dataset. So, we can use this technique to find highly correlated features and drop them accordingly
Conclusion
- We have seen the implementation of a high correlation filter from scratch and we understood how it's reducing the dimensions of the data.
- Apply these high correlation filters only to the numerical columns (don't apply them in the categorical variables).
- Remember, when two variables are closely related to each other, then we need to select only one of them.