Introduction
Feature Selection plays a key role in reducing the dimensions of any dataset. There are various benefits of dimensionality reduction including reduced computational/training time of a dataset, lesser dimensions leading to better visualization, etc. And Missing Value Ratio is one of the basic feature selection techniques.
What is the Missing Value Ratio and how do we calculate it?
In a dataset, there might be the presence of values in different columns. We can try to fill those missing values using the techniques mentioned here. Else, we need to find the missing value ratio of the particular column and if the percentage of the missing value is greater than the threshold then we need to eliminate the column from the dataset.
Formula to calculate the missing value ratio,
Explanation
Let us consider our titanic dataset,
data = pd.read_csv("https://raw.githubusercontent.com/syedjafer/datasets/main/titanic.csv")
data.head()
In the above image, we are able to see some NaN (Not A Number) values present in the Cabin column. Now let us find out the missing value's percentage on each column.
data.isnull().sum()/len(data)*100
We can see that column cabin has the highest missing value percentage of 77. Now let's apply the threshold level (subjected to the project) of 40% and filter it. So that, if any column is having a higher percentage than the threshold level it would be eliminated.
missing_val_percentage = data.isnull().sum()/len(data)*100
filtered_columns = [ ]
threshold_level = 40
for index, column in enumerate(data.columns):
if missing_val_percentage[index] <= threshold_level:
filtered_columns.append(column)
filtered_columns
And then we can create a new dataset from the filtered columns like below.
new_data = data[filtered_columns]
When to use this?
If the dataset has too many missing values, we use this approach to reduce the number of variables. We can drop the variables having a large number of missing values in them
Notebook Link :
Please find this colab notebook, where you can try to execute all the python code experimented here.
Conclusion
In this article, we have covered one of the most common feature selection techniques Missing Value Ratio, which will also be helpful in reducing the dimensions of the features.