Photo by Lucas Hoang on Unsplash
The Curse of Dimensionality
The Problem of Data Point in Higher Dimensions.
Prerequisite
- Polynomial Regression - learnml.hashnode.dev/the-polynomial-regress..
Introduction
Curse of Dimensionality refers to a set of problems that arise when working with high-dimensional data. The dimension of a dataset corresponds to the number of attributes/features that exist in a dataset. A dataset with a large number of attributes, generally of the order of a hundred or more, is referred to as high dimensional data. Some of the difficulties that come with high dimensional data manifest during analyzing or visualizing the data to identify patterns, and some manifest while training machine learning models. The difficulties related to training machine learning models due to high dimensional data are referred to as ‘Curse of Dimensionality’.
Note: All the Experiments executed in this article is available in this colab notebook
What is the problem with more dimensions?
Let us consider a dataset ,
np.random.seed(0)
x = 2 - 3 * np.random.normal(0, 1, 20)
y = x - 2 * (x ** 2) + 0.5 * (x ** 3) + np.random.normal(-3, 3, 20)
We can plot this data in dimension wise,
1 D - One Dimension.
ax = sns.rugplot(x=x)
plt.xlabel("X-Axis")
plt.title("1D")
we can see the points on the x-axis (at the bottom) are more close to each other.
2D - Two Dimension
Now we can consider the y-axis also and plot the graph in 2D
plt.scatter(x,y, s=10)
plt.show()
Now, we can see the points have moved some distance from each other. It's closer, but not as much closer as in 1D.
3D - Three Dimension
Now let us introduce some more features using polynomial and then plot the same.
from sklearn.preprocessing import PolynomialFeatures
polynomial_features= PolynomialFeatures(degree=1)
x_poly = polynomial_features.fit_transform(x)
In the 3 Dimensional Space, we can see the points are getting far again from each other.
So we figured out the first problem,
When the Dimension Increases, the data points are moved from denser to the sparse area.
There is also another problem when data points are in higher dimensions, every point is of approx. equidistance from each other. This affects the algorithms that are based on distance ( kNN and more. )
Mathematical Proof: journalofbigdata.springeropen.com/articles/..
Distance concentration refers to the problem of all the pairwise distances between different samples/points in the space converging to the same value as the dimensionality of the data increases. Several machine learning models such as clustering or nearest neighbors’ methods use distance-based metrics to identify similarities or proximity of the samples. Due to distance concentration, the concept of proximity or similarity of the samples may not be qualitatively relevant in higher dimensions.
Explanation
Let's take our previous article on Polynomial Regression and let's see what happens when we increase the dimensions (adding more features)
degree = []
rmse_list = []
r2_list = []
for deg in range(2, 200, 3):
polynomial_features= PolynomialFeatures(degree=deg)
x_poly = polynomial_features.fit_transform(x)
degree.append(deg)
model = LinearRegression()
model.fit(x_poly, y)
y_poly_pred = model.predict(x_poly)
rmse = np.sqrt(mean_squared_error(y,y_poly_pred))
r2 = r2_score(y,y_poly_pred)
rmse_list.append(rmse)
r2_list.append(r2)
print(rmse, r2)
will result in the below graph,
As we increase the dimensions (new features), after a certain point both the RMSE and R2 score is getting reduced drastically. This observation is called as Hughes phenomenon
Hughes Phenomenon: For a fixed size dataset the performance of a machine learning model decreases as the dimensionality increases.
Notebook
COLAB Notebook: colab.research.google.com/drive/1g3xfikqXZm..
Conclusion
Curse of Dimensionality is affecting the model when we have many features. By this, we need to neglect less contributing features for the training. So that we don't fall into this pit or else we need to combine n features to 1 features to solve this probem.
.
.
.
Wait
.
.
.
.
Is there any way to combine n features to 1, like reducing the dimensions? yes, there is one.
Dimensionality reduction is an important technique to overcome the curse of dimensionality in data science and machine learning. As the number of predictors (or dimensions or features) in the dataset increase, it becomes computationally more expensive (ie. increased storage space, longer computation time) and exponentially more difficult to produce accurate predictions in classification or regression models. Moreover, it is hard to wrap our heads around to visualize the data points in more than 3 dimensions.
We will see more about dimensionality reduction in upcoming articles.
Interview Questions
- What is the Curse of Dimensionality?
- What is the Curse of Dimensionality and how can Unsupervised Learning help with it?
- Why is data more sparse in a high-dimensional space?
- How does the Curse of Dimensionality effect Machine Learning models?
- How does High Dimensionality affect Distance-Based Mining Applications?
- Does kNN suffer from the Curse of Dimensionality and if it why?
- How does the Curse of Dimensionality affect k-Means Clustering?