Curse of Dimentionality

Prerequisite

Polynomial Regression - learnml.hashnode.dev/the-polynomial-regress..

Introduction

Curse of Dimensionality refers to a set of problems that arise when working with high-dimensional data. The dimension of a dataset corresponds to the number of attributes/features that exist in a dataset. A dataset with a large number of attributes, generally of the order of a hundred or more, is referred to as high dimensional data. Some of the difficulties that come with high dimensional data manifest during analyzing or visualizing the data to identify patterns, and some manifest while training machine learning models. The difficulties related to training machine learning models due to high dimensional data are referred to as ‘Curse of Dimensionality’.

Note: All the Experiments executed in this article is available in this colab notebook

What is the problem with more dimensions?

Let us consider a dataset ,

np.random.seed(0)
x = 2 - 3 * np.random.normal(0, 1, 20)
y = x - 2 * (x ** 2) + 0.5 * (x ** 3) + np.random.normal(-3, 3, 20)

Randomly generated datapoints - curse of dimensionality - machine learning - learnml

We can plot this data in dimension wise,

1 D - One Dimension.

ax = sns.rugplot(x=x)
plt.xlabel("X-Axis")
plt.title("1D")

One dimensional data - curse of dimensionality - machine learning - learnml

we can see the points on the x-axis (at the bottom) are more close to each other.

2D - Two Dimension

Now we can consider the y-axis also and plot the graph in 2D

plt.scatter(x,y, s=10)
plt.show()

2D Dimensional data - curse of dimensionality - machine learning - learnml

Now, we can see the points have moved some distance from each other. It's closer, but not as much closer as in 1D.

3D - Three Dimension

Now let us introduce some more features using polynomial and then plot the same.

from sklearn.preprocessing import PolynomialFeatures
polynomial_features= PolynomialFeatures(degree=1)
x_poly = polynomial_features.fit_transform(x)

In the 3 Dimensional Space, we can see the points are getting far again from each other.

So we figured out the first problem,

When the Dimension Increases, the data points are moved from denser to the sparse area.

There is also another problem when data points are in higher dimensions, every point is of approx. equidistance from each other. This affects the algorithms that are based on distance ( kNN and more. )

Mathematical Proof: journalofbigdata.springeropen.com/articles/..

Distance concentration refers to the problem of all the pairwise distances between different samples/points in the space converging to the same value as the dimensionality of the data increases. Several machine learning models such as clustering or nearest neighbors’ methods use distance-based metrics to identify similarities or proximity of the samples. Due to distance concentration, the concept of proximity or similarity of the samples may not be qualitatively relevant in higher dimensions.

Explanation

Let's take our previous article on Polynomial Regression and let's see what happens when we increase the dimensions (adding more features)


degree = []
rmse_list = []
r2_list = []
for deg in range(2, 200, 3):
  polynomial_features= PolynomialFeatures(degree=deg)
  x_poly = polynomial_features.fit_transform(x)
  degree.append(deg)
  model = LinearRegression()
  model.fit(x_poly, y)
  y_poly_pred = model.predict(x_poly)

  rmse = np.sqrt(mean_squared_error(y,y_poly_pred))
  r2 = r2_score(y,y_poly_pred)
  rmse_list.append(rmse)
  r2_list.append(r2)
  print(rmse, r2)

will result in the below graph,

affect of rmse and r2 value when there is an increase in dimensions - curse of dimensionality - machine learning - learnml

As we increase the dimensions (new features), after a certain point both the RMSE and R² score is getting reduced drastically. This observation is called as Hughes phenomenon

Hughes Phenomenon: For a fixed size dataset the performance of a machine learning model decreases as the dimensionality increases.

Notebook

COLAB Notebook: colab.research.google.com/drive/1g3xfikqXZm..

Conclusion

Curse of Dimensionality is affecting the model when we have many features. By this, we need to neglect less contributing features for the training. So that we don't fall into this pit or else we need to combine n features to 1 features to solve this probem. .
.
.
Wait .
.
.
.
Is there any way to combine n features to 1, like reducing the dimensions? yes, there is one. Dimensionality reduction is an important technique to overcome the curse of dimensionality in data science and machine learning. As the number of predictors (or dimensions or features) in the dataset increase, it becomes computationally more expensive (ie. increased storage space, longer computation time) and exponentially more difficult to produce accurate predictions in classification or regression models. Moreover, it is hard to wrap our heads around to visualize the data points in more than 3 dimensions.

We will see more about dimensionality reduction in upcoming articles.

Interview Questions

What is the Curse of Dimensionality?
What is the Curse of Dimensionality and how can Unsupervised Learning help with it?
Why is data more sparse in a high-dimensional space?
How does the Curse of Dimensionality effect Machine Learning models?
How does High Dimensionality affect Distance-Based Mining Applications?
Does kNN suffer from the Curse of Dimensionality and if it why?
How does the Curse of Dimensionality affect k-Means Clustering?

The Curse of Dimensionality

The Problem of Data Point in Higher Dimensions.

Table of contents

Prerequisite

Introduction

What is the problem with more dimensions?

1 D - One Dimension.

2D - Two Dimension

3D - Three Dimension

Explanation

Notebook

Conclusion

Interview Questions

The Curse of Dimensionality

The Problem of Data Point in Higher Dimensions.

Table of contents

Prerequisite

Introduction

What is the problem with more dimensions?

1 D - One Dimension.

2D - Two Dimension

3D - Three Dimension

Explanation

Notebook

Conclusion

Interview Questions

Did you find this article valuable?