What is Curse of Dimensionality
Curse of Dimensionality refers to non-intuitive properties of data observed when working in high-dimensional space*, specifically related to usability and interpretation of distances and volumes. This is one of my favourite topics in Machine Learning and Statistics since it has broad applications (not specific to any machine learning method), it is very counter-intuitive and hence awe-inspiring, it has profound application for any of analytics techniques, and it has ‘cool’ scary name like some Egyptian curse!
For quick grasp, consider this example: Say, you dropped a coin on a 100 meter line. How do you find it? Simple, just walk on the line and search. But what if it’s 100 x 100 sq. m. field? It’s already getting tough, trying to search a (roughly) football ground for a single coin. But what if it’s 100 x 100 x 100 cu.m space?! You know, football ground now has thirty-story height. Good luck finding a coin there! That, in essence is “curse of dimensionality”.
Many ML methods use Distance Measure
Most segmentation and clustering methods rely on computing distances between observations. Well known k-Means segmentation assigns points to nearest center. DBSCAN and Hierarchical clustering also required distance metrics. Distribution and density based outlier detection algorithms also make use of distance relative to other distances to mark outliers.
Supervised classification solutions like k-Nearest Neighbours method also use distance between observations to assign class to unknown observation. Support Vector Machine method involves transforming observations around select Kernels based on distance between observation and the kernel.
Common form of recommendation systems involve distance based similarity among user and item attribute vectors. Even when other forms of distances are used, number of dimensions plays a role in analytic design.
One of the most common distance metrics is Euclidian Distance metric, which is simply linear distance between two points in multi-dimensional hyper-space. Euclidian Distance for point i and point j in n dimensional space can be computed as:
Distance plays havoc in high-dimension
Consider simple process of data sampling. Suppose the black outside box in Fig. 1 is data universe with uniform distribution of data points across whole volume, and that we want to sample 1% of observations as enclosed by red inside box. Black box is hyper-cube in multi-dimensional space with each side representing range of value in that dimension. For simple 3-dimensional example in Fig. 1, we may have following range:
Figure 1 : Sampling
What is proportion of each range should we sample to obtain that 1% sample? For 2-dimensions, 10% of range will achieve overall 1% sampling, so we may select xâˆˆ(0,10) and yâˆˆ(0,50) and expect to capture 1% of all observations. This is because 10%2=1%. Do you expect this proportion to be higher or lower for 3-dimension?
Even though our search is now in additional direction, proportional actually increases to 21.5%. And not only increases, for just one additional dimension, it doubles! And you can see that we have to cover almost one-fifth of each dimension just to get one-hundredth of overall! In 10-dimensions, this proportion is 63% and in 100-dimensions â€“ which is not uncommon number of dimensions in any real-life machine learning â€“ one has to sample 95% of range along each dimension to sample 1% of observations! This mind-bending result happens because in high dimensions spread of data points becomes larger even if they are uniformly spread.
This has consequence in terms of design of experiment and sampling. Process becomes very computationally expensive, even to the extent that sampling asymptotically approaches population despite sample size remaining much smaller than population.
Consider another huge consequence of high dimensionality. Many algorithms measure distance between two data points to define some sort of near-ness (DBSCAN, Kernels, k-Nearest Neighbour) in reference to some pre-defined distance threshold. In 2-dimensions, we can imagine that two points are near if one falls within certain radius of another. Consider left image in Fig. 2. What’s share of uniformly spaced points within black square fall inside the red circle? That is about
Figure 2 : Near-ness
So if you fit biggest circle possible inside the square, you cover 78% of square. Yet, biggest sphere possible inside the cube covers only
of the volume. This volume reduces exponentially to 0.24% for just 10-dimension! What it essentially means that in high-dimensional world every single data point is at corners and nothing really is center of volume, or in other words, center volume reduces to nothing because there is (almost) no center! This has huge consequences of distance based clustering algorithms. All the distances start looking like same and any distance more or less than other is more random fluctuation in data rather than any measure of dissimilarity!
Fig. 3 shows randomly generated 2-D data and corresponding all-to-all distances. Coefficient of Variation in distance, computed as Standard Deviation divided by Mean, is 45.9%. Corresponding number of similarly generated 5-D data is 26.5% and for 10-D is 19.1%. Admittedly this is one sample, but trend supports the conclusion that in high-dimensions every distance is about same, and none is near or far!
Figure 3 : Distance Clustering
High-dimension affects other things too
Apart from distances and volumes, number of dimensions creates other practical problems. Solution run-time and system-memory requirements often non-linearly escalate with increase in number of dimensions. Due to exponential increase in feasible solutions, many optimization methods cannot reach global optima and have to make do with local optima. Further, instead of closed-form solution, optimization must use search based algorithms like gradient descent, genetic algorithm and simulated annealing. More dimensions introduce possibility of correlation and parameter estimation can become difficult in regression approaches.
Dealing with High-dimension
This will be separate blog post in itself, but correlation analysis, clustering, information value, variance inflation factor, principal component analysis are some of the ways in which number of dimensions can be reduced.
* Number of variables, observations or features a data point is made up of is called dimension of data. For instance, any point in space can be represented using 3 co-ordinates of length, breadth, and height, and has 3 dimensions