Skip to main content
Statistics LibreTexts

Understanding the Geometry of High Dimensional Data through Simulation

Introduction

From Section 2 of the research paper titled Geometric Representation of High Dimension, Low Sample Size Data (HDLSS) by Peter Hall, it discusses that as dimension of the data, \(d\), increases drastically, i.e. \(d \rightarrow \infty \),we expect to see two characteristics that are vastly different from low dimensional data, namely:

  1. All points are approximately equidistant from one another (i.e. their pair wise distances are nearly the same).
  2. Most points lie on boundary of the data cloud.

To gain some insight on this phenomenon, we will simulate data of high dimensions from a well-known distribution, the standard normal distribution, and observe if the data behaves according to the assumptions/expectations above. The simulation is performed using the the Monte Carlo method, which is briefly explained in the previous section here.

High Dimensional Data Simulation Process

Here, we will go over the basics on what one should think of when creating a simulation to visualize high dimensional data. The idea here is that we want to create a random sample of high-dimensional vectors, and observe how each of these vector (points) lie on the vector space from one another.

To illustrate, we want to sample a number of observations, denoted \(n\), from standard normal distribution (for simplification) with \(d\) dimensions. Then, each observation is a vector of the form \(X = [X_1, X_2, \dots, X_d]' \), where each component is from a normal distribution with mean 0 and standard deviation 1. Then, we will observe distances between these points by calculating and comparing the norms. The following steps are a guideline on how to set this up in R:

  1. Fix the number of dimensions, \(d\), and sample size, \(n\) however you want, but keep in mind that \(d > n\) for HDLSS.
  2. Sample \(d\) (dimension) data points from a standard normal distribution. This will be our first observation vector \(Z(d) = (Z_1 , . . . , Z_d )\) .
  3. Repeat (Step 1) \(n\) number of times to generate \(n\) samples. We now have our data set.
  4. For each vector in our data set, compute the square of the Euclidean norm and divide by \(d\). 
  5. Plot a histogram of Euclidean norm to examine, mean, standard deviation, and distribution.
  6. To examine the pair-wise distances between each observation, compute pair-wise differences of the samples after (Step 3) and proceed as before. 

Sample R Code

You can try the above in R using this sample code to get started:

hdlss_simulation = function(sample_size, dimension) {
    data = data.frame(replicated(sample_size, rnorm(dimension, 0,1)))
    #Generates the data and stores into a data frame. Each row corresponds to a single observation.
    
    squared_norm = apply(data, 1, function(sample) sum(sample^2)/dimension)
    #Calculate squared norm and divide each vector by its dimension.
    
    hist(squared_norm)
}

As for the pair-wise distance between each point, we can use the nifty function dist()in R. Here we are using the Euclidean distance:

pairdist = as.matrix(dist(t(data), method = "euclidean"))
pairs = pairdist[upper.tri(pairdist)

#Note 1: 'pairdist' is the distance between every observation vector in matrix form. To find the distances
between each row (i.e. observation), transpose the matrix because as.matrix has it reversed. 
#Note 2: Since the distance matrix is symmetric about the diagonal, upper.tri() serves to take only the
values above the diagonal. 

At this point, you can plot the histogram as before.

(Side note: To get a better visualization of how different sample sizes and dimensions affect or change the distribution of the norms, fix your histogram's x-axis. Another option is to try using a scatter plot with a fixed x-axis to see how the variance of the norm changes.)

What we expect to see

Let \(Z\) denote our vector of standard normal random variable.
Formula for Euclidean Norm : \(||X|| = \sqrt{X^TX}\)

Single Vector

Find distribution of \(\frac{Z^TZ}{d}\)

\(Z^TZ = \sum\limits_{i = 1}^d (Z^i)^2\)

Since each \(Z^i\) come from standard normal distribution, we have the square of a standard normal distribution which gives a \(X^{2}_{1}\), chi-square random variable with \(\mu = 1\) and \(\sigma^2 = 2\).

Given that we have d chi-square random variables, \(Z^TZ \sim X^{2}_{d}\), or chi-square with \(\mu = d\) and \(\sigma^2 = 2d\).

Dividing \(Z^TZ\) by \(d\), we get that \(\frac{Z^TZ}{d}\) is chi-square with \(\mu = 1\) and \(sigma^2 = 2/d\). 

Pair-wise Distances

Find distribution of \(\frac{\sum_{d} (Z^i - Z^j)^2}{d}\), where \(i \neq j\)

\(Z^i - Z^j \sim N(0, 2)\), because the expectation of difference between two standard normal is a standard normal random variable with the variance be the sum of the variance of the two variables.

Since we only know the square of a standard normal random variable, not the square of a \(N(0,2)\) random variable, therefore we divide \(Z^i - Z^j\) by \(\sqrt{2}\) to get \(\frac{Z^i - Z^j}{\sqrt{2}} \sim N(0,1)\).

Then \((\frac{Z^i - Z^j}{\sqrt{2}})^2 \sim N(0,1)^2 \sim X^{2}_{1}\). Given d of these random variables, we have \(\mu = d\), \(\sigma^2 = 2d\). To get only  \(\sum\limits_{d}(Z^i - Z^j)^2\), we multiply  \(\sum\limits_{d}(\frac{Z^i - Z^j}{\sqrt{2}})^2\) by 2. As a result, our \(\mu = 2d\) and \(\sigma^2 = 8d\).

Dividing \(\sum\limits_{d}({Z^i - Z^j})^2\) by d \(\mu = 2\) and \(\sigma^2 = 8/d\)

Dynamic Histogram of the Norms

The following link is a R-Shiny app to demonstrate how the histograms should look like. You can also select the sample size, norm of interest, and dimensions to see how the distributions change whenever any one of these parameters are changed. Please click here to access the Shiny app.

Analysis

Our simulation results matches our expectations. We see that as dimensions increase, the mean of of the data is centered at 1 and the standard deviation decreases for the norm. In other words, this means that the norms are approximately equal when we increase the dimensions significantly for a fixed sample size. For pair-wise differences between the points, we note that the mean is centered at 2 and the standard deviation decreases slowly as well. Likewise, this suggests that the as dimensions increase for a fixed sample size, the each point is approximately equidistant from one another. Hence, the points tend to lie on the border of the data cloud in high dimensions.

Contributors

  • Ariel Sim
  • Kavi Tan
  • Wolfgang Polonik