Principal Component Analysis (PCA): Data insights

The bane of any unsupervised learning algorithm is the number of dimensions or feature vectors in the data under analysis. The higher the number of feature vectors (represented as columns in a data matrix, with rows being the data records), the more exponentially high the configuration of features of interest becomes.

Consider the example below

a.      In a 1-D example, with 1 variable, the regions of interest is limited to the number of variables, 10, in this case.

b.      In a 2-D example, the same variable needs to be analyzed along 2-axes and the number of configurations grows to 10×10 = 100 positions.

c.       In a 3-D example, the same variable analyzed will grow to 10x10x10 = 1000 positions.

This is the Curse of Dimensionality. For an algorithm to learn from training data and generalize with the highest possible accuracy with a manageable compute cost, the focus is to ensure the features under consideration should represent the highest data variability.

Principal Component Analysis, PCA, works to identify the largest group of variances along the principal components (axes) that explain the higher variance of data (holding most of the information). In other words, in place of analysing the data across all the n features (assume n= 10), explain the data in ‘p’ features, where p=(n-q), [assume n=10, q=7 -> features dropped by using PCA, p=3] by sacrificing a bit of accuracy but gaining on increased processing time and lower compute and storage resources and cost (lossy compression).

PCA works on the below steps.

Step  1.      Let matrix A contain the data of adults in the age range of 20 years to 30 years, with columns representing the features (Height, Weight, BMI, BP, Resting Heart Rate, SPO2, # of meals in a day, Smoking [categorical data] etc.) and rows representing the individual records of individuals.

Step 2.      The features can have different scale and hence needs to be standardized to eliminate any bias.

Example:

Feature “Height”, in cms, can have a range of 150 To 193.

Feature “Resting Heartrate”, in bpm, can have a range of 50 to 80.

Feature “Weight”, in kgs, can have a range from 55 to 120.

Feature “Consumption_of_Fast_Food”, can be a categorical value with value of 1 (2 meals in a week) or 2 (more than 2 meals in a week).

Data needs to be standardized to eliminate disparity. Standardization helps eliminate these differences in scale of measure. The popular way of standardization is using z-score [(value-mean)/standard deviation]. This ensures that all features will have a mean of 0 (zero) and a standard deviation of 1 (one).

Step 3. On the standardized matrix, compute the covariance of the columns. This helps in understanding the dependency of the columns between themselves.

a.      Covariance matrix has its principle diagonal elements as the variance and the remainder of the elements as the covariance number with each other.

In the sample 3×3 matrix above, Cov(x,y) = Cov(y,x). The covariance matrix will be a symmetric matrix (i.e the upper and lower triangle of elements across the main diagonal are equal) and the diagonal elements will be equal.

Refer to output below. Diagonal elements are in RED box and indicative array elements are highlighted to depict symmetry.

s

Step 4.     On the covariance matrix generated, compute its Eigenvalue and Eigenvector. Consider expressing the real number, 12, as a product of prime factors as shown

12 = 2x2x3. This indicates that any multiple of 12, is divisible by 3. In a similar term, Eigenvalue (scalar) and Eigenvector (column vector) explain the characteristics of a matrix.

Eigenvector and Eigenvalue always come in pair i.e. for every eigenvector there is a corresponding eigenvalue.

a.      Eigenvalue:

This is a scalar value indicating the variance or information explainable in the feature vectors. We consider the absolute values of this scalar, with higher eigenvalue indicating a higher variance contained. The computed eigenvalues are sorted in descending order on their absolute value. Assuming a 90% accuracy level is acceptable, the eigenvalues with cumulative sum of meeting the minimum requirement of 90% is selected from the descending order of value. In the example below, the last 3 features are dropped to meet the accepted accuracy of 90%.

Ex: Eigenvalue (unsorted 14 scalar values, there are 14 features in the original dataset)

The cumulative percent is given below (after sorting eigenvalue in descending order).

b.      Eigenvector

With the selection of eigenvalues, the features that explain the maximum variance/containing the maximum information is shortlisted. These shortlisted features need to be considered along with their relevant directions/axes. Hence, the eigenvalue and eigenvector always come in pairs. While sorting the eigenvalues in descending order of absolute value, focus is ensured that their corresponding directions are carried along. An eigenvector indicates the direction of axes of the eigenvalue

Step 5;     By dropping the last 3 features based on the acceptable accuracy level of 90%, the dimension of the matrix is reduced from 14 to 11. In other terms, the number of Principal Component for Analysis is 11. These selected features are built into a Feature Vector.

Step 6:     Final step is to recast the data of the selected vectors along the axes represented by the principal components. This is done by

TransformedDataset = Transpose[FeatureVector] x [StandardizedDataSet]

In the obesity dataset that was used as an example above, the features were relatively simpler. In a complex scenario of examining the behavior of analysing the locomotive behavior of c.elegans in presence or absence of carcinogens in a human body fluid, the test can have multiple features which determine the movement of c.elegans.

The test under consideration will have variations beginning with the drag coefficients of the medium used (such as agar), the number of worms used, the length of each of the worms, lifecycle stage of each of the worms, the placement of the worms on the petri-dish prior to the introduction of sample, the placement of the microscope lens for recording the movement of worms, characteristics of the sample used and also the crawling behavior (slow, sinusoidal undulations) in which the worms form grooves with their heads when moving on a medium such as agar gels. This movement of the worm can also include head-thrusts that may need to be considered.

These multitude of variables with various degree and scale of measure will be good use case to check the applicability of PCA for analysis deepening my curiosity.

All rights reserved by www.orangehue.in @ 2024.