PCA in python

The basic goal of pca is to keep as much information as possible while reducing the size of the data. The steps of PCA are explained in a simple and straightforward manner. Let's get started.

Step 1: Standardization is the first step.

When there is a lot of fluctuation in the data ranges for an attribute, it might lead to inaccurate outcomes.

For instance, an attribute in the range of 0 to 500 may outnumber an attribute in the range of 0 to 50.

Standardization allows data to be transformed on the same scale.

import numpy as np

import pandas as pd

from sklearn.preprocessing import StandardScaler

from sklearn.datasets import load_iris

from sklearn.utils import extmath


iris = load_iris()

# Load iris into a dataframe and set the field names


df = pd.DataFrame(iris['data'], columns=iris['feature_names'])

y = iris.target

df.head()

Outcome:

# Visualize standardised data

X = df.iloc[:, 0:4]

X_std = StandardScaler().fit_transform(X)

df = pd.DataFrame(X_std, columns = ['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)'])

df.head()

Outcome:

The difference between the original data and standardized data can be noted.

Step 2: The second step is to compute the covariance matrix.

A square matrix is used to represent the covariance matrix. It's a symmetric matrix that displays each pair of variables' covariance. The covariance matrix is used to identify correlations between data points and to remove irrelevant data. The data considered here has 4 variable, thus our covariance matrix would look like this,

[[Covariance(a, a), Covariance(a, b), Covariance(a,c), Covariance(a, d)],

[Covariance(b, a), Covariance(b, b), Covariance(b, c), Covariance(b, d)],

[Covariance(c, a), Covariance(c, b), Covariance(c, c), Covariance(c, d)],

[Covariance(d, a), Covariance(d, b), Covariance(d, c), Covariance(d, d)]]

The variances are the diagonal elements, while the covariances are the other entries. The magnitude and direction of multivariate data distributions in multidimensional space are represented by these values in the covariance matrix. We can get information on how data spreads across dimensions by altering these quantities.

Positive covariance:

Covariance is judged to be positive if the total of these values is positive. It signifies that the variables move in the same direction. To put it another way, if a value in variable A is greater, the corresponding value in variable B is likely to be higher as well. In summary, they are associated with each other.

Negative covariance:

If there is a negative covariance, it is regarded as the inverse. That is to say, the two variables have a negative association. For instance, if a value in variable A is greater, the corresponding value in variable B is likely to be lower.

Near zero covariance:

When one or both of the variables are zero, covariance can be close to zero.

There are no relationships between variables in this situation.

# Covariance

def Covariance(x, y):

XX, YY = x.mean(), y.mean()

return np.sum((x - XX)*(y - YY))/(len(x) - 1)

# Covariance matrix

def cov_mat(X):

mat = np.array([[Covariance(X[0], X[0]), Covariance(X[0], X[1]), Covariance(X[0], X[2]), Covariance(X[0], X[3])],\

[Covariance(X[1], X[0]), Covariance(X[1], X[1]), Covariance(X[1], X[2]), Covariance(X[1], X[3])],\

[Covariance(X[2], X[0]), Covariance(X[2], X[1]), Covariance(X[2], X[2]), Covariance(X[2], X[3])],\

[Covariance(X[3], X[0]), Covariance(X[3], X[1]), Covariance(X[3], X[2]), Covariance(X[3], X[3])]])

return mat


# Calculate covariance matrix

Cov = cov_mat(X.T) # (or with np.cov(X.T))

print(Cov)

Outcome:

[[1.25555467 0.64274656 0.93136433 0.78850988]

[0.64274656 0.32932363 0.47960106 0.40725381]

[0.93136433 0.47960106 0.71901892 0.62217212]

[0.78850988 0.40725381 0.62217212 0.547482 ]]

Or we can use numpy calculation

Cov = np.cov(X_std.T)

Outcome:

array([[ 1.00671141, -0.11835884, 0.87760447, 0.82343066],

[-0.11835884, 1.00671141, -0.43131554, -0.36858315],

[ 0.87760447, -0.43131554, 1.00671141, 0.96932762],

[ 0.82343066, -0.36858315, 0.96932762, 1.00671141]])

Step 3: Determine the eigenvalues and eigenvectors.

(i) The magnitude of the spread for the variables is represented by the eigenvalues.

(ii) The eigenvectors are the new directions in N-dimensional space (the new axis), and they are sorted in decreasing order by eigenvalue value.

(iii) The first thing you need know about them is that they always come in pairs, with one eigenvalue for each eigenvector.

(iv) And the number of them is equal to the number of data dimensions.

(v) The amount of variance contained in each Principal Component is determined by eigenvalues, which are simply the coefficients associated to eigenvectors.


e_vector, e_value = np.linalg.eig(Cov)

e_value, _ = extmath.svd_flip(e_value, np.empty_like(e_value).T)

print('Eigen vector ',e_vector)

print('Eigen value ',e_value)

Outcome:

Eigen vector [2.80537562e+00 4.50103485e-02 1.29374382e-16 9.93253028e-04]

Eigen value [[ 0.66477554 -0.59105804 -0.38454495 -0.246676 ]

[ 0.34140935 -0.22733035 0.91091027 0.04475536]

[ 0.50427432 0.33407732 -0.14410692 0.78315577]

[ 0.4326899 0.69811435 0.04001067 -0.56904741]]

Step 4: Choose the principal components by computing the eigenvectors

Sort them in descending order by their eigenvalues. This allows us to identify the principal components in order of importance. In this stage, we decide whether to preserve all of these components or to reject those with low eigenvalues, and then build a matrix of vectors called the Feature vector with the remaining ones. Because we only keep p eigenvectors (components) out of n, the final data set will only have p dimensions.

# Concatenate the eigenvectors corresponding to the highest n_components eigenvalues

n_components = 2

matrix_w = np.column_stack([e_value[:,-i] for i in range(1,n_components+1)])

# Get the PCA reduced data

Xpca = X_std.dot(matrix_w)

df = pd.DataFrame(Xpca)

df.head()

Outcome:

This is the reduced features after performing PCA