Introdução à Matemática Linear

Principal Component Analysis

The analytical technique that will end this journey toward understanding linear algebra is the principal component analysis (PCA). But don’t worry, as I bring good news! By now, you know everything that is needed to understand this methodology. In essence, the principal component analysis is a data projection into a new set of axes, or a change of basis that occurs via a linear transformation. Mappings are not a problem for us as we are now experts in manoeuvring vectors through space. So far, we know that when a linear transformation is involved, a matrix is needed, but we still need to understand the goal of this new method to define this linear transformation.

So I will put it out there. The principal component analysis will create a new set of axes called the principal axes, where we will project the data and get these so-called principal components. These are a linear combination of the original features that will be equipped with outstanding characteristics. Characteristics which are not only uncorrelated, but the first components also capture most of the variance in the data, which makes this methodology a good technique for reducing the dimensions of complex data sets. While often seen in a dimensionality reduction context, this technique also has other applications, as these new relationships are latent variables.

Cool, we know what it does and where we can use it, so let’s define this linear transformation. The word axes was involved in the explanation, meaning we need orthogonal vectors. We also know that by definition, if we have a symmetrical positive matrix, the eigenvectors are not only orthogonal but they also have positive and real eigenvalues. Multiplying a matrix by its transpose or the other way around will result in a symmetrical matrix, and with this, we can accommodate a matrix of any size.

Let’s pause and check where we are because I am sensing that the desired result is near. We have defined PCA and we’ve concluded we need to get a new set of axes onto which we can project data. The eigenvectors cover this, but there is one piece missing. These so called components needed to reflect variance. I want to bring to your attention a specific case of matrix multiplication where we multiply A^T by A. Let’s think about this operation in terms of dot products. We will end up with a symmetrical matrix where the non-diagonal entries represent how much the rows relate to the columns.

With this, we are one step away from a covariance matrix. The numbers in this particular matrix reflect how much variables vary with each other. The only thing we’re missing is a way to centre the data around the mean. Given that the covariance matrix is symmetric positive, its eigenvalues are positive and the eigenvectors are perpendicular. Considering that the eigenvalues represent a scaling factor that comes from a covariance matrix, the largest value will correspond to the highest direction of variance; therefore, the correspondent eigenvector will be the first principal component.

We can apply this same logic to each of the other vectors and then order them. The result will be the principal axes on which we can project the data to obtain the principal components. I know that’s a lot of words, but no numbers, and no equations! That’s not our style! So let’s consider an example, a fabricated one, but still, it will help us comprehend this technique. Say that we have the following set of data:

user totalBetsValuetotalWontotalDaysPlayedaverageBetSizetotalSessions 1 3000 0 4 30 4 2 10453 0 1 100 1 3 21500 4230 6 50 7 4 10000 2000 12 10 14 5 340 10 10 1 10 6 5430 2000 4 5 70 7 43200 4320 10 4 32 8 2450 100 8 5 12

The data can represent, for example, the features of users for some online casino games. Our goal is to transform this data with principal component analysis, so the first step is to calculate a covariance matrix. For that, we need to centre the data. To do this, we can use the following equation:

x − x− x ∗ = ------j- σj

(6.1)

This means that, for each element in column j, we will subtract the mean of the same column and then divide the results by the standard deviation of this same column. If we do this to each column, we have standardised the data. So let’s start by computing each column’s mean and standard deviation:

Metric totalBetsValuetotalWontotalDaysPlayedaverageBetSizetotalSessions − x j 12046.62 1582.50 6.87 25.62 18.75 σj 13329.42 1751.10 3.51 32.23 21.264

Applying equation 6.1, the table with standardised data is as follows:

use123r totalBet--0s00.V..7a611l82ue tot∗a--l001W...o995n001 tota∗lDays---P010l...a862y275edaverageB∗e020t...S137i416ze totalS∗e---s000s...i685o935ns ∗ 4 -0.15 0.24 1.46 - 0.48 -0.22 5 -0.88 -0.90 0.89 - 0.76 -0.41 6 -0.50 0.24 -0.82 - 0.64 2.41 7 2.34 1.56 0.89 - 0.67 0.62 8 -0.72 -0.85 0.32 - 0.64 -0.32

We are now ready for linear algebra, so let’s go from that ugly table into a more familiar format in which to display data, a matrix. And that is right, it will be good old matrix A! So let A be such that:

( ) | − 0.68 − 0.90 − 0.82 0.14 − 0.69| | | | − 0.12 − 0.90 − 1.67 2.31 − 0.83| || || | 0.71 1.51 − 0.25 0.76 − 0.55| || || | − 0.15 0.24 1.46 − 0.48 − 0.22| | | A = || − 0.88 − 0.90 0.89 − 0.76 − 0.41|| | | || − 0.50 0.24 − 0.82 − 0.64 2.41 || | | | 2.34 1.56 0.89 − 0.67 0.62 | || || ( − 0.72 − 0.85 0.32 − 0.64 − 0.32)

So A^T is:

( ) | − 0.68 − 0.12 0.71 − 0.15 − 0.88 − 0.50 2.34 − 0.72| | | | − 0.90 − 0.90 1.51 0.24 − 0.90 0.24 1.56 − 0.85| || || AT = | − 0.82 − 1.67 − 0.25 1.46 0.89 − 0.82 0.89 0.32 | || || | 0.14 2.31 0.76 − 0.48 − 0.76 − 0.64 − 0.67 − 0.64| | | ( − 0.69 − 0.83 − 0.55 − 0.22 − 0.41 2.41 0.62 − 0.32)

One more step, and we will have the desired covariance matrix. The only thing we’re missing is financial freedom, oh, oops… Sorry, my mind drifts away sometimes. I meant to say, we need to multiply A^T by A and then divide by the number of entries. Shall we call the resultant of this operation matrix M?

( ) 1.00 0.84 0.23 0.02 0.13 | | || 0.84 1.00 0.29 − 0.14 0.33 || | | AT A | 0.23 0.29 1.00 − 0.73 − 0.01| --8-- = M = || || | 0.02 − 0.14 − 0.73 1.00 − 0.47| | | |( 0.13 0.33 − 0.01 − 0.47 1.00 |)

So as expected, M is a symmetrical matrix. Where are the eigenvectors? There are several ways to get them. One is to use the eigen decomposition, so let’s start by exploring that. This numerical technique will return three matrices, two of which will have what we are after, the eigenvalues and the eigenvectors. This time I will use a computer to calculate the eigenvectors and eigenvalues. We are looking for a representation of M like:

− 1 M = P ΣP

Where P is:

( ) 0.45 0.53 0.44 − 0.43 0.34 | | | 0.53 0.42 − 0.38 0.60 − 0.13| || || | − 0.14 0.01 − 0.54 − 0.04 0.82 | P = || || | 0.69 − 0.68 − 0.11 − 0.19 0.05 | | | |( 0.00 − 0.24 0.58 0.64 0.42 |)

And:

( ) 2.29 0 0 0 0 | | | 0 1.48 0 0 0 | || || | 0 0 1.02 0 0 | Σ = | | || 0 0 0 0.14 0 || | | |( 0 0 0 0 0.08 |)

Right, so P is a matrix with the eigenvectors, and this will be where we find our principal axes. It happens to be already sorted by eigenvalue magnitude. On the other hand, in the matrix Σ we have all the eigenvalues. These represent what can be called the explainability of the variance, how much of the variance present in the data is ”captured” by each component. This means we can derive excellent insights from these magnitudes when we want to apply principal component analysis as a dimensionality reduction technique. For example, we could percentage the eigenvalues and then understand how much variance they each explain:

∑5 λi = 5 i=1

(6.2)

We have five eigenvalues, and we can sum them using 6.2, where λ_i represents the eigenvalues. Now, by dividing each of them by five, we find the individual percentage of variance that each distinct eigenvalue explains.

PPPPP01234543222CCCCC00000600V12345ariance explained %

Figure 6.1: The amount of information captured by each component.

We can see that two components, namely PC1 and PC2 explain 76% of the variability. Considering the case of dimensionality reduction, let’s pick the two largest eigenvalues and transform our original data set to create the first two components for each user in the data set. Choosing the first two components will reduce the columns of P. We will call this new version P_t, and it has the following format:

( ) 0.45 0.53 || || | 0.53 0.42 | | | || − 0.14 0.01 || Pt = | | || 0.69 − 0.68|| | | ( 0.00 − 0.24)

The linear combinations that represent the first two components are described by the following two equations:

PC1 = 0.45 ⋅ totalBetsValue + 0.53 ⋅ totalWon − 0.14 ⋅ totalDaysPlayed +0.69 ⋅ averageBetSize + 0.00 ⋅ totalSessions

(6.3)

PC2 = 0.53 ⋅ totalBetsValue + 0.42 ⋅ totalWon + 0.01 ⋅ totalDaysPlayed − 0.68 ⋅ averageBetSize − 0.24 ⋅ totalSessions

(6.4)

The only thing missing is to project the data into this new space. We need to transform our versions of the scaled data with the matrix P_t. For that, we can use the following equation:

T Mreduced = M8 ×5 ⋅ P 5×2

(6.5)

By performing the matrix multiplication in 6.5, we will create the first and second components for all users in the features set:

user PC1 PC2 1 -1.46 -0.26 2 -2.56 1.69 3 0.51 1.65 4 0.84 -0.81 5 -0.30 -1.60 6 0.63 -0.55 7 2.81 1.10 8 -0.48 -1.22

These components are also called latent or hidden variables; relationships that are hidden in the data and are the result of linear combinations. It is possible to give these some meaning (a way of interpretation), and for that, we can use the expressions that define them, as shown in 6.3 and 6.4 that define them.

I want to point out that this step is dependent on the business problem and what you think makes most sense. This is probably one of the only times when mathematics is similar to poetry. Some will disagree with me and say that equations are pure poetry, and whilst I may not have reached that level of sensitivity, I do wonder what it’s like to live with that state of mind. Maybe it is excellent.

Anyway, back to our example. So, let’s say that principal component number one could be intensity. The higher positive coefficients in equation 6.3 are those that correspond to totalBets, totalWon, and averageBetSize. The second principal component is harder to attribute a meaning to, but risk exposure could make sense as these players have significant losses:

tt−−123−−123oottaa3232llBSeetsssViaolnuse

Figure 6.2: A view of the totalSession and the totalBetsValue.

Seeing as we reduced the data from five dimensions to two and we attributed some meaning to the new components, we can plot these new coordinates and see if anything interesting comes from it:

ir−−−123−−123nitsek32132nsietxyposure

Figure 6.3: A peak into the secret space! First two components.

This is somehow an interesting perspective in the context of gambling. Some players like to risk their balance but play in a less intense session. We can also observe a case of a player that enjoys risk and intensity. These hidden variables are excellent indicators for situations where human emotion is involved, as they can capture some of that temperament.

There are other methods that can be leveraged to get to the principal components. While we have so far explored the eigen decomposition, it is also possible to calculate these metrics via the single value decomposition. We defined M as a covariance matrix while A was the matrix with entries that are a standardised version of the data in the original features table. So, M can be defined by the following equation:

AT ⋅ A M = -------- N

(6.6)

In equation 6.6, N is the number of rows in matrix A. Now, say that, instead of performing an eigen decomposition on M, we choose to use the single value decomposition and we do it on matrix A. It follows that:

A = U ΣV T

Substituting this information into equation 6.6 it follows that:

T T T (U--⋅ Σ-⋅ V--)--(U-⋅-Σ-⋅-V--) M = N

We can manipulate that equation a little bit:

T T T V--⋅-Σ---⋅ U--⋅ U-⋅-Σ-⋅-V-- M = N

Finally:

2 T M = V--⋅-Σ--⋅-V-- N

(6.7)

From here, we can conclude that the right singular vectors are the principal directions. Following the same logic, just as we did with the eigen decomposition, we will obtain the principal components by:

AV

And A is equal to UΣV ^T, so:

T AV = U ΣV V

Which means that:

AV = U Σ

So the principal components are given by UΣ where the entries of Σ are calculated by:

σ2i λi = --- N

Conclusion

PCA is a crucial tool for dimensionality reduction, helping to simplify data analysis by transforming high-dimensional data into a lower-dimensional space while preserving as much variance as possible. Understanding and applying PCA enables effective data reduction and visualization, providing valuable insights into complex datasets.