QuantRocket logo

© Copyright Quantopian Inc.
© Modifications Copyright QuantRocket LLC
Licensed under the Creative Commons Attribution 4.0.

Disclaimer

Principal Component Analysis

by Rene Zhang and Max Margenot

Applications in many fields, such as image processing, bioinformatics, and quantitative finance, involve large-scale data. Both the size and complexity of this data can make the computations required for analysis practically infeasible. Principal Component Analysis (PCA) is a classical method for dimension reduction. It uses the first several principal components, statistical features that explain most of the variation of a $m \times n$ data matrix $\mathbf{X}$, to describe the large-scale data matrix $\mathbf{X}$ economically.

We will introduce PCA with an image processing example. A grayscale digital image can be represented by a matrix, whose $(i,j)^{th}$ entry corresponds to the measurement of gray scale at the $(i,j)^{th}$ pixel. The following gray-scale image has $200 \times 200$ pixels, though it can be changed on the fly. We store it in a matrix $\mathbf{X}$. The number of rows of the $\mathbf{X}$ is $200$, and the number of columns of $\mathbf{X}$ is $200$.

We start with a simple checkboard pattern, add some random normal noise, and add a gradient.

Set each row as a variable, with observations in the columns. Denote the covariance matrix of $\mathbf{X}$ as $\mathbf{C}$, where the size of $\mathbf{C}$ is $m \times m$. $\mathbf{C}$ is a matrix whose $(i,j)^{th}$ entry is the covariance between the $i^{th}$ row and $j^{th}$ row of the matrix $\mathbf{X}$.

Performing principal component analysis decomposes the matrix $\mathbf{C}$ into:

$$\mathbf{C} = \mathbf{L}\mathbf{P}\mathbf{L}^{\top},$$

where $\mathbf{P}$ is a diagonal matrix $\mathbf{P}=\text{diag}(\lambda_1,\lambda_2,\dots,\lambda_m)$, with $\lambda_1 \geq \lambda_1 \geq \dots \lambda_m \geq 0$ being the eigenvalues of matrix $\mathbf{C}$. The matrix $\mathbf{L}$ is an orthogonal matrix, consisting the eigenvectors of matrix $\mathbf{C}$.

The function LA.eigh lists the eigenvalues from small to large in $P$. Let us change the order first to list them from largest to smallest and make sure that $\mathbf{L}\mathbf{P}\mathbf{L}^{\top}==\mathbf{C}$.

Here we plot all of the eigenvalues:

The $i^{th}$ principal component is given as $i^{th}$ row of $\mathbf{V}$,

$$\mathbf{V} =\mathbf{L}^{\top} \mathbf{X}.$$

If we multiply both sides on the left by $\mathbf{L}$, we get the following:

$$\mathbf{L}\mathbf{L}^{\top} \mathbf{X}= \mathbf{L}\mathbf{V}.$$

The matrix $\mathbf{L}$ is the set of eigenvectors from a covariance matrix , so $\mathbf{L}\mathbf{L}^{\top} = \mathbf{I}$ and $\mathbf{L}\mathbf{L}^{\top}\mathbf{X} = \mathbf{X}$. The relationship among matrices of $\mathbf{X}$, $\mathbf{L}$, and $\mathbf{V}$ can be expressed as

$$\mathbf{X} = \mathbf{L}\mathbf{V}.$$

To approximate $\mathbf{X}$, we use $k$ eigenvectors that have largest eigenvalues:

$$\mathbf{X} \approx \mathbf{L[:, 1:k]}\mathbf{L[:, 1:k]}^{\top} \mathbf{X}.$$

Denote the approximated $\mathbf{X}$ as $\tilde{\mathbf{X}} = \mathbf{L[:, 1:k]}\mathbf{L[:, 1:k]}^{\top} \mathbf{X}$. When $k = m $, the $\tilde{\mathbf{X}}$ should be same as $\mathbf{X}$.

The proportion of total variance due to the $i^{th}$ principal component is given by the ratio $\frac{\lambda_i}{\lambda_1 + \lambda_2 + \dots \lambda_m}.$ The sum of proportion of total variance should be $1$. As we defined, $\lambda_i$ is $i^{th}$ entry of $\mathbf{P}$,

$$\sum_{i}\frac{P_i}{\text{trace}(P)} = 1$$

Where the trace$(P)$ is the sum of the diagonal of $P$.

Recall the number of principal components is denoted as $k$. Let $k$ be $10, 20, 30, 60$ as examples and take a look at the corresponding approximated images.

The number of variables in $X$ is $200$. When reducing the dimension to $k=60$, which uses half of the principal components, the approximated image is close to the original one.

Moving forward, we do not have to do PCA by hand. Luckly, scikit-learn has an implementation that we can use. Next, let us show an example in quantitative finance using sklearn.

PCA on a Portfolio

Construct a portfolio with 10 stocks, IBM, MSFT, FB, T, INTC, ABX, NEM, AU, AEM, GFI. 5 of them are technology related and 5 of them are gold mining companies.

In this case, there are 10 variables (companies), and each column is a variable.

Notice that the grand bulk of the variance of the returns of these assets can be explained by the first two principal components.

Now we collect the first two principal components and plot their contributions.

From these principal components we can construct "statistical risk factors", similar to more conventional common risk factors. These should give us an idea of how much of the portfolio's returns comes from some unobservable statistical feature.

The factor returns here are an analogue to the principal component matrix $\mathbf{V}$ in the image processing example.

The factor exposures are an analogue to the eigenvector matrix $\mathbf{L}$ in the image processing example.

Creating statistical risk factors allows us to further break down the returns of a portfolio to get a better idea of the risk. This can be used as an additional step after performance attribution with more common risk factors to try to account for additional unknown risks.

References:


Next Lecture: Long-Short Equity

Back to Introduction


This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by Quantopian, Inc. ("Quantopian") or QuantRocket LLC ("QuantRocket"). Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, neither Quantopian nor QuantRocket has taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information believed to be reliable at the time of publication. Neither Quantopian nor QuantRocket makes any guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.