Why pca is used




















User Guide. Quick Help. Origin C. LabTalk Programming. Python External. Automation Server. App Development. Code Builder. Function Reference. Video Tutorials. Origin Basics. The Origin Project File. Workbooks Worksheets and Worksheet Columns. Matrixbooks, Matrixsheets, and Matrix Objects. Importing and Exporting Data. Working with Microsoft Excel. Customizing Your Graph. Graphical Exploration of Data. Common Analysis Features. Matrix Conversion and Gridding. Regression and Curve Fitting. Signal Processing.

Peak Analysis. Image Processing and Analysis. PCA creates a visualization of data that minimizes residual variance in the least squares sense and maximizes the variance of the projection coordinates.

In a previous article, we explained why pre-treating data for PCA is necessary. Consider a matrix X with N rows aka "observations" and K columns aka "variables". For this matrix, we construct a variable space with as many dimensions as there are variables see figure below. Each variable represents one coordinate axis. For each variable, the length has been standardized according to a scaling criterion, normally by scaling to unit variance.

You can find more details on scaling to unit variance in the previous blog post. A K-dimensional variable space. For simplicity, only three variables axes are displayed. In the next step, each observation row of the X-matrix is placed in the K-dimensional variable space. Consequently, the rows in the data table form a swarm of points in this space. The observations rows in the data matrix X can be understood as a swarm of points in the variable space K-space.

Next, mean-centering involves the subtraction of the variable averages from the data. The vector of averages corresponds to a point in the K-space. In the mean-centering procedure, you first compute the variable averages. This vector of averages is interpretable as a point here in red in space. The point is situated in the middle of the point swarm at the center of gravity. The subtraction of the averages from the data corresponds to a re-positioning of the coordinate system, such that the average point now is the origin.

The mean-centering procedure corresponds to moving the origin of the coordinate system to coincide with the average point here in red. After mean-centering and scaling to unit variance, the data set is ready for computation of the first summary index, the first principal component PC1.

This component is the line in the K-dimensional variable space that best approximates the data in the least squares sense. This line goes through the average point. Each observation yellow dot may now be projected onto this line in order to get a coordinate value along the PC-line. This new coordinate value is also known as the score.

The first principal component PC1 is the line that best accounts for the shape of the point swarm. It represents the maximum variance direction in the data. Each observation yellow dot may be projected onto this line in order to get a coordinate value along the PC-line. This value is known as a score. Usually, one summary index or principal component is insufficient to model the systematic variation of a data set.

Thus, a second summary index — a second principal component PC2 — is calculated. More specifically, the reason why it is critical to perform standardization prior to PCA, is that the latter is quite sensitive regarding the variances of the initial variables.

That is, if there are large differences between the ranges of initial variables, those variables with larger ranges will dominate over those with small ranges For example, a variable that ranges between 0 and will dominate over a variable that ranges between 0 and 1 , which will lead to biased results.

So, transforming the data to comparable scales can prevent this problem. Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for each value of each variable.

The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them. Because sometimes, variables are highly correlated in such a way that they contain redundant information. So, in order to identify these correlations, we compute the covariance matrix. What do the covariances that we have as entries of the matrix tell us about the correlations between the variables?

Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the covariance matrix in order to determine the principal components of the data.

Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables i. So, the idea is dimensional data gives you 10 principal components, but PCA tries to put maximum possible information in the first component, then maximum remaining information in the second and so on, until having something like shown in the scree plot below. Organizing information in principal components this way, will allow you to reduce dimensionality without losing much information, and this by discarding the components with low information and considering the remaining components as your new variables.

Geometrically speaking, principal components represent the directions of the data that explain a maximal amount of variance , that is to say, the lines that capture most information of the data. The relationship between variance and information here, is that, the larger the variance carried by a line, the larger the dispersion of the data points along it, and the larger the dispersion along a line, the more the information it has. To put all this simply, just think of principal components as new axes that provide the best angle to see and evaluate the data, so that the differences between the observations are better visible.

As there are as many principal components as there are variables in the data, principal components are constructed in such a manner that the first principal component accounts for the largest possible variance in the data set. The second principal component is calculated in the same way, with the condition that it is uncorrelated with i. This continues until a total of p principal components have been calculated, equal to the original number of variables.

What you firstly need to know about them is that they always come in pairs, so that every eigenvector has an eigenvalue.



0コメント

  • 1000 / 1000