Chemometrics pills: PCA

Today I want to talk a bit about chemometrics!

What is chemometrics?

Chemometrics is “a discipline that uses statistical tools to provide maximum information by analyzing chemical data”. There are different statisical methods, which can be used to extract information from data. One of the most used is PCA.

What is PCA?

PCA in an abbreviation for Principal Component Analysis and it is a way to extract the most valuable information from thousands of data.

Let’s make an example using an old post of mine on spectroscopy. In this post I explained how spectroscopy can be used by food scientists to assess the authenticity of a product (in my example, almonds). So, almonds are analyzed and the result of analysis in a matrix containing n objects with m variables. Variables can be up to 5000. Therefore, it is virtually impossible to understand the data using the “classical” (called univariate) statistics. That is why we have to use the “multiviarate statistics”. Applying PCA it was possible to distinguish the almonds according to their geographical origin. But…

How does the PCA work?

Since I am Italian I decided to explain PCA using pasta. Let’s assume that I analyzed my samples and I obtained a big amount of data, something like this:

The data

It is very difficult to extract information from all these data, so it is fundamental to apply some chemometrics tool. Let’s say that I want to classify all these types of pasta, so I apply PCA. This statistical approach allows me to reduce the amount of information and to extract the most valuable ones. The result of a PCA is a cartesian graph in which the x axis is called PC1 (Principal Component 1) and the y axis is called PC2 (Principal Component 2). In our case, the result of the PCA is the following:

The result of the PCA

How can we interpret the PCA graph?

If you have a look to the picture, you will notice that:

  • the different types of pasta are grouped,
  • PC1 seems to be related to the dimension of the pasta: the smaller the pasta (anellini), the lower value of PC1; the longer the pasta (spaghetti), the higher value of PC1,
  • PC2 is more difficult to understand, but if you are a true lover of pasta, you will deduce that PC2 is related to the cooking time pasta: the shorter the time (anellini), the lower value of PC2; the longer the time (lasagne), the higher value of PC2.
The PCA explained

If you have more questions on PCA, don’t hesitate to ask me!

References

Adams, “Chemometrics in analytical spectroscopy”, RSC

Varmusa and Filzmore, “Introduction to multivariate statistical analysis in chemometrics”, CRC Press

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s