PCA – hierarchical tree – partition: Why do we need to choose for visualizing data?

Principal component methods such as PCA (principal component analysis) or MCA (multiple correspondence analysis) can be used as a pre-processing step before clustering.

But principal component methods give also a framework to visualize data. Thus, the clustering methods can be represented onto the map provided by the principal component method. In the figure below, the hierarchical tree is represented in 3D onto the principal component map (using the first 2 component obtained with PCA). And then, a partition has been done and individuals are coloured according to their belonging cluster.

arbre_temperature

Thus, the graph gives simultaneously the information given by  the principal component map, the hierarchical tree and the clusters (see th function HCPC in the FactoMineR package).

library(FactoMineR) 

temperature <- read.table("http://factominer.free.fr/livre/temperat.csv",
       header=TRUE, sep=";", dec=".", row.names=1)

res.pca <- PCA(temperature[1:23,], scale.unit=TRUE, ncp=Inf,
      graph = FALSE,quanti.sup=13:16,quali.sup=17)

res.hcpc <- HCPC(res.pca) 

The approaches complement one another in two ways:

  • firstly, a continuous view (the trend identified by the principal components) and a discontinuous view (the clusters) of the same data set are both represented in a unique framework;
  • secondly, the two-dimensional map provides no information about the position of the individuals in the other dimensions; the tree and the clusters, defined from more dimensions, offer some information “outside of the map”; two individuals close together on the map can be in the same cluster (and therefore not too far from one another along the other dimensions) or in two different clusters (as they are far from one another along other dimensions).

So why do we need to choose when we want to better visualize the data?

The example shows the common use of PCA and clustering methods, but rather than PCA we can use correspondence analysis on contingency tables, or multiple correspondence analysis on categorical variables.

If you want to learn more, you can see this video, or you cab enroll in this MOOC (free) and you can see this unpublished paper.

Advertisements

3 thoughts on “PCA – hierarchical tree – partition: Why do we need to choose for visualizing data?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s