Missing values imputation with missMDA

The best thing to do with missing values is not to have any” (Gertrude Mary Cox)

Unfortunately, missing values are ubiquitous and occur for plenty of reasons. One solution is single imputation which consists in replacing missing entries with plausible values. It leads to a complete dataset that can be analyzed by any statistical methods.

Based on dimensionality reduction methods, the missMDA package successfully imputes large and complex datasets with quantitative variables, categorical variables and mixed variables. Indeed, it imputes data with principal component methods that take into account the similarities between the observations and the relationship between variables. It has proven to be very competitive in terms of quality of the prediction compared to the state of the art methods.

With 3 lines of code, we impute the dataset orange available in missMDA:

library(missMDA)
data(orange)
nbdim <- estim_ncpPCA(orange) # estimate the number of dimensions to impute
res.comp <- imputePCA(orange, ncp = nbdim)

In the same way, imputeMCA imputes datasets with categorical variables and imputeFAMD imputes mixed datasets.

With a completed data, we can pursue our analyses… however we need to be careful and not to forget that the data was incomplete! In a future post, we will see how to visualize the uncertainty on these predicted values.

You can find more information in this JSS paper, on this website, on this tutorial given at useR!2016 at stanford.

You can also watch this playlist on Youtube to practice with R.

Text Mining on Wine Description

Here is an example of text mining with correspondence analysis.
Within the context of research into the characteristics of the wines from Chenin vines in the Loire Valley (French wines), a set of 10 dry white wines from Touraine were studied: 5 Touraine Protected Appellation of Origin (AOC) from Sauvignon vines, and 5 Vouvray AOC from Chenin vines.
degustationThese wines were described by 12 professionals. The instructions were: for each wine, give one or more words which, in your opinion, characterises the sensory aspects of the wine. This data was brought together in a table with the wines as rows and the columns as words, where the general term Xij is the number of times that a word j was associated with a wine i (data are available here).

This contingency table has been analysed using Correspondence Analysis (CA) to provide an image summarising the diversity of the wines. Continue reading

PCA – hierarchical tree – partition: Why do we need to choose for visualizing data?

Principal component methods such as PCA (principal component analysis) or MCA (multiple correspondence analysis) can be used as a pre-processing step before clustering.

But principal component methods give also a framework to visualize data. Thus, the clustering methods can be represented onto the map provided by the principal component method. In the figure below, the hierarchical tree is represented in 3D onto the principal component map (using the first 2 component obtained with PCA). And then, a partition has been done and individuals are coloured according to their belonging cluster.

arbre_temperature

Thus, the graph gives simultaneously the information given by  the principal component map, the hierarchical tree and the clusters (see th function HCPC in the FactoMineR package).

library(FactoMineR) 

temperature <- read.table("http://factominer.free.fr/livre/temperat.csv",
       header=TRUE, sep=";", dec=".", row.names=1)

res.pca <- PCA(temperature[1:23,], scale.unit=TRUE, ncp=Inf,
      graph = FALSE,quanti.sup=13:16,quali.sup=17)

res.hcpc <- HCPC(res.pca) 

The approaches complement one another in two ways:

  • firstly, a continuous view (the trend identified by the principal components) and a discontinuous view (the clusters) of the same data set are both represented in a unique framework;
  • secondly, the two-dimensional map provides no information about the position of the individuals in the other dimensions; the tree and the clusters, defined from more dimensions, offer some information “outside of the map”; two individuals close together on the map can be in the same cluster (and therefore not too far from one another along the other dimensions) or in two different clusters (as they are far from one another along other dimensions).

So why do we need to choose when we want to better visualize the data?

The example shows the common use of PCA and clustering methods, but rather than PCA we can use correspondence analysis on contingency tables, or multiple correspondence analysis on categorical variables.

If you want to learn more, you can see this video, or you cab enroll in this MOOC (free) and you can see this unpublished paper.

How to perform PCA with R?

This post shows how to perform PCA with R and the package FactoMineR.

If you want to learn more on methods such as PCA, you can enroll in this MOOC (everyting is free): MOOC on Exploratory Multivariate Data Analysis

Dataset

Here is a wine dataset, with 10 wines and 27 sensory attributes (like sweetness, bitterness, fruity odor, and so on), 2 preference variables, and a qualitative variable corresponding to the wine labels (there are 2 labels, Sauvignon and Vouvray). The values in the data table correspond to the average score given by several judges for the same wine and descriptive variable. The aim of doing PCA here is to characterize the wines according to their sensory characteristics.

Performing PCA … with additional information

Here are the lines of code used. Note that we use the information given by the qualitative variable.

### Read data
wine

### Loading FactoMineR
library(FactoMineR)

### PCA with supplementary variables
res

### Print the main results
summary(res)

Two graphs are given by default, one for the individuals, one for the quantitative variables.

But is is interesting to consider the qualitative variable to better understand the differences between wines. Wines are colored according to their label.

## Drawing wines according to the label
 plot(res,habillage="Label")

pca_wine

Interpretation

The graph of the individuals shows, for instance, that S Michaud and S Trotignon are very “close”. Continue reading

Exploratory Multivariate Data Analysis with R- enroll now in the MOOC

Exploratory multivariate data analysis is studied and has been taught in a “French-way” for a long time in France. You can enroll in a MOOC (completely free) on Exploratory Multivariate Data Analysis. The MOOC will start the 27th of February.

image_cours

This MOOC focuses on 4 essential and basic methods, those with the largest potential in terms of applications: principal component analysis (PCA) when variables are quantitative, correspondence analysis (CA) and multiple correspondence analysis (MCA) when variables are categorical and clustering.

This course is application-oriented and many examples and numerous exercises are done with FactoMineR (a package of the free R software) will make the participant efficient and reliable face to data analysis.

Interactive plots in PCA with Factoshiny

A beautiful graph tells more than a lenghtly speach!!

So it is crucial to improve the graphs obtained by Principal Component Analysis or (Multiple) Correspondence Analysis. The package Factoshiny allows us to easily improve these graphs interactively.

The package Factoshiny makes interacting with R and FactoMineR simpler, thus facilitating selection and addition of supplementary information. The main advantage of this package is that you don’t need to know the lines of code, and moreover that you can modify the graphical options and see instantly how the graphs are improved. You can visualize this video to see how to use Factoshiny.

essai_gif

Continue reading