Unfortunately, missing values are ubiquitous and occur for plenty of reasons. One solution is **single imputation** which consists in replacing missing entries with plausible values. It leads to a complete dataset that can be analyzed by any statistical methods.

Based on dimensionality reduction methods, the **missMDA package **successfully imputes large and complex datasets with **quantitative** variables, **categorical** variables and **mixed** variables. Indeed, it imputes data with principal component methods that take into account the similarities between the observations and the relationship between variables. It has proven to be very competitive in terms of quality of the prediction compared to the state of the art methods.

With 3 lines of code, we impute the dataset *orange *available in missMDA:

library(missMDA)

data(orange)

nbdim <- estim_ncpPCA(orange) # estimate the number of dimensions to impute

res.comp <- imputePCA(orange, ncp = nbdim)

In the same way, **imputeMCA** imputes datasets with categorical variables and **imputeFAMD** imputes mixed datasets.

With a completed data, we can pursue our analyses… however we need to be careful and not to forget that the data was incomplete! In a future post, we will see how to visualize the uncertainty on these predicted values.

You can find more information in this JSS paper, on this website, on this tutorial given at useR!2016 at stanford.

You can also watch this playlist on Youtube to practice with R.

You can also enroll in this MOOC.

]]>

Within the context of research into the characteristics of the wines from Chenin vines in the Loire Valley (French wines), a set of 10 dry white wines from Touraine were studied: 5 Touraine Protected Appellation of Origin (AOC) from Sauvignon vines, and 5 Vouvray AOC from Chenin vines.

These wines were described by 12 professionals. The instructions were: for each wine, give one or more words which, in your opinion, characterises the sensory aspects of the wine. This data was brought together in a table with the wines as rows and the columns as words, where the general term

This contingency table has been analysed using Correspondence Analysis (CA) to provide an image summarising the diversity of the wines. Prior to the analysis, the words which are used the least frequently are suppressed and a number of “neighbouring” words were grouped together (for example, *sweet*, *smooth*, and *syrupy*, all of which refer to the same perception, that of the sweet taste of the wine).

CA is implemented using the following commands:

library(FactoMineR) wine = read.table("http://factominer.free.fr/bookV2/wine.csv", header=TRUE,row.names=1,sep=";",check.names=FALSE) res.ca = CA(wine,col.sup=11,row.sup=31) summary(res.ca)

We can comment the graph saying that there are 3 poles of wines:

- Aubuissières Silex (6), characterised by
*sweet*(cited 11 times)*,*is the only wine to contain more than trace level residual sugars. This unusual characteristic for a dry wine, stands out, as it is only rarely cited for the other wines (never more than twice for one wine), and accounts for over a third of the words associated with this wine. The graph highlights the wine’s lack of character; although this term was only cited 3 times for this wine, we have classed it in second place (among other things, this characteristic is really a lack of a characteristic and is therefore less evocative). - Aubuissières Marigny (7) + Fontainerie Coteaux (10). These two wines were mainly characterised by the terms
*oak, woody*, which were each cited 7 and 5 times, respectively, whereas the word was only used 3 times elsewhere. This description can, of course, be linked to the fact that these two wines are the only two to have been cask aged. According to this plane,*foreign flavour*best characterises these wines, but we chose to place it second due to the low frequency of this term (4), even if it was cited for these two wines alone. It should also be noted that the effect of ageing wine in casks does not only lead to positive characteristics. - The five Touraine wines (Sauvignon; 1–5). Characterising these wines was more difficult. The terms
*lots of character*,*fresh*,*delicate, discrete*, and*citrus*were cited for these wines, which seems to fit with the traditional image of a Sauvignon wine, according to which this vine yields fresh, flavoursome wines. We can also add two more marginal characteristics:*musty*(and*little character*, respectively), cited 8 times (4 times, respectively), and which are never used to describe the Sauvignon wines.

Once these three poles are established, we can go on to qualify the dimensions. The first distinguishes the Sauvignons from the Chenin wines based on freshness and flavour. The second opposes the cask-aged Chenin wines (with an oak flavor) with that containing residual sugar (with a sweet flavour).

Having determined these outlines, the term *lack of character*, which was only used for wines 6 and 8, seems to appear in the right place, i.e., far from the wines which could be described as flavoursome, whether the flavour be due to the Sauvignon vines or from being aged in oak casks.

Finally, this plane offers an image of the Touraine white wines, according to which the Sauvignons are similar to one another and the Chenins are more varied. From a viticulturist’s point of view, this analysis identifies the marginal characteristics of the Chenin vine. In practice, this vine yields rather varied wines which seem particularly different from the Sauvignons as they are somewhat similar and rather typical.

You can find a complete decription of this data in the book *Exploratory Multivaraite Data Analysis by Example Using R *(Husson, Lê, Pagès)*.*

Here are some materials: a video on another example of text mining, a video to better understand the CA method, and this video to see how to run CA with the R package FactoMineR.

You can also enroll in this MOOC.

]]>

But principal component methods give also a framework to visualize data. Thus, the clustering methods can be represented onto the map provided by the principal component method. In the figure below, the hierarchical tree is represented in 3D onto the principal component map (using the first 2 component obtained with PCA). And then, a partition has been done and individuals are coloured according to their belonging cluster.

Thus, the graph gives simultaneously the information given by the principal component map, the hierarchical tree and the clusters (see th function HCPC in the FactoMineR package).

library(FactoMineR) temperature <- read.table("http://factominer.free.fr/livre/temperat.csv", header=TRUE, sep=";", dec=".", row.names=1) res.pca <- PCA(temperature[1:23,], scale.unit=TRUE, ncp=Inf, graph = FALSE,quanti.sup=13:16,quali.sup=17) res.hcpc <- HCPC(res.pca)

The approaches complement one another in two ways:

- firstly, a continuous view (the trend identified by the principal components) and a discontinuous view (the clusters) of the same data set are both represented in a unique framework;
- secondly, the two-dimensional map provides no information about the position of the individuals in the other dimensions; the tree and the clusters, defined from more dimensions, offer some information “outside of the map”; two individuals close together on the map can be in the same cluster (and therefore not too far from one another along the other dimensions) or in two different clusters (as they are far from one another along other dimensions).

So why do we need to choose when we want to better visualize the data?

The example shows the common use of PCA and clustering methods, but rather than PCA we can use correspondence analysis on contingency tables, or multiple correspondence analysis on categorical variables.

If you want to learn more, you can see this video, or you cab enroll in this MOOC (free) and you can see this unpublished paper.

]]>

If you want to learn more on methods such as PCA, you can enroll in this MOOC (everyting is free): MOOC on Exploratory Multivariate Data Analysis

Here is a wine dataset, with 10 wines and 27 sensory attributes (like sweetness, bitterness, fruity odor, and so on), 2 preference variables, and a qualitative variable corresponding to the wine labels (there are 2 labels, Sauvignon and Vouvray). The values in the data table correspond to the average score given by several judges for the same wine and descriptive variable. The aim of doing PCA here is to characterize the wines according to their sensory characteristics.

Here are the lines of code used. Note that we use the information given by the qualitative variable.

### Read data

wine

### Loading FactoMineR

library(FactoMineR)

### PCA with supplementary variables

res

### Print the main results

summary(res)

Two graphs are given by default, one for the individuals, one for the quantitative variables.

But is is interesting to consider the qualitative variable to better understand the differences between wines. Wines are colored according to their label.

## Drawing wines according to the label plot(res,habillage="Label")

The graph of the individuals shows, for instance, that S Michaud and S Trotignon are very “close”. It means that the scores for S Michaud and S Trotignon are approximately the same, whatever the variable. In the same way, Aub Marigny and Font Coteaux are wines with similar sensory scores for the 27 attributes. On the other hand, Font Brulés and S Trotignon have very different sensory profiles, because the first principal component, representing the main axis of variability between wines, separates them strongly.

The variables astringency, visual intensity, mushroom odor and candied fruit odor, found to the right, have correlations close to 1 with the first dimension. Since the correlation with the 1st dimension is close to 1, the values of these variables move in the same direction as the coordinates in the 1^{st} dimensions. Wines with a small value in the 1st dimension have low values for these variables, and wines with large values in the 1st dimension have high values for these variables. Thus, the wines that are to the right of the plot have high (and positive) values in the 1st dimension and thus have high values for these variables. With the same logic, wines that are to the left have a small value in the 1st dimension, and thus low values for these variables.

For the variables passionfruit odor, citrus odor and freshness, everything is the other way around. The correlation with the 1st dimension is close to -1, and thus the values move in the opposite direction. Wines with a low value in the 1st dimension have low coordinate values, and thus have high values for these variables, and wines with large values in the 1st dimension have small values for these variables.

Overall, we see that the first dimension splits apart wines that are considered fruity and flowery (on the left) from wines that are woody or with vegetal odors. And this is the main source of variability.

So then, how can we interpret the 2nd dimension, the vertical axis? At the top, wines have large values on the vertical axis. Since the correlation coefficients between the 2nd dimension and variables such acidity or bitterness are close to 1, it means that wines at the top take large values for these variables. And wines at the bottom have small values in the 2nd dimension, and thus small values for these variables. For sweetness, the correlation coefficient is close to -1, so wines that have a small value in the 2nd dimension are sweet, while wines that have large values are not.

Overall, the 2nd dimension separates the wines at the top, acidic and bitter, from sweet wines at the bottom.

]]>

This MOOC focuses on 4 essential and basic methods, those with the largest potential in terms of applications: principal component analysis (PCA) when variables are quantitative, correspondence analysis (CA) and multiple correspondence analysis (MCA) when variables are categorical and clustering.

This course is application-oriented and many examples and numerous exercises are done with FactoMineR (a package of the free R software) will make the participant efficient and reliable face to data analysis.

]]>

So it is crucial to improve the graphs obtained by Principal Component Analysis or (Multiple) Correspondence Analysis. The package **Factoshiny** allows us to easily improve these graphs interactively.

The package Factoshiny makes interacting with R and FactoMineR simpler, thus facilitating selection and addition of supplementary information. The main advantage of this package is that you don’t need to know the lines of code, and moreover that you can modify the graphical options and see instantly how the graphs are improved. You can visualize **this video** to see how to use Factoshiny.

The interface allows us to define the parameters of the methods and to modify the graphical options. The results (the graphs and the indicators) are updated automatically. For instance, in the animation, individuals are colored in terms of the category they belong to, for a given qualitative variable. Then we modify the threshold to label the individuals according to their quality of representation in the plane. In such a way, individuals that are badly represented have transparent labels.

Once the “beautiful graphs” are done, you can download the plots but you can also obtain the lines of code to redo the analysis. It is also possible to save and then reuse the object resulting from Factoshiny in order to further modify the graphs, using the configuration described previously. The interface is re-opened as it was when we left it. So we can modify the parameters of the method or the graphical options.

]]>