Can we believe in the imputations?

A popular approach to deal with missing values is to impute the data to get a complete dataset on which any statistical method can be applied. Many imputation methods are available and provide a completed dataset in any cases, whatever the number of individuals and/or variables, the percentage of missing values, the pattern of missing values, the relationships between variables, etc.

However, can we believe in these imputations and in the analyses performed on these imputed datasets?

Multiple imputation generates several imputed datasets and the variance between-imputations reflects the uncertainty of the predictions of the missing entries (using an imputation model). In the missMDA package we propose a way to visualize the uncertainty associated to the predictions. The rough idea is to project all the multiple imputed datasets on the PCA graphical representations obtained from the “mean” imputed dataset.

For instance, for the incomplete orange data, the two following graphs read as follows: observation 6 has no uncertainty (there is no missing value for this observation) whereas there is more variability on the position of observation 10. For the variables, the clouds of points represent the uncertainties on the predictions. Ellipses as well as clouds are quite small and encourage to carry-on the analysis on the imputed dataset.

missMDA_ind_orangemissMDA_var_orange

The graphics above where obtained after performing multiple imputation with PCA simply be obtained using the function plot.MIPCA as follows:

library(missMDA)
data(orange)
nbdim <- estim_ncpPCA(orange) # estimate the number of dimensions to impute 
res.comp <- MIPCA(orange, ncp = nbdim$ncp, nboot = 1000)
plot(res.comp)

Now we have hints to answer the famous questions: “I have a dataset with xx% of missing values, can I impute it with your method?” or “Is 30% of missing values too much or not?” or “What is the maximum percentage of missing values?” Indeed, the percentage of missing values impacts the quality of the imputation but not only! The structure of the data (i.e. the relationships between variables) is very important. It is indeed possible to have small ellipses with a high percentage of missing values and the other way around. That is why these graphs are useful. The following ones suggest that we must be very careful with subsequent analyses on the imputed dataset, and even it suggests stopping the analysis of this dataset. When there’s nothing good to do, it’s better to do nothing!

missMDA_ind2missMDA_var2

This methodology is also available for categorical data with the functions MIMCA and plot.MIMCA to visualize the uncertainty around the prediction of categories.

You can contact us for more information:
julie.josse@polytechnique.edu       @JulieJosseStat
husson@agrocampus-ouest.fr

Advertisements

Multiple imputation for continuous and categorical data

The idea of imputation is both seductive and dangerous” (R.J.A Little & D.B. Rubin).

Indeed, a predicted value is considered as an observed one and the uncertainty of prediction is ignored, conducting to bad inferences with missing values. That is why Multiple Imputation is recommended.

The missMDA package quickly generates several imputed datasets with quantitative variables and/or categorical variables. It is based on dimensionality reduction methods such as PCA for continuous variables or multiple correspondence analysis for categorical variables. Compared to the packages Amelia and mice, it better handles cases where the number of variables is larger than the number of units, and cases where regularization is needed (i.e. when the imputation model is prone to overfitting issues). For categorical variables, it is particularly interesting with many variables and many levels, but also with rare levels.

With 3 lines of code, we generate 1000 imputed datasets for the quantitative orange data available in missMDA:

library(missMDA)
data(orange)
nbdim <- estim_ncpPCA(orange) # estimate the number of dimensions to impute
res.comp <- MIPCA(orange, ncp = nbdim$ncp, nboot = 1000)

In the same way, MIMCA can be used for categorical data:

library(missMDA)
data(vnf)
nb <- estim_ncpMCA(vnf,ncp.max=5) ## Time-consuming, nb = 4
res <- MIMCA(vnf, ncp=4,nboot=10)

You can find more information in this JSS paper, on this website, on this tutorial given at useR!2016 at stanford.

You can also watch this playlist on Youtube to practice with R.

You can also contact us:
julie.josse@polytechnique.edu       @JulieJosseStat
husson@agrocampus-ouest.fr

Multiple Factor Analysis to analyse several data tables

How to take into account and how to compare information from different information sources? Multiple Factor Analysis is a principal Component Methods that deals with datasets that contain quantitative and/or categorical variables that are structured by groups.

Here is a course with videos that present the method named Multiple Factor Analysis.

Multiple Factor Analysis (MFA) allows you to study complex data tables, where a group of individuals is characterized by variables structured as groups, and possibly coming from different information sources. Our interest in the method is due to it being able to analyze a data table as a whole, but also its ability to compare information provided by the various information sources.

Four videos present a course on MFA, highlighting the way to interpret the data. Then  you will find videos presenting the way to implement MFA in FactoMineR.

With this course, you will be stand-alone to perform and interpret results obtain with MFA.

MFA

 

 

Multiple Correspondence Analysis with FactoMineR

How to analyse of categorical data? Here is a course with videos that present Multiple Correspondence Analysis in a French way. The most well-known use of Multiple Correspondence Analysis is: surveys.

Four videos present a course on MCA, highlighting the way to interpret the data. Then  you will find videos presenting the way to implement MCA in FactoMineR, to deal with missing values in MCA thanks to the package missMDA and lastly a video to draw interactive graphs with Factoshiny. And finally you will see that the new package FactoInvestigate allows you to obtain automatically an interpretation of your MCA results.

With this course, you will be stand-alone to perform and interpret results obtain with MCA.

MCA4

 

For more information, you can see the book blow. Here are some reviews on the book and a link to order the book.

bookR

Correspondence Analysis with FactoMineR

How to analyse a contingency table – count or document-word matrix? Here is a course with videos that present Correspondence Analysis in a French way. Five videos present a course on CA, highlighting the way to interpret the data. Then  you will find videos presenting the way to implement in FactoMineR.

With this course, you will be stand-alone to perform and interpret results obtain with Correspondence Analysis.

CA4

 

For more information, you can see the book blow. Here are some reviews on the book and a link to order the book.

bookR

PCA course using FactoMineR

Here is a course with videos that present Principal Component Analysis in a French way. Three videos present a course on PCA, highlighting the way to interpret the data. Then  you will find videos presenting the way to implement in FactoMineR, to deal with missing values in PCA thanks to the package missMDA and lastly a video to draw interactive graphs with Factoshiny. And finally you will see that the new package FactoInvestigate allows you to obtain automatically an interpretation of your PCA results.

With this course, you will be stand-alone to perform and interpret results obtain with PCA.

PCA3

 

For more information, you can see the book blow. Here are some reviews on the book and a link to order the book.

bookR

Text Mining on Wine Description

Here is an example of text mining with correspondence analysis.
Within the context of research into the characteristics of the wines from Chenin vines in the Loire Valley (French wines), a set of 10 dry white wines from Touraine were studied: 5 Touraine Protected Appellation of Origin (AOC) from Sauvignon vines, and 5 Vouvray AOC from Chenin vines.
degustationThese wines were described by 12 professionals. The instructions were: for each wine, give one or more words which, in your opinion, characterises the sensory aspects of the wine. This data was brought together in a table with the wines as rows and the columns as words, where the general term Xij is the number of times that a word j was associated with a wine i (data are available here).

This contingency table has been analysed using Correspondence Analysis (CA) to provide an image summarising the diversity of the wines. Continue reading