All you need to know on Multiple Factor Analysis …

Multiple facrtor analysis deals with dataset where variables are organized in groups. Typically, from data coming from different sources of variables. The method highlights a common structure of all the groups, and the specificity of each group. It allows to compare the results of several PCAs or MCAs in a unique frame of reference. The groups of variables can be continuous, categorical or can be a contingency table.

Implementation with R software

See this video and the audio transcription of this video:

MFA_img

Course videos

Theorectical and practical informations on Multiple Factor Analysis are available in these 4 course videos:

  1. Introduction
  2. Weighting and global PCA
  3. Study of the groups of variables
  4. Complements: qualitative groups, frenquency tables

Here are the slides and the audio transcription of the course.

Materials

Here is the material used in the videos:

Advertisement

All you need to know on clustering with Factoshiny…

The function Factoshiny of the package Factoshiny proposes a complete clustering strategy that allows you:

  • to draw a hierarchical tree and a partition
  • to describe and characterize the clusters by quantitative and categorical variables
  • to consider lots of individuals thanks to the complementarity of Kmeans and clustering algorithms
  • to consider categorical variables or contingency tables

Implementation with R software

See this video and the audio transcription of this video:

CLASSIFFacto

Course videos

Theorectical and practical informations on clustering are available in these 4 course videos (here are the slides and the audio transcription of the courses):

 

clustering1

Introduction

 

 

 

 

Materials

Here is the material used in the videos:

 

All you need to know to analyse a survey with MCA …

All you need to do with MCA to analyse a survey is in Factoshiny!

MCA – Multiple Correspondence Analysis – is a method for exploring and visualizing data obtained from a survey or a questionnaire, i.e. datasets with categorical variables.

The function Factoshiny of the package Factoshiny allows you to perform MCA in a really easy way. You can include extras information such as quantitative variables, manage missing data, draw and improve the graphs interactively, draw confidence ellipses, have several numeric indicators as outputs, perform clustering on the MCA results, and even have an automatic interpretation of the results. Finally, the function returns the lines of code to parameterize the analysis and redo the graphs, which makes the analysis reproducible.

Implementation with R software

See this video and the audio transcription of this video:

ACM_img

The lines of code to do a MCA:

install.packages(Factoshiny)
library(Factoshiny)
data(tea)
result <- Factoshiny(tea)

 

Course videos

Theorectical and practical informations on Multiple Correspondence Analysis are available in these 4 course videos:

Here are the slides and the audio transcription of the course.

Materials

Here is the material used in the videos:

Management of missing data

This video gives more information on the management of missing data in MCA.

If you want to see more methods on Exploratory Data Analysis, follow this link.

All you need to know on Correspondence Analysis …

Correspondence Analysis – CA – is an exploratory multivariate method for exploring and visualizing contingency tables, i.e. tables on which a chi-squared test can be performed. CA is particularly useful in text mining.

The function Factoshiny of the package Factoshiny allows you to perform CA in an easy way. You can include extras information, manage missing data, draw and improve the graph interactively, have several numeric indicators as outputs, perform clustering on the CA results, and even have an automatic interpretation of the results. Finally, the function returns the lines of code to parameterize the analysis and redo the graphs, which makes the analysis reproducible.

Implementation with R software

See this video and the audio transcription of this video:

CA_img

The lines of code to do a Correspondence Analysis:

install.packages(Factoshiny)
library(Factoshiny)
data(children)
result <- Factoshiny(children)

 

Course videos

Theorectical and practical informations on Correspondence Analysis are available in these 6 course videos:

  1. Introduction
  2. Visualizing the row and column clouds
  3. Inertia and percentage of inertia
  4. Simultaneous representation
  5. Interpretation aids
  6. Text mining with correspondence analysis

Here are the slides and the audio transcription of the course.

Materials

Here is the material used in the videos:

Follow this link if you want to see more methods on Exploratory Data Analysis.

All you need to know on PCA …

All you need to do with PCA is in Factoshiny!

PCA – Principal Component Analysis – is a well known method for exploring and visualizing data. The function Factoshiny of the package Factoshiny allows you to perform PCA in a really easy way. You can include extras information such as categorical variables, manage missing data, draw and improve the graphs interactively, have several numeric indicators as outputs, perform clustering on the PCA results, and even have an automatic interpretation of the results. Finally, the function returns the lines of code to parameterize the analysis and redo the graphs, which makes the analysis reproducible.

See this video and the audio transcription of this video:

PCAFacto

The lines of code to do a PCA:

install.packages(Factoshiny)
library(Factoshiny)
data(decathlon)
result <- Factoshiny(decathlon)

Theorectical and practical informations on PCA are available in these 3 course videos:

  1. Data – practicalities
  2. Studying individuals and variables
  3. Interpretation aids

Here are the slides and the audio transcription of the course.

Here is the material used in the videos:

And here is a video that gives more information on the management of missing data.

Enjoy to make beautiful visualizations of your data!

If you want to see more methods on Exploratory Data Analysis, follow this link.

Factoshiny: an updated version on CRAN!

The newest version of R package Factoshiny (2.2) is now on CRAN!
It gives a graphical user interface that allows you to implement exploratory multivariate analyses such as PCA, correspondence analysis, multiple factor analysis or clustering.
This interface allows you to modify the graphs interactively, it manages missing data, it gives the lines of code to parameterize the analysis and redo the graphs (reproducibility) and it proposes an automatic report on the results of the analysis.

remove.packages("Factoshiny")
install.packages("Factoshiny")

Try it! Only 1 function to retain: the Factoshiny function (same name as the packages):

library(Factoshiny)
data(decathlon)
result <- Factoshiny(decathlon)

Here is a video that shows how to perform PCA with Factoshiny.

PCAfactoshiny

Enroll now in the MOOC on Exploratory Multivariate Data Analysis with R

Exploratory multivariate data analysis is studied and has been taught in a “French-way” for a long time in France. You can enroll in a MOOC (completely free) on Exploratory Multivariate Data Analysis. The MOOC will start the 2nd of March 2020.

image_cours

This MOOC focuses on 5 essential and basic methods, those with the largest potential in terms of applications: principal component analysis (PCA) when variables are quantitative, correspondence analysis (CA) and multiple correspondence analysis (MCA) when variables are categorical and clustering. An extension to Multiple Factor Analysis (MFA) will give you the opportunity to analyse more complex dataset that are structured by groups.

This course is application-oriented and many examples and numerous exercises are done with FactoMineR (a package of the free R software) will make the participant efficient and reliable face to data analysis.

See you soon.

 

Can we believe in the imputations?

A popular approach to deal with missing values is to impute the data to get a complete dataset on which any statistical method can be applied. Many imputation methods are available and provide a completed dataset in any cases, whatever the number of individuals and/or variables, the percentage of missing values, the pattern of missing values, the relationships between variables, etc.

However, can we believe in these imputations and in the analyses performed on these imputed datasets?

Multiple imputation generates several imputed datasets and the variance between-imputations reflects the uncertainty of the predictions of the missing entries (using an imputation model). In the missMDA package we propose a way to visualize the uncertainty associated to the predictions. The rough idea is to project all the multiple imputed datasets on the PCA graphical representations obtained from the “mean” imputed dataset.

For instance, for the incomplete orange data, the two following graphs read as follows: observation 6 has no uncertainty (there is no missing value for this observation) whereas there is more variability on the position of observation 10. For the variables, the clouds of points represent the uncertainties on the predictions. Ellipses as well as clouds are quite small and encourage to carry-on the analysis on the imputed dataset.

missMDA_ind_orangemissMDA_var_orange

The graphics above where obtained after performing multiple imputation with PCA simply be obtained using the function plot.MIPCA as follows:

library(missMDA)
data(orange)
nbdim <- estim_ncpPCA(orange) # estimate the number of dimensions to impute 
res.comp <- MIPCA(orange, ncp = nbdim$ncp, nboot = 1000)
plot(res.comp)

Now we have hints to answer the famous questions: “I have a dataset with xx% of missing values, can I impute it with your method?” or “Is 30% of missing values too much or not?” or “What is the maximum percentage of missing values?” Indeed, the percentage of missing values impacts the quality of the imputation but not only! The structure of the data (i.e. the relationships between variables) is very important. It is indeed possible to have small ellipses with a high percentage of missing values and the other way around. That is why these graphs are useful. The following ones suggest that we must be very careful with subsequent analyses on the imputed dataset, and even it suggests stopping the analysis of this dataset. When there’s nothing good to do, it’s better to do nothing!

missMDA_ind2missMDA_var2

This methodology is also available for categorical data with the functions MIMCA and plot.MIMCA to visualize the uncertainty around the prediction of categories.

You can contact us for more information:
julie.josse@polytechnique.edu       @JulieJosseStat
husson@agrocampus-ouest.fr

Multiple imputation for continuous and categorical data

The idea of imputation is both seductive and dangerous” (R.J.A Little & D.B. Rubin).

Indeed, a predicted value is considered as an observed one and the uncertainty of prediction is ignored, conducting to bad inferences with missing values. That is why Multiple Imputation is recommended.

The missMDA package quickly generates several imputed datasets with quantitative variables and/or categorical variables. It is based on dimensionality reduction methods such as PCA for continuous variables or multiple correspondence analysis for categorical variables. Compared to the packages Amelia and mice, it better handles cases where the number of variables is larger than the number of units, and cases where regularization is needed (i.e. when the imputation model is prone to overfitting issues). For categorical variables, it is particularly interesting with many variables and many levels, but also with rare levels.

With 3 lines of code, we generate 1000 imputed datasets for the quantitative orange data available in missMDA:

library(missMDA)
data(orange)
nbdim <- estim_ncpPCA(orange) # estimate the number of dimensions to impute
res.comp <- MIPCA(orange, ncp = nbdim$ncp, nboot = 1000)

In the same way, MIMCA can be used for categorical data:

library(missMDA)
data(vnf)
nb <- estim_ncpMCA(vnf,ncp.max=5) ## Time-consuming, nb = 4
res <- MIMCA(vnf, ncp=4,nboot=10)

You can find more information in this JSS paper, on this website, on this tutorial given at useR!2016 at stanford.

You can also watch this playlist on Youtube to practice with R.

You can also contact us:
julie.josse@polytechnique.edu       @JulieJosseStat
husson@agrocampus-ouest.fr

Clustering with FactoMineR

Here is a course with videos that present Hierarchical clustering and its complementary with principal component methods. Four videos present a course on clustering, how to determine the number of clusters, how to describe the clusters and how to perform the clustering when there are lots of individuals and/or lots of variables. Then  you will find videos presenting the way to implement hierarchical clustering in FactoMineR.

arbre_temperature

For more information, you can see the book blow. Here are some reviews on the book and a link to order the book.

bookR