SlideShare a Scribd company logo
1 of 27
Download to read offline
Exploratory Analysis of Multivariate Data
Matt Moores
Department of Statistics, University of Warwick
WDSI Vacation School 2017
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 1 / 27
1 Multivariate Data
2 Principal Components Analysis (PCA)
3 Clustering
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 2 / 27
Multivariate Data
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 3 / 27
Multivariate Data
High-dimensional (more than 2D)
Complexity increases combinatorially
Correlation between columns
Mixed data types
Missing data
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 4 / 27
Visualisation
vis_dat(typical_data)
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 5 / 27
Missing Data
Missing completely at random (MCAR)
“the dog ate my homework”
Missing at random (MAR)
Missing not at random (MNAR)
nonignorable nonresponse
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 6 / 27
Examples
Surveys
Consumer data
Medical records
Morphological measurements
Structured data with millions of dimensions:
*-omics
Images
Text
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 7 / 27
Purple Rock Crab
Figure 1: Leptograpsus variegatus
Image courtesy Wikimedia Commons, user Benjamint444
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 8 / 27
Morphology
library(MASS)
vis_dat(crabs)
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 9 / 27
Data Dictionary
sp: species (“B” or “O” for blue or orange)
sex: “F” or “M”
index: 1:50 within each group
FL: frontal lobe size (mm)
RW: rear width (mm)
CL: carapace length (mm)
CW: carapace width (mm)
BD: body depth (mm)
Campbell & Mahon (1974) Aust. J. Zoology 22: 417–425.
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 10 / 27
Exploratory Data Analysis
ggpairs(crabs, ggplot2::aes(colour=sp))
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 11 / 27
Principal Components Analysis (PCA)
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 12 / 27
PCA
PCA is a transformation of the original variables:
Useful for high-dimensional, multicollinear data
Principal components are uncorrelated
orthogonal vectors
Decreasing order of variance
An example of unsupervised machine learning
Pearson (1901) Philosophical Magazine 2(11): 559–572.
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 13 / 27
PCA for 2D Gaussian
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 14 / 27
PCA for crabs
crab_pca <- PCA(crabs, quanti.sup=3, quali.sup=1:2,
graph=FALSE)
knitr::kable(crab_pca$eig, digits=4,
col.names = c("eigenvalue","% variance","cumulative %"))
eigenvalue % variance cumulative %
comp 1 4.7888 95.7767 95.7767
comp 2 0.1517 3.0337 98.8104
comp 3 0.0466 0.9327 99.7431
comp 4 0.0111 0.2227 99.9658
comp 5 0.0017 0.0342 100.0000
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 15 / 27
Scree Plot
fviz_screeplot(crab_pca)
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 16 / 27
Pairs Plot
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 17 / 27
Species
fviz_pca_ind(crab_pca, axes = c(1,3), habillage=1,
addEllipses=TRUE, ellipse.level=0.95)
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 18 / 27
Sex
fviz_pca_ind(crab_pca, axes = c(1,2), habillage=2,
addEllipses=TRUE, ellipse.level=0.95)
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 19 / 27
Another example: genetic variation
Novembre et al. (2008) Nature 456(7218): 98–101.
n = 1, 387 European individuals from POPRES project
Affymetrix 500K single nucleotide polymorphism (SNP) chip
n p
Country of origin of each individual’s grandparents
Data publicly available from the database of Genotypes and Phenotypes
(dbGaP), http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gap
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 20 / 27
PCA analysis of 197,146 genetic loci
Figure 2: PCA
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 21 / 27
Clustering
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 22 / 27
Clustering
Another unsupervised approach for finding patterns in
data:
Grouping observations based on distance
Many different distance metrics (Euclidean,
Manhattan, Gower, etc.)
Also many different clustering algorithms
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 23 / 27
Heatmap
data("USArrests")
USdf <- scale(USArrests)
US.dist <- get_dist(USdf, method = "euclidean")
fviz_dist(US.dist)
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 24 / 27
k-means
res_km <- eclust(USdf, "kmeans", k = 3)
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 25 / 27
Chloropleth
library(maps)
UScat <-data.frame(state = tolower(names(res_km$cluster)), clu
states_map <-map_data("state")
ggplot(UScat, aes(map_id = state)) +
geom_map(aes(fill = cluster), map = states_map) + expand_l
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 26 / 27
Dendogram
res.hc <- eclust(USdf, "hclust")
fviz_dend(res.hc, rect = TRUE)
Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 27 / 27

More Related Content

Similar to Exploratory Analysis of Multivariate Data

Metadata Standards in CKAN for Biodiversity Pilot in NextGEOSS
Metadata Standards in CKAN for Biodiversity Pilot in NextGEOSSMetadata Standards in CKAN for Biodiversity Pilot in NextGEOSS
Metadata Standards in CKAN for Biodiversity Pilot in NextGEOSS
plan4all
 

Similar to Exploratory Analysis of Multivariate Data (20)

Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebBeyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
 
Studyguide FK6163 Basic Statistics UKM
Studyguide FK6163 Basic Statistics UKMStudyguide FK6163 Basic Statistics UKM
Studyguide FK6163 Basic Statistics UKM
 
Towards Strengthening Links between Learning Analytics and Assessment
Towards Strengthening Links between  Learning Analytics and AssessmentTowards Strengthening Links between  Learning Analytics and Assessment
Towards Strengthening Links between Learning Analytics and Assessment
 
Visualising Multi-objective Data: From League Tables to Optimisers, and back
Visualising Multi-objective Data: From League Tables to Optimisers, and backVisualising Multi-objective Data: From League Tables to Optimisers, and back
Visualising Multi-objective Data: From League Tables to Optimisers, and back
 
Ssp astro data 2014 may
Ssp astro data 2014 maySsp astro data 2014 may
Ssp astro data 2014 may
 
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
 
Dc 2014 baierer-droege
Dc 2014 baierer-droegeDc 2014 baierer-droege
Dc 2014 baierer-droege
 
Metadata Standards in CKAN for Biodiversity Pilot in NextGEOSS
Metadata Standards in CKAN for Biodiversity Pilot in NextGEOSSMetadata Standards in CKAN for Biodiversity Pilot in NextGEOSS
Metadata Standards in CKAN for Biodiversity Pilot in NextGEOSS
 
Data Mining Application and Trends
Data Mining Application and TrendsData Mining Application and Trends
Data Mining Application and Trends
 
A structured catalog of open educational datasets
A structured catalog of open educational datasetsA structured catalog of open educational datasets
A structured catalog of open educational datasets
 
What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked Datasets
 
joaks-evolution-2014
joaks-evolution-2014joaks-evolution-2014
joaks-evolution-2014
 
SENIOR COMP FINAL
SENIOR COMP FINALSENIOR COMP FINAL
SENIOR COMP FINAL
 
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
 
Bayesian network-based predictive analytics applied to invasive species distr...
Bayesian network-based predictive analytics applied to invasive species distr...Bayesian network-based predictive analytics applied to invasive species distr...
Bayesian network-based predictive analytics applied to invasive species distr...
 
STATS 780 (Bayesian 1 way anova comparison).pdf
STATS 780 (Bayesian 1 way anova comparison).pdfSTATS 780 (Bayesian 1 way anova comparison).pdf
STATS 780 (Bayesian 1 way anova comparison).pdf
 
clustering.ppt
clustering.pptclustering.ppt
clustering.ppt
 
Scholarship in the Digital Age
Scholarship in the Digital AgeScholarship in the Digital Age
Scholarship in the Digital Age
 
TreeScaper: Software to Visualize and Extract Phylogenetic Signals from Sets ...
TreeScaper: Software to Visualize and Extract Phylogenetic Signals from Sets ...TreeScaper: Software to Visualize and Extract Phylogenetic Signals from Sets ...
TreeScaper: Software to Visualize and Extract Phylogenetic Signals from Sets ...
 
Forest Learning from Data
Forest Learning from DataForest Learning from Data
Forest Learning from Data
 

More from Matt Moores

R package bayesImageS: Scalable Inference for Intractable Likelihoods
R package bayesImageS: Scalable Inference for Intractable LikelihoodsR package bayesImageS: Scalable Inference for Intractable Likelihoods
R package bayesImageS: Scalable Inference for Intractable Likelihoods
Matt Moores
 
Final PhD Seminar
Final PhD SeminarFinal PhD Seminar
Final PhD Seminar
Matt Moores
 

More from Matt Moores (16)

Bayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsBayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse Problems
 
bayesImageS: an R package for Bayesian image analysis
bayesImageS: an R package for Bayesian image analysisbayesImageS: an R package for Bayesian image analysis
bayesImageS: an R package for Bayesian image analysis
 
R package bayesImageS: Scalable Inference for Intractable Likelihoods
R package bayesImageS: Scalable Inference for Intractable LikelihoodsR package bayesImageS: Scalable Inference for Intractable Likelihoods
R package bayesImageS: Scalable Inference for Intractable Likelihoods
 
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
 
Approximate Bayesian computation for the Ising/Potts model
Approximate Bayesian computation for the Ising/Potts modelApproximate Bayesian computation for the Ising/Potts model
Approximate Bayesian computation for the Ising/Potts model
 
Importing satellite imagery into R from NASA and the U.S. Geological Survey
Importing satellite imagery into R from NASA and the U.S. Geological SurveyImporting satellite imagery into R from NASA and the U.S. Geological Survey
Importing satellite imagery into R from NASA and the U.S. Geological Survey
 
Accelerating Pseudo-Marginal MCMC using Gaussian Processes
Accelerating Pseudo-Marginal MCMC using Gaussian ProcessesAccelerating Pseudo-Marginal MCMC using Gaussian Processes
Accelerating Pseudo-Marginal MCMC using Gaussian Processes
 
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
 
Bayesian modelling and computation for Raman spectroscopy
Bayesian modelling and computation for Raman spectroscopyBayesian modelling and computation for Raman spectroscopy
Bayesian modelling and computation for Raman spectroscopy
 
Final PhD Seminar
Final PhD SeminarFinal PhD Seminar
Final PhD Seminar
 
Precomputation for SMC-ABC with undirected graphical models
Precomputation for SMC-ABC with undirected graphical modelsPrecomputation for SMC-ABC with undirected graphical models
Precomputation for SMC-ABC with undirected graphical models
 
Intro to ABC
Intro to ABCIntro to ABC
Intro to ABC
 
Pre-computation for ABC in image analysis
Pre-computation for ABC in image analysisPre-computation for ABC in image analysis
Pre-computation for ABC in image analysis
 
Variational Bayes
Variational BayesVariational Bayes
Variational Bayes
 
Parallel R
Parallel RParallel R
Parallel R
 
Informative Priors for Segmentation of Medical Images
Informative Priors for Segmentation of Medical ImagesInformative Priors for Segmentation of Medical Images
Informative Priors for Segmentation of Medical Images
 

Recently uploaded

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 

Recently uploaded (20)

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 

Exploratory Analysis of Multivariate Data

  • 1. Exploratory Analysis of Multivariate Data Matt Moores Department of Statistics, University of Warwick WDSI Vacation School 2017 Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 1 / 27
  • 2. 1 Multivariate Data 2 Principal Components Analysis (PCA) 3 Clustering Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 2 / 27
  • 3. Multivariate Data Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 3 / 27
  • 4. Multivariate Data High-dimensional (more than 2D) Complexity increases combinatorially Correlation between columns Mixed data types Missing data Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 4 / 27
  • 5. Visualisation vis_dat(typical_data) Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 5 / 27
  • 6. Missing Data Missing completely at random (MCAR) “the dog ate my homework” Missing at random (MAR) Missing not at random (MNAR) nonignorable nonresponse Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 6 / 27
  • 7. Examples Surveys Consumer data Medical records Morphological measurements Structured data with millions of dimensions: *-omics Images Text Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 7 / 27
  • 8. Purple Rock Crab Figure 1: Leptograpsus variegatus Image courtesy Wikimedia Commons, user Benjamint444 Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 8 / 27
  • 9. Morphology library(MASS) vis_dat(crabs) Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 9 / 27
  • 10. Data Dictionary sp: species (“B” or “O” for blue or orange) sex: “F” or “M” index: 1:50 within each group FL: frontal lobe size (mm) RW: rear width (mm) CL: carapace length (mm) CW: carapace width (mm) BD: body depth (mm) Campbell & Mahon (1974) Aust. J. Zoology 22: 417–425. Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 10 / 27
  • 11. Exploratory Data Analysis ggpairs(crabs, ggplot2::aes(colour=sp)) Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 11 / 27
  • 12. Principal Components Analysis (PCA) Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 12 / 27
  • 13. PCA PCA is a transformation of the original variables: Useful for high-dimensional, multicollinear data Principal components are uncorrelated orthogonal vectors Decreasing order of variance An example of unsupervised machine learning Pearson (1901) Philosophical Magazine 2(11): 559–572. Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 13 / 27
  • 14. PCA for 2D Gaussian Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 14 / 27
  • 15. PCA for crabs crab_pca <- PCA(crabs, quanti.sup=3, quali.sup=1:2, graph=FALSE) knitr::kable(crab_pca$eig, digits=4, col.names = c("eigenvalue","% variance","cumulative %")) eigenvalue % variance cumulative % comp 1 4.7888 95.7767 95.7767 comp 2 0.1517 3.0337 98.8104 comp 3 0.0466 0.9327 99.7431 comp 4 0.0111 0.2227 99.9658 comp 5 0.0017 0.0342 100.0000 Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 15 / 27
  • 16. Scree Plot fviz_screeplot(crab_pca) Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 16 / 27
  • 17. Pairs Plot Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 17 / 27
  • 18. Species fviz_pca_ind(crab_pca, axes = c(1,3), habillage=1, addEllipses=TRUE, ellipse.level=0.95) Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 18 / 27
  • 19. Sex fviz_pca_ind(crab_pca, axes = c(1,2), habillage=2, addEllipses=TRUE, ellipse.level=0.95) Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 19 / 27
  • 20. Another example: genetic variation Novembre et al. (2008) Nature 456(7218): 98–101. n = 1, 387 European individuals from POPRES project Affymetrix 500K single nucleotide polymorphism (SNP) chip n p Country of origin of each individual’s grandparents Data publicly available from the database of Genotypes and Phenotypes (dbGaP), http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gap Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 20 / 27
  • 21. PCA analysis of 197,146 genetic loci Figure 2: PCA Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 21 / 27
  • 22. Clustering Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 22 / 27
  • 23. Clustering Another unsupervised approach for finding patterns in data: Grouping observations based on distance Many different distance metrics (Euclidean, Manhattan, Gower, etc.) Also many different clustering algorithms Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 23 / 27
  • 24. Heatmap data("USArrests") USdf <- scale(USArrests) US.dist <- get_dist(USdf, method = "euclidean") fviz_dist(US.dist) Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 24 / 27
  • 25. k-means res_km <- eclust(USdf, "kmeans", k = 3) Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 25 / 27
  • 26. Chloropleth library(maps) UScat <-data.frame(state = tolower(names(res_km$cluster)), clu states_map <-map_data("state") ggplot(UScat, aes(map_id = state)) + geom_map(aes(fill = cluster), map = states_map) + expand_l Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 26 / 27
  • 27. Dendogram res.hc <- eclust(USdf, "hclust") fviz_dend(res.hc, rect = TRUE) Matt Moores (Warwick) Exploratory Analysis of Multivariate Data WDSI Vacation School 2017 27 / 27