This document discusses using topological data analysis techniques to address challenges with small sample sizes and distinct subgroups in data. It provides three case studies applying topological tools: 1) exploring profoundly gifted students with small educational samples, 2) identifying risk clusters in auto insurance claims data with subgroups, and 3) validating a psychometric survey using topology over traditional factor analysis. Topological data analysis is presented as a robust solution to problems traditional statistical and machine learning algorithms struggle with for small and complex data.
Women in Data Science 2018 Slides--Small Samples, Subgroups, and Topology
1. Human Behavior, Small Samples, and the
Problem of Subgroups
The Power of Topology
3/5/2018
2. Introduction
Big data hype
Less publicized but
very important types
of data:
1. Small data
2. Data with distinct
subgroups
Industries where
these are common:
Education
Insurance
Biotechnology/
pharmaceuticals
Industrial
psychology
3. Problems Unique to Small Data
Types of small data problems:
Rare diseases (100 cases worldwide with unknown genetic causes)
Pilot studies (10’s or 100’s of participants)
Small educational programs (10’s of students enrolled in the previous year)
Main issues:
Statistical models require minimum sample sizes to estimate effects with
computational issues or wide confidence intervals (singularities, p>>n problems).
Machine learning algorithms need to converge for stable estimates and models.
Small samples can induce sparsity in the data space, which is problematic for
clustering and general data mining techniques.
3
4. Problems Unique to Subgroups in Data
Types of data in which subgroups are common:
Medicine (diverse causes of a given disease, subtypes of disease)
Education (different types of students and risk types for failure)
Industrial psychology (different personality trait patterns)
Main issues:
Washing out of effects in a full model
Examples:
Small subgroup defined by extremely high extraversion and openness related to public speaking
outcome
Rare genetic variant combination predicting high likelihood of response to a drug within a disease
population
Defining robust partitions within a piecewise regression model to deal with this
phenomenon
Mixed results from many methods employing this strategy
Difficult with small sample sizes for most piecewise regression models
4
5. Unique Solutions: Topology
Branch of mainly
pure mathematics
Study of
changes in
function
behavior on
different shapes
Identify
invariant
properties of
shapes
Classify
similarities/
differences
between shapes
5
Deep connections to physics and
differential equations
6. Topological Data Analysis
Data as discrete point clouds
Topological spaces, called simplicial complexes,
built from these:
Connect points within a certain distance of each
other
Topologically similar to a graph
Tools using simplicial complexes for data
analysis called topological data analysis (TDA)
6
2-d neighborhoods are defined by
Euclidean distance.
Points within a given circle are
mutually connected, forming a
simplex.
7. Tool 1: Persistent Homology
Filtration
Series of simplicial complexes based on varying
distance thresholds
Features appear and disappear as lens changes
Nested sequence of features with deep algebraic properties
Persistence as length of feature existence in the sequences
(plotted as persistence diagrams)
Termed persistent homology
A bit like an MRI-type examination of data
Persistence as organ size and type
Gives a comprehensive view of data
Persistent homology related to hierarchical
clustering
Statistical methods to compare datasets0 2 4 6 8 10
0246810
Birth
Death
0 2 4 6 8 10
time
8. Tool 2: Morse-Smale Clustering
Multivariate technique from TDA similar to mode clustering
Find peaks and valleys in data by filtering on a defined function:
A watershed on mountains
Dribbling a soccer ball across a field of hills
Separate data based on shared peaks and valleys
Many nice developments on convergence and theoretical properties
8
9. Tool 3: Homotopy and Path Equivalence
Homotopy arrow example
Red to blue by wiggling
start to finish path
Yellow arrow and hole
problem
Homotopy method in LASSO
Wiggles easy regression
path to optimal regression
path
Recent success solving
ordinary differential
equations
Avoids local optima that
can trap other regression
estimators
9
10. Case Study 1: Small Educational Samples
Problem set-up
1. Understand subgroups of profoundly
gifted students (IQ>160)
2. Explore impact of educational
interventions on early career awards
Sample
1. 17 profoundly gifted students:
Gross’s 2003 sample
Intelligence testing data available
Early achievement testing (verbal,
math) available
2. 16 of these same students with
follow-up data related to:
Educational intervention data
Early career recognition/awards
10
11. Data Mining
11
9
3
13
5
1
7
8
14
10
11
12
6
16
15
17
2
4
0204060
Intelligence and Achievement Dendrogram
hclust (*, "complete")
dist(mydata[, 2:4])
Height
Distinct population that separates out very early in the
filtration (box)
Students with an IQ>200 and achievement scores 5+
grades ahead for math and verbal (multivariate outliers)
Corroborates previous evidence of a “high flat” profile
distinct from other types of profound giftedness
12. Logistic Regression Coefficient Comparison
Comparison of 2 machine learning models and 2 topologically-based
models
Too few observations for traditional logistic regression
Multivariate adaptive regression splines (MARS) inadequate fit (R^2=0.27)
Bayesian model averaging (BMA) extremely large confidence interval
DGLARS and HLASSO (topologically-based) good fit, small confidence
intervals, consistent results across replication
12
IQ
Score
Early
English
Early
Math
Early
Entry
Grade
Skip
Subject
Acceleration
Radical
Acceleration
MARS 0.44
BMA -6.25 5.79 0.97 1.38 2.41 33.10
DGLARS 2.20 4.66
HLASSO 0.02 -0.26 1.44 3.27
13. Case Study 2: Actuarial Modeling with
Subgroups
Problem set-up
Understand risk factors
associated with auto insurance
claims
Understand subgroups with
different types of risk
Sample
Open-source Swedish automobile
claims dataset from 1977
2182 claims, 6 predictors
13
14. Risk Clusters
Group 1: relatively high dependence on
make and number of claims
Group 2: relatively high dependence on
bonus and number of years insured
Group 3: almost solely dependent on
number of claims and geographic zone
14
Three distinct
subgroups with
varying risk type
15. Case Study 3: Psychometric Test Design
Set-up:
Explore/validate survey measuring identity importance/expression across social
contexts
Create subscales within the survey
Sample:
406 participants in a pilot study
91 test items
Random samples of 130 participants taken with replacement as validation samples
15
16. Advantages of Topology Over Factor
Analysis
16
Loss of information with each
projection to a lower-dimensional
space (errors)
Topological methods work by
partitioning existing space into
homogenous components (no maps,
no error)
2D example
18. Insights Gained
Some aspects of identity fluid, others are fixed
Political and racial/ethnic identity fixed
Other types, such as athletic or gender, fairly fluid
No statistically significant differences between samples
Subscales consistent
Validates measure
18
19. Conclusions
Unique
challenges in
data science
Subgroups
Small
samples
Failure of
statistical
and machine
learning
algorithms
Topological
data analysis
as robust
solutions
19
20. Reference Papers
Carlsson, G. (2009). Topology and data. Bulletin of the American
Mathematical Society, 46(2), 255-308.
Edelsbrunner, H., & Harer, J. (2008). Persistent homology-a survey.
Contemporary mathematics, 453, 257-282.
Farrelly, C. M. (2017). Extensions of Morse-Smale Regression with Application
to Actuarial Science. arXiv preprint arXiv:1708.05712. Accepted as new model
by Casualty Actuarial Society, December 2017.
Farrelly, C. M. (2018). Topology and Geometry for Small Sample Sizes: An
Application to Research on the Profoundly Gifted.
Farrelly, C. M., Schwartz, S. J., Amodeo, A. L., Feaster, D. J., Steinley, D. L.,
Meca, A., & Picariello, S. (2017). The analysis of bridging constructs with
hierarchical clustering methods: An application to identity. Journal of
Research in Personality, 70, 93-106.
Gerber, S., Rübel, O., Bremer, P. T., Pascucci, V., & Whitaker, R. T. (2013).
Morse–smale regression. Journal of Computational and Graphical Statistics,
22(1), 193-214. 20