Women in Data Science 2018 Slides--Small Samples, Subgroups, and Topology

Human Behavior, Small Samples, and the
Problem of Subgroups
The Power of Topology
3/5/2018

Introduction
 Big data hype
 Less publicized but
very important types
of data:
1. Small data
2. Data with distinct
subgroups
 Industries where
these are common:
 Education
 Insurance
 Biotechnology/
pharmaceuticals
 Industrial
psychology

Problems Unique to Small Data
 Types of small data problems:
 Rare diseases (100 cases worldwide with unknown genetic causes)
 Pilot studies (10’s or 100’s of participants)
 Small educational programs (10’s of students enrolled in the previous year)
 Main issues:
 Statistical models require minimum sample sizes to estimate effects with
computational issues or wide confidence intervals (singularities, p>>n problems).
 Machine learning algorithms need to converge for stable estimates and models.
 Small samples can induce sparsity in the data space, which is problematic for
clustering and general data mining techniques.
3

Problems Unique to Subgroups in Data
 Types of data in which subgroups are common:
 Medicine (diverse causes of a given disease, subtypes of disease)
 Education (different types of students and risk types for failure)
 Industrial psychology (different personality trait patterns)
 Main issues:
 Washing out of effects in a full model
 Examples:
 Small subgroup defined by extremely high extraversion and openness related to public speaking
outcome
 Rare genetic variant combination predicting high likelihood of response to a drug within a disease
population
 Defining robust partitions within a piecewise regression model to deal with this
phenomenon
 Mixed results from many methods employing this strategy
 Difficult with small sample sizes for most piecewise regression models
4

Unique Solutions: Topology
 Branch of mainly
pure mathematics
 Study of
changes in
function
behavior on
different shapes
 Identify
invariant
properties of
shapes
 Classify
similarities/
differences
between shapes
5
Deep connections to physics and
differential equations

Topological Data Analysis
 Data as discrete point clouds
 Topological spaces, called simplicial complexes,
built from these:
 Connect points within a certain distance of each
other
 Topologically similar to a graph
 Tools using simplicial complexes for data
analysis called topological data analysis (TDA)
6
2-d neighborhoods are defined by
Euclidean distance.
Points within a given circle are
mutually connected, forming a
simplex.

Tool 1: Persistent Homology
 Filtration
 Series of simplicial complexes based on varying
distance thresholds
 Features appear and disappear as lens changes
 Nested sequence of features with deep algebraic properties
 Persistence as length of feature existence in the sequences
(plotted as persistence diagrams)
 Termed persistent homology
 A bit like an MRI-type examination of data
 Persistence as organ size and type
 Gives a comprehensive view of data
 Persistent homology related to hierarchical
clustering
 Statistical methods to compare datasets0 2 4 6 8 10
0246810
Birth
Death
0 2 4 6 8 10
time

Tool 2: Morse-Smale Clustering
 Multivariate technique from TDA similar to mode clustering
 Find peaks and valleys in data by filtering on a defined function:
 A watershed on mountains
 Dribbling a soccer ball across a field of hills
 Separate data based on shared peaks and valleys
 Many nice developments on convergence and theoretical properties
8

Tool 3: Homotopy and Path Equivalence
 Homotopy arrow example
 Red to blue by wiggling
start to finish path
 Yellow arrow and hole
problem
 Homotopy method in LASSO
 Wiggles easy regression
path to optimal regression
path
 Recent success solving
ordinary differential
equations
 Avoids local optima that
can trap other regression
estimators
9

Case Study 1: Small Educational Samples
 Problem set-up
1. Understand subgroups of profoundly
gifted students (IQ>160)
2. Explore impact of educational
interventions on early career awards
 Sample
1. 17 profoundly gifted students:
 Gross’s 2003 sample
 Intelligence testing data available
 Early achievement testing (verbal,
math) available
2. 16 of these same students with
follow-up data related to:
 Educational intervention data
 Early career recognition/awards
10

Data Mining
11
9
3
13
5
1
7
8
14
10
11
12
6
16
15
17
2
4
0204060
Intelligence and Achievement Dendrogram
hclust (*, "complete")
dist(mydata[, 2:4])
Height
 Distinct population that separates out very early in the
filtration (box)
 Students with an IQ>200 and achievement scores 5+
grades ahead for math and verbal (multivariate outliers)
 Corroborates previous evidence of a “high flat” profile
distinct from other types of profound giftedness

Logistic Regression Coefficient Comparison
 Comparison of 2 machine learning models and 2 topologically-based
models
 Too few observations for traditional logistic regression
 Multivariate adaptive regression splines (MARS) inadequate fit (R^2=0.27)
 Bayesian model averaging (BMA) extremely large confidence interval
 DGLARS and HLASSO (topologically-based) good fit, small confidence
intervals, consistent results across replication
12
IQ
Score
Early
English
Early
Math
Early
Entry
Grade
Skip
Subject
Acceleration
Radical
Acceleration
MARS 0.44
BMA -6.25 5.79 0.97 1.38 2.41 33.10
DGLARS 2.20 4.66
HLASSO 0.02 -0.26 1.44 3.27

Case Study 2: Actuarial Modeling with
Subgroups
 Problem set-up
 Understand risk factors
associated with auto insurance
claims
 Understand subgroups with
different types of risk
 Sample
 Open-source Swedish automobile
claims dataset from 1977
 2182 claims, 6 predictors
13

Risk Clusters
 Group 1: relatively high dependence on
make and number of claims
 Group 2: relatively high dependence on
bonus and number of years insured
 Group 3: almost solely dependent on
number of claims and geographic zone
14
 Three distinct
subgroups with
varying risk type

Case Study 3: Psychometric Test Design
 Set-up:
 Explore/validate survey measuring identity importance/expression across social
contexts
 Create subscales within the survey
 Sample:
 406 participants in a pilot study
 91 test items
 Random samples of 130 participants taken with replacement as validation samples
15

Advantages of Topology Over Factor
Analysis
16
Loss of information with each
projection to a lower-dimensional
space (errors)
Topological methods work by
partitioning existing space into
homogenous components (no maps,
no error)
2D example

Exploratory Analysis
17
ILLCa_school_success_family
ILLCa_school_success_school
ILLCa_gender_dating
ILLCa_age_dating
ILLCa_age_freetime
ILLCa_sexual_or_dating
ILLCa_beauty_dating
ILLCa_sport_dating
ILLCa_sport_freetime
ILLCa_sport_religion
ILLCa_religion_freetime
ILLCa_religion_family
ILLCa_religion_school
ILLCa_religion_neighborhood
ILLCa_politics_dating
ILLCa_religion_group
ILLCa_sexual_or_religion
ILLCa_gender_religion
ILLCa_age_religion
ILLCa_politics_religion
ILLCa_politics_family
ILLCa_politics_neighborhood
ILLCa_politics_group
ILLCa_politics_school
ILLCa_politics_freetime
ILLCa_tribe_dating
ILLCa_tribe_group
ILLCa_tribe_freetime
ILLCa_tribe_family
ILLCa_tribe_school
ILLCa_tribe_neighborhood
ILLCa_tribe_religion
ILLCa_beauty_neighborhood
ILLCa_look_neighborhood
ILLCa_school_success_religion
ILLCa_look_religion
ILLCa_music_neighborhood
ILLCa_race_religion
ILLCa_status_religion
ILLCa_beauty_religion
ILLCa_religion_religion
ILLCa_religion_dating
ILLCa_race_school
ILLCa_race_freetime
ILLCa_sexual_or_school
ILLCa_beauty_family
ILLCa_beauty_freetime
ILLCa_beauty_school
ILLCa_beauty_group
ILLCa_look_freetime
ILLCa_look_family
ILLCa_look_school
ILLCa_status_dating
ILLCa_status_group
ILLCa_race_group
ILLCa_race_dating
ILLCa_sexual_or_group
ILLCa_sexual_or_freetime
ILLCa_gender_freetime
ILLCa_gender_family
ILLCa_gender_school
ILLCa_age_family
ILLCa_age_school
ILLCa_school_success_neighborhood
ILLCa_race_neighborhood
ILLCa_sexual_or_neighborhood
ILLCa_status_neighborhood
ILLCa_gender_neighborhood
ILLCa_age_neighborhood
ILLCa_sport_school
ILLCa_sport_family
ILLCa_sport_group
ILLCa_music_freetime
ILLCa_music_religion
ILLCa_music_dating
ILLCa_sport_neighborhood
ILLCa_school_success_dating
ILLCa_school_success_group
ILLCa_school_success_freetime
ILLCa_music_school
ILLCa_music_family
ILLCa_music_group
ILLCa_gender_group
ILLCa_age_group
ILLCa_look_group
ILLCa_look_dating
ILLCa_race_family
ILLCa_sexual_or_family
ILLCa_status_freetime
ILLCa_status_family
ILLCa_status_school
-0.2
0
0.2
0.4
0.6
0.8
1

Insights Gained
 Some aspects of identity fluid, others are fixed
 Political and racial/ethnic identity fixed
 Other types, such as athletic or gender, fairly fluid
 No statistically significant differences between samples
 Subscales consistent
 Validates measure
18

Conclusions
 Unique
challenges in
data science
 Subgroups
 Small
samples
 Failure of
statistical
and machine
learning
algorithms
 Topological
data analysis
as robust
solutions
19

Reference Papers
 Carlsson, G. (2009). Topology and data. Bulletin of the American
Mathematical Society, 46(2), 255-308.
 Edelsbrunner, H., & Harer, J. (2008). Persistent homology-a survey.
Contemporary mathematics, 453, 257-282.
 Farrelly, C. M. (2017). Extensions of Morse-Smale Regression with Application
to Actuarial Science. arXiv preprint arXiv:1708.05712. Accepted as new model
by Casualty Actuarial Society, December 2017.
 Farrelly, C. M. (2018). Topology and Geometry for Small Sample Sizes: An
Application to Research on the Profoundly Gifted.
 Farrelly, C. M., Schwartz, S. J., Amodeo, A. L., Feaster, D. J., Steinley, D. L.,
Meca, A., & Picariello, S. (2017). The analysis of bridging constructs with
hierarchical clustering methods: An application to identity. Journal of
Research in Personality, 70, 93-106.
 Gerber, S., Rübel, O., Bremer, P. T., Pascucci, V., & Whitaker, R. T. (2013).
Morse–smale regression. Journal of Computational and Graphical Statistics,
22(1), 193-214. 20

Women in Data Science 2018 Slides--Small Samples, Subgroups, and Topology

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Women in Data Science 2018 Slides--Small Samples, Subgroups, and Topology

Similar to Women in Data Science 2018 Slides--Small Samples, Subgroups, and Topology (20)

More from Colleen Farrelly

More from Colleen Farrelly (20)

Recently uploaded

Recently uploaded (20)

Women in Data Science 2018 Slides--Small Samples, Subgroups, and Topology