How Data Science is Preventing College Dropouts and Advancing Student Success

1© Copyright 2015 Pivotal. All rights reserved. 1© Copyright 2015 Pivotal. All rights reserved.
Srivatsan Ramanujam,
Regunathan Radhakrishnan
Principal Data Scientists
Pivotal Data Science
How Data Science is Preventing
College Dropouts and Advancing
Student Success

2© Copyright 2015 Pivotal. All rights reserved.
Agenda
 Justifying the cost of college education
 Institutional Data Lake
 Overview of typical data sources
– Structured
– Unstructured
 Models to predict student success
– Predict time-to-graduate
– Predict term GPA
– Predict course Grade
 Operationalizing student success models

Justifying the Cost of College
Education

Educators’ Concerns
 Cost of education has been steeply increasing and
as a consequence student debt has also been
rising steeply
 How do we justify the value of college education?
 How do we ensure students graduate on time,
don’t drop-out, get better jobs?
 What are factors which educators can influence to
improve student graduation rates?

Institutional Data Lake

Business Goals
 Educators would like to study
factors that affect student
success
 To introduce policies and tools
that positively impact student
success
Admissions
data
Registration
data
demographics
data
Online
forums data
Blackboard
data
Card swipes
at campus
facilities
Structured data sources Unstructured data sources
Behaviors
Student Cluster
Analysis
Outcomes
GPA Prediction
Retention Prediction

Advantages of Institutional Data Lake
 Derive insights from data that help drive institution policies
 Deploy data-driven apps that positively impacts student success
Admissions
data
Registration
data
demographics
data
Online
forums data
Blackboard
data
Card swipes
at campus
facilities
Structured data sources Unstructured data sources
Analytics
Education
research
Data-driven
apps
e.g Predict time-to-grad
Predict drop-outs
e.g course recommender
Term GPA predictor

Problem Statement
 Given data related to a student’s activity and profile, predict a student’s
success ( e.g. time to graduate)
– Identify key attributes that influence the time to graduate which
will assist institution take action on some of the identified attributes
 Our approach:
– Create 360-degree view of student’s activity and profile
– Apply machine learning methods to predict time to graduate
– Interrogate the developed models to understand key factors

Data Science Toolkit
KEY LANGUAGES
P L A T F O R M
KEY TOOLS
MLlib
PL/X
ModelingTools
VisualizationTools
Platform

Data Lake
Business Levers
Apps
Pipeline of a Data Science Driven App
MLlib
PL/X
Model Building
Model Tuning
Continuous Model
Improvement
Data Feeds
Ingest Filter Enrich Sink
SpringXD
Greenplum

• For embarrassingly parallel
tasks, we can use procedural
languages to easily
parallelize any stand-alone
library in Java, Python, R,
pgSQL or C/C++
• The interpreter/VM of the
language ‘X’ is installed on
each node of the MPP
environment
Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Data Parallelism through PL/X: X in Python, R,
Java, C/C++ and pgSQL
• plpython and python are loaded as dynamic
libraries on the master and segment nodes
(libpython.so and plpython.so are under
$GPHOME/ext/python)

MADlib: Scalable, In-Database Machine Learning
http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf

Functions
Supervised Learning
Regression Models
• Cox Proportional Hazards Regression
• Elastic Net Regularization
• Generalized Linear Models
• Linear Regression
• Logistic Regression
• Marginal Effects
• Multinomial Regression
• Ordinal Regression
• Robust Variance, Clustered Variance
• Support Vector Machines
Tree Methods
• Decision Tree
• Random Forest
Other Methods
• Conditional Random Field
• Naïve Bayes
Unsupervised Learning
• Association Rules (Apriori)
• Clustering (K-means)
• Topic Modeling (LDA)
Statistics
Descriptive
• Cardinality Estimators
• Correlation
• Summary
Inferential
• Hypothesis Tests
Other Statistics
• Probability Functions
Other Modules
• Conjugate Gradient
• Linear Solvers
• PMML Export
• Random Sampling
• Term Frequency for Text
Time Series
• ARIMA
Data Types and Transformations
• Array Operations
• Dimensionality Reduction (PCA)
• Encoding Categorical Variables
• Matrix Operations
• Matrix Factorization (SVD, Low Rank)
• Norms and Distance Functions
• Sparse Vectors
Model Evaluation
• Cross Validation
Predictive Analytics Library
@ApacheMADlib

Overview of Typical Data Sources

Typical Data Sources
Student 360
application,
admission
academic
activity
(e.g courses, GPA)
previous
education
profile
demographic
profileBlackboard
Learn
activity
network
activity
Card swipes
activity