More Related Content Similar to How Data Science is Preventing College Dropouts and Advancing Student Success (20) More from VMware Tanzu (20) How Data Science is Preventing College Dropouts and Advancing Student Success1. 1© Copyright 2015 Pivotal. All rights reserved. 1© Copyright 2015 Pivotal. All rights reserved.
Srivatsan Ramanujam,
Regunathan Radhakrishnan
Principal Data Scientists
Pivotal Data Science
How Data Science is Preventing
College Dropouts and Advancing
Student Success
2. 2© Copyright 2015 Pivotal. All rights reserved.
Agenda
Justifying the cost of college education
Institutional Data Lake
Overview of typical data sources
– Structured
– Unstructured
Models to predict student success
– Predict time-to-graduate
– Predict term GPA
– Predict course Grade
Operationalizing student success models
3. 3© Copyright 2015 Pivotal. All rights reserved.
Justifying the Cost of College
Education
4. 4© Copyright 2015 Pivotal. All rights reserved.
Educators’ Concerns
Cost of education has been steeply increasing and
as a consequence student debt has also been
rising steeply
How do we justify the value of college education?
How do we ensure students graduate on time,
don’t drop-out, get better jobs?
What are factors which educators can influence to
improve student graduation rates?
6. 6© Copyright 2015 Pivotal. All rights reserved.
Business Goals
Educators would like to study
factors that affect student
success
To introduce policies and tools
that positively impact student
success
Institutional Data Lake
Admissions
data
Registration
data
demographics
data
Online
forums data
Blackboard
data
Card swipes
at campus
facilities
Structured data sources Unstructured data sources
Behaviors
Student Cluster
Analysis
Outcomes
GPA Prediction
Retention Prediction
7. 7© Copyright 2015 Pivotal. All rights reserved.
Advantages of Institutional Data Lake
Derive insights from data that help drive institution policies
Deploy data-driven apps that positively impacts student success
Institutional Data Lake
Admissions
data
Registration
data
demographics
data
Online
forums data
Blackboard
data
Card swipes
at campus
facilities
Structured data sources Unstructured data sources
Analytics
Education
research
Data-driven
apps
e.g Predict time-to-grad
Predict drop-outs
e.g course recommender
Term GPA predictor
8. 8© Copyright 2015 Pivotal. All rights reserved.
Problem Statement
Given data related to a student’s activity and profile, predict a student’s
success ( e.g. time to graduate)
– Identify key attributes that influence the time to graduate which
will assist institution take action on some of the identified attributes
Our approach:
– Create 360-degree view of student’s activity and profile
– Apply machine learning methods to predict time to graduate
– Interrogate the developed models to understand key factors
9. 9© Copyright 2015 Pivotal. All rights reserved.
Data Science Toolkit
KEY LANGUAGES
P L A T F O R M
KEY TOOLS
MLlib
PL/X
ModelingTools
VisualizationTools
Platform
10. 10© Copyright 2015 Pivotal. All rights reserved.
Data Lake
Business Levers
Apps
Pipeline of a Data Science Driven App
MLlib
PL/X
Model Building
Model Tuning
Continuous Model
Improvement
Data Feeds
Ingest Filter Enrich Sink
SpringXD
Greenplum
11. 11© Copyright 2015 Pivotal. All rights reserved.
• For embarrassingly parallel
tasks, we can use procedural
languages to easily
parallelize any stand-alone
library in Java, Python, R,
pgSQL or C/C++
• The interpreter/VM of the
language ‘X’ is installed on
each node of the MPP
environment
Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Data Parallelism through PL/X: X in Python, R,
Java, C/C++ and pgSQL
• plpython and python are loaded as dynamic
libraries on the master and segment nodes
(libpython.so and plpython.so are under
$GPHOME/ext/python)
12. 12© Copyright 2015 Pivotal. All rights reserved.
MADlib: Scalable, In-Database Machine Learning
http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf
13. 13© Copyright 2015 Pivotal. All rights reserved.
Functions
Supervised Learning
Regression Models
• Cox Proportional Hazards Regression
• Elastic Net Regularization
• Generalized Linear Models
• Linear Regression
• Logistic Regression
• Marginal Effects
• Multinomial Regression
• Ordinal Regression
• Robust Variance, Clustered Variance
• Support Vector Machines
Tree Methods
• Decision Tree
• Random Forest
Other Methods
• Conditional Random Field
• Naïve Bayes
Unsupervised Learning
• Association Rules (Apriori)
• Clustering (K-means)
• Topic Modeling (LDA)
Statistics
Descriptive
• Cardinality Estimators
• Correlation
• Summary
Inferential
• Hypothesis Tests
Other Statistics
• Probability Functions
Other Modules
• Conjugate Gradient
• Linear Solvers
• PMML Export
• Random Sampling
• Term Frequency for Text
Time Series
• ARIMA
Data Types and Transformations
• Array Operations
• Dimensionality Reduction (PCA)
• Encoding Categorical Variables
• Matrix Operations
• Matrix Factorization (SVD, Low Rank)
• Norms and Distance Functions
• Sparse Vectors
Model Evaluation
• Cross Validation
Predictive Analytics Library
@ApacheMADlib
15. 15© Copyright 2015 Pivotal. All rights reserved.
Typical Data Sources
Student 360
application,
admission
academic
activity
(e.g courses, GPA)
previous
education
profile
demographic
profileBlackboard
Learn
activity
network
activity
Card swipes
activity
16. 16© Copyright 2015 Pivotal. All rights reserved.
Student 360
application,
admission
academic
activity
(e.g courses, GPA)
previous
education
profile
demographic
profileBlackboard
learn
activity
network
activity
card swipes
activity
17. 17© Copyright 2015 Pivotal. All rights reserved.
Admit Rate and Yield Rate
Admit_rate = num admitted/ num applications Yield rate = num enrolled/ num admitted
18. 18© Copyright 2015 Pivotal. All rights reserved.
Percentage of Foreign Applicants Increases, while Percentage
of Residents Decreases
19. 19© Copyright 2015 Pivotal. All rights reserved.
Student 360
application,
admission
academic
activity
(e.g courses, GPA)
previous
education
profile
demographic
profileBlackboard
learn
activity
network
activity
card swipes
activity
21. 21© Copyright 2015 Pivotal. All rights reserved.
Changes in Enrollment Info: College, Major, Program
How many changes?
when?
• Late grads more likely to change college, major or program and
change later in their student career
22. 22© Copyright 2015 Pivotal. All rights reserved.
Student 360
application,
admission
academic
activity
(e.g courses, GPA)
previous
education
profile
demographic
profileBlackboard
learn
activity
network
activity
card swipes
activity
24. 24© Copyright 2015 Pivotal. All rights reserved.
Student 360
application,
admission
academic
activity
(e.g courses, GPA)
previous
education
profile
demographic
profileBlackboard
learn
activity
network
activity
card swipes
activity
25. 25© Copyright 2015 Pivotal. All rights reserved.
Percentage of Assignments Submitted/Course/Term
• Number of assignments submitted per course/
Actual number of assignments for that course
• Average this ratio across courses taken in each
Academic period
• Students who drop out are less likely to submit all assignments in a course
26. 26© Copyright 2015 Pivotal. All rights reserved.
How Quickly do Students Submit Assignments?
• Calculate time diff between
first_submission and student’s
submission
• Average across courses and
assignments per academic period
• Time converted in terms of hours
27. 27© Copyright 2015 Pivotal. All rights reserved.
Participation Ratio
Participation ratio = number of posted messages/number of viewed messages
• Number of viewed messages always
greater than number of posted messages
• Normal students view more messages
than they post
28. 28© Copyright 2015 Pivotal. All rights reserved.
Popularity of Messages by Posters
• Median viewer count largest for normal grads
• Msg_hit_count is not the highest for normal grads
29. 29© Copyright 2015 Pivotal. All rights reserved.
Student 360
application,
admission
academic
activity
(e.g courses, GPA)
previous
education
profile
demographic
profileBlackboard
learn
activity
network
activity
card swipes
activity
30. 30© Copyright 2015 Pivotal. All rights reserved.
Network Activity – Usage by Class Label
• Incomplete and late grads tend to use
the network more
31. 31© Copyright 2015 Pivotal. All rights reserved.
Student 360
application,
admission
academic
activity
(e.g courses, GPA)
previous
education
profile
demographic
profileBlackboard
learn
activity
network
activity
card swipes
activity
32. 32© Copyright 2015 Pivotal. All rights reserved.
Students’ network logs and card activity
Card Swipes – Laundry and Dining Facility Use
• Late grads are more likely to use laundry facilities while classes are in session
• Late grads also tend to have breakfast, lunch and dinner later than on-time grads
33. 33© Copyright 2015 Pivotal. All rights reserved.
Models
Predict Time-to-Graduate
Predict Term GPA
Predict Course Grade
34. 34© Copyright 2015 Pivotal. All rights reserved.
Modeling Approach: Two Types of Features
Fall 2008 Spring 2009 Summer 2009
Fall 2008
cohort joins
Extract
features
until end of
term
Perform
Modeling
&
Scoring
Whether or
not a student will
graduate on time
Extract
features
until end of
term
Perform
Modeling
&
Scoring
Whether or
not a student will
graduate on time
Static features
• Remain the same irrespective of time window
• E.g gender, admission attribute, sat score etc
Time sensitive features
• Depends on time window (w1,w2)
• E.g course activity, bblearn activity etc
w1
w2
35. 35© Copyright 2015 Pivotal. All rights reserved.
Modeling – Algorithms and Libraries
• Logistic Regression (MADlib)
XGBoost (https://github.com/dmlc/xgboost)
• AdaBoost (Scikit-Learn)
• RandomForest (Scikit-Learn)
• Extracted three categories of features from all data sources
- Static ~ 50 features, time sensitive ~ 115 features
- Card Swipes, Network Logs ~ 110 features
• Built models for three different problems
- Time-to-graduate
- Course grade
- Term GPA
Algorithms and Libraries
37. 37© Copyright 2015 Pivotal. All rights reserved.
Operationalization Pipeline
Refreshed data (incoming
daily/weekly/monthly updates)
feature gen.
pipeline
Static features
Static + time-sensitive
LMS features
Static + time-sensitive LMS +
network + card logs features
In-database
parallel grid-search
(XGBoost)
MADlib Logistic
Regression
Sklearn
AdaBoost
Sklearn
RandomForest
Model
selection
Serialize to
disk
Structured, unstructured
data sources in data lake
scoring results
• Student ID
• Feature names, values, importance scores
• Predictions (late, normal, dropped)
User-notification on
smartphone app
Cleared by Data
Scientist
Modeling pipeline
38. 38© Copyright 2015 Pivotal. All rights reserved.
In-Database Parallel Grid Search Using
https://github.com/vatsan/gp_xgboost_gridsearch
• XGBoost (eXtreme
Gradient Boosting) is a
popular library used in
many prize winning
Kaggle contests.
• Implemented in C++ with
Python and R bindings
• Supports multi-core
• Implemented in-database
parallel grid-search for
XGBoost using PL/Python
39. 39© Copyright 2015 Pivotal. All rights reserved.
In-Database Grid Search – Approach
https://github.com/vatsan/gp_xgboost_gridsearch
Refreshed data (incoming
daily/weekly/monthly updates)
feature gen.
pipeline training dataset
(distributed table)
Model
selection
structured,
unstructured
data sources
scored results
grid search
params dict
Grid params table
(expanded)
master
segments
param-list-1 param-list-n. . .
training set(serialized) training set(serialized)
Driver function
(PL/Python)
pickle
and
distribute
mdl-1 mdl-n. . .
40. 40© Copyright 2015 Pivotal. All rights reserved.
Model Training and Scoring: XGBoost
https://github.com/vatsan/gp_xgboost_gridsearch
Training Scoring
41. 41© Copyright 2015 Pivotal. All rights reserved.
Conclusions
We built an institutional data lake for our customer, to serve
as a platform for education research and data-driven apps
for student success
Reviewed typical data sources in an institutional data lake
– Interesting features that make up student 360
Using open source, scalable data science toolkits we built
three different models for predicting student success
We set-up a pipeline for operationalizing our models, to be
consumed by a data driven smartphone app
42. 42© Copyright 2015 Pivotal. All rights reserved.
Pivotal Data Science Blogs
1. Scaling native (C++) apps on Pivotal MPP
2. Predicting commodity futures through Tweets
3. A pipeline for distributed topic & sentiment analysis of tweets on Greenplum
4. Using data science to predict TV viewer behavior
5. Twitter NLP: Scaling part-of-speech tagging
6. Distributed deep learning on MPP and Hadoop
7. Multi-variate time series forecasting
8. Pivotal for good – Crisis Textline
http://blog.pivotal.io/data-science-pivotal