1) The document discusses using linear regression on 1 terabyte of data by leveraging Amazon Web Services' free tier and distributed computing algorithms in Python and R.
2) It notes the challenges of going beyond linear models with big data, including better prediction and real-time analytics.
3) A proposed solution is "universal association discovery" to find relationships between random variables regardless of form using functions on observation graphs, though this approach currently only works for continuous variables.
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Linear Regression on 1TB Data & Real-Time Analytics Challenges
1. Linear Regression on 1 Terabytes of Data?
Some Crazy Observations and Actions
Hesen Peng
Amazon.com
Big Data Exploration with Amazon
2. Model building procedure for
a major internet company
Planning and
Idea Generation
Data collection
Model building
and offline
evaluation
Implementation
for application
online
Performance
evaluation in
real world
Experiment
Design,
Clinical Trial
Major Machine
Learning/Stat
research
Interesting
weekend project
Unsupervised
Machine Learning,
Survival analysis
Power Point
4. Wanna try it out?
• Use Amazon Web Service! (with free tire)
– http://aws.amazon.com/education/
• Write simple distributed algorithm:
– Python: MRJob (https://github.com/Yelp/mrjob)
– R: RHadoop (https://github.com/RevolutionAnalytics/RHadoop)
– Launch your own Sun/Oracle Grid Engine
environment for parallel computing
(http://star.mit.edu/cluster/)
5. New Challenges
• Association beyond linear
– Make better use of data: (most) factors are statistically
significant in linear models with 1 TB of data
– (Better?) Prediction
• Everything goes to real time
– Build/ update model, analytics, data storage in real
time
– Faster response to new happenings
– Save engineering overhead
6. Real time big data analytics work flow
Real time data input
(training + testing data)
Real time analytics front
end
Dashboarding/
monitoring
Model building / update
Prediction server
Outlier detection and
pre-processing
Huge Statistical
ChallengeTree design rather than
ring design, enabling
parallel construction and
update
7. Where are we?
Offline model
building and
scheduled updating
Linear regression / GLM
using Mahout etc
Random
Forest, SVM, Hashing, and
beyond
Mutual
information, Brownian
Covariate, Mira score, and
density estimation!
Batch processing and
near real time
updating
Batch update to the linear
model
Batch update of random
forest, adaptively throw
away trees
?
Real time data
processing / cleaning
and model building
Linear model built and
consumed in real time
?
Real time universal
association discovery !
Timeliness of model build
Complexityof
association
8. Universal association discovery
• Discovere associations between to random
vectors
• Regardless of dimension and association form
(linear / nonlinear/ higher order interaction).
• E.g. Mutual information, Brownian Distance
Covariate, Mira score (1NN edge sum)
9. Intuition
Hesen Peng, Tianwe Yu. SeMira: Universal Association Discovery and Variable Selection
among Continuous Variables using Functions on the Observation Graph
10. Mira score: another function on the
distance graph
• Where d(i) is the distance between observation i
and its nearest neighbore.
• O(N2P)
• How to adapt to real time analytics?
– Segment data for batch processing
– Keep partial data in memory and change the
calculation function
11. From O(N2P) to O(NP)
A whole distance
matrix between
observations
Only keep the most up-to-
date few in memory and
calculate NN distance btw
observations kept in memory
Yes, loss of power;
assuming association is
independent of
sequence of observation
12. We are still at Day 1
• Mira score: only capable of detecting association
between continuous variables
– SeMira: variable selection
– No prediction yet
• Functions on the distance graph is a gold mine.
• Real time analytics = $$$
– Fraud detection
– Clustering
– Recommendation systems
13. Join Us!
• Ask Hesen for referral:
hesepeng@amazon.com
• http://www.amazon.com/gp/jobs
• Jobs of all levels:
– Research Scientist
– Business Intelligence Engineer
– Software Development Engineers
– Machine Learning scientist
– Manager in Machine Learning