SlideShare a Scribd company logo
1 of 13
Linear Regression on 1 Terabytes of Data?
Some Crazy Observations and Actions
Hesen Peng
Amazon.com
Big Data Exploration with Amazon
Model building procedure for
a major internet company
Planning and
Idea Generation
Data collection
Model building
and offline
evaluation
Implementation
for application
online
Performance
evaluation in
real world
Experiment
Design,
Clinical Trial
Major Machine
Learning/Stat
research
Interesting
weekend project
Unsupervised
Machine Learning,
Survival analysis
Power Point
Linear regression with 1TB of data
Wanna try it out?
• Use Amazon Web Service! (with free tire)
– http://aws.amazon.com/education/
• Write simple distributed algorithm:
– Python: MRJob (https://github.com/Yelp/mrjob)
– R: RHadoop (https://github.com/RevolutionAnalytics/RHadoop)
– Launch your own Sun/Oracle Grid Engine
environment for parallel computing
(http://star.mit.edu/cluster/)
New Challenges
• Association beyond linear
– Make better use of data: (most) factors are statistically
significant in linear models with 1 TB of data
– (Better?) Prediction
• Everything goes to real time
– Build/ update model, analytics, data storage in real
time
– Faster response to new happenings
– Save engineering overhead
Real time big data analytics work flow
Real time data input
(training + testing data)
Real time analytics front
end
Dashboarding/
monitoring
Model building / update
Prediction server
Outlier detection and
pre-processing
Huge Statistical
ChallengeTree design rather than
ring design, enabling
parallel construction and
update
Where are we?
Offline model
building and
scheduled updating
Linear regression / GLM
using Mahout etc
Random
Forest, SVM, Hashing, and
beyond
Mutual
information, Brownian
Covariate, Mira score, and
density estimation!
Batch processing and
near real time
updating
Batch update to the linear
model
Batch update of random
forest, adaptively throw
away trees
?
Real time data
processing / cleaning
and model building
Linear model built and
consumed in real time
?
Real time universal
association discovery !
Timeliness of model build
Complexityof
association
Universal association discovery
• Discovere associations between to random
vectors
• Regardless of dimension and association form
(linear / nonlinear/ higher order interaction).
• E.g. Mutual information, Brownian Distance
Covariate, Mira score (1NN edge sum)
Intuition
Hesen Peng, Tianwe Yu. SeMira: Universal Association Discovery and Variable Selection
among Continuous Variables using Functions on the Observation Graph
Mira score: another function on the
distance graph
• Where d(i) is the distance between observation i
and its nearest neighbore.
• O(N2P)
• How to adapt to real time analytics?
– Segment data for batch processing
– Keep partial data in memory and change the
calculation function
From O(N2P) to O(NP)
A whole distance
matrix between
observations
Only keep the most up-to-
date few in memory and
calculate NN distance btw
observations kept in memory
Yes, loss of power;
assuming association is
independent of
sequence of observation
We are still at Day 1
• Mira score: only capable of detecting association
between continuous variables
– SeMira: variable selection
– No prediction yet
• Functions on the distance graph is a gold mine.
• Real time analytics = $$$
– Fraud detection
– Clustering
– Recommendation systems
Join Us!
• Ask Hesen for referral:
hesepeng@amazon.com
• http://www.amazon.com/gp/jobs
• Jobs of all levels:
– Research Scientist
– Business Intelligence Engineer
– Software Development Engineers
– Machine Learning scientist
– Manager in Machine Learning

More Related Content

What's hot

VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1BigML, Inc
 
Azure Machine Learning and ML on Premises
Azure Machine Learning and ML on PremisesAzure Machine Learning and ML on Premises
Azure Machine Learning and ML on PremisesIvo Andreev
 
Data! Data! Data! I Can't Make Bricks Without Clay!
Data! Data! Data! I Can't Make Bricks Without Clay!Data! Data! Data! I Can't Make Bricks Without Clay!
Data! Data! Data! I Can't Make Bricks Without Clay!Turi, Inc.
 
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnIntroduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnAmol Agrawal
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
 
Modern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and PracticesModern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and PracticesWill Gardella
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTuri, Inc.
 
Open and Automated Machine Learning
Open and Automated Machine LearningOpen and Automated Machine Learning
Open and Automated Machine LearningJoaquin Vanschoren
 
A tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesA tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesVimal Gupta
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_financeStefan Duprey
 
One Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationOne Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationWork-Bench
 
R Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal DependenceR Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal DependenceWork-Bench
 
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
K Nearest Neighbor V1.0 Supervised Machine Learning AlgorithmK Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
K Nearest Neighbor V1.0 Supervised Machine Learning AlgorithmDataMites
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningTamir Taha
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013MLconf
 
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph CompletionNaomi Shiraishi
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15MLconf
 
Magellan FOSS4G Talk, Boston 2017
Magellan FOSS4G Talk, Boston 2017Magellan FOSS4G Talk, Boston 2017
Magellan FOSS4G Talk, Boston 2017Ram Sriharsha
 

What's hot (20)

VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1
 
Azure Machine Learning and ML on Premises
Azure Machine Learning and ML on PremisesAzure Machine Learning and ML on Premises
Azure Machine Learning and ML on Premises
 
Data! Data! Data! I Can't Make Bricks Without Clay!
Data! Data! Data! I Can't Make Bricks Without Clay!Data! Data! Data! I Can't Make Bricks Without Clay!
Data! Data! Data! I Can't Make Bricks Without Clay!
 
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnIntroduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-Learn
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Modern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and PracticesModern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and Practices
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
 
Open and Automated Machine Learning
Open and Automated Machine LearningOpen and Automated Machine Learning
Open and Automated Machine Learning
 
A tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesA tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbies
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
One Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationOne Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical Computation
 
R Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal DependenceR Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal Dependence
 
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
K Nearest Neighbor V1.0 Supervised Machine Learning AlgorithmK Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
 
L11. The Future of Machine Learning
L11. The Future of Machine LearningL11. The Future of Machine Learning
L11. The Future of Machine Learning
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
 
Magellan FOSS4G Talk, Boston 2017
Magellan FOSS4G Talk, Boston 2017Magellan FOSS4G Talk, Boston 2017
Magellan FOSS4G Talk, Boston 2017
 

Similar to Linear Regression on 1TB Data & Real-Time Analytics Challenges

Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSotiris Beis
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSymeon Papadopoulos
 
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersMelinda Thielbar
 
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersMelinda Thielbar
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReducesscdotopen
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 
Data analytics in computer networking
Data analytics in computer networkingData analytics in computer networking
Data analytics in computer networkingStenio Fernandes
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
 
Machine Learning for Forecasting: From Data to Deployment
Machine Learning for Forecasting: From Data to DeploymentMachine Learning for Forecasting: From Data to Deployment
Machine Learning for Forecasting: From Data to DeploymentAnant Agarwal
 
Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsNYC Predictive Analytics
 
network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...Ashish Gupta
 
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...Dr. Cornelius Ludmann
 
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Amazon Web Services
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverviewMotaz El-Saban
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...Ashish Gupta
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupSri Ambati
 

Similar to Linear Regression on 1TB Data & Real-Time Analytics Challenges (20)

Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl Winters
 
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl Winters
 
OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
Data analytics in computer networking
Data analytics in computer networkingData analytics in computer networking
Data analytics in computer networking
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
Machine Learning for Forecasting: From Data to Deployment
Machine Learning for Forecasting: From Data to DeploymentMachine Learning for Forecasting: From Data to Deployment
Machine Learning for Forecasting: From Data to Deployment
 
Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media Analytics
 
network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...
 
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
 
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverview
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 

Recently uploaded

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Linear Regression on 1TB Data & Real-Time Analytics Challenges

  • 1. Linear Regression on 1 Terabytes of Data? Some Crazy Observations and Actions Hesen Peng Amazon.com Big Data Exploration with Amazon
  • 2. Model building procedure for a major internet company Planning and Idea Generation Data collection Model building and offline evaluation Implementation for application online Performance evaluation in real world Experiment Design, Clinical Trial Major Machine Learning/Stat research Interesting weekend project Unsupervised Machine Learning, Survival analysis Power Point
  • 4. Wanna try it out? • Use Amazon Web Service! (with free tire) – http://aws.amazon.com/education/ • Write simple distributed algorithm: – Python: MRJob (https://github.com/Yelp/mrjob) – R: RHadoop (https://github.com/RevolutionAnalytics/RHadoop) – Launch your own Sun/Oracle Grid Engine environment for parallel computing (http://star.mit.edu/cluster/)
  • 5. New Challenges • Association beyond linear – Make better use of data: (most) factors are statistically significant in linear models with 1 TB of data – (Better?) Prediction • Everything goes to real time – Build/ update model, analytics, data storage in real time – Faster response to new happenings – Save engineering overhead
  • 6. Real time big data analytics work flow Real time data input (training + testing data) Real time analytics front end Dashboarding/ monitoring Model building / update Prediction server Outlier detection and pre-processing Huge Statistical ChallengeTree design rather than ring design, enabling parallel construction and update
  • 7. Where are we? Offline model building and scheduled updating Linear regression / GLM using Mahout etc Random Forest, SVM, Hashing, and beyond Mutual information, Brownian Covariate, Mira score, and density estimation! Batch processing and near real time updating Batch update to the linear model Batch update of random forest, adaptively throw away trees ? Real time data processing / cleaning and model building Linear model built and consumed in real time ? Real time universal association discovery ! Timeliness of model build Complexityof association
  • 8. Universal association discovery • Discovere associations between to random vectors • Regardless of dimension and association form (linear / nonlinear/ higher order interaction). • E.g. Mutual information, Brownian Distance Covariate, Mira score (1NN edge sum)
  • 9. Intuition Hesen Peng, Tianwe Yu. SeMira: Universal Association Discovery and Variable Selection among Continuous Variables using Functions on the Observation Graph
  • 10. Mira score: another function on the distance graph • Where d(i) is the distance between observation i and its nearest neighbore. • O(N2P) • How to adapt to real time analytics? – Segment data for batch processing – Keep partial data in memory and change the calculation function
  • 11. From O(N2P) to O(NP) A whole distance matrix between observations Only keep the most up-to- date few in memory and calculate NN distance btw observations kept in memory Yes, loss of power; assuming association is independent of sequence of observation
  • 12. We are still at Day 1 • Mira score: only capable of detecting association between continuous variables – SeMira: variable selection – No prediction yet • Functions on the distance graph is a gold mine. • Real time analytics = $$$ – Fraud detection – Clustering – Recommendation systems
  • 13. Join Us! • Ask Hesen for referral: hesepeng@amazon.com • http://www.amazon.com/gp/jobs • Jobs of all levels: – Research Scientist – Business Intelligence Engineer – Software Development Engineers – Machine Learning scientist – Manager in Machine Learning