SlideShare a Scribd company logo
1 of 34
Download to read offline
Analytics Drives Big Data Drives
Infrastructure
Confessions of Storage turned Analytics Geeks
Dr. Aloke Guha
29th IEEE Conference on Massive Data Storage
May 8th, 2013
aloke@cruxly.com
2
What’s Common Between
a Sensor that could Distinguish a fine Cognac,
and Predicting Movies You’d Like on Netflix?
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
The Sommelier “Robot”
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 3
Predicting What Movies You’d Watch
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 4
5
(Analytics, BigData, DataStore)+
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
6
Many Analytics Techniques . . .
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
Statistics
Regression
Linear
Time-Series
Decision
Trees
R
AI (McCarthy) 1956
Expert
Systems
Machine Learning
Neural
Networks
SVM
LDA
Naïve
Bayes K-nearest
neighbor
Random
Forests
. . .
Genetic
Algorithms
Random
Forests
SNARC (Minsky) 1951
Dendral (Feigenbaum) 1965
Fraser and Burnell (1970)
. . . Vapnik (1992)
Ihaka and Gentleman (1993)
7
Common Analytics Processing pre-2000
• Sources: Local
• Data: Numeric, Homogeneous
• Processing: Local
• Consumer: Local
• Analytics: Linear/Non-Linear Regression,
Neural Networks, SVM, LDA, LSA,
Decision Trees, Monte Carlo, Lin-Ops,
Expert Systems . . .
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
Flavor Predictor – Neural Networks
USPTO #5,373,452 (1994) 1988
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 8
Pattern Recognition – Genetic Algorithms
US PTO #5,140,530, 1992
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 9
10
Small to Big
http://article.wn.com/view/2013/04/04/Big_data_forefather_Michael_Stonebraker_shows_no_signs_of_sl/#/related_news
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
11
Typical Analytics: 2000-2006
• Sources: Global , Social
Networks
• Data: Heterogeneous, Numeric,
Text
• Processing: Hosted/Scale
• Consumer: Global
• Analytics: Batch Mode, Social
Media Marketing, Churn
Detection, Sentiment Analysis, etc.
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
2007- : Internet Data Analytics
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 12
Financial Risk Scoring: Detect
Risk Scoring: detect incremental change in # occurrences where corporate officers
mention “risk” (or equivalent terms) during earnings call
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 13
Financial Risk Scoring: Listen
*Risk Scoring: detect incremental change in occurrences where corporate officers
mention “risk” (or semantically equivalent terms) during the corporate earnings call
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 14
Banking: Credit Worthiness – remember 2008?
Analyze bank reports to assess loans, payments, recoveries, etc. for key bank
indexes, groups of banks, or individual banks
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 15
Share of Voice: Online Buzz
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 16
Sentiment Analysis
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 17
18
Analytics Processing: 2007-
• Sources: Global, Mobile,
New Social (Instagram, . . )
• Data: Multi-Dimensional,
Heterogeneous, Audio/Video
• Processing: Hosted/Scale
• Consumer: Global
• Analytics: Batch, Streaming, . . .
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
2008 - : Real-Time/Streaming Analytics
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 19
Brand Marketing
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 20
Brand Management
21
Customer Support
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 22
Customer Support
23
24
Lead Generation
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
. . . More Data, Faster
http://www.cioinsight.com/it-strategy/big-data/data-analytics-allows-pg-to-turn-on-a-dime/?kc=CIOMINUTE05062013CIOA
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 25
“Internet of Things”
http://www.news-sap.com/survey-by-sap-and-harris-interactive-finds-brazil-china-germany-and-india-most-ready-for-
m2m-technology-to-drive-connected-smarter-cities/
Message Queuing Telemetry Transport
Machine-to-Machine
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 26
27
AumniData: Batch Processing
Data Collector
(Batch Scheduled)
Twitter Blog/Web Site
Data Collector
(Batch Scheduled)
RSS/ATOM
Feed
Requestor/
URL Scanner
NLP+ Cruxly Intent
Detection
(AWS)
NLP+ Cruxly Intent
Detection
(AWS)
NLP+ Cruxly Intent
Detection
(AWS)
NLP+ Cruxly Intent
Detection
(AWS)
NLP+ Cruxly Intent
Detection
(AWS)
NLP Stack+ AumniData
Classifier + Analytics*
(RackSpace VM)
Dashboard
Application
(.3rd party App)
Blog/Web Site
Blog/Web SiteYouTube
Dashboard
Configuration
(TomCat)
Custom Analytics
Display
Ad-Hoc Query
Summary
Data Collector
(Batch Scheduled)
Content
Store
Content /
Metadata
Index
(MySQL)
Dashboard
Store
(SQL Server)
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
28
Cruxly: Stream Processing
Streaming API Client
(Heroku Worker)
(24x7)
Streaming API Client
(Heroku Worker)
(24x7)
NLP+ Cruxly Intent
Detection
(AWS)
Streaming API Client
(Heroku Worker)
(24x7)
Tweets
(Keywords)
Request
(Keywords)
Tweets
(Keywords) Tweet ID + Intent
Signal
(Heroku
PostgresSQL)
Tweets
Content Store
(DynamoDB)
NLP+ Cruxly Intent
Detection
(AWS)
NLP+ Cruxly Intent
Detection
(AWS)
NLP+ Cruxly Intent
Detection
(AWS)
NLP+ Cruxly Intent
Detection
(AWS)
NLP (NER, etc + Cruxly
Intent Detection
(AWS)
Reports / Dashboard
Tracker Editor
(web app - Heroku)
Twitter
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
29
Data Analytics Demands . . .
Store
Process
Analyze
View
Store
Process
Analyze
View
Storm
Data Collector
Text / Sensor Data/ Stream . . .
NLP
Classify
Index
Query/ RT Query
Ad Hoc/ Search/ SQL
Custom Analytics
Dashboards
Chart
Report
Machine
Learning
Library
Stats
Library
R
Yarn
Storage Implications: Back to the Future
MB/s – Batch
IOPs – Stream
Both?
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 30
Storage Implications: Back to the Future II, III
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
Task
tracker
Task
tracker
Task
tracker
Job Tracker
Zookeeper
Hive
Pig
Oozie
HUE
HDFS clientData Node Data Node Data Node
Name Node
MapReduceHDFS
Master Slave #1 Slave #N Mgmt Node
Storage Capacity Scaling?
31
Storage Tiering?
Import/Export Data?
A More General Data Analytics Framework?
Data
Ingesters
(Basic)
Data
Ingesters
(Smart)
Content StoreMetadata / In-Mem
Store
Processing
Stream and Batch
Data Ingesters
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
AnalyticsProcessing
SensorProcessing:DataIntegration
VisualizationLibrary/InteractiveQuery
LocalStorage/Flash/DAS
MapReduce/DistributedDataStore
32
33
Conclusion
• Data Analytics ⇒ Big Data ⇒ Scale-Out
• Variety ⇒ Infrastructure
• Volume ⇒ Bandwidth Support
• Velocity ⇒ Streaming Support
• We Solved the Processing Problem
• We Need to Solve the Larger Storage Problem
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
34
Grateful Acknowledgements
• Kapil Tundwal
• Dr. Kirill Kireyev
• Dr. Andrew Lampert
• Venky Madireddy
• Dr. Shumin Wu
• Joan Wrabetz
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

More Related Content

Viewers also liked

Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupCaserta
 
4 Steps to Building a Data-Driven Strategy - White Exchange - 24.11.14
4 Steps to Building a Data-Driven Strategy - White Exchange - 24.11.144 Steps to Building a Data-Driven Strategy - White Exchange - 24.11.14
4 Steps to Building a Data-Driven Strategy - White Exchange - 24.11.14Daniel Bianchini
 
How to tackle big data from a security
How to tackle big data from a securityHow to tackle big data from a security
How to tackle big data from a securityTyrone Systems
 
The Future of Telecom (Petro Chernyshov Business Stream)
The Future of Telecom (Petro Chernyshov Business Stream)The Future of Telecom (Petro Chernyshov Business Stream)
The Future of Telecom (Petro Chernyshov Business Stream)IT Arena
 
7 Characteristics of a Bad (Big) Data Platform
7 Characteristics of a Bad (Big) Data Platform7 Characteristics of a Bad (Big) Data Platform
7 Characteristics of a Bad (Big) Data PlatformHarshal Deo (HD)
 
The Business Benefits of a Data-Driven, Self-Service BI Organization
The Business Benefits of a Data-Driven, Self-Service BI OrganizationThe Business Benefits of a Data-Driven, Self-Service BI Organization
The Business Benefits of a Data-Driven, Self-Service BI OrganizationLooker
 
CES 2014 - Autonomous Connected Vehicles
CES 2014 - Autonomous Connected VehiclesCES 2014 - Autonomous Connected Vehicles
CES 2014 - Autonomous Connected VehiclesAndreas Mai
 
Data Driven Decision Making Presentation
Data Driven Decision Making PresentationData Driven Decision Making Presentation
Data Driven Decision Making PresentationRussell Kunz
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessCloudera, Inc.
 
Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...
Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...
Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...Lora Cecere
 

Viewers also liked (12)

Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing Meetup
 
4 Steps to Building a Data-Driven Strategy - White Exchange - 24.11.14
4 Steps to Building a Data-Driven Strategy - White Exchange - 24.11.144 Steps to Building a Data-Driven Strategy - White Exchange - 24.11.14
4 Steps to Building a Data-Driven Strategy - White Exchange - 24.11.14
 
Big Data
Big DataBig Data
Big Data
 
How to tackle big data from a security
How to tackle big data from a securityHow to tackle big data from a security
How to tackle big data from a security
 
The Future of Telecom (Petro Chernyshov Business Stream)
The Future of Telecom (Petro Chernyshov Business Stream)The Future of Telecom (Petro Chernyshov Business Stream)
The Future of Telecom (Petro Chernyshov Business Stream)
 
7 Characteristics of a Bad (Big) Data Platform
7 Characteristics of a Bad (Big) Data Platform7 Characteristics of a Bad (Big) Data Platform
7 Characteristics of a Bad (Big) Data Platform
 
Data Strategy
Data StrategyData Strategy
Data Strategy
 
The Business Benefits of a Data-Driven, Self-Service BI Organization
The Business Benefits of a Data-Driven, Self-Service BI OrganizationThe Business Benefits of a Data-Driven, Self-Service BI Organization
The Business Benefits of a Data-Driven, Self-Service BI Organization
 
CES 2014 - Autonomous Connected Vehicles
CES 2014 - Autonomous Connected VehiclesCES 2014 - Autonomous Connected Vehicles
CES 2014 - Autonomous Connected Vehicles
 
Data Driven Decision Making Presentation
Data Driven Decision Making PresentationData Driven Decision Making Presentation
Data Driven Decision Making Presentation
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
 
Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...
Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...
Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Analytics Drives Big Data Drives Infrastructure

  • 1. Analytics Drives Big Data Drives Infrastructure Confessions of Storage turned Analytics Geeks Dr. Aloke Guha 29th IEEE Conference on Massive Data Storage May 8th, 2013 aloke@cruxly.com
  • 2. 2 What’s Common Between a Sensor that could Distinguish a fine Cognac, and Predicting Movies You’d Like on Netflix? Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  • 3. The Sommelier “Robot” Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 3
  • 4. Predicting What Movies You’d Watch Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 4
  • 5. 5 (Analytics, BigData, DataStore)+ Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  • 6. 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 Statistics Regression Linear Time-Series Decision Trees R AI (McCarthy) 1956 Expert Systems Machine Learning Neural Networks SVM LDA Naïve Bayes K-nearest neighbor Random Forests . . . Genetic Algorithms Random Forests SNARC (Minsky) 1951 Dendral (Feigenbaum) 1965 Fraser and Burnell (1970) . . . Vapnik (1992) Ihaka and Gentleman (1993)
  • 7. 7 Common Analytics Processing pre-2000 • Sources: Local • Data: Numeric, Homogeneous • Processing: Local • Consumer: Local • Analytics: Linear/Non-Linear Regression, Neural Networks, SVM, LDA, LSA, Decision Trees, Monte Carlo, Lin-Ops, Expert Systems . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  • 8. Flavor Predictor – Neural Networks USPTO #5,373,452 (1994) 1988 Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 8
  • 9. Pattern Recognition – Genetic Algorithms US PTO #5,140,530, 1992 Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 9
  • 11. 11 Typical Analytics: 2000-2006 • Sources: Global , Social Networks • Data: Heterogeneous, Numeric, Text • Processing: Hosted/Scale • Consumer: Global • Analytics: Batch Mode, Social Media Marketing, Churn Detection, Sentiment Analysis, etc. Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  • 12. 2007- : Internet Data Analytics Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 12
  • 13. Financial Risk Scoring: Detect Risk Scoring: detect incremental change in # occurrences where corporate officers mention “risk” (or equivalent terms) during earnings call Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 13
  • 14. Financial Risk Scoring: Listen *Risk Scoring: detect incremental change in occurrences where corporate officers mention “risk” (or semantically equivalent terms) during the corporate earnings call Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 14
  • 15. Banking: Credit Worthiness – remember 2008? Analyze bank reports to assess loans, payments, recoveries, etc. for key bank indexes, groups of banks, or individual banks Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 15
  • 16. Share of Voice: Online Buzz Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 16
  • 17. Sentiment Analysis Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 17
  • 18. 18 Analytics Processing: 2007- • Sources: Global, Mobile, New Social (Instagram, . . ) • Data: Multi-Dimensional, Heterogeneous, Audio/Video • Processing: Hosted/Scale • Consumer: Global • Analytics: Batch, Streaming, . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  • 19. 2008 - : Real-Time/Streaming Analytics Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 19
  • 20. Brand Marketing Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 20
  • 22. Customer Support Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 22
  • 24. 24 Lead Generation Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  • 25. . . . More Data, Faster http://www.cioinsight.com/it-strategy/big-data/data-analytics-allows-pg-to-turn-on-a-dime/?kc=CIOMINUTE05062013CIOA Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 25
  • 26. “Internet of Things” http://www.news-sap.com/survey-by-sap-and-harris-interactive-finds-brazil-china-germany-and-india-most-ready-for- m2m-technology-to-drive-connected-smarter-cities/ Message Queuing Telemetry Transport Machine-to-Machine Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 26
  • 27. 27 AumniData: Batch Processing Data Collector (Batch Scheduled) Twitter Blog/Web Site Data Collector (Batch Scheduled) RSS/ATOM Feed Requestor/ URL Scanner NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP Stack+ AumniData Classifier + Analytics* (RackSpace VM) Dashboard Application (.3rd party App) Blog/Web Site Blog/Web SiteYouTube Dashboard Configuration (TomCat) Custom Analytics Display Ad-Hoc Query Summary Data Collector (Batch Scheduled) Content Store Content / Metadata Index (MySQL) Dashboard Store (SQL Server) Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  • 28. 28 Cruxly: Stream Processing Streaming API Client (Heroku Worker) (24x7) Streaming API Client (Heroku Worker) (24x7) NLP+ Cruxly Intent Detection (AWS) Streaming API Client (Heroku Worker) (24x7) Tweets (Keywords) Request (Keywords) Tweets (Keywords) Tweet ID + Intent Signal (Heroku PostgresSQL) Tweets Content Store (DynamoDB) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP (NER, etc + Cruxly Intent Detection (AWS) Reports / Dashboard Tracker Editor (web app - Heroku) Twitter Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  • 29. 29 Data Analytics Demands . . . Store Process Analyze View Store Process Analyze View Storm Data Collector Text / Sensor Data/ Stream . . . NLP Classify Index Query/ RT Query Ad Hoc/ Search/ SQL Custom Analytics Dashboards Chart Report Machine Learning Library Stats Library R Yarn
  • 30. Storage Implications: Back to the Future MB/s – Batch IOPs – Stream Both? Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 30
  • 31. Storage Implications: Back to the Future II, III Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 Task tracker Task tracker Task tracker Job Tracker Zookeeper Hive Pig Oozie HUE HDFS clientData Node Data Node Data Node Name Node MapReduceHDFS Master Slave #1 Slave #N Mgmt Node Storage Capacity Scaling? 31 Storage Tiering? Import/Export Data?
  • 32. A More General Data Analytics Framework? Data Ingesters (Basic) Data Ingesters (Smart) Content StoreMetadata / In-Mem Store Processing Stream and Batch Data Ingesters Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 AnalyticsProcessing SensorProcessing:DataIntegration VisualizationLibrary/InteractiveQuery LocalStorage/Flash/DAS MapReduce/DistributedDataStore 32
  • 33. 33 Conclusion • Data Analytics ⇒ Big Data ⇒ Scale-Out • Variety ⇒ Infrastructure • Volume ⇒ Bandwidth Support • Velocity ⇒ Streaming Support • We Solved the Processing Problem • We Need to Solve the Larger Storage Problem Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  • 34. 34 Grateful Acknowledgements • Kapil Tundwal • Dr. Kirill Kireyev • Dr. Andrew Lampert • Venky Madireddy • Dr. Shumin Wu • Joan Wrabetz Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013