SlideShare a Scribd company logo
1 of 15
Download to read offline
Machine Learning at Scale
Madhukara Phatak
Zinnia Systems
@madhukaraphatak
Agenda
• Zinnia and Big data
• Hadoop Saga
• Machine learning – State of Art
• Scale Challenges
• People challenges
• Machine learning at Zinnia
• Case studies
• Demo
Zinnia and Big data
• BSS/OSS product company
• Big data is normal in Telecom
• CDR (call data record ) around 3TB for
companies like Airtel
• Need a solution for processing over 6
months
• Started to work around 3 years ago
Hadoop Saga
• Hadoop was default choice
• Challenge in the ecosystem in India
• Hype vs Reality
• Work
– Building ML library Nectar
– Working with companies to build hadoop
expertise and solutions
– POC’s
Machine Learning in Hadoop
• Apache Mahout was the choice but its
too hard to map it any new requirements
• Map/Reduce implementation suffered
from speed and complexity
• Accuracy of the results often poor
• We set out to build our own and realized
it was too much of overhead even to
build simplest things
ML and Map Reduce
• M/R forgets everything once one
operation is done
• Everything has to go through HDFS ,
slower because of disk over heads
• Mahout long tried to make as fast
possible , but they kind of given up
• In Zinnia , we moved on with
aggregation and KPI based solutions
rather than pure ML.
Apache Spark
• Apache Spark is a framework for
lightening fast cluster computing .
• Build by AmpLabs and now Databricks.
• Runs Hadoop 2.0
• Built for Iterative algorithms aka ML
• There is suddenly interest in Bigdata ML
again with spark as its finally possible to
run fast and accurate with spark
• Mahout is moving on to Spark
MLLib
• Standard Spark library for Machine
learning
• Built into spark
• Very small code base – 1200 line of scala
code
• 40x – 100x faster than Mahout
• Supports
– Linear and Logistic regression
– SVM
– Recommender systems
ML-Scale challenges
• Choosing an algorithm
• Accuracy of algorithm implementation
• Modeling when data is noisy and big
• Faster sampling
• Real time processing
• Accuracy vs Performance
ML-People challenges
• Hard to find Data scientists
• Unique combination of skills –
Programming at scale and maths.
• Mathematical reasoning and
practicallality of implementation.
Machine learning at Zinnia Systems
• 4 people team
• We work on public data and use ML
algorithms to get interesting insight out.
• We work on following
– Predictive modeling
– Text analysis
– Recommender systems
– Classification systems
Case study –Movie twitter sentiment
Analysis
• Everyone likes movies and want to catch
up good movie every week.
• Too many critic reviews so difficult to
say whom to trust.
• Can we know what real audience think
about the movies so that we can make
right choice?
Movie twitter sentiment analysis
• We build model using Naïve Bayes using
labeled public tweets.
• Collect tweet about movies every day
and run through models to do the
predictions.
• We aggregate these scores to give our
twitter score.
• On par with imdb score.
• Demo
Movie Recommendation System
• Want to explore older movies based on
your current liking?
• We pull the data from FB for you and
your friends movie liking , and
recommend you movies out of our 17000
movie collection.
• Model built using public Nextflix data
• Demo
Kick start in ML
• https://www.coursera.org/course/ml
• https://github.com/zinniasystems/spark-
ml-class
• https://class.coursera.org/nlp/lecture/pre
view

More Related Content

What's hot

Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation PlatformKarthik Murugesan
 
Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic a...
Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic a...Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic a...
Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic a...Mail.ru Group
 
Machine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh PoduskaMachine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh PoduskaData Con LA
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamSri Ambati
 
Future of ai on the jvm
Future of ai on the jvmFuture of ai on the jvm
Future of ai on the jvmAdam Gibson
 
Yufeng Guo - Building machine learning systems for scale with Google Cloud AI...
Yufeng Guo - Building machine learning systems for scale with Google Cloud AI...Yufeng Guo - Building machine learning systems for scale with Google Cloud AI...
Yufeng Guo - Building machine learning systems for scale with Google Cloud AI...Codemotion
 
Facebook ML Infrastructure - 2018 slides
Facebook ML Infrastructure - 2018 slidesFacebook ML Infrastructure - 2018 slides
Facebook ML Infrastructure - 2018 slidesKarthik Murugesan
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OSri Ambati
 
Productionizing Machine Learning in Our Health and Wellness Marketplace
Productionizing Machine Learning in Our Health and Wellness MarketplaceProductionizing Machine Learning in Our Health and Wellness Marketplace
Productionizing Machine Learning in Our Health and Wellness MarketplaceDatabricks
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learningStanley Wang
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
 
Ml infra at an early stage
Ml infra at an early stageMl infra at an early stage
Ml infra at an early stageNick Handel
 
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital OneUsing H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital OneSri Ambati
 
Machine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkMachine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkInSemble
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Turi, Inc.
 
From Chatbots to Augmented Conversational Assistants
From Chatbots to Augmented Conversational AssistantsFrom Chatbots to Augmented Conversational Assistants
From Chatbots to Augmented Conversational AssistantsDatabricks
 
Deep learning in production with the best
Deep learning in production   with the bestDeep learning in production   with the best
Deep learning in production with the bestAdam Gibson
 
Introduction to ML.NET
Introduction to ML.NETIntroduction to ML.NET
Introduction to ML.NETMarco Parenzan
 

What's hot (20)

Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation Platform
 
Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic a...
Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic a...Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic a...
Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic a...
 
Machine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh PoduskaMachine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh Poduska
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
Future of ai on the jvm
Future of ai on the jvmFuture of ai on the jvm
Future of ai on the jvm
 
Yufeng Guo - Building machine learning systems for scale with Google Cloud AI...
Yufeng Guo - Building machine learning systems for scale with Google Cloud AI...Yufeng Guo - Building machine learning systems for scale with Google Cloud AI...
Yufeng Guo - Building machine learning systems for scale with Google Cloud AI...
 
Facebook ML Infrastructure - 2018 slides
Facebook ML Infrastructure - 2018 slidesFacebook ML Infrastructure - 2018 slides
Facebook ML Infrastructure - 2018 slides
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2O
 
Productionizing Machine Learning in Our Health and Wellness Marketplace
Productionizing Machine Learning in Our Health and Wellness MarketplaceProductionizing Machine Learning in Our Health and Wellness Marketplace
Productionizing Machine Learning in Our Health and Wellness Marketplace
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
Ml infra at an early stage
Ml infra at an early stageMl infra at an early stage
Ml infra at an early stage
 
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital OneUsing H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
 
Machine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkMachine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache Spark
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015
 
From Chatbots to Augmented Conversational Assistants
From Chatbots to Augmented Conversational AssistantsFrom Chatbots to Augmented Conversational Assistants
From Chatbots to Augmented Conversational Assistants
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
 
Deep learning in production with the best
Deep learning in production   with the bestDeep learning in production   with the best
Deep learning in production with the best
 
Introduction to ML.NET
Introduction to ML.NETIntroduction to ML.NET
Introduction to ML.NET
 

Viewers also liked

Building RESTtful services in MEAN
Building RESTtful services in MEANBuilding RESTtful services in MEAN
Building RESTtful services in MEANMadhukara Phatak
 
Online Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkOnline Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkDavide Nardone
 
Get MEAN! Node.js and the MEAN stack
Get MEAN!  Node.js and the MEAN stackGet MEAN!  Node.js and the MEAN stack
Get MEAN! Node.js and the MEAN stackNicholas McClay
 
Twitter sentiment analysis project report
Twitter sentiment analysis project reportTwitter sentiment analysis project report
Twitter sentiment analysis project reportBharat Khanna
 
Create Rest API in Nodejs
Create Rest API in Nodejs Create Rest API in Nodejs
Create Rest API in Nodejs Irfan Maulana
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment AnalysisJaganadh Gopinadhan
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis worksCJ Jenkins
 
The MEAN Stack: MongoDB, ExpressJS, AngularJS and Node.js
The MEAN Stack: MongoDB, ExpressJS, AngularJS and Node.jsThe MEAN Stack: MongoDB, ExpressJS, AngularJS and Node.js
The MEAN Stack: MongoDB, ExpressJS, AngularJS and Node.jsMongoDB
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Create Restful Web Application With Node.js Express Framework
Create Restful Web Application With Node.js Express FrameworkCreate Restful Web Application With Node.js Express Framework
Create Restful Web Application With Node.js Express FrameworkEdureka!
 
Sentiment analysis of twitter data
Sentiment analysis of twitter dataSentiment analysis of twitter data
Sentiment analysis of twitter dataBhagyashree Deokar
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisFabio Benedetti
 
Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14Rachit Goel
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweetsVasu Jain
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in TwitterAyushi Dalmia
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 

Viewers also liked (18)

Building RESTtful services in MEAN
Building RESTtful services in MEANBuilding RESTtful services in MEAN
Building RESTtful services in MEAN
 
MongoDB and Node.js
MongoDB and Node.jsMongoDB and Node.js
MongoDB and Node.js
 
Cours1
Cours1Cours1
Cours1
 
Online Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkOnline Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache Spark
 
Get MEAN! Node.js and the MEAN stack
Get MEAN!  Node.js and the MEAN stackGet MEAN!  Node.js and the MEAN stack
Get MEAN! Node.js and the MEAN stack
 
Twitter sentiment analysis project report
Twitter sentiment analysis project reportTwitter sentiment analysis project report
Twitter sentiment analysis project report
 
Create Rest API in Nodejs
Create Rest API in Nodejs Create Rest API in Nodejs
Create Rest API in Nodejs
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment Analysis
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
 
The MEAN Stack: MongoDB, ExpressJS, AngularJS and Node.js
The MEAN Stack: MongoDB, ExpressJS, AngularJS and Node.jsThe MEAN Stack: MongoDB, ExpressJS, AngularJS and Node.js
The MEAN Stack: MongoDB, ExpressJS, AngularJS and Node.js
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Create Restful Web Application With Node.js Express Framework
Create Restful Web Application With Node.js Express FrameworkCreate Restful Web Application With Node.js Express Framework
Create Restful Web Application With Node.js Express Framework
 
Sentiment analysis of twitter data
Sentiment analysis of twitter dataSentiment analysis of twitter data
Sentiment analysis of twitter data
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
 
Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweets
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 

Similar to Machine learninginspark

Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Cloudera, Inc.
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014gmalouf678
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Lucidworks
 
Building a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamBuilding a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamRaymond Tay
 
Machine Learning Startup
Machine Learning StartupMachine Learning Startup
Machine Learning StartupBen Lackey
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Scaling a High Traffic Web Application: Our Journey from Java to PHP
Scaling a High Traffic Web Application: Our Journey from Java to PHPScaling a High Traffic Web Application: Our Journey from Java to PHP
Scaling a High Traffic Web Application: Our Journey from Java to PHP120bi
 
Scaling High Traffic Web Applications
Scaling High Traffic Web ApplicationsScaling High Traffic Web Applications
Scaling High Traffic Web ApplicationsAchievers Tech
 
How I became ML Engineer
How I became ML Engineer How I became ML Engineer
How I became ML Engineer Kevin Lee
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesAlice Zheng
 
Mapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the CloudMapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the CloudChris Dagdigian
 
Saturn - UCSD CNS Research Review
Saturn - UCSD CNS Research ReviewSaturn - UCSD CNS Research Review
Saturn - UCSD CNS Research ReviewKabirNagrecha
 
Saturn: Joint Optimization for Large-Model Deep Learning
Saturn: Joint Optimization for Large-Model Deep LearningSaturn: Joint Optimization for Large-Model Deep Learning
Saturn: Joint Optimization for Large-Model Deep LearningKabirNagrecha
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisCrowdFlower
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014Ricard Clau
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Christopher Curtin
 

Similar to Machine learninginspark (20)

Machine Learning at Scale
Machine Learning at ScaleMachine Learning at Scale
Machine Learning at Scale
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
 
Building a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamBuilding a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beam
 
Machine Learning Startup
Machine Learning StartupMachine Learning Startup
Machine Learning Startup
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Scaling a High Traffic Web Application: Our Journey from Java to PHP
Scaling a High Traffic Web Application: Our Journey from Java to PHPScaling a High Traffic Web Application: Our Journey from Java to PHP
Scaling a High Traffic Web Application: Our Journey from Java to PHP
 
Scaling High Traffic Web Applications
Scaling High Traffic Web ApplicationsScaling High Traffic Web Applications
Scaling High Traffic Web Applications
 
How I became ML Engineer
How I became ML Engineer How I became ML Engineer
How I became ML Engineer
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Mapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the CloudMapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the Cloud
 
Saturn - UCSD CNS Research Review
Saturn - UCSD CNS Research ReviewSaturn - UCSD CNS Research Review
Saturn - UCSD CNS Research Review
 
Saturn: Joint Optimization for Large-Model Deep Learning
Saturn: Joint Optimization for Large-Model Deep LearningSaturn: Joint Optimization for Large-Model Deep Learning
Saturn: Joint Optimization for Large-Model Deep Learning
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 

Machine learninginspark

  • 1. Machine Learning at Scale Madhukara Phatak Zinnia Systems @madhukaraphatak
  • 2. Agenda • Zinnia and Big data • Hadoop Saga • Machine learning – State of Art • Scale Challenges • People challenges • Machine learning at Zinnia • Case studies • Demo
  • 3. Zinnia and Big data • BSS/OSS product company • Big data is normal in Telecom • CDR (call data record ) around 3TB for companies like Airtel • Need a solution for processing over 6 months • Started to work around 3 years ago
  • 4. Hadoop Saga • Hadoop was default choice • Challenge in the ecosystem in India • Hype vs Reality • Work – Building ML library Nectar – Working with companies to build hadoop expertise and solutions – POC’s
  • 5. Machine Learning in Hadoop • Apache Mahout was the choice but its too hard to map it any new requirements • Map/Reduce implementation suffered from speed and complexity • Accuracy of the results often poor • We set out to build our own and realized it was too much of overhead even to build simplest things
  • 6. ML and Map Reduce • M/R forgets everything once one operation is done • Everything has to go through HDFS , slower because of disk over heads • Mahout long tried to make as fast possible , but they kind of given up • In Zinnia , we moved on with aggregation and KPI based solutions rather than pure ML.
  • 7. Apache Spark • Apache Spark is a framework for lightening fast cluster computing . • Build by AmpLabs and now Databricks. • Runs Hadoop 2.0 • Built for Iterative algorithms aka ML • There is suddenly interest in Bigdata ML again with spark as its finally possible to run fast and accurate with spark • Mahout is moving on to Spark
  • 8. MLLib • Standard Spark library for Machine learning • Built into spark • Very small code base – 1200 line of scala code • 40x – 100x faster than Mahout • Supports – Linear and Logistic regression – SVM – Recommender systems
  • 9. ML-Scale challenges • Choosing an algorithm • Accuracy of algorithm implementation • Modeling when data is noisy and big • Faster sampling • Real time processing • Accuracy vs Performance
  • 10. ML-People challenges • Hard to find Data scientists • Unique combination of skills – Programming at scale and maths. • Mathematical reasoning and practicallality of implementation.
  • 11. Machine learning at Zinnia Systems • 4 people team • We work on public data and use ML algorithms to get interesting insight out. • We work on following – Predictive modeling – Text analysis – Recommender systems – Classification systems
  • 12. Case study –Movie twitter sentiment Analysis • Everyone likes movies and want to catch up good movie every week. • Too many critic reviews so difficult to say whom to trust. • Can we know what real audience think about the movies so that we can make right choice?
  • 13. Movie twitter sentiment analysis • We build model using Naïve Bayes using labeled public tweets. • Collect tweet about movies every day and run through models to do the predictions. • We aggregate these scores to give our twitter score. • On par with imdb score. • Demo
  • 14. Movie Recommendation System • Want to explore older movies based on your current liking? • We pull the data from FB for you and your friends movie liking , and recommend you movies out of our 17000 movie collection. • Model built using public Nextflix data • Demo
  • 15. Kick start in ML • https://www.coursera.org/course/ml • https://github.com/zinniasystems/spark- ml-class • https://class.coursera.org/nlp/lecture/pre view