SlideShare a Scribd company logo
1 of 46
Download to read offline
Presented by: Huy Nguyen @huyng
Computer vision at scale with
Hadoop and Storm
Copyright © 2014 Yahoo, Inc.
Outline
▪ Computer vision at 10,000 feet
▪ Hadoop training system
▪ Storm continuous stream processing
▪ Hadoop batch processing
▪ Q&A
How can we help Flickr users better
organize, search, and browse
their photo collection?
photo credit: Quinn Dombrowski
What is AutoTagging?
Any photo you upload to Flickr is now automatically
classified using computer vision software
Class Probability
Flowers 0.98
Outdoors 0.95
Cat 0.001
Grass 0.6
Classifiers
Classifiers
Classifiers
Demo
Auto Tags Feeding Search
▪
▪
▪
▪
▪
▪
Our Goals
▪ Train thousands of classifiers
▪ Classify millions of photos per day
▪ Compute auto tags on backlog of billions of photos
▪ Enable reprocessing bi-weekly and every quarter
▪ Do it all within 90 days of being acquired!
Our system for training classifiers
▪ High level overview of image classification
▪ Hadoop workflow
Classification in a nutshell
How do we define a classifier?
A classifier is simply the line “z = wx -
b” which divides positive and negative
examples with the largest margin.
Entirely defined by w and b
Training is simply adjusting w and b to
find the line with the largest margin
dividing negative and positive example
data
Classification is done by computing z,
given some data point x
Image classification in a nutshell
To classify images we need to convert
raw pixels into a “feature” to plug into x
compute
feature
How are image classifiers made?
flower Class Name
Negative
Examples
(approx. 100K)
Positive Examples
(approx. 10K)
Repeat a few thousand times
for each class
Compute
Features
Train
Classifier
w,b
Hadoop Classifier Training
Hadoop Classifier Training
HDFS
Pixel Data
+
Label Data
1 <base64>
2 <base64>
3 <base64>
4 <base64>
.
.
n <base64>
dog 1 TRUE
dog 2 FALSE
dog 3 TRUE
cat 1 FALSE
cat 2 TRUE
cat 3 FALSE
.
.
Pixel Data Label Data
photo_id, pixel_data class, photo_id, label
Hadoop Classifier Training - JOIN
map
map
map
reduce
Pixels
Labels
1 <pixelsb64> [(cat, FALSE), (dog, TRUE)]
2 <pixelsb64> [(cat,TRUE), (dog, FALSE)]
3 <pixelsb64> [(cat, FALSE), (dog, TRUE)]
.
.
▪ mappers emit (photo_id, pixels) OR (photo_id, class, label)
▪ reducers, takes advantage of sorting to emit (photo_id, pixels, json(label data))
Hadoop Classifier Training - TRAIN
1 <pixelsb64> [(cat, FALSE), (dog, TRUE)]
2 <pixelsb64> [(cat,TRUE), (dog, FALSE)]
3 <pixelsb64> [(cat, FALSE), (dog, TRUE)]
.
.
map
map
map
reduce
compute
feature
training
classifier
dog
clf.
cat
clf.
▪ mappers compute features, emits a (class, feat64, label) for each class
image associated with image
▪ reducers, takes advantage of sorting to train. Emits (class, w, b)
Hadoop Classifier Training
▪ Workflow coordinated with OOZIE
Hadoop Classifier Training
▪ 10,000 mappers
▪ 1,000 reducers
▪ A few thousand classifiers in less than 6 hours
▪ Replaced OpenMPI based system
▪ Easier deployment
▪ Access to more hardware
▪ Much simpler code
Area of Improvements
▪ Reducers have all training data in memory
▪ A potential bottleneck if we want to expand training set for any given
set of classifiers
▪ Known path to avoiding this bottleneck
▪ Incrementally update models as new data arrives
Our system for tagging the flickr corpus
▪ Batch processing with Hadoop
▪ Stream processing with Storm
Batch processing in Hadoop
www HDFS Autotags
Mapper
Search
Engine
Indexer
HDFS
Indexing
Reducer
Autotags
MapperAutotags
Mapper
Hadoop Auto tagging
HDFS
Pixel Data
Autotag Results
1 <base64>
2 <base64>
3 <base64>
4 <base64>
.
.
n <base64>
1 { “cat”: .98, “dog”: 0.3 }
2 { “cat”: .97, “dog”: 0.2 }
3 { “cat”: .2, “dog”: 0.3 }
4 { “cat”: .5, “dog”: 0.9 }
.
.
n { “cat”: .90, “dog”: 0.23 }
Pixel Data Autotag Results
photo_id, pixel_data photo_id, results
Hadoop classification performance
▪ 10,000 Mappers
▪ Processed in about 1 week entire corpus
Main challenge: packaging our code
▪ Heterogenous Python & C++ codebase
▪ How do we get all of the shared library dependencies
into a heterogenous system distributed across many
machines?
Main challenge: packaging our code
▪ tried: pyinstaller, didn’t work
▪ rolled our own solution using strace
▪ /autotags_package
▪ /lib
▪ /lib/python2.7/site-packages
▪ /bin
▪ /app
Stream processing (our first attempt)
www Redis
PHP
WorkersPHP
WorkersPHP
Workers
fork autotags a.jpg
search
db
Why we started looking into Storm
Why we started looking into Storm
Feature
Computation
Classification
300 ms 10 ms
Classifier
Load
300 ms
Classifier loading accounted for
49% of total processing time
Solutions we evaluated at the time
▪ Mem-mapping model
▪ Batching up operations
▪ Daemonize our autotagging code
Enter
Storm
Storm computational framework
▪ Spouts
▪ The source of your data
stream. Spouts “emit”
tuples
▪ Tuples
▪ An atomic unit of work
▪ Bolts
▪ Small programs that
processes your tuple
stream
▪ Topology
▪ A graph of bolts and
spouts
Defining a topology
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("images", new ImageSpout(), 10);
builder.setBolt("autotag", new AutoTagBolt(), 3)
.shuffleGrouping("images")
builder.setBolt("dbwriter", new DBWriterBolt(), 2)
.shuffleGrouping("autotag")
Defining a bolt
class Autotag(storm.BasicBolt):
def __init__(self):
self.classifier.load()
def process(self, tuple):
<process tuple here>
Our final solution
www Redis Autotags
Spout
Autotags
Bolt
Search
Engine
Indexer
Bolt
DB
Writer
Bolt
Topology
Storm architecture
Supervisors
(10 nodes)
Nimbus
(1 node)
Zookeepers
(3 nodes)
Spouts and bolts live here
Fault tolerant against node failures
Nimbus server dies
▪ no new jobs can be submitted
▪ current jobs stay running
▪ simply restart
Supervisor node dies
▪ All unprocessed jobs for that node get reassigned
▪ Supervisor restarted
Zookeeper node dies
▪ Restart node, no data loss as long as not too many die at once
storm commands
storm jar topology-jar-path class
storm commands
storm list
storm commands
storm kill topology-name
storm commands
storm deactivate topology-jar-path class
storm commands
storm activate topology-name
Is Storm just another queuing system?
Yes, and more
▪ A framework for separating application design from
scaling concerns
▪ Has a great deployment story
▪ atomic start, kill, deactivate for free
▪ no more rsync
▪ No single point of failure
▪ Wide industry adoption
▪ Is programming language agnostic
Try storm out!
https://github.com/apache/incubator-storm/tree/master/examples/storm-starter
https://github.com/nathanmarz/storm-deploy
Thank you
Presentation copyright © 2014 Yahoo, Inc.

More Related Content

What's hot

A Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningA Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningMakoto Yui
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersJen Aman
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to HivemallMakoto Yui
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...Herman Wu
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Databricks
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentSasha Ovsankin
 
Get Competitive with Driverless AI
Get Competitive with Driverless AIGet Competitive with Driverless AI
Get Competitive with Driverless AISri Ambati
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sjHolden Karau
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Databricks
 
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users GroupNitay Joffe
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 

What's hot (20)

A Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningA Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
20160908 hivemall meetup
20160908 hivemall meetup20160908 hivemall meetup
20160908 hivemall meetup
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
 
Get Competitive with Driverless AI
Get Competitive with Driverless AIGet Competitive with Driverless AI
Get Competitive with Driverless AI
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
 
R meetup talk
R meetup talkR meetup talk
R meetup talk
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
 
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 

Similar to Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

OpenPOWER Workshop in Silicon Valley
OpenPOWER Workshop in Silicon ValleyOpenPOWER Workshop in Silicon Valley
OpenPOWER Workshop in Silicon ValleyGanesan Narayanasamy
 
Information from pixels
Information from pixelsInformation from pixels
Information from pixelsDave Snowdon
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disquszeeg
 
Puppet and CloudStack
Puppet and CloudStackPuppet and CloudStack
Puppet and CloudStackke4qqq
 
Python: Migrating from Procedural to Object-Oriented Programming
Python: Migrating from Procedural to Object-Oriented ProgrammingPython: Migrating from Procedural to Object-Oriented Programming
Python: Migrating from Procedural to Object-Oriented ProgrammingDamian T. Gordon
 
Viktor Tsykunov "Microsoft AI platform for every Developer"
Viktor Tsykunov "Microsoft AI platform for every Developer"Viktor Tsykunov "Microsoft AI platform for every Developer"
Viktor Tsykunov "Microsoft AI platform for every Developer"Lviv Startup Club
 
Puppet Camp Dublin - 06/2012
Puppet Camp Dublin - 06/2012Puppet Camp Dublin - 06/2012
Puppet Camp Dublin - 06/2012Roland Tritsch
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015Debashis Saha
 
NUS-ISS Learning Day 2019-Pandas in the cloud
NUS-ISS Learning Day 2019-Pandas in the cloudNUS-ISS Learning Day 2019-Pandas in the cloud
NUS-ISS Learning Day 2019-Pandas in the cloudNUS-ISS
 
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, PuppetPuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, PuppetPuppet
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowDatabricks
 
Puppet and Apache CloudStack
Puppet and Apache CloudStackPuppet and Apache CloudStack
Puppet and Apache CloudStackPuppet
 
Infrastructure as code with Puppet and Apache CloudStack
Infrastructure as code with Puppet and Apache CloudStackInfrastructure as code with Puppet and Apache CloudStack
Infrastructure as code with Puppet and Apache CloudStackke4qqq
 
Advanced Topics in Continuous Deployment
Advanced Topics in Continuous DeploymentAdvanced Topics in Continuous Deployment
Advanced Topics in Continuous DeploymentMike Brittain
 
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopAWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopJulien SIMON
 
Python在豆瓣的应用
Python在豆瓣的应用Python在豆瓣的应用
Python在豆瓣的应用Qiangning Hong
 
Machine Learning Models on Mobile Devices
Machine Learning Models on Mobile DevicesMachine Learning Models on Mobile Devices
Machine Learning Models on Mobile DevicesLars Gregori
 
Usecase examples of Packer
Usecase examples of Packer Usecase examples of Packer
Usecase examples of Packer Hiroshi SHIBATA
 

Similar to Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen) (20)

OpenPOWER Workshop in Silicon Valley
OpenPOWER Workshop in Silicon ValleyOpenPOWER Workshop in Silicon Valley
OpenPOWER Workshop in Silicon Valley
 
Information from pixels
Information from pixelsInformation from pixels
Information from pixels
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disqus
 
Puppet and CloudStack
Puppet and CloudStackPuppet and CloudStack
Puppet and CloudStack
 
Python: Migrating from Procedural to Object-Oriented Programming
Python: Migrating from Procedural to Object-Oriented ProgrammingPython: Migrating from Procedural to Object-Oriented Programming
Python: Migrating from Procedural to Object-Oriented Programming
 
Viktor Tsykunov "Microsoft AI platform for every Developer"
Viktor Tsykunov "Microsoft AI platform for every Developer"Viktor Tsykunov "Microsoft AI platform for every Developer"
Viktor Tsykunov "Microsoft AI platform for every Developer"
 
Puppet Camp Dublin - 06/2012
Puppet Camp Dublin - 06/2012Puppet Camp Dublin - 06/2012
Puppet Camp Dublin - 06/2012
 
Django at Scale
Django at ScaleDjango at Scale
Django at Scale
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015
 
NUS-ISS Learning Day 2019-Pandas in the cloud
NUS-ISS Learning Day 2019-Pandas in the cloudNUS-ISS Learning Day 2019-Pandas in the cloud
NUS-ISS Learning Day 2019-Pandas in the cloud
 
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, PuppetPuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
Puppet and Apache CloudStack
Puppet and Apache CloudStackPuppet and Apache CloudStack
Puppet and Apache CloudStack
 
Infrastructure as code with Puppet and Apache CloudStack
Infrastructure as code with Puppet and Apache CloudStackInfrastructure as code with Puppet and Apache CloudStack
Infrastructure as code with Puppet and Apache CloudStack
 
Advanced Topics in Continuous Deployment
Advanced Topics in Continuous DeploymentAdvanced Topics in Continuous Deployment
Advanced Topics in Continuous Deployment
 
Devf (Shoe Lovers)
Devf (Shoe Lovers)Devf (Shoe Lovers)
Devf (Shoe Lovers)
 
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopAWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
 
Python在豆瓣的应用
Python在豆瓣的应用Python在豆瓣的应用
Python在豆瓣的应用
 
Machine Learning Models on Mobile Devices
Machine Learning Models on Mobile DevicesMachine Learning Models on Mobile Devices
Machine Learning Models on Mobile Devices
 
Usecase examples of Packer
Usecase examples of Packer Usecase examples of Packer
Usecase examples of Packer
 

More from Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

More from Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Recently uploaded

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

  • 1. Presented by: Huy Nguyen @huyng Computer vision at scale with Hadoop and Storm Copyright © 2014 Yahoo, Inc.
  • 2. Outline ▪ Computer vision at 10,000 feet ▪ Hadoop training system ▪ Storm continuous stream processing ▪ Hadoop batch processing ▪ Q&A
  • 3.
  • 4. How can we help Flickr users better organize, search, and browse their photo collection? photo credit: Quinn Dombrowski
  • 5. What is AutoTagging? Any photo you upload to Flickr is now automatically classified using computer vision software Class Probability Flowers 0.98 Outdoors 0.95 Cat 0.001 Grass 0.6 Classifiers Classifiers Classifiers
  • 7. Auto Tags Feeding Search ▪ ▪ ▪ ▪ ▪ ▪
  • 8. Our Goals ▪ Train thousands of classifiers ▪ Classify millions of photos per day ▪ Compute auto tags on backlog of billions of photos ▪ Enable reprocessing bi-weekly and every quarter ▪ Do it all within 90 days of being acquired!
  • 9. Our system for training classifiers ▪ High level overview of image classification ▪ Hadoop workflow
  • 10. Classification in a nutshell How do we define a classifier? A classifier is simply the line “z = wx - b” which divides positive and negative examples with the largest margin. Entirely defined by w and b Training is simply adjusting w and b to find the line with the largest margin dividing negative and positive example data Classification is done by computing z, given some data point x
  • 11. Image classification in a nutshell To classify images we need to convert raw pixels into a “feature” to plug into x compute feature
  • 12. How are image classifiers made? flower Class Name Negative Examples (approx. 100K) Positive Examples (approx. 10K) Repeat a few thousand times for each class Compute Features Train Classifier w,b
  • 14. Hadoop Classifier Training HDFS Pixel Data + Label Data 1 <base64> 2 <base64> 3 <base64> 4 <base64> . . n <base64> dog 1 TRUE dog 2 FALSE dog 3 TRUE cat 1 FALSE cat 2 TRUE cat 3 FALSE . . Pixel Data Label Data photo_id, pixel_data class, photo_id, label
  • 15. Hadoop Classifier Training - JOIN map map map reduce Pixels Labels 1 <pixelsb64> [(cat, FALSE), (dog, TRUE)] 2 <pixelsb64> [(cat,TRUE), (dog, FALSE)] 3 <pixelsb64> [(cat, FALSE), (dog, TRUE)] . . ▪ mappers emit (photo_id, pixels) OR (photo_id, class, label) ▪ reducers, takes advantage of sorting to emit (photo_id, pixels, json(label data))
  • 16. Hadoop Classifier Training - TRAIN 1 <pixelsb64> [(cat, FALSE), (dog, TRUE)] 2 <pixelsb64> [(cat,TRUE), (dog, FALSE)] 3 <pixelsb64> [(cat, FALSE), (dog, TRUE)] . . map map map reduce compute feature training classifier dog clf. cat clf. ▪ mappers compute features, emits a (class, feat64, label) for each class image associated with image ▪ reducers, takes advantage of sorting to train. Emits (class, w, b)
  • 17. Hadoop Classifier Training ▪ Workflow coordinated with OOZIE
  • 18. Hadoop Classifier Training ▪ 10,000 mappers ▪ 1,000 reducers ▪ A few thousand classifiers in less than 6 hours ▪ Replaced OpenMPI based system ▪ Easier deployment ▪ Access to more hardware ▪ Much simpler code
  • 19. Area of Improvements ▪ Reducers have all training data in memory ▪ A potential bottleneck if we want to expand training set for any given set of classifiers ▪ Known path to avoiding this bottleneck ▪ Incrementally update models as new data arrives
  • 20. Our system for tagging the flickr corpus ▪ Batch processing with Hadoop ▪ Stream processing with Storm
  • 21. Batch processing in Hadoop www HDFS Autotags Mapper Search Engine Indexer HDFS Indexing Reducer Autotags MapperAutotags Mapper
  • 22. Hadoop Auto tagging HDFS Pixel Data Autotag Results 1 <base64> 2 <base64> 3 <base64> 4 <base64> . . n <base64> 1 { “cat”: .98, “dog”: 0.3 } 2 { “cat”: .97, “dog”: 0.2 } 3 { “cat”: .2, “dog”: 0.3 } 4 { “cat”: .5, “dog”: 0.9 } . . n { “cat”: .90, “dog”: 0.23 } Pixel Data Autotag Results photo_id, pixel_data photo_id, results
  • 23. Hadoop classification performance ▪ 10,000 Mappers ▪ Processed in about 1 week entire corpus
  • 24. Main challenge: packaging our code ▪ Heterogenous Python & C++ codebase ▪ How do we get all of the shared library dependencies into a heterogenous system distributed across many machines?
  • 25. Main challenge: packaging our code ▪ tried: pyinstaller, didn’t work ▪ rolled our own solution using strace ▪ /autotags_package ▪ /lib ▪ /lib/python2.7/site-packages ▪ /bin ▪ /app
  • 26. Stream processing (our first attempt) www Redis PHP WorkersPHP WorkersPHP Workers fork autotags a.jpg search db
  • 27. Why we started looking into Storm
  • 28. Why we started looking into Storm Feature Computation Classification 300 ms 10 ms Classifier Load 300 ms Classifier loading accounted for 49% of total processing time
  • 29. Solutions we evaluated at the time ▪ Mem-mapping model ▪ Batching up operations ▪ Daemonize our autotagging code
  • 31.
  • 32. Storm computational framework ▪ Spouts ▪ The source of your data stream. Spouts “emit” tuples ▪ Tuples ▪ An atomic unit of work ▪ Bolts ▪ Small programs that processes your tuple stream ▪ Topology ▪ A graph of bolts and spouts
  • 33. Defining a topology TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("images", new ImageSpout(), 10); builder.setBolt("autotag", new AutoTagBolt(), 3) .shuffleGrouping("images") builder.setBolt("dbwriter", new DBWriterBolt(), 2) .shuffleGrouping("autotag")
  • 34. Defining a bolt class Autotag(storm.BasicBolt): def __init__(self): self.classifier.load() def process(self, tuple): <process tuple here>
  • 35. Our final solution www Redis Autotags Spout Autotags Bolt Search Engine Indexer Bolt DB Writer Bolt Topology
  • 36. Storm architecture Supervisors (10 nodes) Nimbus (1 node) Zookeepers (3 nodes) Spouts and bolts live here
  • 37. Fault tolerant against node failures Nimbus server dies ▪ no new jobs can be submitted ▪ current jobs stay running ▪ simply restart Supervisor node dies ▪ All unprocessed jobs for that node get reassigned ▪ Supervisor restarted Zookeeper node dies ▪ Restart node, no data loss as long as not too many die at once
  • 38. storm commands storm jar topology-jar-path class
  • 40. storm commands storm kill topology-name
  • 41. storm commands storm deactivate topology-jar-path class
  • 43. Is Storm just another queuing system?
  • 44. Yes, and more ▪ A framework for separating application design from scaling concerns ▪ Has a great deployment story ▪ atomic start, kill, deactivate for free ▪ no more rsync ▪ No single point of failure ▪ Wide industry adoption ▪ Is programming language agnostic
  • 46. Thank you Presentation copyright © 2014 Yahoo, Inc.