SlideShare a Scribd company logo
1 of 61
Download to read offline
Full Stack Deep Learning
Josh Tobin, Sergey Karayev, Pieter Abbeel
Data Management
Full Stack Deep Learning
“All-in-one”
Deployment
or or
Development Training/Evaluation
or
Data
or
Experiment Management
Labeling
Interchange
Storage
Monitoring
?
Hardware /
Mobile
Software Engineering
Frameworks
Database
Hyperparameter TuningDistributed Training
Web
Versioning
Processing
CI / Testing
?
Resource Management
Data Management - overview
Full Stack Deep Learning 3
https://veekaybee.github.io/2019/02/13/data-science-is-different/
Data Management - overview
Full Stack Deep Learning
In this module
4
1. Sources

Where training data comes from

2. Labeling

How to label proprietary data at scale

3. Storage

Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately

4. Versioning

Dataset is updated through user activity or additional labeling, affects model

5. Processing

Aggregating and converting raw data and metadata
Data Management - overview
Full Stack Deep Learning
In this module
5
1. Sources

Where training data comes from

2. Labeling

How to label proprietary data at scale

3. Storage

Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately

4. Versioning

Dataset is updated through user activity or additional labeling, affects model

5. Processing

Aggregating and converting raw data and metadata
Data Management - overview
Full Stack Deep Learning
• Most DL applications require lots of labeled data

• Exceptions: RL, GANs, "semi-supervised" learning

• Publicly available datasets = No competitive advantage

• But can serve as starting point
6
Sources
Data Management - sources
Full Stack Deep Learning
Usually: spend $$$ and time to label own data
7
https://cdn-sv1.deepsense.ai/wp-content/uploads/2017/04/sample_image_from_the_training_set.jpg
Data Management - sources
Full Stack Deep Learning 8Data Management - sources
Data flywheel
Enables rapid improvement with user labels
Full Stack Deep Learning
9
Data Management - sources
https://lilianweng.github.io/lil-log/2019/11/10/self-supervised-learning.html
Semi-supervised learning
Fig. 4. Illustration of self-supervised learning by predicting the relative position of two random patches. (Image source:
Doersch et al., 2015)
Fig. 1. A great summary of how self-supervised learning tasks can be constructed (Image source: LeCun’s talk)
Use parts of data to label other parts
Full Stack Deep Learning
Image data augmentation
• Must do for training vision models

• Keras, fast.ai provide functions that do this

• Hook into your Dataset class to do in CPU
workers in parallel to GPU training
10
https://towardsdatascience.com/1000x-faster-data-augmentation-b91bafee896c
Data Management - sources
Full Stack Deep Learning
Other data augmentation
• Tabular

• Delete some cells to simulate missing
data

• Text

• No well established techniques, but
replace words with synonyms, change
order of things.

• Speech/video

• Change speed, inserts pauses, etc
11
https://github.com/makcedward/nlpaug
Data Management - sources
Full Stack Deep Learning 12
https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning/
Data Management - sources
Synthetic data
Underrated idea that is almost
always worth starting with
Full Stack Deep Learning 13
This can get pretty deep!
Andrew Moffat - https://github.com/amoffat/metabrite-receipt-tests
Data Management - sources
Full Stack Deep Learning 14
https://microsoft.github.io/AirSim/
Data Management - sources
https://openai.com/blog/ingredients-for-robotics-research/
Especially for driving and robotics
Full Stack Deep Learning
Questions?
15
Full Stack Deep Learning
In this module
16
1. Sources

Where training data comes from

2. Labeling

How to label proprietary data at scale

3. Storage

Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately

4. Versioning

Dataset is updated through user activity or additional labeling, affects model

5. Processing

Aggregating and converting raw data and metadata
Data Management - overview
Full Stack Deep Learning
Data Labeling
1. User Interfaces

2. Sources of labor

3. Service companies
17Data Management - labeling
Full Stack Deep Learning 18
Standard set of features:

- bounding boxes,
segmentations,
keypoints, cuboids

- set of applicable
classes
Data Management - labeling
Full Stack Deep Learning 19
Training the annotators is crucial
Quality assurance is key
Data Management - labeling
Full Stack Deep Learning 20
• Hire own annotators, promote best ones to quality control

• Pros: secure, fast (once hired), less QC needed

• Cons: expensive, slow to scale, admin overhead

• ...or, crowdsource (Mechanical Turk)
• Pros: cheaper, more scalable

• Cons: not secure, significant QC effort required

• ...or, full-service data labeling companies
Sources of Labor
Data Management - labeling
Full Stack Deep Learning
Service Companies
21
• Data labeling requires separate software stack, temporary labor, and
quality assurance. Makes sense to outsource.
• Dedicate several days to selecting the best one for you:

• Label gold standard data yourself

• Sales calls with several contenders, ask for work sample on same data

• Ensure agreement with your gold standard, and evaluate on value
Data Management - labeling
Full Stack Deep Learning 22
FigureEight is the original AI data labeling company
Data Management - labeling
Full Stack Deep Learning 23
Scale.ai is a dominant up-and-comer
Data Management - labeling
Full Stack Deep Learning 24
And there are a ton of others
Data Management - labeling
Full Stack Deep Learning 25
And there are a ton of others
Data Management - labeling
Full Stack Deep Learning
Software
26
• Full-service data labeling is always pricy

• But some companies offer their software without labor
Data Management - labeling
Full Stack Deep Learning 27
Prodigy
Data Management - labeling
Full Stack Deep Learning 28
Prodigy
Data Management - labeling
Full Stack Deep Learning 29
• Conclusions

• outsource to full-service company if you can afford it

• if not, then at least use existing software

• hiring part-time makes more sense than trying to make crowdsourcing
work
Data Management - labeling
Full Stack Deep Learning
Questions?
30
Full Stack Deep Learning
In this module
31
1. Sources

Where training data comes from

2. Labeling

How to label proprietary data at scale

3. Storage
Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately

4. Versioning

Dataset is updated through user activity or additional labeling, affects model

5. Processing

Aggregating and converting raw data and metadata
Data Management - overview
Full Stack Deep Learning
Data Storage
1. Building blocks

- Filesystem

- Object Storage

- Database

- "Data Lake"

2. What goes where

3. Where to learn more
32Data Management - storage
Full Stack Deep Learning
Filesystem
• Foundational layer of storage.

• Fundamental unit is a "file", which can be text or binary, is not versioned,
and is easily overwritten.

• Can be as simple as a locally mounted disk containing all the files you need.

• Can be networked (e.g. NFS): accessible over network by multiple machines.

• Can be distributed (e.g. HDFS): stored and accessed over multiple
machines.

• Be mindful of access patterns! Can be fast, but not parallel.
33Data Management - storage
Full Stack Deep Learning
Object Storage
• An API over the filesystem. GET, PUT, DELETE files to a service, without
worrying where they are actually stored.

• Fundamental unit is an "object". Usually binary: image, sound file, etc.

• Versioning, redundancy can be built into the service.

• Can be parallel, but not fast.

• Amazon S3 is the canonical example.

• For on-prem setup, Ceph is a good solution.
34Data Management - storage
Full Stack Deep Learning
Database
• Persistent, fast, scalable storage and retrieval of structured data.

• Mental model: everything is actually in RAM, but software ensures that
everything is logged to disk and never lost.

• Fundamental unit is a "row": unique id, references to other rows, values in
columns.

• Not for binary data! Store references instead.

• Usually for data that will be accessed again (not logs).

• Postgres is the right choice >90% of the time. Best-in-class SQL, great support
for unstructured JSON, actively developed.
35Data Management - storage
Full Stack Deep Learning
SQL
• SQL is the right interface for
structured data.

• If you avoid using it, you will
eventually reinvent a slow
and crappy subset of it.
36Data Management - storage
Full Stack Deep Learning
"Data Lake"
• Unstructured aggregation of data from multiple sources, e.g. databases,
logs, expensive data transformations.

• "Schema on read": dump everything in, then transform for specific
needs later.

37
https://medium.com/data-ops/throw-your-data-in-a-lake-32cd21b6de02
Data Management - storage
Full Stack Deep Learning
What goes where
• Binary data (images, sound files, compressed texts) is stored as
objects.

• Metadata (labels, user activity) is stored in database.

• If need features which are not obtainable from database (logs), set up
data lake and a process to aggregate needed data.

• At training time, copy the data that is needed onto filesystem (local or
networked).
38Data Management - storage
Full Stack Deep Learning
Where to learn more
39
https://dataintensive.net
Data Management - storage
Full Stack Deep Learning
Questions?
40
Full Stack Deep Learning
In this module
41
1. Sources

Where training data comes from

2. Labeling

How to label proprietary data at scale

3. Storage

Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately

4. Versioning
Dataset is updated through user activity or additional labeling, affects model

5. Processing

Aggregating and converting raw data and metadata
Data Management - overview
Full Stack Deep Learning
Data Versioning
Level 0: unversioned

Level 1: versioned via snapshot at training time

Level 2: versioned as a mix of assets and code

Level 3: specialized data versioning solution
42Data Management - versioning
Full Stack Deep Learning
Level 0
• Data lives are on filesystem/S3 and database

• Problem: Deployments must be versioned. Deployed machine learning
models are part code, part data. If data is not versioned, deployed
models are not versioned.

• Problem you will face: inability to get back to a previous level of
performance
43Data Management - versioning
Full Stack Deep Learning
Level 1
• Data is versioned by storing a snapshot of everything at training time

• This allows you to version deployed models, and to get back to past
performance, but is super hacky.

• Would be far better to be able to version data just as easily as code.
44Data Management - versioning
Full Stack Deep Learning
Level 2
• Data is versioned as a mix of assets and code.

• Heavy files stored in S3, with unique ids. Training data is stored as JSON or
similar, referring to these ids and include relevant metadata (labels, user activity,
etc).

• JSON files can get big, but using git-lfs lets us store them just as easily as code

• Can improve further with "lazydata": only syncing files that are needed.

• The git signature + of the raw data file defines the version of the dataset

• Often helpful to add timestamp
45Data Management - versioning
Full Stack Deep Learning
Level 3
• Specialized solutions for versioning data.

• Avoid these until you can fully explain how they will improve your project.

• Leading solutions are DVC, Pachyderm, Quill.
46Data Management - versioning
Full Stack Deep Learning
DVC
47
1
2 3
4
Data Management - versioning
Full Stack Deep Learning
Pachyderm
48Data Management - versioning
Full Stack Deep Learning
Dolt
49
A nice simple solution for
versioning databases,
that speaks SQL.
Data Management - versioning
Full Stack Deep Learning
Dolt
50Data Management - versioning
Full Stack Deep Learning
Questions?
51
Full Stack Deep Learning
In this module
52
1. Sources

Where training data comes from

2. Labeling

How to label proprietary data at scale

3. Storage

Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately

4. Versioning

Dataset is updated through user activity or additional labeling, affects model

5. Processing
Aggregating and converting raw data and metadata
Data Management - overview
Full Stack Deep Learning
Motivational Example
•We have to train a photo popularity predictor every night.

• For each photo, training data must include:

• Metadata such as posting time, title, location

• Some features of the user, such as how many times they
logged in today.

• Outputs of photo classifiers (content, style)
53
In database
Need to compute
from logs
Need to run classifiers
Data Management - processing
Full Stack Deep Learning 54
Task Dependencies
• Some tasks can't be
started until other tasks
are finished.

• Finishing a task should
"kick off" its dependencies
Data Management - processing
Full Stack Deep Learning
• Run everything from scratch. If things exist, no need to recompute.
55
https://www.slideshare.net/PyData/how-i-learned-to-time-travel-or-data-pipelining-and-scheduling-with-airflow-67650418
Data Workflows
Data Management - processing
Full Stack Deep Learning
Makefile limitations
• What if re-computation needs to depend on content, not on date?

• What if the dependencies are not files, but disparate programs and
databases?

• What if the work needs to be spread over multiple machines?

• What if there are 100s of "Makefiles" all executing at the same time, with
shared dependencies?
56Data Management - processing
Full Stack Deep Learning
Airflow
57
https://www.slideshare.net/PyData/how-i-learned-to-time-travel-or-data-pipelining-and-scheduling-with-airflow-67650418
Data Management - processing
Full Stack Deep Learning
Distributing work
• The workflow manager has a queue for the tasks, and manages workers
that pull from it, restarting jobs if they fail.
58
http://site.clairvoyantsoft.com/making-apache-airflow-highly-available/
Data Management - processing
Full Stack Deep Learning
Try to keep things simple
• Don't overengineer

• e.g. unix: powerful parallelism,
streaming, highly optimized
59
https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
26 minutes
70 seconds
18 seconds
in parallel
in parallel
Data Management - processing
Full Stack Deep Learning
Questions?
60
Full Stack Deep Learning
Thank you!
61

More Related Content

What's hot

Software testing and process
Software testing and processSoftware testing and process
Software testing and processgouravkalbalia
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Black and Blue APIs: Attacker's and Defender's View of API Vulnerabilities
Black and Blue APIs: Attacker's and Defender's View of API VulnerabilitiesBlack and Blue APIs: Attacker's and Defender's View of API Vulnerabilities
Black and Blue APIs: Attacker's and Defender's View of API VulnerabilitiesMatt Tesauro
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewPoo Kuan Hoong
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architectureLiang Xiang
 
Real World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineSrivatsan Srinivasan
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
 
Introduction To TensorFlow
Introduction To TensorFlowIntroduction To TensorFlow
Introduction To TensorFlowSpotle.ai
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelinesjeykottalam
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
How to build a recommender system?
How to build a recommender system?How to build a recommender system?
How to build a recommender system?blueace
 
Boston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender SystemsBoston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender SystemsJames Kirk
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Alexandros Karatzoglou
 
Introduction to Keras
Introduction to KerasIntroduction to Keras
Introduction to KerasJohn Ramey
 
Machine Learning at Netflix Scale
Machine Learning at Netflix ScaleMachine Learning at Netflix Scale
Machine Learning at Netflix ScaleAish Fenton
 
Zero shot learning through cross-modal transfer
Zero shot learning through cross-modal transferZero shot learning through cross-modal transfer
Zero shot learning through cross-modal transferRoelof Pieters
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceTao Feng
 
Deep Learning using Keras
Deep Learning using KerasDeep Learning using Keras
Deep Learning using KerasAly Abdelkareem
 

What's hot (20)

Software testing and process
Software testing and processSoftware testing and process
Software testing and process
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Black and Blue APIs: Attacker's and Defender's View of API Vulnerabilities
Black and Blue APIs: Attacker's and Defender's View of API VulnerabilitiesBlack and Blue APIs: Attacker's and Defender's View of API Vulnerabilities
Black and Blue APIs: Attacker's and Defender's View of API Vulnerabilities
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 
Real World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning Pipeline
 
Software Testing
Software TestingSoftware Testing
Software Testing
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
 
Introduction To TensorFlow
Introduction To TensorFlowIntroduction To TensorFlow
Introduction To TensorFlow
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
How to build a recommender system?
How to build a recommender system?How to build a recommender system?
How to build a recommender system?
 
Boston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender SystemsBoston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender Systems
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial
 
Introduction to Keras
Introduction to KerasIntroduction to Keras
Introduction to Keras
 
Machine Learning at Netflix Scale
Machine Learning at Netflix ScaleMachine Learning at Netflix Scale
Machine Learning at Netflix Scale
 
Zero shot learning through cross-modal transfer
Zero shot learning through cross-modal transferZero shot learning through cross-modal transfer
Zero shot learning through cross-modal transfer
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
 
Deep Learning using Keras
Deep Learning using KerasDeep Learning using Keras
Deep Learning using Keras
 

Similar to Data Management Overview

ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Understanding Big Data And Hadoop
Understanding Big Data And HadoopUnderstanding Big Data And Hadoop
Understanding Big Data And HadoopEdureka!
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...DATAVERSITY
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Geek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsGeek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsIDERA Software
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeeling Cheung
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
ADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsDATAVERSITY
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.pptRutujaPatil247341
 

Similar to Data Management Overview (20)

ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Understanding Big Data And Hadoop
Understanding Big Data And HadoopUnderstanding Big Data And Hadoop
Understanding Big Data And Hadoop
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Geek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsGeek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure Environments
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
GDSC Cloud Jam.pptx
GDSC Cloud Jam.pptxGDSC Cloud Jam.pptx
GDSC Cloud Jam.pptx
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
ADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic Solutions
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

More from Sergey Karayev

Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Sergey Karayev
 
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Sergey Karayev
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Testing and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep LearningTesting and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep LearningSergey Karayev
 
Machine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningMachine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningSergey Karayev
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningSergey Karayev
 
Research Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningResearch Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningSergey Karayev
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningSergey Karayev
 

More from Sergey Karayev (20)

Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
 
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
 
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
 
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
 
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
 
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
 
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
 
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
 
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
 
Testing and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep LearningTesting and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep Learning
 
Machine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningMachine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep Learning
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
 
Research Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningResearch Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep Learning
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 

Recently uploaded

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 

Recently uploaded (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 

Data Management Overview

  • 1. Full Stack Deep Learning Josh Tobin, Sergey Karayev, Pieter Abbeel Data Management
  • 2. Full Stack Deep Learning “All-in-one” Deployment or or Development Training/Evaluation or Data or Experiment Management Labeling Interchange Storage Monitoring ? Hardware / Mobile Software Engineering Frameworks Database Hyperparameter TuningDistributed Training Web Versioning Processing CI / Testing ? Resource Management Data Management - overview
  • 3. Full Stack Deep Learning 3 https://veekaybee.github.io/2019/02/13/data-science-is-different/ Data Management - overview
  • 4. Full Stack Deep Learning In this module 4 1. Sources
 Where training data comes from
 2. Labeling
 How to label proprietary data at scale
 3. Storage Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately 4. Versioning Dataset is updated through user activity or additional labeling, affects model 5. Processing Aggregating and converting raw data and metadata Data Management - overview
  • 5. Full Stack Deep Learning In this module 5 1. Sources
 Where training data comes from
 2. Labeling
 How to label proprietary data at scale
 3. Storage Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately 4. Versioning Dataset is updated through user activity or additional labeling, affects model 5. Processing Aggregating and converting raw data and metadata Data Management - overview
  • 6. Full Stack Deep Learning • Most DL applications require lots of labeled data • Exceptions: RL, GANs, "semi-supervised" learning • Publicly available datasets = No competitive advantage • But can serve as starting point 6 Sources Data Management - sources
  • 7. Full Stack Deep Learning Usually: spend $$$ and time to label own data 7 https://cdn-sv1.deepsense.ai/wp-content/uploads/2017/04/sample_image_from_the_training_set.jpg Data Management - sources
  • 8. Full Stack Deep Learning 8Data Management - sources Data flywheel Enables rapid improvement with user labels
  • 9. Full Stack Deep Learning 9 Data Management - sources https://lilianweng.github.io/lil-log/2019/11/10/self-supervised-learning.html Semi-supervised learning Fig. 4. Illustration of self-supervised learning by predicting the relative position of two random patches. (Image source: Doersch et al., 2015) Fig. 1. A great summary of how self-supervised learning tasks can be constructed (Image source: LeCun’s talk) Use parts of data to label other parts
  • 10. Full Stack Deep Learning Image data augmentation • Must do for training vision models • Keras, fast.ai provide functions that do this • Hook into your Dataset class to do in CPU workers in parallel to GPU training 10 https://towardsdatascience.com/1000x-faster-data-augmentation-b91bafee896c Data Management - sources
  • 11. Full Stack Deep Learning Other data augmentation • Tabular • Delete some cells to simulate missing data • Text • No well established techniques, but replace words with synonyms, change order of things. • Speech/video • Change speed, inserts pauses, etc 11 https://github.com/makcedward/nlpaug Data Management - sources
  • 12. Full Stack Deep Learning 12 https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning/ Data Management - sources Synthetic data Underrated idea that is almost always worth starting with
  • 13. Full Stack Deep Learning 13 This can get pretty deep! Andrew Moffat - https://github.com/amoffat/metabrite-receipt-tests Data Management - sources
  • 14. Full Stack Deep Learning 14 https://microsoft.github.io/AirSim/ Data Management - sources https://openai.com/blog/ingredients-for-robotics-research/ Especially for driving and robotics
  • 15. Full Stack Deep Learning Questions? 15
  • 16. Full Stack Deep Learning In this module 16 1. Sources
 Where training data comes from
 2. Labeling
 How to label proprietary data at scale
 3. Storage Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately 4. Versioning Dataset is updated through user activity or additional labeling, affects model 5. Processing Aggregating and converting raw data and metadata Data Management - overview
  • 17. Full Stack Deep Learning Data Labeling 1. User Interfaces 2. Sources of labor 3. Service companies 17Data Management - labeling
  • 18. Full Stack Deep Learning 18 Standard set of features: - bounding boxes, segmentations, keypoints, cuboids - set of applicable classes Data Management - labeling
  • 19. Full Stack Deep Learning 19 Training the annotators is crucial Quality assurance is key Data Management - labeling
  • 20. Full Stack Deep Learning 20 • Hire own annotators, promote best ones to quality control • Pros: secure, fast (once hired), less QC needed • Cons: expensive, slow to scale, admin overhead • ...or, crowdsource (Mechanical Turk) • Pros: cheaper, more scalable • Cons: not secure, significant QC effort required • ...or, full-service data labeling companies Sources of Labor Data Management - labeling
  • 21. Full Stack Deep Learning Service Companies 21 • Data labeling requires separate software stack, temporary labor, and quality assurance. Makes sense to outsource. • Dedicate several days to selecting the best one for you: • Label gold standard data yourself • Sales calls with several contenders, ask for work sample on same data • Ensure agreement with your gold standard, and evaluate on value Data Management - labeling
  • 22. Full Stack Deep Learning 22 FigureEight is the original AI data labeling company Data Management - labeling
  • 23. Full Stack Deep Learning 23 Scale.ai is a dominant up-and-comer Data Management - labeling
  • 24. Full Stack Deep Learning 24 And there are a ton of others Data Management - labeling
  • 25. Full Stack Deep Learning 25 And there are a ton of others Data Management - labeling
  • 26. Full Stack Deep Learning Software 26 • Full-service data labeling is always pricy • But some companies offer their software without labor Data Management - labeling
  • 27. Full Stack Deep Learning 27 Prodigy Data Management - labeling
  • 28. Full Stack Deep Learning 28 Prodigy Data Management - labeling
  • 29. Full Stack Deep Learning 29 • Conclusions • outsource to full-service company if you can afford it • if not, then at least use existing software • hiring part-time makes more sense than trying to make crowdsourcing work Data Management - labeling
  • 30. Full Stack Deep Learning Questions? 30
  • 31. Full Stack Deep Learning In this module 31 1. Sources
 Where training data comes from
 2. Labeling
 How to label proprietary data at scale
 3. Storage Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately 4. Versioning Dataset is updated through user activity or additional labeling, affects model 5. Processing Aggregating and converting raw data and metadata Data Management - overview
  • 32. Full Stack Deep Learning Data Storage 1. Building blocks - Filesystem - Object Storage - Database - "Data Lake" 2. What goes where 3. Where to learn more 32Data Management - storage
  • 33. Full Stack Deep Learning Filesystem • Foundational layer of storage. • Fundamental unit is a "file", which can be text or binary, is not versioned, and is easily overwritten. • Can be as simple as a locally mounted disk containing all the files you need. • Can be networked (e.g. NFS): accessible over network by multiple machines. • Can be distributed (e.g. HDFS): stored and accessed over multiple machines. • Be mindful of access patterns! Can be fast, but not parallel. 33Data Management - storage
  • 34. Full Stack Deep Learning Object Storage • An API over the filesystem. GET, PUT, DELETE files to a service, without worrying where they are actually stored. • Fundamental unit is an "object". Usually binary: image, sound file, etc. • Versioning, redundancy can be built into the service. • Can be parallel, but not fast. • Amazon S3 is the canonical example. • For on-prem setup, Ceph is a good solution. 34Data Management - storage
  • 35. Full Stack Deep Learning Database • Persistent, fast, scalable storage and retrieval of structured data. • Mental model: everything is actually in RAM, but software ensures that everything is logged to disk and never lost. • Fundamental unit is a "row": unique id, references to other rows, values in columns. • Not for binary data! Store references instead. • Usually for data that will be accessed again (not logs). • Postgres is the right choice >90% of the time. Best-in-class SQL, great support for unstructured JSON, actively developed. 35Data Management - storage
  • 36. Full Stack Deep Learning SQL • SQL is the right interface for structured data. • If you avoid using it, you will eventually reinvent a slow and crappy subset of it. 36Data Management - storage
  • 37. Full Stack Deep Learning "Data Lake" • Unstructured aggregation of data from multiple sources, e.g. databases, logs, expensive data transformations. • "Schema on read": dump everything in, then transform for specific needs later. 37 https://medium.com/data-ops/throw-your-data-in-a-lake-32cd21b6de02 Data Management - storage
  • 38. Full Stack Deep Learning What goes where • Binary data (images, sound files, compressed texts) is stored as objects. • Metadata (labels, user activity) is stored in database. • If need features which are not obtainable from database (logs), set up data lake and a process to aggregate needed data. • At training time, copy the data that is needed onto filesystem (local or networked). 38Data Management - storage
  • 39. Full Stack Deep Learning Where to learn more 39 https://dataintensive.net Data Management - storage
  • 40. Full Stack Deep Learning Questions? 40
  • 41. Full Stack Deep Learning In this module 41 1. Sources
 Where training data comes from
 2. Labeling
 How to label proprietary data at scale
 3. Storage Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately 4. Versioning Dataset is updated through user activity or additional labeling, affects model 5. Processing Aggregating and converting raw data and metadata Data Management - overview
  • 42. Full Stack Deep Learning Data Versioning Level 0: unversioned Level 1: versioned via snapshot at training time Level 2: versioned as a mix of assets and code Level 3: specialized data versioning solution 42Data Management - versioning
  • 43. Full Stack Deep Learning Level 0 • Data lives are on filesystem/S3 and database • Problem: Deployments must be versioned. Deployed machine learning models are part code, part data. If data is not versioned, deployed models are not versioned. • Problem you will face: inability to get back to a previous level of performance 43Data Management - versioning
  • 44. Full Stack Deep Learning Level 1 • Data is versioned by storing a snapshot of everything at training time • This allows you to version deployed models, and to get back to past performance, but is super hacky. • Would be far better to be able to version data just as easily as code. 44Data Management - versioning
  • 45. Full Stack Deep Learning Level 2 • Data is versioned as a mix of assets and code. • Heavy files stored in S3, with unique ids. Training data is stored as JSON or similar, referring to these ids and include relevant metadata (labels, user activity, etc). • JSON files can get big, but using git-lfs lets us store them just as easily as code • Can improve further with "lazydata": only syncing files that are needed. • The git signature + of the raw data file defines the version of the dataset • Often helpful to add timestamp 45Data Management - versioning
  • 46. Full Stack Deep Learning Level 3 • Specialized solutions for versioning data. • Avoid these until you can fully explain how they will improve your project. • Leading solutions are DVC, Pachyderm, Quill. 46Data Management - versioning
  • 47. Full Stack Deep Learning DVC 47 1 2 3 4 Data Management - versioning
  • 48. Full Stack Deep Learning Pachyderm 48Data Management - versioning
  • 49. Full Stack Deep Learning Dolt 49 A nice simple solution for versioning databases, that speaks SQL. Data Management - versioning
  • 50. Full Stack Deep Learning Dolt 50Data Management - versioning
  • 51. Full Stack Deep Learning Questions? 51
  • 52. Full Stack Deep Learning In this module 52 1. Sources
 Where training data comes from
 2. Labeling
 How to label proprietary data at scale
 3. Storage Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately 4. Versioning Dataset is updated through user activity or additional labeling, affects model 5. Processing Aggregating and converting raw data and metadata Data Management - overview
  • 53. Full Stack Deep Learning Motivational Example •We have to train a photo popularity predictor every night. • For each photo, training data must include: • Metadata such as posting time, title, location • Some features of the user, such as how many times they logged in today. • Outputs of photo classifiers (content, style) 53 In database Need to compute from logs Need to run classifiers Data Management - processing
  • 54. Full Stack Deep Learning 54 Task Dependencies • Some tasks can't be started until other tasks are finished. • Finishing a task should "kick off" its dependencies Data Management - processing
  • 55. Full Stack Deep Learning • Run everything from scratch. If things exist, no need to recompute. 55 https://www.slideshare.net/PyData/how-i-learned-to-time-travel-or-data-pipelining-and-scheduling-with-airflow-67650418 Data Workflows Data Management - processing
  • 56. Full Stack Deep Learning Makefile limitations • What if re-computation needs to depend on content, not on date? • What if the dependencies are not files, but disparate programs and databases? • What if the work needs to be spread over multiple machines? • What if there are 100s of "Makefiles" all executing at the same time, with shared dependencies? 56Data Management - processing
  • 57. Full Stack Deep Learning Airflow 57 https://www.slideshare.net/PyData/how-i-learned-to-time-travel-or-data-pipelining-and-scheduling-with-airflow-67650418 Data Management - processing
  • 58. Full Stack Deep Learning Distributing work • The workflow manager has a queue for the tasks, and manages workers that pull from it, restarting jobs if they fail. 58 http://site.clairvoyantsoft.com/making-apache-airflow-highly-available/ Data Management - processing
  • 59. Full Stack Deep Learning Try to keep things simple • Don't overengineer • e.g. unix: powerful parallelism, streaming, highly optimized 59 https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html 26 minutes 70 seconds 18 seconds in parallel in parallel Data Management - processing
  • 60. Full Stack Deep Learning Questions? 60
  • 61. Full Stack Deep Learning Thank you! 61