SlideShare a Scribd company logo
1 of 51
Download to read offline
WTF is Data Science?
Dylan Gregersen
OpenWest 2018
My name is Dylan Gregersen
I like these things... You can find me at…
dylangregersen
I am the lead data
scientist at...
Rachel Schutt & Cathy O’Neil in Doing Data Science: Straight Talk From the Frontline
Data Science is the process of
collecting, cleaning, analyzing,
visualizing, and communicating
data in order to solve problems
in the real world.
Data science is...
What people think data science is...
People often think data science
is all about mathematics,
algorithms, and something call
“machine learning”
Rachel Schutt & Cathy O’Neil in Doing Data Science: Straight Talk From the Frontline
What most data science is...
Data science actually consists
mostly of data collection,
cleaning, and organization
(often 80% of the work)
Rachel Schutt & Cathy O’Neil in Doing Data Science: Straight Talk From the Frontline
What people forget that data science is
People tend to forget the skills
needed in data science to
communicate results so someone
can take an action in the real
worldRachel Schutt & Cathy O’Neil in Doing Data Science: Straight Talk From the Frontline
Rachel Schutt & Cathy O’Neil in Doing Data Science: Straight Talk From the Frontline
Data science is a process
When doing data science we...
1. Conceptual Data Model: Collect
data and create a conceptual data
model of real world phenomena
2. Understand the data: We use that
data model to understand something
about the phenomena
3. Solve a Problem: We apply that
understanding to solve a problem
4. Take action: Ultimately, we succeed
when our solution leads to actions
Data science is successful when you learn
something about the real world which
helps you solve a problem by taking an
action.
Data science is successful when you learn
something about the real world which
helps you solve a problem by taking an
action.
Example: What is my conference room utilization?
Identifying the problem
U: What is my conference room utilization?
Identifying the problem
U: What is my conference room utilization?
Me: What problem are you trying to solve?
U: I want to know which rooms are underutilized
Me: Why do you want to know?
U: To improve the efficiency of conference rooms use
Me: What are you going to do with that information?
A: Repurpose rooms who’s meeting usage is less than 50%
Problem: Conference rooms should be used efficiently
Action: repurpose rooms with usage less than 50%, also heavily used areas
Metric: room utilization = hours in use / available hours per day
Identifying the problem
U: What is my conference room utilization?
Me: What problem are you trying to solve?
U: I want to know which rooms are underutilized
Me: Why do you want to know?
U: To improve the efficiency of conference rooms use
Me: What are you going to do with that information?
A: Repurpose rooms who’s meeting usage is less than 50%
What problem are you
trying to solve?
What action will you take
with this number?
Problem: Change meeting rooms to fit the needs of department
Action: make purchasing decisions about technology or furniture
Metrics: room utilization, organizer’s department, occupancy size,
technology or furniture used
U: What is my conference room utilization?
Me: What problem are you trying to solve?
U: I want to know which departments are using the rooms the most.
Me: Why do you want to know?
U: To adjust the rooms to meet their needs
Me: What are you going to do with that information?
A: Buy new technology or furniture to better meet those needs
Identifying the problem
Solving the Problem
Start by figuring out a plan.
1. Conceptual Data Model: Collect
data and create a conceptual data
model of real world phenomena
2. Understand the data: We use that
data model to understand something
about the phenomena
3. Solve a Problem: We apply that
understanding to solve a problem
4. Take action: Ultimately, we succeed
when our solution leads to actions
Solving the Problem
Start by figuring out a plan.
Document requirements and
get feedback from your end
user
Problem: Conference rooms should be
used efficiently
Action: repurpose rooms with usage
less than 50%, also heavily used areas
Metric: room utilization = hours in use
/ available hours per day
Solving the Problem
Having a plan...
● Helps you stay focused
● Helps you communicate with your
end users
● Build in things you’ll need in
production: data quality, alerts,
testing, security, code reviews
Solving the Problem
Now with a plan
1. Conceptual Data Model: Collect
data and create a conceptual data
model of real world phenomena
2. Understand the data: We use that
data model to understand something
about the phenomena
3. Solve a Problem: We apply that
understanding to solve a problem
4. Take action: Ultimately, we succeed
when our solution leads to actions
Collect data and create a
conceptual data model of real
world phenomena
Small project you might use python and
store in a folder called “raw_data”
Large project you might use python+kafka
and store in AWS S3
{
….
"id": "6iunsmr8qv1k1c5avlek045oup",
"iCalUID": "6iunsmr8qv1k1c5avlek045oup@google.com",
"summary": "OpenWest: WTF is data science?",
"status": "confirmed",
"start": {
"dateTime": "2018-06-08T11:30:00-06:00"
},
"end": {
"dateTime": "2018-06-08T12:30:00-06:00"
},
….
}
Metadata: room_id, customer_id, time_range
Google Event File
1. Conceptual Data Model
80% of data science work is
cleaning and structuring the
data.
Small project you might use python to
process “raw_data” into “processed_data”
Large project you might use AWS Glue to
process AWS S3 data and store in AWS
Redshift
{
….
"id": "6iunsmr8qv1k1c5avlek045oup",
"iCalUID": "6iunsmr8qv1k1c5avlek045oup@google.com",
"summary": "OpenWest: WTF is data science?",
"status": "confirmed",
"start": {
"dateTime": "2018-06-08T11:30:00-06:00"
},
"end": {
"dateTime": "2018-06-08T12:30:00-06:00"
},
….
}
Metadata: room_id, customer_id, time_range
Google Event File
1. Conceptual Data Model
80% of data science work is
cleaning and structuring the
data.
1. Conceptual Data Model
{
….
"id": "6iunsmr8qv1k1c5avlek045oup",
"iCalUID": "6iunsmr8qv1k1c5avlek045oup@google.com",
"summary": "OpenWest: WTF is data science?",
"status": "confirmed",
"start": {
"dateTime": "2018-06-08T11:30:00-06:00"
},
"end": {
"dateTime": "2018-06-08T12:30:00-06:00"
},
….
}
Metadata: room_id, customer_id, time_range
Google Event File
INSERT INTO customer
1212 AS customer_id
INSERT INTO room
42 AS room_id
1212 AS customer_id
INSERT INTO event
"6iunsmr8qv1k1c5avlek045oup" AS event_id
“2018-06-08T17:30:00Z” AS event_start_utc
3600.0 AS event_duration
“confirmed” AS event_status
INSERT INTO fact_room_event
room_id
event_id
Structured Data - Star Schema
We use that data model to
understand something about
the phenomena
2. Understand the Data
Explore, manipulate the data.
Question the data quality and
return to cleaning if necessary.
Small project you might use python to load
“processed_data” and make plots
Large project you might use SQL to query
AWS Redshift and use python to visualize
2. Understand the Data
INSERT INTO customer
1212 AS customer_id
INSERT INTO room
42 AS room_id
1212 AS customer_id
INSERT INTO event
"6iunsmr8qv1k1c5avlek045oup" AS event_id
“2018-06-08T17:30:00Z” AS event_start_utc
3600.0 AS event_duration
“confirmed” AS event_status
INSERT INTO fact_room_event
room_id
event_id
Structured Data - Star Schema
Explore, manipulate the data.
Question the data quality and
return to cleaning if necessary.
2. Understand the Data
INSERT INTO customer
1212 AS customer_id
INSERT INTO room
42 AS room_id
1212 AS customer_id
INSERT INTO event
"6iunsmr8qv1k1c5avlek045oup" AS event_id
“2018-06-08T17:30:00Z” AS event_start_utc
3600.0 AS event_duration
“confirmed” AS event_status
INSERT INTO fact_room_event
room_id
event_id
Structured Data - Star Schema
3. Solve a Problem
We apply that understanding to
solve a problem
3. Solve a Problem
We apply that understanding to
solve a problem
Problem: Conference rooms should be
used efficiently
Action: repurpose rooms with usage
less than 50%, also heavily used areas
Metric: room utilization = hours in use
/ available hours per day
3. Solve a Problem
Did we solve the problem?
What action are you going to
take?
Problem: Conference rooms should be
used efficiently
Action: repurpose rooms with usage
less than 50%, also heavily used areas
Metric: room utilization = hours in use
/ available hours per day
3. Take Action
Ultimately, we succeed when
our solution leads to actions
3. Take Action
Ultimately, we succeed when
our solution leads to actions
Small project might periodically recreate to
allow user to take new actions.
Large project you might provide a tool for
the user to recreate on their own.
3. Take Action
Ultimately, we succeed when
our solution leads to actions
In our example, our Facilities Gal goes and
looks at the bottom three rooms. Decides
that Camp Ivanhoe really isn’t needed.
She also checks Fire Swamp and asks
some people why it is used so much.
Rachel Schutt & Cathy O’Neil in Doing Data Science: Straight Talk From the Frontline
Data science is a process
1. Collect data and create a conceptual
data model of real world phenomena
2. We use that data model to
understand something about the
phenomena
3. We apply that understanding to
solve a problem
4. Ultimately, we succeed when our
solution leads to actions
? ?
?
Cool! So what about
machine learning and
predictive modeling?
The data science process
has a hierarchy of needs
Data Basics
The data science hierarchy of needs
describes the stages of data
complexity and insights
The Data Science Process
The Data Science Process
The Data Science Process
The Data Science Process
First point of value
Descriptive Analytics are your first
stage where you can actually answer
questions.
Especially important for business end
users who want the results of your
data.
First point of value
Businesses spend 1-3
months to get this into
production the first time
They spend 1-3 years to
really get this right
Descriptive Analytics are
your first stage where you
can actually answer
questions.
Businesses spend 1-3
months to get this into
production the first time
They spend 1-3 years to
really get this right
1-2 years to do this well
1-2 years integrate these
1+ years grow modeling to
optimizations
In the real world,
data science is a team
activity
Data-Driven Companies Build Data Science Teams
Data Engineer
Data Architect
Data Analyst
Developer
Product
Manager
QA
Statistician
Chief Data Officer
Senior Data
Analyst
Data Steward
Data Engineer
Business
Analyst
Myth of the data scientist
Data science requires many different
jobs and skills.
Being a “data scientist” is very much
like being a “full stack developer”.
The most data-driven companies are
creating data specific jobs: data
engineers, data architects, data
analysts, data researchers.
How do you get started?
Start with descriptive analytics
Best way to build your intuition about the data
science process works. Become good at
identifying the root question, problem to solve,
and the possible actions to be taken.
Start with descriptive analytics
Best way to build your intuition about the data
science process works. Become good at
identifying the root question, problem to solve,
and the possible actions to be taken.
Open Data Sets:
www.kaggle.com/datasets
www.data.gov
www.github.com/awesomedata/awesome-public-datasets
www.google.com/search?q=open+data+sets
Start with descriptive analytics
Best way to build your intuition about the data
science process works. Become good at
identifying the root question, problem to solve,
and the possible actions to be taken.
● The best tools are powerful.
● The best tools are easy to use and learn.
● The best tools support teamwork.
● The best tools are beloved by the community.
Excel is still a standard across the data world and is a
perfectly fine way to get started.
Data science is successful when
you learn something about the real
world which helps you solve a
problem by taking an action.
You set yourself for success if you...
● Foster a determination to discover the
underlying problems to solve
● Learn to work with data
What is data science?
References and Resources
● Rachel Schutt & Cathy O’Neil (2013) Doing Data Science: Straight Talk From the
Frontline, Sebastopol, CA: O’Reilly
● DJ Patil & Hilary Mason (2015) Data Driven. Sebastopol, CA: O’Reilly
● DJ Patil (2011) Building Data Science Teams. Sebastopol, CA: O’Reilly
● Monica Rogati (2017) The AI Hierarchy of Needs
● Nick Crocker (2014) Thirty Things I’ve Learned
● Tavish Srivastava (2015) 13 Tips to make you awesome in Data Science / Analytics Jobs
● Daniel Tunkelang (2017) 10 Things Everyone Should Know About Machine Learning
● DJ Patil - Everything We Wish We'd Known About Building Data Products
Data science is successful when
you learn something about the real
world which helps you solve a
problem by taking an action.
You set yourself for success if you...
● Foster a determination to discover the
underlying problems to solve
● Learn to work with data
Thank You!

More Related Content

What's hot

Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data ScienceEdureka!
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningLior Rokach
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data ScienceKenny Daniel
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introductionkrishna singh
 
Machine Learning vs. Deep Learning
Machine Learning vs. Deep LearningMachine Learning vs. Deep Learning
Machine Learning vs. Deep LearningBelatrix Software
 
The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science LandscapePhilip Bourne
 
Fighting financial fraud at Danske Bank with artificial intelligence
Fighting financial fraud at Danske Bank with artificial intelligenceFighting financial fraud at Danske Bank with artificial intelligence
Fighting financial fraud at Danske Bank with artificial intelligenceRon Bodkin
 
CRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologyCRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologySergey Shelpuk
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science ProcessVishal Patel
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 
Data Analytics and Artificial Intelligence in the era of Digital Transformation
Data Analytics and Artificial Intelligence in the era of Digital TransformationData Analytics and Artificial Intelligence in the era of Digital Transformation
Data Analytics and Artificial Intelligence in the era of Digital TransformationJan Wiegelmann
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisEva Durall
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data ScienceSpotle.ai
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
Importance of Data Analytics
 Importance of Data Analytics Importance of Data Analytics
Importance of Data AnalyticsProduct School
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Simplilearn
 

What's hot (20)

Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introduction
 
Machine Learning vs. Deep Learning
Machine Learning vs. Deep LearningMachine Learning vs. Deep Learning
Machine Learning vs. Deep Learning
 
The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science Landscape
 
Fighting financial fraud at Danske Bank with artificial intelligence
Fighting financial fraud at Danske Bank with artificial intelligenceFighting financial fraud at Danske Bank with artificial intelligence
Fighting financial fraud at Danske Bank with artificial intelligence
 
CRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologyCRISP-DM: a data science project methodology
CRISP-DM: a data science project methodology
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Data Analytics and Artificial Intelligence in the era of Digital Transformation
Data Analytics and Artificial Intelligence in the era of Digital TransformationData Analytics and Artificial Intelligence in the era of Digital Transformation
Data Analytics and Artificial Intelligence in the era of Digital Transformation
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Importance of Data Analytics
 Importance of Data Analytics Importance of Data Analytics
Importance of Data Analytics
 
Big data
Big dataBig data
Big data
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
 

Similar to Wtf is data science?

Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Natalino Busa
 
Guide for a Data Scientist
Guide for a Data ScientistGuide for a Data Scientist
Guide for a Data ScientistRohit Dubey
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesAditya Parameswaran
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamDoug Needham
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantLynne Thomas
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceLivePerson
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectbodaceacat
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSara-Jayne Terp
 
Machine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual WorkshopMachine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual WorkshopCCG
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.Theo Schlossnagle
 
Data Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of PeopleData Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of Peoplemark madsen
 
training_presentation
training_presentationtraining_presentation
training_presentationYudi512144
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...Srinath Perera
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Trieu Nguyen
 

Similar to Wtf is data science? (20)

Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
 
Guide for a Data Scientist
Guide for a Data ScientistGuide for a Data Scientist
Guide for a Data Scientist
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic Perspectives
 
Adopting innovation
Adopting innovationAdopting innovation
Adopting innovation
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug Needham
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Machine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual WorkshopMachine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual Workshop
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
Data Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of PeopleData Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of People
 
training_presentation
training_presentationtraining_presentation
training_presentation
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 

Recently uploaded

Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 

Recently uploaded (20)

Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 

Wtf is data science?

  • 1. WTF is Data Science? Dylan Gregersen OpenWest 2018
  • 2. My name is Dylan Gregersen I like these things... You can find me at… dylangregersen I am the lead data scientist at...
  • 3. Rachel Schutt & Cathy O’Neil in Doing Data Science: Straight Talk From the Frontline Data Science is the process of collecting, cleaning, analyzing, visualizing, and communicating data in order to solve problems in the real world. Data science is...
  • 4. What people think data science is... People often think data science is all about mathematics, algorithms, and something call “machine learning” Rachel Schutt & Cathy O’Neil in Doing Data Science: Straight Talk From the Frontline
  • 5. What most data science is... Data science actually consists mostly of data collection, cleaning, and organization (often 80% of the work) Rachel Schutt & Cathy O’Neil in Doing Data Science: Straight Talk From the Frontline
  • 6. What people forget that data science is People tend to forget the skills needed in data science to communicate results so someone can take an action in the real worldRachel Schutt & Cathy O’Neil in Doing Data Science: Straight Talk From the Frontline
  • 7. Rachel Schutt & Cathy O’Neil in Doing Data Science: Straight Talk From the Frontline Data science is a process When doing data science we... 1. Conceptual Data Model: Collect data and create a conceptual data model of real world phenomena 2. Understand the data: We use that data model to understand something about the phenomena 3. Solve a Problem: We apply that understanding to solve a problem 4. Take action: Ultimately, we succeed when our solution leads to actions
  • 8. Data science is successful when you learn something about the real world which helps you solve a problem by taking an action.
  • 9. Data science is successful when you learn something about the real world which helps you solve a problem by taking an action. Example: What is my conference room utilization?
  • 10. Identifying the problem U: What is my conference room utilization?
  • 11. Identifying the problem U: What is my conference room utilization? Me: What problem are you trying to solve? U: I want to know which rooms are underutilized Me: Why do you want to know? U: To improve the efficiency of conference rooms use Me: What are you going to do with that information? A: Repurpose rooms who’s meeting usage is less than 50%
  • 12. Problem: Conference rooms should be used efficiently Action: repurpose rooms with usage less than 50%, also heavily used areas Metric: room utilization = hours in use / available hours per day Identifying the problem U: What is my conference room utilization? Me: What problem are you trying to solve? U: I want to know which rooms are underutilized Me: Why do you want to know? U: To improve the efficiency of conference rooms use Me: What are you going to do with that information? A: Repurpose rooms who’s meeting usage is less than 50%
  • 13. What problem are you trying to solve? What action will you take with this number?
  • 14. Problem: Change meeting rooms to fit the needs of department Action: make purchasing decisions about technology or furniture Metrics: room utilization, organizer’s department, occupancy size, technology or furniture used U: What is my conference room utilization? Me: What problem are you trying to solve? U: I want to know which departments are using the rooms the most. Me: Why do you want to know? U: To adjust the rooms to meet their needs Me: What are you going to do with that information? A: Buy new technology or furniture to better meet those needs Identifying the problem
  • 15. Solving the Problem Start by figuring out a plan. 1. Conceptual Data Model: Collect data and create a conceptual data model of real world phenomena 2. Understand the data: We use that data model to understand something about the phenomena 3. Solve a Problem: We apply that understanding to solve a problem 4. Take action: Ultimately, we succeed when our solution leads to actions
  • 16. Solving the Problem Start by figuring out a plan. Document requirements and get feedback from your end user Problem: Conference rooms should be used efficiently Action: repurpose rooms with usage less than 50%, also heavily used areas Metric: room utilization = hours in use / available hours per day
  • 17. Solving the Problem Having a plan... ● Helps you stay focused ● Helps you communicate with your end users ● Build in things you’ll need in production: data quality, alerts, testing, security, code reviews
  • 18. Solving the Problem Now with a plan 1. Conceptual Data Model: Collect data and create a conceptual data model of real world phenomena 2. Understand the data: We use that data model to understand something about the phenomena 3. Solve a Problem: We apply that understanding to solve a problem 4. Take action: Ultimately, we succeed when our solution leads to actions
  • 19. Collect data and create a conceptual data model of real world phenomena Small project you might use python and store in a folder called “raw_data” Large project you might use python+kafka and store in AWS S3 { …. "id": "6iunsmr8qv1k1c5avlek045oup", "iCalUID": "6iunsmr8qv1k1c5avlek045oup@google.com", "summary": "OpenWest: WTF is data science?", "status": "confirmed", "start": { "dateTime": "2018-06-08T11:30:00-06:00" }, "end": { "dateTime": "2018-06-08T12:30:00-06:00" }, …. } Metadata: room_id, customer_id, time_range Google Event File 1. Conceptual Data Model
  • 20. 80% of data science work is cleaning and structuring the data. Small project you might use python to process “raw_data” into “processed_data” Large project you might use AWS Glue to process AWS S3 data and store in AWS Redshift { …. "id": "6iunsmr8qv1k1c5avlek045oup", "iCalUID": "6iunsmr8qv1k1c5avlek045oup@google.com", "summary": "OpenWest: WTF is data science?", "status": "confirmed", "start": { "dateTime": "2018-06-08T11:30:00-06:00" }, "end": { "dateTime": "2018-06-08T12:30:00-06:00" }, …. } Metadata: room_id, customer_id, time_range Google Event File 1. Conceptual Data Model
  • 21. 80% of data science work is cleaning and structuring the data. 1. Conceptual Data Model { …. "id": "6iunsmr8qv1k1c5avlek045oup", "iCalUID": "6iunsmr8qv1k1c5avlek045oup@google.com", "summary": "OpenWest: WTF is data science?", "status": "confirmed", "start": { "dateTime": "2018-06-08T11:30:00-06:00" }, "end": { "dateTime": "2018-06-08T12:30:00-06:00" }, …. } Metadata: room_id, customer_id, time_range Google Event File INSERT INTO customer 1212 AS customer_id INSERT INTO room 42 AS room_id 1212 AS customer_id INSERT INTO event "6iunsmr8qv1k1c5avlek045oup" AS event_id “2018-06-08T17:30:00Z” AS event_start_utc 3600.0 AS event_duration “confirmed” AS event_status INSERT INTO fact_room_event room_id event_id Structured Data - Star Schema
  • 22. We use that data model to understand something about the phenomena 2. Understand the Data
  • 23. Explore, manipulate the data. Question the data quality and return to cleaning if necessary. Small project you might use python to load “processed_data” and make plots Large project you might use SQL to query AWS Redshift and use python to visualize 2. Understand the Data INSERT INTO customer 1212 AS customer_id INSERT INTO room 42 AS room_id 1212 AS customer_id INSERT INTO event "6iunsmr8qv1k1c5avlek045oup" AS event_id “2018-06-08T17:30:00Z” AS event_start_utc 3600.0 AS event_duration “confirmed” AS event_status INSERT INTO fact_room_event room_id event_id Structured Data - Star Schema
  • 24. Explore, manipulate the data. Question the data quality and return to cleaning if necessary. 2. Understand the Data INSERT INTO customer 1212 AS customer_id INSERT INTO room 42 AS room_id 1212 AS customer_id INSERT INTO event "6iunsmr8qv1k1c5avlek045oup" AS event_id “2018-06-08T17:30:00Z” AS event_start_utc 3600.0 AS event_duration “confirmed” AS event_status INSERT INTO fact_room_event room_id event_id Structured Data - Star Schema
  • 25. 3. Solve a Problem We apply that understanding to solve a problem
  • 26. 3. Solve a Problem We apply that understanding to solve a problem Problem: Conference rooms should be used efficiently Action: repurpose rooms with usage less than 50%, also heavily used areas Metric: room utilization = hours in use / available hours per day
  • 27. 3. Solve a Problem Did we solve the problem? What action are you going to take? Problem: Conference rooms should be used efficiently Action: repurpose rooms with usage less than 50%, also heavily used areas Metric: room utilization = hours in use / available hours per day
  • 28. 3. Take Action Ultimately, we succeed when our solution leads to actions
  • 29. 3. Take Action Ultimately, we succeed when our solution leads to actions Small project might periodically recreate to allow user to take new actions. Large project you might provide a tool for the user to recreate on their own.
  • 30. 3. Take Action Ultimately, we succeed when our solution leads to actions In our example, our Facilities Gal goes and looks at the bottom three rooms. Decides that Camp Ivanhoe really isn’t needed. She also checks Fire Swamp and asks some people why it is used so much.
  • 31. Rachel Schutt & Cathy O’Neil in Doing Data Science: Straight Talk From the Frontline Data science is a process 1. Collect data and create a conceptual data model of real world phenomena 2. We use that data model to understand something about the phenomena 3. We apply that understanding to solve a problem 4. Ultimately, we succeed when our solution leads to actions
  • 32. ? ? ? Cool! So what about machine learning and predictive modeling?
  • 33. The data science process has a hierarchy of needs
  • 34. Data Basics The data science hierarchy of needs describes the stages of data complexity and insights
  • 35. The Data Science Process
  • 36. The Data Science Process
  • 37. The Data Science Process
  • 38. The Data Science Process
  • 39. First point of value Descriptive Analytics are your first stage where you can actually answer questions. Especially important for business end users who want the results of your data.
  • 40. First point of value Businesses spend 1-3 months to get this into production the first time They spend 1-3 years to really get this right Descriptive Analytics are your first stage where you can actually answer questions.
  • 41. Businesses spend 1-3 months to get this into production the first time They spend 1-3 years to really get this right 1-2 years to do this well 1-2 years integrate these 1+ years grow modeling to optimizations
  • 42. In the real world, data science is a team activity
  • 43. Data-Driven Companies Build Data Science Teams Data Engineer Data Architect Data Analyst Developer Product Manager QA Statistician Chief Data Officer Senior Data Analyst Data Steward Data Engineer Business Analyst
  • 44. Myth of the data scientist Data science requires many different jobs and skills. Being a “data scientist” is very much like being a “full stack developer”. The most data-driven companies are creating data specific jobs: data engineers, data architects, data analysts, data researchers.
  • 45. How do you get started?
  • 46. Start with descriptive analytics Best way to build your intuition about the data science process works. Become good at identifying the root question, problem to solve, and the possible actions to be taken.
  • 47. Start with descriptive analytics Best way to build your intuition about the data science process works. Become good at identifying the root question, problem to solve, and the possible actions to be taken. Open Data Sets: www.kaggle.com/datasets www.data.gov www.github.com/awesomedata/awesome-public-datasets www.google.com/search?q=open+data+sets
  • 48. Start with descriptive analytics Best way to build your intuition about the data science process works. Become good at identifying the root question, problem to solve, and the possible actions to be taken. ● The best tools are powerful. ● The best tools are easy to use and learn. ● The best tools support teamwork. ● The best tools are beloved by the community. Excel is still a standard across the data world and is a perfectly fine way to get started.
  • 49. Data science is successful when you learn something about the real world which helps you solve a problem by taking an action. You set yourself for success if you... ● Foster a determination to discover the underlying problems to solve ● Learn to work with data What is data science?
  • 50. References and Resources ● Rachel Schutt & Cathy O’Neil (2013) Doing Data Science: Straight Talk From the Frontline, Sebastopol, CA: O’Reilly ● DJ Patil & Hilary Mason (2015) Data Driven. Sebastopol, CA: O’Reilly ● DJ Patil (2011) Building Data Science Teams. Sebastopol, CA: O’Reilly ● Monica Rogati (2017) The AI Hierarchy of Needs ● Nick Crocker (2014) Thirty Things I’ve Learned ● Tavish Srivastava (2015) 13 Tips to make you awesome in Data Science / Analytics Jobs ● Daniel Tunkelang (2017) 10 Things Everyone Should Know About Machine Learning ● DJ Patil - Everything We Wish We'd Known About Building Data Products
  • 51. Data science is successful when you learn something about the real world which helps you solve a problem by taking an action. You set yourself for success if you... ● Foster a determination to discover the underlying problems to solve ● Learn to work with data Thank You!