Data science is having a growing effect on our lives, from the content we see on social media feeds to the decisions businesses are making. Along with successes, data science has inspired much hype about what it is and what it can do. So I plan to try and demystify data science and have a discussion about what it really is. What does a day-in-the-life look like? What tools and skills are needed? How is data science successfully applied in the real world? In this talk, I’ll be providing insight into these questions and also speculate the future of data science and its place in business and technology.
Presented at OpenWest 2018
Student profile product demonstration on grades, ability, well-being and mind...
Wtf is data science?
1. WTF is Data Science?
Dylan Gregersen
OpenWest 2018
2. My name is Dylan Gregersen
I like these things... You can find me at…
dylangregersen
I am the lead data
scientist at...
3. Rachel Schutt & Cathy O’Neil in Doing Data Science: Straight Talk From the Frontline
Data Science is the process of
collecting, cleaning, analyzing,
visualizing, and communicating
data in order to solve problems
in the real world.
Data science is...
4. What people think data science is...
People often think data science
is all about mathematics,
algorithms, and something call
“machine learning”
Rachel Schutt & Cathy O’Neil in Doing Data Science: Straight Talk From the Frontline
5. What most data science is...
Data science actually consists
mostly of data collection,
cleaning, and organization
(often 80% of the work)
Rachel Schutt & Cathy O’Neil in Doing Data Science: Straight Talk From the Frontline
6. What people forget that data science is
People tend to forget the skills
needed in data science to
communicate results so someone
can take an action in the real
worldRachel Schutt & Cathy O’Neil in Doing Data Science: Straight Talk From the Frontline
7. Rachel Schutt & Cathy O’Neil in Doing Data Science: Straight Talk From the Frontline
Data science is a process
When doing data science we...
1. Conceptual Data Model: Collect
data and create a conceptual data
model of real world phenomena
2. Understand the data: We use that
data model to understand something
about the phenomena
3. Solve a Problem: We apply that
understanding to solve a problem
4. Take action: Ultimately, we succeed
when our solution leads to actions
8. Data science is successful when you learn
something about the real world which
helps you solve a problem by taking an
action.
9. Data science is successful when you learn
something about the real world which
helps you solve a problem by taking an
action.
Example: What is my conference room utilization?
11. Identifying the problem
U: What is my conference room utilization?
Me: What problem are you trying to solve?
U: I want to know which rooms are underutilized
Me: Why do you want to know?
U: To improve the efficiency of conference rooms use
Me: What are you going to do with that information?
A: Repurpose rooms who’s meeting usage is less than 50%
12. Problem: Conference rooms should be used efficiently
Action: repurpose rooms with usage less than 50%, also heavily used areas
Metric: room utilization = hours in use / available hours per day
Identifying the problem
U: What is my conference room utilization?
Me: What problem are you trying to solve?
U: I want to know which rooms are underutilized
Me: Why do you want to know?
U: To improve the efficiency of conference rooms use
Me: What are you going to do with that information?
A: Repurpose rooms who’s meeting usage is less than 50%
13. What problem are you
trying to solve?
What action will you take
with this number?
14. Problem: Change meeting rooms to fit the needs of department
Action: make purchasing decisions about technology or furniture
Metrics: room utilization, organizer’s department, occupancy size,
technology or furniture used
U: What is my conference room utilization?
Me: What problem are you trying to solve?
U: I want to know which departments are using the rooms the most.
Me: Why do you want to know?
U: To adjust the rooms to meet their needs
Me: What are you going to do with that information?
A: Buy new technology or furniture to better meet those needs
Identifying the problem
15. Solving the Problem
Start by figuring out a plan.
1. Conceptual Data Model: Collect
data and create a conceptual data
model of real world phenomena
2. Understand the data: We use that
data model to understand something
about the phenomena
3. Solve a Problem: We apply that
understanding to solve a problem
4. Take action: Ultimately, we succeed
when our solution leads to actions
16. Solving the Problem
Start by figuring out a plan.
Document requirements and
get feedback from your end
user
Problem: Conference rooms should be
used efficiently
Action: repurpose rooms with usage
less than 50%, also heavily used areas
Metric: room utilization = hours in use
/ available hours per day
17. Solving the Problem
Having a plan...
● Helps you stay focused
● Helps you communicate with your
end users
● Build in things you’ll need in
production: data quality, alerts,
testing, security, code reviews
18. Solving the Problem
Now with a plan
1. Conceptual Data Model: Collect
data and create a conceptual data
model of real world phenomena
2. Understand the data: We use that
data model to understand something
about the phenomena
3. Solve a Problem: We apply that
understanding to solve a problem
4. Take action: Ultimately, we succeed
when our solution leads to actions
19. Collect data and create a
conceptual data model of real
world phenomena
Small project you might use python and
store in a folder called “raw_data”
Large project you might use python+kafka
and store in AWS S3
{
….
"id": "6iunsmr8qv1k1c5avlek045oup",
"iCalUID": "6iunsmr8qv1k1c5avlek045oup@google.com",
"summary": "OpenWest: WTF is data science?",
"status": "confirmed",
"start": {
"dateTime": "2018-06-08T11:30:00-06:00"
},
"end": {
"dateTime": "2018-06-08T12:30:00-06:00"
},
….
}
Metadata: room_id, customer_id, time_range
Google Event File
1. Conceptual Data Model
20. 80% of data science work is
cleaning and structuring the
data.
Small project you might use python to
process “raw_data” into “processed_data”
Large project you might use AWS Glue to
process AWS S3 data and store in AWS
Redshift
{
….
"id": "6iunsmr8qv1k1c5avlek045oup",
"iCalUID": "6iunsmr8qv1k1c5avlek045oup@google.com",
"summary": "OpenWest: WTF is data science?",
"status": "confirmed",
"start": {
"dateTime": "2018-06-08T11:30:00-06:00"
},
"end": {
"dateTime": "2018-06-08T12:30:00-06:00"
},
….
}
Metadata: room_id, customer_id, time_range
Google Event File
1. Conceptual Data Model
21. 80% of data science work is
cleaning and structuring the
data.
1. Conceptual Data Model
{
….
"id": "6iunsmr8qv1k1c5avlek045oup",
"iCalUID": "6iunsmr8qv1k1c5avlek045oup@google.com",
"summary": "OpenWest: WTF is data science?",
"status": "confirmed",
"start": {
"dateTime": "2018-06-08T11:30:00-06:00"
},
"end": {
"dateTime": "2018-06-08T12:30:00-06:00"
},
….
}
Metadata: room_id, customer_id, time_range
Google Event File
INSERT INTO customer
1212 AS customer_id
INSERT INTO room
42 AS room_id
1212 AS customer_id
INSERT INTO event
"6iunsmr8qv1k1c5avlek045oup" AS event_id
“2018-06-08T17:30:00Z” AS event_start_utc
3600.0 AS event_duration
“confirmed” AS event_status
INSERT INTO fact_room_event
room_id
event_id
Structured Data - Star Schema
22. We use that data model to
understand something about
the phenomena
2. Understand the Data
23. Explore, manipulate the data.
Question the data quality and
return to cleaning if necessary.
Small project you might use python to load
“processed_data” and make plots
Large project you might use SQL to query
AWS Redshift and use python to visualize
2. Understand the Data
INSERT INTO customer
1212 AS customer_id
INSERT INTO room
42 AS room_id
1212 AS customer_id
INSERT INTO event
"6iunsmr8qv1k1c5avlek045oup" AS event_id
“2018-06-08T17:30:00Z” AS event_start_utc
3600.0 AS event_duration
“confirmed” AS event_status
INSERT INTO fact_room_event
room_id
event_id
Structured Data - Star Schema
24. Explore, manipulate the data.
Question the data quality and
return to cleaning if necessary.
2. Understand the Data
INSERT INTO customer
1212 AS customer_id
INSERT INTO room
42 AS room_id
1212 AS customer_id
INSERT INTO event
"6iunsmr8qv1k1c5avlek045oup" AS event_id
“2018-06-08T17:30:00Z” AS event_start_utc
3600.0 AS event_duration
“confirmed” AS event_status
INSERT INTO fact_room_event
room_id
event_id
Structured Data - Star Schema
25. 3. Solve a Problem
We apply that understanding to
solve a problem
26. 3. Solve a Problem
We apply that understanding to
solve a problem
Problem: Conference rooms should be
used efficiently
Action: repurpose rooms with usage
less than 50%, also heavily used areas
Metric: room utilization = hours in use
/ available hours per day
27. 3. Solve a Problem
Did we solve the problem?
What action are you going to
take?
Problem: Conference rooms should be
used efficiently
Action: repurpose rooms with usage
less than 50%, also heavily used areas
Metric: room utilization = hours in use
/ available hours per day
29. 3. Take Action
Ultimately, we succeed when
our solution leads to actions
Small project might periodically recreate to
allow user to take new actions.
Large project you might provide a tool for
the user to recreate on their own.
30. 3. Take Action
Ultimately, we succeed when
our solution leads to actions
In our example, our Facilities Gal goes and
looks at the bottom three rooms. Decides
that Camp Ivanhoe really isn’t needed.
She also checks Fire Swamp and asks
some people why it is used so much.
31. Rachel Schutt & Cathy O’Neil in Doing Data Science: Straight Talk From the Frontline
Data science is a process
1. Collect data and create a conceptual
data model of real world phenomena
2. We use that data model to
understand something about the
phenomena
3. We apply that understanding to
solve a problem
4. Ultimately, we succeed when our
solution leads to actions
32. ? ?
?
Cool! So what about
machine learning and
predictive modeling?
39. First point of value
Descriptive Analytics are your first
stage where you can actually answer
questions.
Especially important for business end
users who want the results of your
data.
40. First point of value
Businesses spend 1-3
months to get this into
production the first time
They spend 1-3 years to
really get this right
Descriptive Analytics are
your first stage where you
can actually answer
questions.
41. Businesses spend 1-3
months to get this into
production the first time
They spend 1-3 years to
really get this right
1-2 years to do this well
1-2 years integrate these
1+ years grow modeling to
optimizations
42. In the real world,
data science is a team
activity
43. Data-Driven Companies Build Data Science Teams
Data Engineer
Data Architect
Data Analyst
Developer
Product
Manager
QA
Statistician
Chief Data Officer
Senior Data
Analyst
Data Steward
Data Engineer
Business
Analyst
44. Myth of the data scientist
Data science requires many different
jobs and skills.
Being a “data scientist” is very much
like being a “full stack developer”.
The most data-driven companies are
creating data specific jobs: data
engineers, data architects, data
analysts, data researchers.
46. Start with descriptive analytics
Best way to build your intuition about the data
science process works. Become good at
identifying the root question, problem to solve,
and the possible actions to be taken.
47. Start with descriptive analytics
Best way to build your intuition about the data
science process works. Become good at
identifying the root question, problem to solve,
and the possible actions to be taken.
Open Data Sets:
www.kaggle.com/datasets
www.data.gov
www.github.com/awesomedata/awesome-public-datasets
www.google.com/search?q=open+data+sets
48. Start with descriptive analytics
Best way to build your intuition about the data
science process works. Become good at
identifying the root question, problem to solve,
and the possible actions to be taken.
● The best tools are powerful.
● The best tools are easy to use and learn.
● The best tools support teamwork.
● The best tools are beloved by the community.
Excel is still a standard across the data world and is a
perfectly fine way to get started.
49. Data science is successful when
you learn something about the real
world which helps you solve a
problem by taking an action.
You set yourself for success if you...
● Foster a determination to discover the
underlying problems to solve
● Learn to work with data
What is data science?
50. References and Resources
● Rachel Schutt & Cathy O’Neil (2013) Doing Data Science: Straight Talk From the
Frontline, Sebastopol, CA: O’Reilly
● DJ Patil & Hilary Mason (2015) Data Driven. Sebastopol, CA: O’Reilly
● DJ Patil (2011) Building Data Science Teams. Sebastopol, CA: O’Reilly
● Monica Rogati (2017) The AI Hierarchy of Needs
● Nick Crocker (2014) Thirty Things I’ve Learned
● Tavish Srivastava (2015) 13 Tips to make you awesome in Data Science / Analytics Jobs
● Daniel Tunkelang (2017) 10 Things Everyone Should Know About Machine Learning
● DJ Patil - Everything We Wish We'd Known About Building Data Products
51. Data science is successful when
you learn something about the real
world which helps you solve a
problem by taking an action.
You set yourself for success if you...
● Foster a determination to discover the
underlying problems to solve
● Learn to work with data
Thank You!