Joel Grus gives a funny and beginner-friendly talk about his journey on the road to data science. For animations, see the original slides here: https://docs.google.com/presentation/d/1gqs54MMCgJpIVgcXUFm82MKfdmHA_pQRvThT6fAdb6g/edit?usp=sharing. More insights in Joel's new book, "Data Science from Scratch."
10. Data Science Is A Broad Field
Some Stuff
More
Stuff
Even
More
Stuff
Data
Science
People who think they're
data scientists, but they're
not really data scientists
People who are a danger
to everyone around them
People who say
"machine learnings"
14. a data scientist should be able to
run a regression, write a sql query,
JOEL GRUS
15. a data scientist should be able to
run a regression, write a sql query, scrape a web
site,
JOEL GRUS
16. a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment,
JOEL GRUS
17. a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices,
JOEL GRUS
18. a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame,
JOEL GRUS
19. a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
JOEL GRUS
20. a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery,
JOEL GRUS
21. a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
JOEL GRUS
22. a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce,
JOEL GRUS
23. a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior,
JOEL GRUS
24. a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard,
JOEL GRUS
25. a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard, clean up messy data,
JOEL GRUS
26. a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard, clean up messy data, test a hypothesis,
JOEL GRUS
27. a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard, clean up messy data, test a hypothesis,
talk to a businessperson,
JOEL GRUS
28. a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard, clean up messy data, test a hypothesis,
talk to a businessperson, script a shell,
JOEL GRUS
29. a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard, clean up messy data, test a hypothesis,
talk to a businessperson, script a shell, code on a
whiteboard,
JOEL GRUS
30. a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard, clean up messy data, test a hypothesis,
talk to a businessperson, script a shell, code on a
whiteboard, hack a p-value,
JOEL GRUS
31. a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard, clean up messy data, test a hypothesis,
talk to a businessperson, script a shell, code on a
whiteboard, hack a p-value, machine-learn a model.
JOEL GRUS
32. a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard, clean up messy data, test a hypothesis,
talk to a businessperson, script a shell, code on a
whiteboard, hack a p-value, machine-learn a model.
specialization is for engineers.
JOEL GRUS
37. a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard, clean up messy data, test a hypothesis,
talk to a businessperson, script a shell, code on a
whiteboard, hack a p-value, machine-learn a model.
specialization is for engineers.
JOEL GRUS
grad students!
40. The Math Way
I like to start with
matrix
decompositions.
How's your
measure theory?
41. The Math Way
The Good:
Solid foundation
Math is the noblest
known pursuit
42. The Math Way
The Good:
Solid foundation
Math is the noblest
known pursuit
The Bad:
Some weirdos don't
think math is fun
Can be pretty
forbidding
Can miss practical
skills
43. So, did you
count the
words in that
document?
No, but I have an
elegant proof
that the number
of words is finite!
46. The Tools Way
Here's a list of
the 25 libraries
you really ought
to know. How's
your R
programming?
47. The Tools Way
The Good:
Don't have to
understand the
math
Practical
Can get started doing
fun stuff right away
48. The Tools Way
The Good:
Don't have to
understand the
math
Practical
Can get started doing
fun stuff right away
The Bad:
Don't have to
understand the
math
Can get started doing
bad science right
away
49. So, did you
build that
model?
Yes, and it fits the
training data
almost perfectly!
52. Example: k-means clustering
Unsupervised machine learning technique
Given a set of points, group them into k clusters
in a way that minimizes the within-cluster sum-
of-squares
i.e. in a way such that the clusters are as "small"
as possible (for a particular conception of
"small")
60. Data Science from Scratch
This is to certify that Joel Grus
has honorably completed the course of study outlined in
the book Data Science from Scratch: First Principles with
Python, and is entitled to all the Rights, Privileges, and
Honors thereunto appertaining.
Joel GrusJune 23, 2015
Certificate Programs?
62. Learning By Building
You don't really understand something until you
build it
For example, I understand garbage disposals
much better now that I had to replace one that
was leaking water all over my kitchen
More relevantly, I thought I understood
hypothesis testing, until I tried to write a book
chapter + code about it.
67. Example: k-means clustering
Given a set of points, group them into k clusters
in a way that minimizes the within-cluster sum-
of-squares
Global optimization is hard, so use a greedy
iterative approach
68. Fun Motivation: Image Posterization
Image consists of pixels
Each pixel is a triplet (R,G,B)
Imagine pixels as points in space
Find k clusters of pixels
Recolor each pixel to its cluster mean
I think it's fun, anyway
8 colors
69. Example: k-means clustering
given some points, find k clusters by
choose k "means"
repeat:
assign each point to cluster of closest "mean"
recompute mean of each cluster
sounds simple! let's code!
70. def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
71. def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
start with k randomly chosen points
72. def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
73. def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
74. def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
75. def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each mean
76. def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each mean
compute the distance
77. def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each mean
compute the distance
assign the point to the cluster of the mean with
the smallest distance
78. def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each mean
compute the distance
assign the point to the cluster of the mean with
the smallest distance
find the points in each cluster
79. def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each mean
compute the distance
assign the point to the cluster of the mean with
the smallest distance
find the points in each cluster
and compute the new means
80. def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
Not impenetrable, but
a lot less helpful than
it could be
81. def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
Not impenetrable, but
a lot less helpful than
it could be
Can we make it
simpler?
83. def k_means(points, k, num_iters=10):
# start with k of the points as "means"
means = random.sample(points, k)
# and iterate finding new means
for _ in range(num_iters):
means = new_means(points, means)
return means
84. def new_means(points, means):
# assign points to clusters
# each cluster is just a list of points
clusters = assign_clusters(points, means)
# return the cluster means
return [mean(cluster)
for cluster in clusters]
85. def assign_clusters(points, means):
# one cluster for each mean
# each cluster starts empty
clusters = [[] for _ in means]
# assign each point to cluster
# corresponding to closest mean
for p in points:
index = closest_index(point, means)
clusters[index].append(point)
return clusters
86. def closest_index(point, means):
# return index of closest mean
return argmin(distance(point, mean)
for mean in means)
def argmin(xs):
# return index of smallest element
return min(enumerate(xs),
key=lambda pair: pair[1])[0]
88. As a Pedagogical Tool
Can be used "top down" (as we did here)
Implement high-level logic
Then implement the details
Nice for exposition
Can also be used "bottom up"
Implement small pieces
Build up to high-level logic
Good for workshops
89. Example: Decision Trees
Want to predict whether
a given Meetup is worth
attending (True) or not
(False)
Inputs are dictionaries
describing each Meetup
{ "group" : "DAML",
"date" : "2015-06-23",
"beer" : "free",
"food" : "dim sum",
"speaker" : "@joelgrus",
"location" : "Google",
"topic" : "shameless self-promotion" }
{ "group" : "Seattle Atheists",
"date" : "2015-06-23",
"location" : "Round the Table",
"beer" : "none",
"food" : "none",
"topic" : "Godless Game Night" }
91. Example: Decision Trees
class LeafNode:
def __init__(self, prediction):
self.prediction = prediction
def predict(self, input_dict):
return self.prediction
class DecisionNode:
def __init__(self, attribute, subtree_dict):
self.attribute = attribute
self.subtree_dict = subtree_dict
def predict(self, input_dict):
value = input_dict.get(self.attribute)
subtree = self.subtree_dict[value]
return subtree.predict(input)
92. Example: Decision Trees
Again inspiration from functional programming:
type Input = Map.Map String String
data Tree = Predict Bool
| Subtrees String (Map.Map String Tree)
look at the "beer" entry
a map from each possible
"beer" value to a subtree
always predict a specific value
93. Example: Decision Trees
type Input = Map.Map String String
data Tree = Predict Bool
| Subtrees String (Map.Map String Tree)
predict :: Tree -> Input -> Bool
predict (Predict b) _ = b
predict (Subtrees a subtrees) input =
predict subtree input
where subtree = subtrees Map.! (input Map.!
94. Example: Decision Trees
type Input = Map.Map String String
data Tree = Predict Bool
| Subtrees String (Map.Map String Tree)
We can do the same,
we'll say a decision tree is either
True
False
(attribute, subtree_dict)
("beer",
{ "free" : True,
"none" : False,
"paid" : ("speaker",
{...})})
95. predict :: Tree -> Input -> Bool
predict (Predict b) _ = b
predict (Subtrees a subtrees) input =
predict subtree input
where subtree = subtrees Map.! (input Map.! a)
Example: Decision Trees
def predict(tree, input_dict):
# leaf node predicts itself
if tree in (True, False):
return tree
else:
# destructure tree
attribute, subtree_dict = tree
# find appropriate subtree
value = input_dict[attribute]
subtree = subtree_dict[value]
# classify using subtree
return predict(subtree, input_dict)
97. In Conclusion
Teaching data science is fun, if you're smart
about it
Learning data science is fun, if you're smart
about it
Writing a book is not that much fun
Having written a book is pretty fun
Making slides is actually kind of fun
Functional programming is a lot of fun