Often times there exists a divide between data teams, engineering, and product managers in organizations, but with the dawn of data driven companies/applications, it is more prescient now than ever to be able to automate your analyses to personalize your users experiences. LinkedIn's People you May Know, Netflix and Pandora's recommenders, and Amazon's eerily custom shopping experience have all shown us why it is essential to leverage data if you want to stay relevant as a company.
As data analyses turn into products, it is essential that your tech/data stack be flexible enough to run models in production, integrate with web applications, and provide users with immediate and valuable feedback. I believe Python is becoming the lingua franca of data science due to its flexibility as a general purpose performant programming language, rich scientific ecosystem (numpy, scipy, scikit-learn, pandas, etc.), web frameworks/community, and utilities/libraries for handling data at scale. In this talk I will walk through a fictional company bringing it's first data product to market. Along the way I will cover Python and data science best practices for such a pipeline, cover some of the pitfalls of what happens when you put models into production, and how to make sure your users (and engineers) are as happy as they can be.
https://github.com/Jay-Oh-eN/pydatasv2014
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014
1. Jonathan Dinu
Co-Founder, Zipfian Academy
jonathan@zipfianacademy.com
@clearspandex
@ZipfianAcademy
Data Engineering 101: Building your first
data product
May 4th, 2014
2. Today
• whoami
• Nws Rdr (News Reader)
• The What,Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Q&A
Questions? tweet @zipfianacademy #pydata
6. Today Disclaimer:
All characters appearing in this presentation are
fictitious. Any resemblance to real persons, living
or dead, is purely coincidental.
Questions? tweet @zipfianacademy #pydata
7. Today Disclaimer:
This presentation contains strong opinions that
you may or may not agree with. All thoughts are
my own.
Jonathan Dinu
Co-Founder, Zipfian Academy
jonathan@zipfianacademy.com
@clearspandex
Questions? tweet @zipfianacademy #pydata
8. Today
• whoami
• Nws Rdr (News Reader)
• The What,Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• CreatingValue for Users
• Q&A
Questions? tweet @zipfianacademy #pydata
9. nwsrdr (News Reader)
Source: http://www.groovypost.com/wp-content/uploads/2013/05/Bookmark-
Button.png
OR
nwsrdr
+ nwrsrdr
+ nwrsrdr
+ nwrsrdr
nwsrdr
getnews.com/bookmarklet
When browsing the web simply click the
+nwsrdr to save any page to nwsrdr
Get nwsrdr on your desktop
Questions? tweet @zipfianacademy #pydata
10. nwsrdr
• Auto-categorize Articles
• Find Similar Articles
• Recommend articles
• Suggest Feeds to Follow
• No Ads!
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
Questions? tweet @zipfianacademy #pydata
11. nwsrdr
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
• Naive Bayes (classification)
• Clustering (unsupervised learning)
• Collaborative Filtering
• Triangle Closing
• Real Business Model!
Questions? tweet @zipfianacademy #pydata
12. Today
• whoami
• Nws Rdr (News Reader)
• The What,Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Q&A
Questions? tweet @zipfianacademy #pydata
20. Products that enhance a users’
experience the more “data” a user
provides
Data Generating
Products
Ex: Recommender Systems
Questions? tweet @zipfianacademy #pydata
21. Today
• whoami
• Nws Rdr (News Reader)
• The What,Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Q&A
Questions? tweet @zipfianacademy #pydata
30. Today
• whoami
• Nws Rdr (News Reader)
• The What,Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Q&A
Questions? tweet @zipfianacademy #pydata
31. What
• Naive Bayes (classification)
• Clustering (unsupervised learning)
• Collaborative Filtering
• Triangle Closing
• Real Business Model
Questions? tweet @zipfianacademy #pydata
32. nwsrdr
• Auto-categorize Articles
• Find Similar Articles
• Recommend articles
• No Ads!
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
Questions? tweet @zipfianacademy #pydata
33. nwsrdr
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
• Naive Bayes (classification)
• Clustering (unsupervised learning)
• Collaborative Filtering
• Triangle Closing
• Real Business Model!
Questions? tweet @zipfianacademy #pydata
44. # parse resulting JSON and insert into a mongoDB collection!
for content in api.json()['response']['docs']:!
if not collection.find_one(content):!
collection.insert(content)!
!
!
# only returns 10 per page!
"There are only %i docuemtns returned 0_o" % !
! len(api.json()[‘response']['docs'])!
Questions? tweet @zipfianacademy #pydata
Acquire
45. # there are many more than 10 articles however!
total_art = articles_left = api.json()['response']['meta']['hits']!
!
!
print "There are currently %s articles in the NYT archive" % total_art!
!
!
#=> There are currently 15277775 articles in the NYT archive
Questions? tweet @zipfianacademy #pydata
Acquire
65. Tokenize article text and
create feature vectors with NLTK
Questions? tweet @zipfianacademy #pydata
Vectorize
66. Vectorize
wnl = nltk.WordNetLemmatizer()!
!
def tokenize_and_normalize(chunks):!
words = [ tokenize.word_tokenize(sent) for sent in
tokenize.sent_tokenize("".join(chunks)) ]!
flatten = [ inner for sublist in words for inner in sublist ]!
stripped = [] !
!
for word in flatten: !
if word not in stopwords.words('english'):!
try:!
stripped.append(word.encode('latin-1').decode('utf8').lower())!
except:!
print "Cannot encode: " + word!
!
no_punks = [ word for word in stripped if len(word) > 1 ] !
return [wnl.lemmatize(t) for t in no_punks]!
Questions? tweet @zipfianacademy #pydata
89. Immutable append only set of Raw Data
Computation is a view on data
*Lambda Architecture by Nathan MarzQuestions? tweet @zipfianacademy #pydata
Pipeline
90. Functional Data Science
• Modularity
• Define interfaces
• Separate data from computation
• Data Lineage
Functional
Questions? tweet @zipfianacademy #pydata
91. Need Robust and Flexible Pipeline!
Questions? tweet @zipfianacademy #pydata
Pipeline
92. Whatever you do, DO NOT cross the streams
Questions? tweet @zipfianacademy #pydata
Pipeline
94. Gotchas!
• Only have a static subset of articles
• Pipeline not automated for re-training
Questions? tweet @zipfianacademy #pydata
Gotchas
95. Today
• whoami
• Nws Rdr (News Reader)
• The What,Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Q&A
Questions? tweet @zipfianacademy #pydata
98. testing
Start small (data)
and fast
(development)
testing
Increase size of
data set
Optimize and
productionize
PROFIT!
$$$
Questions? tweet @zipfianacademy #pydata
How to Scale
99. How to Scale
testing
Develop locally
testing
Distribute
computation
(run on cluster)
Tune parameters
PROFIT!
$$$
Questions? tweet @zipfianacademy #pydata
Can also use a
streaming algorithm or
single machine disk
based “medium data”
technologies (i.e.
database or memory
mapped files)
102. Today
• whoami
• Nws Rdr (News Reader)
• The What,Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Q&A
Questions? tweet @zipfianacademy #pydata
107. Data Sources
Obtain
(ranked by ease of use)
1. DaaS -- Data as a service
2. Bulk Download
3. APIs
4. Web Scraping
Questions? tweet @zipfianacademy #pydata
108. DaaS
(Data as a Service)
•Time Series/Numeric: Quandl
• Financial Modeling: Quantopian
• Email Contextualization: Rapleaf
• Location and POI: Factual
Data Sources
Questions? tweet @zipfianacademy #pydata
109. Bulk Download
(just like the good ol’ days)
• File Transfer Protocol (FTP): CDC
•Amazon Web Services: Public Datasets
• Infochimps: Data Marketplace
•Academia: UCI Machine Learning Repository
Data Sources
Questions? tweet @zipfianacademy #pydata
110. APIs
(if it’s not RESTed, I’m not buying)
• Geographic: Foursquare
• Social: Facebook
•Audio: Rdio
• Content:Tumblr
• Realtime:Twitter
• Hidden:Yahoo Finance
Data Sources
Questions? tweet @zipfianacademy #pydata
111. Web Scraping
1. wget and curl
2. Web Spider/Crawler
3. API scraping
4. Manual Download
(DIY for life)
Data Sources
Questions? tweet @zipfianacademy #pydata
112. • DelimitedValues
• TSV
• CSV
• WSV
• JSON
• XML
• Ad Hoc Formats (avoid these if you can)
Data Formats
Questions? tweet @zipfianacademy #pydata
113. • JSON is made up of hash tables and arrays
• Hash tables: { “foo” : 1, “bar” : 2, baz : “3” }
• Arrays: [1, 2, 3]
• Arrays of arrays: [[1, 2, 3], [‘foo’, ‘bar’, ‘baz’]]
• Array of hashes: [{‘foo’:1, ‘bar’:2}, {‘baz’:3}]
• Hashes of hashes: {‘foo’: {‘bar’: 2, ‘baz’: 3}}
Questions? tweet @zipfianacademy #pydata
Data Formats
121. Programming languages like
Python, Ruby, and R have built in
parsers for data formats such as
JSON and CSV. For other
esoteric formats you will
probably have to write your own
Questions? tweet @zipfianacademy #pydata
Data Formats