Full-Stack Data Science: How to be a One-person Data Team

Full-stack
data sciencehow to be a one-person data team Greg Goltsov
Data Hacker
gregory.goltsov.info
@gregoltsov

3+ years in startups
Built systems
supporting 1mil+
users
Delivered to Fortune
10 companies
Greg Goltsov
Data Hacker
@gregoltsov

#
Small/medium data
Python vs R
Concepts > code

CS +
Physics
Games
dev
Data analyst/
engineer/viz/*
Data
Scientist
University Touch Surgery Appear Here
Data
Hacker

Web
dev
Full-stack dev
DBA
SysAdmin
DevOps
Data Analyst
Data Engineer
Team Lead

First contract, 1-man team
Engineering → science
Learning. Fast.

Basic
Postgres
Pandas
Scikit-learn
Notebooks
Luigi/Airﬂow
Bash
Git
Extra
Flask
AWS EC2
AWS Redshift
d3.js
ElasticSearch
Spark

CREATE TABLE events (
name varchar(200),
visitor_id varchar(200),
properties jsonb,
browser jsonb
);
Postgres JSONB
http://clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json

CREATE TABLE events (
name varchar(200),
visitor_id varchar(200),
properties jsonb,
browser jsonb
);
Postgres JSONB
“Hey, there’sMongoDB inmy Postgres!”

INSERT INTO events VALUES (
'pageview', '1',
'{ "page": "/account" }',
'{ "name": "Chrome", "os": "Mac", "resolution":
{ "x": 1440, "y": 900 } }'
);
INSERT INTO events VALUES (
'purchase', '5',
'{ "amount": 10 }',
'{ "name": "Firefox", "os": "Windows", "resolution":
{ "x": 1024, "y": 768 } }'
);
Postgres JSONB

SELECT browser->>'name' AS browser,
count(browser)
FROM events
GROUP BY browser->>'name';
browser | count
---------+-------
Firefox | 3
Chrome | 2
Postgres JSONB

your_db=# o 'path/to/export.csv'
your_db=# COPY (
SELECT *
...
) TO STDOUT WITH CSV HEADER;
Postgres CSV

WITH new_users AS (SELECT ...),
unverified_users_ids AS (SELECT ...)
SELECT COUNT(new_user.id)
FROM new_user
WHERE new_user.id NOT IN
unverified_users_ids;
Postgres WITH

“R in Python”
DataFrame
Simple I/O
Plotting
Split/apply/combine

http://worrydream.com/LadderOfAbstraction
# plain python
col_C = []
for i, row in enumerate(col_A):
c = row + col_B[i]
col_C.append(c)
# pandas
df['C'] = df['A'] + df['B']
pandas
What vs How

Like
making
data
tidy
http://www.forbes.com/sites/gilpress/2016/03/23/
data-preparation-most-time-consuming-least-
enjoyable-data-science-task-survey-says

Framework for rapid dev
Fit, transform, predict
NumPy

get_data.sh &&
process_data.sh &&
publish_data.sh

Luigi/Airﬂow
Describe the
pipeline
declaratively

www.drivendata.github.io/cookiecutter-data-science
Rails new for data science
Notebooks are for exploration
Sane structure for collaboration
https://drivendata.github.io/cookiecutter-data-science

!"" Makefile <- Makefile with commands like `make data` or `make train`
!"" data
# !"" external <- Data from third party sources.
# !"" interim <- Intermediate data that has been transformed.
# !"" processed <- The final, canonical data sets for modeling.
# $"" raw <- The original, immutable data dump.
!"" docs <- A default Sphinx project; see sphinx-doc.org for details
!"" models <- Trained and serialized models, model predictions
!"" notebooks <- Jupyter notebooks
!"" references <- Data dictionaries, manuals, and all other explanatory materials.
!"" reports <- Generated analysis as HTML, PDF, LaTeX, etc.
!"" requirements.txt <- The requirements file for reproducing the env
!"" src <- Source code for use in this project.
# !"" data <- Scripts to download or generate data
# # $"" make_dataset.py
# !"" features <- Scripts to turn raw data into features for modeling
# # $"" build_features.py
# !"" models <- Scripts to train models and then use trained models to make
# # # predictions
# # !"" predict_model.py
# # $"" train_model.py
# $"" visualization <- Scripts to create exploratory and results oriented visualizations
# $"" visualize.py

www.dataset.readthedocs.io
Just write SQL
# connect, return rows as objects with attributes
db = dataset.connect(‘postgresql://…', row_type=stuf)
rows = db.query('SELECT country, COUNT(*) c
FROM user GROUP BY country')
# get data into pandas, that's where the fun begins!
rows_df = pandas.DataFrame.from_records(rows)

# sklearn-pandas
mapper = DataFrameMapper([
(['country'], [sklearn.preprocessing.Imputer(),
sklearn.preprocessing.StandardScaler()]),
...])
# pipeline to convert DataFrame to ML representation
pipeline = sklearn.pipeline.Pipeline([
('featurise', mapper),
('feature_selection', feature_selection.SelectKBest()),
('random_forest', ensemble.RandomForestClassifier())])
# set up search space for the best model
cv = grid_search.GridSearchCV(pipeline, param_grid=dict(…))
best_model = cv.best_estimator_

Don’t guard,
empower instead
http://jhtart.deviantart.com/art/Castle-171525835

The goal is to turn data
into information, and
information into insight.
– Carly Fiorina, former HP CEO

www.github.com/
caesar0301/awesome-
public-datasets
www.dataquest.io/blog/
data-science-portfolio-
project

Read
Keep up
DataTau
DataScienceWeekly
O’Reilly Data Newsletter
KDnuggets
Kaggle

Emacs
C++
games
dev
photography

Thanks!
Greg Goltsov
Data Hacker
@gregoltsov
questions?
Slides at
bit.ly/gg-full-stack-data-science

Full-Stack Data Science: How to be a One-person Data Team

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Full-Stack Data Science: How to be a One-person Data Team

Similar to Full-Stack Data Science: How to be a One-person Data Team (20)

Recently uploaded

Recently uploaded (20)

Full-Stack Data Science: How to be a One-person Data Team