OpenRefine - Data Science Training for Librarians

Data Scientist Training for Librarians
Harvard College Observatory
March 28, 2013

Tom Morris
@tfmorris

Who am I?
•
Independent software engineering & product
management consultant
•
Developer on open source OpenRefine project
•
Curious data geek
•
Contact:
– Twitter: @tfmorris
– Email: tfmorris@gmail.com

2013-03-28 Tom Morris @tfmorris 2

Data Analysis Lifecycle
●
Find / Extract
●
Prepare
– Characterize
– Clean
– Integrate / Extend
●
Analyze
●
Visualize / Report


Provenance
●
Provenance is key! - both before and after
you get data
●
Record source (e.g. download URL) and date
●
Unix command line
– build up a repeatable transformation pipeline script
– use make to keep from having to repeat steps)

●
OpenRefine maintains an undo history (but...)


Irreversible Transforms
●
Be careful of anything which isn't reversible
●
Keep source files and plan recovery strategy
●
Common gotchas:
– Character encoding – can't replace
substitution character with its original
value
– Leading 0s on identifiers


Provenance projects
●
Stanford Panda (Provenance and Data) -
http://infolab.stanford.edu/panda/
●
Open Provenance Model -
http://openprovenance.org/
●
Both focus on bi-directional traceability


Tools vs Scale
●
Editor with macro facility: emacs, vim
●
Spreadsheet: Excel, OO Calc
●
OpenRefine
●
Unix shell commands – awk, sed, grep, cut,
sort, head, tail
●
“Real” programming – Python, Ruby, Java
●
Map-Reduce


Regular Expressions
●
Useful in so many contexts
●
A little confusing to learn, but
●
Absolutely worth the effort!


OpenRefine
●
Power tool for working with messy data
●
Free and open source
●
Desktop based (data stays private)
●
Faceted browsing interface
●
Lots of input & output formats
●
Powerful transformations
●
Useful for analysis & web scraping/APIs too

OpenRefine Data Formats
●
CSV/TSV/separator based
●
Fixed width field
●
JSON & XML
●
Excel & OpenOffice Calc
●
Google Spreadsheets & Fusion Tables
●
RDF
●
URLs & zip files too!

Data Characterization
●
Coded vs free-form fields
●
Distribution of values
– Missing values – skip, impute, ...
– Outliers – cause? Can they be rescaled?
●
Delimiters & escaping (e.g. HTML, XML)
●
Formatting problems
●
Character encoding issues?


Hands-on
●
Let's play with some data!
●
http://code.google.com/p/google-refine/


Export
●
OpenRefine exports most import formats:
Excel, CSV, TSV, OpenOffice, Google
Spreadsheets, Fusion tables, JSON, RDF
●
Template-based exporter for everything else:
custom JSON formats, etc.


Scaling Up
●
Experiment with a (representative) sample of
your data
●
Reuse regexs, filters, etc with more heavy
duty tools – awk, sed, Map-Reduce


Resources
●
Berkeley Data Science course
http://datascienc.es/schedule/
– week 2 - Data Preparation has good R examples
http://berkeleydatascience.files.wordpress.com/2012/02/2012
●
Mike Loukides "Data Hand Tools"
http://radar.oreilly.com/2011/04/data-hand-tools
●
Jeremy Howard Getting in shape for the sport
of Data Science
http://media.kaggle.com/MelbURN.html

More resources
●
MIT IAP Data Science course materials
– http://dataiap.github.com/dataiap/
●
Quora
– http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public
– http://www.quora.com/What-are-some-good-methods-for-data-pre-processing-in-machine-learning

●
OKFN School of Data handbook
– http://handbook.schoolofdata.org
●
Hilary Mason
– http://www.dataists.com/2010/09/a-taxonomy-of-data-science/

Resources mentioned
●
Harvard Business Review competition at
Kaggle
– Competition ends 8/27/2012 4:00 AM UTC !
– https://www.kaggle.com/c/harvard-business-review-vision-statement-prospect

●
Stanford Data Wrangler
– http://vis.stanford.edu/wrangler/


Thanks!

•
Questions now?
•
Questions later:
– Twitter: @tfmorris
– Email: tfmorris@gmail.com


OpenRefine - Data Science Training for Librarians

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to OpenRefine - Data Science Training for Librarians

Similar to OpenRefine - Data Science Training for Librarians (20)

OpenRefine - Data Science Training for Librarians