4. Provenance
●
Provenance is key! - both before and after
you get data
●
Record source (e.g. download URL) and date
●
Unix command line
– build up a repeatable transformation pipeline script
– use make to keep from having to repeat steps)
●
OpenRefine maintains an undo history (but...)
2013-03-28 Tom Morris @tfmorris 4
5. Irreversible Transforms
●
Be careful of anything which isn't reversible
●
Keep source files and plan recovery strategy
●
Common gotchas:
– Character encoding – can't replace
substitution character with its original
value
– Leading 0s on identifiers
2013-03-28 Tom Morris @tfmorris 5
6. Provenance projects
●
Stanford Panda (Provenance and Data) -
http://infolab.stanford.edu/panda/
●
Open Provenance Model -
http://openprovenance.org/
●
Both focus on bi-directional traceability
2013-03-28 Tom Morris @tfmorris 6
8. Regular Expressions
●
Useful in so many contexts
●
A little confusing to learn, but
●
Absolutely worth the effort!
2013-03-28 Tom Morris @tfmorris 8
9. OpenRefine
●
Power tool for working with messy data
●
Free and open source
●
Desktop based (data stays private)
●
Faceted browsing interface
●
Lots of input & output formats
●
Powerful transformations
●
Useful for analysis & web scraping/APIs too
2013-03-28 Tom Morris @tfmorris 9
10. OpenRefine Data Formats
●
CSV/TSV/separator based
●
Fixed width field
●
JSON & XML
●
Excel & OpenOffice Calc
●
Google Spreadsheets & Fusion Tables
●
RDF
●
URLs & zip files too!
2013-03-28 Tom Morris @tfmorris 10
11. Data Characterization
●
Coded vs free-form fields
●
Distribution of values
– Missing values – skip, impute, ...
– Outliers – cause? Can they be rescaled?
●
Delimiters & escaping (e.g. HTML, XML)
●
Formatting problems
●
Character encoding issues?
2013-03-28 Tom Morris @tfmorris 11
12. Hands-on
●
Let's play with some data!
●
http://code.google.com/p/google-refine/
2013-03-28 Tom Morris @tfmorris 12
13. Export
●
OpenRefine exports most import formats:
Excel, CSV, TSV, OpenOffice, Google
Spreadsheets, Fusion tables, JSON, RDF
●
Template-based exporter for everything else:
custom JSON formats, etc.
2013-03-28 Tom Morris @tfmorris 13
14. Scaling Up
●
Experiment with a (representative) sample of
your data
●
Reuse regexs, filters, etc with more heavy
duty tools – awk, sed, Map-Reduce
2013-03-28 Tom Morris @tfmorris 14
15. Resources
●
Berkeley Data Science course
http://datascienc.es/schedule/
– week 2 - Data Preparation has good R examples
http://berkeleydatascience.files.wordpress.com/2012/02/2012
●
Mike Loukides "Data Hand Tools"
http://radar.oreilly.com/2011/04/data-hand-tools
●
Jeremy Howard Getting in shape for the sport
of Data Science
http://media.kaggle.com/MelbURN.html
2013-03-28 Tom Morris @tfmorris 15
16. More resources
●
MIT IAP Data Science course materials
– http://dataiap.github.com/dataiap/
●
Quora
– http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public
– http://www.quora.com/What-are-some-good-methods-for-data-pre-processing-in-machine-learning
●
OKFN School of Data handbook
– http://handbook.schoolofdata.org
●
Hilary Mason
– http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
2013-03-28 Tom Morris @tfmorris 16
17. Resources mentioned
●
Harvard Business Review competition at
Kaggle
– Competition ends 8/27/2012 4:00 AM UTC !
– https://www.kaggle.com/c/harvard-business-review-vision-statement-prospect
●
Stanford Data Wrangler
– http://vis.stanford.edu/wrangler/
2013-03-28 Tom Morris @tfmorris 17