2. What is OpenRefine?
• Data wrangling tool (ETL, data cleaning)
• Originally developed by Google, now open-sourced
• Widely used by data scientists to understand their data
• Designed to traverse messy data
3. Extract/scrape
(API vs. no API)
• import.io
• most programming
languages
Clean/wrangle
(clean vs. messy)
• > OpenRefine <
• Microsoft Excel
• most programming
languages
• knowledge of
RegEx especially
useful here
Analyze
(inferential vs.
predictive)
• Microsoft Excel
• R
• Stata
• SAS
• most programming
languages
Visualize
• Tableau
• Microsoft Excel
• R
• most programming
languages
Where is OpenRefine on the data science spectrum?
Goal: abstract everything away from programming
except for the more complicated stuff
4. What is messy data?
• Messy data is not data that simply needs to be transformed to get it how
you’d like to see it
– Clean data: one operation successfully transforms a column (or a subset of a column)
– Eg. splitting a column by comma delimiter, adding a month column based on date
column, transposing rows, removing all values over 100
• Messy data is when there is no single rule that easily transforms a column
(ie. so many operations on a single column are required it has to be done
manually); eg.
– You have ‘company_name’ but there are typos for the same entities
– You have a ‘price’ column but some are encoded 999 (for N/A) + others are strings “N/A”
or “NA” + some blanks are in there + some are written as strings (‘40$’)
• When does this ever happen?
– ONLY when the data is input by a human!
– If the data is input by a program (no matter thoughtlessly organized it may be), it will at
least be consistent
• What does clean vs. messy data look like?
5.
6.
7. OpenRefine is one tool among many
• No one tool is good at everything (“do one thing and do it well”)
• What should you not use OpenRefine for?
– Very large datasets (over 80 columns and 100,000 rows)
• Use SAS/Stata/some programming language on a server
– Clean data
• Probably easier to use Excel (you can do a lot with filtering, splitting to
columns, VLOOKUP joins, removing duplicates, sorting) as it is a more
familiar environment
8. OpenRefine demo: what we will discuss
• Installation
• Creating a project
• Introduce the layout
• Demo main functionalities (default facets, custom facets, clustering)
• Use cases
9. OpenRefine demo: what you will learn
• How does OpenRefine add value relative to other software?
– String clustering
– Quick view of columns with inconsistent types
• Can be done in Excel/SAS/Stata but it takes a few steps to look for outliers
or blanks or strings whereas OpenRefine automatically groups/counts these
Transformations with clean data
Converting weekly data into monthly (collapsing)
Splitting multiple values in a column (eg. name “Smith, John” by ‘,’)
Transposing rows in to columns (eg. column with “Male” or “Female” as values and turn them into columns with 1/0s for regression model)
Renaming/reordering columns
Working with messy data
Crosswalking inconsistent names
Identifying numeric typos (eg. outliers), so you can quickly visualize a column
“Wrangling” inconsistent data