SlideShare a Scribd company logo
1 of 10
OpenRefine
Alex Petralia
What is OpenRefine?
• Data wrangling tool (ETL, data cleaning)
• Originally developed by Google, now open-sourced
• Widely used by data scientists to understand their data
• Designed to traverse messy data
Extract/scrape
(API vs. no API)
• import.io
• most programming
languages
Clean/wrangle
(clean vs. messy)
• > OpenRefine <
• Microsoft Excel
• most programming
languages
• knowledge of
RegEx especially
useful here
Analyze
(inferential vs.
predictive)
• Microsoft Excel
• R
• Stata
• SAS
• most programming
languages
Visualize
• Tableau
• Microsoft Excel
• R
• most programming
languages
Where is OpenRefine on the data science spectrum?
Goal: abstract everything away from programming
except for the more complicated stuff
What is messy data?
• Messy data is not data that simply needs to be transformed to get it how
you’d like to see it
– Clean data: one operation successfully transforms a column (or a subset of a column)
– Eg. splitting a column by comma delimiter, adding a month column based on date
column, transposing rows, removing all values over 100
• Messy data is when there is no single rule that easily transforms a column
(ie. so many operations on a single column are required it has to be done
manually); eg.
– You have ‘company_name’ but there are typos for the same entities
– You have a ‘price’ column but some are encoded 999 (for N/A) + others are strings “N/A”
or “NA” + some blanks are in there + some are written as strings (‘40$’)
• When does this ever happen?
– ONLY when the data is input by a human!
– If the data is input by a program (no matter thoughtlessly organized it may be), it will at
least be consistent
• What does clean vs. messy data look like?
OpenRefine is one tool among many
• No one tool is good at everything (“do one thing and do it well”)
• What should you not use OpenRefine for?
– Very large datasets (over 80 columns and 100,000 rows)
• Use SAS/Stata/some programming language on a server
– Clean data
• Probably easier to use Excel (you can do a lot with filtering, splitting to
columns, VLOOKUP joins, removing duplicates, sorting) as it is a more
familiar environment
OpenRefine demo: what we will discuss
• Installation
• Creating a project
• Introduce the layout
• Demo main functionalities (default facets, custom facets, clustering)
• Use cases
OpenRefine demo: what you will learn
• How does OpenRefine add value relative to other software?
– String clustering
– Quick view of columns with inconsistent types
• Can be done in Excel/SAS/Stata but it takes a few steps to look for outliers
or blanks or strings whereas OpenRefine automatically groups/counts these
DEMO

More Related Content

What's hot

5 Secret Weapons Of A Great Salesforce Architect
5 Secret Weapons Of A Great Salesforce Architect5 Secret Weapons Of A Great Salesforce Architect
5 Secret Weapons Of A Great Salesforce ArchitectSebastian Wagner
 
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...Denodo
 
Solr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW TechnologySolr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW TechnologyLucidworks
 
How to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better PerformanceHow to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better Performanceoysteing
 
An overview of Neo4j Internals
An overview of Neo4j InternalsAn overview of Neo4j Internals
An overview of Neo4j InternalsTobias Lindaaker
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databasesArangoDB Database
 
Workshop Tel Aviv - Graph Data Science
Workshop Tel Aviv - Graph Data ScienceWorkshop Tel Aviv - Graph Data Science
Workshop Tel Aviv - Graph Data ScienceNeo4j
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022Jim Dowling
 
Salesforce complete overview
Salesforce complete overviewSalesforce complete overview
Salesforce complete overviewNitesh Mishra ☁
 
3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar itemsViet-Trung TRAN
 
Building Data Pipelines in Python
Building Data Pipelines in PythonBuilding Data Pipelines in Python
Building Data Pipelines in PythonC4Media
 
DBT Kit - Support Group in a Box for DBT (Dialectical Behavior Therapy)
DBT Kit - Support Group in a Box for DBT (Dialectical Behavior Therapy)DBT Kit - Support Group in a Box for DBT (Dialectical Behavior Therapy)
DBT Kit - Support Group in a Box for DBT (Dialectical Behavior Therapy)Mo Goltz
 
SQL Server Reporting Services
SQL Server Reporting ServicesSQL Server Reporting Services
SQL Server Reporting ServicesAhmed Elbaz
 
My Laws of Test Driven Development (2023)
My Laws of Test Driven Development (2023)My Laws of Test Driven Development (2023)
My Laws of Test Driven Development (2023)Dennis Doomen
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introductionAlexey Grigorev
 
Ibm db2 10.5 for linux, unix, and windows data movement utilities guide and...
Ibm db2 10.5 for linux, unix, and windows   data movement utilities guide and...Ibm db2 10.5 for linux, unix, and windows   data movement utilities guide and...
Ibm db2 10.5 for linux, unix, and windows data movement utilities guide and...bupbechanhgmail
 

What's hot (20)

Fuzzy Regression&Bulanık Regresyon
Fuzzy Regression&Bulanık RegresyonFuzzy Regression&Bulanık Regresyon
Fuzzy Regression&Bulanık Regresyon
 
5 Secret Weapons Of A Great Salesforce Architect
5 Secret Weapons Of A Great Salesforce Architect5 Secret Weapons Of A Great Salesforce Architect
5 Secret Weapons Of A Great Salesforce Architect
 
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
 
Solr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW TechnologySolr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW Technology
 
How to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better PerformanceHow to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better Performance
 
An overview of Neo4j Internals
An overview of Neo4j InternalsAn overview of Neo4j Internals
An overview of Neo4j Internals
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databases
 
Workshop Tel Aviv - Graph Data Science
Workshop Tel Aviv - Graph Data ScienceWorkshop Tel Aviv - Graph Data Science
Workshop Tel Aviv - Graph Data Science
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
 
Salesforce complete overview
Salesforce complete overviewSalesforce complete overview
Salesforce complete overview
 
Solr formation Sparna
Solr formation SparnaSolr formation Sparna
Solr formation Sparna
 
Graph databases
Graph databasesGraph databases
Graph databases
 
3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar items
 
Building Data Pipelines in Python
Building Data Pipelines in PythonBuilding Data Pipelines in Python
Building Data Pipelines in Python
 
DBT Kit - Support Group in a Box for DBT (Dialectical Behavior Therapy)
DBT Kit - Support Group in a Box for DBT (Dialectical Behavior Therapy)DBT Kit - Support Group in a Box for DBT (Dialectical Behavior Therapy)
DBT Kit - Support Group in a Box for DBT (Dialectical Behavior Therapy)
 
SQL Server Reporting Services
SQL Server Reporting ServicesSQL Server Reporting Services
SQL Server Reporting Services
 
My Laws of Test Driven Development (2023)
My Laws of Test Driven Development (2023)My Laws of Test Driven Development (2023)
My Laws of Test Driven Development (2023)
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
 
Ibm db2 10.5 for linux, unix, and windows data movement utilities guide and...
Ibm db2 10.5 for linux, unix, and windows   data movement utilities guide and...Ibm db2 10.5 for linux, unix, and windows   data movement utilities guide and...
Ibm db2 10.5 for linux, unix, and windows data movement utilities guide and...
 

Viewers also liked

TXDHC OpenRefine Training
TXDHC OpenRefine TrainingTXDHC OpenRefine Training
TXDHC OpenRefine TrainingLiz Grumbach
 
Introduction to OpenRefine
Introduction to OpenRefineIntroduction to OpenRefine
Introduction to OpenRefineHeather Myers
 
A Quick Tour of OpenRefine
A Quick Tour of OpenRefineA Quick Tour of OpenRefine
A Quick Tour of OpenRefineTony Hirst
 
OpenRefine Class Tutorial
OpenRefine Class TutorialOpenRefine Class Tutorial
OpenRefine Class TutorialAshwin Dinoriya
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotialVijaya Prabhu
 
Open Data & Data Visualization: dalle licenze ai grafici | Bologna, 16 giugno...
Open Data & Data Visualization: dalle licenze ai grafici | Bologna, 16 giugno...Open Data & Data Visualization: dalle licenze ai grafici | Bologna, 16 giugno...
Open Data & Data Visualization: dalle licenze ai grafici | Bologna, 16 giugno...Dataninja
 
Using entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APIUsing entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APISpazioDati
 
Web semántica y linked data la web como bd
Web semántica y linked data  la web como bdWeb semántica y linked data  la web como bd
Web semántica y linked data la web como bdAlvaro Graves
 
Employing Google Refine to publish Linked Data
Employing Google Refine to publish Linked DataEmploying Google Refine to publish Linked Data
Employing Google Refine to publish Linked DataFadi Maali
 
Data and Donuts: Data cleaning with OpenRefine
Data and Donuts: Data cleaning with OpenRefineData and Donuts: Data cleaning with OpenRefine
Data and Donuts: Data cleaning with OpenRefineC. Tobin Magle
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotialVijaya Prabhu
 
7th_LinkedData(20131008)
7th_LinkedData(20131008)7th_LinkedData(20131008)
7th_LinkedData(20131008)真 岡本
 
OpenRefine - Data Science Training for Librarians
OpenRefine - Data Science Training for LibrariansOpenRefine - Data Science Training for Librarians
OpenRefine - Data Science Training for Librarianstfmorris
 

Viewers also liked (18)

TXDHC OpenRefine Training
TXDHC OpenRefine TrainingTXDHC OpenRefine Training
TXDHC OpenRefine Training
 
Introduction to OpenRefine
Introduction to OpenRefineIntroduction to OpenRefine
Introduction to OpenRefine
 
A Quick Tour of OpenRefine
A Quick Tour of OpenRefineA Quick Tour of OpenRefine
A Quick Tour of OpenRefine
 
OpenRefine Class Tutorial
OpenRefine Class TutorialOpenRefine Class Tutorial
OpenRefine Class Tutorial
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotial
 
Open Data & Data Visualization: dalle licenze ai grafici | Bologna, 16 giugno...
Open Data & Data Visualization: dalle licenze ai grafici | Bologna, 16 giugno...Open Data & Data Visualization: dalle licenze ai grafici | Bologna, 16 giugno...
Open Data & Data Visualization: dalle licenze ai grafici | Bologna, 16 giugno...
 
Using entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APIUsing entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion API
 
Web semántica y linked data la web como bd
Web semántica y linked data  la web como bdWeb semántica y linked data  la web como bd
Web semántica y linked data la web como bd
 
SPARQL
SPARQLSPARQL
SPARQL
 
Employing Google Refine to publish Linked Data
Employing Google Refine to publish Linked DataEmploying Google Refine to publish Linked Data
Employing Google Refine to publish Linked Data
 
Data and Donuts: Data cleaning with OpenRefine
Data and Donuts: Data cleaning with OpenRefineData and Donuts: Data cleaning with OpenRefine
Data and Donuts: Data cleaning with OpenRefine
 
Linked Data (再)入門
Linked Data (再)入門Linked Data (再)入門
Linked Data (再)入門
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotial
 
7th_LinkedData(20131008)
7th_LinkedData(20131008)7th_LinkedData(20131008)
7th_LinkedData(20131008)
 
SPARQL Tutorial
SPARQL TutorialSPARQL Tutorial
SPARQL Tutorial
 
RDF Refineの使い方
RDF Refineの使い方RDF Refineの使い方
RDF Refineの使い方
 
第7回 Linked Data 勉強会 @yayamamo
第7回 Linked Data 勉強会 @yayamamo第7回 Linked Data 勉強会 @yayamamo
第7回 Linked Data 勉強会 @yayamamo
 
OpenRefine - Data Science Training for Librarians
OpenRefine - Data Science Training for LibrariansOpenRefine - Data Science Training for Librarians
OpenRefine - Data Science Training for Librarians
 

Similar to OpenRefine Tutorial

Expressing and sharing workflows
Expressing and sharing workflowsExpressing and sharing workflows
Expressing and sharing workflowsDaniel S. Katz
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
 
Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017Casey Kinsey
 
Algorithms and Data Structures
Algorithms and Data StructuresAlgorithms and Data Structures
Algorithms and Data Structuressonykhan3
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenDatabricks
 
Graph databases for SQL Server profesionnals
Graph databases for SQL Server profesionnalsGraph databases for SQL Server profesionnals
Graph databases for SQL Server profesionnalsMSDEVMTL
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyRobert Viseur
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 

Similar to OpenRefine Tutorial (20)

Expressing and sharing workflows
Expressing and sharing workflowsExpressing and sharing workflows
Expressing and sharing workflows
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017
 
Algorithms and Data Structures
Algorithms and Data StructuresAlgorithms and Data Structures
Algorithms and Data Structures
 
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
محاضرة برنامج التحليل الكمي   R program د.هديل القفيديمحاضرة برنامج التحليل الكمي   R program د.هديل القفيدي
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
Graph databases for SQL Server profesionnals
Graph databases for SQL Server profesionnalsGraph databases for SQL Server profesionnals
Graph databases for SQL Server profesionnals
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
R training
R trainingR training
R training
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
DataHub
DataHubDataHub
DataHub
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 

Recently uploaded

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Recently uploaded (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

OpenRefine Tutorial

  • 2. What is OpenRefine? • Data wrangling tool (ETL, data cleaning) • Originally developed by Google, now open-sourced • Widely used by data scientists to understand their data • Designed to traverse messy data
  • 3. Extract/scrape (API vs. no API) • import.io • most programming languages Clean/wrangle (clean vs. messy) • > OpenRefine < • Microsoft Excel • most programming languages • knowledge of RegEx especially useful here Analyze (inferential vs. predictive) • Microsoft Excel • R • Stata • SAS • most programming languages Visualize • Tableau • Microsoft Excel • R • most programming languages Where is OpenRefine on the data science spectrum? Goal: abstract everything away from programming except for the more complicated stuff
  • 4. What is messy data? • Messy data is not data that simply needs to be transformed to get it how you’d like to see it – Clean data: one operation successfully transforms a column (or a subset of a column) – Eg. splitting a column by comma delimiter, adding a month column based on date column, transposing rows, removing all values over 100 • Messy data is when there is no single rule that easily transforms a column (ie. so many operations on a single column are required it has to be done manually); eg. – You have ‘company_name’ but there are typos for the same entities – You have a ‘price’ column but some are encoded 999 (for N/A) + others are strings “N/A” or “NA” + some blanks are in there + some are written as strings (‘40$’) • When does this ever happen? – ONLY when the data is input by a human! – If the data is input by a program (no matter thoughtlessly organized it may be), it will at least be consistent • What does clean vs. messy data look like?
  • 5.
  • 6.
  • 7. OpenRefine is one tool among many • No one tool is good at everything (“do one thing and do it well”) • What should you not use OpenRefine for? – Very large datasets (over 80 columns and 100,000 rows) • Use SAS/Stata/some programming language on a server – Clean data • Probably easier to use Excel (you can do a lot with filtering, splitting to columns, VLOOKUP joins, removing duplicates, sorting) as it is a more familiar environment
  • 8. OpenRefine demo: what we will discuss • Installation • Creating a project • Introduce the layout • Demo main functionalities (default facets, custom facets, clustering) • Use cases
  • 9. OpenRefine demo: what you will learn • How does OpenRefine add value relative to other software? – String clustering – Quick view of columns with inconsistent types • Can be done in Excel/SAS/Stata but it takes a few steps to look for outliers or blanks or strings whereas OpenRefine automatically groups/counts these
  • 10. DEMO

Editor's Notes

  1. Transformations with clean data Converting weekly data into monthly (collapsing) Splitting multiple values in a column (eg. name “Smith, John” by ‘,’) Transposing rows in to columns (eg. column with “Male” or “Female” as values and turn them into columns with 1/0s for regression model) Renaming/reordering columns
  2. Working with messy data Crosswalking inconsistent names Identifying numeric typos (eg. outliers), so you can quickly visualize a column “Wrangling” inconsistent data