SlideShare a Scribd company logo
1 of 64
Vinod Gupta School of Management, IIT Kharagpur




                      Google Refine
                        Tutorial

                                April, 08 2012

                        Sathishwaran.R - 10BM60079
                         Vijaya Prabhu - 10BM60097



This Tutorial was created using Google Refine Version 2.5 on a Windows 7 platform
Data Cleansing
• Data cleansing is identifying the wrong or inaccurate
  records in the data set and making appropriate
  corrections to the records.
• It involves identifying incomplete, inaccurate, and
  incorrect parts of data and then either replacing them
  with correct data or deleting the incorrect data
• Data cleansing results in data which is consistent with
  the other standard data and is useful for performing
  various analysis
• The error in the data could be due to data entry error
  by the user, failure during transmission of data or
  improper data definitions.

                                                        2
Need for Data Cleansing
• Incorrect or inaccurate data may lead to false
  conclusions and can cause investments to be
  misdirected in finance.
• Also government needs accurate data on
  population and census for directing the funds to
  the deserving areas.
• Many organizations tap into customer
  information. If the data is not accurate, for eg. If
  the address is not accurate then the business
  runs the risk of send wrong information, thus
  losing customers.

                                                     3
Challenges Data Cleansing
• Loss of Information: In many cases the record may be
  incomplete, hence the whole record may require to be
  deleted which leads to loss of information. It could
  become costly if huge number of data is deleted.
• Maintenance of Data: Once the data is cleansed then
  any change in the data specification needs to affect
  only the new values. Hence data management
  solutions should be designed in such a way that the
  process of data entry and retrieval are altered to
  provide correct data.
• Data cleansing is an iterative process which needs
  significant work in exploration and corrction of entries.

                                                          4
About Google Refine
• Google Refine is a powerful tool that can be effectively
  used for data cleansing.
• It helps in working with raw data, cleaning it up,
  transforming from one format to other, encompassing
  it with web services and linking it to databases.
• It is very easy to use and has a web interface.
• It is freely available and works well with any browser.
• Google Refine is a desktop application and it runs a
  small web server on your system and we need to point
  our browser to the server to use refine.

                                                         5
Getting Started - Installation
1. Download the zip file (appropriate Windows,
   Mac, Linux versions) from the link
   http://code.google.com/p/google-
   refine/wiki/Downloads?tm=2
2. Uncompress the files from the zip file.
3. Run the “google-refine.exe” file.
4. A command window opens and Google refine
   runs taking the user to the home page in the
   default browser.
                                              6
Google Refine Homepage




                         7
Importing Data
• Google Refine supports TSV, CSV, Excel (.xls
  and .xlsx), JSON, XML, and Google data
  document formats.
• Once imported the data is in Google Refine’s
  own data format.
• We have used TSV data on Disasters
  worldwide from 1900-2008 available from
  http://www.infochimps.com/datasets/disaster
  s-worldwide-from-1900-2008 for the tutorial.

                                             8
Importing Data




                 9
Importing Data




                 10
Data
Uploaded   Creating Project




                              11
Creating Project   Project
                   Created




                             12
Faceting
• Faceting is about seeing the big picture and
  filtering based on rows to work on data you
  want to change in bulk.
• We can create a facet for a column to get the
  details about that column and then we can
  filter to a subset of rows with a constraint.
• We can perform text facet, Numeric facet,
  timeline facet and scatterplot facet. Also
  various customized facets can be designed.

                                                  13
Faceting




           14
Faceting




The Column
Type has 18
  unique
  options



                         15
Removing Redundancy




  Even though
they are of same
 type, shows as
different options
   due to case


                                          16
Removing Redundancy




                      17
Removing Redundancy




                      18
Removing Redundancy




                      19
Removing Redundancy




Reduced to 15
unique options




                                       20
Numeric Faceting




                   21
Numeric Faceting




Highly clustered
  towards low
     values



                                      22
Numeric Faceting




                   23
Numeric Faceting




                   24
Numeric Faceting




                    Cost column is
                   blank and has no
                         value


                                      25
Numeric Faceting




                   Calamities with
                      low cost



                                     26
Numeric Faceting




              Calamities with
                 high cost



                                27
Clustering
•   Clustering is used to merge choices which look similar.




                                                              28
Clustering




             29
Clustering




Data Merged




                           30
Using Expressions
•   Expressions are used to transform existing data to create new data




                                                                         31
Using Expressions




                    32
Using Expressions




                    33
Data Augmentation
• Reconciliation option in Google refine allows
  data to be linked to web pages. Suppose we
  want details on the country where the
  calamity has struck we can perform the
  following steps




                                                  34
Reconciliation




                 35
Reconciliation




                 36
Reconciliation




                 37
Reconciliation




                 38
Reconciliation




                 39
Data Enrichment




                  40
Data Enrichment




                  41
Data Enrichment




                  42
Data Enrichment




                  43
Export




         44
How to Use Twitter Data

Step 1




Step 2

                             45
Step 3




         46
Step 4




Step 5

         47
Step 6




         48
Step 7   Step 8




                  49
Output




         50
Friends Events using Facebook data




                                 51
Friends Events using Facebook data




                                 52
Friends Events using Facebook data




                                 53
Friends Events using Facebook data




                                 54
Friends Events using Facebook data




                                 55
Friends Events using Facebook data




                                 56
Friends Events using Facebook data




                                 57
Friends Events using Facebook data




                                 58
Friends Events using Facebook data




                                 59
Friends Events using Facebook data




                                 60
Friends Events using Facebook data
• After splitting the cell using separator },{




                                                 61
Friends Events using Facebook data




                                 62
Friends Events using Facebook data
•   After updating for other columns and rearranging it we get the events as




                                                                               63
Thank You




            64

More Related Content

Viewers also liked (14)

Ruth C. White Resume
Ruth C. White ResumeRuth C. White Resume
Ruth C. White Resume
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotial
 
The Reproductive System
The Reproductive SystemThe Reproductive System
The Reproductive System
 
LessonPlanning2
LessonPlanning2LessonPlanning2
LessonPlanning2
 
Angles complementaris
Angles complementarisAngles complementaris
Angles complementaris
 
Servicios de-streaming
Servicios de-streamingServicios de-streaming
Servicios de-streaming
 
Media Evaluation
Media EvaluationMedia Evaluation
Media Evaluation
 
啥是部落格
啥是部落格啥是部落格
啥是部落格
 
Ruth White Cv11.11.11
Ruth White Cv11.11.11Ruth White Cv11.11.11
Ruth White Cv11.11.11
 
Drug glossaries
Drug glossariesDrug glossaries
Drug glossaries
 
YouTube進階應用2
YouTube進階應用2YouTube進階應用2
YouTube進階應用2
 
Langzame Stad
Langzame StadLangzame Stad
Langzame Stad
 
08 批次處理大量照片
08 批次處理大量照片08 批次處理大量照片
08 批次處理大量照片
 
Trainees workshops
Trainees workshopsTrainees workshops
Trainees workshops
 

Similar to Google refine tutotial

Google refine from a business perspective
Google refine   from a business perspectiveGoogle refine   from a business perspective
Google refine from a business perspective
Vijaya Prabhu
 
Data Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmersData Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmers
itnig
 
Builiding analytical apps on Hadoop
Builiding analytical apps on HadoopBuiliding analytical apps on Hadoop
Builiding analytical apps on Hadoop
Dmitry Makarchuk
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
Ian Feller
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
Jonathan Seidman
 
Varshneya samdarshi lmu_symposium_2016
Varshneya samdarshi lmu_symposium_2016Varshneya samdarshi lmu_symposium_2016
Varshneya samdarshi lmu_symposium_2016
GRNsight
 

Similar to Google refine tutotial (20)

Google refine from a business perspective
Google refine   from a business perspectiveGoogle refine   from a business perspective
Google refine from a business perspective
 
Gauge October 2012
Gauge October 2012Gauge October 2012
Gauge October 2012
 
Big Data
Big DataBig Data
Big Data
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
 
Large-Scale Data Extraction, Structuring and Matching using Python and Spark
Large-Scale Data Extraction, Structuring and Matching using Python and SparkLarge-Scale Data Extraction, Structuring and Matching using Python and Spark
Large-Scale Data Extraction, Structuring and Matching using Python and Spark
 
Data Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmersData Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmers
 
Data tools ecosystem for non-programmers
Data tools ecosystem for non-programmersData tools ecosystem for non-programmers
Data tools ecosystem for non-programmers
 
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
 
Builiding analytical apps on Hadoop
Builiding analytical apps on HadoopBuiliding analytical apps on Hadoop
Builiding analytical apps on Hadoop
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Tricks and tips_re
Tricks and tips_reTricks and tips_re
Tricks and tips_re
 
001 More introduction to big data analytics
001   More introduction to big data analytics001   More introduction to big data analytics
001 More introduction to big data analytics
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Varshneya samdarshi lmu_symposium_2016
Varshneya samdarshi lmu_symposium_2016Varshneya samdarshi lmu_symposium_2016
Varshneya samdarshi lmu_symposium_2016
 
Datacamp @ Bar Camp Bratislava
Datacamp @ Bar Camp BratislavaDatacamp @ Bar Camp Bratislava
Datacamp @ Bar Camp Bratislava
 
How Graph Technology is Changing AI
How Graph Technology is Changing AIHow Graph Technology is Changing AI
How Graph Technology is Changing AI
 
Big Data Ecosystem for Data-Driven Decision Making
Big Data Ecosystem for Data-Driven Decision MakingBig Data Ecosystem for Data-Driven Decision Making
Big Data Ecosystem for Data-Driven Decision Making
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Google refine tutotial