SlideShare a Scribd company logo
1 of 66
Data mining and data linking
 
Getting data from papers (beyond the PDF) http://dx.doi.org/10.1016/j.ympev.2009.07.011
Extracting tables
Tables from paper as  comma separated values (CSV) Taxon and institutional vouchera,Locality ID,Collection locality,Geographic coordinates/approximate location,Elevation (m),GenBank accession number 12S,16S,COI,c-myc 1. UTA A-52449,1,"Puntarenas, CR","(10°18′N, 84°48′W)",1520,EF562312,EF562365,None,EF562417 2. MVZ 149813,2,"Puntarenas, CR","(10°18′N, 84°42′W)",1500,EF562319,EF562373,EF562386,EF562430 3. FMNH 257669,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562320,EF562372,EF562380,EF562432 4. FMNH 257670,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562317,EF562336,EF562376,EF562421 5. FMNH 257671,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562314,EF562374,EF562409,None 6. FMNH 257672,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562318,None,EF562382,None
Cleaning data
(10°18’N, 84°42’W) We can read this, but a computer would prefer just numbers 2. MVZ 149813,2,"Puntarenas, CR","(10°18′N, 84°42′W)",1500,EF562319,EF562373,EF562386,EF562430
Tools for cleaning data ,[object Object],[object Object]
Achatina fulica (giant African snail)
Reconciliation services ,[object Object],[object Object]
Names reconciled using uBio and Google Refine
What can we do with data mining?
Extract information on ecological relationships
 
Text mining
Morphological and molecular description of  Haematoloechus   meridionalis  n. sp. (Digenea: Plagiorchioidea: Haematoloechidae) from  Rana   vaillanti   brocchi  of Guanacaste, Costa Rica Halipegus   eschi  n. sp. (Digenea: Hemiuridae) in  Rana   vaillanti  from Guanacaste Province, Costa Rica Haematoloechus   danbrooksi  n. sp. (Digenea: Plagiorchioidea) from  Rana   vaillanti  from Los Tuxtlas, Veracruz, Mexico
<parasite name> (n. sp.)  from  <host name>
Sources of host-parasite associations ,[object Object],[object Object]
What do crustaceans live on? Green plants Bacteria Fungi Vertebrates Arthropods
What do insects live on? Green plants Bacteria Fungi Vertebrates Arthropods
Host names in GenBank ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Extracting links between data sets
http://iphylo.org/~rpage/challenge
 
Citation links
Are there other kinds of links?
data linking
Extracting these links ,[object Object],[object Object],[object Object],[object Object]
Regular expressions to the rescue!
Regular expressions ,[object Object],[object Object],[object Object],[object Object]
demo
Perils of data mining (matching the wrong things)
Taxa found in one paper Image search on taxonomic name
Electra pilosa
Carmen  Electra  versus  Electra (guess which one is more popular?)
But what about this?
Homo sapiens
AJ711044
should be AJ971044
Error in paper lead to wrong image How do I fix this error in the paper?
Is there a better way to make these links? (what if they were made for us?)
Digital Object Identifier (DOI)
 
Identifies a publication
Globally unique
10.1016/j.ympev.2006.04.006
Paper
Why have DOIs?
Link rot
Refs
 
 
Cites 2006 2006
Forward Cites 2006 2009
Shoulders of giants
progress is incremental
reuse past results
Forward Cites 2006 2008
 
Species Genes
data linking
Data citation
 
Linked data ,[object Object],[object Object],[object Object]
 
What does the future hold? ,[object Object],[object Object],[object Object],[object Object]

More Related Content

What's hot

Data integration in the International Consortium of Proteome Biology in Cardi...
Data integration in the International Consortium of Proteome Biology in Cardi...Data integration in the International Consortium of Proteome Biology in Cardi...
Data integration in the International Consortium of Proteome Biology in Cardi...
Rafael C. Jimenez
 
New Tools For Searching PubMed
New Tools For Searching PubMedNew Tools For Searching PubMed
New Tools For Searching PubMed
Mary Markland
 
AAG 2014 Talk on Ontology Views, Reusue, Alignment
AAG 2014 Talk on Ontology Views, Reusue, AlignmentAAG 2014 Talk on Ontology Views, Reusue, Alignment
AAG 2014 Talk on Ontology Views, Reusue, Alignment
kjanowicz
 
Precompetitive preclinical ADME/tox data and set it free on the web to facili...
Precompetitive preclinical ADME/tox data and set it free on the web to facili...Precompetitive preclinical ADME/tox data and set it free on the web to facili...
Precompetitive preclinical ADME/tox data and set it free on the web to facili...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Herbarium management multiple-taxonomies-20130227_istc_tervuren
Herbarium management multiple-taxonomies-20130227_istc_tervurenHerbarium management multiple-taxonomies-20130227_istc_tervuren
Herbarium management multiple-taxonomies-20130227_istc_tervuren
Heimo Rainer
 

What's hot (16)

CV Search at Population Data Workbench
CV Search at Population Data WorkbenchCV Search at Population Data Workbench
CV Search at Population Data Workbench
 
US-UK HHS-NHS Summit
US-UK HHS-NHS SummitUS-UK HHS-NHS Summit
US-UK HHS-NHS Summit
 
Data integration in the International Consortium of Proteome Biology in Cardi...
Data integration in the International Consortium of Proteome Biology in Cardi...Data integration in the International Consortium of Proteome Biology in Cardi...
Data integration in the International Consortium of Proteome Biology in Cardi...
 
LOD challenge day 2011 LT
LOD challenge day 2011 LTLOD challenge day 2011 LT
LOD challenge day 2011 LT
 
Open semantic chemical structures
Open semantic chemical structuresOpen semantic chemical structures
Open semantic chemical structures
 
New Tools For Searching PubMed
New Tools For Searching PubMedNew Tools For Searching PubMed
New Tools For Searching PubMed
 
DFD Symbol Rules
DFD Symbol RulesDFD Symbol Rules
DFD Symbol Rules
 
AAG 2014 Talk on Ontology Views, Reusue, Alignment
AAG 2014 Talk on Ontology Views, Reusue, AlignmentAAG 2014 Talk on Ontology Views, Reusue, Alignment
AAG 2014 Talk on Ontology Views, Reusue, Alignment
 
2012 06 swaf-johanneskeizer
2012 06 swaf-johanneskeizer2012 06 swaf-johanneskeizer
2012 06 swaf-johanneskeizer
 
Ebi public meeting on internet chemistry databases november 2010
Ebi public meeting on internet chemistry databases november 2010Ebi public meeting on internet chemistry databases november 2010
Ebi public meeting on internet chemistry databases november 2010
 
NGB Documenation System SESTO (4 February 2004)
NGB Documenation System SESTO (4 February 2004)NGB Documenation System SESTO (4 February 2004)
NGB Documenation System SESTO (4 February 2004)
 
Precompetitive preclinical ADME/tox data and set it free on the web to facili...
Precompetitive preclinical ADME/tox data and set it free on the web to facili...Precompetitive preclinical ADME/tox data and set it free on the web to facili...
Precompetitive preclinical ADME/tox data and set it free on the web to facili...
 
Bases de datos de libre acceso (open acces)
Bases de datos de libre acceso (open acces)Bases de datos de libre acceso (open acces)
Bases de datos de libre acceso (open acces)
 
Revitaapro
Revitaapro Revitaapro
Revitaapro
 
Herbarium management multiple-taxonomies-20130227_istc_tervuren
Herbarium management multiple-taxonomies-20130227_istc_tervurenHerbarium management multiple-taxonomies-20130227_istc_tervuren
Herbarium management multiple-taxonomies-20130227_istc_tervuren
 
Where to Find Data Sets
Where to Find Data SetsWhere to Find Data Sets
Where to Find Data Sets
 

Similar to Data mining and data linking

Multimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalMultimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location Retrieval
Svitlana volkova
 
Visualization Approaches for Biomedical Omics Data: Putting It All Together
Visualization Approaches for Biomedical Omics Data: Putting It All TogetherVisualization Approaches for Biomedical Omics Data: Putting It All Together
Visualization Approaches for Biomedical Omics Data: Putting It All Together
Nils Gehlenborg
 
Referencias bibliograficas en formato apa y vancouver de elena rodado
Referencias bibliograficas en formato apa y vancouver de elena rodadoReferencias bibliograficas en formato apa y vancouver de elena rodado
Referencias bibliograficas en formato apa y vancouver de elena rodado
elenard6
 
The Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision MedicineThe Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision Medicine
mhaendel
 
TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)
TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)
TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)
niranabey
 

Similar to Data mining and data linking (20)

Multimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalMultimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location Retrieval
 
Cdac 2018 antoniotti cancer evolution trait
Cdac 2018 antoniotti cancer evolution traitCdac 2018 antoniotti cancer evolution trait
Cdac 2018 antoniotti cancer evolution trait
 
Visualization Approaches for Biomedical Omics Data: Putting It All Together
Visualization Approaches for Biomedical Omics Data: Putting It All TogetherVisualization Approaches for Biomedical Omics Data: Putting It All Together
Visualization Approaches for Biomedical Omics Data: Putting It All Together
 
Using the Semantic Web to Support Ecoinformatics
Using the Semantic Web to Support EcoinformaticsUsing the Semantic Web to Support Ecoinformatics
Using the Semantic Web to Support Ecoinformatics
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
Referencias bibliograficas en formato apa y vancouver de elena rodado
Referencias bibliograficas en formato apa y vancouver de elena rodadoReferencias bibliograficas en formato apa y vancouver de elena rodado
Referencias bibliograficas en formato apa y vancouver de elena rodado
 
Dynamic Semantic Metadata in Biomedical Communications
Dynamic Semantic Metadata in Biomedical CommunicationsDynamic Semantic Metadata in Biomedical Communications
Dynamic Semantic Metadata in Biomedical Communications
 
The Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision MedicineThe Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision Medicine
 
Presentation from Code Camp 2017
Presentation from Code Camp 2017Presentation from Code Camp 2017
Presentation from Code Camp 2017
 
Acknowledgement Entity Recognition In CORD-19 Papers
Acknowledgement Entity Recognition In CORD-19 PapersAcknowledgement Entity Recognition In CORD-19 Papers
Acknowledgement Entity Recognition In CORD-19 Papers
 
Rough Draft Essay. Rough Draft Examples - Writing a rough draft. Rough Draft....
Rough Draft Essay. Rough Draft Examples - Writing a rough draft. Rough Draft....Rough Draft Essay. Rough Draft Examples - Writing a rough draft. Rough Draft....
Rough Draft Essay. Rough Draft Examples - Writing a rough draft. Rough Draft....
 
Semantic (Web) Technologies for Translational Research in Life Sciences
Semantic (Web) Technologies for Translational Research in Life SciencesSemantic (Web) Technologies for Translational Research in Life Sciences
Semantic (Web) Technologies for Translational Research in Life Sciences
 
TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)
TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)
TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)
 
Improving online chemistry one structure at a time
Improving online chemistry one structure at a timeImproving online chemistry one structure at a time
Improving online chemistry one structure at a time
 
Expressed sequence tag (EST), molecular marker
Expressed sequence tag (EST), molecular markerExpressed sequence tag (EST), molecular marker
Expressed sequence tag (EST), molecular marker
 
Charleston Conference 2016
Charleston Conference 2016Charleston Conference 2016
Charleston Conference 2016
 
ASHG_2014_AP
ASHG_2014_APASHG_2014_AP
ASHG_2014_AP
 
Deep learning for genomics: Present and future
Deep learning for genomics: Present and futureDeep learning for genomics: Present and future
Deep learning for genomics: Present and future
 
Mikel egana itbam_2010_ogo_system
Mikel egana itbam_2010_ogo_systemMikel egana itbam_2010_ogo_system
Mikel egana itbam_2010_ogo_system
 
'A PAL's Life' for OMII-UK Board, May 2008
'A PAL's Life' for OMII-UK Board, May 2008'A PAL's Life' for OMII-UK Board, May 2008
'A PAL's Life' for OMII-UK Board, May 2008
 

More from Roderic Page

GrBio Workshop talk
GrBio Workshop talkGrBio Workshop talk
GrBio Workshop talk
Roderic Page
 
Biodiversity Knowledge Graphs
Biodiversity Knowledge GraphsBiodiversity Knowledge Graphs
Biodiversity Knowledge Graphs
Roderic Page
 
GBIF Science Committee Report GB21, Delhi, India
GBIF Science Committee Report GB21, Delhi, IndiaGBIF Science Committee Report GB21, Delhi, India
GBIF Science Committee Report GB21, Delhi, India
Roderic Page
 

More from Roderic Page (20)

ALEC (A List of Everything Cool)
ALEC (A List of Everything Cool)ALEC (A List of Everything Cool)
ALEC (A List of Everything Cool)
 
Wikidata and the Biodiversity Knowledge Graph
Wikidata and the Biodiversity Knowledge GraphWikidata and the Biodiversity Knowledge Graph
Wikidata and the Biodiversity Knowledge Graph
 
BioStor Next
BioStor NextBioStor Next
BioStor Next
 
Ozymandias - from an atlas to a knowledge graph of living Australia
Ozymandias - from an atlas to a knowledge graph of living AustraliaOzymandias - from an atlas to a knowledge graph of living Australia
Ozymandias - from an atlas to a knowledge graph of living Australia
 
SLiDInG6 talk on biodiversity knowledge graph
SLiDInG6 talk on biodiversity knowledge graphSLiDInG6 talk on biodiversity knowledge graph
SLiDInG6 talk on biodiversity knowledge graph
 
Wild idea for TDWG17 Bitcoins, biodiversity and micropayments
Wild idea for TDWG17 Bitcoins, biodiversity and micropaymentsWild idea for TDWG17 Bitcoins, biodiversity and micropayments
Wild idea for TDWG17 Bitcoins, biodiversity and micropayments
 
Towards a biodiversity knowledge graph
Towards a biodiversity knowledge graphTowards a biodiversity knowledge graph
Towards a biodiversity knowledge graph
 
The Sam Adams talk
The Sam Adams talkThe Sam Adams talk
The Sam Adams talk
 
Unknown knowns, long tails, and long data
Unknown knowns, long tails, and long dataUnknown knowns, long tails, and long data
Unknown knowns, long tails, and long data
 
In praise of grumpy old men: Open versus closed data and the challenge of cre...
In praise of grumpy old men: Open versus closed data and the challenge of cre...In praise of grumpy old men: Open versus closed data and the challenge of cre...
In praise of grumpy old men: Open versus closed data and the challenge of cre...
 
BHL, BioStor, and beyond
BHL, BioStor, and beyondBHL, BioStor, and beyond
BHL, BioStor, and beyond
 
Cisco Digital Catapult
Cisco Digital CatapultCisco Digital Catapult
Cisco Digital Catapult
 
Built in the 19th century, rebuilt for the 21st
Built in the 19th century, rebuilt for the 21stBuilt in the 19th century, rebuilt for the 21st
Built in the 19th century, rebuilt for the 21st
 
Two graphs, three responses
Two graphs, three responsesTwo graphs, three responses
Two graphs, three responses
 
GrBio Workshop talk
GrBio Workshop talkGrBio Workshop talk
GrBio Workshop talk
 
Biodiversity Knowledge Graphs
Biodiversity Knowledge GraphsBiodiversity Knowledge Graphs
Biodiversity Knowledge Graphs
 
Visualing phylogenies: a personal view
Visualing phylogenies: a personal viewVisualing phylogenies: a personal view
Visualing phylogenies: a personal view
 
Biodiversity informatics: digitising the living world
Biodiversity informatics: digitising the living worldBiodiversity informatics: digitising the living world
Biodiversity informatics: digitising the living world
 
Ebbe Nielsen Challenge GBIF #gb21
Ebbe Nielsen Challenge GBIF #gb21Ebbe Nielsen Challenge GBIF #gb21
Ebbe Nielsen Challenge GBIF #gb21
 
GBIF Science Committee Report GB21, Delhi, India
GBIF Science Committee Report GB21, Delhi, IndiaGBIF Science Committee Report GB21, Delhi, India
GBIF Science Committee Report GB21, Delhi, India
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

Data mining and data linking

Editor's Notes

  1. Publication is a closed object
  2. Publication is a closed object
  3. All the same thing
  4. Citation, user sees bibliography and may be able to follow links
  5. ~/Desktop/GrandChallenge/Data/DVD/LAB0370A/10557903/00420003/06003691/main.xml
  6. Citation, user sees bibliography and may be able to follow links
  7. Data citation with PageRank scores