SlideShare a Scribd company logo
1 of 37
Download to read offline
Content Mining:
Technology and Policy Developments
@jenny_molloy World Health Organisation – 9 April 2015
What is content?
What is mining?
1982
“Automatically generating logical representations of
text passages... by means of an analysis of the
coherence structure of the passages.”
Jerry R. Hobbs, Donald E. Walker, and Robert A. Amsler. 1982. Natural language access to structured text. In Proceedings of the 9th
conference on Computational linguistics - Volume 1(COLING '82), Ján Horecký (Ed.), Vol. 1. Academia Praha, , Czechoslovakia, 127-132.
DOI=10.3115/991813.991833 http://dx.doi.org/10.3115/991813.991833
2008
“The use of automated methods for exploiting
the enormous amount of knowledge available in
the biomedical literature.”
Cohen, K. Bretonnel; Hunter, Lawrence (2008). "Getting Started in Text Mining". PLoS Computational
Biology 4 (1): e20. doi:10.1371/journal.pcbi.0040020. PMC 2217579.PMID 18225946.
Legal Considerations
Copyright
Database
rights
Contract
Law
2011
2014
From 2014
UK Law
Workshops, hackdays, presentations, collaborations,
discussions with librarians and publishers.
Putting new rights into action.
In Europe
2013
Shortly after
20132015
Research commisioned through H2020...any EU Directive >5 years away.
Ireland already considering following UK - plus other member states?.
OUR MISSION
“make 100,000,000 facts
from the scholarly literature
open, accessible and reusable”
SOFTWARE OVERVIEW
quickscrape & thresher	

norma	

AMI fact extraction
THE SCALE OFTHETASK
• ~ 27,000 peer reviewed journals*	

• > 5,000 publishers	

• ~ 3,000 new papers per day
*Ulrich’s database: http://ulrichsweb.serialssolutions.com/login
STRUCTURED INFORMATION
• chemical names and structures	

• species	

• metabolism	

• phylogenetic trees
SOFTWARE PIPELINE
PRODUCT:
PROCESS:
journals
(ISSNs)
fulltext
URLs
metadata +
content +
files
facts
crawl scrape extract
CRAWLING
The latest journal
tables of contents
at Journal TOCs
http://www.journaltocs.hw.ac.uk/
SCRAPERS
• all have the same plumbing	

• scraping software (thresher) handles the plumbing	

• scraperJSON is a config file	

• supports large collections of scrapers	

• no programming required	

• not limited to one piece of software
BASIC SCRAPER JSON
name of the scraper:	

the URL(s) it applies to:	

the elements to capture:	

element name:	

where to find it:
{!
"name": "PLOS",!
"url": "plosw*.org",!
"elements": {!
"title": {!
"selector": “//h1[@property=‘dc:title’]”,!
}!
}!
}!
http://github.com/ContentMine/scraperJSON
SCRAPERS
SCRAPERS
{!
"name": "PLoS",!
"url": "plosw*.org",!
"elements": {!
"title": {!
"selector": “//h1[@property=‘dc:title’]”,!
}!
}!
}!
SCRAPERS
{!
"title": "Ab Initio Identification of Novel
Regulatory Elements in the Genome of Trypanosoma
brucei by Bayesian Inference on Sequence
Segmentation"!
}!
bibJSON output
THRESHER & QUICKSCRAPE
• reference implementation of scraperJSON	

• thresher is the scraping library	

• http://github.com/ContentMine/thresher	

• quickscrape is the command-line tool	

• http://github.com/ContentMine/quickscrape	

• Node.js, MIT licensed
JOURNAL SCRAPERS
http://github.com/ContentMine/journal-scrapers	

a self-testing collection of scraperJSON scrapers for academic journals	

PLOS MDPI
PeerJ Wiley
ScienceDirect Taylor & Francis
NPG, AAAS, RSC, ACS Springer
NORMALISATION
quickscrape HTML
PDF
XML
DOC
CSV
Norma
sHTML
AMI	

fact	

extraction
NORMALISATION
before after
• un-navigable	

• non-unicode	

• pixel glyphs	

• no structure
• processable	

• sectioned	

• tagged	

• structured
NORMALISATION
mending on a journal-by-journal basis
invalid XHTML
from PLOS ONE
invalid XHTML
from BMC
NORMALISATION
document structure
before: un-sectioned
HTML from Hindawi
after: sectioned and
tagged HTML
FACT EXTRACTION
we can’t turn a
hamburger into a cow
but we can
turn PDFs
into science
FACT EXTRACTION
AMI software: https://bitbucket.org/petermr/ami-core
pixel	

 	

 path	

 	

 shape	

	

 char	

 	

 word…	

!
!
	

 	

 para	

 	

 document	

 	

 	

 SCIENCE
FACT EXTRACTION
• titles	

• scale	

• units	

• ticks	

• quantity	

• + data
DATA!!%
2000+%points%
VECTOR%%PDF%
FACT EXTRACTION
raw mobile photo	

shadows, contrast,
noise, skew
binarization:	

pixels = 0, 1
clipping
AMI-chem for extracting chemical formulae
FACT EXTRACTION
thinning chemical optical
character recognition
down to 1- pixel
AMI-chem for extracting chemical formulae
FACT EXTRACTION
thinning topology
AMI-phylo for extracting phylogenetic trees
FACT EXTRACTION
Newick format can be viewed at:	

http://www.unc.edu/~bdmorris/treelib-js/demo.html
AMI-phylo for extracting phylogenetic trees
serialization
((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182),
((((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),
(n218,(n215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),
((n233,n187),n41)))),((((n59,n184),((n134,n200),(n137,(n212,
((n92,n209),n29))))),(n88,(n102,n161))),((((n70,n140),(n18,n188)),(n49,
((n123,n132),(n219,n198)))),(((n37,(n65,n46)),(n135,(n11,
(n113,n142)))),(n210,((n69,(n216,n36)),(n231,n160))))))),(((n107,n43),
((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,((n139,n5),((n170,
(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,n222))))))))))),
(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n158,(n91,
(n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,
(n163,n227)),((n53,n131),n159)))))));
Mining Examples
Building bacterial supertrees
Mining chemical reactions
Better genome annotation
Chemistry
AMI reads and recognises chemicals
structures.
Can even create reaction animation.
Natural language processing
can be used to analyse
chemical methods. These are
FACTS but the paper itself may
be copyrighted.
Clinical Trials
Clinical trials offer clear use cases
for content mining.
Data extraction from graphs could be very
useful for meta-analyses where raw data is
unavailable.
Only ~4% phylogenetic analyses
make underlying data available.
Supertrees
Content Mining enables AUTOMATED
extraction from daily literature and
conversion to NeXML:
- Machine-readable
- Open
- Reuseable
RAW data would be optimal!
PLUTo: Ross Mounce & Peter Murray-Rust
Annotation
Many applications:
- Find primers
- Enhance positive controls
- Find novel sequence information
- More detailed and accurate annotation
Potential to improve
quality and efficiency
of genomic research.
WHO
Thank you very much
for your attention!
Any questions?
Peter Murray-Rust
Ross Mounce
Richard Smith-Unna
Steph Unna
Jenny Molloy
Mark MacGillivray
Graham Steel
With thanks to:
Charles Oppenheim
Michelle Brook
Follow
@TheContentMine
contentmine.org
Find the code on
github.com/Content
Mine
Funded by:
Why might ContentMine be of interest?
Training for pubic health data researchers.
'Science on a Stick' standardised scholarly HTML
corpus for mining.
Potential to mine other standardised PDF documents
such as reports.
Open source, academic-led, easy to use and
customise.
All images are licensed under CC-BY unless otherwise stated
What is Content?
Phylogenetic Tree from Figure 1 in Evolution and Taxonomic Classification of Human Papillomavirus 16 (HPV16)-Related Variant Genomes: HPV31,
HPV33, HPV35, HPV52, HPV58 and HPV67. Chen Z, Schiffman M, Herrero R, DeSalle R, Anastos K, et al. (2011) Evolution and Taxonomic
Classification of Human Papillomavirus 16 (HPV16)-Related Variant Genomes: HPV31, HPV33, HPV35, HPV52, HPV58 and HPV67. PLoS ONE 6(5):
e20183. doi: 10.1371/journal.pone.0020183
Graph from He F, Fromion V, Westerhoff HV. (Im)Perfect robustness and adaptation of metabolic networks subject to metabolic and gene-expression
regulation: marrying control engineering with metabolic control analysis. BMC Syst Biol. 2013;7 131. doi:10.1186/1752-0509-7-131. PubMed PMID:
24261908; PubMed Central PMCID: PMC4222491.
Table from Table 1 Young GR, Mavrommatis B, Kassiotis G. Microarray analysis reveals global modulation of endogenous retroelement transcription by
microbes. Retrovirology. 2014;11 59. doi:10.1186/1742-4690-11-59. PubMed PMID: 25063042; PubMed Central PMCID: PMC4222864.
Text from Laidlaw CT, Condon JM, Belk MC. Viability Costs of Reproduction and Behavioral Compensation in Western Mosquitofish (Gambusia affinis).
PLoS One. 2014;9(11) e110524. doi:10.1371/journal.pone.0110524. PubMed PMID: 25365426; PubMed Central PMCID: PMC4217728.
Cell microscopy image from Pettinato G, Vanden Berg-Foels WS, Zhang N, Wen X. ROCK Inhibitor Is Not Required for Embryoid Body Formation from
Singularized Human Embryonic Stem Cells. PLoS One. 2014;9(11) e100742. doi:10.1371/journal.pone.0100742. PubMed PMID: 25365581; PubMed
Central PMCID: PMC4217711.
Supertrees:
Lang JM, Darling AE, Eisen JA. Phylogeny of bacterial and archaeal genomes using conserved genes: supertrees and supermatrices. PLoS One.
2013;8(4) e62510. doi:10.1371/journal.pone.0062510. PubMed PMID: 23638103; PubMed Central PMCID: PMC3636077.
McDowell A, Nagy I, Magyari M, Barnard E, Patrick S. The opportunistic pathogen Propionibacterium acnes: insights into typing, human disease, clonal
diversification and CAMP factor evolution. PLoS One. 2013;8(9) e70897. doi:10.1371/journal.pone.0070897. PubMed PMID: 24058439; PubMed Central
PMCID: PMC3772855.
Chemistry:
Diagram from Klejnstrup ML, Frandsen RJ, Holm DK, Nielsen MT, Mortensen UH, Larsen TO, Nielsen JB. Genetics of Polyketide Metabolism in
Aspergillus nidulans. Metabolites. 2012;2(1) 100-133. doi:10.3390/metabo2010100. PubMed PMID: 24957370; PubMed Central PMCID: PMC3901194.
Methods text from Greshock, T. J., Grubbs, A. W., Jiao, P., Wicklow, D. T., Gloer, J. B., & Williams, R. M. (2008). Isolation, Structure Elucidation, and
Biomimetic Total Synthesis of Versicolamide B, and the Isolation of Antipodal (−) Stephacidin A and (+) Notoamide B from Aspergillus versicolor NRRL‐ ‐
35600. Angewandte Chemie m frokInternational Edition, 47(19), 3573-3577.
Annotation:
Stubben, C. J., & Challacombe, J. F. (2014). Mining locus tags in PubMed Central to improve microbial gene annotation. BMC bioinformatics, 15(1), 43.
Figure from Haeussler, M., Gerner, M., & Bergman, C. M. (2011). Annotating genes and genomes with DNA sequences extracted from biomedical
articles. Bioinformatics, 27(7), 980-986.

More Related Content

What's hot

Content Mining of Science in Cambridge
Content Mining of Science in CambridgeContent Mining of Science in Cambridge
Content Mining of Science in CambridgeTheContentMine
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 TheContentMine
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData TheContentMine
 
Representation of kidney structures in Uberon
Representation of kidney structures in UberonRepresentation of kidney structures in Uberon
Representation of kidney structures in UberonChris Mungall
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and modelsmyGrid team
 
Collaborative Genomic Data Analyses in the Cloud
Collaborative Genomic Data Analyses in the CloudCollaborative Genomic Data Analyses in the Cloud
Collaborative Genomic Data Analyses in the Cloudsr320
 
Content Mining of Science in Europe
Content Mining of Science in EuropeContent Mining of Science in Europe
Content Mining of Science in Europepetermurrayrust
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literatureAutomatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literaturepetermurrayrust
 
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2Matt Oates
 
Cochrane workshop 2016
Cochrane workshop 2016Cochrane workshop 2016
Cochrane workshop 2016TheContentMine
 
Text and Data Mining explained at FTDM
Text and Data Mining explained at FTDMText and Data Mining explained at FTDM
Text and Data Mining explained at FTDMpetermurrayrust
 
Science Communication and Impact: A Researcher's Perspective
Science Communication and Impact: A Researcher's PerspectiveScience Communication and Impact: A Researcher's Perspective
Science Communication and Impact: A Researcher's Perspectivesr320
 
Scott Edmunds: Data publication in the data deluge
Scott Edmunds: Data publication in the data delugeScott Edmunds: Data publication in the data deluge
Scott Edmunds: Data publication in the data delugeGigaScience, BGI Hong Kong
 
Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? TheContentMine
 
E-Utilities
E-UtilitiesE-Utilities
E-Utilitiesmkim8
 
GigaScience: data and beta-database launch. Announcing GigaDB
GigaScience: data and beta-database launch. Announcing GigaDBGigaScience: data and beta-database launch. Announcing GigaDB
GigaScience: data and beta-database launch. Announcing GigaDBGigaScience, BGI Hong Kong
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureTheContentMine
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)TheContentMine
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literaturepetermurrayrust
 

What's hot (20)

Content Mining of Science in Cambridge
Content Mining of Science in CambridgeContent Mining of Science in Cambridge
Content Mining of Science in Cambridge
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
 
Representation of kidney structures in Uberon
Representation of kidney structures in UberonRepresentation of kidney structures in Uberon
Representation of kidney structures in Uberon
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and models
 
Collaborative Genomic Data Analyses in the Cloud
Collaborative Genomic Data Analyses in the CloudCollaborative Genomic Data Analyses in the Cloud
Collaborative Genomic Data Analyses in the Cloud
 
Content Mining of Science in Europe
Content Mining of Science in EuropeContent Mining of Science in Europe
Content Mining of Science in Europe
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literatureAutomatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
 
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
 
Cochrane workshop 2016
Cochrane workshop 2016Cochrane workshop 2016
Cochrane workshop 2016
 
Text and Data Mining explained at FTDM
Text and Data Mining explained at FTDMText and Data Mining explained at FTDM
Text and Data Mining explained at FTDM
 
Science Communication and Impact: A Researcher's Perspective
Science Communication and Impact: A Researcher's PerspectiveScience Communication and Impact: A Researcher's Perspective
Science Communication and Impact: A Researcher's Perspective
 
Scott Edmunds: Data publication in the data deluge
Scott Edmunds: Data publication in the data delugeScott Edmunds: Data publication in the data deluge
Scott Edmunds: Data publication in the data deluge
 
Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape?
 
E-Utilities
E-UtilitiesE-Utilities
E-Utilities
 
Cochrane workshop2016
Cochrane workshop2016Cochrane workshop2016
Cochrane workshop2016
 
GigaScience: data and beta-database launch. Announcing GigaDB
GigaScience: data and beta-database launch. Announcing GigaDBGigaScience: data and beta-database launch. Announcing GigaDB
GigaScience: data and beta-database launch. Announcing GigaDB
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
 

Viewers also liked

ContentMine (EMBL-EBI Industry Programme)
ContentMine (EMBL-EBI Industry Programme)ContentMine (EMBL-EBI Industry Programme)
ContentMine (EMBL-EBI Industry Programme)Jenny Molloy
 
Sixth sense technology
Sixth sense technologySixth sense technology
Sixth sense technologyMukesh Godara
 
Flora normal mata dan kuman penyebab infeksi mata
Flora normal mata dan kuman penyebab infeksi mataFlora normal mata dan kuman penyebab infeksi mata
Flora normal mata dan kuman penyebab infeksi mataSatya Pragnanda
 
2nd determinants -finished product
2nd determinants -finished product2nd determinants -finished product
2nd determinants -finished product415167hg
 
YEAR Conference 2015 - How to share our research data
YEAR Conference 2015 - How to share our research dataYEAR Conference 2015 - How to share our research data
YEAR Conference 2015 - How to share our research dataJenny Molloy
 
Introducing Open Science
Introducing Open ScienceIntroducing Open Science
Introducing Open ScienceJenny Molloy
 
Legal Framework for TDM
Legal Framework for TDMLegal Framework for TDM
Legal Framework for TDMJenny Molloy
 
ContentMine at EuropePMC AGM
ContentMine at EuropePMC AGMContentMine at EuropePMC AGM
ContentMine at EuropePMC AGMJenny Molloy
 
SciDataCon 2014 TDM Workshop Intro Slides
SciDataCon 2014 TDM Workshop Intro SlidesSciDataCon 2014 TDM Workshop Intro Slides
SciDataCon 2014 TDM Workshop Intro SlidesJenny Molloy
 
Engineering Life with Synthetic Biology
Engineering Life with Synthetic BiologyEngineering Life with Synthetic Biology
Engineering Life with Synthetic BiologyJenny Molloy
 

Viewers also liked (14)

ContentMine (EMBL-EBI Industry Programme)
ContentMine (EMBL-EBI Industry Programme)ContentMine (EMBL-EBI Industry Programme)
ContentMine (EMBL-EBI Industry Programme)
 
Sixth sense technology
Sixth sense technologySixth sense technology
Sixth sense technology
 
Flora normal mata dan kuman penyebab infeksi mata
Flora normal mata dan kuman penyebab infeksi mataFlora normal mata dan kuman penyebab infeksi mata
Flora normal mata dan kuman penyebab infeksi mata
 
2nd determinants -finished product
2nd determinants -finished product2nd determinants -finished product
2nd determinants -finished product
 
I am happy
I am happyI am happy
I am happy
 
YEAR Conference 2015 - How to share our research data
YEAR Conference 2015 - How to share our research dataYEAR Conference 2015 - How to share our research data
YEAR Conference 2015 - How to share our research data
 
Google glass
Google glassGoogle glass
Google glass
 
Introducing Open Science
Introducing Open ScienceIntroducing Open Science
Introducing Open Science
 
Legal Framework for TDM
Legal Framework for TDMLegal Framework for TDM
Legal Framework for TDM
 
ContentMine at EuropePMC AGM
ContentMine at EuropePMC AGMContentMine at EuropePMC AGM
ContentMine at EuropePMC AGM
 
Android
Android Android
Android
 
SciDataCon 2014 TDM Workshop Intro Slides
SciDataCon 2014 TDM Workshop Intro SlidesSciDataCon 2014 TDM Workshop Intro Slides
SciDataCon 2014 TDM Workshop Intro Slides
 
Id2 presentation
Id2 presentationId2 presentation
Id2 presentation
 
Engineering Life with Synthetic Biology
Engineering Life with Synthetic BiologyEngineering Life with Synthetic Biology
Engineering Life with Synthetic Biology
 

Similar to ContentMine Presentation for WHO Health Data Seminar

How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Sciencedrnigam
 
The Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in BiologyThe Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in Biologyrobertstevens65
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themRoss Mounce
 
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Ross Mounce
 
Why Life is Difficult, and What We MIght Do About It
Why Life is Difficult, and What We MIght Do About ItWhy Life is Difficult, and What We MIght Do About It
Why Life is Difficult, and What We MIght Do About ItAnita de Waard
 
TOPSAN_at_NIH
TOPSAN_at_NIHTOPSAN_at_NIH
TOPSAN_at_NIHtopsan
 
Science Seminar Series 4 Norman Johnson
Science Seminar Series 4 Norman JohnsonScience Seminar Series 4 Norman Johnson
Science Seminar Series 4 Norman JohnsonUniversity of Adelaide
 
Welch Wordifier Bosc2009
Welch Wordifier Bosc2009Welch Wordifier Bosc2009
Welch Wordifier Bosc2009bosc
 
Molecular scaffolds are special and useful guides to discovery
Molecular scaffolds are special and useful guides to discoveryMolecular scaffolds are special and useful guides to discovery
Molecular scaffolds are special and useful guides to discoveryJeremy Yang
 
Mouse-Human Research Classifier
Mouse-Human Research ClassifierMouse-Human Research Classifier
Mouse-Human Research ClassifierOsama Jomaa
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giantsBenjamin Good
 
iGem Eindhoven 2018 pitch
iGem Eindhoven 2018 pitchiGem Eindhoven 2018 pitch
iGem Eindhoven 2018 pitchMariska Brüls
 
485 lec4 the_genome
485 lec4 the_genome485 lec4 the_genome
485 lec4 the_genomehhalhaddad
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong
 
Real-time tagging of biomedical entities
Real-time tagging of biomedical entitiesReal-time tagging of biomedical entities
Real-time tagging of biomedical entitiesLars Juhl Jensen
 
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingScott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingGigaScience, BGI Hong Kong
 
The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960mare34
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsDuncan Hull
 

Similar to ContentMine Presentation for WHO Health Data Seminar (20)

How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
 
Shorthouse
ShorthouseShorthouse
Shorthouse
 
The Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in BiologyThe Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in Biology
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on them
 
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
 
Why Life is Difficult, and What We MIght Do About It
Why Life is Difficult, and What We MIght Do About ItWhy Life is Difficult, and What We MIght Do About It
Why Life is Difficult, and What We MIght Do About It
 
TOPSAN_at_NIH
TOPSAN_at_NIHTOPSAN_at_NIH
TOPSAN_at_NIH
 
Science Seminar Series 4 Norman Johnson
Science Seminar Series 4 Norman JohnsonScience Seminar Series 4 Norman Johnson
Science Seminar Series 4 Norman Johnson
 
Welch Wordifier Bosc2009
Welch Wordifier Bosc2009Welch Wordifier Bosc2009
Welch Wordifier Bosc2009
 
Molecular scaffolds are special and useful guides to discovery
Molecular scaffolds are special and useful guides to discoveryMolecular scaffolds are special and useful guides to discovery
Molecular scaffolds are special and useful guides to discovery
 
Mouse-Human Research Classifier
Mouse-Human Research ClassifierMouse-Human Research Classifier
Mouse-Human Research Classifier
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giants
 
iGem Eindhoven 2018 pitch
iGem Eindhoven 2018 pitchiGem Eindhoven 2018 pitch
iGem Eindhoven 2018 pitch
 
Introduction to Biological databases
Introduction to Biological databasesIntroduction to Biological databases
Introduction to Biological databases
 
485 lec4 the_genome
485 lec4 the_genome485 lec4 the_genome
485 lec4 the_genome
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
 
Real-time tagging of biomedical entities
Real-time tagging of biomedical entitiesReal-time tagging of biomedical entities
Real-time tagging of biomedical entities
 
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingScott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
 
The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of Bioinformatics
 

Recently uploaded

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to VirusesAreesha Ahmad
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceAlex Henderson
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 

Recently uploaded (20)

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 

ContentMine Presentation for WHO Health Data Seminar

  • 1. Content Mining: Technology and Policy Developments @jenny_molloy World Health Organisation – 9 April 2015
  • 3. What is mining? 1982 “Automatically generating logical representations of text passages... by means of an analysis of the coherence structure of the passages.” Jerry R. Hobbs, Donald E. Walker, and Robert A. Amsler. 1982. Natural language access to structured text. In Proceedings of the 9th conference on Computational linguistics - Volume 1(COLING '82), Ján Horecký (Ed.), Vol. 1. Academia Praha, , Czechoslovakia, 127-132. DOI=10.3115/991813.991833 http://dx.doi.org/10.3115/991813.991833 2008 “The use of automated methods for exploiting the enormous amount of knowledge available in the biomedical literature.” Cohen, K. Bretonnel; Hunter, Lawrence (2008). "Getting Started in Text Mining". PLoS Computational Biology 4 (1): e20. doi:10.1371/journal.pcbi.0040020. PMC 2217579.PMID 18225946.
  • 5. 2011 2014 From 2014 UK Law Workshops, hackdays, presentations, collaborations, discussions with librarians and publishers. Putting new rights into action.
  • 6. In Europe 2013 Shortly after 20132015 Research commisioned through H2020...any EU Directive >5 years away. Ireland already considering following UK - plus other member states?.
  • 7. OUR MISSION “make 100,000,000 facts from the scholarly literature open, accessible and reusable”
  • 8. SOFTWARE OVERVIEW quickscrape & thresher norma AMI fact extraction
  • 9. THE SCALE OFTHETASK • ~ 27,000 peer reviewed journals* • > 5,000 publishers • ~ 3,000 new papers per day *Ulrich’s database: http://ulrichsweb.serialssolutions.com/login
  • 10. STRUCTURED INFORMATION • chemical names and structures • species • metabolism • phylogenetic trees
  • 12. CRAWLING The latest journal tables of contents at Journal TOCs http://www.journaltocs.hw.ac.uk/
  • 13. SCRAPERS • all have the same plumbing • scraping software (thresher) handles the plumbing • scraperJSON is a config file • supports large collections of scrapers • no programming required • not limited to one piece of software
  • 14. BASIC SCRAPER JSON name of the scraper: the URL(s) it applies to: the elements to capture: element name: where to find it: {! "name": "PLOS",! "url": "plosw*.org",! "elements": {! "title": {! "selector": “//h1[@property=‘dc:title’]”,! }! }! }! http://github.com/ContentMine/scraperJSON
  • 16. SCRAPERS {! "name": "PLoS",! "url": "plosw*.org",! "elements": {! "title": {! "selector": “//h1[@property=‘dc:title’]”,! }! }! }!
  • 17. SCRAPERS {! "title": "Ab Initio Identification of Novel Regulatory Elements in the Genome of Trypanosoma brucei by Bayesian Inference on Sequence Segmentation"! }! bibJSON output
  • 18. THRESHER & QUICKSCRAPE • reference implementation of scraperJSON • thresher is the scraping library • http://github.com/ContentMine/thresher • quickscrape is the command-line tool • http://github.com/ContentMine/quickscrape • Node.js, MIT licensed
  • 19. JOURNAL SCRAPERS http://github.com/ContentMine/journal-scrapers a self-testing collection of scraperJSON scrapers for academic journals PLOS MDPI PeerJ Wiley ScienceDirect Taylor & Francis NPG, AAAS, RSC, ACS Springer
  • 21. NORMALISATION before after • un-navigable • non-unicode • pixel glyphs • no structure • processable • sectioned • tagged • structured
  • 22. NORMALISATION mending on a journal-by-journal basis invalid XHTML from PLOS ONE invalid XHTML from BMC
  • 23. NORMALISATION document structure before: un-sectioned HTML from Hindawi after: sectioned and tagged HTML
  • 24. FACT EXTRACTION we can’t turn a hamburger into a cow but we can turn PDFs into science
  • 25. FACT EXTRACTION AMI software: https://bitbucket.org/petermr/ami-core pixel path shape char word… ! ! para document SCIENCE
  • 26. FACT EXTRACTION • titles • scale • units • ticks • quantity • + data DATA!!% 2000+%points% VECTOR%%PDF%
  • 27. FACT EXTRACTION raw mobile photo shadows, contrast, noise, skew binarization: pixels = 0, 1 clipping AMI-chem for extracting chemical formulae
  • 28. FACT EXTRACTION thinning chemical optical character recognition down to 1- pixel AMI-chem for extracting chemical formulae
  • 29. FACT EXTRACTION thinning topology AMI-phylo for extracting phylogenetic trees
  • 30. FACT EXTRACTION Newick format can be viewed at: http://www.unc.edu/~bdmorris/treelib-js/demo.html AMI-phylo for extracting phylogenetic trees serialization ((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182), ((((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86), (n218,(n215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))), ((n233,n187),n41)))),((((n59,n184),((n134,n200),(n137,(n212, ((n92,n209),n29))))),(n88,(n102,n161))),((((n70,n140),(n18,n188)),(n49, ((n123,n132),(n219,n198)))),(((n37,(n65,n46)),(n135,(n11, (n113,n142)))),(n210,((n69,(n216,n36)),(n231,n160))))))),(((n107,n43), ((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,((n139,n5),((n170, (n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,n222))))))))))), (((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n158,(n91, (n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171, (n163,n227)),((n53,n131),n159)))))));
  • 31. Mining Examples Building bacterial supertrees Mining chemical reactions Better genome annotation
  • 32. Chemistry AMI reads and recognises chemicals structures. Can even create reaction animation. Natural language processing can be used to analyse chemical methods. These are FACTS but the paper itself may be copyrighted.
  • 33. Clinical Trials Clinical trials offer clear use cases for content mining. Data extraction from graphs could be very useful for meta-analyses where raw data is unavailable.
  • 34. Only ~4% phylogenetic analyses make underlying data available. Supertrees Content Mining enables AUTOMATED extraction from daily literature and conversion to NeXML: - Machine-readable - Open - Reuseable RAW data would be optimal! PLUTo: Ross Mounce & Peter Murray-Rust
  • 35. Annotation Many applications: - Find primers - Enhance positive controls - Find novel sequence information - More detailed and accurate annotation Potential to improve quality and efficiency of genomic research.
  • 36. WHO Thank you very much for your attention! Any questions? Peter Murray-Rust Ross Mounce Richard Smith-Unna Steph Unna Jenny Molloy Mark MacGillivray Graham Steel With thanks to: Charles Oppenheim Michelle Brook Follow @TheContentMine contentmine.org Find the code on github.com/Content Mine Funded by: Why might ContentMine be of interest? Training for pubic health data researchers. 'Science on a Stick' standardised scholarly HTML corpus for mining. Potential to mine other standardised PDF documents such as reports. Open source, academic-led, easy to use and customise.
  • 37. All images are licensed under CC-BY unless otherwise stated What is Content? Phylogenetic Tree from Figure 1 in Evolution and Taxonomic Classification of Human Papillomavirus 16 (HPV16)-Related Variant Genomes: HPV31, HPV33, HPV35, HPV52, HPV58 and HPV67. Chen Z, Schiffman M, Herrero R, DeSalle R, Anastos K, et al. (2011) Evolution and Taxonomic Classification of Human Papillomavirus 16 (HPV16)-Related Variant Genomes: HPV31, HPV33, HPV35, HPV52, HPV58 and HPV67. PLoS ONE 6(5): e20183. doi: 10.1371/journal.pone.0020183 Graph from He F, Fromion V, Westerhoff HV. (Im)Perfect robustness and adaptation of metabolic networks subject to metabolic and gene-expression regulation: marrying control engineering with metabolic control analysis. BMC Syst Biol. 2013;7 131. doi:10.1186/1752-0509-7-131. PubMed PMID: 24261908; PubMed Central PMCID: PMC4222491. Table from Table 1 Young GR, Mavrommatis B, Kassiotis G. Microarray analysis reveals global modulation of endogenous retroelement transcription by microbes. Retrovirology. 2014;11 59. doi:10.1186/1742-4690-11-59. PubMed PMID: 25063042; PubMed Central PMCID: PMC4222864. Text from Laidlaw CT, Condon JM, Belk MC. Viability Costs of Reproduction and Behavioral Compensation in Western Mosquitofish (Gambusia affinis). PLoS One. 2014;9(11) e110524. doi:10.1371/journal.pone.0110524. PubMed PMID: 25365426; PubMed Central PMCID: PMC4217728. Cell microscopy image from Pettinato G, Vanden Berg-Foels WS, Zhang N, Wen X. ROCK Inhibitor Is Not Required for Embryoid Body Formation from Singularized Human Embryonic Stem Cells. PLoS One. 2014;9(11) e100742. doi:10.1371/journal.pone.0100742. PubMed PMID: 25365581; PubMed Central PMCID: PMC4217711. Supertrees: Lang JM, Darling AE, Eisen JA. Phylogeny of bacterial and archaeal genomes using conserved genes: supertrees and supermatrices. PLoS One. 2013;8(4) e62510. doi:10.1371/journal.pone.0062510. PubMed PMID: 23638103; PubMed Central PMCID: PMC3636077. McDowell A, Nagy I, Magyari M, Barnard E, Patrick S. The opportunistic pathogen Propionibacterium acnes: insights into typing, human disease, clonal diversification and CAMP factor evolution. PLoS One. 2013;8(9) e70897. doi:10.1371/journal.pone.0070897. PubMed PMID: 24058439; PubMed Central PMCID: PMC3772855. Chemistry: Diagram from Klejnstrup ML, Frandsen RJ, Holm DK, Nielsen MT, Mortensen UH, Larsen TO, Nielsen JB. Genetics of Polyketide Metabolism in Aspergillus nidulans. Metabolites. 2012;2(1) 100-133. doi:10.3390/metabo2010100. PubMed PMID: 24957370; PubMed Central PMCID: PMC3901194. Methods text from Greshock, T. J., Grubbs, A. W., Jiao, P., Wicklow, D. T., Gloer, J. B., & Williams, R. M. (2008). Isolation, Structure Elucidation, and Biomimetic Total Synthesis of Versicolamide B, and the Isolation of Antipodal (−) Stephacidin A and (+) Notoamide B from Aspergillus versicolor NRRL‐ ‐ 35600. Angewandte Chemie m frokInternational Edition, 47(19), 3573-3577. Annotation: Stubben, C. J., & Challacombe, J. F. (2014). Mining locus tags in PubMed Central to improve microbial gene annotation. BMC bioinformatics, 15(1), 43. Figure from Haeussler, M., Gerner, M., & Bergman, C. M. (2011). Annotating genes and genomes with DNA sequences extracted from biomedical articles. Bioinformatics, 27(7), 980-986.