This document summarizes Peter Murray-Rust's work on developing software to extract structured data and information from scientific documents. It discusses tools to extract data from text, tables, images, computational logs, and more. It provides examples of extracting chemical information, disease and species data, and phylogenetic trees from figures. The goal is to liberate scientific data locked up in unstructured documents to enable new discoveries.
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
A Global Commons for Scientific Data: Molecules and Wikidata
1. A global Commons for scientific Data
Peter Murray-Rust,
Dept of Chemistry, University of Cambridge
and ContentMine
At Molecular Engineering, Cavendish, Cambridge,
UK, 2016-11-07
contentmine.org is supported by a grant to PMR as a
2. The Right to Read is the Right to Mine**PeterMurray-Rust, 2011
http://contentmine.org
3. Some topics
• Content in scientific publications
• Extracting data from text and tables
• Dictionaries
• Extracting data from images
• Extracting data from computational logs and theses
• Wikidata
Everything is open (CC-BY). Please steal and re-use
8. Output of scholarly publishing
[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg
586,364 Crossref DOIs 201507 [1] per month
2.5 million (papers + supplemental data) /year [citation needed]*
each 3 mm thick
4500 m high per year [2]
* Most is not Publicly readable
[1] http://www.crossref.org/01company/crossref_indicators.html
1 year’s scholarly output!
10. Most Publishers destroy structured
information (LaTeX, Word) into PDF …
• Characters (NOT words or higher structure)
WORD is simply 4 characters, no space chars
• Paths (NOT circles, squares …) “Vectors”
… APIs then destroy it further into Pixels
(e.g. PNG or JPG )
Content Mine will read 10,000 PNGs a day and
try to recover the science.
17. Search for “Zika” in EuropePMC and
Wikidata
• https://github.com/ContentMine/amidemos/blob/master/WIKIDATA.md#content
mine-demos (list of demos)
• https://rawgit.com/ContentMine/amidemos/master/zika/full.dataTables.html
(datatables extracted - disease, gene, species, etc.)
• Lars Willighagen (NL) and Tom Arrow. visualisation of single facts and groups from
Corpus. https://tarrow.github.io/factvis/#cmid=CM.wikidatacountry136
• https://contentmine-demo.herokuapp.com/cooccurrences Coocurrence of
diseases - suggest select 25 and disease.
33. Annotation sent to hypothes.is
prefix
suffix
source
user
text
uri
maybe 100+ annotations per paper
text
34.
35. Wikidata: monoclinic systems with mass < 200
https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20%3Fmass%0AWHERE%0A%7B%0A%20%20%0
9%3Fitem%20wdt%3AP556%20wd%3AQ624543%20.%0A%20%20%09%3Fitem%20wdt%3AP2067%20%3Fmass%20.
%0A%20%20%20%20FILTER%28%3Fmass%20%3C%20200%20%29%0A%09SERVICE%20wikibase%3Alabel%20%7B%2
0bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20%7D%0A%7D
45. Ross Mounce (Bath), Panton Fellow
• Sharing research data:
http://www.slideshare.net/rossmounce
• How-to figures from PLOS/One [link]:
Ross shows how to bring figures to life:
• PLOSOne at http://bit.ly/PLOStrees
• PLOS at http://bit.ly/phylofigs (demo)
47. Note Jaggy and
broken pixels
NEW Bacteria must have a phylogenetic tree
Length
_________Weight
Binomial Name Culture/Strain GENBANK ID
Evolution
Rate
49. IJSEM phylotrees
• International Journal Systematic and
Evolutionary Microbiology
• All new microorganisms are expected to be
published there
• Consistent (though primitive) approach to
trees
53. Automatic Open Notebook of computations
Everything is posted to Github before being analyzed
54. Bacillus subtilis [131238]*
Bacteroides fragilis [221817]
Brevibacillus brevis
Cyclobacterium marinum
Escherichia coli [25419]
Filobacillus milosensis
Flectobacillus major [15809775]
Flexibacter flexilis [15809789]
Formosa algae
Gelidibacter algens [16982233]
Halobacillus halophilus
Lentibacillus salicampi [18345921]
Octadecabacter arcticus
Psychroflexus torquis [16988834]
Pseudomonas aeruginosa [31856]
Sagittula stellata [16992371]
Salegentibacter salegens
Sphingobacterium spiritivorum
Terrabacter tumescens
• [Identifier in Wikidata]
• Missing = not found with Wikidata API
20 commonest organisms (in > 30 papers) in trees from IJSEM*
Half do not appear to be in Wikidata
Can the Wikipedia Scientists comment?
*Int. J. Syst. Evol. Microbiol.
55. Display your own tree
• Cut and paste…
• ((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182)
,((((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218
,(n215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n18
7),n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n8
8,(n102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198))
)),(((n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n2
31,n160))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,
((n139,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,
n222))))))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n1
58,(n91,(n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163
,n227)),((n53,n131),n159)))))));
• View with http://www.unc.edu/~bdmorris/treelib-js/demo.html or
• http://www.trex.uqam.ca/index.php?action=newick&project=trex
72. AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram,
recognizes the paths and
generates the molecules. Then
she creates a stop-fram animation
showing how the 12 reactions
lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)
73. Precision + Recall for ImageAnalysis?
• Chemical Patents (obfuscation) ca 25% PR
• Binomial names from text > 99% PR
• Binomial from images (lookup) 95%+
• Trees from images (pred.)
• Molecules: image ca 90% SVG >
• Analysis massively hampered by Copyright
74. Software Availability and collaboration
• All software OSI-compliant (non-GPL) Apache2 , MIT, BSD
• http://bitbucket.org/wwmm, (euclid, Jumbo6, svg, pdf2svg,
• http://bitbucket.org/petermr, svgbuilder, xhtml2stm,
imageanalysis, diagramanalyzer
• http://bitbucket.org/AndyHowlett/ami2-poc
• http://github.com/petermr/ami-plugin
• http://github.com/ContentMine
• http://boofcv.org
• collaboration with PDFBox, TabulaPDF, JailbreakingThePDF
• Extracted data CC 0
76. Questions and comments
Thanks:
• Andy Howlett, Dept Chemistry, Cambridge
• Mark Williamson, Dept Chemistry, Cambridge
• Ross Mounce, Biology, University of Bath
• Shuttleworth Foundation
PM-R has offered to mentor an MSc project this summer
for anyone interested.
contentmine.org