2. G***** for Science
If you had a magic wand what would you want to
be able to search for in the current “paper”
literature[1]?
Anything “readable”, including text, equations,
diagrams, metadata, supplementary data.
[1] Including closed access.
3. The Right to Read is the Right to Mine
http://contentmine.org
4. • Crawl scientific literature
(Open Bibliography)
• Scrape each scientific article
(ContentMine-quickscrape)
• Extract the facts (ContentMine-AMI)
• Index (Wikipedia)
• Republish (WikiData)
Machine Extraction of scientific facts
5. RSU: Richard Smith-Unna
PMR: Peter Murray-Rust
CL: CottageLabs
Queues
Repos
Scientific
literature
Science
Plugins
Science
Volunteers
6. Facts Marked by “non-scientists” in ContentMine workshops
With Wikipedia everyone can be a scientist
7. “nuggets” in a scientific paper
quantity
units
Value ranges
Humans aren’t designed to mine this …
chemical
project places
9. Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
10. AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram,
recognizes the paths and
generates the molecules. Then
she creates a stop-fram animation
showing how the 12 reactions
lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)
11. But we can now
turn PDFs into
Science
We can’t turn a hamburger into a cow
12. Bacterial WP_phylogenetic tree
Our machines have read and interpreted 4300 in an hour with > 95% accuracy
Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
WP: Clostridium_butyricum
Genbank ID
American Type
Culture Collection
13. (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0036933 –
“Adaptive Evolution of HIV at HLA Epitopes Is Associated with Ethnicity in Canada” .
((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182),(
(((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218,(n
215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n187),
n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n88,(n
102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198)))),(((
n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n231,n1
60))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,((n139
,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,n222)))
)))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n158,(n91,(
n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163,n227)),((
n53,n131),n159)))))));
http://en.wikipedia.org/wiki/Digital_image_processing
http://en.wikipedia.org/wiki/Newick_format http://en.wikipedia.org/wiki/Phylogenetics
14. Content Mining is now legal in UK
• Since 2014-06-01
• For non-commercial research purposes
• Without permission or licence from publishers
Come to the UK!
15. Some Community Ideas
• Exoplanets
• Elementary particles.
• NASA missions
• Biodiversity (what lives where and when)
• Phytochemistry (chemistry in plants)
• Neurophysiology data (electrical spikes)
• Evolutionary biology
• Daily dinosaurs for 5-year-olds
• Legal and political aspects
• Licence and Funder metadata