SlideShare a Scribd company logo
1 of 52
Making eTheses USEFUL
Peter Murray-Rust*,
University of Cambridge and OKF
ETD2014, Leicester, UK 2014-07-24
*Shuttleworth Fellow 2014-5
Overview
• We waste > 10,000,000,000 USD of eThesis value*
• Everyone else is becoming OPEN; not Universities
• What we CAN DO NOW: ContentMining
• What we SHOULD do: Open Notebook Science
• We don’t need commercial organisations to manage
theses.
• The time has come; We can do it now
*My numbers are DEBATABLE! Please add your thoughts to
http://pads.cottagelabs.com/p/etd2014 or tweet #etd2014
Jean-Claude Bradley
Jean-Claude Bradley was one of the
most influential open scientists of our
time. He was an innovator in all that
he did, from Open Education to
bleeding edge Open Science; in 2006,
he coined the phrase Open Notebook
Science. His loss is felt deeply by
friends and colleagues around the
world.
On Monday July 14, 2014 we gathered
at Cambridge University to honour his
memory and the legacy he leaves
behind with a highly distinguished set
of invited speakers to revisit and build
upon the ideas which inspired and
defined his life’s work.
Wikipedia CC BY-SA
The cost and value
The economic value of data
• I believe that we spend globally ca 400 billion
USD / yr on public research.
• The outputs include:
– Knowledge / papers / patents
– Organizations
– People
– Materials
– Data – many billions/year and much is lost
US Taxpayers spend 139 Billion USD / yr
on Scientific Research
4 Billion USD on human genome
yielded 800 Billion USD and 4 M job-years
Scholarly publication
• Citizens pay $400,000,000,000…
• … for research in 1,500,000 articles …
• … cost $300,000 each to create …
• … $7000 each to “publish” … ($7 USD arXiv)
• … costs $10,000,000,000 …
• … “publishers” forbid access to 99.9% of citizens of the
world …
• … Value???
• Please challenge these numbers… #etd2014 or
http://pads.cottagelabs.com/p/etd2014
…three problems—flawed design, non-
publication, and poor reporting—together
meant >85% of research funds were wasted, a
global total loss >100 billion USD per year.
[Lancet 2009]
[Even more] waste clearly occurs after
publication: from poor access, poor
dissemination, and poor uptake of the findings
of research. [PLOS Medicine 2014-05-27]
Bad publication wastes science
Authors don’t deposit data (Ross Mounce)
Where is the Digital Enlightenment?
• Science is done in C20th ways …
• …communicated in C19th ways …
• … losing the power of C21st
Linked Open Data – the world’s knowledge
very little physical science and THESES?? 
http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png
DBPedia
BIO
Comp
Lib
PDB
Ontologies
GOV
GOV.uk
Music,
Art
Literature
Social
Knowledge
bases
RDF
triples
eTheses
• Citizens pay $20,000,000,000*…
• … for research in 200,000 science theses*…
• … cost $100,000 each to create* …
• … re-use ??? (near zero)
• … Value???
• *Please challenge these numbers…
• NOTE: we pay publishers $15,000,000,000 for
journals and APCs
“Free” and “Open”
• "Free software is a matter of liberty, not price.
’free speech', not 'free beer'”. (R M Stallman)
• “A piece of data or content is open if anyone is
free to use, reuse, and redistribute it”
(OKFN)http://opendefinition.org/
• “open” (access) has multiple incompatible “definitions”. Major split
is “human eyeballs” vs copying and machine “reusability”
• “Open” is a marketing term for publishers, who frequently (often
deliberately) do not grant full Openness.
“Gratis” vs “Libre”
Critical Historical Open Events
• Free Software Foundation (RMS,
1985) and Linux (Torvalds, 1991)
• The World Wide Web (TBL, 1991)
• The human genome (1990-2001)
The life of Aaron Swarz (1986-2013)
https://en.wikipedia.org/wiki/Bermuda_Principles
• Automatic release of sequence assemblies larger than 1
kb (preferably within 24 hours).
• Immediate publication of finished annotated
sequences.
• Aim to make the entire sequence freely available in the
public domain for both research and development in
order to maximise benefits to society.
http://www.budapestopenaccessinitiative.org/read
… an unprecedented public good. …
… completely free and unrestricted access to [peer-
reviewed literature] by all scientists, scholars, teachers,
students, and other curious minds. …
…Removing access barriers to this literature will
accelerate research, enrich education, share the
learning of the rich with the poor and the poor with
the rich, make this literature as useful as it can be, and
lay the foundation for uniting humanity in a common
intellectual conversation and quest for knowledge.
(Budapest Open Access Initiative, 2003)
Panton Principles for Open Data in
science(2010)
• PUBLISH YOUR DATA OPENLY
• …make an explicit and robust statement of your wishes.
• Use a recognized waiver or license that is appropriate for
data.
• open as defined by the Open Knowledge/Data Definition
(… NOT non-commercial)
• Explicit dedication of data … into the public domain via
PDDL or CCZero
Peter Murray-Rust, Cameron Neylon, Rufus Pollock, John
Wilbanks
Panton Authors and Fellows
Problems of Commercial
Elsevier wants to control Open Data
[asked by Michelle Brook]
Mendeley
From Wikipedia, the free encyclopedia
• … a social media site used by many scientists
to store metadata …
• … purchased by Elsevier in 2013
• David Dobbs, in The New Yorker, described
motive as:
– to acquire its user data,
– to destroy or coöpt an open-science icon that
threatens its business model.
• PM-R: Mendeley can also Snoop and Control
New ways for Theses
• Content Mining
• Open Notebook Theses
Traditional Research and Publication
“Lab” work paper/th
esis
Write
rewrite
Re-experiment
publish
???
Validation??
DATA
output often
seriously restricted
Content-Mining (TDM)
• Now COMPLETELY LEGAL IN UK since 2014-06-01 …
• … Whatever the publishers tell you. Do NOT sign their
APIs
• Contentmine.org …
• … sponsored by Shuttleworth Foundation …
• … to extract 100,000,000 facts from scientific literature
• And STM publishers are throwing millions to stop us
But we can now
turn PDFs into
Science
We can’t turn a hamburger into a cow
How a machine reads a chemical thesis
nodes are compounds; arrows are reactions
PROPERTIES (Name-Value-Units-Error)
Name Value Units
NV U NV U N V
U
N
E
V E U
“nuggets” in a scientific paper
quantity
units
Value ranges
Humans aren’t designed to mine this … 
chemical
project places
Natural Language Processing
Part of speech tagging (Wordnet, Brown Corpus, etc.)
Parsing chemical sentences
http://wwmm.ch.cam.ac.uk/chemicaltagger
• Typical
Typical chemical synthesis
Automatic semantic markup of chemistry
Could be used for analytical, crystallization, etc.
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
Evolution of ultraviolet
vision in the largest avian
radiation - the passerines
Anders Ödeen 1* , Olle
Håstad 2,3 and Per Alström 4
PDF 
HTML 
Styles , superscripts
And diåcritics
preserved!
AMI
PDF 
Turdus iliacus
Taeniopygia guttata
Serinus canaria
Lanius excubitor
Melopsittacus undulatus
Pavo cristatus
Sturnus vulgaris
Dolichonyx oryzivorus
Ficedula hypoleuca
Vaccinium myrtillus
Falco tinnunculus
Turdus
Pomatostomus
Leothrix
Amytornis
Acanthisitta
Orthonyx x 2
Malurus
Cnemophilus x 4
Philesturnus x 2
Motacilla x 2
Toxorhampus x 2
Typical phylo tree: 60 nodes, complex and miniscule annotation,
vertical text, hyphenation and valuable branch lengths. AMI extracts ALL
Acanthisittidae
Acanthizidae
Acrocephalidae
Callaeidae
Campephagidae
Cnemophilidae
Corvidae
0.84
0.91
0.93
0.95
Acanthisitta
Acrocephalus
Ailuroedus
Ailuroedus
Amytornis
Camptostoma
AMI
23.12
34.54
37.21
38.55
Posterior
probability
AMI can MEASURE
Branch lengths!
NexML
Genus Family
HTML
Open Notebook Science
• Graduate students understand it: do you?
Free/Open Software Development
Engineered
repository
World
community
CODE
rewrite
validate
CODE
fork
CODE
Re-use
CODE
Re-use
Github, BitBucket
StackOverflow,
Apache
inspires
OSI
Example: ContentMine at
http://github.com/ContentMine/quickscrape
Sophie Kershaw, Panton Fellow, Training PhD Students
“Do you think you would be
more confident in the future
about trying to apply Open
techniques to your work..?”
• 50% Yes, by myself
• 41% Yes, with help/guidance
• 9% No opinion/neutral
• 0% No
Rotation-Based Learning (RBL)
Phase 1: Initiator
• No communication
permitted between groups
• Attempt to reproduce
existing literature
• Deliver a coherent research
story by the end of Phase 1
Phase 2: Successor
• Communication between
groups still prohibited
• Validate and develop the
inherited research story
• Critique your predecessors
• Role of research producer vs. research user
• Can this approach help to foster awareness of reproducibility issues?
Throughout Phases 1 & 2:
• Daily lectures on open
science culture & techniques
• First-hand application to own
research work
• Version control using GitHub
• Daily group supervision
Open Source software inspires Open Science
Jean-Claude Bradley 2006
Open Notebook Science, ONS
Jean-Claude Bradley 2006
http://michaelnielsen.org/blog/reinventing-
discovery/
http://en.wikipedia.org/wiki/Reinventing_Discovery
http://gowers.wordpress.com/2013/11/03/dbd1-initial-post/
http://polymathprojects.org/2013/11/04/polymath9-pnp/#comments
The Polymath project
Tim Gowers and the world
Jean-Claude Bradley 2006
Jean-Claude Bradley 2006
Jean-Claude Bradley 2006
And spectra were included as well
Jean-Claude Bradley 2006
TOOLS
Open Notebook Science
Open
engineered
repository
World
community
INSTRUMENT
validate
merge
MODEL
CODE
DATA
DATA
knowledge
calibrate
Problems are solved communally;
Nothing is needlessly duplicated; “publication“ is
continuous
Machines
and humans
Working
together
CC-BY

More Related Content

What's hot

Embrace the Open Revolution
Embrace the Open RevolutionEmbrace the Open Revolution
Embrace the Open Revolutionpetermurrayrust
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machinespetermurrayrust
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)petermurrayrust
 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humanspetermurrayrust
 
Co-creating Global Natural History Networks
Co-creating Global Natural History NetworksCo-creating Global Natural History Networks
Co-creating Global Natural History NetworksBoris Jacob
 
Copyright Reform and Open Data
Copyright Reform and Open DataCopyright Reform and Open Data
Copyright Reform and Open Datapetermurrayrust
 
Scholarly communication and OA
Scholarly communication and OAScholarly communication and OA
Scholarly communication and OAalnarpsbiblioteket
 
Berlin 2013 - Digital humanities in Open Access
Berlin 2013 - Digital humanities in Open AccessBerlin 2013 - Digital humanities in Open Access
Berlin 2013 - Digital humanities in Open AccessOpenEdition
 

What's hot (10)

Petermrjisc20141201
Petermrjisc20141201Petermrjisc20141201
Petermrjisc20141201
 
Embrace the Open Revolution
Embrace the Open RevolutionEmbrace the Open Revolution
Embrace the Open Revolution
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)
 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humans
 
Ebi
EbiEbi
Ebi
 
Co-creating Global Natural History Networks
Co-creating Global Natural History NetworksCo-creating Global Natural History Networks
Co-creating Global Natural History Networks
 
Copyright Reform and Open Data
Copyright Reform and Open DataCopyright Reform and Open Data
Copyright Reform and Open Data
 
Scholarly communication and OA
Scholarly communication and OAScholarly communication and OA
Scholarly communication and OA
 
Berlin 2013 - Digital humanities in Open Access
Berlin 2013 - Digital humanities in Open AccessBerlin 2013 - Digital humanities in Open Access
Berlin 2013 - Digital humanities in Open Access
 

Viewers also liked

Cochrane workshop 2016
Cochrane workshop 2016Cochrane workshop 2016
Cochrane workshop 2016TheContentMine
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureTheContentMine
 
Content Mining of Science and Medicine
Content Mining of Science and MedicineContent Mining of Science and Medicine
Content Mining of Science and MedicineTheContentMine
 
Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? TheContentMine
 
OpenNotebookScience NOW!
OpenNotebookScience NOW!OpenNotebookScience NOW!
OpenNotebookScience NOW!TheContentMine
 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiDataTheContentMine
 
Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts TheContentMine
 
Mining Scientific Images
Mining Scientific ImagesMining Scientific Images
Mining Scientific ImagesTheContentMine
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSS Open software and knowledge for MIOSS
Open software and knowledge for MIOSS TheContentMine
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature TheContentMine
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData TheContentMine
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature TheContentMine
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 TheContentMine
 
Open Data and Open Science
Open Data and Open ScienceOpen Data and Open Science
Open Data and Open ScienceTheContentMine
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)TheContentMine
 
Content Mining of Science in Cambridge
Content Mining of Science in CambridgeContent Mining of Science in Cambridge
Content Mining of Science in CambridgeTheContentMine
 
High throughput mining of the scholarly literature
High throughput mining of the scholarly literature High throughput mining of the scholarly literature
High throughput mining of the scholarly literature TheContentMine
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika! ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika! TheContentMine
 

Viewers also liked (18)

Cochrane workshop 2016
Cochrane workshop 2016Cochrane workshop 2016
Cochrane workshop 2016
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
 
Content Mining of Science and Medicine
Content Mining of Science and MedicineContent Mining of Science and Medicine
Content Mining of Science and Medicine
 
Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape?
 
OpenNotebookScience NOW!
OpenNotebookScience NOW!OpenNotebookScience NOW!
OpenNotebookScience NOW!
 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
 
Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts
 
Mining Scientific Images
Mining Scientific ImagesMining Scientific Images
Mining Scientific Images
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSS Open software and knowledge for MIOSS
Open software and knowledge for MIOSS
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
 
Open Data and Open Science
Open Data and Open ScienceOpen Data and Open Science
Open Data and Open Science
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)
 
Content Mining of Science in Cambridge
Content Mining of Science in CambridgeContent Mining of Science in Cambridge
Content Mining of Science in Cambridge
 
High throughput mining of the scholarly literature
High throughput mining of the scholarly literature High throughput mining of the scholarly literature
High throughput mining of the scholarly literature
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika! ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!
 

Similar to Making eTheses More Useful with Open Science Techniques

ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesTheContentMine
 
ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesespetermurrayrust
 
Open Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics InstituteOpen Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics InstituteTheContentMine
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchDatapetermurrayrust
 
The Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustThe Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustLEARN Project
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neurosciencepetermurrayrust
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in NeuroscienceTheContentMine
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in NeuroscienceTheContentMine
 
Young people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge NeocolonialismYoung people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge Neocolonialismpetermurrayrust
 
Scientific search for everyone
Scientific search for everyoneScientific search for everyone
Scientific search for everyonepetermurrayrust
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and  Medicine from the scholarly literatureAutomatic Extraction of Science and  Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literaturepetermurrayrust
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureAutomatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureTheContentMine
 
Open Access and Knowledge Sharing
Open Access and Knowledge SharingOpen Access and Knowledge Sharing
Open Access and Knowledge SharingGetaneh Alemu
 
Principles and practice of Open Science
Principles and practice of Open SciencePrinciples and practice of Open Science
Principles and practice of Open ScienceTheContentMine
 
Principles and practice of Open Science
Principles and practice of Open SciencePrinciples and practice of Open Science
Principles and practice of Open Sciencepetermurrayrust
 
Principles and practice of Open Science
Principles and practice of Open SciencePrinciples and practice of Open Science
Principles and practice of Open ScienceTheContentMine
 
Social Machines of Scholarly Collaboration
Social Machines of Scholarly CollaborationSocial Machines of Scholarly Collaboration
Social Machines of Scholarly CollaborationDavid De Roure
 

Similar to Making eTheses More Useful with Open Science Techniques (20)

ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and theses
 
ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and theses
 
Open Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics InstituteOpen Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics Institute
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchData
 
The Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustThe Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-Rust
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
Young people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge NeocolonialismYoung people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge Neocolonialism
 
GCFaure Bobcatsss2017
GCFaure Bobcatsss2017GCFaure Bobcatsss2017
GCFaure Bobcatsss2017
 
GCFaure London LII2016
GCFaure London LII2016GCFaure London LII2016
GCFaure London LII2016
 
Scientific search for everyone
Scientific search for everyoneScientific search for everyone
Scientific search for everyone
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and  Medicine from the scholarly literatureAutomatic Extraction of Science and  Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureAutomatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
Open Access and Knowledge Sharing
Open Access and Knowledge SharingOpen Access and Knowledge Sharing
Open Access and Knowledge Sharing
 
Principles and practice of Open Science
Principles and practice of Open SciencePrinciples and practice of Open Science
Principles and practice of Open Science
 
Principles and practice of Open Science
Principles and practice of Open SciencePrinciples and practice of Open Science
Principles and practice of Open Science
 
Principles and practice of Open Science
Principles and practice of Open SciencePrinciples and practice of Open Science
Principles and practice of Open Science
 
10 questions about open access to research nov 2014
10 questions about open access to research nov 201410 questions about open access to research nov 2014
10 questions about open access to research nov 2014
 
Social Machines of Scholarly Collaboration
Social Machines of Scholarly CollaborationSocial Machines of Scholarly Collaboration
Social Machines of Scholarly Collaboration
 

More from TheContentMine

ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesTheContentMine
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and TechnologyTheContentMine
 
Embrace the Open Revolution
Embrace the Open RevolutionEmbrace the Open Revolution
Embrace the Open RevolutionTheContentMine
 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and HumansTheContentMine
 
TheContentMine: Mining for Everyone
TheContentMine: Mining for EveryoneTheContentMine: Mining for Everyone
TheContentMine: Mining for EveryoneTheContentMine
 
Overview of Practical Content Mining
Overview of Practical Content Mining Overview of Practical Content Mining
Overview of Practical Content Mining TheContentMine
 
Copyright Reform and Open Data
Copyright Reform and Open DataCopyright Reform and Open Data
Copyright Reform and Open DataTheContentMine
 
ContentMining and Clinical Trials
ContentMining and Clinical TrialsContentMining and Clinical Trials
ContentMining and Clinical TrialsTheContentMine
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome TrustTheContentMine
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic BiologyTheContentMine
 

More from TheContentMine (10)

ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technology
 
Embrace the Open Revolution
Embrace the Open RevolutionEmbrace the Open Revolution
Embrace the Open Revolution
 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humans
 
TheContentMine: Mining for Everyone
TheContentMine: Mining for EveryoneTheContentMine: Mining for Everyone
TheContentMine: Mining for Everyone
 
Overview of Practical Content Mining
Overview of Practical Content Mining Overview of Practical Content Mining
Overview of Practical Content Mining
 
Copyright Reform and Open Data
Copyright Reform and Open DataCopyright Reform and Open Data
Copyright Reform and Open Data
 
ContentMining and Clinical Trials
ContentMining and Clinical TrialsContentMining and Clinical Trials
ContentMining and Clinical Trials
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 

Recently uploaded

Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDhatriParmar
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvRicaMaeCastro1
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsPooky Knightsmith
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 

Recently uploaded (20)

Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young minds
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 

Making eTheses More Useful with Open Science Techniques

  • 1. Making eTheses USEFUL Peter Murray-Rust*, University of Cambridge and OKF ETD2014, Leicester, UK 2014-07-24 *Shuttleworth Fellow 2014-5
  • 2. Overview • We waste > 10,000,000,000 USD of eThesis value* • Everyone else is becoming OPEN; not Universities • What we CAN DO NOW: ContentMining • What we SHOULD do: Open Notebook Science • We don’t need commercial organisations to manage theses. • The time has come; We can do it now *My numbers are DEBATABLE! Please add your thoughts to http://pads.cottagelabs.com/p/etd2014 or tweet #etd2014
  • 3. Jean-Claude Bradley Jean-Claude Bradley was one of the most influential open scientists of our time. He was an innovator in all that he did, from Open Education to bleeding edge Open Science; in 2006, he coined the phrase Open Notebook Science. His loss is felt deeply by friends and colleagues around the world. On Monday July 14, 2014 we gathered at Cambridge University to honour his memory and the legacy he leaves behind with a highly distinguished set of invited speakers to revisit and build upon the ideas which inspired and defined his life’s work. Wikipedia CC BY-SA
  • 4. The cost and value
  • 5. The economic value of data • I believe that we spend globally ca 400 billion USD / yr on public research. • The outputs include: – Knowledge / papers / patents – Organizations – People – Materials – Data – many billions/year and much is lost
  • 6. US Taxpayers spend 139 Billion USD / yr on Scientific Research 4 Billion USD on human genome yielded 800 Billion USD and 4 M job-years
  • 7. Scholarly publication • Citizens pay $400,000,000,000… • … for research in 1,500,000 articles … • … cost $300,000 each to create … • … $7000 each to “publish” … ($7 USD arXiv) • … costs $10,000,000,000 … • … “publishers” forbid access to 99.9% of citizens of the world … • … Value??? • Please challenge these numbers… #etd2014 or http://pads.cottagelabs.com/p/etd2014
  • 8. …three problems—flawed design, non- publication, and poor reporting—together meant >85% of research funds were wasted, a global total loss >100 billion USD per year. [Lancet 2009] [Even more] waste clearly occurs after publication: from poor access, poor dissemination, and poor uptake of the findings of research. [PLOS Medicine 2014-05-27] Bad publication wastes science
  • 9. Authors don’t deposit data (Ross Mounce)
  • 10. Where is the Digital Enlightenment? • Science is done in C20th ways … • …communicated in C19th ways … • … losing the power of C21st
  • 11. Linked Open Data – the world’s knowledge very little physical science and THESES??  http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png DBPedia BIO Comp Lib PDB Ontologies GOV GOV.uk Music, Art Literature Social Knowledge bases RDF triples
  • 12. eTheses • Citizens pay $20,000,000,000*… • … for research in 200,000 science theses*… • … cost $100,000 each to create* … • … re-use ??? (near zero) • … Value??? • *Please challenge these numbers… • NOTE: we pay publishers $15,000,000,000 for journals and APCs
  • 13. “Free” and “Open” • "Free software is a matter of liberty, not price. ’free speech', not 'free beer'”. (R M Stallman) • “A piece of data or content is open if anyone is free to use, reuse, and redistribute it” (OKFN)http://opendefinition.org/ • “open” (access) has multiple incompatible “definitions”. Major split is “human eyeballs” vs copying and machine “reusability” • “Open” is a marketing term for publishers, who frequently (often deliberately) do not grant full Openness. “Gratis” vs “Libre”
  • 14. Critical Historical Open Events • Free Software Foundation (RMS, 1985) and Linux (Torvalds, 1991) • The World Wide Web (TBL, 1991) • The human genome (1990-2001) The life of Aaron Swarz (1986-2013)
  • 15. https://en.wikipedia.org/wiki/Bermuda_Principles • Automatic release of sequence assemblies larger than 1 kb (preferably within 24 hours). • Immediate publication of finished annotated sequences. • Aim to make the entire sequence freely available in the public domain for both research and development in order to maximise benefits to society.
  • 16. http://www.budapestopenaccessinitiative.org/read … an unprecedented public good. … … completely free and unrestricted access to [peer- reviewed literature] by all scientists, scholars, teachers, students, and other curious minds. … …Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge. (Budapest Open Access Initiative, 2003)
  • 17. Panton Principles for Open Data in science(2010) • PUBLISH YOUR DATA OPENLY • …make an explicit and robust statement of your wishes. • Use a recognized waiver or license that is appropriate for data. • open as defined by the Open Knowledge/Data Definition (… NOT non-commercial) • Explicit dedication of data … into the public domain via PDDL or CCZero Peter Murray-Rust, Cameron Neylon, Rufus Pollock, John Wilbanks
  • 20. Elsevier wants to control Open Data [asked by Michelle Brook]
  • 21. Mendeley From Wikipedia, the free encyclopedia • … a social media site used by many scientists to store metadata … • … purchased by Elsevier in 2013 • David Dobbs, in The New Yorker, described motive as: – to acquire its user data, – to destroy or coöpt an open-science icon that threatens its business model. • PM-R: Mendeley can also Snoop and Control
  • 22. New ways for Theses • Content Mining • Open Notebook Theses
  • 23. Traditional Research and Publication “Lab” work paper/th esis Write rewrite Re-experiment publish ??? Validation?? DATA output often seriously restricted
  • 24. Content-Mining (TDM) • Now COMPLETELY LEGAL IN UK since 2014-06-01 … • … Whatever the publishers tell you. Do NOT sign their APIs • Contentmine.org … • … sponsored by Shuttleworth Foundation … • … to extract 100,000,000 facts from scientific literature • And STM publishers are throwing millions to stop us
  • 25. But we can now turn PDFs into Science We can’t turn a hamburger into a cow
  • 26. How a machine reads a chemical thesis nodes are compounds; arrows are reactions
  • 27. PROPERTIES (Name-Value-Units-Error) Name Value Units NV U NV U N V U N E V E U
  • 28. “nuggets” in a scientific paper quantity units Value ranges Humans aren’t designed to mine this …  chemical project places
  • 29. Natural Language Processing Part of speech tagging (Wordnet, Brown Corpus, etc.)
  • 32. Automatic semantic markup of chemistry Could be used for analytical, crystallization, etc.
  • 33. Open Content Mining of FACTs Machines can interpret chemical reactions We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
  • 34. Evolution of ultraviolet vision in the largest avian radiation - the passerines Anders Ödeen 1* , Olle Håstad 2,3 and Per Alström 4 PDF  HTML  Styles , superscripts And diåcritics preserved! AMI
  • 35. PDF  Turdus iliacus Taeniopygia guttata Serinus canaria Lanius excubitor Melopsittacus undulatus Pavo cristatus Sturnus vulgaris Dolichonyx oryzivorus Ficedula hypoleuca Vaccinium myrtillus Falco tinnunculus Turdus Pomatostomus Leothrix Amytornis Acanthisitta Orthonyx x 2 Malurus Cnemophilus x 4 Philesturnus x 2 Motacilla x 2 Toxorhampus x 2
  • 36. Typical phylo tree: 60 nodes, complex and miniscule annotation, vertical text, hyphenation and valuable branch lengths. AMI extracts ALL
  • 38. Open Notebook Science • Graduate students understand it: do you?
  • 39. Free/Open Software Development Engineered repository World community CODE rewrite validate CODE fork CODE Re-use CODE Re-use Github, BitBucket StackOverflow, Apache inspires OSI Example: ContentMine at http://github.com/ContentMine/quickscrape
  • 40. Sophie Kershaw, Panton Fellow, Training PhD Students
  • 41. “Do you think you would be more confident in the future about trying to apply Open techniques to your work..?” • 50% Yes, by myself • 41% Yes, with help/guidance • 9% No opinion/neutral • 0% No
  • 42. Rotation-Based Learning (RBL) Phase 1: Initiator • No communication permitted between groups • Attempt to reproduce existing literature • Deliver a coherent research story by the end of Phase 1 Phase 2: Successor • Communication between groups still prohibited • Validate and develop the inherited research story • Critique your predecessors • Role of research producer vs. research user • Can this approach help to foster awareness of reproducibility issues? Throughout Phases 1 & 2: • Daily lectures on open science culture & techniques • First-hand application to own research work • Version control using GitHub • Daily group supervision
  • 43.
  • 44. Open Source software inspires Open Science Jean-Claude Bradley 2006
  • 45. Open Notebook Science, ONS Jean-Claude Bradley 2006
  • 51. And spectra were included as well Jean-Claude Bradley 2006
  • 52. TOOLS Open Notebook Science Open engineered repository World community INSTRUMENT validate merge MODEL CODE DATA DATA knowledge calibrate Problems are solved communally; Nothing is needlessly duplicated; “publication“ is continuous Machines and humans Working together CC-BY

Editor's Notes

  1. Hi, I’m here to talk about AMI; a data extraction framework and tool. First, I just want highlight some of key contributors to the projects; Andy for his work on the ChemistryVisitor and Peter for the overall architecture. In this talk, I’m going to impress the importance of data in a specific format and its utility to automated machine processing. Then I’m going to demonstrate AMI’s architecture and the transformation of data as it flows through the process. I’m going to dwell a little on a core format used, Scalable Vector Graphics (SVG) before introducing the concept of visitors, which are pluggable context specific data extractors. Next, I’m going to introduce Andy’s ChemVisitor, for extracting semantic chemistry data, along with a few other visitors that can process non-chemistry specific data. Finally, I will demonstrate some uses of the ChemVisitor, within the realm of validation and metabolism.
  2. Hi, I’m here to talk about AMI; a data extraction framework and tool. First, I just want highlight some of key contributors to the projects; Andy for his work on the ChemistryVisitor and Peter for the overall architecture. In this talk, I’m going to impress the importance of data in a specific format and its utility to automated machine processing. Then I’m going to demonstrate AMI’s architecture and the transformation of data as it flows through the process. I’m going to dwell a little on a core format used, Scalable Vector Graphics (SVG) before introducing the concept of visitors, which are pluggable context specific data extractors. Next, I’m going to introduce Andy’s ChemVisitor, for extracting semantic chemistry data, along with a few other visitors that can process non-chemistry specific data. Finally, I will demonstrate some uses of the ChemVisitor, within the realm of validation and metabolism.