SlideShare a Scribd company logo
1 of 17
[1]
Chemicalize.org, SureChemOpen, PubChem and
the InChIKey: A heavenly conjunction with
transformative utility
Christopher Southan, TW2Informatics, Göteborg, Sweden,
ChemAxon UGM, Budapest, May 2013
Image credit: http://www.eso.org/public/images/yb_vlt_moon_cnn_cc/
[2]
Dr Christopher Southan, Ph.D., M.Sc.,B.Sc.
TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htm
Mobile: +46(0)702-530710
Skype: cdsouthan
Email: cdsouthan@hotmail.com
Twitter: http://twitter.com/#!/cdsouthan
Blog: http://cdsouthan.blogspot.com/
LinkedIN: http://www.linkedin.com/in/cdsouthan
Publications: http://www.citeulike.org/user/cdsouthan/order/year,,/publications
Presentations: http://www.slideshare.net/cdsouthan
[3]
The ChemAxon name-to-struc functionality is not only a component of the SureChem
patent extraction pipeline but also powers chemicalize.org. Both operations are now
submitting sources to PubChem. The former has deposited structures that bring the
patent-extracted total in PC to 14.5 mill. CIDs. The deposition from chemicalize is
~0.3 mill., but has been actively selected by users and is 20% unique. The final
conjunction is that all three sources generate the InChIKey (IK) that turns Google into
a de-facto merge of PubChem and ChemSpider of ~50 mill. structures.
Chemicalize.org users can convert new patents, other external or internal documents
and web based text. Individual results can be Googled, searched against
SurChemOpen and bulk extractions triaged against PubChem. It thus becomes
possible to connect chemistry between patents, papers, abstracts and database
records via exact match or similarity searching. When SureChem and
chemicalize.org update their submissions, relationships with the other ~200 PubChem
sources (including ChEMBL and vendor databases) are re-computed and new CID
links made. The synergy between SureChem and chemicalize.org is powerful because
matches between them (~ 0.15 mill.) via SureChemOpen, give occurrence statistics
and the location of the structure within patents. The applications of chemicalize.org
are extended by web tools such as Venny for determining intersects from multiple
extractions and CheS-Mapper for cluster visualization. These utility expansions will be
illustrated by documents specifying BACE1 inhibitors for Alzheimer’s disease.
Abstract
[4]
Auspicious Conjunctions 2012-13
• PubChem: global chemistry to slice ‘n dice
• SureChemOpen: majority of patent chemistry opened up
• Chemicalize.org : chemistry extractable from any text toombs
• Chemical images: patents extracted in SureChemOpen, OSRA
handles papers
• InChIKey indexing in Google
• ChemSpider: crowdsourcing chemisty quality
• Exapnding toolbox e.g.OPSIN, Venny, Ches-mapper
• SciBite alerts
• Expanding preview and surfacing options e.g. ChEMBLntd, Github,
OSDD, Open Lab Books, figshare etc
• Rise of mobile chemistry
[5]
Databases <> structures < > documents
Abstracts
Patents
Papers
15 mill
0.2 mill (MeSH)
0.8 mill
(ChEMBL)
12K
Google InChIKey ~ 50 million
(47m PubChem + 33m
UniChem + 28m ChemSpider)
[6]
Triaging chemistry from text
• Identify the structure specification types, e.g.
– Semantic names (all sources)
– Code names (press releases, papers and abstracts)
– IUPAC names (papers, patents and abstracts)
– Images (papers, patents, & Google images)
– SMILES (open lab books)
– InChi strings (open lab books)
– SDF files (open lab books, & github)
Convert these to a structure (e.g. SDF, SMILES, InChI) then:
– Search InChIKey in Google
– Search major databases
– Search SureChemOpen
– Compare extracted sets for intersects and diffs
– Extend exact match connectivity with similarity searching
[7]
PubChem Composition
[8]
SureChemOpen Composition (in PubChem)
[9]
Chemicalize.org Composition (in PubChem)
[10]
BACE2 Conjunctions
[11]
BACE2 Conjunctions
[12]
Chemicalise.org Triage
[13]
BACE2 Conjunctions
1. WO2013054291 > chemicalize.org
2. Download 450 structures
3. Upload to PubChem search
[14]
Clustering document extraction sets: CheS-Mapper
[15]
Venny: intersects, diffs, de-dupes and merges
[16]
Conclusions
• Transformative opening up of chemistry > biology via structure >document
connectivity
• Open mining of patent metadata and data
• Expanding toolbox
• Inexorable expansion of open-access publishing
But;
• Journal chemistry extraction > database records still slow
• Text mining of journals still restricted
• Author annotation and direct db submission rare
• Pharmaceutical research publications are still blinding structures (see
PMID: 23159359)
[17]
References
http://www.slideshare.net/cdsouthan/the-patent-chemistry-big-bang-in-pubchem
http://www.slideshare.net/cdsouthan/cs-cax-bioitchemicalizeposter03apr
http://www.ncbi.nlm.nih.gov/pubmed/23399051
http://www.ncbi.nlm.nih.gov/pubmed/23618056
http://www.ncbi.nlm.nih.gov/pubmed/23506624

More Related Content

Viewers also liked

EUGM 2013 - Andrea de Souza (Broad Institute): Setting the stage for the “SD”...
EUGM 2013 - Andrea de Souza (Broad Institute): Setting the stage for the “SD”...EUGM 2013 - Andrea de Souza (Broad Institute): Setting the stage for the “SD”...
EUGM 2013 - Andrea de Souza (Broad Institute): Setting the stage for the “SD”...
ChemAxon
 
EUGM 2013 - Sergio H. Rotstein (Pfizer): What about the “big guys”? The emerg...
EUGM 2013 - Sergio H. Rotstein (Pfizer): What about the “big guys”? The emerg...EUGM 2013 - Sergio H. Rotstein (Pfizer): What about the “big guys”? The emerg...
EUGM 2013 - Sergio H. Rotstein (Pfizer): What about the “big guys”? The emerg...
ChemAxon
 
EUGM 2013 - Bernd Rupp (FMP) Chemical Information systems: From compound coll...
EUGM 2013 - Bernd Rupp (FMP) Chemical Information systems: From compound coll...EUGM 2013 - Bernd Rupp (FMP) Chemical Information systems: From compound coll...
EUGM 2013 - Bernd Rupp (FMP) Chemical Information systems: From compound coll...
ChemAxon
 

Viewers also liked (17)

EUGM 2013 - Eufrozina Hoffmann (ChemAxon): Marvin extending the scope of usab...
EUGM 2013 - Eufrozina Hoffmann (ChemAxon): Marvin extending the scope of usab...EUGM 2013 - Eufrozina Hoffmann (ChemAxon): Marvin extending the scope of usab...
EUGM 2013 - Eufrozina Hoffmann (ChemAxon): Marvin extending the scope of usab...
 
EUGM 2013 - Andras Stracz (ChemAxon) - ChemAxon Plexus: A desktop application...
EUGM 2013 - Andras Stracz (ChemAxon) - ChemAxon Plexus: A desktop application...EUGM 2013 - Andras Stracz (ChemAxon) - ChemAxon Plexus: A desktop application...
EUGM 2013 - Andras Stracz (ChemAxon) - ChemAxon Plexus: A desktop application...
 
EUGM 2013 - Andrea de Souza (Broad Institute): Setting the stage for the “SD”...
EUGM 2013 - Andrea de Souza (Broad Institute): Setting the stage for the “SD”...EUGM 2013 - Andrea de Souza (Broad Institute): Setting the stage for the “SD”...
EUGM 2013 - Andrea de Souza (Broad Institute): Setting the stage for the “SD”...
 
EUGM 2013 - Sergio H. Rotstein (Pfizer): What about the “big guys”? The emerg...
EUGM 2013 - Sergio H. Rotstein (Pfizer): What about the “big guys”? The emerg...EUGM 2013 - Sergio H. Rotstein (Pfizer): What about the “big guys”? The emerg...
EUGM 2013 - Sergio H. Rotstein (Pfizer): What about the “big guys”? The emerg...
 
EUGM 2013 - Peter Englert, Peter Kovacs (ChemAxon) - The Next Generation of M...
EUGM 2013 - Peter Englert, Peter Kovacs (ChemAxon) - The Next Generation of M...EUGM 2013 - Peter Englert, Peter Kovacs (ChemAxon) - The Next Generation of M...
EUGM 2013 - Peter Englert, Peter Kovacs (ChemAxon) - The Next Generation of M...
 
EUGM 2013 - Steve Hajkowski (Thomson Reuters): Patent analytics - what can Ma...
EUGM 2013 - Steve Hajkowski (Thomson Reuters): Patent analytics - what can Ma...EUGM 2013 - Steve Hajkowski (Thomson Reuters): Patent analytics - what can Ma...
EUGM 2013 - Steve Hajkowski (Thomson Reuters): Patent analytics - what can Ma...
 
EUGM 2013 - Michael Dippolito (Deltasoft): Great Migrations! – Approaches to ...
EUGM 2013 - Michael Dippolito (Deltasoft): Great Migrations! – Approaches to ...EUGM 2013 - Michael Dippolito (Deltasoft): Great Migrations! – Approaches to ...
EUGM 2013 - Michael Dippolito (Deltasoft): Great Migrations! – Approaches to ...
 
EUGM 2013 - Björn Windshügel (European ScreeningPort): Chemoinformatic tools ...
EUGM 2013 - Björn Windshügel (European ScreeningPort): Chemoinformatic tools ...EUGM 2013 - Björn Windshügel (European ScreeningPort): Chemoinformatic tools ...
EUGM 2013 - Björn Windshügel (European ScreeningPort): Chemoinformatic tools ...
 
EUGM 2013 - Gyorgy Pirok (ChemAxon) - Prediction of Xenobiotic Metabolism
EUGM 2013 - Gyorgy Pirok (ChemAxon) - Prediction of Xenobiotic MetabolismEUGM 2013 - Gyorgy Pirok (ChemAxon) - Prediction of Xenobiotic Metabolism
EUGM 2013 - Gyorgy Pirok (ChemAxon) - Prediction of Xenobiotic Metabolism
 
EUGM 2013 - Ian Berry, Bob Marmon (Evotec): Classification and analysis of 21...
EUGM 2013 - Ian Berry, Bob Marmon (Evotec): Classification and analysis of 21...EUGM 2013 - Ian Berry, Bob Marmon (Evotec): Classification and analysis of 21...
EUGM 2013 - Ian Berry, Bob Marmon (Evotec): Classification and analysis of 21...
 
EUGM 2013 - Bernd Rupp (FMP) Chemical Information systems: From compound coll...
EUGM 2013 - Bernd Rupp (FMP) Chemical Information systems: From compound coll...EUGM 2013 - Bernd Rupp (FMP) Chemical Information systems: From compound coll...
EUGM 2013 - Bernd Rupp (FMP) Chemical Information systems: From compound coll...
 
EUGM 2013 - Jon Patterson (ChemAxon) ChemAxon Platform for Scientists
EUGM 2013 - Jon Patterson (ChemAxon) ChemAxon Platform for Scientists EUGM 2013 - Jon Patterson (ChemAxon) ChemAxon Platform for Scientists
EUGM 2013 - Jon Patterson (ChemAxon) ChemAxon Platform for Scientists
 
EUGM 2013 - Anna Tomin (ChemAxon) - Reaction Library Design
EUGM 2013 - Anna Tomin (ChemAxon) - Reaction Library DesignEUGM 2013 - Anna Tomin (ChemAxon) - Reaction Library Design
EUGM 2013 - Anna Tomin (ChemAxon) - Reaction Library Design
 
EUGM 2013 - Timea Polgar (ChemAxon) - 3D visualization for medicinal chemists
EUGM 2013 - Timea Polgar (ChemAxon) - 3D visualization for medicinal chemistsEUGM 2013 - Timea Polgar (ChemAxon) - 3D visualization for medicinal chemists
EUGM 2013 - Timea Polgar (ChemAxon) - 3D visualization for medicinal chemists
 
biologydriven
biologydrivenbiologydriven
biologydriven
 
20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse
 
Southan real drugs_paris_oct_11_2014
Southan real drugs_paris_oct_11_2014Southan real drugs_paris_oct_11_2014
Southan real drugs_paris_oct_11_2014
 

Similar to EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOpen, PubChem and the InChIKey: A heavenly conjunction with transformative utility

Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...
Chris Southan
 
Scooteroer pg cert talk introduction to open education by v rolfe sept11
Scooteroer pg cert talk introduction to open education by v rolfe sept11Scooteroer pg cert talk introduction to open education by v rolfe sept11
Scooteroer pg cert talk introduction to open education by v rolfe sept11
Vivien Rolfe
 

Similar to EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOpen, PubChem and the InChIKey: A heavenly conjunction with transformative utility (20)

A Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and WikidataA Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and Wikidata
 
Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...
 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humans
 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humans
 
Big Data and ContentMining for Libraries
Big Data and ContentMining for LibrariesBig Data and ContentMining for Libraries
Big Data and ContentMining for Libraries
 
Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases
 
Paradise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to MineParadise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to Mine
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
 
Scooteroer pg cert talk introduction to open education by v rolfe sept11
Scooteroer pg cert talk introduction to open education by v rolfe sept11Scooteroer pg cert talk introduction to open education by v rolfe sept11
Scooteroer pg cert talk introduction to open education by v rolfe sept11
 
Berlin 6 Open Access Conference: Tony Hey
Berlin 6 Open Access Conference: Tony HeyBerlin 6 Open Access Conference: Tony Hey
Berlin 6 Open Access Conference: Tony Hey
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on them
 
Open Data HK: open science meets open data. A primer from Scott Edmunds
Open Data HK: open science meets open data. A primer from Scott EdmundsOpen Data HK: open science meets open data. A primer from Scott Edmunds
Open Data HK: open science meets open data. A primer from Scott Edmunds
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Open Educational Resources (OER) and OpenCourseWare (OCW)
Open Educational Resources (OER) and OpenCourseWare (OCW)Open Educational Resources (OER) and OpenCourseWare (OCW)
Open Educational Resources (OER) and OpenCourseWare (OCW)
 
Open PHACTS April 2017 Science webinar Workflow tools
Open PHACTS April 2017 Science webinar Workflow toolsOpen PHACTS April 2017 Science webinar Workflow tools
Open PHACTS April 2017 Science webinar Workflow tools
 
Towards Responsible Content Mining: A Cambridge perspective
Towards Responsible Content Mining: A Cambridge perspectiveTowards Responsible Content Mining: A Cambridge perspective
Towards Responsible Content Mining: A Cambridge perspective
 

More from ChemAxon

Translating data to predictive models
Translating data to predictive modelsTranslating data to predictive models
Translating data to predictive models
ChemAxon
 

More from ChemAxon (20)

Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
 
Chemaxon EU UGM 2022 | Translating data to predictive models
Chemaxon EU UGM 2022 | Translating data to predictive modelsChemaxon EU UGM 2022 | Translating data to predictive models
Chemaxon EU UGM 2022 | Translating data to predictive models
 
Translating data to predictive models
Translating data to predictive modelsTranslating data to predictive models
Translating data to predictive models
 
Efficient biomolecular structural data handling and analysis - Webinar with D...
Efficient biomolecular structural data handling and analysis - Webinar with D...Efficient biomolecular structural data handling and analysis - Webinar with D...
Efficient biomolecular structural data handling and analysis - Webinar with D...
 
Biomolecule structural data management
Biomolecule structural data managementBiomolecule structural data management
Biomolecule structural data management
 
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first releaseCheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
 
Enhanced stereochemistry representation
Enhanced stereochemistry representation Enhanced stereochemistry representation
Enhanced stereochemistry representation
 
Intellectual property (IP) intelligence solutions designed for the way resear...
Intellectual property (IP) intelligence solutions designed for the way resear...Intellectual property (IP) intelligence solutions designed for the way resear...
Intellectual property (IP) intelligence solutions designed for the way resear...
 
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
 
Patent Data for Artificial Intelligence based Drug Discovery
Patent Data for Artificial Intelligence based Drug DiscoveryPatent Data for Artificial Intelligence based Drug Discovery
Patent Data for Artificial Intelligence based Drug Discovery
 
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
 
Research data management on the cloud
Research data management on the cloudResearch data management on the cloud
Research data management on the cloud
 
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound RegistrationCheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
 
Cheminfo Stories APAC 2020 - JChem Engines introduction
Cheminfo Stories APAC 2020 - JChem Engines introduction Cheminfo Stories APAC 2020 - JChem Engines introduction
Cheminfo Stories APAC 2020 - JChem Engines introduction
 
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
 
Cheminfo Stories APAC 2020 -- Markush technology
Cheminfo Stories APAC 2020 -- Markush technology Cheminfo Stories APAC 2020 -- Markush technology
Cheminfo Stories APAC 2020 -- Markush technology
 
JChem Microservices
JChem MicroservicesJChem Microservices
JChem Microservices
 
Migration from joc to jpc or choral
Migration from joc to jpc or choralMigration from joc to jpc or choral
Migration from joc to jpc or choral
 
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
 
Chemicalize Pro - Cheminfo Stories 2020 Day 5
Chemicalize Pro - Cheminfo Stories 2020 Day 5Chemicalize Pro - Cheminfo Stories 2020 Day 5
Chemicalize Pro - Cheminfo Stories 2020 Day 5
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOpen, PubChem and the InChIKey: A heavenly conjunction with transformative utility

  • 1. [1] Chemicalize.org, SureChemOpen, PubChem and the InChIKey: A heavenly conjunction with transformative utility Christopher Southan, TW2Informatics, Göteborg, Sweden, ChemAxon UGM, Budapest, May 2013 Image credit: http://www.eso.org/public/images/yb_vlt_moon_cnn_cc/
  • 2. [2] Dr Christopher Southan, Ph.D., M.Sc.,B.Sc. TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htm Mobile: +46(0)702-530710 Skype: cdsouthan Email: cdsouthan@hotmail.com Twitter: http://twitter.com/#!/cdsouthan Blog: http://cdsouthan.blogspot.com/ LinkedIN: http://www.linkedin.com/in/cdsouthan Publications: http://www.citeulike.org/user/cdsouthan/order/year,,/publications Presentations: http://www.slideshare.net/cdsouthan
  • 3. [3] The ChemAxon name-to-struc functionality is not only a component of the SureChem patent extraction pipeline but also powers chemicalize.org. Both operations are now submitting sources to PubChem. The former has deposited structures that bring the patent-extracted total in PC to 14.5 mill. CIDs. The deposition from chemicalize is ~0.3 mill., but has been actively selected by users and is 20% unique. The final conjunction is that all three sources generate the InChIKey (IK) that turns Google into a de-facto merge of PubChem and ChemSpider of ~50 mill. structures. Chemicalize.org users can convert new patents, other external or internal documents and web based text. Individual results can be Googled, searched against SurChemOpen and bulk extractions triaged against PubChem. It thus becomes possible to connect chemistry between patents, papers, abstracts and database records via exact match or similarity searching. When SureChem and chemicalize.org update their submissions, relationships with the other ~200 PubChem sources (including ChEMBL and vendor databases) are re-computed and new CID links made. The synergy between SureChem and chemicalize.org is powerful because matches between them (~ 0.15 mill.) via SureChemOpen, give occurrence statistics and the location of the structure within patents. The applications of chemicalize.org are extended by web tools such as Venny for determining intersects from multiple extractions and CheS-Mapper for cluster visualization. These utility expansions will be illustrated by documents specifying BACE1 inhibitors for Alzheimer’s disease. Abstract
  • 4. [4] Auspicious Conjunctions 2012-13 • PubChem: global chemistry to slice ‘n dice • SureChemOpen: majority of patent chemistry opened up • Chemicalize.org : chemistry extractable from any text toombs • Chemical images: patents extracted in SureChemOpen, OSRA handles papers • InChIKey indexing in Google • ChemSpider: crowdsourcing chemisty quality • Exapnding toolbox e.g.OPSIN, Venny, Ches-mapper • SciBite alerts • Expanding preview and surfacing options e.g. ChEMBLntd, Github, OSDD, Open Lab Books, figshare etc • Rise of mobile chemistry
  • 5. [5] Databases <> structures < > documents Abstracts Patents Papers 15 mill 0.2 mill (MeSH) 0.8 mill (ChEMBL) 12K Google InChIKey ~ 50 million (47m PubChem + 33m UniChem + 28m ChemSpider)
  • 6. [6] Triaging chemistry from text • Identify the structure specification types, e.g. – Semantic names (all sources) – Code names (press releases, papers and abstracts) – IUPAC names (papers, patents and abstracts) – Images (papers, patents, & Google images) – SMILES (open lab books) – InChi strings (open lab books) – SDF files (open lab books, & github) Convert these to a structure (e.g. SDF, SMILES, InChI) then: – Search InChIKey in Google – Search major databases – Search SureChemOpen – Compare extracted sets for intersects and diffs – Extend exact match connectivity with similarity searching
  • 13. [13] BACE2 Conjunctions 1. WO2013054291 > chemicalize.org 2. Download 450 structures 3. Upload to PubChem search
  • 15. [15] Venny: intersects, diffs, de-dupes and merges
  • 16. [16] Conclusions • Transformative opening up of chemistry > biology via structure >document connectivity • Open mining of patent metadata and data • Expanding toolbox • Inexorable expansion of open-access publishing But; • Journal chemistry extraction > database records still slow • Text mining of journals still restricted • Author annotation and direct db submission rare • Pharmaceutical research publications are still blinding structures (see PMID: 23159359)