SlideShare a Scribd company logo
1 of 28
www.guidetopharmacology.org
20 million public patent-extracted chemical
structures: a look at the gift horse
Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY,
Centre for Integrative Physiology, University of Edinburgh
http://www.guidetopharmacology.org/index.jsp
Prepared for Global Health Compound Design webinar, 30th Nov
Recording should become available below
http://www.mmv.org/research-development/computational-chemistry/global-health-compound-
design-webinars
http://www.slideshare.net/cdsouthan/20-mill-public-patent-structures-looking-at-the-gift-horse
1
Outline
• Good and bad news about chemistry from patens
• Chemical Named Entity Recognition, pros and cons
• Major submitters to PubChem
• New WIPO initiative
• Overlaps between sources
• Examples of CNER caveats
• Roll your own extractions
• Curated activity-to-target mappings
• MMV example
• Conclusions
• References
2
Looking at informatics gift horses
• We will look at just patent chemistry here
• But any source repays detailed analysis
• What are the statistics of entity and relationship capture?
• Can we assess real-world comparative utility?
• No source is free of caveats, overlaps, complexities, quirks and errors
• So can we ameliorate these during exploitation?
• PubChem submitters can be sliced, diced and compared in detail
• Public sources welcome feedback but may not have resources to implement
• The example below shows the analysis of four “horses” at once
3
Medicinal chemistry from patents:
good news, part I
• This presentation will focus on bioactivity value, not IP assessments (but I
can try to address IP-related questions)
• Patents are a Cinderella scientific data source with underestimated utility by
academics
• They typically publish between two-to-five years before a paper with some
of the same examples
• They may contain anywhere between 2x to 10x the amount of SAR than an
eventual paper
• For some filings from world-class medicinal chemistry teams, (academic or
commercial) the SAR never appears anywhere else
4
Good news, part II
• Paradoxically, documents are more “open” than papers (e.g. for text mining)
• The non-redundant primary med. chem. data corpus (first-filings with
composition of matter, classified as C07+A61) is well below 100K
• Examiners search reports and inventivness assessments are public
• Citations of papers and other patents usually extensive
• Massive synthetic protocol and analytical data archive
• Estimated total bioactive compounds ~ 4- 6 million
• A treasure trove for compound design, chemical property extraction (see
slides from previous speaker, Igor Tetko) and many other uses
5
Bad news: part I
• Data mining is more difficult than for papers
• Access historically dominated by commercial products
• Need to engage with quirks of patent family redundancy, Kind Codes,
patent classifications, 100s pages of turgid legal text, Markush nests
• Major portals pushing towards 50 million documents
• Some applicants are guilty of varying degrees of obfuscation to make data
mining more difficult (e.g. the “Novel Compound” titles)
• What gets into public databases are not patented structures, merely
structures extracted from patents
6
Bad news: part II
• Finding first-filings can be difficult
• Judging data quality is a challenge
• Few journal authors cite their patents
• A large proportion of SAR data is “binned” rather than discrete values
• Some applicants don’t declare data values at all
• From public extractions so far, the proportion of bioactive examples:
“other” (including non-med. chem. and artefacts) is ~ 5:15 million
• Comparing sources indicates constitutive divergence of extraction
• Automated extraction has inadvertently contaminated public databases
with a variety of artefactual structures, running into millions
7
Chemical Named Entity Recognition (CNER)
• Automated process of documents in > structures out
• SureChEMBL pipeline shown above, other sources similar
• Name-to-Struc (n2s) by look-up and/or IUPAC translation, image-to-
struc (i2s) and mol files from USPTO Complex Work Units (CWUs)
• Indexing usually added e.g. abstract, descriptions, claims
8
History of patent chemistry feeds into PubChem
• 2006 -Thomson (Reuters) Pharma (TRP) manual extraction of patents
and papers, 2016 4.3 mil ~40% patents, guess ~1.5 mill – now static :)
• 2011- IBM phase 1 CNER 2.5 mil
- SLING Consortium EPO extraction 0.1 mil (static)
• 2012 - SCRIPDB, CNER 4.0 mil (static)
• 2013 - SureChem, CNER 9.0 mil (> SureChEMBL)
• 2014 - BindingDB USPTO manual assay mapping 0.1 mil (active)
• 2015- CNER
• SureChEMBL 13.0 mil (active)
• IBM phase 2, 7.0 mil, (static)
• NextMove Software 1.4 mil synthesis mapping (static)
• 2016 (Nov) all large sources above = 19.46 mill + ~ 1.5 mill Thomson
9
CNER: good news and bad news
• SureChEMBL is the major contribution to public patent chemistry by far
• 17.51 million cpds in UniChem on 22 Nov
• 16.25 million in PubChem up to August
• 8.43 million are novel (i.e. source-uniqe CIDs)
• In situ chemistry is indexed and downloadable within days of publication
• Complemented by SciBites automated “bio-entity” indexing (on the fly)
• Powerful query interface
• UniChem cross-indexing (e.g. to PubChem and/or ChEMBL)
But
• SurChEMBL remains the only active CNER source – others are static
• Current feed hiccups are being addressed
• Extraction performance compromised by poor OCR quality in WO
documents and instances of very dense image tables
• Some types of CNER artefacts are introduced in subsequent slides
10
Major PubChem CNER patent sources at the CID level:
corroboration but also divergence
11
SCRIPDB = 4.0
(SID:CID 1.5)
IBM = 7.9
(SID:CID 1.2)
SureChEMBL = 14.6
(SID:CID 1.0)
0.66
2.12
0.67 8.56
0.53 3.26
1.95
Compound Identifiers (CIDs)
in millions with a union of
17.8 (in 2015)
Patent CNER vs. manual bioactivity sources in PubChem:
corroboration along with (expected) divergence
12
SCRIPDB + IBM
+ SureChEMBL = 17.8
Thomson (Reuters) Pharma = 4.3
ChEMBL = 1.4
16.13
0.18
0.12 0.90
1.35 0.26
2.55Counts (2015)
are CIDs in millions
A “new horse” (Oct 2016)
13
• ~ 7 million structures so far from WO and US from 1978
• WIPO collaboration with InfoChem and NextMove
CNER fragmentation
14
• Mainly split IUPAC strings but some authentic intermediates
• Compare with selective manual extraction by Thomson/Derwent
Bioactivity-gap: most patent chemistry has no linked data
15
Comparing the total
CNER patent set with
a bioactivity-centric
source e.g. Guide to
PHARMACOLOGY
(GtoPdb) at 6037
CIDs (2015 numbers)
Patent-unique structures: strange big things
16
https://www.blogger.com/blogger.g?blogID=2155351992730855318#editor/target=post;postID=89592136438562
00429;onPublishedMenu=allposts;onClosedMenu=allposts;postNum=2;src=postname post on “chessbordane”
Mixtures from patents: more confounding than useful
17
PubChem ameliorates the issue by splitting SID mixtures to component CIDs
while maintaining the mappings
Continual re-extraction of common chemistry
18
US6589997: missing punctuation > CNER > mixtures
19
Virtuals I: stereo enumerations from US 20080085923
20
260 CIDs > 581 SIDs from IBM,
SureChEMBL, SCRIPDB, Thomson
Pharma and Discovery Gate
Virtuals II: deuterated enumerations from US20080045558
21
986 deuterated CIDs > 2818
SIDs from IBM, SureChEMBL
and SCRIPDB,
http://www.slideshare.net/cdsouthan/causes-and-consequences-of-automated-extraction-of-
patentspecified-virtual-deuterated-drugs
Some good news: supplementing CNER with DIY extraction
Either for unprocessed patent documents (e.g. on publication day) or where
the extraction of examples by CNER is clearly gapped
22
More good news:
expert activity-to-target patent mapping complements CNER
23
Expert activity-to-target mapping II
24
http://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=2331
Utility example from MMV
25
Pick up from the
SureChEMBL interface
with MMV as applicant
or C07 + malaria
Following through:
SureChEMBL > PubChem
26
• CID > “similar compounds” (Tanimoto
90% neighbours) 58 CIDs > cluster
• Generally picks out analogue series
from same patent (i.e. the 118s)
• But note structures from other
sources nesting into the cluster
(e.g. 426, 509, 920, 280 and 308)
Conclusions
• The open patent chemistry “Big Bang” value massively outweighs the
caveats (i.e. it’s a very nice horse - thanks…)
• The majority of med. chem. exemplifications are now out there
• All contributing sources are to be congratulated, and PubChem for
wrangling most of them
• But, it is important to look closely at the gift horse
• We can then resolve and understand quirks, artefacts and pitfalls
• PubChem slicing and filtering can partially ameliorate these
• Activity-to-target mapping for SAR extraction is the main pinch point
• Those without commercial sources are now more enabled for patent mining
• Those with commercial sources can now synergise with open ones
27
References
28
http://cdsouthan.blogspot.com/ 19 posts have the tag “patents”
http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.ncbi.nlm.nih.gov/pubmed/23506624
N.b. from the reproducibility aspect, anyone needing technical tips to
reproduce or extend the PubChem queries used for these slides is welcome
to contact me
www.ncbi.nlm.nih.gov/pubmed/25415348 //nar.oxfordjournals.org/content/early/2015/10/11/nar.gkv1037
Southan C: Examples of SAR-centric patent mining using open resources, in Elsevier
COMPREHENSIVE MEDICINAL CHEMISTRY III, July 2017, in press

More Related Content

What's hot

The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveats
Dr. Haxel Consult
 
Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...
Chris Southan
 
Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases? Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases?
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
Sunghwan Kim
 

What's hot (16)

ChEMBL+KNIME
ChEMBL+KNIMEChEMBL+KNIME
ChEMBL+KNIME
 
Integrating Patents with Research Data
Integrating Patents with Research DataIntegrating Patents with Research Data
Integrating Patents with Research Data
 
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
 
Benchmarking Commercial RDF Stores with Publications Office Dataset
Benchmarking Commercial RDF Stores with Publications Office DatasetBenchmarking Commercial RDF Stores with Publications Office Dataset
Benchmarking Commercial RDF Stores with Publications Office Dataset
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveats
 
Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivity
 
5HT2A modulators in GtoPdb and other databses
5HT2A modulators in GtoPdb and other databses5HT2A modulators in GtoPdb and other databses
5HT2A modulators in GtoPdb and other databses
 
Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases? Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases?
 
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
 
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
 
Digging out Structures for Repurposing: Non-competitive Intelligence ...
Digging out Structures for Repurposing: Non-competitive Intelligence        ...Digging out Structures for Repurposing: Non-competitive Intelligence        ...
Digging out Structures for Repurposing: Non-competitive Intelligence ...
 
Data exchange alternatives, GIGA TAG (2009)
Data exchange alternatives, GIGA TAG (2009)Data exchange alternatives, GIGA TAG (2009)
Data exchange alternatives, GIGA TAG (2009)
 
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
 
BHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-AustraliaBHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-Australia
 

Viewers also liked

EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
ChemAxon
 

Viewers also liked (6)

Correct drug structures for pharmacology
Correct drug structures for pharmacologyCorrect drug structures for pharmacology
Correct drug structures for pharmacology
 
biologydriven
biologydrivenbiologydriven
biologydriven
 
Southan real drugs_paris_oct_11_2014
Southan real drugs_paris_oct_11_2014Southan real drugs_paris_oct_11_2014
Southan real drugs_paris_oct_11_2014
 
Presentation given at UCSF Precision Medicine meeting 4/11/2015
Presentation given at UCSF Precision Medicine meeting 4/11/2015 Presentation given at UCSF Precision Medicine meeting 4/11/2015
Presentation given at UCSF Precision Medicine meeting 4/11/2015
 
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
 
GtoPdb_ITMAT_2017
GtoPdb_ITMAT_2017GtoPdb_ITMAT_2017
GtoPdb_ITMAT_2017
 

Similar to 20 million public patent structures: looking at the gift horse

Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...
Sean Ekins
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligence
Sunghwan Kim
 
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Neue Lösungen für Life Sciences und die Pharmaindustrie mit Graphdatenbanken
Neue Lösungen für Life Sciences und die Pharmaindustrie mit GraphdatenbankenNeue Lösungen für Life Sciences und die Pharmaindustrie mit Graphdatenbanken
Neue Lösungen für Life Sciences und die Pharmaindustrie mit Graphdatenbanken
Neo4j
 
Data Review and Clean-Up Using Crowdsourced Input via the US EPA CompTox Das...
Data Review and Clean-Up Using Crowdsourced Input via the  US EPA CompTox Das...Data Review and Clean-Up Using Crowdsourced Input via the  US EPA CompTox Das...
Data Review and Clean-Up Using Crowdsourced Input via the US EPA CompTox Das...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Chemistry data delivery from the US-EPA to support environmental chemistry
Chemistry data delivery from the US-EPA to support environmental chemistryChemistry data delivery from the US-EPA to support environmental chemistry
Chemistry data delivery from the US-EPA to support environmental chemistry
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Similar to 20 million public patent structures: looking at the gift horse (20)

Patent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEsPatent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEs
 
Pros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemPros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChem
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
 
Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...
 
Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases
 
Is 20TB really Big Data?
Is 20TB really Big Data?Is 20TB really Big Data?
Is 20TB really Big Data?
 
EUGM 2014 - Alfonso Pozzan (Aptuit): Expanding the scope of “literature data”...
EUGM 2014 - Alfonso Pozzan (Aptuit): Expanding the scope of “literature data”...EUGM 2014 - Alfonso Pozzan (Aptuit): Expanding the scope of “literature data”...
EUGM 2014 - Alfonso Pozzan (Aptuit): Expanding the scope of “literature data”...
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligence
 
FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCP
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort Data
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
 
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
 
Using open data, services and source software to deliver the EPA CompTox Chem...
Using open data, services and source software to deliver the EPA CompTox Chem...Using open data, services and source software to deliver the EPA CompTox Chem...
Using open data, services and source software to deliver the EPA CompTox Chem...
 
Neue Lösungen für Life Sciences und die Pharmaindustrie mit Graphdatenbanken
Neue Lösungen für Life Sciences und die Pharmaindustrie mit GraphdatenbankenNeue Lösungen für Life Sciences und die Pharmaindustrie mit Graphdatenbanken
Neue Lösungen für Life Sciences und die Pharmaindustrie mit Graphdatenbanken
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistry
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
Patent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTSPatent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTS
 
Data Review and Clean-Up Using Crowdsourced Input via the US EPA CompTox Das...
Data Review and Clean-Up Using Crowdsourced Input via the  US EPA CompTox Das...Data Review and Clean-Up Using Crowdsourced Input via the  US EPA CompTox Das...
Data Review and Clean-Up Using Crowdsourced Input via the US EPA CompTox Das...
 
Chemistry data delivery from the US-EPA to support environmental chemistry
Chemistry data delivery from the US-EPA to support environmental chemistryChemistry data delivery from the US-EPA to support environmental chemistry
Chemistry data delivery from the US-EPA to support environmental chemistry
 

More from Chris Southan

Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2
Chris Southan
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug Development
Chris Southan
 

More from Chris Southan (20)

Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulations
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updae
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug Development
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCP
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteins
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFER
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 poster
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagens
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand up
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide Tribulations
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIR
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology update
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProt
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
 
Pub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityPub Med to PubChem Connectivity
Pub Med to PubChem Connectivity
 
The IUPHAR/MMV Guide to Malaria Pharmacology
The  IUPHAR/MMV Guide to Malaria Pharmacology  The  IUPHAR/MMV Guide to Malaria Pharmacology
The IUPHAR/MMV Guide to Malaria Pharmacology
 

Recently uploaded

Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Sérgio Sacani
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
University of Hertfordshire
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 

Recently uploaded (20)

fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 

20 million public patent structures: looking at the gift horse

  • 1. www.guidetopharmacology.org 20 million public patent-extracted chemical structures: a look at the gift horse Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY, Centre for Integrative Physiology, University of Edinburgh http://www.guidetopharmacology.org/index.jsp Prepared for Global Health Compound Design webinar, 30th Nov Recording should become available below http://www.mmv.org/research-development/computational-chemistry/global-health-compound- design-webinars http://www.slideshare.net/cdsouthan/20-mill-public-patent-structures-looking-at-the-gift-horse 1
  • 2. Outline • Good and bad news about chemistry from patens • Chemical Named Entity Recognition, pros and cons • Major submitters to PubChem • New WIPO initiative • Overlaps between sources • Examples of CNER caveats • Roll your own extractions • Curated activity-to-target mappings • MMV example • Conclusions • References 2
  • 3. Looking at informatics gift horses • We will look at just patent chemistry here • But any source repays detailed analysis • What are the statistics of entity and relationship capture? • Can we assess real-world comparative utility? • No source is free of caveats, overlaps, complexities, quirks and errors • So can we ameliorate these during exploitation? • PubChem submitters can be sliced, diced and compared in detail • Public sources welcome feedback but may not have resources to implement • The example below shows the analysis of four “horses” at once 3
  • 4. Medicinal chemistry from patents: good news, part I • This presentation will focus on bioactivity value, not IP assessments (but I can try to address IP-related questions) • Patents are a Cinderella scientific data source with underestimated utility by academics • They typically publish between two-to-five years before a paper with some of the same examples • They may contain anywhere between 2x to 10x the amount of SAR than an eventual paper • For some filings from world-class medicinal chemistry teams, (academic or commercial) the SAR never appears anywhere else 4
  • 5. Good news, part II • Paradoxically, documents are more “open” than papers (e.g. for text mining) • The non-redundant primary med. chem. data corpus (first-filings with composition of matter, classified as C07+A61) is well below 100K • Examiners search reports and inventivness assessments are public • Citations of papers and other patents usually extensive • Massive synthetic protocol and analytical data archive • Estimated total bioactive compounds ~ 4- 6 million • A treasure trove for compound design, chemical property extraction (see slides from previous speaker, Igor Tetko) and many other uses 5
  • 6. Bad news: part I • Data mining is more difficult than for papers • Access historically dominated by commercial products • Need to engage with quirks of patent family redundancy, Kind Codes, patent classifications, 100s pages of turgid legal text, Markush nests • Major portals pushing towards 50 million documents • Some applicants are guilty of varying degrees of obfuscation to make data mining more difficult (e.g. the “Novel Compound” titles) • What gets into public databases are not patented structures, merely structures extracted from patents 6
  • 7. Bad news: part II • Finding first-filings can be difficult • Judging data quality is a challenge • Few journal authors cite their patents • A large proportion of SAR data is “binned” rather than discrete values • Some applicants don’t declare data values at all • From public extractions so far, the proportion of bioactive examples: “other” (including non-med. chem. and artefacts) is ~ 5:15 million • Comparing sources indicates constitutive divergence of extraction • Automated extraction has inadvertently contaminated public databases with a variety of artefactual structures, running into millions 7
  • 8. Chemical Named Entity Recognition (CNER) • Automated process of documents in > structures out • SureChEMBL pipeline shown above, other sources similar • Name-to-Struc (n2s) by look-up and/or IUPAC translation, image-to- struc (i2s) and mol files from USPTO Complex Work Units (CWUs) • Indexing usually added e.g. abstract, descriptions, claims 8
  • 9. History of patent chemistry feeds into PubChem • 2006 -Thomson (Reuters) Pharma (TRP) manual extraction of patents and papers, 2016 4.3 mil ~40% patents, guess ~1.5 mill – now static :) • 2011- IBM phase 1 CNER 2.5 mil - SLING Consortium EPO extraction 0.1 mil (static) • 2012 - SCRIPDB, CNER 4.0 mil (static) • 2013 - SureChem, CNER 9.0 mil (> SureChEMBL) • 2014 - BindingDB USPTO manual assay mapping 0.1 mil (active) • 2015- CNER • SureChEMBL 13.0 mil (active) • IBM phase 2, 7.0 mil, (static) • NextMove Software 1.4 mil synthesis mapping (static) • 2016 (Nov) all large sources above = 19.46 mill + ~ 1.5 mill Thomson 9
  • 10. CNER: good news and bad news • SureChEMBL is the major contribution to public patent chemistry by far • 17.51 million cpds in UniChem on 22 Nov • 16.25 million in PubChem up to August • 8.43 million are novel (i.e. source-uniqe CIDs) • In situ chemistry is indexed and downloadable within days of publication • Complemented by SciBites automated “bio-entity” indexing (on the fly) • Powerful query interface • UniChem cross-indexing (e.g. to PubChem and/or ChEMBL) But • SurChEMBL remains the only active CNER source – others are static • Current feed hiccups are being addressed • Extraction performance compromised by poor OCR quality in WO documents and instances of very dense image tables • Some types of CNER artefacts are introduced in subsequent slides 10
  • 11. Major PubChem CNER patent sources at the CID level: corroboration but also divergence 11 SCRIPDB = 4.0 (SID:CID 1.5) IBM = 7.9 (SID:CID 1.2) SureChEMBL = 14.6 (SID:CID 1.0) 0.66 2.12 0.67 8.56 0.53 3.26 1.95 Compound Identifiers (CIDs) in millions with a union of 17.8 (in 2015)
  • 12. Patent CNER vs. manual bioactivity sources in PubChem: corroboration along with (expected) divergence 12 SCRIPDB + IBM + SureChEMBL = 17.8 Thomson (Reuters) Pharma = 4.3 ChEMBL = 1.4 16.13 0.18 0.12 0.90 1.35 0.26 2.55Counts (2015) are CIDs in millions
  • 13. A “new horse” (Oct 2016) 13 • ~ 7 million structures so far from WO and US from 1978 • WIPO collaboration with InfoChem and NextMove
  • 14. CNER fragmentation 14 • Mainly split IUPAC strings but some authentic intermediates • Compare with selective manual extraction by Thomson/Derwent
  • 15. Bioactivity-gap: most patent chemistry has no linked data 15 Comparing the total CNER patent set with a bioactivity-centric source e.g. Guide to PHARMACOLOGY (GtoPdb) at 6037 CIDs (2015 numbers)
  • 16. Patent-unique structures: strange big things 16 https://www.blogger.com/blogger.g?blogID=2155351992730855318#editor/target=post;postID=89592136438562 00429;onPublishedMenu=allposts;onClosedMenu=allposts;postNum=2;src=postname post on “chessbordane”
  • 17. Mixtures from patents: more confounding than useful 17 PubChem ameliorates the issue by splitting SID mixtures to component CIDs while maintaining the mappings
  • 18. Continual re-extraction of common chemistry 18
  • 19. US6589997: missing punctuation > CNER > mixtures 19
  • 20. Virtuals I: stereo enumerations from US 20080085923 20 260 CIDs > 581 SIDs from IBM, SureChEMBL, SCRIPDB, Thomson Pharma and Discovery Gate
  • 21. Virtuals II: deuterated enumerations from US20080045558 21 986 deuterated CIDs > 2818 SIDs from IBM, SureChEMBL and SCRIPDB, http://www.slideshare.net/cdsouthan/causes-and-consequences-of-automated-extraction-of- patentspecified-virtual-deuterated-drugs
  • 22. Some good news: supplementing CNER with DIY extraction Either for unprocessed patent documents (e.g. on publication day) or where the extraction of examples by CNER is clearly gapped 22
  • 23. More good news: expert activity-to-target patent mapping complements CNER 23
  • 24. Expert activity-to-target mapping II 24 http://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=2331
  • 25. Utility example from MMV 25 Pick up from the SureChEMBL interface with MMV as applicant or C07 + malaria
  • 26. Following through: SureChEMBL > PubChem 26 • CID > “similar compounds” (Tanimoto 90% neighbours) 58 CIDs > cluster • Generally picks out analogue series from same patent (i.e. the 118s) • But note structures from other sources nesting into the cluster (e.g. 426, 509, 920, 280 and 308)
  • 27. Conclusions • The open patent chemistry “Big Bang” value massively outweighs the caveats (i.e. it’s a very nice horse - thanks…) • The majority of med. chem. exemplifications are now out there • All contributing sources are to be congratulated, and PubChem for wrangling most of them • But, it is important to look closely at the gift horse • We can then resolve and understand quirks, artefacts and pitfalls • PubChem slicing and filtering can partially ameliorate these • Activity-to-target mapping for SAR extraction is the main pinch point • Those without commercial sources are now more enabled for patent mining • Those with commercial sources can now synergise with open ones 27
  • 28. References 28 http://cdsouthan.blogspot.com/ 19 posts have the tag “patents” http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.ncbi.nlm.nih.gov/pubmed/23506624 N.b. from the reproducibility aspect, anyone needing technical tips to reproduce or extend the PubChem queries used for these slides is welcome to contact me www.ncbi.nlm.nih.gov/pubmed/25415348 //nar.oxfordjournals.org/content/early/2015/10/11/nar.gkv1037 Southan C: Examples of SAR-centric patent mining using open resources, in Elsevier COMPREHENSIVE MEDICINAL CHEMISTRY III, July 2017, in press