20 million public patent structures: looking at the gift horse

www.guidetopharmacology.org
20 million public patent-extracted chemical
structures: a look at the gift horse
Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY,
Centre for Integrative Physiology, University of Edinburgh
http://www.guidetopharmacology.org/index.jsp
Prepared for Global Health Compound Design webinar, 30th Nov
Recording should become available below
http://www.mmv.org/research-development/computational-chemistry/global-health-compound-
design-webinars
http://www.slideshare.net/cdsouthan/20-mill-public-patent-structures-looking-at-the-gift-horse
1

Outline
• Good and bad news about chemistry from patens
• Chemical Named Entity Recognition, pros and cons
• Major submitters to PubChem
• New WIPO initiative
• Overlaps between sources
• Examples of CNER caveats
• Roll your own extractions
• Curated activity-to-target mappings
• MMV example
• Conclusions
• References
2

Looking at informatics gift horses
• We will look at just patent chemistry here
• But any source repays detailed analysis
• What are the statistics of entity and relationship capture?
• Can we assess real-world comparative utility?
• No source is free of caveats, overlaps, complexities, quirks and errors
• So can we ameliorate these during exploitation?
• PubChem submitters can be sliced, diced and compared in detail
• Public sources welcome feedback but may not have resources to implement
• The example below shows the analysis of four “horses” at once
3

Medicinal chemistry from patents:
good news, part I
• This presentation will focus on bioactivity value, not IP assessments (but I
can try to address IP-related questions)
• Patents are a Cinderella scientific data source with underestimated utility by
academics
• They typically publish between two-to-five years before a paper with some
of the same examples
• They may contain anywhere between 2x to 10x the amount of SAR than an
eventual paper
• For some filings from world-class medicinal chemistry teams, (academic or
commercial) the SAR never appears anywhere else
4

Good news, part II
• Paradoxically, documents are more “open” than papers (e.g. for text mining)
• The non-redundant primary med. chem. data corpus (first-filings with
composition of matter, classified as C07+A61) is well below 100K
• Examiners search reports and inventivness assessments are public
• Citations of papers and other patents usually extensive
• Massive synthetic protocol and analytical data archive
• Estimated total bioactive compounds ~ 4- 6 million
• A treasure trove for compound design, chemical property extraction (see
slides from previous speaker, Igor Tetko) and many other uses
5

Bad news: part I
• Data mining is more difficult than for papers
• Access historically dominated by commercial products
• Need to engage with quirks of patent family redundancy, Kind Codes,
patent classifications, 100s pages of turgid legal text, Markush nests
• Major portals pushing towards 50 million documents
• Some applicants are guilty of varying degrees of obfuscation to make data
mining more difficult (e.g. the “Novel Compound” titles)
• What gets into public databases are not patented structures, merely
structures extracted from patents
6

Bad news: part II
• Finding first-filings can be difficult
• Judging data quality is a challenge
• Few journal authors cite their patents
• A large proportion of SAR data is “binned” rather than discrete values
• Some applicants don’t declare data values at all
• From public extractions so far, the proportion of bioactive examples:
“other” (including non-med. chem. and artefacts) is ~ 5:15 million
• Comparing sources indicates constitutive divergence of extraction
• Automated extraction has inadvertently contaminated public databases
with a variety of artefactual structures, running into millions
7

Chemical Named Entity Recognition (CNER)
• Automated process of documents in > structures out
• SureChEMBL pipeline shown above, other sources similar
• Name-to-Struc (n2s) by look-up and/or IUPAC translation, image-to-
struc (i2s) and mol files from USPTO Complex Work Units (CWUs)
• Indexing usually added e.g. abstract, descriptions, claims
8

History of patent chemistry feeds into PubChem
• 2006 -Thomson (Reuters) Pharma (TRP) manual extraction of patents
and papers, 2016 4.3 mil ~40% patents, guess ~1.5 mill – now static :)
• 2011- IBM phase 1 CNER 2.5 mil
- SLING Consortium EPO extraction 0.1 mil (static)
• 2012 - SCRIPDB, CNER 4.0 mil (static)
• 2013 - SureChem, CNER 9.0 mil (> SureChEMBL)
• 2014 - BindingDB USPTO manual assay mapping 0.1 mil (active)
• 2015- CNER
• SureChEMBL 13.0 mil (active)
• IBM phase 2, 7.0 mil, (static)
• NextMove Software 1.4 mil synthesis mapping (static)
• 2016 (Nov) all large sources above = 19.46 mill + ~ 1.5 mill Thomson
9

CNER: good news and bad news
• SureChEMBL is the major contribution to public patent chemistry by far
• 17.51 million cpds in UniChem on 22 Nov
• 16.25 million in PubChem up to August
• 8.43 million are novel (i.e. source-uniqe CIDs)
• In situ chemistry is indexed and downloadable within days of publication
• Complemented by SciBites automated “bio-entity” indexing (on the fly)
• Powerful query interface
• UniChem cross-indexing (e.g. to PubChem and/or ChEMBL)
But
• SurChEMBL remains the only active CNER source – others are static
• Current feed hiccups are being addressed
• Extraction performance compromised by poor OCR quality in WO
documents and instances of very dense image tables
• Some types of CNER artefacts are introduced in subsequent slides
10

Major PubChem CNER patent sources at the CID level:
corroboration but also divergence
11
SCRIPDB = 4.0
(SID:CID 1.5)
IBM = 7.9
(SID:CID 1.2)
SureChEMBL = 14.6
(SID:CID 1.0)
0.66
2.12
0.67 8.56
0.53 3.26
1.95
Compound Identifiers (CIDs)
in millions with a union of
17.8 (in 2015)

Patent CNER vs. manual bioactivity sources in PubChem:
corroboration along with (expected) divergence
12
SCRIPDB + IBM
+ SureChEMBL = 17.8
Thomson (Reuters) Pharma = 4.3
ChEMBL = 1.4
16.13
0.18
0.12 0.90
1.35 0.26
2.55Counts (2015)
are CIDs in millions

A “new horse” (Oct 2016)
13
• ~ 7 million structures so far from WO and US from 1978
• WIPO collaboration with InfoChem and NextMove

CNER fragmentation
14
• Mainly split IUPAC strings but some authentic intermediates
• Compare with selective manual extraction by Thomson/Derwent

Bioactivity-gap: most patent chemistry has no linked data
15
Comparing the total
CNER patent set with
a bioactivity-centric
source e.g. Guide to
PHARMACOLOGY
(GtoPdb) at 6037
CIDs (2015 numbers)

Patent-unique structures: strange big things
16
https://www.blogger.com/blogger.g?blogID=2155351992730855318#editor/target=post;postID=89592136438562
00429;onPublishedMenu=allposts;onClosedMenu=allposts;postNum=2;src=postname post on “chessbordane”

Mixtures from patents: more confounding than useful
17
PubChem ameliorates the issue by splitting SID mixtures to component CIDs
while maintaining the mappings

Continual re-extraction of common chemistry
18

US6589997: missing punctuation > CNER > mixtures
19

Virtuals I: stereo enumerations from US 20080085923
20
260 CIDs > 581 SIDs from IBM,
SureChEMBL, SCRIPDB, Thomson
Pharma and Discovery Gate

Virtuals II: deuterated enumerations from US20080045558
21
986 deuterated CIDs > 2818
SIDs from IBM, SureChEMBL
and SCRIPDB,
http://www.slideshare.net/cdsouthan/causes-and-consequences-of-automated-extraction-of-
patentspecified-virtual-deuterated-drugs

Some good news: supplementing CNER with DIY extraction
Either for unprocessed patent documents (e.g. on publication day) or where
the extraction of examples by CNER is clearly gapped
22

More good news:
expert activity-to-target patent mapping complements CNER
23

Expert activity-to-target mapping II
24
http://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=2331

Utility example from MMV
25
Pick up from the
SureChEMBL interface
with MMV as applicant
or C07 + malaria

Following through:
SureChEMBL > PubChem
26
• CID > “similar compounds” (Tanimoto
90% neighbours) 58 CIDs > cluster
• Generally picks out analogue series
from same patent (i.e. the 118s)
• But note structures from other
sources nesting into the cluster
(e.g. 426, 509, 920, 280 and 308)

Conclusions
• The open patent chemistry “Big Bang” value massively outweighs the
caveats (i.e. it’s a very nice horse - thanks…)
• The majority of med. chem. exemplifications are now out there
• All contributing sources are to be congratulated, and PubChem for
wrangling most of them
• But, it is important to look closely at the gift horse
• We can then resolve and understand quirks, artefacts and pitfalls
• PubChem slicing and filtering can partially ameliorate these
• Activity-to-target mapping for SAR extraction is the main pinch point
• Those without commercial sources are now more enabled for patent mining
• Those with commercial sources can now synergise with open ones
27

References
28
http://cdsouthan.blogspot.com/ 19 posts have the tag “patents”
http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.ncbi.nlm.nih.gov/pubmed/23506624
N.b. from the reproducibility aspect, anyone needing technical tips to
reproduce or extend the PubChem queries used for these slides is welcome
to contact me
www.ncbi.nlm.nih.gov/pubmed/25415348 //nar.oxfordjournals.org/content/early/2015/10/11/nar.gkv1037
Southan C: Examples of SAR-centric patent mining using open resources, in Elsevier
COMPREHENSIVE MEDICINAL CHEMISTRY III, July 2017, in press

20 million public patent structures: looking at the gift horse

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (6)

Similar to 20 million public patent structures: looking at the gift horse

Similar to 20 million public patent structures: looking at the gift horse (20)

More from Chris Southan

More from Chris Southan (20)

Recently uploaded

Recently uploaded (20)

20 million public patent structures: looking at the gift horse