(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
Which Drug Did You Mean ?
1. Which Drug Did You Mean?
Resolving the linkage spaghetti between
semantic names, structures, bioactivity
and mixtures
Christopher Southan
ChrisDS Consulting, Göteborg,
Sweden,
Prepared for BioIT, Boston, April
2012, Track 14, Tuesday
See also
http://cdsouthan.blogspot.se/2012/
06/will-real-bosinhib-please-stand-
up-take.html
[1]
2. History of Drug Names
Approximate timelines
[cpd registration system structure and ID------------------------------------------------------------]
[patent IUPAC or image--------------------------------------------------------------------]
[internal code name(s) externally blinded-------]
[code name(s) > structure declared externally -----]
[journal papers -----------------------------------------------------------------------]
[International Non-proprietary name INN]
[INN indexed in MeSH-----------------]
[USAN, BAN, JAN --------------------]
[brand name(s)-------------------]
[combination brand ]
[2]
4. Causes of Drug Linkage Spaghetti (I)
• Tautomer/stereo mutiplexing and structure interconversion differences (e.g.
complex antibiotics)
• Popular structures > 100s of submitters > many vendors > more noise
• Opaque ecosystem of primary submitters, secondary linkers, declared circularity,
cryptic circularity, and submitters having independent portals with different rules
• Older drugs accumulate 100’s of synonyms and database x-refs, with erros
• Accumulated wet assay results are dependent on how long the drug has been in
which public screening collection
• Deprecated structures not always refreshed between databases globally
• Pro-drugs, metabolites or tested combinations rarely have explicit x-refs
[4]
5. Causes of Drug Linkage Spaghetti (II)
• Literature extractions flowing into drug databases (including MeSH) can have
– Author errors and paucity of standards in the primary report
– No quality filtration at the result level
– Curation errors and different annotation rules
– No discrimination of independent de-novo checking from annotation recycling
• Large-scale patent extraction feeds into databases bring in
– Forests of analogues with no data links
– High redundency for drugs and leads
– Structural differences between pipeline outputs
– Opportunistic permutations of salts and mixtures
– Opportunistic virtual deuteration of all best-selling drugs
• Drug discovery operations use many drugs as reference compounds in their
internal screening collections . This means
– Name > structure cross-mapping, internal, public and commercial
– Integration of internal and external data across the same drugs
[5]
6. Atorvastatin
• The scale of links provides a good cross section of problems
• Relationship cross-mappings and the PubChem tool-box
facilitate navigation through the links
• External submissons get a substance ID (SID) which are
merged to compound records (CID) vi chemistry rules (see
PubChem documentation)
• This drug has accumulated years of submissions from different
sources, BioAssay entries and pharmacology literature links
• The parent CID 60823 has
– 99 synonyms
– 6 stero forms
– 70 cannonicaly-related structures
– 449 substance records
[6]
11. Drug BioAssay Data: Splitting by
Submitted Structure Differences
Mainly uHTS and counterscreens
from Scripps & Burnham
AIDs 406848-53 in ChEMBL –
(antimalarial assay specified salt)
ChEMBL Antimalarial strain assays
(also specified salt), in vivo plus
three target links
Mainly qHTS from NCGC, no hits
[11]
12. Pharmacological Activity in vivo is ~70% Active
Metabolites i.e. not Atorvastatin
Hazardous Substances Data
Bank x-ref in the CID, but no
direct links to the metabolites
(yet). Only one in-vitro assay CID 9851106
result for 9808225
CID 60823
CID 9808225
[12]
13. Salt Confusion (I) Atorvastatin Calcium
FDA packege
CID 656846 Mw 1209 insert lable,
CAS 344423-98-9 hemicalcium
trihydrate
CID 60822 Mw 1155
CAS 134523-03-8
INN = atorvastatin
USAN/BAN = atorvastatin
CID 11227182 Mw 598 calcium
[13]
14. Salt Confusion (II): What gets to Patients
CID 656846
CID 53252956
CID 23665101
No INNs, USANs or clinical trials entries for these salts
[14]
15. Mixtures: Problematic all Round
• Atorvastatin parent (CID 60823) has 379 mixture SIDs and 147 mixture CIDs
permuatated from 122 component CIDs
• Of the 122 components 58 have a MeSH pharmacology tag, 92 have
BioAssays results, 70 are in DrugBank, 101 are in ChEMBL, and 47 are below
200 mw (and thus probably salts not drugs)
• Of the 147 mixture CIDs, only the 2 atorvastatin dimers have assay results or
pharmacology so none of the drug mixtures have direct data links
• None are in DrugBank CIDs and only atorvastin calcium is in ChEMBL
• 138 of the 147 have been extracted from patents by Derwent/Thomson and
are unlikely to get data links
• The small number of important drug combinations that do have data and/or
trial results are difficult to identify
• Tested drug mixtures rarely get public code names, some get trade names but
never INNs
• Chemistry rules may split mixtures and synonyms in databases
• PubMed "Drug Combinations"[MeSH Term] = 54,186 but no SID or CID links
• Mixture components can be designated with space, / , + or ”co”
[15]
16. The Famous Polypill: A Fuzzy term
CID 44602839 Thomson Pharma
18 clinicaltrials.gov entries, but
only partial component links
aspirin 81 mg, enalapril 2.5 mg, atorvastatin 20 mg and hydrochlorothiazide 12.5 mg
(polypill) PMID: 21647425: Australian New Zealand Clinical Trials Registry
ACTRN12607000099426
DrugBank and TTD negative
[16]
17. Caduet: an Approved Combination
Drugbank Wikipedia
http://clinicaltrials.gov/ct2/show/NCT01107743
[17]
19. A more Recent Combination
But, QA149 is negative in PubChem, DrugBank and TTD
[19]
20. Spaghetti is Resolvable but Errors are Tough:
Will the Real LX4211 Please Stand up ?
http://cenblog.org/the-haystack/2012/03/liveblogging-first-time-disclosures-from-acssandiego/
See also: http://cdsouthan.blogspot.se/2012/03/live-chemical-structure-blogging-but.html
[20]
21. Summary
• You can navigate the linkage spaghetti in name, synonym, structure
bioactivity and mixture space, but this needs perspicacity and
circumspection.
• The current drug information ecosystem with multiple stakeholders seems
destined to remain ”fuzzy”
• Beyond informatics challenges the consequences, particularly from frank
errors, could be more serious
• WHO INNs and naming stems play a key positive role – but ;
– No open athoritative database - only 7000 PDF entries (!)
– No transparent coordination between USAN, FDA, MeSH, national offices, or
clinical trials registries
– Susceptable to commercial flanking tactics
• Drug combinations have a bright pharmacological future but a difficult
informatics one
• The fuzz includes scientific challenges (e.g. complex strucutures,
dynamic tautomerism, active metabolites, formulation differences,
paucity of standardised and comparable activity data.
• Efforts are being made to improve the situation, including from the
databases represented in this Workshop session.
[21]
22. Questions Welcome
ChrisDS Consulting: http://www.cdsouthan.info/Consult/CDS_cons.htm
Mobile: +46(0)702-530710, Skype: cdsouthan
Email: cdsouthan@hotmail.com
Twitter: http://twitter.com/#!/cdsouthan
Blog: http://cdsouthan.blogspot.com/
LinkedIN: http://www.linkedin.com/in/cdsouthan
Website: http://www.cdsouthan.info/CDS_prof.htm
Publications: http://www.citeulike.org/user/cdsouthan/publications/order/year
Citations: http://scholar.google.com/citations?user=y1DsHJ8AAAAJ&hl=en
Presentations: http://www.slideshare.net/cdsouthan
FYI : A short piece on identifying the names and molecular details of
drugs in clinicaltrials.gov
http://www.samedanltd.com/magazine/13/issue/166/article/3152
[22]