Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Connecting the Dots in Early Drug Discovery

We have created a large Neo4j database that integrates the results from text mining, experimental data and biological background knowledge. The utility of this graph is two fold:
- Identify promising compounds to be tested as a starting point for drug development.
- Better understand the results of large scale compound testing in cellular assays using imaging technology.
Currently the database contains 25 million article abstracts, data for 2 million compounds and 60000 genes – overall 29 million nodes and 270 million relationships.


We show some details about how the graph was built and show examples how combining text mining with experimental results leads to new insights and to better understanding and design in biological experiments.

  • Login to see the comments

Connecting the Dots in Early Drug Discovery

  1. 1. Novartis Institutes for BioMedical Research (NIBR) Connecting the dots in early drug discovery Stephan Reiling Senior Scientist, Novartis Institutes for BioMedical Research
  2. 2. Connecting the dots in early drug discovery Stephan Reiling In-Silico Lead Discovery Group Novartis Institutes for BioMedical Research (NIBR) Cambridge GraphConnect 2016, San Francisco Novartis Institutes for BioMedical Research (NIBR)
  3. 3. Novartis Institutes for BioMedical Research (NIBR) Why (might you be interested in this talk) • The talk shows how a lot of heterogeneous data can be integrated into one big graph – Greater than the sum of its parts • Text mining and pattern detection can lead to valuable insights – Nobody can read 25 million scientific papers • Data mining this graph can give novel biological insights – Connecting the dots Public3
  4. 4. Novartis Institutes for BioMedical Research (NIBR) Why (did we build the graph) Public4 Treatment effects in cellular phenotypic assays Compound treatment
  5. 5. Novartis Institutes for BioMedical Research (NIBR) • What we have (the dots) – almost 1 Billion data points of compound activity data on protein targets (~99% of which can be summarized as “not active”) – More and more results of phenotypic assays • What we lack (the connections) – A good way to use biological knowledge or background information to make a connection – A storage for “biological knowledge” that can be “queried” Public5 Why Compound Gene Disease (Phenotype)
  6. 6. Novartis Institutes for BioMedical Research (NIBR) How (did we build the graph) Public6 Text mining for chemicals, diseases, proteins In continuation of our investigation on novel stearoyl-CoA desaturase (SCD) 1 inhibitors, we have already reported on the structural modification of the benzoylpiperidines that led to a series of novel and highly potent spiropiperidine-based SCD1 inhibitors. In this report, we would like to extend the scope of our previous investigation and disclose details of the synthesis, SAR, ADME, PK, and pharmacological evaluation of the spiropiperidines with high potency for SCD1 inhibition. Our current efforts have culminated in the identification of 5-fluoro-1'-{6-[5-(pyridin-3-ylmethyl)-1,3,4-oxadiazol-2-yl]pyridazin-3-yl}-3,4-dihydrospiro[chromene-2,4'- piperidine] (10e), which demonstrated a very strong potency for liver SCD1inhibition (ID(50)=0.6 mg/kg). This highly efficacious inhibition is presumed to be the result of a combination of strong enzymatic inhibitory activity (IC(50) (mouse)=2 nM) and good oral bioavailability (F >95%). Pharmacological evaluation of 10e has demonstrated potent, dose-dependent reduction of the plasma desaturation index in C57BL/6J mice on a high carbohydrate diet after a 7-day oral administration (q.d.). In addition, it did not cause any noticeable skin abnormalities up to the highest dose (10 mg/kg).
  7. 7. Novartis Institutes for BioMedical Research (NIBR) How (did we build the graph) Public7 Text mining for chemicals, diseases, proteins In continuation of our investigation on novel stearoyl-CoA desaturase (SCD) 1 inhibitors, we have already reported on the structural modification of the benzoylpiperidines that led to a series of novel and highly potent spiropiperidine-based SCD1 inhibitors. In this report, we would like to extend the scope of our previous investigation and disclose details of the synthesis, SAR, ADME, PK, and pharmacological evaluation of the spiropiperidines with high potency for SCD1 inhibition. Our current efforts have culminated in the identification of 5-fluoro-1'-{6-[5-(pyridin-3-ylmethyl)-1,3,4-oxadiazol-2-yl]pyridazin-3-yl}-3,4-dihydrospiro[chromene-2,4'- piperidine] (10e), which demonstrated a very strong potency for liver SCD1inhibition (ID(50)=0.6 mg/kg). This highly efficacious inhibition is presumed to be the result of a combination of strong enzymatic inhibitory activity (IC(50) (mouse)=2 nM) and good oral bioavailability (F >95%). Pharmacological evaluation of 10e has demonstrated potent, dose-dependent reduction of the plasma desaturation index in C57BL/6J mice on a high carbohydrate diet after a 7-day oral administration (q.d.). In addition, it did not cause any noticeable skin abnormalities up to the highest dose (10 mg/kg). Hit Type Recognized text Smiles T1 GeneOrProtein stearoyl-CoA desaturase T2 Mechanism inhibitors T3 G benzoylpiperidines T4 D spiropiperidine O=C(NC(Cc1c[nH]c2ccccc12)C(=O)N3CCC4(CC3)CCc5ccccc45)NC6C N7CCC6CC7 T5 GeneOrProtein SCD1 T6 Mechanism inhibitors T7 GeneOrProtein SCD1 T8 M 5-fluoro-1'-{6-[5-(pyridin-3-ylmethyl)-1,3,4-oxadiazol-2- yl]pyridazin-3-yl}-3,4-dihydrospiro[chromene-2,4'- piperidine] FC1=C2CCC3(OC2=CC=C1)CCN(CC3)C=3N=NC(=CC3)C=3OC(=NN3) CC=3C=NC=CC3 T9 GeneOrProtein SCD1 T10 G carbohydrate T11 Disease skin abnormalities
  8. 8. Novartis Institutes for BioMedical Research (NIBR) How (did we build the graph) • ~25,000,000 article abstracts • 5,600 journals • 1946 – current Public8 National Institutes of Health (NIH) PubMed http://www.ncbi.nlm.nih.gov/pubmed http://www.ncbi.nlm.nih.gov/pubmed/?term=20801551 • Tagged with “MeSH terms” (MeSH: Medical Subject Heading)
  9. 9. Novartis Institutes for BioMedical Research (NIBR) How Public9 Structure of the MeSH term hierarchy (partial) Yellow: Diseases Blue: Processes and Mechanisms Green: Anatomy Red: Chemicals and Drugs Grey: Organisms
  10. 10. Novartis Institutes for BioMedical Research (NIBR) Public10
  11. 11. Novartis Institutes for BioMedical Research (NIBR) Public11
  12. 12. Novartis Institutes for BioMedical Research (NIBR) How Public12 Association rule mining of co-occurrences Article 1 • Compound A • Gene 1 • Gene 2 Article 2 • Compound A • Compound B • Gene 1 Article 3 • Compound A • Mesh term X • Gene 1 Article 4 • Compound C • Gene 1 • Identification of entities (compounds, mesh terms, genes, diseases,…) from pubmed annotations or textmining • The a-priori algorithm from association rule mining is used to identify frequently co-mentioned entities (aka market basket analysis) • Associations above a certain association strength (lift) and number of articles in which they are co- mentioned (support) are stored • The association strength is scaled to 0-1 and stored as the uncertainty of the association (high lift = low uncertainty) • Articles are stored as well, including the entities that are mentioned in it • This only captures the fact that something is frequently co-mentioned with something else, not any causality (similar to correlation)
  13. 13. Novartis Institutes for BioMedical Research (NIBR) What (can you do with this) Public13 Example: disease – compound – target from text mining Every relationship in the graph has a property “uncertainty” in the range of 0-1 This allows to query for connections with the highest confidence Tafamidis (INN, or Fx- 1006A, trade name Vyndaqel) is a drug for the amelioration of transthyretin-related hereditary amyloidosis (also familial amyloid polyneuropathy, or FAP), a rare but deadly neurodegenerative disease. Canavan disease is caused by a defective ASPA gene which is responsible for the production of the enzyme aspartoacylase. Decreased aspartoacylase activity prevents the normal breakdown of N-acetyl aspartate, wherein the accumulation of N-acetylaspartate, or lack of its further metabolism interferes with growth of the myelin sheath of the nerve fibers of the brain. From Wikipedia: From Wikipedia: Color code: Disease, Gene, Compound MATCH p = (cpd:Compound) -[:is_associated]-> (g:Gene) -[:is_associated]-> (d:Disease) <-[:is_associated]- (cpd) RETURN p, reduce(u=0.0, r in relationships(p) | u+r.uncertainty) as unc ORDER BY unc
  14. 14. Novartis Institutes for BioMedical Research (NIBR) What (can you do with this) Public14 So why not just load Wikipedia? Disease Uncertainty Canavan Disease 0.1 Pelizaeus-Merzbacher Disease 0.364 Alexander Disease 0.432 Diffuse Axonal Injury 0.432 Brain Diseases, Metabolic 0.451 MATCH p = (cpd:Compound {name: 'N-acetylaspartate'}) -[r:is_associated]-> (m:Disease) RETURN m.name as Disease, r.uncertainty as Uncertainty ORDER BY r.uncertainty LIMIT 5
  15. 15. Novartis Institutes for BioMedical Research (NIBR) What (can you do with this) Public15 Now this is getting more interesting (for us) MATCH p = (cpd:Compound {name: 'N-acetylaspartate'}) -[r:is_associated]-> (m:CellularComponent) return m.name as CellularComponent, r.uncertainty as Uncertainty ORDER BY r.uncertainty LIMIT 5 CellularComponent Uncertainty Axons 0.582 Myelin Sheath 0.611 Extracellular Fluid 0.772 MATCH p = (cpd:Compound {name: 'N-acetylaspartate'}) -[r:is_associated]-> (m:BiologicalProcess) RETURN m.name as BiologicalProcess, r.uncertainty as Uncertainty ORDER BY r.uncertainty LIMIT 5 BiologicalProcess Uncertainty Energy Metabolism 0.476 Dominance, Cerebral 0.532 Functional Laterality 0.586 Cerebrovascular Circulation 0.653 Lipid Metabolism 0.72 N-acetylaspartate association with cellular components N-acetylaspartate association with biological processes
  16. 16. Novartis Institutes for BioMedical Research (NIBR) Data sources: 1. MeSH Hierarchy 2. Pubmed articles, (pubmed_id, title, abstract, Lucene full text searches enabled) 3. Pubmed Associations 4. Comparative Toxicogenomics Database (CTD) 5. Compound Target Scores* 6. Public compound annotations 7. Entity relations from sentences 8. Protein-protein interactions data set from CCSB 9. MetaCore gene - gene interactions (binds, activates, regulates expression, …) 10. Similarity relations for all the compounds in the graph* (~2M compounds) 11. Gene ontology 12. Protein annotations 13. Pathways / gene sets Objects: • 25,430,635 articles • 1,951,819 compounds • 257,000 Mesh and SCR terms • 59,859 Genes • 24,769 GO terms • 10,570 Diseases Public16 How (did we build the graph) Relationships: 91 different relationships Compound - is_active – Gene • X – is_associated – X • Gene – binding – Gene • Gene – ubiquitinates – Gene • Compound – affects_ubiquitination – Gene • Article – mentions – (compound, gene, mesh) 209,031,615 mentions 50,334,440 is_similar 6,951,257 literature_association 762,002 is_active Other data sources integrated (*: NIBR internal data) See Acknowledgments / References slide 30 Million nodes 480 Million relationships
  17. 17. Novartis Institutes for BioMedical Research (NIBR) How Public17 The different relationships and nodes in the graph 15 Nodes Article BiologicalProcess CellType CellularComponent Compound Disease Gene GeneSet Go Mesh Pathway Pfam Phenotype Similar2D Tissue 91 Relationships acetylation affects_geranoylation affects_stability is_active adp_ribosylation affects_glucuronidation affects_sulfation is_associated affects_ADP_ribosylation affects_glutathionylation affects_sumoylation is_child_of affects_N_linked_glycosylation affects_glycation affects_transport is_part_of affects_O_linked_glycosylation affects_glycosylation affects_ubiquitination is_query affects_abundance affects_hydrolysis affects_uptake is_similar affects_acetylation affects_hydroxylation binding member_of affects_activity affects_import cleavage mentions affects_acylation affects_lipidation co_regulation_of_transcription methylation affects_alkylation affects_localization complex_formation mirna_binding affects_amination affects_metabolic_processing covalent_modification neddylation affects_binding affects_methylation deacetylation oxidation affects_carbamoylation affects_mutagenesis demethylation phosphorylation affects_carboxylation affects_nitrosation deneddylation ppi affects_chemical_synthesis affects_oxidation dephosphorylation receptor_binding affects_cleavage affects_phosphorylation desumoylation s_nitrosylation affects_cotreatment affects_prenylation deubiquitination sulfation affects_degradation affects_reaction glycosylation sumoylation affects_ethylation affects_reduction go_component transcription_regulation affects_export affects_response_to_substance go_function transformation affects_expression affects_ribosylation go_process transport affects_farnesylation affects_secretion gpi_anchor ubiquitination affects_folding affects_splicing hydroxylation
  18. 18. Novartis Institutes for BioMedical Research (NIBR) How (did we build the graph) Public18 Overall build process MongoDB PostgreSQL Pubmed xml files Internal data sources MeSH hierarchies ctdbase Pubchem ChEMBL ChEBI CCSB MetaStore Information extraction Compound similarities Gene sets Protein annotations Gene ontologies CSV file staging Titles Abstracts • Information extraction (entity recognition, relationship detection, association rule mining is done on linux cluster) • Neo4J “endpoint” focused on graph mining • MongoDB and PostgreSQL are also used for datamining purposes Neo4J
  19. 19. Novartis Institutes for BioMedical Research (NIBR) What (can you do with this) Public19 Example: Analysis of compound activities A B C D E F G H Active compounds Inactive compounds
  20. 20. Novartis Institutes for BioMedical Research (NIBR) What Public20 Example: Analysis of compound activities A B C D E F G H 2 5 1 4 3 6 Active compounds Inactive compounds 1. Find genes directly affected by the compounds
  21. 21. Novartis Institutes for BioMedical Research (NIBR) What Public21 Example: Analysis of compound activities A B C D E F G H 2 8 5 1 4 9 3 6 7 10 Active compounds Inactive compounds 1. Find genes directly affected by the compounds 2. Find all genes that are indirectly affected with some confidence (below a given uncertainyt)
  22. 22. Novartis Institutes for BioMedical Research (NIBR) What Public22 Example: Analysis of compound activities A B C D E F G H 2 8 5 1 4 9 3 6 7 10 Active compounds Inactive compounds 1. Find genes directly affected by the compounds 2. Find all genes that are indirectly affected with some confidence (below a given uncertainty) 3. Assign nodes that can not be reached a large distance 4. Identify nodes that • can not be reached by most of the inactive compound • or are “closer” to the actives than the inactives
  23. 23. Novartis Institutes for BioMedical Research (NIBR) What Public23 Example: Analysis of compound activities MATCH (cpd:Compound) where any( nvs in cpd.cpd_id where nvs in [‘cpd1’,’cpd2’,…]) WITH cpd MATCH p = (cpd) -[r*1..2]-> (m) WITH cpd, p, m, reduce(u=0.0, r in relationships(p) | u+r.uncertainty ) as uncertainty WHERE uncertainty < 0.9 RETURN cpd.cpd_id as Compound_ID, m.id as ID, uncertainty as Distance ORDER BY uncertainty Query reachable nodes Compound_ID Active C582554 C495901 C495900 1 0 1.00 1.00 1.00 2 1 0.78 0.89 0.88 3 1 1.00 1.00 1.00 4 0 1.00 1.00 1.00 5 0 1.00 0.78 0.67 6 0 1.00 1.00 1.00 7 0 1.00 1.00 1.00 8 0 0.88 0.88 0.90 9 0 1.00 0.88 0.82 10 1 1.00 1.00 1.00 11 0 1.00 1.00 1.00 12 0 1.00 0.80 0.83 13 0 1.00 1.00 1.00 14 1 1.00 1.00 1.00 15 1 0.82 1.00 1.00 16 1 0.78 0.89 0.88 17 1 0.80 1.00 1.00 18 1 0.80 1.00 1.00 19 1 0.78 0.89 0.88 20 1 0.80 1.00 1.00 Matrix of compound – node “distances” Result of recursive partitioning (decision tree) Sum of relationship uncertainty is used as distance from compound to node Distance to unreachable node is set to 1.0 ( and one surrogate split with equivalent performance: 2 nodes of interest )
  24. 24. Novartis Institutes for BioMedical Research (NIBR) What Public24 Example: Analysis of compound activities Green: relationships derived from in-house data Grey: relationships found from textmining Compound 1 Compound 2 Compound 3 Compound 4 Compound 5 Compound 6 Compound 7 Compound 8 Compound 9 Compound 10 Compound 11 Compound 12 Compound 13 Only showing the active compounds and their connections to the identified nodes.
  25. 25. Novartis Institutes for BioMedical Research (NIBR) Public25 Compound 1 Compound 2 Compound 3 Compound 4 Compound 5 Compound 6 Compound 7 Compound 8 Compound 9 Compound 10 Compound 11 Compound 12 Compound 13 MATCH p = (g1:Gene) -[r*1..2 {datasource: 'metacore'}]-> (g2:Gene) WHERE g2.gene_symbol in ['FOXO','MTOR'] and g1.gene_symbol in ['PRKAB1', 'PRKAA1','PRKAA2'] RETURN p, reduce(u=0.0, r in relationships(p) | u+r.uncertainty) as unc ORDER BY unc LIMIT 20
  26. 26. Novartis Institutes for BioMedical Research (NIBR) Public26 MATCH p = (g1:Gene) <-[:mentions]- (a:Article) -[:mentions]-> (g2:Gene) WHERE g2.gene_symbol in ['FOXO','MTOR'] and g1.gene_symbol in ['PRKAB1', 'PRKAA1','PRKAA2'] RETURN p MATCH p = (g1:Gene) -[r*1..2 {datasource: 'metacore'}]-> (g2:Gene) WHERE g2.gene_symbol in ['FOXO','MTOR'] and g1.gene_symbol in ['PRKAB1', 'PRKAA1','PRKAA2'] RETURN p, reduce(u=0.0, r in relationships(p) | u+r.uncertainty) as unc ORDER BY unc LIMIT 20
  27. 27. Novartis Institutes for BioMedical Research (NIBR) Public27 Compound 1 Compound 2 Compound 3 Compound 4 Compound 5 Compound 6 Compound 7 Compound 8 Compound 9 Compound 10 Compound 11 Compound 12 Compound 13
  28. 28. Novartis Institutes for BioMedical Research (NIBR) Where (is this going) • More tweaks to what we have – Improvements to text mining – Analysis of verbs (actions) / information extraction – Monitor change over time (what is new “emerging knowledge”) • Full text analysis – Enable analysis and inclusion of internal documents • Incorporate additional data sources – Gene Expression data (tissue expression and perturbations) – Mutations – Proteomics • Refining the “uncertainty” measure – How best to compare uncertainties from different data sources • Expand user base • Automated updates Public28
  29. 29. Novartis Institutes for BioMedical Research (NIBR) • ISLD group – John Davies – Miguel Camargo – Eugen Lounkine – Elisabet Gregori-Puigjane – Mark Bray – Pierre Farmer – Ansgar Schuffenhauer • Text mining group – Therese Vachon – Pierre Parrisot – Andrea Splendiani – Fatima Oezdemir-Zaech – Frederic Sutter • Protein information: – Pfam: R.D. Finn, et. al. The Pfam protein families database: towards a more sustainable future, Nucleic Acids Research (2016) Database Issue 44:D279-D285 http://pfam.xfam.org/ – Uniprot: The UniProt Consortium, UniProt: a hub for protein information, Nucleic Acids Res. 43: D204-D212 (2015) http://www.uniprot.org/ • Comparative Toxicogenomics database: – Davis AP et. al. The Comparative Toxicogenomics Database's 10th year anniversary: update 2015. Nucleic Acids Res. 2015 Jan;43 (Database issue): D914-20. Curated chemical–gene data were retrieved from the Comparative Toxicogenomics Database (CTD), MDI Biological Laboratory, Salisbury Cove, Maine, and NC State University, Raleigh, North Carolina. World Wide Web (URL: http://ctdbase.org/). [May 2016]. • MetaCore – Thomson Reuters LifeSciences http://thomsonreuters.com/en/products-services/pharma-life-sciences/pharmaceutical-research/metacore.html • Protein-Protein interaction data set: – Center for Cancer Systems Biology (CCSB) at the Dana Farber Cancer Institute http://ccsb.dfci.harvard.edu/ • Gene Ontology – The Gene Ontology Consortium. Gene Ontology Consortium: going forward. (2015) Nucl Acids Res 43 Database issue D1049–D1056. http://geneontology.org/ • Pathways – Reactome pathway database: A. Fabregat et. al., The Reactome pathway Knowledgebase, Nucl. Acids Res. (04 January 2016) 44 (D1): D481-D487 D. Croft et. al., The Reactome pathway knowledgebase, Nucl. Acids Res. (1 January 2014) 42 (D1): D472-D477 http://reactome.org/ Public29 Acknowledgments / References Source References • CPC – Sylvain Cottens – Doug Auld • DMP – Jeremy Jenkins – Ben Cornett – Florian Nigsch • NX – Stephen Litster
  30. 30. Thank you

×