Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Informatics and data analytics to support for exposome-based discovery

330 views

Published on

International Society of Exposure Science 10/20/2015

Published in: Health & Medicine
  • ⇒ www.HelpWriting.net ⇐ is a good website if you’re looking to get your essay written for you. You can also request things like research papers or dissertations. It’s really convenient and helpful.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! I do no use writing service very often, only when I really have problems. But this one, I like best of all. The team of writers operates very quickly. It's called ⇒ www.WritePaper.info ⇐ Hope this helps!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I would absolutely recommend this program. You get lots of support and tools, and you get to be open and share, but you never feel embarrassed or ashamed. Everyone is so accepting and kind. It's just a wonderful community. Joining the program was the best thing I did to help my recovery. ♥♥♥ http://t.cn/A6Pq6OB6
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Informatics and data analytics to support for exposome-based discovery

  1. 1. Informatics and data analytics to support exposome-based discovery Perspectives from a NIEHS workshop Chirag J Patel International Society of Exposure Science Henderson, NV (by way of Boston, MA) 10/20/15 chirag@hms.harvard.edu @chiragjp www.chiragjpgroup.org
  2. 2. Arjun Manrai (Harvard)* Yuxia Cui (NIEHS) Pierre Bushel (NIEHS) Molly Hall (Penn State, now U Penn)* Spyros Karakitsios(Aristotle U, Greece) Carolyn Mattingly (NCSU) Marylyn Ritchie (Geisinger Health/Penn State) Charles Schmitt (NIEHS) Denis Sarigiannis (Aristotle U, Greece) Duncan Thomas (USC) David Wishart (U Alberta, Canada) David Balshaw (NIEHS) The workgroup discussed informatics capability for high-throughput exposome research (late 2014 to early 2015)
  3. 3. We are now in the era of high-throughput biology and biomedicine. (now possible to assay thousands to millions of datapoints today)
  4. 4. We are now in the era of high-throughput biology and biomedicine: examples of genomic advances genetic arrays gene expression common genetic variants epigenome (methylation) whole genome sequencing (WGS) full genome sequencing mRNA-seq epigenome (3D, histone) 3 x 109 nucleotidebases 3-4 x 104 genes 106 to 107 variants
  5. 5. Informatics has enabled discovery in genomics investigations. 1. infrastructure/standards, 2. analytics, 3. databases
  6. 6. Information infrastructure has enabled discovery in genomics (example: UCSC genome browser)
  7. 7. Analytic methods have enabled discovery in genomics (example: genome-wide association [GWAS]) A search engine for genetic influence in phenotypes Genome-wide association studies (GWASs) A RT I C L E S 13 autosomal loci exceeded the threshold for genome-wide significance (r2 < 0.05), and conditional analyses (see below) establish these SNPs 50 Locus established previously Locus identified by current study Locus not confirmed by current study BCL11A THADA NOTCH2 ADAMTS9 IRS1 IGF2BP2 WFS1 ZBED3 CDKAL1 HHEX/IDE KCNQ1 (2 signals*: ) TCF7L2 KCNJ11 CENTD2 MTNR1B HMGA2 ZFAND6 PRC1 FTO HNF1B DUSP9 Conditional analysis Unconditional analysis TSPAN8/LGR5 HNF1A CDC123/CAMK1D CHCHD9 CDKN2A/2B SLC30A8 TP53INP1 JAZF1 KLF14 PPAR 40 30 –log10(P)–log10(P) 20 10 10 1 2 3 4 5 6 7 8 Chromosome 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X 0 0 Suggestive statistical association (P < 1 10 –5 ) Association in identified or established region (P < 1 10 –4 ) Figure 1 Genome-wide Manhattan plots for the DIAGRAM+ stage 1 meta-analysis. Top panel summarizes the results of the unconditional meta- analysis. Previously established loci are denoted in red and loci identified by the current study are denoted in green. The ten signals in blue are those taken forward but not confirmed in stage 2 analyses. The genes used to name signals have been chosen on the basis of proximity to the index SNP and should not be presumed to indicate causality. The lower panel summarizes the results of equivalent meta-analysis after conditioning on 30 previously established and newly identified autosomal T2D-associated SNPs (denoted by the dotted lines below these loci in the upper panel). Newly discovered conditional signals (outside established loci) are denoted with an orange dot if they show suggestive levels of significance (P < 10−5), whereas secondary signals close to already confirmed T2D loci are shown in purple (P < 10−4). Voight et al, Nature Genetics 2012 N=8K T2D, 39K Controls GWAS in Type 2 Diabetes
  8. 8. 758,000 individuals >400 studies >>1B datapoints (genotypes and phenotypes) >950 manuscripts (Paltoo et al., Nature Genetics 2014) Accessible data repositories have enabled discovery in genomics investigation: (ex: Databases of Genotypes and Phenotypes)
  9. 9. We claim that there is need for informatics analytic methods, databases, and standards for the exposome-driven discovery. EWAS akin to GWAS?
  10. 10. Why? courtesy: colabria.com
  11. 11. P = G + E
  12. 12. σ2 P = σ2 G + σ2 E
  13. 13. σ2 G σ2 P H2 = Heritability (H2) is the range of phenotypic variability attributed to genetic variability in a population
  14. 14. Eye color Hair curliness Type-1 diabetes Height Schizophrenia Epilepsy Graves' disease Celiac disease Polycystic ovary syndrome Attention deficit hyperactivity disorder Bipolar disorder Obesity Alzheimer's disease Anorexia nervosa Psoriasis Bone mineral density Menarche, age at Nicotine dependence Sexual orientation Alcoholism Lupus Rheumatoid arthritis Crohn's disease Migraine Thyroid cancer Autism Blood pressure, diastolic Body mass index Depression Coronary artery disease Insomnia Menopause, age at Heart disease Prostate cancer QT interval Breast cancer Ovarian cancer Hangover Stroke Asthma Blood pressure, systolic Hypertension Osteoarthritis Parkinson's disease Longevity Type-2 diabetes Gallstone disease Testicular cancer Cervical cancer Sciatica Bladder cancer Colon cancer Lung cancer Leukemia Stomach cancer 0 25 50 75 100 Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com H2 estimates for complex traits are low and variable: massive opportunity for high-throughput E research Type 2 Diabetes (25%) Heart Disease (25-30%) Autism (50%???) Gaugler et al, Nature Genetics (2014)
  15. 15. Eye color Hair curliness Type-1 diabetes Height Schizophrenia Epilepsy Graves' disease Celiac disease Polycystic ovary syndrome Attention deficit hyperactivity disorder Bipolar disorder Obesity Alzheimer's disease Anorexia nervosa Psoriasis Bone mineral density Menarche, age at Nicotine dependence Sexual orientation Alcoholism Lupus Rheumatoid arthritis Crohn's disease Migraine Thyroid cancer Autism Blood pressure, diastolic Body mass index Depression Coronary artery disease Insomnia Menopause, age at Heart disease Prostate cancer QT interval Breast cancer Ovarian cancer Hangover Stroke Asthma Blood pressure, systolic Hypertension Osteoarthritis Parkinson's disease Longevity Type-2 diabetes Gallstone disease Testicular cancer Cervical cancer Sciatica Bladder cancer Colon cancer Lung cancer Leukemia Stomach cancer 0 25 50 75 100 Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com H2 estimates for complex traits are low and variable: massive opportunity for high-throughput E research H2 < 50%
  16. 16. ©2015NatureAmerica,Inc.Allrightsreserved. Despite a century of research on complex traits in humans, the relative importance and specific nature of the influences of genes and environment on human traits remain controversial. We report a meta-analysis of twin correlations and reported variance components for 17,804 traits from 2,748 publications including 14,558,903 partly dependent twin pairs, virtually all published twin studies of complex traits. Estimates of heritability cluster strongly within functional domains, and across all traits the reported heritability is 49%. For a majority (69%) of traits, the observed twin correlations are consistent with a simple and parsimonious model where twin resemblance is solely due to additive genetic variation. The data are inconsistent with substantial influences from shared environment or non-additive genetic variation. This study provides the most comprehensive analysis of the causes of individual differences in human traits thus far and will guide future gene-mapping efforts. All the results can be visualized using the MaTCH webtool. Specifically, the partitioning of observed variability into underlying genetic and environmental sources and the relative importance of additive and non-additive genetic variation are continually debated1–5. Recent results from large-scale genome-wide association studies (GWAS) show that many genetic variants contribute to the variation in complex traits and that effect sizes are typically small6,7. However, the sum of the variance explained by the detected variants is much smaller than the reported heritability of the trait4,6–10. This ‘missing heritability’ has led some investigators to conclude that non-additive variation must be important4,11. Although the presence of gene-gene interaction has been demonstrated empirically5,12–17, little is known about its relative contribution to observed variation18. In this study, our aim is twofold. First, we analyze empirical esti- mates of the relative contributions of genes and environment for virtually all human traits investigated in the past 50 years. Second, we assess empirical evidence for the presence and relative importance of non-additive genetic influences on all human traits studied. We rely on classical twin studies, as the twin design has been used widely to disentangle the relative contributions of genes and environment, across a variety of human traits. The classical twin design is based on contrasting the trait resemblance of monozygotic and dizygotic twin pairs. Monozygotic twins are genetically identical, and dizygotic twins are genetically full siblings. We show that, for a majority of traits (69%), the observed statistics are consistent with a simple and parsi- monious model where the observed variation is solely due to additive genetic variation. The data are inconsistent with a substantial influence from shared environment or non-additive genetic variation. We also show that estimates of heritability cluster strongly within functional domains, and across all traits the reported heritability is 49%. Our results are based on a meta-analysis of twin correlations and reported variance components for 17,804 traits from 2,748 publications includ- ing 14,558,903 partly dependent twin pairs, virtually all twin studies of complex traits published between 1958 and 2012. This study provides the most comprehensive analysis of the causes of individual differences in human traits thus far and will guide future gene-mapping efforts. All Meta-analysis of the heritability of human traits based on fifty years of twin studies Tinca J C Polderman1,10, Beben Benyamin2,10, Christiaan A de Leeuw1,3, Patrick F Sullivan4–6, Arjen van Bochoven7, Peter M Visscher2,8,11 & Danielle Posthuma1,9,11 1Department of Complex Trait Genetics, VU University, Center for Neurogenomics and Cognitive Research, Amsterdam, the Netherlands. 2Queensland Brain Institute, University of Queensland, Brisbane, Queensland, Australia. 3Institute for Computing and Information Sciences, Radboud University Nijmegen, Nijmegen, the Netherlands. 4Center for Psychiatric Genomics, Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA. 5Department of Psychiatry, University of North Carolina, Chapel Hill, North Carolina, USA. 6Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden. 7Faculty of Sciences, VU University, Insight into the nature of observed variation in human traits is impor- tant in medicine, psychology, social sciences and evolutionary biology. It has gained new relevance with both the ability to map genes for human traits and the availability of large, collaborative data sets to do so on an extensive and comprehensive scale. Individual differences in human traits have been studied for more than a century, yet the causes of variation in human traits remain uncertain and controversial. Nature Genetics, 2015 17,804 traits of the phenome 2,748 publications 14,558,903 twin pairs Average H2 (genome): 0.49 Exposome plays an equal role.
  17. 17. What is the potential chemical (external and internal) space of the exposome?: perhaps on the order of thousands. >84,000 TSCA and EPA Inventory (2014) >13,000 Davis et al Comparative Tox DB (2015) 3,600 + 1,634 Toxic Exposome Database Wishart et al (2015) toxicants drugs 100-1,000? uBiome
  18. 18. What will the exposome data structure look like?: a high-dimensioned 3D matrix of (1) exposure measurements on (2) individuals as a function of (3) time tim e exposome pollutants diet m etabolites . . . gut flora CVD xenobiotics individuals GWAS, RVAS, pathway analysis..etc. EWAS, PheWAS..etc. genome(static) mixtures of exposures drugs integrative (A) (C) (B) exposome factors nutrient value for individual i individual i
  19. 19. What will the exposome data structure look like?: a high-dimensioned 3D matrix of (1) exposure measurements on (2) individuals as a function of (3) time tim e exposome pollutants diet m etabolites . . . gut flora CVD BP can xenobiotics individuals GWAS, RVAS, pathway analysis..etc. EWAS, PheWAS..etc. genome(static) mixtures of exposures drugs integrative (A) (C) (B) longitudinal system genome
  20. 20. Data-driven investigation for novel exposome factors in the phenome: Exposome-wide, phenome-wide, and genome-exposome-wide discovery tim e exposome phenome pollutants diet m etabolites . . . gut flora height w eight CVD BP T2D cancer xenobiotics . . . individuals GWAS, RVAS, pathway analysis..etc. EWAS, PheWAS..etc. genome(static) mixtures of exposures tim e drugs integrative mixtures of phenotypes (A) (C) (B) Informatics methods to integrate heterogeneous data (E, G, and P) and to conduct EWAS, GxEWAS, and PheWAS EWAS PheWAS
  21. 21. Integration challenges in conducting data-driven investigation for novel exposome factors in the phenome: The exposome is heterogenous and G does not equal E. platform scale time-dependent type correlation mass-spec: targeted vs. untargeted external vs. internal sampling and life trajectories continuous vs. categorical dense!
  22. 22. Interdependencies of the exposome: Correlation globes paint a dense and complex view of exposure JAMA 2015 Pac Symp Biocomput. 2015
  23. 23. σ2 P = σ2 G + σ2 E σ2 E ???
  24. 24. Alpha-carotene Alcohol VitaminEasalpha-tocopherol Beta-carotene Caffeine Calcium Carbohydrate Cholesterol Copper Beta-cryptoxanthin Folicacid Folate,DFE Foodfolate Dietaryfiber Iron Energy Lycopene Lutein+zeaxanthin MFA16:1 MFA18:1 MFA20:1 Magnesium Totalmonounsaturatedfattyacids Moisture Niacin PFA18:2 PFA18:3 PFA20:4 PFA22:5 PFA22:6 Totalpolyunsaturatedfattyacids Phosphorus Potassium Protein Retinol SFA4:0 SFA6:0 SFA8:0 SFA10:0 SFA12:0 SFA14:0 SFA16:0 SFA18:0 Selenium Totalsaturatedfattyacids Totalsugars Totalfat Theobromine VitaminA,RAE Thiamin VitaminB12 Riboflavin VitaminB6 VitaminC VitaminK Zinc NoSalt OrdinarySalt a-Carotene VitaminB12,serum trans-b-carotene cis-b-carotene b-cryptoxanthin Folate,serum g-tocopherol Iron,FrozenSerum CombinedLutein/zeaxanthin trans-lycopene Folate,RBC Retinylpalmitate Retinylstearate Retinol VitaminD a-Tocopherol Daidzein o-Desmethylangolensin Equol Enterodiol Enterolactone Genistein EstimatedVO2max PhysicalActivity Doesanyonesmokeinhome? Total#ofcigarettessmokedinhome Cotinine CurrentCigaretteSmoker? Agelastsmokedcigarettesregularly #cigarettessmokedperdaywhenquit #cigarettessmokedperdaynow #dayssmokedcigsduringpast30days Avg#cigarettes/dayduringpast30days Smokedatleast100cigarettesinlife Doyounowsmokecigarettes... numberofdayssincequit Usedsnuffatleast20timesinlife drink5inaday drinkperday days5drinksinyear daysdrinkinyear 3-fluorene 2-fluorene 3-phenanthrene 1-phenanthrene 2-phenanthrene 1-pyrene 3-benzo[c]phenanthrene 3-benz[a]anthracene Mono-n-butylphthalate Mono-phthalate Mono-cyclohexylphthalate Mono-ethylphthalate Mono-phthalate Mono--hexylphthalate Mono-isobutylphthalate Mono-n-methylphthalate Mono-phthalate Mono-benzylphthalate Cadmium Lead Mercury,total Barium,urine Cadmium,urine Cobalt,urine Cesium,urine Mercury,urine Iodine,urine Molybdenum,urine Lead,urine Platinum,urine Antimony,urine Thallium,urine Tungsten,urine Uranium,urine BloodBenzene BloodEthylbenzene Bloodo-Xylene BloodStyrene BloodTrichloroethene BloodToluene Bloodm-/p-Xylene 1,2,3,7,8-pncdd 1,2,3,7,8,9-hxcdd 1,2,3,4,6,7,8-hpcdd 1,2,3,4,6,7,8,9-ocdd 2,3,7,8-tcdd Beta-hexachlorocyclohexane Gamma-hexachlorocyclohexane Hexachlorobenzene HeptachlorEpoxide Mirex Oxychlordane p,p-DDE Trans-nonachlor 2,5-dichlorophenolresult 2,4,6-trichlorophenolresult Pentachlorophenol Dimethylphosphate Diethylphosphate Dimethylthiophosphate PCB66 PCB74 PCB99 PCB105 PCB118 PCB138&158 PCB146 PCB153 PCB156 PCB157 PCB167 PCB170 PCB172 PCB177 PCB178 PCB180 PCB183 PCB187 3,3,4,4,5,5-hxcb 3,3,4,4,5-pncb 3,4,4,5-tcb Perfluoroheptanoicacid Perfluorohexanesulfonicacid Perfluorononanoicacid Perfluorooctanoicacid Perfluorooctanesulfonicacid Perfluorooctanesulfonamide 2,3,7,8-tcdf 1,2,3,7,8-pncdf 2,3,4,7,8-pncdf 1,2,3,4,7,8-hxcdf 1,2,3,6,7,8-hxcdf 1,2,3,7,8,9-hxcdf 2,3,4,6,7,8-hxcdf 1,2,3,4,6,7,8-hpcdf Measles Toxoplasma HepatitisAAntibody HepatitisBcoreantibody HepatitisBSurfaceAntibody HerpesII Albumin, urine Uric acid Phosphorus Osmolality Sodium Potassium Creatinine Chloride Total calcium Bicarbonate Blood urea nitrogen Total protein Total bilirubin Lactate dehydrogenase LDH Gamma glutamyl transferase Globulin Alanine aminotransferase ALT Aspartate aminotransferase AST Alkaline phosphotase Albumin Methylmalonic acid PSA. total Prostate specific antigen ratio TIBC, Frozen Serum Red cell distribution width Red blood cell count Platelet count SI Segmented neutrophils percent Mean platelet volume Mean cell volume Mean cell hemoglobin MCHC Hemoglobin Hematocrit Ferritin Protoporphyrin Transferrin saturation White blood cell count Monocyte percent Lymphocyte percent Eosinophils percent C-reactive protein Segmented neutrophils number Monocyte number Lymphocyte number Eosinophils number Basophils number mean systolic mean diastolic 60 sec. pulse: 60 sec HR Total Cholesterol Triglycerides Glucose, serum Insulin Homocysteine Glucose, plasma Glycohemoglobin C-peptide: SI LDL-cholesterol Direct HDL-Cholesterol Bone alkaline phosphotase Trunk Fat Lumber Pelvis BMD Lumber Spine BMD Head BMD Trunk Lean excl BMC Total Lean excl BMC Total Fat Total BMD Weight Waist Circumference Triceps Skinfold Thigh Circumference Subscapular Skinfold Recumbent Length Upper Leg Length Standing Height Head Circumference Maximal Calf Circumference Body Mass Index -0.4 -0.2 0 0.2 0.4 Value 050100150 Color Key and Histogram Count http://bit.ly.com/pemap phenotypes exposures +- EWAS-derived phenotype-exposure association map: A 2-D view of 86 phenotype by 252 exposure associations
  25. 25. Triglycerides Total Cholesterol LDL-cholesterol Trunk Fat Albumin, urine Insulin Total Fat Head Circumference Blood urea nitrogen Albumin Homocysteine C-peptide: SI C-reactive protein Body Mass Index Ferritin Thigh Circumference Maximal Calf Circumference Direct HDL-Cholesterol Total calcium Total bilirubin Red cell distribution width Gamma glutamyl transferase Mean cell volume Mean cell hemoglobin White blood cell count Uric acid Protoporphyrin Hemoglobin Total protein Alkaline phosphotase Waist Circumference Hematocrit Weight Standing Height 1/Creatinine Creatinine Trunk Lean excl BMC Methylmalonic acid Triceps Skinfold Lymphocyte number Subscapular Skinfold Total Lean excl BMC Segmented neutrophils number Lactate dehydrogenase LDH Bone alkaline phosphotase TIBC, Frozen Serum Aspartate aminotransferase AST Phosphorus Lumber Pelvis BMD Glycohemoglobin Globulin Chloride Bicarbonate Alanine aminotransferase ALT 60 sec. pulse: Upper Leg Length Total BMD Potassium Glucose, serum Glucose, plasma Red blood cell count Lumber Spine BMD Platelet count SI MCHC Osmolality Monocyte number mean systolic Lymphocyte percent Segmented neutrophils percent Recumbent Length Eosinophils number Monocyte percent Head BMD mean diastolic Prostate specific antigen ratio 60 sec HR Basophils number Sodium PSA, free Mean platelet volume Eosinophils percent PSA. total Basophils percent 0 10 20 30 40 R^2 * 100 1 to 66 exposures identified for 81 phenotypes Additive effect of E factors: Describe less than 10% of variability in P (On average: 8%) Stan Shaw, Hugues Aschard, JP Ioannidis σ2 E? Exposome may enable realization of remainder of P (>40%) Recall: H2 <= 50%
  26. 26. What do we do now? Recommendations from the workgroup
  27. 27. Data workgroup recommendation highlights Comprehensive catalog of documented environmental associations (e.g., risk, variance explained) to strengthen case for exposome. Where is evidence robust (e.g., air pollution and CVD)? Where do we see non-replication? Where is heritability low and ripe for exposome? Identify technologies that can measure the exposome. Targeted and untargeted metabolomics.
  28. 28. Develop high-throughput data analytic capability. Statistical methodologies for the 3D matrix! Encourage a shift from 1 E to many Es. Link external and internal exposome measures. Data workgroup recommendation highlights tim e exposome phenome pollutants diet m etabolites . . . gut flora height w eight CVD BP T2D cancer xenobiotics . . . individuals GWAS, RVAS, pathway analysis..etc. EWAS, PheWAS..etc. genome(static) mixtures of exposures tim e drugs integrative mixtures of phenotypes (A) (C) (B) Develop data repositories to house and disseminate individual-level exposome data. Assess the variability of the exposome in diverse populations
  29. 29. Data workgroup recommendation highlights Identify data standards for exposome research. Develop data standards to enable the re-use of research to build large exposome-rich cohorts. Identify analytics standards for reproducible research. Software libraries and tools to share methods and findings. Incentivize other parties (e.g., researchers, funders, and industry) to integrate the exposome in their existing programs.
  30. 30. Data workgroup recommendation highlights Educate. Identify example datasets (e.g., NHANES, DEMOCOPHES). Hackathons and challenges to recruit data scientists. Develop big data training support (e.g., K awards) directed at exposome-related research
  31. 31. google:“niehs chear”
  32. 32. Informatics will enable us to decipher the role of the emerging exposome in phenotypes to capture the missing σ2 P σ2 P = σ2 G + σ2 E
  33. 33. Arjun Manrai (Harvard)* Yuxia Cui (NIEHS) Pierre Bushel (NIEHS) Molly Hall (Penn State, now Penn)* Spyros Karakitsios(Aristotle U, Greece) Carolyn Mattingly (NCSU) Marylyn Ritchie (Geisinger/Penn State) Charles Schmitt (NIEHS) Denis Sarigiannis (Aristotle U, Greece) Duncan Thomas (USC) David Wishart (U Alberta, Canada) David Balshaw (NIEHS) Thanks again to the group: Funded in part by the NIEHS.
  34. 34. chirag@hms.harvard.edu @chiragjp www.chiragjpgroup.org Thank you.

×