Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Biomedical Informatics 706: Precision Medicine with exposures

395 views

Published on

Lecture for Zak Kohane's precision medicine course at HMS

Published in: Health & Medicine
  • I think you need a perfect and 100% unique academic essays papers have a look once this site i hope you will get valuable papers, ⇒ www.HelpWriting.net ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Have you ever used the help of ⇒ www.WritePaper.info ⇐? They can help you with any type of writing - from personal statement to research paper. Due to this service you'll save your time and get an essay without plagiarism.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Gout Detox Diet - Here's how to flush Gout. straight out of your system ★★★ http://t.cn/A67Do9lh
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I recovered from bulimia. You can too! learn more... ♣♣♣ http://scamcb.com/bulimiarec/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • The recovery program is giving me the chance that I was seeking to change my life and to free me of the bulimia. For the first time in my life I feel that I am not alone trying to surpass my bulimia. I have real knowledges about my illness and how to beat them. I feel supported, pleased and liberated, with less fears and insecurities of my image. ➤➤ http://scamcb.com/bulimiarec/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Biomedical Informatics 706: Precision Medicine with exposures

  1. 1. Towards a more precise medicine with the exposome Chirag J Patel BMI 703 Precision Medicine I: Genomic Medicine 10/19/16 chirag@hms.harvard.edu @chiragjp www.chiragjpgroup.org
  2. 2. P = G + EType 2 Diabetes Cancer Alzheimer’s Gene expression Phenotype Genome Variants Environment Infectious agents Nutrients Pollutants Drugs
  3. 3. We are great at G investigation! over 2400 Genome-wide Association Studies (GWAS) https://www.ebi.ac.uk/gwas/ G
  4. 4. Nothing comparable to elucidate E influence! E: ??? We lack high-throughput methods and data to discover new E in P…
  5. 5. A similar paradigm for discovery should exist for E! Why?
  6. 6. σ2 P = σ2 G + σ2 E
  7. 7. σ2 G σ2P H2 = Heritability (H2) is the range of phenotypic variability attributed to genetic variability in a population Indicator of the proportion of phenotypic differences attributed to G.
  8. 8. Height is an example of a heritable trait: Francis Galton shows how its done (1887) “mid-height of 205 parents described 60% of variability of 928 offspring”
  9. 9. Eye color Hair curliness Type-1 diabetes Height Schizophrenia Epilepsy Graves' disease Celiac disease Polycystic ovary syndrome Attention deficit hyperactivity disorder Bipolar disorder Obesity Alzheimer's disease Anorexia nervosa Psoriasis Bone mineral density Menarche, age at Nicotine dependence Sexual orientation Alcoholism Lupus Rheumatoid arthritis Crohn's disease Migraine Thyroid cancer Autism Blood pressure, diastolic Body mass index Depression Coronary artery disease Insomnia Menopause, age at Heart disease Prostate cancer QT interval Breast cancer Ovarian cancer Hangover Stroke Asthma Blood pressure, systolic Hypertension Osteoarthritis Parkinson's disease Longevity Type-2 diabetes Gallstone disease Testicular cancer Cervical cancer Sciatica Bladder cancer Colon cancer Lung cancer Leukemia Stomach cancer 0 25 50 75 100 Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com G estimates for burdensome diseases are low and variable: massive opportunity for high-throughput E discovery Type 2 Diabetes Heart Disease Autism (50%???)
  10. 10. Eye color Hair curliness Type-1 diabetes Height Schizophrenia Epilepsy Graves' disease Celiac disease Polycystic ovary syndrome Attention deficit hyperactivity disorder Bipolar disorder Obesity Alzheimer's disease Anorexia nervosa Psoriasis Bone mineral density Menarche, age at Nicotine dependence Sexual orientation Alcoholism Lupus Rheumatoid arthritis Crohn's disease Migraine Thyroid cancer Autism Blood pressure, diastolic Body mass index Depression Coronary artery disease Insomnia Menopause, age at Heart disease Prostate cancer QT interval Breast cancer Ovarian cancer Hangover Stroke Asthma Blood pressure, systolic Hypertension Osteoarthritis Parkinson's disease Longevity Type-2 diabetes Gallstone disease Testicular cancer Cervical cancer Sciatica Bladder cancer Colon cancer Lung cancer Leukemia Stomach cancer 0 25 50 75 100 Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com G estimates for complex traits are low and variable: massive opportunity for high-throughput E discovery σ2 E : Exposome!
  11. 11. The implications for precision medicine: By itself, the genome is a poor to modest diagnostic Science Translational Medicine, 2012
  12. 12. The implications for precision medicine: By itself, the genome is a poor to modest diagnostic Science Translational Medicine, 2016 % of cases that would test positive “twins do not develop or die from the same maladies…” heritability (%) [in red] 86 60 30 21 76
  13. 13. Grace Mahoney and Alan LeGoallec E vs. G in clinical traits Sivateja Tangirala gene expression in twins Yeran Li seasonality and disease Adam Brown (epi)genomic drug repositioning Danielle Rasooly family history & infectious disease Chirag Lakhani clinical traits in twins environmental databases Jake Chung, Nam Pho exposome analytics
  14. 14. The implications for precision medicine: By itself, the genome is a poor to modest diagnostic Science Translational Medicine, 2012 Can we use administrative data (e.g., insurance) to ascertain heritability and environment?
  15. 15. We are great at finding specific G! over 2400 Genome-wide Association Studies (GWAS) https://www.ebi.ac.uk/gwas/ G
  16. 16. A similar paradigm for discovery should exist for E!
  17. 17. It took a new paradigm of GWAS for discovery: Human Genome Project to GWAS Sequencing of the genome 2001 HapMap project: http://hapmap.ncbi.nlm.nih.gov/ Characterize common variation 2001-current day High-throughput variant assay < $99 for ~1M variants Measurement tools ~2003 (ongoing) ARTICLES Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls The Wellcome Trust Case Control Consortium* There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to the identification of genes involved in common human diseases. We describe a joint GWA study (using the Affymetrix GeneChip 500K Mapping Array Set) undertaken in the British population, which has examined ,2,000 individuals for each of 7 major diseases and a shared set of ,3,000 controls. Case-control comparisons identified 24 independent association signals at P , 5 3 1027 : 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn’s disease, 3 in rheumatoid arthritis, 7 in type 1 diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of these signals reflect genuine susceptibility effects. We observed association at many previously identified loci, and found compelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified a 25 27 Vol 447|7 June 2007|doi:10.1038/nature05911 WTCCC, Nature, 2008. Comprehensive, high-throughput analyses GWAS
  18. 18. Explaining the other 50%: A big data-driven paradigm for robust discovery of E in disease via EWAS and the exposome what to measure? how to measure? PERSPECTIVES Xenobiotics Inflammation Preexisting disease Lipid peroxidation Oxidative stress Gut flora Internal chemical environment Externalenvironment ExposomeRADIATION DIET POLLUTION INFECTIONS DRUGS LIFE-STYLE STRESS Reactive electrophiles Metals Endocrine disrupters Immune modulators Receptor-binding proteins itical entity for disease eti- ogy (7). Recent discussion as focused on whether and ow to implement this vision 8). Although fully charac- rizing human exposomes daunting, strategies can be eveloped for getting “snap- hots” of critical portions of person’s exposome during ifferent stages of life. At ne extreme is a “bottom-up” rategy in which all chemi- als in each external source f a subject’s exposome are easured at each time point. lthoughthisapproachwould ave the advantage of relat- g important exposures to e air, water, or diet, it would quire enormous effort and ould miss essential compo- ents of the internal chemi- al environment due to such actors as gender, obesity, flammation, and stress. By ontrast, a “top-down” strat- gy would measure all chem- als (or products of their ownstream processing or ffects, so-called read-outs r signatures) in a subject’s ood. This would require nly a single blood specimen each time point and would relate directly ruptors and can be measured through serum some (telomere) length in peripheral blood mono- nuclear cells responded to chronic psychological stress, possibly mediated by the production of reac- tive oxygen species (15). Characterizing the exposome represents a tech- nological challenge like that of thehumangenomeproject,which began when DNA sequencing was in its infancy (16). Analyti- cal systems are needed to pro- cess small amounts of blood from thousands of subjects. Assays should be multiplexed for mea- suring many chemicals in each class of interest. Tandem mass spectrometry, gene and protein chips, and microfluidic systems offer the means to do this. Plat- forms for high-throughput assays shouldleadtoeconomiesofscale, again like those experienced by the human genome project. And because exposome technologies would provide feedback for thera- peuticinterventionsandpersonal- ized medicine, they should moti- vate the development of commer- cial devices for screening impor- tant environmental exposures in blood samples. With successful characterization of both Characterizing the exposome. The exposome represents the combined exposures from all sources that reach the internal chemical environment. Toxicologically important classes of exposome chemicals are shown. Signatures and biomarkers can detect these agents in blood or serum. onOctober21,2010www.sciencemag.orgrom “A more comprehensive view of environmental exposure is needed ... to discover major causes of diseases...” how to analyze in relation to health? Wild, 2005 Rappaport and Smith, 2010, 2011 Buck-Louis and Sundaram 2012 Miller and Jones, 2014 Patel CJ and Ioannidis JPAI, 2014
  19. 19. Connecting E with Disease: Missing the “System” of Exposures? E+ E- diseased non- diseased ? Exposed to many things, but do not assess the multiplicity. Fragmented literature of associations. Challenge to discover E associated with disease.
  20. 20. Examples of exposome-driven discovery machinery
  21. 21. Gold standard for breadth of human exposure information: National Health and Nutrition Examination Survey1 since the 1960s now biannual: 1999 onwards 10,000 participants per survey The sample for the survey is selected to represent the U.S. population of all ages. To produce reli- able statistics, NHANES over-samples persons 60 and older, African Americans, and Hispanics. Since the United States has experienced dramatic growth in the number of older people during this century, the aging population has major impli- cations for health care needs, public policy, and research priorities. NCHS is working with public health agencies to increase the knowledge of the health status of older Americans. NHANES has a primary role in this endeavor. All participants visit the physician. Dietary inter- views and body measurements are included for everyone. All but the very young have a blood sample taken and will have a dental screening. Depending upon the age of the participant, the rest of the examination includes tests and proce- dures to assess the various aspects of health listed above. In general, the older the individual, the more extensive the examination. Survey Operations Health interviews are conducted in respondents’ homes. Health measurements are performed in specially-designed and equipped mobile centers, which travel to locations throughout the country. The study team consists of a physician, medical and health technicians, as well as dietary and health interviewers. Many of the study staff are bilingual (English/Spanish). An advanced computer system using high- end servers, desktop PCs, and wide-area networking collect and process all of the NHANES data, nearly eliminating the need for paper forms and manual coding operations. This system allows interviewers to use note- book computers with electronic pens. The staff at the mobile center can automatically transmit data into data bases through such devices as digital scales and stadiometers. Touch-sensi- tive computer screens let respondents enter their own responses to certain sensitive ques- tions in complete privacy. Survey information is available to NCHS staff within 24 hours of collection, which enhances the capability of collecting quality data and increases the speed with which results are released to the public. In each location, local health and government officials are notified of the upcoming survey. Households in the study area receive a letter from the NCHS Director to introduce the survey. Local media may feature stories about the survey. NHANES is designed to facilitate and en- courage participation. Transportation is provided to and from the mobile center if necessary. Participants receive compensation and a report of medical findings is given to each participant. All information collected in the survey is kept strictly confidential. Privacy is protected by public laws. Uses of the Data Information from NHANES is made available through an extensive series of publications and articles in scientific and technical journals. For data users and researchers throughout the world, survey data are available on the internet and on easy-to-use CD-ROMs. Research organizations, universities, health care providers, and educators benefit from survey information. Primary data users are federal agencies that collaborated in the de- sign and development of the survey. The National Institutes of Health, the Food and Drug Administration, and CDC are among the agencies that rely upon NHANES to provide data essential for the implementation and evaluation of program activities. The U.S. Department of Agriculture and NCHS coop- erate in planning and reporting dietary and nutrition information from the survey. NHANES’ partnership with the U.S. Environ- mental Protection Agency allows continued study of the many important environmental influences on our health. • Physical fitness and physical functioning • Reproductive history and sexual behavior • Respiratory disease (asthma, chronic bron- chitis, emphysema) • Sexually transmitted diseases • Vision 1 http://www.cdc.gov/nchs/nhanes.htm >250 exposures (serum + urine) GWAS chip >85 quantitative clinical traits (e.g., serum glucose, lipids, body mass index) Death index linkage (cause of death)
  22. 22. Gold standard for breadth of exposure & behavior data: National Health and Nutrition Examination Survey Nutrients and Vitamins vitamin D, carotenes Infectious Agents hepatitis, HIV, Staph. aureus Plastics and consumables phthalates, bisphenol A Physical Activity e.g., stepsPesticides and pollutants atrazine; cadmium; hydrocarbons Drugs statins; aspirin
  23. 23. EWAS in Type 2 Diabetes: Visualizing >200 associations with a Manhattan Plot−log10(pvalue) ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● acrylamide allergentest bacterialinfection cotinine diakyl dioxins furansdibenzofuran heavymetals hydrocarbons latex nutrientscarotenoid nutrientsminerals nutrientsvitaminA nutrientsvitaminB nutrientsvitaminC nutrientsvitaminD nutrientsvitaminE pcbs perchlorate pesticidesatrazine pesticideschlorophenol pesticidesorganochlorine pesticidesorganophosphate pesticidespyrethyroid phenols phthalates phytoestrogens polybrominatedethers polyflourochemicals viralinfection volatilecompounds 012 Heptachlor Epoxide OR=3.2, 1.8 PCB170 OR=4.5,2.3 γ-tocopherol (vitamin E) OR=1.8,1.6 β-carotene OR=0.6,0.6 FDR<10% FBG > 125 mg/dL age, sex, race, SES, BMI PLOS ONE. 2010
  24. 24. What E are associated with all-cause mortality and telomere length?
  25. 25. How does it work?: Searching for exposures and behaviors associated with all- cause mortality. NHANES: 1999-2004 National Death Index linked mortality 246 behaviors and exposures (serum/urine/self-report) NHANES: 1999-2001 N=330 to 6008 (26 to 655 deaths) ~5.5 years of followup Cox proportional hazards baseline exposure and time to death False discovery rate < 5% NHANES: 2003-2004 N=177 to 3258 (20-202 deaths) ~2.8 years of followup p < 0.05 Int J Epidem. 2013
  26. 26. Adjusted Hazard Ratio -log10(pvalue) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8 02468 1 2 3 4 5 67 1 Physical Activity 2 Does anyone smoke in home? 3 Cadmium 4 Cadmium, urine 5 Past smoker 6 Current smoker 7 trans-lycopene (11) 1 2 3 4 5 6 78 9 10 1112 13 14 1516 1 age (10 year increment) 2 SES_1 3 male 4 SES_0 5 black 6 SES_2 7 SES_3 8 education_hs 9 other_eth 10 mexican 11 occupation_blue_semi 12 education_less_hs 13 occupation_never 14 occupation_blue_high 15 occupation_white_semi 16 other_hispanic (69) EWAS in All-cause mortality: 253 exposure/behavior associations in survival Multivariate Cox (age, sex, income, education, race/ethnicity, occupation [in red]) FDR < 5% sociodemographics replicated factor IJE, 2013
  27. 27. Adjusted Hazard Ratio -log10(pvalue) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8 02468 1 2 3 4 5 67 1 Physical Activity 2 Does anyone smoke in home? 3 Cadmium 4 Cadmium, urine 5 Past smoker 6 Current smoker 7 trans-lycopene (11) 1 2 3 4 5 6 78 9 10 1112 13 14 1516 1 age (10 year increment) 2 SES_1 3 male 4 SES_0 5 black 6 SES_2 7 SES_3 8 education_hs 9 other_eth 10 mexican 11 occupation_blue_semi 12 education_less_hs 13 occupation_never 14 occupation_blue_high 15 occupation_white_semi 16 other_hispanic (69) EWAS (re)-identifies factors associated with all-cause mortality: Volcano plot of 200 associations age (10 years) income (quintile 2) income (quintile 1) male black income (quintile 3) any one smoke in home? Multivariate cox (age, sex, income, education, race/ethnicity, occupation [in red]) serum and urine cadmium [1 SD] past smoker? current smoker?serum lycopene [1SD] physical activity [low, moderate, high activity]* *derived from METs per activity and categorized by Health.gov guidelines R2 ~ 2%
  28. 28. 452 associations in Telomere Length: Polychlorinated biphenyls associated with longer telomeres?! IJE, 2016 0 1 2 3 4 −0.2 −0.1 0.0 0.1 0.2 effect size −log10(pvalue) PCBs FDR<5% Trunk Fat Alk. PhosCRP Cadmium Cadmium (urine)cigs per day retinyl stearate R2 ~ 1% VO2 Maxpulse rate shorter telomeres longer telomeres adjusted by age, age2, race, poverty, education, occupation median N=3000; N range: 300-7000
  29. 29. Samples exposed to PCBs associated with difference in genes implicated in telomere length GWAS? Expression differences for 24 GWAS implicated genes Queried the Gene Expression Omnibus for PCBs Affymetrix human arrays (GPL570) 7 gene expression experiments on humans 52 exposed; 14 unexposed Differential gene expression and a functional analysis of PCB-exposed children: Understanding disease and disorder development Sisir K. Dutta a, ⁎, Partha S. Mitra a,1 , Somiranjan Ghosh a,1 , Shizhu Zang a,1 , Dean Sonneborn b , Irva Hertz-Picciotto b , Tomas Trnovec c , Lubica Palkovicova c , Eva Sovcikova c , Svetlana Ghimbovschi d , Eric P. Hoffman d a Molecular Genetics Laboratory, Howard University, Washington, DC, USA b Department of Public Health Sciences, University of California Davis, Davis, CA, USA c Slovak Medical University, Bratislava, Slovak Republic d Center for Genetic Medicine, Children's National Medical Center, Washington, DC, USA a b s t r a c ta r t i c l e i n f o Article history: Received 20 December 2010 Accepted 10 July 2011 The goal of the present study is to understand the probable molecular mechanism of toxicities and the associated pathways related to observed pathophysiology in high PCB-exposed populations. We have performed a microarray-based differential gene expression analysis of children (mean age 46.1 months) of Environment International 40 (2012) 143–154 Contents lists available at ScienceDirect Environment International journal homepage: www.elsevier.com/locate/envint IJE, 2016
  30. 30. Suggestive, but need more N! 0 1 2 −0.50 −0.25 0.00 0.25 0.50 0.75 log(difference) −log10(pvalue) 1555203_s_at (SLC44A4) 1555203_s_at (MYNN) 224206_x_at (MYNN) Could PCBs influence expression of genes implicated in telomere length GWAS? myoneurin bladder, leukemia, colorectal cancer GWASs IJE, in press
  31. 31. Studying the Elusive Environment in Large Scale Itispossiblethatmorethan50%ofcomplexdiseaserisk isattributedtodifferencesinanindividual’senvironment.1 Airpollution,smoking,anddietaredocumentedenviron- mental factors affecting health, yet these factors are but a fraction of the “exposome,” the totality of the exposure loadoccurringthroughoutaperson’slifetime.1 Investigat- ing one or a handful of exposures at a time has led to a highly fragmented literature of epidemiologic associa- tions. Much of that literature is not reproducible, and se- lectivereportingmaybeamajorreasonforthelackofre- producibility. A new model is required to discover environmental exposures associated with disease while mitigating possibilities of selective reporting. Toremedythelackofreproducibilityandconcernsof validity, multiple personal exposures can be assessed si- multaneously in terms of their association with a condi- tion or disease of interest; the strongest associations can then be tentatively validated in independent data sets (eg, as done in references 2 and 3).2,3 The main advan- tages of this process include the ability to search the list ofexposuresandadjustformultiplicitysystematicallyand reportalltheprobedassociationsinsteadofonlythemost significant results. The term “environment-wide associa- tion studies” (EWAS) has been used to describe this ap- proach (an analogy to genome-wide association stud- ies).Forexample,Wangetal4 screenedmorethan2000 chemicalsinserumtodiscoverendogenousexposuresas- sociated with risk for cardiovascular disease. Therearenotablehurdlesinanalyzing“big”environ- mental data. These same problems affect epidemiology of1-risk-factor-at-a-time,butinEWAStheirprevalencebe- comes more clearly manifest at large scale. When study- the EWAS vantage point, intervening on β-carotene (Figure, D) seems a futile exercise given its complex rela- tionship with other nutrients and pollutants. Giventhiscomplexity,howcanstudiesofenvironmen- talriskmoveforward?First,EWASanalysesshouldbeap- pliedtomultipledatasets,andconsistencycanbeformally examinedforallassessedcorrelations.Second,thetempo- ral relationship between exposure and changes in health parametersmayofferhelpfulhintsaboutwhichofthesig- nalsaremorethansimplecorrelations.Third,standardized adjustedanalyses,inwhichadjustmentsareperformedsys- tematicallyandinthesamewayacrossmultipledatasets, may also help. This is in stark contrast with the current model,wherebymostepidemiologicstudiesusesingledata setswithoutreplicationaswellasnon–time-dependentas- sessments,andreportedadjustmentsaremarkedlydiffer- entacrossreportsanddatasets,eventhoseperformedby thesameteam(differentapproachesincreasevaliditybut mustbereconciledandassimilated). However, eventually for most environmental cor- relates,theremaybeunsurpassabledifficultyestablish- ing potential causal inferences based on observational data alone. Factors that seem protective may some- times be tested in randomized trials. The complexity of the multiple correlations also highlights the challenge thatinterveningtomodify1putativeriskfactoralsomay inadvertently affect multiple other correlated factors. Even when a seemingly simple intervention is tested in randomizedtrials(affectingasingleriskfactoramongthe manycorrelations),theinterventionisnotreallysimple. In essence what is tested are multiple perturbations of factors correlated with the one targeted for interven- VIEWPOINT Chirag J. Patel, PhD Center for Biomedical Informatics, Harvard Medical School, Boston, Massachusetts. John P. A. Ioannidis, MD, DSc Stanford Prevention Research Center, Department of Health Research and Policy, Department of Medicine, Stanford University School of Medicine, Stanford, California, Department of Statistics, Stanford University School of Humanities and Sciences, Stanford, California, and Meta-Research Innovation Center at Stanford (METRICS), Stanford, California. Opinion JAMA, 2014 JECH, 2014 Proc Symp Biocomp, 2015 ARPH, in press How can we study the elusive environment in larger scale for biomedical discovery? Studying the Elusive Environment in Large Scale Itispossiblethatmorethan50%ofcomplexdiseaserisk isattributedtodifferencesinanindividual’senvironment.1 Airpollution,smoking,anddietaredocumentedenviron- mental factors affecting health, yet these factors are but a fraction of the “exposome,” the totality of the exposure loadoccurringthroughoutaperson’slifetime.1 Investigat- ing one or a handful of exposures at a time has led to a highly fragmented literature of epidemiologic associa- tions. Much of that literature is not reproducible, and se- lectivereportingmaybeamajorreasonforthelackofre- producibility. A new model is required to discover environmental exposures associated with disease while mitigating possibilities of selective reporting. Toremedythelackofreproducibilityandconcernsof validity, multiple personal exposures can be assessed si- multaneously in terms of their association with a condi- tion or disease of interest; the strongest associations can then be tentatively validated in independent data sets (eg, as done in references 2 and 3).2,3 The main advan- tages of this process include the ability to search the list ofexposuresandadjustformultiplicitysystematicallyand reportalltheprobedassociationsinsteadofonlythemost significant results. The term “environment-wide associa- tion studies” (EWAS) has been used to describe this ap- the EWAS vantage point, intervening on β-carotene (Figure, D) seems a futile exercise given its complex rela- tionship with other nutrients and pollutants. Giventhiscomplexity,howcanstudiesofenvironmen- talriskmoveforward?First,EWASanalysesshouldbeap- pliedtomultipledatasets,andconsistencycanbeformally examinedforallassessedcorrelations.Second,thetempo- ral relationship between exposure and changes in health parametersmayofferhelpfulhintsaboutwhichofthesig- nalsaremorethansimplecorrelations.Third,standardized adjustedanalyses,inwhichadjustmentsareperformedsys- tematicallyandinthesamewayacrossmultipledatasets may also help. This is in stark contrast with the current model,wherebymostepidemiologicstudiesusesingledata setswithoutreplicationaswellasnon–time-dependentas- sessments,andreportedadjustmentsaremarkedlydiffer- entacrossreportsanddatasets,eventhoseperformedby thesameteam(differentapproachesincreasevaliditybut mustbereconciledandassimilated). However, eventually for most environmental cor- relates,theremaybeunsurpassabledifficultyestablish- ing potential causal inferences based on observationa data alone. Factors that seem protective may some- times be tested in randomized trials. The complexity of VIEWPOINT Chirag J. Patel, PhD Center for Biomedical Informatics, Harvard Medical School, Boston, Massachusetts. John P. A. Ioannidis, MD, DSc Stanford Prevention Research Center, Department of Health Research and Policy, Department of Medicine, Stanford University School of Medicine, Stanford, California, Department of Statistics, Stanford University School of Humanities and Sciences, Stanford, California, and Meta-Research Innovation Center at Stanford (METRICS), Stanford, California. Opinion High-throughputascertainmentofendogenousindicatorsofen- vironmentalexposurethatmayreflecttheexposomeincreasinglyat- tractattention,andtheirperformanceneedstobecarefullyevaluated. These include chemical detection of indicators of exposure through metabolomics, proteomics, and biosensors.7 Eventually, patterns of US federally funded gene expression experiment data be d itedinpublicrepositoriessuchastheGeneExpressionOmnibu repositoryhasbeeninstrumentalindevelopmentoftechnolo measurement of gene expression, data standardization, and ofdatafordiscovery.JustaswiththeGeneExpressionOmnib Figure. Correlation Interdependency Globes for 4 Environmental Exposures (Cotinine, Mercury, Cadmium, Trans-β-Carotene) in National Healt Nutrition Examination Survey (NHANES) Participants, 2003-2004 A Serum cotinine B Serum total mercury C Serum cadmium D Serum trans-β-carotene 37 Total correlations 42 Total correlations 68 Total correlations 68 Total correlations Negative correlation Positive correl Infectious agents Pollutants Nutrients and vitamins Demographic attributes Eachcorrelationinterdependencyglobeincludes317environmentalexposures representedbythenodesaroundtheperipheryoftheglobe.Pairwisecorrelations aredepictedbyedges(lines)betweenthenodeofinterest(arrowhead)andother nodes.Correlationswithabsolutevaluesexceeding0.2areshown(stronge Thesizeofeachnodeisproportionaltothenumberofedgesforanode,and thicknessofeachedgeindicatesthemagnitudeofthecorrelation. Opinion Viewpoint •bioinformatics to connect exposome with phenome •new ‘omics technologies to measure the exposome •dense correlations •reverse causality •confounding •(longitudinal) publicly available data
  32. 32. Interdependencies of the exposome: Correlation globes paint a complex view of exposure Red: positive ρ Blue: negative ρ thickness: |ρ| for each pair of E: Spearman ρ (575 factors: 81,937 correlations) permuted data to produce “null ρ” sought replication in > 1 cohort Pac Symp Biocomput. 2015 JECH. 2015
  33. 33. Red: positive ρ Blue: negative ρ thickness: |ρ| for each pair of E: Spearman ρ (575 factors: 81,937 correlations) Interdependencies of the exposome: Correlation globes paint a complex view of exposure permuted data to produce “null ρ” sought replication in > 1 cohort Pac Symp Biocomput. 2015 JECH. 2015 Effective number of variables: 500 (10% decrease)
  34. 34. Telomere Length All-cause mortality http://bit.ly/globebrowse Interdependencies of the exposome: Telomeres vs. all-cause mortality
  35. 35. Browse these and 82 other phenotype-exposome globes! http://www.chiragjpgroup.org/exposome_correlation
  36. 36. What nodes have the most correlations / have the most connections? (“hubs of the network”) (What factors are correlated with others the most?) income... AJE 2015
  37. 37. Pulse rate Eosinophils number Lymphocyte number Monocyte Segmented neutrophils number Blood 2,5-Dimethylfuran Cadmium LeadCotinine C-reactive protein Floor, GFAAS Protoporphyrin Glycohemoglobin Glucose, plasma g-tocopherol Hepatitis A Antibody Homocysteine Herpes I Herpes II Red cell distribution width Alkaline phosphotase Globulin Glucose, serum Gamma glutamyl transferase Triglycerides Blood Benzene Blood 1,4-Dichlorobenzene Blood Ethylbenzene Blood Styrene Blood Toluene Blood m-/p-Xylene White blood cell count Mono-benzyl phthalate 3-fluorene 2-fluorene 3-phenanthrene 2-phenanthrene 1-pyrene Cadmium, urine Albumin, urine Lead, urine 10 20 30 -0.3 -0.2 -0.1 0.0 Effect Size per 1SD of income/poverty ratio -log10(pvalue) overall income/poverty ratio effects (per 1SD) validated results Lower income associated with 43 of 330 (>13%) exposures and biomarkers in the US population Higher income: lower levels of biomarkers AJE, 2015 (Another 23 associated with higher levels=20%)
  38. 38. Studying the Elusive Environment in Large Scale Itispossiblethatmorethan50%ofcomplexdiseaserisk isattributedtodifferencesinanindividual’senvironment.1 Airpollution,smoking,anddietaredocumentedenviron- mental factors affecting health, yet these factors are but a fraction of the “exposome,” the totality of the exposure loadoccurringthroughoutaperson’slifetime.1 Investigat- ing one or a handful of exposures at a time has led to a highly fragmented literature of epidemiologic associa- tions. Much of that literature is not reproducible, and se- lectivereportingmaybeamajorreasonforthelackofre- producibility. A new model is required to discover environmental exposures associated with disease while mitigating possibilities of selective reporting. Toremedythelackofreproducibilityandconcernsof validity, multiple personal exposures can be assessed si- multaneously in terms of their association with a condi- tion or disease of interest; the strongest associations can then be tentatively validated in independent data sets (eg, as done in references 2 and 3).2,3 The main advan- tages of this process include the ability to search the list ofexposuresandadjustformultiplicitysystematicallyand reportalltheprobedassociationsinsteadofonlythemost significant results. The term “environment-wide associa- tion studies” (EWAS) has been used to describe this ap- proach (an analogy to genome-wide association stud- ies).Forexample,Wangetal4 screenedmorethan2000 chemicalsinserumtodiscoverendogenousexposuresas- sociated with risk for cardiovascular disease. Therearenotablehurdlesinanalyzing“big”environ- mental data. These same problems affect epidemiology of1-risk-factor-at-a-time,butinEWAStheirprevalencebe- comes more clearly manifest at large scale. When study- the EWAS vantage point, intervening on β-carotene (Figure, D) seems a futile exercise given its complex rela- tionship with other nutrients and pollutants. Giventhiscomplexity,howcanstudiesofenvironmen- talriskmoveforward?First,EWASanalysesshouldbeap- pliedtomultipledatasets,andconsistencycanbeformally examinedforallassessedcorrelations.Second,thetempo- ral relationship between exposure and changes in health parametersmayofferhelpfulhintsaboutwhichofthesig- nalsaremorethansimplecorrelations.Third,standardized adjustedanalyses,inwhichadjustmentsareperformedsys- tematicallyandinthesamewayacrossmultipledatasets, may also help. This is in stark contrast with the current model,wherebymostepidemiologicstudiesusesingledata setswithoutreplicationaswellasnon–time-dependentas- sessments,andreportedadjustmentsaremarkedlydiffer- entacrossreportsanddatasets,eventhoseperformedby thesameteam(differentapproachesincreasevaliditybut mustbereconciledandassimilated). However, eventually for most environmental cor- relates,theremaybeunsurpassabledifficultyestablish- ing potential causal inferences based on observational data alone. Factors that seem protective may some- times be tested in randomized trials. The complexity of the multiple correlations also highlights the challenge thatinterveningtomodify1putativeriskfactoralsomay inadvertently affect multiple other correlated factors. Even when a seemingly simple intervention is tested in randomizedtrials(affectingasingleriskfactoramongthe manycorrelations),theinterventionisnotreallysimple. In essence what is tested are multiple perturbations of factors correlated with the one targeted for interven- VIEWPOINT Chirag J. Patel, PhD Center for Biomedical Informatics, Harvard Medical School, Boston, Massachusetts. John P. A. Ioannidis, MD, DSc Stanford Prevention Research Center, Department of Health Research and Policy, Department of Medicine, Stanford University School of Medicine, Stanford, California, Department of Statistics, Stanford University School of Humanities and Sciences, Stanford, California, and Meta-Research Innovation Center at Stanford (METRICS), Stanford, California. Opinion JAMA, 2014 JECH, 2014 Proc Symp Biocomp, 2015 How can we study the elusive environment in larger scale for biomedical discovery? Studying the Elusive Environment in Large Scale Itispossiblethatmorethan50%ofcomplexdiseaserisk isattributedtodifferencesinanindividual’senvironment.1 Airpollution,smoking,anddietaredocumentedenviron- mental factors affecting health, yet these factors are but a fraction of the “exposome,” the totality of the exposure loadoccurringthroughoutaperson’slifetime.1 Investigat- ing one or a handful of exposures at a time has led to a highly fragmented literature of epidemiologic associa- tions. Much of that literature is not reproducible, and se- lectivereportingmaybeamajorreasonforthelackofre- producibility. A new model is required to discover environmental exposures associated with disease while mitigating possibilities of selective reporting. Toremedythelackofreproducibilityandconcernsof validity, multiple personal exposures can be assessed si- multaneously in terms of their association with a condi- tion or disease of interest; the strongest associations can then be tentatively validated in independent data sets (eg, as done in references 2 and 3).2,3 The main advan- tages of this process include the ability to search the list ofexposuresandadjustformultiplicitysystematicallyand reportalltheprobedassociationsinsteadofonlythemost significant results. The term “environment-wide associa- tion studies” (EWAS) has been used to describe this ap- the EWAS vantage point, intervening on β-carotene (Figure, D) seems a futile exercise given its complex rela- tionship with other nutrients and pollutants. Giventhiscomplexity,howcanstudiesofenvironmen- talriskmoveforward?First,EWASanalysesshouldbeap- pliedtomultipledatasets,andconsistencycanbeformally examinedforallassessedcorrelations.Second,thetempo- ral relationship between exposure and changes in health parametersmayofferhelpfulhintsaboutwhichofthesig- nalsaremorethansimplecorrelations.Third,standardized adjustedanalyses,inwhichadjustmentsareperformedsys- tematicallyandinthesamewayacrossmultipledatasets may also help. This is in stark contrast with the current model,wherebymostepidemiologicstudiesusesingledata setswithoutreplicationaswellasnon–time-dependentas- sessments,andreportedadjustmentsaremarkedlydiffer- entacrossreportsanddatasets,eventhoseperformedby thesameteam(differentapproachesincreasevaliditybut mustbereconciledandassimilated). However, eventually for most environmental cor- relates,theremaybeunsurpassabledifficultyestablish- ing potential causal inferences based on observationa data alone. Factors that seem protective may some- times be tested in randomized trials. The complexity of VIEWPOINT Chirag J. Patel, PhD Center for Biomedical Informatics, Harvard Medical School, Boston, Massachusetts. John P. A. Ioannidis, MD, DSc Stanford Prevention Research Center, Department of Health Research and Policy, Department of Medicine, Stanford University School of Medicine, Stanford, California, Department of Statistics, Stanford University School of Humanities and Sciences, Stanford, California, and Meta-Research Innovation Center at Stanford (METRICS), Stanford, California. Opinion High-throughputascertainmentofendogenousindicatorsofen- vironmentalexposurethatmayreflecttheexposomeincreasinglyat- tractattention,andtheirperformanceneedstobecarefullyevaluated. These include chemical detection of indicators of exposure through metabolomics, proteomics, and biosensors.7 Eventually, patterns of US federally funded gene expression experiment data be d itedinpublicrepositoriessuchastheGeneExpressionOmnibu repositoryhasbeeninstrumentalindevelopmentoftechnolo measurement of gene expression, data standardization, and ofdatafordiscovery.JustaswiththeGeneExpressionOmnib Figure. Correlation Interdependency Globes for 4 Environmental Exposures (Cotinine, Mercury, Cadmium, Trans-β-Carotene) in National Healt Nutrition Examination Survey (NHANES) Participants, 2003-2004 A Serum cotinine B Serum total mercury C Serum cadmium D Serum trans-β-carotene 37 Total correlations 42 Total correlations 68 Total correlations 68 Total correlations Negative correlation Positive correl Infectious agents Pollutants Nutrients and vitamins Demographic attributes Eachcorrelationinterdependencyglobeincludes317environmentalexposures representedbythenodesaroundtheperipheryoftheglobe.Pairwisecorrelations aredepictedbyedges(lines)betweenthenodeofinterest(arrowhead)andother nodes.Correlationswithabsolutevaluesexceeding0.2areshown(stronge Thesizeofeachnodeisproportionaltothenumberofedgesforanode,and thicknessofeachedgeindicatesthemagnitudeofthecorrelation. Opinion Viewpoint •bioinformatics to connect exposome with phenome •new ‘omics technologies to measure the exposome •dense correlations •reverse causality •confounding •(longitudinal) publicly available data
  39. 39. BD2K Patient-Centered Information Commons Integrated repositories of individual-level information PI: Isaac Kohane http://pic-sure.org
  40. 40. with Paul Avillach, Michael McDuffie, Jeremy Easton-Marks, Cartik Saravanamuthu and the BD2K PIC-SURE team NHANES 1999-2006 API available now http://bit.ly/nhanes_pici BD2K Patient-Centered Information Commons NHANES exposome browser
  41. 41. http://github.com repository to deposit and control code
  42. 42. http://chiragjpgroup.org/exposome-analytics-course Scientific Data, in press IJE, 2016 Reproduce our results!
  43. 43. -What is the average BMI? - for females? males? kids (0-10 year olds)? teens? -Identify an “exposure” variable: -What is its average? For females? For males? -Associate that exposure variable with a phenotype. -What analytic procedure would you use? Reproduce our results! Warm-up exercises before you do…
  44. 44. P We are many phenotypes simultaneously: Can we better categorize these P? Body Measures Body Mass Index Height Blood pressure & fitness Systolic BP Diastolic BP Pulse rate VO2 Max Metabolic Glucose LDL-Cholesterol Triglycerides Inflammation C-reactive protein white blood cell count Kidney function Creatinine Sodium Uric Acid Liver function Aspartate aminotransferase Gamma glutamyltransferase Aging Telomere length
  45. 45. Creation of a phenotype-exposure association map: A 2-D view of 83 phenotype by 252 exposure associations > 0 < 0 Association Size: Clusters of exposures associated with clusters of phenotypes? 252 biomarkers of exposure × 83 clinical trait phenotypes NHANES 1999-2000, 2001-2002, 2005-2006 ~21K regressions: replicated significant (FDR < 5%) in 2003-2004 adjusted by age, age2, sex, race, income, chronic disease Hugues Aschard, JP Ioannidis 83phenotypes 252 exposures
  46. 46. Alpha-carotene Alcohol VitaminEasalpha-tocopherol Beta-carotene Caffeine Calcium Carbohydrate Cholesterol Copper Beta-cryptoxanthin Folicacid Folate,DFE Foodfolate Dietaryfiber Iron Energy Lycopene Lutein+zeaxanthin MFA16:1 MFA18:1 MFA20:1 Magnesium Totalmonounsaturatedfattyacids Moisture Niacin PFA18:2 PFA18:3 PFA20:4 PFA22:5 PFA22:6 Totalpolyunsaturatedfattyacids Phosphorus Potassium Protein Retinol SFA4:0 SFA6:0 SFA8:0 SFA10:0 SFA12:0 SFA14:0 SFA16:0 SFA18:0 Selenium Totalsaturatedfattyacids Totalsugars Totalfat Theobromine VitaminA,RAE Thiamin VitaminB12 Riboflavin VitaminB6 VitaminC VitaminK Zinc NoSalt OrdinarySalt a-Carotene VitaminB12,serum trans-b-carotene cis-b-carotene b-cryptoxanthin Folate,serum g-tocopherol Iron,FrozenSerum CombinedLutein/zeaxanthin trans-lycopene Folate,RBC Retinylpalmitate Retinylstearate Retinol VitaminD a-Tocopherol Daidzein o-Desmethylangolensin Equol Enterodiol Enterolactone Genistein EstimatedVO2max PhysicalActivity Doesanyonesmokeinhome? Total#ofcigarettessmokedinhome Cotinine CurrentCigaretteSmoker? Agelastsmokedcigarettesregularly #cigarettessmokedperdaywhenquit #cigarettessmokedperdaynow #dayssmokedcigsduringpast30days Avg#cigarettes/dayduringpast30days Smokedatleast100cigarettesinlife Doyounowsmokecigarettes... numberofdayssincequit Usedsnuffatleast20timesinlife drink5inaday drinkperday days5drinksinyear daysdrinkinyear 3-fluorene 2-fluorene 3-phenanthrene 1-phenanthrene 2-phenanthrene 1-pyrene 3-benzo[c]phenanthrene 3-benz[a]anthracene Mono-n-butylphthalate Mono-phthalate Mono-cyclohexylphthalate Mono-ethylphthalate Mono-phthalate Mono--hexylphthalate Mono-isobutylphthalate Mono-n-methylphthalate Mono-phthalate Mono-benzylphthalate Cadmium Lead Mercury,total Barium,urine Cadmium,urine Cobalt,urine Cesium,urine Mercury,urine Iodine,urine Molybdenum,urine Lead,urine Platinum,urine Antimony,urine Thallium,urine Tungsten,urine Uranium,urine BloodBenzene BloodEthylbenzene Bloodo-Xylene BloodStyrene BloodTrichloroethene BloodToluene Bloodm-/p-Xylene 1,2,3,7,8-pncdd 1,2,3,7,8,9-hxcdd 1,2,3,4,6,7,8-hpcdd 1,2,3,4,6,7,8,9-ocdd 2,3,7,8-tcdd Beta-hexachlorocyclohexane Gamma-hexachlorocyclohexane Hexachlorobenzene HeptachlorEpoxide Mirex Oxychlordane p,p-DDE Trans-nonachlor 2,5-dichlorophenolresult 2,4,6-trichlorophenolresult Pentachlorophenol Dimethylphosphate Diethylphosphate Dimethylthiophosphate PCB66 PCB74 PCB99 PCB105 PCB118 PCB138&158 PCB146 PCB153 PCB156 PCB157 PCB167 PCB170 PCB172 PCB177 PCB178 PCB180 PCB183 PCB187 3,3,4,4,5,5-hxcb 3,3,4,4,5-pncb 3,4,4,5-tcb Perfluoroheptanoicacid Perfluorohexanesulfonicacid Perfluorononanoicacid Perfluorooctanoicacid Perfluorooctanesulfonicacid Perfluorooctanesulfonamide 2,3,7,8-tcdf 1,2,3,7,8-pncdf 2,3,4,7,8-pncdf 1,2,3,4,7,8-hxcdf 1,2,3,6,7,8-hxcdf 1,2,3,7,8,9-hxcdf 2,3,4,6,7,8-hxcdf 1,2,3,4,6,7,8-hpcdf Measles Toxoplasma HepatitisAAntibody HepatitisBcoreantibody HepatitisBSurfaceAntibody HerpesII Albumin, urine Uric acid Phosphorus Osmolality Sodium Potassium Creatinine Chloride Total calcium Bicarbonate Blood urea nitrogen Total protein Total bilirubin Lactate dehydrogenase LDH Gamma glutamyl transferase Globulin Alanine aminotransferase ALT Aspartate aminotransferase AST Alkaline phosphotase Albumin Methylmalonic acid PSA. total Prostate specific antigen ratio TIBC, Frozen Serum Red cell distribution width Red blood cell count Platelet count SI Segmented neutrophils percent Mean platelet volume Mean cell volume Mean cell hemoglobin MCHC Hemoglobin Hematocrit Ferritin Protoporphyrin Transferrin saturation White blood cell count Monocyte percent Lymphocyte percent Eosinophils percent C-reactive protein Segmented neutrophils number Monocyte number Lymphocyte number Eosinophils number Basophils number mean systolic mean diastolic 60 sec. pulse: 60 sec HR Total Cholesterol Triglycerides Glucose, serum Insulin Homocysteine Glucose, plasma Glycohemoglobin C-peptide: SI LDL-cholesterol Direct HDL-Cholesterol Bone alkaline phosphotase Trunk Fat Lumber Pelvis BMD Lumber Spine BMD Head BMD Trunk Lean excl BMC Total Lean excl BMC Total Fat Total BMD Weight Waist Circumference Triceps Skinfold Thigh Circumference Subscapular Skinfold Recumbent Length Upper Leg Length Standing Height Head Circumference Maximal Calf Circumference Body Mass Index -0.4 -0.2 0 0.2 0.4 Value 050100150 Color Key and Histogram Count http://bit.ly.com/pemap phenotypes exposures +- nutrients BMI,weight, BMD metabolic renalfunction pcbs metabolic bloodparameters hydrocarbons Creation of a phenotype-exposure association map: A 2-D view of connections between P and E
  47. 47. Body Mass Index Waist circumference Trunk fat Total fat Weight Total lean fat Thigh circumference Calf circumference Trunk Lean Skinfold CRP Trans-b-carotene a-carotene cis-b-carotene b-cryptoxanthin lutein/xeaxanthin VitaminD Magnesium Folate Vo2Max PCB180 Cotinine 100cigs Ciginlast30 Cadmium Benzene Toluene Smokeinhome? Styrene Currentsmoker 3-fluorene 2-fluorene White blood cell count Segmented neutrophils Monocyte number Lymphocyte number Eosinophils number Basophils number Alkaline phosphotase Homocysteine Hemoglobin Pulse rate http://bit.ly.com/pemap EWAS-derived phenotype-exposure association map: Zooming in to WBC and BMI phenotype clusters Alpha-carotene Alcohol VitaminEasalpha-tocopherol Beta-carotene Caffeine Calcium Carbohydrate Cholesterol Copper Beta-cryptoxanthin Folicacid Folate,DFE Foodfolate Dietaryfiber Iron Energy Lycopene Lutein+zeaxanthin MFA16:1 MFA18:1 MFA20:1 Magnesium Totalmonounsaturatedfattyacids Moisture Niacin PFA18:2 PFA18:3 PFA20:4 PFA22:5 PFA22:6 Totalpolyunsaturatedfattyacids Phosphorus Potassium Protein Retinol SFA4:0 SFA6:0 SFA8:0 SFA10:0 SFA12:0 SFA14:0 SFA16:0 SFA18:0 Selenium Totalsaturatedfattyacids Totalsugars Totalfat Theobromine VitaminA,RAE Thiamin VitaminB12 Riboflavin VitaminB6 VitaminC VitaminK Zinc NoSalt OrdinarySalt a-Carotene VitaminB12,serum trans-b-carotene cis-b-carotene b-cryptoxanthin Folate,serum g-tocopherol Iron,FrozenSerum CombinedLutein/zeaxanthin trans-lycopene Folate,RBC Retinylpalmitate Retinylstearate Retinol VitaminD a-Tocopherol Daidzein o-Desmethylangolensin Equol Enterodiol Enterolactone Genistein EstimatedVO2max PhysicalActivity Doesanyonesmokeinhome? Total#ofcigarettessmokedinhome Cotinine CurrentCigaretteSmoker? Agelastsmokedcigarettesregularly #cigarettessmokedperdaywhenquit #cigarettessmokedperdaynow #dayssmokedcigsduringpast30days Avg#cigarettes/dayduringpast30days Smokedatleast100cigarettesinlife Doyounowsmokecigarettes... numberofdayssincequit Usedsnuffatleast20timesinlife drink5inaday drinkperday days5drinksinyear daysdrinkinyear 3-fluorene 2-fluorene 3-phenanthrene 1-phenanthrene 2-phenanthrene 1-pyrene 3-benzo[c]phenanthrene 3-benz[a]anthracene Mono-n-butylphthalate Mono-phthalate Mono-cyclohexylphthalate Mono-ethylphthalate Mono-phthalate Mono--hexylphthalate Mono-isobutylphthalate Mono-n-methylphthalate Mono-phthalate Mono-benzylphthalate Cadmium Lead Mercury,total Barium,urine Cadmium,urine Cobalt,urine Cesium,urine Mercury,urine Iodine,urine Molybdenum,urine Lead,urine Platinum,urine Antimony,urine Thallium,urine Tungsten,urine Uranium,urine BloodBenzene BloodEthylbenzene Bloodo-Xylene BloodStyrene BloodTrichloroethene BloodToluene Bloodm-/p-Xylene 1,2,3,7,8-pncdd 1,2,3,7,8,9-hxcdd 1,2,3,4,6,7,8-hpcdd 1,2,3,4,6,7,8,9-ocdd 2,3,7,8-tcdd Beta-hexachlorocyclohexane Gamma-hexachlorocyclohexane Hexachlorobenzene HeptachlorEpoxide Mirex Oxychlordane p,p-DDE Trans-nonachlor 2,5-dichlorophenolresult 2,4,6-trichlorophenolresult Pentachlorophenol Dimethylphosphate Diethylphosphate Dimethylthiophosphate PCB66 PCB74 PCB99 PCB105 PCB118 PCB138&158 PCB146 PCB153 PCB156 PCB157 PCB167 PCB170 PCB172 PCB177 PCB178 PCB180 PCB183 PCB187 3,3,4,4,5,5-hxcb 3,3,4,4,5-pncb 3,4,4,5-tcb Perfluoroheptanoicacid Perfluorohexanesulfonicacid Perfluorononanoicacid Perfluorooctanoicacid Perfluorooctanesulfonicacid Perfluorooctanesulfonamide 2,3,7,8-tcdf 1,2,3,7,8-pncdf 2,3,4,7,8-pncdf 1,2,3,4,7,8-hxcdf 1,2,3,6,7,8-hxcdf 1,2,3,7,8,9-hxcdf 2,3,4,6,7,8-hxcdf 1,2,3,4,6,7,8-hpcdf Measles Toxoplasma HepatitisAAntibody HepatitisBcoreantibody HepatitisBSurfaceAntibody HerpesII Albumin, urine Uric acid Phosphorus Osmolality Sodium Potassium Creatinine Chloride Total calcium Bicarbonate Blood urea nitrogen Total protein Total bilirubin Lactate dehydrogenase LDH Gamma glutamyl transferase Globulin Alanine aminotransferase ALT Aspartate aminotransferase AST Alkaline phosphotase Albumin Methylmalonic acid PSA. total Prostate specific antigen ratio TIBC, Frozen Serum Red cell distribution width Red blood cell count Platelet count SI Segmented neutrophils percent Mean platelet volume Mean cell volume Mean cell hemoglobin MCHC Hemoglobin Hematocrit Ferritin Protoporphyrin Transferrin saturation White blood cell count Monocyte percent Lymphocyte percent Eosinophils percent C-reactive protein Segmented neutrophils number Monocyte number Lymphocyte number Eosinophils number Basophils number mean systolic mean diastolic 60 sec. pulse: 60 sec HR Total Cholesterol Triglycerides Glucose, serum Insulin Homocysteine Glucose, plasma Glycohemoglobin C-peptide: SI LDL-cholesterol Direct HDL-Cholesterol Bone alkaline phosphotase Trunk Fat Lumber Pelvis BMD Lumber Spine BMD Head BMD Trunk Lean excl BMC Total Lean excl BMC Total Fat Total BMD Weight Waist Circumference Triceps Skinfold Thigh Circumference Subscapular Skinfold Recumbent Length Upper Leg Length Standing Height Head Circumference Maximal Calf Circumference Body Mass Index -0.4 -0.2 0 0.2 0.4 Value 050100150 Color Key and Histogram Count +-
  48. 48. Toward a phenotype-exposure association map: (Re)-categorizing phenotypes with E 7 6 5 4 3 2 1 0 Distance liver:Albumin kidney:Bicarbonate immunological:Basophils percent immunological:Lymphocyte percent immunological:Eosinophils percent kidney:Phosphorus liver:Total protein liver:Aspartate aminotransferase AST liver:Alanine aminotransferase ALT body measures:Head Circumference body measures:Recumbent Length liver:Lactate dehydrogenase LDH cancer:Prostate specific antigen ratio cancer:PSA, free blood:Transferrin saturation liver:Total bilirubin heart:Direct HDL-Cholesterol immunological:Monocyte percent bone:Head BMD body measures:Standing Height body measures:Upper Leg Length bone:Total BMD bone:Lumber Spine BMD bone:Lumber Pelvis BMD heart:Triglycerides heart:LDL-cholesterol heart:Total Cholesterol blood:MCHC blood:TIBC, Frozen Serum blood:Hematocrit blood:Hemoglobin kidney:Potassium blood:Mean cell hemoglobin blood:Mean cell volume kidney:Uric acid kidney:Blood urea nitrogen kidney:Total calcium kidney:Creatinine blood:Ferritin blood:Red blood cell count body measures:Weight blood:Segmented neutrophils percent body measures:Total Lean excl BMC body measures:Trunk Lean excl BMC body measures:Body Mass Index body measures:Waist Circumference body measures:Triceps Skinfold body measures:Maximal Calf Circumference body measures:Thigh Circumference liver:Gamma glutamyl transferase blood pressure:60 sec. pulse: metabolic:Insulin body measures:Total Fat body measures:Trunk Fat body measures:Subscapular Skinfold blood pressure:mean systolic immunological:C-reactive protein liver:Globulin immunological:Monocyte number immunological:Segmented neutrophils number immunological:Lymphocyte number immunological:White blood cell count immunological:Basophils number immunological:Eosinophils number blood:Mean platelet volume heart:Homocysteine nutrition:Methylmalonic acid kidney:Osmolality kidney:Chloride kidney:Sodium kidney:Albumin, urine blood pressure:60 sec HR cancer:PSA. total blood:Platelet count SI blood:Protoporphyrin blood:Red cell distribution width bone:Bone alkaline phosphotase liver:Alkaline phosphotase blood pressure:mean diastolic metabolic:C-peptide: SI metabolic:Glycohemoglobin metabolic:Glucose, plasma metabolic:Glucose, serum inflammation adiposity kidney function metabolic traits
  49. 49. 7 6 5 4 3 2 1 0 Distance liver:Albumin kidney:Bicarbonate immunological:Basophils percent immunological:Lymphocyte percent immunological:Eosinophils percent kidney:Phosphorus liver:Total protein liver:Aspartate aminotransferase AST liver:Alanine aminotransferase ALT body measures:Head Circumference body measures:Recumbent Length liver:Lactate dehydrogenase LDH cancer:Prostate specific antigen ratio cancer:PSA, free blood:Transferrin saturation liver:Total bilirubin heart:Direct HDL-Cholesterol immunological:Monocyte percent bone:Head BMD body measures:Standing Height body measures:Upper Leg Length bone:Total BMD bone:Lumber Spine BMD bone:Lumber Pelvis BMD heart:Triglycerides heart:LDL-cholesterol heart:Total Cholesterol blood:MCHC blood:TIBC, Frozen Serum blood:Hematocrit blood:Hemoglobin kidney:Potassium blood:Mean cell hemoglobin blood:Mean cell volume kidney:Uric acid kidney:Blood urea nitrogen kidney:Total calcium kidney:Creatinine blood:Ferritin blood:Red blood cell count body measures:Weight blood:Segmented neutrophils percent body measures:Total Lean excl BMC body measures:Trunk Lean excl BMC body measures:Body Mass Index body measures:Waist Circumference body measures:Triceps Skinfold body measures:Maximal Calf Circumference body measures:Thigh Circumference liver:Gamma glutamyl transferase blood pressure:60 sec. pulse: metabolic:Insulin body measures:Total Fat body measures:Trunk Fat body measures:Subscapular Skinfold blood pressure:mean systolic immunological:C-reactive protein liver:Globulin immunological:Monocyte number immunological:Segmented neutrophils number immunological:Lymphocyte number immunological:White blood cell count immunological:Basophils number immunological:Eosinophils number blood:Mean platelet volume heart:Homocysteine nutrition:Methylmalonic acid kidney:Osmolality kidney:Chloride kidney:Sodium kidney:Albumin, urine blood pressure:60 sec HR cancer:PSA. total blood:Platelet count SI blood:Protoporphyrin blood:Red cell distribution width bone:Bone alkaline phosphotase liver:Alkaline phosphotase blood pressure:mean diastolic metabolic:C-peptide: SI metabolic:Glycohemoglobin metabolic:Glucose, plasma metabolic:Glucose, serum “bad” cholesterol “good” cholesterol Toward a phenotype-exposure association map: (Re)-categorizing phenotypes with E
  50. 50. 7 6 5 4 3 2 1 0 Distance liver:Albumin kidney:Bicarbonate immunological:Basophils percent immunological:Lymphocyte percent immunological:Eosinophils percent kidney:Phosphorus liver:Total protein liver:Aspartate aminotransferase AST liver:Alanine aminotransferase ALT body measures:Head Circumference body measures:Recumbent Length liver:Lactate dehydrogenase LDH cancer:Prostate specific antigen ratio cancer:PSA, free blood:Transferrin saturation liver:Total bilirubin heart:Direct HDL-Cholesterol immunological:Monocyte percent bone:Head BMD body measures:Standing Height body measures:Upper Leg Length bone:Total BMD bone:Lumber Spine BMD bone:Lumber Pelvis BMD heart:Triglycerides heart:LDL-cholesterol heart:Total Cholesterol blood:MCHC blood:TIBC, Frozen Serum blood:Hematocrit blood:Hemoglobin kidney:Potassium blood:Mean cell hemoglobin blood:Mean cell volume kidney:Uric acid kidney:Blood urea nitrogen kidney:Total calcium kidney:Creatinine blood:Ferritin blood:Red blood cell count body measures:Weight blood:Segmented neutrophils percent body measures:Total Lean excl BMC body measures:Trunk Lean excl BMC body measures:Body Mass Index body measures:Waist Circumference body measures:Triceps Skinfold body measures:Maximal Calf Circumference body measures:Thigh Circumference liver:Gamma glutamyl transferase blood pressure:60 sec. pulse: metabolic:Insulin body measures:Total Fat body measures:Trunk Fat body measures:Subscapular Skinfold blood pressure:mean systolic immunological:C-reactive protein liver:Globulin immunological:Monocyte number immunological:Segmented neutrophils number immunological:Lymphocyte number immunological:White blood cell count immunological:Basophils number immunological:Eosinophils number blood:Mean platelet volume heart:Homocysteine nutrition:Methylmalonic acid kidney:Osmolality kidney:Chloride kidney:Sodium kidney:Albumin, urine blood pressure:60 sec HR cancer:PSA. total blood:Platelet count SI blood:Protoporphyrin blood:Red cell distribution width bone:Bone alkaline phosphotase liver:Alkaline phosphotase blood pressure:mean diastolic metabolic:C-peptide: SI metabolic:Glycohemoglobin metabolic:Glucose, plasma metabolic:Glucose, serum height + BMD Toward a phenotype-exposure association map: (Re)-categorizing phenotypes with E
  51. 51. Triglycerides Total Cholesterol LDL-cholesterol Trunk Fat Albumin, urine Insulin Total Fat Head Circumference Blood urea nitrogen Albumin Homocysteine C-peptide: SI C-reactive protein Body Mass Index Ferritin Thigh Circumference Maximal Calf Circumference Direct HDL-Cholesterol Total calcium Total bilirubin Red cell distribution width Gamma glutamyl transferase Mean cell volume Mean cell hemoglobin White blood cell count Uric acid Protoporphyrin Hemoglobin Total protein Alkaline phosphotase Waist Circumference Hematocrit Weight Standing Height 1/Creatinine Creatinine Trunk Lean excl BMC Methylmalonic acid Triceps Skinfold Lymphocyte number Subscapular Skinfold Total Lean excl BMC Segmented neutrophils number Lactate dehydrogenase LDH Bone alkaline phosphotase TIBC, Frozen Serum Aspartate aminotransferase AST Phosphorus Lumber Pelvis BMD Glycohemoglobin Globulin Chloride Bicarbonate Alanine aminotransferase ALT 60 sec. pulse: Upper Leg Length Total BMD Potassium Glucose, serum Glucose, plasma Red blood cell count Lumber Spine BMD Platelet count SI MCHC Osmolality Monocyte number mean systolic Lymphocyte percent Segmented neutrophils percent Recumbent Length Eosinophils number Monocyte percent Head BMD mean diastolic Prostate specific antigen ratio 60 sec HR Basophils number Sodium PSA, free Mean platelet volume Eosinophils percent PSA. total Basophils percent 0 10 20 30 40 R^2 * 100 1 to 66 exposures identified for 81 phenotypes Additive effect of E factors: Describe < 20% of variability in P (On average: 8%) σ2E? Recall: Avg(h2) = 50% Long road ahead to capture σ2 P
  52. 52. Connecting Environmental Exposure with Disease: Missing the “System” of Exposures? E+ E- diseased non- diseased ? Exposed to many things, but do not assess the multiplicity. Fragmented literature of associations. Challenge to discover E associated with disease.
  53. 53. Example of fragmentation: Is everything we eat associated with cancer? Schoenfeld and Ioannidis, AJCN 2012 50 random ingredients from Boston Cooking School Cookbook Any associated with cancer? FIGURE 1. Effect estimates reported in the literature by malignancy type (top) or ingredient (bottom). Only ingredients with $10 studie outliers are not shown (effect estimates .10). Of 50, 40 studied in cancer risk Weak statistical evidence: non-replicated inconsistent effects non-standardized
  54. 54. https://www.youtube.com/watch?v=0Rnq1NpHdmw
  55. 55. e modelling oblem is akin to – but less well sed and more poorly understood than – e testing. For example, consider the use r regression to adjust the risk levels of atments to the same background level There can be many covariates, and t of covariates can be in or out of the With ten covariates, there are over 1000 models. Consider a maze as a metaphor elling (Figure 3). The red line traces the path out of the maze. The path through ze looks simple, once it is known. ways in the literature for dealing with model selection, so we propose a new, composite 2. Publication bias is general recognition that a paper much better chance of acceptance if hing new is found. This means that, for ation, the claim in the paper has to sed on a p-value less than 0.05. From g’s point of view5 , this is quality by tion. The journals are placing heavy ce on a statistical test rather than nation of the methods and steps that o a conclusion. As to having a p-value han 0.05, some might be tempted to the system10 through multiple testing, ple modelling or unfair treatment of or some combination of the three that to a small p-value. Researchers can be creative in devising a plausible story to statistical finding. 2 The data cleaning team creates a modelling data set and a holdout set and P < 0.05 Figure 3. The path through a complex process can appear quite simple once the path is defined. Which terms are included in a multiple linear regression model? Each turn in a maze is analogous to including or not a specific term in the evolving linear model. By keeping an eye on the p-value on the term selected to be at issue, one can work towards a suitably small p-value. © ktsdesign – Fotolia A maze of associations is one way to a fragmented literature and Vibration of Effects Young, 2011 univariate sex sex & age sex & race sex & race & age JCE, 2015
  56. 56. Distribution of associations and p-values due to model choice: Estimating the Vibration of Effects (or Risk) Variable of Interest e.g., 1 SD of log(serum Vitamin D) Adjusting Variable Set n=13 All-subsets Cox regression 213+ 1 = 8,193 models SES [3rd tertile] education [>HS] race [white] body mass index [normal] total cholesterol any heart disease family heart disease any hypertension any diabetes any cancer current/past smoker [no smoking] drink 5/day physical activity Data Source NHANES 1999-2004 417 variables of interest time to death N≧1000 (≧100 deaths) effect sizes p-values ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 11 1 50 1 50 99 5.0 7.5 −log10(pvalue) Vitamin D (1SD(log)) RHR = 1.14 RPvalue = 4.68 A B C D E median p-value/HR for k percentile indicator JCE, 2015 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 10 11 1213 1 50 99 1 50 99 2.5 5.0 7.5 0.64 0.68 0.72 0.76 Hazard Ratio −log10(pvalue) Vitamin D (1SD(log)) RHR = 1.14 RP = 4.68 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 11 12 13 1 50 99 1 50 99 1 2 3 4 0.75 0.80 0.85 0.90 Hazard Ratio −log10(pvalue) Thyroxine (1SD(log)) RHR = 1.15 RP = 2.90
  57. 57. The Vibration of Effects: Vitamin D and Thyroxine and attenuated risk in mortality JCE, 2015 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 10 11 1213 1 50 99 1 50 99 2.5 5.0 7.5 0.64 0.68 0.72 0.76 Hazard Ratio −log10(pvalue) Vitamin D (1SD(log)) RHR = 1.14 RP = 4.68 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 11 12 13 1 50 99 1 50 99 1 2 3 4 0.75 0.80 0.85 0.90 Hazard Ratio −log10(pvalue) Thyroxine (1SD(log)) RHR = 1.15 RP = 2.90
  58. 58. ● ● ● ● ● 9 10 111213 1 5 10 1.3 −log10(pvalue) ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 111213 1 50 99 1 50 99 5 10 1.3 1.4 1.5 1.6 Hazard Ratio −log10(pvalue) Cadmium (1SD(log)) adjustment=current_past_smoking ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 111213 1 50 99 1 50 99 5 10 1.3 1.4 1.5 1.6 Hazard Ratio −log10(pvalue) Cadmium (1SD(log)) RHR = 1.29 RP = 8.29 The Vibration of Effects: shifts in the effect size distribution due to select adjustments (e.g., adjusting cadmium levels with smoking status) JCE, 2015
  59. 59. ●●●●●●●●●●●●●● 012345678910111213 1 50 99 15099 0 1 1 2 3 4 5 Hazard Ratio −log1 ●●● ● ●●●●●●●●●● 012345678910111213 1 50 99 15099 0 1 1 2 3 4 5 Hazard Ratio −log1 ●● ●●●●●●●●●●●● 012345678910111213 1 50 99 1 50 99 0 1 1 2 3 4 5 Hazard Ratio −log1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 11 12 13 1 50 99 0.0 0.5 0.90 0.95 1.00 1.05 Hazard Ratio −log1 ● ● ● ● ●●●●● ●●● 2 3 4 5678910111213 50 99 0 1 0.85 0.90 0.95 Hazard Ratio −log1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 11 1213 1 50 99 1 50 99 1 2 3 4 5 0.75 0.80 0.85 0.90 Hazard Ratio −log10(pvalue) Vitamin E as alpha−tocopherol (1SD(log)) RHR = 1.15 RP = 3.17 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 11 1213 1 50 99 1 50 99 1 2 3 0.80 0.85 0.90 Hazard Ratio −log10(pvalue) Beta−carotene (1SD(log)) RHR = 1.15 RP = 2.34 ●● ●●●●●●●●●● ●● 01 2345678910111213 1 50 99 1 50 99 1 2 3 0.875 0.900 0.925 0.950 0.975 Hazard Ratio −log10(pvalue) Caffeine (1SD(log)) RHR = 1.10 RP = 1.99 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 1011 1213 1 50 99 1 50 99 0.0 0.5 1.0 1.5 0.90 0.95 1.00 Hazard Ratio −log10(pvalue) Calcium (1SD(log)) RHR = 1.13 RP = 1.15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 1112 13 1 50 99 1 50 99 0.5 1.0 1.5 2.0 2.5 0.84 0.88 0.92 Hazard Ratio −log10(pvalue) Carbohydrate (1SD(log)) RHR = 1.12 RP = 1.57 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 1011 1213 1 50 99 1 50 99 0.5 1.0 1.5 2.0 2.5 0.80 0.84 0.88 Hazard Ratio −log10(pvalue) Carotene (1SD(log)) RHR = 1.14 RP = 1.53 ● ●●●●●●●●●●●●● 0 12345678910111213 1 50 99 1 50 99 0.5 1.0 1.050 1.075 1.100 1.125 Hazard Ratio −log10(pvalue) Cholesterol (1SD(log)) RHR = 1.08 RP = 0.64 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 10 111213 1 50 99 1 50 99 1 2 3 4 0.80 0.85 0.90 0.95 Hazard Ratio −log10(pvalue) Copper (1SD(log)) RHR = 1.17 RP = 2.86 ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 1 2 3 4 5 6 7 8 910 111213 1 50 99 1 50 99 0.0 0.5 1.0 1.5 0.85 0.90 0.95 1.00 Hazard Ratio −log10(pvalue) Beta−cryptoxanthin (1SD(log)) RHR = 1.15 RP = 1.39 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10111213 1 50 99 1 50 99 0.0 0.5 1.0 0.96 0.99 1.02 1.05 1.08 Hazard Ratio −log10(pvalue) Folic acid (1SD(log)) RHR = 1.09 RP = 0.41 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 101112 13 1 50 99 1 50 99 1 2 3 4 0.80 0.85 0.90 Hazard Ratio −log10(pvalue) Folate, DFE (1SD(log)) RHR = 1.14 RP = 2.39 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 101112 13 1 50 99 1 50 99 2 4 6 8 0.76 0.80 0.84 0.88 Hazard Ratio −log10(pvalue) Food folate (1SD(log)) RHR = 1.14 RP = 4.64 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 10111213 1 50 99 1 50 99 1 2 3 4 0.80 0.84 0.88 0.92 Hazard Ratio −log10(pvalue) Dietary fiber (1SD(log)) RHR = 1.15 RP = 2.79 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 1011 1213 1 50 99 1 50 99 1 2 3 0.80 0.84 0.88 0.92 0.96 Hazard Ratio −log10(pvalue) Total Folate (1SD(log)) RHR = 1.15 RP = 2.11 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 10 11 1213 1 50 99 1 50 99 1 2 0.84 0.88 0.92 Hazard Ratio −log10(pvalue) Iron (1SD(log)) RHR = 1.12 RP = 1.91 β-carotene caffeine cholesterol food folate JCE, 2015
  60. 60. ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 1011 1213 1 50 99 1 50 99 1 2 3 0.80 0.85 0.90 Hazard Ratio −log10(pvalue) Potassium (1SD(log)) RHR = 1.14 RP = 2.28 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 10111213 1 50 99 1 50 99 0.5 1.0 1.5 2.0 0.850 0.875 0.900 0.925 0.950 Hazard Ratio −log10(pvalue) Protein (1SD(log)) RHR = 1.11 RP = 1.42 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 01 2 3 4 5 6 7 8 9 10 1112 13 1 50 99 1 50 99 0.0 0.5 1.0 0.95 1.00 1.05 1.10 Hazard Ratio −log10(pvalue) Retinol (1SD(log)) RHR = 1.13 RP = 0.67 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 10 11 1213 1 50 99 1 50 99 0.0 0.5 1.0 1.5 1.00 1.05 1.10 Hazard Ratio −log10(pvalue) SFA 4:0 (1SD(log)) RHR = 1.11 RP = 1.29 ● ● ● ● ● ● ● ● ● ● ● ● ●● 01 2 3 4 5 6 7 8 9 10 11 1213 1 50 99 1 50 99 0.5 1.0 1.5 2.0 2.5 1.04 1.08 1.12 1.16 Hazard Ratio −log10(pvalue) SFA 6:0 (1SD(log)) RHR = 1.11 RP = 1.71 ● ● ● ● ● ● ● ● ● ● ● ●●● 01 2 3 4 5 6 7 8 9 10 111213 1 50 99 1 50 99 2 3 4 1.12 1.16 1.20 Hazard Ratio −log10(pvalue) SFA 8:0 (1SD(log)) RHR = 1.10 RP = 2.55 ● ● ● ● ● ● ● ● ● ● ● ●●● 0 1 2 3 4 5 6 7 8 9 10 111213 1 50 99 1 50 99 1 2 1.04 1.08 1.12 1.16 Hazard Ratio −log10(pvalue) SFA 10:0 (1SD(log)) RHR = 1.11 RP = 1.87 ● ● ● ● ● ● ● ● ● ● ● ●●● 0 1 2 3 4 5 6 7 8 9 10 111213 1 50 99 1 50 99 1.0 1.5 2.0 2.5 3.0 1.075 1.100 1.125 1.150 1.175 Hazard Ratio −log10(pvalue) SFA 12:0 (1SD(log)) RHR = 1.08 RP = 1.79 ●● ● ● ● ● ● ● ● ● ● ● ●● 01 2 3 4 5 6 7 8 9 10111213 1 50 99 1 50 99 0.5 1.0 1.5 2.0 1.05 1.10 1.15 Hazard Ratio −log10(pvalue) SFA 14:0 (1SD(log)) RHR = 1.11 RP = 1.61 ●● ● ● ● ● ● ● ● ● ● ● ●● 01 2 3 4 5 6 7 8 9 10 11 1213 1 50 99 1 50 99 0.0 0.5 1.0 1.00 1.05 1.10 Hazard Ratio −log10(pvalue) SFA 16:0 (1SD(log)) RHR = 1.11 RP = 0.84 ●● ● ● ● ● ● ● ● ● ● ● ●● 01 2 3 4 5 67 891011 1213 1 50 99 1 50 99 0.0 0.5 1.0 1.02 1.06 1.10 Hazard Ratio −log10(pvalue) SFA 18:0 (1SD(log)) RHR = 1.10 RP = 0.73 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 910 111213 1 50 99 1 50 99 0.5 1.0 1.5 2.0 0.875 0.900 0.925 0.950 Hazard Ratio −log10(pvalue) Selenium (1SD(log)) RHR = 1.09 RP = 1.24 ● ● ● ● ● ● ● ● ● ● ● ● ●● 01 2 3 4 5 6 7 8 910 11 1213 1 50 99 1 50 99 0.0 0.5 1.0 1.00 1.05 1.10 1.15 Hazard Ratio −log10(pvalue) Total saturated fatty acids (1SD(log)) RHR = 1.11 RP = 0.93 ● ● ● ●●●●●●●●●●● 0 1 2 345678910111213 1 50 99 1 50 99 2 4 6 0.650 0.675 0.700 0.725 0.750 Hazard Ratio −log10(pvalue) Sodium (1SD(log)) RHR = 1.12 RP = 3.74 ● ● ● ● ● ● ● ● ● ● ●● ● ● 0 1 2 3 4 5 6 7 8910111213 1 50 99 1 50 99 1 2 3 4 0.76 0.80 0.84 Hazard Ratio −log10(pvalue) Total sugars (1SD(log)) RHR = 1.13 RP = 2.51 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 910 111213 1 50 99 1 50 99 0.0 0.5 1.0 0.95 1.00 1.05 1.10 Hazard Ratio −log10(pvalue) Total fat (1SD(log)) RHR = 1.11 RP = 0.54 ● ● ● ● ● ● ● ● ●● ●● ●● 0 1 2 3 4 5 6 7891011 1213 1 50 99 1 50 99 0.5 1.0 1.5 0.87 0.90 0.93 0.96 Hazard Ratio −log10(pvalue) Theobromine (1SD(log)) RHR = 1.08 RP = 1.19 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 1011 1213 1 50 99 1 50 99 0.4 0.8 1.2 1.6 0.80 0.84 0.88 0.92 Hazard Ratio −log10(pvalue) Vitamin A (1SD(log)) RHR = 1.13 RP = 1.09 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 111213 1 50 99 1 50 99 0.0 0.5 1.0 1.5 0.85 0.90 0.95 1.00 Hazard Ratio −log10(pvalue) Vitamin A, RAE (1SD(log)) RHR = 1.16 RP = 1.31 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 910111213 1 50 99 1 50 99 0.5 1.0 0.86 0.90 0.94 0.98 Hazard Ratio −log10(pvalue) Retinol (1SD(log)) RHR = 1.15 RP = 0.74 sodium sugars SFA 6:0 SFA 8:0 SFA 10:0
  61. 61. JCE, 2015 Janus (two-faced) risk profile Risk and significance depends on modeling scenario! The Vibration of Effects: beware of the Janus effect (both risk and protection?!) “risk”“protection” “significant” Brittanica.com http://bit.ly/effectvibration
  62. 62. Emerging technologies to ascertain exposome will enable biomedical discovery High-throughput E data standards & exposome: mitigate fragmented literature of associations Confounding, reverse causality: how to handle at large dimension? e.g., EWASs in telomere length and mortality and 81 quantitative phenotypes Prioritize biological and epidemiological studies.
  63. 63. Studying the Elusive Environment in Large Scale Itispossiblethatmorethan50%ofcomplexdiseaserisk isattributedtodifferencesinanindividual’senvironment.1 Airpollution,smoking,anddietaredocumentedenviron- mental factors affecting health, yet these factors are but a fraction of the “exposome,” the totality of the exposure loadoccurringthroughoutaperson’slifetime.1 Investigat- ing one or a handful of exposures at a time has led to a highly fragmented literature of epidemiologic associa- tions. Much of that literature is not reproducible, and se- lectivereportingmaybeamajorreasonforthelackofre- producibility. A new model is required to discover environmental exposures associated with disease while mitigating possibilities of selective reporting. Toremedythelackofreproducibilityandconcernsof validity, multiple personal exposures can be assessed si- multaneously in terms of their association with a condi- tion or disease of interest; the strongest associations can then be tentatively validated in independent data sets (eg, as done in references 2 and 3).2,3 The main advan- tages of this process include the ability to search the list ofexposuresandadjustformultiplicitysystematicallyand reportalltheprobedassociationsinsteadofonlythemost significant results. The term “environment-wide associa- tion studies” (EWAS) has been used to describe this ap- proach (an analogy to genome-wide association stud- ies).Forexample,Wangetal4 screenedmorethan2000 chemicalsinserumtodiscoverendogenousexposuresas- sociated with risk for cardiovascular disease. Therearenotablehurdlesinanalyzing“big”environ- mental data. These same problems affect epidemiology of1-risk-factor-at-a-time,butinEWAStheirprevalencebe- comes more clearly manifest at large scale. When study- the EWAS vantage point, intervening on β-carotene (Figure, D) seems a futile exercise given its complex rela- tionship with other nutrients and pollutants. Giventhiscomplexity,howcanstudiesofenvironmen- talriskmoveforward?First,EWASanalysesshouldbeap- pliedtomultipledatasets,andconsistencycanbeformally examinedforallassessedcorrelations.Second,thetempo- ral relationship between exposure and changes in health parametersmayofferhelpfulhintsaboutwhichofthesig- nalsaremorethansimplecorrelations.Third,standardized adjustedanalyses,inwhichadjustmentsareperformedsys- tematicallyandinthesamewayacrossmultipledatasets, may also help. This is in stark contrast with the current model,wherebymostepidemiologicstudiesusesingledata setswithoutreplicationaswellasnon–time-dependentas- sessments,andreportedadjustmentsaremarkedlydiffer- entacrossreportsanddatasets,eventhoseperformedby thesameteam(differentapproachesincreasevaliditybut mustbereconciledandassimilated). However, eventually for most environmental cor- relates,theremaybeunsurpassabledifficultyestablish- ing potential causal inferences based on observational data alone. Factors that seem protective may some- times be tested in randomized trials. The complexity of the multiple correlations also highlights the challenge thatinterveningtomodify1putativeriskfactoralsomay inadvertently affect multiple other correlated factors. Even when a seemingly simple intervention is tested in randomizedtrials(affectingasingleriskfactoramongthe manycorrelations),theinterventionisnotreallysimple. In essence what is tested are multiple perturbations of factors correlated with the one targeted for interven- VIEWPOINT Chirag J. Patel, PhD Center for Biomedical Informatics, Harvard Medical School, Boston, Massachusetts. John P. A. Ioannidis, MD, DSc Stanford Prevention Research Center, Department of Health Research and Policy, Department of Medicine, Stanford University School of Medicine, Stanford, California, Department of Statistics, Stanford University School of Humanities and Sciences, Stanford, California, and Meta-Research Innovation Center at Stanford (METRICS), Stanford, California. Opinion JAMA, 2014 JECH, 2014 Proc Symp Biocomp, 2015 How can we study the elusive environment in larger scale for biomedical discovery? Studying the Elusive Environment in Large Scale Itispossiblethatmorethan50%ofcomplexdiseaserisk isattributedtodifferencesinanindividual’senvironment.1 Airpollution,smoking,anddietaredocumentedenviron- mental factors affecting health, yet these factors are but a fraction of the “exposome,” the totality of the exposure loadoccurringthroughoutaperson’slifetime.1 Investigat- ing one or a handful of exposures at a time has led to a highly fragmented literature of epidemiologic associa- tions. Much of that literature is not reproducible, and se- lectivereportingmaybeamajorreasonforthelackofre- producibility. A new model is required to discover environmental exposures associated with disease while mitigating possibilities of selective reporting. Toremedythelackofreproducibilityandconcernsof validity, multiple personal exposures can be assessed si- multaneously in terms of their association with a condi- tion or disease of interest; the strongest associations can then be tentatively validated in independent data sets (eg, as done in references 2 and 3).2,3 The main advan- tages of this process include the ability to search the list ofexposuresandadjustformultiplicitysystematicallyand reportalltheprobedassociationsinsteadofonlythemost significant results. The term “environment-wide associa- tion studies” (EWAS) has been used to describe this ap- the EWAS vantage point, intervening on β-carotene (Figure, D) seems a futile exercise given its complex rela- tionship with other nutrients and pollutants. Giventhiscomplexity,howcanstudiesofenvironmen- talriskmoveforward?First,EWASanalysesshouldbeap- pliedtomultipledatasets,andconsistencycanbeformally examinedforallassessedcorrelations.Second,thetempo- ral relationship between exposure and changes in health parametersmayofferhelpfulhintsaboutwhichofthesig- nalsaremorethansimplecorrelations.Third,standardized adjustedanalyses,inwhichadjustmentsareperformedsys- tematicallyandinthesamewayacrossmultipledatasets may also help. This is in stark contrast with the current model,wherebymostepidemiologicstudiesusesingledata setswithoutreplicationaswellasnon–time-dependentas- sessments,andreportedadjustmentsaremarkedlydiffer- entacrossreportsanddatasets,eventhoseperformedby thesameteam(differentapproachesincreasevaliditybut mustbereconciledandassimilated). However, eventually for most environmental cor- relates,theremaybeunsurpassabledifficultyestablish- ing potential causal inferences based on observationa data alone. Factors that seem protective may some- times be tested in randomized trials. The complexity of VIEWPOINT Chirag J. Patel, PhD Center for Biomedical Informatics, Harvard Medical School, Boston, Massachusetts. John P. A. Ioannidis, MD, DSc Stanford Prevention Research Center, Department of Health Research and Policy, Department of Medicine, Stanford University School of Medicine, Stanford, California, Department of Statistics, Stanford University School of Humanities and Sciences, Stanford, California, and Meta-Research Innovation Center at Stanford (METRICS), Stanford, California. Opinion High-throughputascertainmentofendogenousindicatorsofen- vironmentalexposurethatmayreflecttheexposomeincreasinglyat- tractattention,andtheirperformanceneedstobecarefullyevaluated. These include chemical detection of indicators of exposure through metabolomics, proteomics, and biosensors.7 Eventually, patterns of US federally funded gene expression experiment data be d itedinpublicrepositoriessuchastheGeneExpressionOmnibu repositoryhasbeeninstrumentalindevelopmentoftechnolo measurement of gene expression, data standardization, and ofdatafordiscovery.JustaswiththeGeneExpressionOmnib Figure. Correlation Interdependency Globes for 4 Environmental Exposures (Cotinine, Mercury, Cadmium, Trans-β-Carotene) in National Healt Nutrition Examination Survey (NHANES) Participants, 2003-2004 A Serum cotinine B Serum total mercury C Serum cadmium D Serum trans-β-carotene 37 Total correlations 42 Total correlations 68 Total correlations 68 Total correlations Negative correlation Positive correl Infectious agents Pollutants Nutrients and vitamins Demographic attributes Eachcorrelationinterdependencyglobeincludes317environmentalexposures representedbythenodesaroundtheperipheryoftheglobe.Pairwisecorrelations aredepictedbyedges(lines)betweenthenodeofinterest(arrowhead)andother nodes.Correlationswithabsolutevaluesexceeding0.2areshown(stronge Thesizeofeachnodeisproportionaltothenumberofedgesforanode,and thicknessofeachedgeindicatesthemagnitudeofthecorrelation. Opinion Viewpoint •bioinformatics to connect exposome with phenome •new ‘omics technologies to measure the exposome •dense correlations •reverse causality •confounding •(longitudinal) publicly available data
  64. 64. Up to 1% of the metabolome with robust significance (Bonferroni-corrected): What are their identities?
  65. 65. We found up to 1% features of the untargeted metabolome associated with pre- vs. post-shift What are they? Dark matter of the exposome? False positives? Francine Laden (Harvard Chan) Jaime Hart (Harvard Chan) Dean Jones (Emory) Doug Walker (Emory) Jake Chung
  66. 66. To efficiently sift, prioritize, and integrate associations for precision medicine: Catalog of exposure-phenotype findings mass-to-charges putative identity (hmdb id) study design phenotype effect size pvalue Trimethylamine-N-oxide 59.035, 76.126 matched case-control myocardial infarction OR: 2.5 1x10-3 ARPH, in press
  67. 67. Catalog of GWAS findings have enabled integration and critical evaluation of genotype-phenotype associations https://www.ebi.ac.uk/gwas/
  68. 68. Precision medicine research may be more fruitful outside of typical US-based populations…
  69. 69. 90 countries 300 surveys >10,000 socioeconomic, environmental, and behavior N=600K in sub-Saharan Africa alone HIV status, chronic disease Eran Bendavid (Stanford)
  70. 70. What factors (of ~1000) are associated with female HIV status in Zambia (2007 and 2013)? AUC of prediction (training and testing): 80% age (40-45) BMI contraception method marital status (divorce) marital status (widowed) # daughters genital ulcer? household size# children
  71. 71. Conclusions: Toward a more precise medicine with G (and E) Precision medicine research will be fruitful areas of high G, E, and P variation. G describes a variable proportion of P (sometimes modest). Eye color Hair curliness Type-1 diabetes Height Schizophrenia Epilepsy Graves' disease Celiac disease Polycystic ovary syndrome Attention deficit hyperactivity disorder Bipolar disorder Obesity Alzheimer's disease Anorexia nervosa Psoriasis Bone mineral density Menarche, age at Nicotine dependence Sexual orientation Alcoholism Lupus Rheumatoid arthritis Crohn's disease Migraine Thyroid cancer Autism Blood pressure, diastolic Body mass index Depression Coronary artery disease Insomnia Menopause, age at Heart disease Prostate cancer QT interval Breast cancer Ovarian cancer Hangover Stroke Asthma Blood pressure, systolic Hypertension Osteoarthritis Parkinson's disease Longevity Type-2 diabetes Gallstone disease Testicular cancer Cervical cancer Sciatica Bladder cancer Colon cancer Lung cancer Leukemia Stomach cancer 0 25 50 75 100 Heritability: Var(G)/Var(Phenotype) σ2 E :! High-throughput E may enable us to complement G E is difficult to study (biases, lack of tools, VoE) High-throughputascertainmentofendogenousindicatorsofen- vironmentalexposurethatmayreflecttheexposomeincreasinglyat- tractattention,andtheirperformanceneedstobecarefullyevaluated. These include chemical detection of indicators of exposure through metabolomics, proteomics, and biosensors.7 Eventually, patterns of US federally funded gene expression experiment data be depos- itedinpublicrepositoriessuchastheGeneExpressionOmnibus.The repositoryhasbeeninstrumentalindevelopmentoftechnologyfor measurement of gene expression, data standardization, and reuse ofdatafordiscovery.JustaswiththeGeneExpressionOmnibus,an Figure. Correlation Interdependency Globes for 4 Environmental Exposures (Cotinine, Mercury, Cadmium, Trans-β-Carotene) in National Health and Nutrition Examination Survey (NHANES) Participants, 2003-2004 A Serum cotinine B Serum total mercury C Serum cadmium D Serum trans-β-carotene 37 Total correlations 42 Total correlations 68 Total correlations 68 Total correlations Negative correlation Positive correlation Infectious agents Pollutants Nutrients and vitamins Demographic attributes Eachcorrelationinterdependencyglobeincludes317environmentalexposures representedbythenodesaroundtheperipheryoftheglobe.Pairwisecorrelations aredepictedbyedges(lines)betweenthenodeofinterest(arrowhead)andother nodes.Correlationswithabsolutevaluesexceeding0.2areshown(strongest10%). Thesizeofeachnodeisproportionaltothenumberofedgesforanode,andthe thicknessofeachedgeindicatesthemagnitudeofthecorrelation. Opinion Viewpoint
  72. 72. Need to consider both G and E towards a more precise medicine and dissecting P. −log10(pvalue) ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● acrylamide allergentest bacterialinfection cotinine diakyl dioxins furansdibenzofuran heavymetals hydrocarbons latex nutrientscarotenoid nutrientsminerals nutrientsvitaminA nutrientsvitaminB nutrientsvitaminC nutrientsvitaminD nutrientsvitaminE pcbs perchlorate pesticidesatrazine pesticideschlorophenol pesticidesorganochlorine pesticidesorganophosphate pesticidespyrethyroid phenols phthalates phytoestrogens polybrominatedethers polyflourochemicals viralinfection volatilecompounds 012 A Serum cotinine B Serum total mercury 37 Total correlations 42 Total correlations 68 Total correlations 68 Total correlations Infectious agents Pollutants Nutrients and vitamins Demographic attributes P = G + E
  73. 73. Harvard DBMI Isaac Kohane Susanne Churchill Stan Shaw Jenn Grandfield Sunny Alvear Michal Preminger Harvard Chan Hugues Aschard Francesca Dominici Chirag J Patel chirag@hms.harvard.edu @chiragjp www.chiragjpgroup.org NIH Common Fund Big Data to Knowledge Acknowledgements Ken Mandl Stanford John Ioannidis Atul Butte (UCSF) U Queensland Jian Yang Peter Visscher Cochrane Belinda Burford RagGroup Chirag Lakhani Adam Brown Danielle Rasooly Arjun Manrai Erik Corona Nam Pho

×