Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...
Building a search engine for exposures in disease
1. Building a search engine to find
environmental and phenotypic factors
associated with disease and health
Chirag J Patel
University of Puerto Rico, Humacao U-STAR
02/21/17
chirag@hms.harvard.edu
@chiragjp
www.chiragjpgroup.org
2. P = G + EType 2 Diabetes
Cancer
Alzheimer’s
Gene expression
Phenotype Genome
Variants
Environment
Infectious agents
Nutrients
Pollutants
Drugs
3. We are great at G investigation!
over 2400
Genome-wide Association Studies (GWAS)
https://www.ebi.ac.uk/gwas/
G
4. Nothing comparable to elucidate E influence!
E: ???
We lack high-throughput methods
and data to discover new E in P…
8. σ2
G
σ2P
H2 =
Heritability (H2) is the range of phenotypic
variability attributed to genetic variability in a
population
Indicator of the proportion of phenotypic
differences attributed to G.
9. Height is an example of a heritable trait:
Francis Galton shows how its done (1887)
“mid-height of 205 parents
described 60% of variability of 928
offspring”
10. Height is an example of a heritable trait:
Francis Galton shows how its done (1887)
“mid-height of 205 parents
described 60% of variability of 928
offspring”
what explains the other 40%???
nutrition?
economics?
12. Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Polycystic ovary syndrome
Attention deficit hyperactivity disorder
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Bone mineral density
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Rheumatoid arthritis
Crohn's disease
Migraine
Thyroid cancer
Autism
Blood pressure, diastolic
Body mass index
Depression
Coronary artery disease
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Blood pressure, systolic
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com
G estimates for burdensome diseases are low and variable:
massive opportunity for high-throughput E discovery
Type 2 Diabetes
Heart Disease
Autism (50%???)
13. Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Polycystic ovary syndrome
Attention deficit hyperactivity disorder
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Bone mineral density
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Rheumatoid arthritis
Crohn's disease
Migraine
Thyroid cancer
Autism
Blood pressure, diastolic
Body mass index
Depression
Coronary artery disease
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Blood pressure, systolic
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com
G estimates for complex traits are low and variable:
massive opportunity for high-throughput E discovery
σ2
E : Exposome!
15. It took a new paradigm of GWAS for discovery:
Human Genome Project to GWAS
Sequencing of the genome
2001
HapMap project:
http://hapmap.ncbi.nlm.nih.gov/
Characterize common variation
2001-current day
High-throughput variant
assay
< $99 for ~1M variants
Measurement tools
~2003 (ongoing)
ARTICLES
Genome-wide association study of 14,000
cases of seven common diseases and
3,000 shared controls
The Wellcome Trust Case Control Consortium*
There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to the
identification of genes involved in common human diseases. We describe a joint GWA study (using the Affymetrix GeneChip
500K Mapping Array Set) undertaken in the British population, which has examined ,2,000 individuals for each of 7 major
diseases and a shared set of ,3,000 controls. Case-control comparisons identified 24 independent association signals at
P , 5 3 1027
: 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn’s disease, 3 in rheumatoid arthritis, 7 in type 1
diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of these
signals reflect genuine susceptibility effects. We observed association at many previously identified loci, and found
compelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified a
25 27
Vol 447|7 June 2007|doi:10.1038/nature05911
WTCCC, Nature, 2008.
Comprehensive, high-throughput analyses
GWAS
16. Explaining the other 50%:
A big data-driven paradigm for robust discovery of
E in disease via EWAS and the exposome
what to measure? how to measure?
PERSPECTIVES
Xenobiotics
Inflammation
Preexisting disease
Lipid peroxidation
Oxidative stress
Gut flora
Internal
chemical
environment
Externalenvironment
ExposomeRADIATION
DIET
POLLUTION
INFECTIONS
DRUGS
LIFE-STYLE
STRESS
Reactive electrophiles
Metals
Endocrine disrupters
Immune modulators
Receptor-binding proteins
itical entity for disease eti-
ogy (7). Recent discussion
as focused on whether and
ow to implement this vision
8). Although fully charac-
rizing human exposomes
daunting, strategies can be
eveloped for getting “snap-
hots” of critical portions of
person’s exposome during
ifferent stages of life. At
ne extreme is a “bottom-up”
rategy in which all chemi-
als in each external source
f a subject’s exposome are
easured at each time point.
lthoughthisapproachwould
ave the advantage of relat-
g important exposures to
e air, water, or diet, it would
quire enormous effort and
ould miss essential compo-
ents of the internal chemi-
al environment due to such
actors as gender, obesity,
flammation, and stress. By
ontrast, a “top-down” strat-
gy would measure all chem-
als (or products of their
ownstream processing or
ffects, so-called read-outs
r signatures) in a subject’s
ood. This would require
nly a single blood specimen
each time point and would relate directly ruptors and can be measured through serum
some (telomere) length in
peripheral blood mono-
nuclear cells responded
to chronic psychological
stress, possibly mediated
by the production of reac-
tive oxygen species (15).
Characterizing the
exposome represents a tech-
nological challenge like that of
thehumangenomeproject,which
began when DNA sequencing
was in its infancy (16). Analyti-
cal systems are needed to pro-
cess small amounts of blood from
thousands of subjects. Assays
should be multiplexed for mea-
suring many chemicals in each
class of interest. Tandem mass
spectrometry, gene and protein
chips, and microfluidic systems
offer the means to do this. Plat-
forms for high-throughput assays
shouldleadtoeconomiesofscale,
again like those experienced by
the human genome project. And
because exposome technologies
would provide feedback for thera-
peuticinterventionsandpersonal-
ized medicine, they should moti-
vate the development of commer-
cial devices for screening impor-
tant environmental exposures in
blood samples.
With successful characterization of both
Characterizing the exposome. The exposome represents
the combined exposures from all sources that reach the
internal chemical environment. Toxicologically important
classes of exposome chemicals are shown. Signatures and
biomarkers can detect these agents in blood or serum.
onOctober21,2010www.sciencemag.orgrom
“A more comprehensive view of
environmental exposure is
needed ... to discover major
causes of diseases...”
how to analyze in relation to health?
Wild, 2005
Rappaport and Smith, 2010, 2011
Buck-Louis and Sundaram 2012
Miller and Jones, 2014
Patel CJ and Ioannidis JPAI, 2014
17. What is a Genome-Wide Association Study (GWAS)?:
Data-driven search for G factors in P
evolut
partic
eases;
tase 1)
well a
biolog
The
captur
implem
STRU
revert
subset
librium
clearly
−log10(P)
0
5
10
15
Chromosome
22
X
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
80
60
40
100
rvedteststatistic
a
b
NATURE|Vol 447|7 June 2007
WTCCC, 2007
AA Aa aa
case
control
Robust, transparent, and comprehensive search for G in P
19. GWAS example
Example of the big data paradigm:
GWAS to drives discovery in G in P
A RT I C L E S
50 Locus established previously
Locus identified by current study
Locus not confirmed by current study
BCL11A
THADA
NOTCH2
ADAMTS9
IRS1
IGF2BP2
WFS1
ZBED3
CDKAL1
HHEX/IDE
KCNQ1 (2 signals*: )
TCF7L2
KCNJ11
CENTD2
MTNR1B
HMGA2 ZFAND6
PRC1
FTO
HNF1B DUSP9
Conditional analysis
Unconditional analysis
TSPAN8/LGR5
HNF1A
CDC123/CAMK1D
CHCHD9
CDKN2A/2B
SLC30A8
TP53INP1
JAZF1
KLF14
PPAR
40
30
–log10(P)–log10(P)
20
10
10
1 2 3 4 5 6 7 8
Chromosome
9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
0
0
Suggestive statistical association (P < 1 10
–5
)
Association in identified or established region (P < 1 10
–4
)
Figure 1 Genome-wide Manhattan plots for the DIAGRAM+ stage 1 meta-analysis. Top panel summarizes the results of the unconditional meta-
analysis. Previously established loci are denoted in red and loci identified by the current study are denoted in green. The ten signals in blue are those
taken forward but not confirmed in stage 2 analyses. The genes used to name signals have been chosen on the basis of proximity to the index SNP and
should not be presumed to indicate causality. The lower panel summarizes the results of equivalent meta-analysis after conditioning on 30 previously
established and newly identified autosomal T2D-associated SNPs (denoted by the dotted lines below these loci in the upper panel). Newly discovered
conditional signals (outside established loci) are denoted with an orange dot if they show suggestive levels of significance (P < 10−5), whereas
secondary signals close to already confirmed T2D loci are shown in purple (P < 10−4).
Voight et al, Nature Genetics 2012
N=8K T2D, 39K Controls
Impossible to reach this scale in E based investigations
20. Connecting E with Disease:
Missing the “System” of Exposures?
E+ E-
diseased
non-
diseased
?
Exposed to many things, but do not assess the multiplicity.
Fragmented literature of associations.
Challenge to discover E associated with disease.
22. Gold standard for breadth of human exposure information:
National Health and Nutrition Examination Survey1
since the 1960s
now biannual: 1999 onwards
10,000 participants per survey
The sample for the survey is selected to represent
the U.S. population of all ages. To produce reli-
able statistics, NHANES over-samples persons 60
and older, African Americans, and Hispanics.
Since the United States has experienced dramatic
growth in the number of older people during this
century, the aging population has major impli-
cations for health care needs, public policy, and
research priorities. NCHS is working with public
health agencies to increase the knowledge of the
health status of older Americans. NHANES has a
primary role in this endeavor.
All participants visit the physician. Dietary inter-
views and body measurements are included for
everyone. All but the very young have a blood
sample taken and will have a dental screening.
Depending upon the age of the participant, the
rest of the examination includes tests and proce-
dures to assess the various aspects of health listed
above. In general, the older the individual, the
more extensive the examination.
Survey Operations
Health interviews are conducted in respondents’
homes. Health measurements are performed in
specially-designed and equipped mobile centers,
which travel to locations throughout the country.
The study team consists of a physician, medical
and health technicians, as well as dietary and health
interviewers. Many of the study staff are
bilingual (English/Spanish).
An advanced computer system using high-
end servers, desktop PCs, and wide-area
networking collect and process all of the
NHANES data, nearly eliminating the need
for paper forms and manual coding operations.
This system allows interviewers to use note-
book computers with electronic pens. The staff
at the mobile center can automatically transmit
data into data bases through such devices as
digital scales and stadiometers. Touch-sensi-
tive computer screens let respondents enter
their own responses to certain sensitive ques-
tions in complete privacy. Survey information
is available to NCHS staff within 24 hours of
collection, which enhances the capability of
collecting quality data and increases the speed
with which results are released to the public.
In each location, local health and government
officials are notified of the upcoming survey.
Households in the study area receive a letter
from the NCHS Director to introduce the
survey. Local media may feature stories about
the survey.
NHANES is designed to facilitate and en-
courage participation. Transportation is provided
to and from the mobile center if necessary.
Participants receive compensation and a report
of medical findings is given to each participant.
All information collected in the survey is kept
strictly confidential. Privacy is protected by
public laws.
Uses of the Data
Information from NHANES is made available
through an extensive series of publications and
articles in scientific and technical journals. For
data users and researchers throughout the world,
survey data are available on the internet and on
easy-to-use CD-ROMs.
Research organizations, universities, health
care providers, and educators benefit from
survey information. Primary data users are
federal agencies that collaborated in the de-
sign and development of the survey. The
National Institutes of Health, the Food and
Drug Administration, and CDC are among the
agencies that rely upon NHANES to provide
data essential for the implementation and
evaluation of program activities. The U.S.
Department of Agriculture and NCHS coop-
erate in planning and reporting dietary and
nutrition information from the survey.
NHANES’ partnership with the U.S. Environ-
mental Protection Agency allows continued
study of the many important environmental
influences on our health.
• Physical fitness and physical functioning
• Reproductive history and sexual behavior
• Respiratory disease (asthma, chronic bron-
chitis, emphysema)
• Sexually transmitted diseases
• Vision
1 http://www.cdc.gov/nchs/nhanes.htm
>250 exposures (serum + urine)
GWAS chip
>85 quantitative clinical traits
(e.g., serum glucose, lipids, body
mass index)
Death index linkage (cause of
death)
23. Gold standard for breadth of exposure & behavior data:
National Health and Nutrition Examination Survey
Nutrients and Vitamins
vitamin D, carotenes
Infectious Agents
hepatitis, HIV, Staph. aureus
Plastics and consumables
phthalates, bisphenol A
Physical Activity
e.g., stepsPesticides and pollutants
atrazine; cadmium; hydrocarbons
Drugs
statins; aspirin
25. Type 2 Diabetes Mellitus:
A complex, multifactorial disease
•Insulin production vs. use
•beta-cell function
•insulin sensitivity (BMI)
•Moves glucose from blood into
cells
•Complications arise due to
glucose in blood, hyperglycemia
•diagnosed by blood glucose
levels
CDC,
body weight, diet, lifestyle, age
27. What E are correlated with heart disease risk factors?
28. EWAS on Serum Lipid Levels:
Triglycerides, LDL-Cholesterol, HDL-Cholesterol
• Risk factors for coronary heart disease (CHD)
• Targets for intervention (ie, statins)
• Influenced by smoking, physical activity, diet,
genetics1
Teslovich et al. Nature (2010)
Grundy et al. ATVB (2004)
Gotto et al. JACC (2004)
• LDL-C Δ1%: 1% increased risk for
CHD2
• HDL-C Δ1%: 2% decreased risk for
CHD3
• Triglycerides: higher risk for CHD image: google.com
29. EWAS in HDL-C:
17 Validated Factors
FDR < 5%
carotenes
cotinine
heavy metals
organochlorine pesticides
Int J Epidem. 2012
hydrocarbons
log10(HDL-C)
adjusted for BMI, SES, ethnicity, age, age2, sex
N=1000-3000
E
Vitamins
DCBA
minerals
1-5 mg/dL
R2 ~ 15%
30. EWAS in Triglycerides and LDL-C
22 factors
organochlorine pesticides
polychlorinated biphenyls
carotenoids
vitamin E
vitamin A
8 factors
carotenoids
vitamin E
vitamin A
Int J Epidem. 2012.
1-15 mg/dL
R2 ~ 15, 2%
32. Persistent pollutants and endocrine disruptors found in
T2D and Heart Disease risk factors:
How are these factors linked with these diseases?
•organochlorine pesticides
•polychlorinated biphenyls
•dibenzofurans
•dioxins
•found all over the world
•persist in food chain
Porta et al, Environ Int 2008
•heart disease,
•T2D/insulin resistance
Porta et al, Lancet, 2006
Lee et al, Diabetes Care, 2006
Lee et al, Diabetologia, 2007
Everett et al, Environ Res, 2010
Lind et al, EHP, 2011
(Korea, Japan, Europe)
Biological mechanisms remain elusive...
capacitors
adhesives
33. Challenges in exposome data mining:
confounding and reverse causality hinder inference!
example: HDL-C
Could the disease “lead” to
exposure?
“Reverse causality”
γ-tocopherol
?
tocopherol (vitamin e) supplements for
T2D individuals?
T2D
Could there something confounding
the association?
statin use
β-carotene
confounders
high HDL
??
34. Longitudinal Study:
“Silver Standard” to mitigate risk of reverse
•exposure changing through time
•reverse causality bias
•compute disease risk
age/time
HDL-Cholesterol
(mg/dL)
[high]
[low]
[γ-tocopherol]
tocopherol (vitamin e) supplements for
CHD individuals?
T2D
?
γ-tocopherol
36. What E are associated with aging:
all-cause mortality and telomere length?
37. How does it work?:
Searching for exposures and behaviors associated with all-
cause mortality.
NHANES: 1999-2004
National Death Index linked mortality
246 behaviors and exposures (serum/urine/self-report)
NHANES: 1999-2001
N=330 to 6008 (26 to 655 deaths)
~5.5 years of followup
Cox proportional hazards
baseline exposure and time to death
False discovery rate < 5%
NHANES: 2003-2004
N=177 to 3258 (20-202 deaths)
~2.8 years of followup
p < 0.05
Int J Epidem. 2013
41. 452 associations in Telomere Length:
Polychlorinated biphenyls associated with longer telomeres?!
0
1
2
3
4
−0.2 −0.1 0.0 0.1 0.2
effect size
−log10(pvalue)
PCBs
FDR<5%
Trunk Fat
Alk. PhosCRP
Cadmium
Cadmium (urine)cigs per day
retinyl stearate
R2 ~ 1%
VO2 Maxpulse rate
shorter telomeres longer telomeres
adjusted by age, age2, race, poverty, education, occupation
median N=3000; N range: 300-7000 IJE, 2016
42. Samples exposed to PCBs associated with difference in genes
implicated in telomere length GWAS?
Expression differences for 24 GWAS implicated genes
Queried the Gene Expression Omnibus for PCBs
Affymetrix human arrays (GPL570)
7 gene expression experiments on humans
52 exposed; 14 unexposed
Differential gene expression and a functional analysis of PCB-exposed children:
Understanding disease and disorder development
Sisir K. Dutta a,
⁎, Partha S. Mitra a,1
, Somiranjan Ghosh a,1
, Shizhu Zang a,1
, Dean Sonneborn b
,
Irva Hertz-Picciotto b
, Tomas Trnovec c
, Lubica Palkovicova c
, Eva Sovcikova c
,
Svetlana Ghimbovschi d
, Eric P. Hoffman d
a
Molecular Genetics Laboratory, Howard University, Washington, DC, USA
b
Department of Public Health Sciences, University of California Davis, Davis, CA, USA
c
Slovak Medical University, Bratislava, Slovak Republic
d
Center for Genetic Medicine, Children's National Medical Center, Washington, DC, USA
a b s t r a c ta r t i c l e i n f o
Article history:
Received 20 December 2010
Accepted 10 July 2011
The goal of the present study is to understand the probable molecular mechanism of toxicities and the
associated pathways related to observed pathophysiology in high PCB-exposed populations. We have
performed a microarray-based differential gene expression analysis of children (mean age 46.1 months) of
Environment International 40 (2012) 143–154
Contents lists available at ScienceDirect
Environment International
journal homepage: www.elsevier.com/locate/envint
IJE, 2016
43. Suggestive, but need more N!
0
1
2
−0.50 −0.25 0.00 0.25 0.50 0.75
log(difference)
−log10(pvalue)
1555203_s_at (SLC44A4)
1555203_s_at (MYNN)
224206_x_at (MYNN)
Could PCBs influence expression of genes
implicated in telomere length GWAS?
myoneurin
bladder, leukemia, colorectal cancer GWASs
IJE, 2016
44. Studying the Elusive Environment in Large Scale
Itispossiblethatmorethan50%ofcomplexdiseaserisk
isattributedtodifferencesinanindividual’senvironment.1
Airpollution,smoking,anddietaredocumentedenviron-
mental factors affecting health, yet these factors are but
a fraction of the “exposome,” the totality of the exposure
loadoccurringthroughoutaperson’slifetime.1
Investigat-
ing one or a handful of exposures at a time has led to a
highly fragmented literature of epidemiologic associa-
tions. Much of that literature is not reproducible, and se-
lectivereportingmaybeamajorreasonforthelackofre-
producibility. A new model is required to discover
environmental exposures associated with disease while
mitigating possibilities of selective reporting.
Toremedythelackofreproducibilityandconcernsof
validity, multiple personal exposures can be assessed si-
multaneously in terms of their association with a condi-
tion or disease of interest; the strongest associations can
then be tentatively validated in independent data sets
(eg, as done in references 2 and 3).2,3
The main advan-
tages of this process include the ability to search the list
ofexposuresandadjustformultiplicitysystematicallyand
reportalltheprobedassociationsinsteadofonlythemost
significant results. The term “environment-wide associa-
tion studies” (EWAS) has been used to describe this ap-
proach (an analogy to genome-wide association stud-
ies).Forexample,Wangetal4
screenedmorethan2000
chemicalsinserumtodiscoverendogenousexposuresas-
sociated with risk for cardiovascular disease.
Therearenotablehurdlesinanalyzing“big”environ-
mental data. These same problems affect epidemiology
of1-risk-factor-at-a-time,butinEWAStheirprevalencebe-
comes more clearly manifest at large scale. When study-
the EWAS vantage point, intervening on β-carotene
(Figure, D) seems a futile exercise given its complex rela-
tionship with other nutrients and pollutants.
Giventhiscomplexity,howcanstudiesofenvironmen-
talriskmoveforward?First,EWASanalysesshouldbeap-
pliedtomultipledatasets,andconsistencycanbeformally
examinedforallassessedcorrelations.Second,thetempo-
ral relationship between exposure and changes in health
parametersmayofferhelpfulhintsaboutwhichofthesig-
nalsaremorethansimplecorrelations.Third,standardized
adjustedanalyses,inwhichadjustmentsareperformedsys-
tematicallyandinthesamewayacrossmultipledatasets,
may also help. This is in stark contrast with the current
model,wherebymostepidemiologicstudiesusesingledata
setswithoutreplicationaswellasnon–time-dependentas-
sessments,andreportedadjustmentsaremarkedlydiffer-
entacrossreportsanddatasets,eventhoseperformedby
thesameteam(differentapproachesincreasevaliditybut
mustbereconciledandassimilated).
However, eventually for most environmental cor-
relates,theremaybeunsurpassabledifficultyestablish-
ing potential causal inferences based on observational
data alone. Factors that seem protective may some-
times be tested in randomized trials. The complexity of
the multiple correlations also highlights the challenge
thatinterveningtomodify1putativeriskfactoralsomay
inadvertently affect multiple other correlated factors.
Even when a seemingly simple intervention is tested in
randomizedtrials(affectingasingleriskfactoramongthe
manycorrelations),theinterventionisnotreallysimple.
In essence what is tested are multiple perturbations of
factors correlated with the one targeted for interven-
VIEWPOINT
Chirag J. Patel, PhD
Center for Biomedical
Informatics, Harvard
Medical School,
Boston, Massachusetts.
John P. A. Ioannidis,
MD, DSc
Stanford Prevention
Research Center,
Department of Health
Research and Policy,
Department of
Medicine, Stanford
University School of
Medicine, Stanford,
California, Department
of Statistics, Stanford
University School of
Humanities and
Sciences, Stanford,
California, and
Meta-Research
Innovation Center at
Stanford (METRICS),
Stanford, California.
Opinion
JAMA, 2014
JECH, 2014
Proc Symp Biocomp, 2015
How can we study the elusive environment in larger scale for
biomedical discovery?
Studying the Elusive Environment in Large Scale
Itispossiblethatmorethan50%ofcomplexdiseaserisk
isattributedtodifferencesinanindividual’senvironment.1
Airpollution,smoking,anddietaredocumentedenviron-
mental factors affecting health, yet these factors are but
a fraction of the “exposome,” the totality of the exposure
loadoccurringthroughoutaperson’slifetime.1
Investigat-
ing one or a handful of exposures at a time has led to a
highly fragmented literature of epidemiologic associa-
tions. Much of that literature is not reproducible, and se-
lectivereportingmaybeamajorreasonforthelackofre-
producibility. A new model is required to discover
environmental exposures associated with disease while
mitigating possibilities of selective reporting.
Toremedythelackofreproducibilityandconcernsof
validity, multiple personal exposures can be assessed si-
multaneously in terms of their association with a condi-
tion or disease of interest; the strongest associations can
then be tentatively validated in independent data sets
(eg, as done in references 2 and 3).2,3
The main advan-
tages of this process include the ability to search the list
ofexposuresandadjustformultiplicitysystematicallyand
reportalltheprobedassociationsinsteadofonlythemost
significant results. The term “environment-wide associa-
tion studies” (EWAS) has been used to describe this ap-
the EWAS vantage point, intervening on β-carotene
(Figure, D) seems a futile exercise given its complex rela-
tionship with other nutrients and pollutants.
Giventhiscomplexity,howcanstudiesofenvironmen-
talriskmoveforward?First,EWASanalysesshouldbeap-
pliedtomultipledatasets,andconsistencycanbeformally
examinedforallassessedcorrelations.Second,thetempo-
ral relationship between exposure and changes in health
parametersmayofferhelpfulhintsaboutwhichofthesig-
nalsaremorethansimplecorrelations.Third,standardized
adjustedanalyses,inwhichadjustmentsareperformedsys-
tematicallyandinthesamewayacrossmultipledatasets
may also help. This is in stark contrast with the current
model,wherebymostepidemiologicstudiesusesingledata
setswithoutreplicationaswellasnon–time-dependentas-
sessments,andreportedadjustmentsaremarkedlydiffer-
entacrossreportsanddatasets,eventhoseperformedby
thesameteam(differentapproachesincreasevaliditybut
mustbereconciledandassimilated).
However, eventually for most environmental cor-
relates,theremaybeunsurpassabledifficultyestablish-
ing potential causal inferences based on observationa
data alone. Factors that seem protective may some-
times be tested in randomized trials. The complexity of
VIEWPOINT
Chirag J. Patel, PhD
Center for Biomedical
Informatics, Harvard
Medical School,
Boston, Massachusetts.
John P. A. Ioannidis,
MD, DSc
Stanford Prevention
Research Center,
Department of Health
Research and Policy,
Department of
Medicine, Stanford
University School of
Medicine, Stanford,
California, Department
of Statistics, Stanford
University School of
Humanities and
Sciences, Stanford,
California, and
Meta-Research
Innovation Center at
Stanford (METRICS),
Stanford, California.
Opinion
High-throughputascertainmentofendogenousindicatorsofen-
vironmentalexposurethatmayreflecttheexposomeincreasinglyat-
tractattention,andtheirperformanceneedstobecarefullyevaluated.
These include chemical detection of indicators of exposure through
metabolomics, proteomics, and biosensors.7
Eventually, patterns of
US federally funded gene expression experiment data be d
itedinpublicrepositoriessuchastheGeneExpressionOmnibu
repositoryhasbeeninstrumentalindevelopmentoftechnolo
measurement of gene expression, data standardization, and
ofdatafordiscovery.JustaswiththeGeneExpressionOmnib
Figure. Correlation Interdependency Globes for 4 Environmental Exposures (Cotinine, Mercury, Cadmium, Trans-β-Carotene) in National Healt
Nutrition Examination Survey (NHANES) Participants, 2003-2004
A Serum cotinine B Serum total mercury C Serum cadmium D Serum trans-β-carotene
37 Total correlations 42 Total correlations 68 Total correlations 68 Total correlations
Negative correlation Positive correl
Infectious
agents
Pollutants
Nutrients
and vitamins
Demographic
attributes
Eachcorrelationinterdependencyglobeincludes317environmentalexposures
representedbythenodesaroundtheperipheryoftheglobe.Pairwisecorrelations
aredepictedbyedges(lines)betweenthenodeofinterest(arrowhead)andother
nodes.Correlationswithabsolutevaluesexceeding0.2areshown(stronge
Thesizeofeachnodeisproportionaltothenumberofedgesforanode,and
thicknessofeachedgeindicatesthemagnitudeofthecorrelation.
Opinion Viewpoint
•bioinformatics to connect exposome with phenome
•new ‘omics technologies to measure the exposome
•dense correlations
•reverse causality
•confounding
•(longitudinal) publicly available data
45. Interdependencies of the exposome:
Correlation globes paint a complex view of exposure
Red: positive ρ
Blue: negative ρ
thickness: |ρ|
for each pair of E:
Spearman ρ
(575 factors: 81,937 correlations)
permuted data to produce
“null ρ”
sought replication in > 1
cohort
Pac Symp Biocomput. 2015
JECH. 2015
46. Red: positive ρ
Blue: negative ρ
thickness: |ρ|
for each pair of E:
Spearman ρ
(575 factors: 81,937 correlations)
Interdependencies of the exposome:
Correlation globes paint a complex view of exposure
permuted data to produce
“null ρ”
sought replication in > 1
cohort
Pac Symp Biocomput. 2015
JECH. 2015
Effective number of
variables:
500 (10% decrease)
47. Telomere Length All-cause mortality
http://bit.ly/globebrowse
Interdependencies of the exposome:
Telomeres vs. all-cause mortality
48. Testing all associations systematically:
Consideration of multiplicity of hypotheses and correlational web!
Explicit in number of hypotheses
tested
False discovery rate;
family-wise error rate;
Report database size!
Does my correlation matter?
How does my new correlation
compare to the family of correlations?
0.17 (e.g., carotene and diabetes)
is average ρ much less than 0.17? greater?
ρ
JAMA 2014
JECH 2015
49. Studying the Elusive Environment in Large Scale
Itispossiblethatmorethan50%ofcomplexdiseaserisk
isattributedtodifferencesinanindividual’senvironment.1
Airpollution,smoking,anddietaredocumentedenviron-
mental factors affecting health, yet these factors are but
a fraction of the “exposome,” the totality of the exposure
loadoccurringthroughoutaperson’slifetime.1
Investigat-
ing one or a handful of exposures at a time has led to a
highly fragmented literature of epidemiologic associa-
tions. Much of that literature is not reproducible, and se-
lectivereportingmaybeamajorreasonforthelackofre-
producibility. A new model is required to discover
environmental exposures associated with disease while
mitigating possibilities of selective reporting.
Toremedythelackofreproducibilityandconcernsof
validity, multiple personal exposures can be assessed si-
multaneously in terms of their association with a condi-
tion or disease of interest; the strongest associations can
then be tentatively validated in independent data sets
(eg, as done in references 2 and 3).2,3
The main advan-
tages of this process include the ability to search the list
ofexposuresandadjustformultiplicitysystematicallyand
reportalltheprobedassociationsinsteadofonlythemost
significant results. The term “environment-wide associa-
tion studies” (EWAS) has been used to describe this ap-
proach (an analogy to genome-wide association stud-
ies).Forexample,Wangetal4
screenedmorethan2000
chemicalsinserumtodiscoverendogenousexposuresas-
sociated with risk for cardiovascular disease.
Therearenotablehurdlesinanalyzing“big”environ-
mental data. These same problems affect epidemiology
of1-risk-factor-at-a-time,butinEWAStheirprevalencebe-
comes more clearly manifest at large scale. When study-
the EWAS vantage point, intervening on β-carotene
(Figure, D) seems a futile exercise given its complex rela-
tionship with other nutrients and pollutants.
Giventhiscomplexity,howcanstudiesofenvironmen-
talriskmoveforward?First,EWASanalysesshouldbeap-
pliedtomultipledatasets,andconsistencycanbeformally
examinedforallassessedcorrelations.Second,thetempo-
ral relationship between exposure and changes in health
parametersmayofferhelpfulhintsaboutwhichofthesig-
nalsaremorethansimplecorrelations.Third,standardized
adjustedanalyses,inwhichadjustmentsareperformedsys-
tematicallyandinthesamewayacrossmultipledatasets,
may also help. This is in stark contrast with the current
model,wherebymostepidemiologicstudiesusesingledata
setswithoutreplicationaswellasnon–time-dependentas-
sessments,andreportedadjustmentsaremarkedlydiffer-
entacrossreportsanddatasets,eventhoseperformedby
thesameteam(differentapproachesincreasevaliditybut
mustbereconciledandassimilated).
However, eventually for most environmental cor-
relates,theremaybeunsurpassabledifficultyestablish-
ing potential causal inferences based on observational
data alone. Factors that seem protective may some-
times be tested in randomized trials. The complexity of
the multiple correlations also highlights the challenge
thatinterveningtomodify1putativeriskfactoralsomay
inadvertently affect multiple other correlated factors.
Even when a seemingly simple intervention is tested in
randomizedtrials(affectingasingleriskfactoramongthe
manycorrelations),theinterventionisnotreallysimple.
In essence what is tested are multiple perturbations of
factors correlated with the one targeted for interven-
VIEWPOINT
Chirag J. Patel, PhD
Center for Biomedical
Informatics, Harvard
Medical School,
Boston, Massachusetts.
John P. A. Ioannidis,
MD, DSc
Stanford Prevention
Research Center,
Department of Health
Research and Policy,
Department of
Medicine, Stanford
University School of
Medicine, Stanford,
California, Department
of Statistics, Stanford
University School of
Humanities and
Sciences, Stanford,
California, and
Meta-Research
Innovation Center at
Stanford (METRICS),
Stanford, California.
Opinion
JAMA, 2014
JECH, 2014
Proc Symp Biocomp, 2015
How can we study the elusive environment in larger scale for
biomedical discovery?
Studying the Elusive Environment in Large Scale
Itispossiblethatmorethan50%ofcomplexdiseaserisk
isattributedtodifferencesinanindividual’senvironment.1
Airpollution,smoking,anddietaredocumentedenviron-
mental factors affecting health, yet these factors are but
a fraction of the “exposome,” the totality of the exposure
loadoccurringthroughoutaperson’slifetime.1
Investigat-
ing one or a handful of exposures at a time has led to a
highly fragmented literature of epidemiologic associa-
tions. Much of that literature is not reproducible, and se-
lectivereportingmaybeamajorreasonforthelackofre-
producibility. A new model is required to discover
environmental exposures associated with disease while
mitigating possibilities of selective reporting.
Toremedythelackofreproducibilityandconcernsof
validity, multiple personal exposures can be assessed si-
multaneously in terms of their association with a condi-
tion or disease of interest; the strongest associations can
then be tentatively validated in independent data sets
(eg, as done in references 2 and 3).2,3
The main advan-
tages of this process include the ability to search the list
ofexposuresandadjustformultiplicitysystematicallyand
reportalltheprobedassociationsinsteadofonlythemost
significant results. The term “environment-wide associa-
tion studies” (EWAS) has been used to describe this ap-
the EWAS vantage point, intervening on β-carotene
(Figure, D) seems a futile exercise given its complex rela-
tionship with other nutrients and pollutants.
Giventhiscomplexity,howcanstudiesofenvironmen-
talriskmoveforward?First,EWASanalysesshouldbeap-
pliedtomultipledatasets,andconsistencycanbeformally
examinedforallassessedcorrelations.Second,thetempo-
ral relationship between exposure and changes in health
parametersmayofferhelpfulhintsaboutwhichofthesig-
nalsaremorethansimplecorrelations.Third,standardized
adjustedanalyses,inwhichadjustmentsareperformedsys-
tematicallyandinthesamewayacrossmultipledatasets
may also help. This is in stark contrast with the current
model,wherebymostepidemiologicstudiesusesingledata
setswithoutreplicationaswellasnon–time-dependentas-
sessments,andreportedadjustmentsaremarkedlydiffer-
entacrossreportsanddatasets,eventhoseperformedby
thesameteam(differentapproachesincreasevaliditybut
mustbereconciledandassimilated).
However, eventually for most environmental cor-
relates,theremaybeunsurpassabledifficultyestablish-
ing potential causal inferences based on observationa
data alone. Factors that seem protective may some-
times be tested in randomized trials. The complexity of
VIEWPOINT
Chirag J. Patel, PhD
Center for Biomedical
Informatics, Harvard
Medical School,
Boston, Massachusetts.
John P. A. Ioannidis,
MD, DSc
Stanford Prevention
Research Center,
Department of Health
Research and Policy,
Department of
Medicine, Stanford
University School of
Medicine, Stanford,
California, Department
of Statistics, Stanford
University School of
Humanities and
Sciences, Stanford,
California, and
Meta-Research
Innovation Center at
Stanford (METRICS),
Stanford, California.
Opinion
High-throughputascertainmentofendogenousindicatorsofen-
vironmentalexposurethatmayreflecttheexposomeincreasinglyat-
tractattention,andtheirperformanceneedstobecarefullyevaluated.
These include chemical detection of indicators of exposure through
metabolomics, proteomics, and biosensors.7
Eventually, patterns of
US federally funded gene expression experiment data be d
itedinpublicrepositoriessuchastheGeneExpressionOmnibu
repositoryhasbeeninstrumentalindevelopmentoftechnolo
measurement of gene expression, data standardization, and
ofdatafordiscovery.JustaswiththeGeneExpressionOmnib
Figure. Correlation Interdependency Globes for 4 Environmental Exposures (Cotinine, Mercury, Cadmium, Trans-β-Carotene) in National Healt
Nutrition Examination Survey (NHANES) Participants, 2003-2004
A Serum cotinine B Serum total mercury C Serum cadmium D Serum trans-β-carotene
37 Total correlations 42 Total correlations 68 Total correlations 68 Total correlations
Negative correlation Positive correl
Infectious
agents
Pollutants
Nutrients
and vitamins
Demographic
attributes
Eachcorrelationinterdependencyglobeincludes317environmentalexposures
representedbythenodesaroundtheperipheryoftheglobe.Pairwisecorrelations
aredepictedbyedges(lines)betweenthenodeofinterest(arrowhead)andother
nodes.Correlationswithabsolutevaluesexceeding0.2areshown(stronge
Thesizeofeachnodeisproportionaltothenumberofedgesforanode,and
thicknessofeachedgeindicatesthemagnitudeofthecorrelation.
Opinion Viewpoint
•bioinformatics to connect exposome with phenome
•new ‘omics technologies to measure the exposome
•dense correlations
•reverse causality
•confounding
•(longitudinal) publicly available data
52. You can use these data!
http://chiragjpgroup.org/exposome-analytics-course
Contact me for project ideas!
@chiragjp
chirag_patel@hms.harvard.edu
53. Connecting Environmental Exposure with Disease:
Missing the “System” of Exposures?
E+ E-
diseased
non-
diseased
?
Exposed to many things, but do not assess the multiplicity.
Fragmented literature of associations.
Challenge to discover E associated with disease.
54. Example of fragmentation:
Is everything we eat associated with cancer?
Schoenfeld and Ioannidis, AJCN 2012
50 random ingredients from
Boston Cooking School
Cookbook
Any associated with cancer?
FIGURE 1. Effect estimates reported in the literature by malignancy type (top) or ingredient (bottom). Only ingredients with $10 studie
outliers are not shown (effect estimates .10).
Of 50, 40 studied in cancer risk
Weak statistical evidence:
non-replicated
inconsistent effects
non-standardized
60. Possible to survey P (fasting glucose) of diabetics
consented through ResearchKit?
Adam Brown
Stanley Shaw (MGH)
Dennis Ausiello (MGH)
http://bit.ly/glucosuccess
61. Does the high physical activity population have lower fasting
glucose?: YES!
mashing up 24K step counts with glucose (N=600)
62. Is step count on previous day associated with fasting
glucose the next day?: YES!
mashing up 24K step counts with glucose (N=600)
68. Harvard DBMI
Isaac Kohane
Susanne Churchill
Stan Shaw
Nathan Palmer
Jenn Grandfield
Sunny Alvear
Michal Preminger
Chirag J Patel
chirag@hms.harvard.edu
@chiragjp
www.chiragjpgroup.org
NIH Common Fund
Big Data to Knowledge
Acknowledgements
RagGroup
Chirag Lakhani
Adam Brown
Danielle Rasooly
Nam Pho
Jake Chung
Alan LeGoallec
Arjun Manrai
Sivateja Tangirala
Shreyas Bhave
Rolando Acosta
Dr. Edwin Traverso Aviles