1. Statistical Foundations for the design/analysis and
interpretation of Molecular Biomarker studies
Athula Herath, PhD, MBCS, CITP, CEng
https://uk.linkedin.com/in/athulaherath
March 2006 (Printed October 2015)
2. Shortcomings of the Statistical Analysis of molecular profiling data
• Often are tempted to find the “smoking gun”, the one or few
most important genes/proteins/metabolites.
• The underlying biology is often lost.
• Often normal/faulty biological processes are manifested by
the artefacts of large number of biological entities (i.e.
multivariate) rather than one or few
– For example:
• Large number of malfunctioning genes set the stage for cardiovascular disease
• Almost 300 genes are involved in Asthma
• 140 faulty genes contributes to the problem of failing memory (Alzheimer's and other)
• It is therefore prudent to use biology to explore the effects
instead and form/test/validate/reform hypothesis
• An additional advantage is that we are building the biological
story simultaneously with the analysis
3. biomarkers
Statistical Foundation for Molecular Biomarker studies
Biological
processes
Study the association
between the environmental
factors and the disease of
individuals
Create a list of factors
(hypothesis)
Asses the effects of
these systematically
Draw
conclusions/refine
Epidemiology
Clinical Samples
Molecular Profiling
(transcriptomics,
proteomics,
metabolomics etc.)
Traditional clinical chemistry analysis
4. Biologically motivated Data Analysis
• Hypothesis Free
– Independently discover and make
inferences from data.
• Hypothesis Driven
Establish a biological relationship
between entities (Form
Hypothesis)
Test
ConcludeRefine
Pragmatic and opportunistic approach.
5. Biology driven analysis, how?
1. Be pragmatic and be subject specific (e.g. breast cancer, Alzheimer's etc or even
narrower areas within wider subject areas) in establishing such active knowledge
repositories in step 5.
2. Filter and extract (using the keywords, synonyms etc) the appropriate molecular
entities and pathways from public and commercial, curated pathway databases (e.g.:
Entrez Gene, KEGG, GenMAPP, GO, UNIPROT, … ).
3. Collate all genetic polymorphisms data (OMIM, dbSNIP, HAPMAP) on the relevant
molecular entities,
4. Use 1,2, and 3 above as seeds and establish new relationships using literature –
literature mining tools.
5. Collate the results from stages 1-4 into a repository (denovo, focussed, targetted
disease oriented active knowledge repository)
6. Suitably parameterize the facts to form multivariate models (e.g: in a Bayesian
framework) to form an associated statistical model repository.
7. Construct a suitable inference engine to generate plausible hypothesis.
8. Use the hypothesis as an aid to design biomarker studies.
9. Integrate the experimental data from relevant molecular profiling experiments
(transcriptomics, proteomics and metabolomics etc).
10. Drive the statistical analysis (testing-and verification of hypothesis).