Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

  • Be the first to comment

  • Be the first to like this


  1. 1. Statistical Foundations for the design/analysis and interpretation of Molecular Biomarker studies Athula Herath, PhD, MBCS, CITP, CEng March 2006 (Printed October 2015)
  2. 2. Shortcomings of the Statistical Analysis of molecular profiling data • Often are tempted to find the “smoking gun”, the one or few most important genes/proteins/metabolites. • The underlying biology is often lost. • Often normal/faulty biological processes are manifested by the artefacts of large number of biological entities (i.e. multivariate) rather than one or few – For example: • Large number of malfunctioning genes set the stage for cardiovascular disease • Almost 300 genes are involved in Asthma • 140 faulty genes contributes to the problem of failing memory (Alzheimer's and other) • It is therefore prudent to use biology to explore the effects instead and form/test/validate/reform hypothesis • An additional advantage is that we are building the biological story simultaneously with the analysis
  3. 3. biomarkers Statistical Foundation for Molecular Biomarker studies Biological processes Study the association between the environmental factors and the disease of individuals  Create a list of factors (hypothesis)  Asses the effects of these systematically  Draw conclusions/refine Epidemiology Clinical Samples Molecular Profiling (transcriptomics, proteomics, metabolomics etc.) Traditional clinical chemistry analysis
  4. 4. Biologically motivated Data Analysis • Hypothesis Free – Independently discover and make inferences from data. • Hypothesis Driven Establish a biological relationship between entities (Form Hypothesis) Test ConcludeRefine Pragmatic and opportunistic approach.
  5. 5. Biology driven analysis, how? 1. Be pragmatic and be subject specific (e.g. breast cancer, Alzheimer's etc or even narrower areas within wider subject areas) in establishing such active knowledge repositories in step 5. 2. Filter and extract (using the keywords, synonyms etc) the appropriate molecular entities and pathways from public and commercial, curated pathway databases (e.g.: Entrez Gene, KEGG, GenMAPP, GO, UNIPROT, … ). 3. Collate all genetic polymorphisms data (OMIM, dbSNIP, HAPMAP) on the relevant molecular entities, 4. Use 1,2, and 3 above as seeds and establish new relationships using literature – literature mining tools. 5. Collate the results from stages 1-4 into a repository (denovo, focussed, targetted disease oriented active knowledge repository) 6. Suitably parameterize the facts to form multivariate models (e.g: in a Bayesian framework) to form an associated statistical model repository. 7. Construct a suitable inference engine to generate plausible hypothesis. 8. Use the hypothesis as an aid to design biomarker studies. 9. Integrate the experimental data from relevant molecular profiling experiments (transcriptomics, proteomics and metabolomics etc). 10. Drive the statistical analysis (testing-and verification of hypothesis).