SlideShare a Scribd company logo
1 of 32
Genome Wide Methodologies and
     Future Perspectives
               Brian Krueger, PhD
                 Duke University
      Center for Human Genome Variation
History of Genetic Linkage

 • Mendel’s Laws
    – Law of segregation
         •   Each parent randomly passes one of two alleles to offspring
    – Law of Independent Assortment
         •   Separate genes for separate traits are passed independently to
             offspring
         •   Traits should appear in offspring in the ratio of 9:3:3:1
    – Laws hold true for genes on different chromosomes or
      genes located far away from one another
 • Linkage
    – Bateson and Punnett quickly found traits that didn’t
      assort independently
    – Thomas Hunt Morgan and his student Alfred
      Sturtevant found that recombination frequency is a
      good predictor of distance between genes
         •   Genes that are inherited together must be closer to one another
             – linked
         •   Generated the first linkage maps
    – Serves as an important basis for understanding
      genetic association studies
Linkage Studies

 • Model Organisms
     – Fruit Flies, plants, etc
     – Extremely important for understanding human
       genetics
     – Fruit flies can produce new generations of 400+
       offspring approximately every week!
          •   Can very quickly understand the genetics of trait heritability

 • Familial Linkage Studies
     – Require multiple generations
     – Take decades to develop
     – Complicated by family participation
 • Association studies
     – Subtle difference between linkage studies
     – Try to apply knowledge of familial linkage to entire
       populations
Genome Wide Association Studies

 • GWA studies
     – Aim to find genetic variants that are associated with
       traits
     – Typically used to elucidate complex disease traits
     – Focus on SNPs, Indels, CNVs
     – Most often Case/Control Studies
 • SNP (Single Nucleotide Polymorphism)
     – Change in a single nucleotide position
 • Indel (Insertion/Deletion)
     – Describes the insertion or deletion of nucleotides
 • CNV (Copy number variations)
     – Large deletions or duplications of genetic material
GWA Study History

 • Human Genome Project (1990-2000)
    – Decade long international project to determine the
      complete human genome sequence
    – Provided the reference genome for future research on
      genome variation
 • Human HapMap (2002-2009)
    – Sequencing whole genomes is expensive
    – Needed a shortcut to understand how variation
      contributes to disease
    – Mapped millions of common known SNPs in 269
      individuals
    – Theory that common SNPs are inherited and could be
      predictive of associated disease
    – Determine how SNPs from case/control studies
      associate with human disease
Defining Association

 • Variants are not always causal!
     – SNPs sometimes only serve as markers
     – Can play absolutely no role in the disease and even be
       located on different chromosomes from the gene
       actually responsible for the phenotype
 • Population stratification
     – Variants differ by population
     – Variants important markers of disease in one
       population or ethnicity may not be effective markers
       in another
     – For GWA studies to be effective predictors in multiple
       populations, large datasets for each ethnicity must be
       obtained
GWAS SNP Genotyping

 • Bead array genotyping
    – Uses a chip containing beads with
      covalently attached baits
    – Baits hybridized to fragmented DNA
    – Baits SPECIFIC for the DNA just upstream
      of a SNP
    – Base extension with fluorescently labeled
      bases allows interrogation of the SNP
      (each base has a different color!)
    – A single bead chip can assay millions of                                                                     rs1372493                                                                                rs1372493


      SNPs                                                                      16000
                                                                                                                                                                               1.60



                                                                                                                                                                               1.40


    – Colorimetric output plotted
                                                                                14000




                                                                                12000                                                                                          1.20


         •   Blue indicates homozygous for one version of the                   10000                                                                                             1

             SNP - CC                                           Intensity (B)


                                                                                8000                                                                                           0.80

         •   Purple is heterozygous - CA




                                                                                                                                                                      Norm R
                                                                                6000                                                                                           0.60

         •   Red homozygous for the other version of the SNP
                                                                                4000
             - AA                                                                                                                                                              0.40



                                                                                2000
                                                                                                                                                                               0.20



                                                                                    0
                                                                                                                                                                                  0
                                                                                                                                                                                       2317                       834                  74
                                                                                -2000
                                                                                                                                                                               -0.20
                                                                                        0   2000   4000   6000   8000    10000     12000   14000   16000   18000   20000
                                                                                                                                                                                        0     0.20   0.40                0.60   0.80   1
                                                                                                                   Intensity (A)
                                                                                                                                                                                                            Norm Theta
GWAS SNP Genotyping and Validation

 • Realtime PCR
    – Use specific PCR probes to verify SNPs
    – Good for validating a handful of SNPs at a time
 • Mass Array
    – Use mass spec to find SNPs
    – Detected by looking at fragment weight
      differences
    – Good for detecting or validating a large number
      of SNPs rapidly
 • Sanger sequencing
    – Gold standard validation method
    – Can determine the SNP at its exact position
    – Very robust
GWA Study History

 • To this point in time, the power of most GWA
   studies was lacking
     – GWA not really genome wide
     – Looked at common variants across genome
     – Missed rare variants and not always descriptive of
       disease causation
 • Whole Genome Sequencing (WGS)
     –   Actually assays the entire genome
     –   Discovers all variants
     –   Prohibitively costly before 2008
     –   Current cost of WGS ~$4000
 • Thousand Genomes Project (2008-)
     – Facilitated by plummeting sequencing costs and
       technological advancements
     – Goal to fully sequence the genomes of 1000 healthy
       individuals to provide a true picture of genome wide
       variation
Second Generation Sequencing

 • Developed to increase throughput of
   Sanger sequencing
 • Can sequence many molecules in parallel
     – Does not require homogenous input
     – Sequenced as clusters
 • Sequencing by synthesis
     – Bases are added, signals scanned, and then
       washed
     – Cycle repeated (30-2000x)
2nd Gen: Sequencing by Synthesis Overview




   Genomic Fragmented DNA Ligate Adaptors
     DNA                                          Generate Clusters (On Flowcell or
                                                              Beads)



                                             T                      T
                                      A T                    A T
                                      TA                     T      A
                                      T                      T
                                             C                      C
                                      G                      G
                                      A      G               A      G
                                      T                      T
                                      T                      T
                                      G                      G
   Repeat Hundreds of times on
       millions of clusters      Detect Signals             Add Bases
Flavors of Sequencing

 • Whole Genome Sequencing
    – Obtain whole blood or tissue sample
    – Create sequencing libraries of all DNA
      fragments
 • Whole Exome Sequencing
    –   Utilizes a selection protocol
    –   Attach complimentary RNA strands to beads
    –   Fish out ONLY coding DNA sequences
    –   Create sequencing libraries from enriched DNA
    –   Reduces cost significantly
 • Custom Capture
    – Same protocol as Exome sequencing
    – Only target desired DNA sequences
 • Amplicon Sequencing
    – Use PCR to amplify target DNA
    – Sequence amplified DNA (Amplicon)
NGS Study Designs for Gene Discovery

  Multiplex families

                           Case-control studies



   Trio sequencing of
   sporadic diseases
De novo Mutation Calling/Filtering

     Variant        Individual variant                 Multi-sample
     calling              calling                      variant calling

                               Exome Variant Server 6500 exome
  Cross-checking                   sequenced individuals
 public databases

      Visual
    Inspection

Sanger sequencing
  confirmation
Detecting Copy Number Variants

  ERDS (Estimation by Read Depth with SNvs)
    Average read depth (RD) of every 2-kb window were calculated, followed
    by GC corrections. A paired Hidden Markov model was applied to infer
    copy numbers of every window by utilizing both RD information and
    heterozygosity information.




                   homozygous                     heterozygous   duplication
                     deletion                       deletion




                                        Windows
Illumina

 • Uses a flow cell
 • Cluster generated on slide via bridge
   amplification
 • Sequencing by synthesis
     –   Performed by flowing labeled bases over flow
         cell
     –   4 pictures taken (one for each base)
     –   Cluster color determined at each cycle allows
         interrogation of sequence
 • Advantages
     –   Low cost per base
     –   Very high throughput
 • Limitations
     –   High cost per experiment
     –   Short read length (30-150bp)
     –   Acquired a company that uses new tech to
         reach read lengths of 2-10Kb
                                                         Schadt et al 2010 HMG
Ion Torrent

 • Emulsion PCR is used to generate clusters
   on a bead
 • Sequencing by synthesis
     – Pyrosequencing
     – Relies on release of pyrophosphate for
       detection
     – Instead of a visual cue, system senses the
       release of H+ as each base is flowed over the
       beads
 • Advantages
     – Short run time
     – Does not require modified bases
     – Longer read length (200bp)
 • Limitations
     – Low data output
     – High homopolymer error rate
Third Generation Sequencing

 • Defined as single molecule sequencing
 • Less complex sample prep
 • Much longer read length
      – SGS Short read length a huge disadvantage for
        de novo sequencing applications
 • Two categories
      – Sequencing by synthesis
      – Direct sequencing
          • Passing molecule through a nanopore
          • Using atomic force microscopy
 •   Bleeding edge technology
      – Many technical hurdles
      – Currently very high error rates
Pacific Biosciences

 • Utilizes single molecule sequencing by
   synthesis
 • Extremely complex system
     – Each well contains a single DNA molecule and
       an immobilized polymerase
     – No reagent washing
     – Employs confocal microscopy to only detect
       fluorescence at the polymerase
 • Advantages
     – Very long read length (1-15kb)
     – Low complexity sample prep
     – Very fast data generation (real time)
 • Disadvantages
     – Prone to sequencing errors (~15% error
       rate)
     – Company on the verge of bankruptcy
Third/Second Generation Sequencing

 • Currently only one viable high throughput
   long read sequencing platform
     – PacBio system has a 15% error rate
     – Need long reads for many applications from de
       novo sequencing to haplotyping
 • Second generation sequencers high
   throughput and accurate
     – Short reads are hard to assemble and leave
       gaps in repetitive sequences
 • Can use both as a highly accurate and
   extremely powerful tool for de novo
   sequencing applications
     –   Use PacBio assembly as a scaffold
     –   Correct errors by aligning HiSeq reads on top
     –   Effective error rate of 0.1%
     –   Expensive but extremely fast and accurate
         compared to other methods                       Koren et al 2012 Nature Biotechnology
Future: Nanopore Sequencing

 • Leading candidate is Oxford
   Nanopore
 • Concept
     – Detect flow of electrons through the
       pore
     – Each base causes a detectable change in
       the current
     – Results in direct sequencing
     – Theoretically could be used to sequence
       RNA and protein too
 • Advantages
     – Long read length
     – Plug and play
     – Easily scalable
 • Disadvantages
     – No hard data yet                          Credit: John MacNeill/TechnologyReview
     – No specific release date
Future: Direct sequencing

 • Concept stage techniques
     – Significant technical hurdles to overcome
     – Mostly proof of concept experiments
 • IBM DNA Transistor                                                     Credit: IBM

     – Bases read as single stranded DNA passes
       through the transistor
     – Gold bands represent metal, gray bands are
       the dielectric
 • Atomic force microscopy sequencing
     – Use AFM tip to detect each base of single
       stranded DNA




                                                    Credit: Lee et al US PAT 20040124084
Sequencing Applications

 • Old techniques which used to take days or
   years to perform can now be completed
   in hours
 • Next generation sequencing has opened a
   new door for addressing very complicated
   genetic questions
     – Has huge potential to revolutionize human
       healthcare
     – Survey complex tumor types
     – Research into macro and micro community
       genomics
     – Reveal evolutionary history
De novo Sequencing

• Human genome took 10 years to complete and
  cost $3 billion dollars
    – Done by laboriously cloning overlapping segments of the
      human genome into bacmid libraries and Sanger
      sequencing each one
    – Genome assembled using computers to line up over
      lapping sequences
• Current estimate is around $4000
    – Can be completed in a week
    – Companies like Complete Genomics say they have already
      sequenced thousands of human genomes
• Future
    – Long read sequencers will make agricultural sequencing
      more viable
    – Whole genome sequencing for human diagnostics will
      become routine
    – Increasing the catalog of organismal genomes will improve
      our understanding of evolution and development
Genome Mutation Analysis

 • Previously done by completing
   complicated and time consuming familial
   linkage studies and targeted Sanger
   sequencing
 • Next generation sequencing can look at
   every gene at once
     – Can produce a genetic map of the complete
       genome
     – Used to detect genetic polymorphisms
     – See every possible mutation
 • Future
     – Whole genome sequence analysis
     – Targeted genome sequencing analysis using
       predetermined sequence selection arrays (ex:
       Exome Enrichment)
Pharmacogenetics

 • Very hot topic in the biotech and
   insurance industries
 • Use genetic typing to guess how a person
   might respond to different drug
   treatments
 • Currently relies on microarrays
 • NGS could provide significantly more
   information at more loci
     – Microarrays only look at a handful of
       polymorphisms
     – Current NGS approaches port the microarray
       technique to enrich pools for sequencing
 • Future
     – As the catalog of human genomes increases, it
       will be easier to calculate responses to
       treatment before drugs are administered
                                                       Gauthier et al 2007 Cancer Cell
Epigenetics

 • Defined as heritable genetic information
   that is not coded in the DNA bases
     – DNA methylation
     – Histone modifications
 • Previous mechanisms for detecting these
   Chromatin or DNA modifications relied on
   targeted probing
     – ChIP-PCR
     – Bisulfite sequencing
     – Footprinting assays
 • Next generation sequencing changed
   everything
     – Whole genome methylation mapping (MAP-IT)
     – Whole genome histone modification and
       protein binding mapping (ChIP-Seq -
       acetylation, methylation, etc)
 • ENCODE project
ENCyclOpedia of Dna Elements (ENCODE)

 • International project
     – Follow up to the human genome project
 • Only 98% of the human genome codes for
   protein
     – Creating and maintaining DNA is biochemically
       expensive
     – What’s the other 98% of the genome doing?
 • ENCODE goals
     – Determine the functional elements of the
       human genome
     – Protein Coding
     – Non-Coding RNA
     – mRNA Expression
     – Regulatory protein binding sites
     – Histone modifications
 • Preliminary estimates show that 80% of
   human DNA is functional!
Transcriptome/Expression Analysis

 • Gene expression analysis is important for
   disease discovery and cancer diagnosis
 • Expression analysis first relied on Northern
   blotting followed by DNA microarrays
     – Both cases require a probe
     – Need to “know” what you are looking for
     – Low resolution screening
 • Next generation approaches screen the
   entire transcriptome (RNA-Seq)
     – Single base resolution of expression
     – Can see level of expression and also visualize
       mutations in expressed sequences
 • Future
     – Important for diagnosing/treating cancer and
       heritable diseases
Phenotypic Correlation

 • NGS data generates huge datasets with
   85-99.9% base accuracy
     – Must determine which signals are real, and
       which are noise/errors
     – Most promising hits are validated by other
       assays (Sanger, qRT, Mass Spec)
     – How do we determine which hits to validate?
 • Currently have very small datasets, even
   in pharmacogenetics that have limited
   utility
 • Validated hits can be distractions                See NYTimes Series on whole genome
     – Tumor diversity presents multiple escape      Sequencing: http://nyti.ms/No4fgd
       routes during targeted treatment
 • Future
     – Require large validated datasets that are
       ethnically and geographically diverse
Metagenomics

 • Used to survey macro and micro
   environments
     –   Microbial communities (Soil/Gut)
     –   Tumors
     –   Plant communities
     –   Coral reef ecosystems
 • Previous techniques coupled mtDNA or
   ribosomal Sanger sequencing with BLAST
   analysis
     – Limited by number of sequenced species
     – Can determine who, but not what is going on
 • NGS approaches now being used to
   determine exactly what organisms are
   present and how they interact
     – Can get expression data and link it back to
       community groups
     – Survey community diversity
Data

 • Absolutely the largest roadblock for next
   generation sequencing
 • Terabytes of data are useless if we can’t
   efficiently analyze the data
 • How long should data be kept?
     – Depends on application
        • Human Diagnostic sequencing?
        • Research sequencing?
 • Where should data be kept and
   processed?
     – Local or Cloud (Amazon, etc)?
     – Cost of infrastructure vs cost of cloud service
     – Security issues
 • Future
     – Cloud based solutions will become more
       attractive

More Related Content

Viewers also liked

Bioinformática Introdução (Basic NGS)
Bioinformática Introdução (Basic NGS)Bioinformática Introdução (Basic NGS)
Bioinformática Introdução (Basic NGS)Renato Puga
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.mkim8
 
The Global Micorbial Identifier (GMI) initiative - and its working groups
The Global Micorbial Identifier (GMI) initiative - and its working groupsThe Global Micorbial Identifier (GMI) initiative - and its working groups
The Global Micorbial Identifier (GMI) initiative - and its working groupsExternalEvents
 
Metagenomics sequencing
Metagenomics sequencingMetagenomics sequencing
Metagenomics sequencingcdgenomics525
 
Aug2015 deanna church analytical validation
Aug2015 deanna church analytical validationAug2015 deanna church analytical validation
Aug2015 deanna church analytical validationGenomeInABottle
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsJoão André Carriço
 
Whole genome microbiology for Salmonella public health microbiology
Whole genome microbiology for Salmonella public health microbiologyWhole genome microbiology for Salmonella public health microbiology
Whole genome microbiology for Salmonella public health microbiologyPhilip Ashton
 
Whole Genome Sequencing (WGS): How significant is it for food safety?
Whole Genome Sequencing (WGS): How significant is it for food safety? Whole Genome Sequencing (WGS): How significant is it for food safety?
Whole Genome Sequencing (WGS): How significant is it for food safety? FAO
 
The Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome SequencingThe Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome SequencingEmiliano De Cristofaro
 
Toolbox for bacterial population analysis using NGS
Toolbox for bacterial population analysis using NGSToolbox for bacterial population analysis using NGS
Toolbox for bacterial population analysis using NGSMirko Rossi
 
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015Torsten Seemann
 
Innovative NGS Library Construction Technology
Innovative NGS Library Construction TechnologyInnovative NGS Library Construction Technology
Innovative NGS Library Construction TechnologyQIAGEN
 
DNA Sequencing from Single Cell
DNA Sequencing from Single CellDNA Sequencing from Single Cell
DNA Sequencing from Single CellQIAGEN
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global communityExternalEvents
 
Aug2013 Heidi Rehm integrating large scale sequencing into clinical practice
Aug2013 Heidi Rehm integrating large scale sequencing into clinical practiceAug2013 Heidi Rehm integrating large scale sequencing into clinical practice
Aug2013 Heidi Rehm integrating large scale sequencing into clinical practiceGenomeInABottle
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...nist-spin
 

Viewers also liked (20)

Bioinformática Introdução (Basic NGS)
Bioinformática Introdução (Basic NGS)Bioinformática Introdução (Basic NGS)
Bioinformática Introdução (Basic NGS)
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
The Global Micorbial Identifier (GMI) initiative - and its working groups
The Global Micorbial Identifier (GMI) initiative - and its working groupsThe Global Micorbial Identifier (GMI) initiative - and its working groups
The Global Micorbial Identifier (GMI) initiative - and its working groups
 
Metagenomics sequencing
Metagenomics sequencingMetagenomics sequencing
Metagenomics sequencing
 
Proposal for 2016 survey of WGS capacity in EU/EEA Member States
Proposal for 2016 survey of WGS capacity in EU/EEA Member StatesProposal for 2016 survey of WGS capacity in EU/EEA Member States
Proposal for 2016 survey of WGS capacity in EU/EEA Member States
 
Poster ESHG
Poster ESHGPoster ESHG
Poster ESHG
 
Aug2015 deanna church analytical validation
Aug2015 deanna church analytical validationAug2015 deanna church analytical validation
Aug2015 deanna church analytical validation
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and Annotations
 
Whole genome microbiology for Salmonella public health microbiology
Whole genome microbiology for Salmonella public health microbiologyWhole genome microbiology for Salmonella public health microbiology
Whole genome microbiology for Salmonella public health microbiology
 
Whole Genome Sequencing (WGS): How significant is it for food safety?
Whole Genome Sequencing (WGS): How significant is it for food safety? Whole Genome Sequencing (WGS): How significant is it for food safety?
Whole Genome Sequencing (WGS): How significant is it for food safety?
 
The Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome SequencingThe Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome Sequencing
 
Toolbox for bacterial population analysis using NGS
Toolbox for bacterial population analysis using NGSToolbox for bacterial population analysis using NGS
Toolbox for bacterial population analysis using NGS
 
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
 
Innovative NGS Library Construction Technology
Innovative NGS Library Construction TechnologyInnovative NGS Library Construction Technology
Innovative NGS Library Construction Technology
 
DNA Sequencing from Single Cell
DNA Sequencing from Single CellDNA Sequencing from Single Cell
DNA Sequencing from Single Cell
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
Aug2013 Heidi Rehm integrating large scale sequencing into clinical practice
Aug2013 Heidi Rehm integrating large scale sequencing into clinical practiceAug2013 Heidi Rehm integrating large scale sequencing into clinical practice
Aug2013 Heidi Rehm integrating large scale sequencing into clinical practice
 
EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization
EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for HarmonizationEU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization
EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
 

Recently uploaded

Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Christina Parmionova
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
Explainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenariosExplainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenariosZachary Labe
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsDobusch Leonhard
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxpriyankatabhane
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxtuking87
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfSubhamKumar3239
 
projectile motion, impulse and moment
projectile  motion, impulse  and  momentprojectile  motion, impulse  and  moment
projectile motion, impulse and momentdonamiaquintan2
 
linear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annovalinear Regression, multiple Regression and Annova
linear Regression, multiple Regression and AnnovaMansi Rastogi
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterHanHyoKim
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxzeus70441
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learningvschiavoni
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxfarhanvvdk
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests GlycosidesNandakishor Bhaurao Deshmukh
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPRPirithiRaju
 

Recently uploaded (20)

Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
Explainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenariosExplainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenarios
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and Pitfalls
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptx
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdf
 
Interferons.pptx.
Interferons.pptx.Interferons.pptx.
Interferons.pptx.
 
projectile motion, impulse and moment
projectile  motion, impulse  and  momentprojectile  motion, impulse  and  moment
projectile motion, impulse and moment
 
linear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annovalinear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annova
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarter
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptx
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptx
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPR
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
 

Genome Wide Methodologies and Future Perspectives

  • 1. Genome Wide Methodologies and Future Perspectives Brian Krueger, PhD Duke University Center for Human Genome Variation
  • 2. History of Genetic Linkage • Mendel’s Laws – Law of segregation • Each parent randomly passes one of two alleles to offspring – Law of Independent Assortment • Separate genes for separate traits are passed independently to offspring • Traits should appear in offspring in the ratio of 9:3:3:1 – Laws hold true for genes on different chromosomes or genes located far away from one another • Linkage – Bateson and Punnett quickly found traits that didn’t assort independently – Thomas Hunt Morgan and his student Alfred Sturtevant found that recombination frequency is a good predictor of distance between genes • Genes that are inherited together must be closer to one another – linked • Generated the first linkage maps – Serves as an important basis for understanding genetic association studies
  • 3. Linkage Studies • Model Organisms – Fruit Flies, plants, etc – Extremely important for understanding human genetics – Fruit flies can produce new generations of 400+ offspring approximately every week! • Can very quickly understand the genetics of trait heritability • Familial Linkage Studies – Require multiple generations – Take decades to develop – Complicated by family participation • Association studies – Subtle difference between linkage studies – Try to apply knowledge of familial linkage to entire populations
  • 4. Genome Wide Association Studies • GWA studies – Aim to find genetic variants that are associated with traits – Typically used to elucidate complex disease traits – Focus on SNPs, Indels, CNVs – Most often Case/Control Studies • SNP (Single Nucleotide Polymorphism) – Change in a single nucleotide position • Indel (Insertion/Deletion) – Describes the insertion or deletion of nucleotides • CNV (Copy number variations) – Large deletions or duplications of genetic material
  • 5. GWA Study History • Human Genome Project (1990-2000) – Decade long international project to determine the complete human genome sequence – Provided the reference genome for future research on genome variation • Human HapMap (2002-2009) – Sequencing whole genomes is expensive – Needed a shortcut to understand how variation contributes to disease – Mapped millions of common known SNPs in 269 individuals – Theory that common SNPs are inherited and could be predictive of associated disease – Determine how SNPs from case/control studies associate with human disease
  • 6. Defining Association • Variants are not always causal! – SNPs sometimes only serve as markers – Can play absolutely no role in the disease and even be located on different chromosomes from the gene actually responsible for the phenotype • Population stratification – Variants differ by population – Variants important markers of disease in one population or ethnicity may not be effective markers in another – For GWA studies to be effective predictors in multiple populations, large datasets for each ethnicity must be obtained
  • 7. GWAS SNP Genotyping • Bead array genotyping – Uses a chip containing beads with covalently attached baits – Baits hybridized to fragmented DNA – Baits SPECIFIC for the DNA just upstream of a SNP – Base extension with fluorescently labeled bases allows interrogation of the SNP (each base has a different color!) – A single bead chip can assay millions of rs1372493 rs1372493 SNPs 16000 1.60 1.40 – Colorimetric output plotted 14000 12000 1.20 • Blue indicates homozygous for one version of the 10000 1 SNP - CC Intensity (B) 8000 0.80 • Purple is heterozygous - CA Norm R 6000 0.60 • Red homozygous for the other version of the SNP 4000 - AA 0.40 2000 0.20 0 0 2317 834 74 -2000 -0.20 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 0 0.20 0.40 0.60 0.80 1 Intensity (A) Norm Theta
  • 8. GWAS SNP Genotyping and Validation • Realtime PCR – Use specific PCR probes to verify SNPs – Good for validating a handful of SNPs at a time • Mass Array – Use mass spec to find SNPs – Detected by looking at fragment weight differences – Good for detecting or validating a large number of SNPs rapidly • Sanger sequencing – Gold standard validation method – Can determine the SNP at its exact position – Very robust
  • 9. GWA Study History • To this point in time, the power of most GWA studies was lacking – GWA not really genome wide – Looked at common variants across genome – Missed rare variants and not always descriptive of disease causation • Whole Genome Sequencing (WGS) – Actually assays the entire genome – Discovers all variants – Prohibitively costly before 2008 – Current cost of WGS ~$4000 • Thousand Genomes Project (2008-) – Facilitated by plummeting sequencing costs and technological advancements – Goal to fully sequence the genomes of 1000 healthy individuals to provide a true picture of genome wide variation
  • 10. Second Generation Sequencing • Developed to increase throughput of Sanger sequencing • Can sequence many molecules in parallel – Does not require homogenous input – Sequenced as clusters • Sequencing by synthesis – Bases are added, signals scanned, and then washed – Cycle repeated (30-2000x)
  • 11. 2nd Gen: Sequencing by Synthesis Overview Genomic Fragmented DNA Ligate Adaptors DNA Generate Clusters (On Flowcell or Beads) T T A T A T TA T A T T C C G G A G A G T T T T G G Repeat Hundreds of times on millions of clusters Detect Signals Add Bases
  • 12. Flavors of Sequencing • Whole Genome Sequencing – Obtain whole blood or tissue sample – Create sequencing libraries of all DNA fragments • Whole Exome Sequencing – Utilizes a selection protocol – Attach complimentary RNA strands to beads – Fish out ONLY coding DNA sequences – Create sequencing libraries from enriched DNA – Reduces cost significantly • Custom Capture – Same protocol as Exome sequencing – Only target desired DNA sequences • Amplicon Sequencing – Use PCR to amplify target DNA – Sequence amplified DNA (Amplicon)
  • 13. NGS Study Designs for Gene Discovery Multiplex families Case-control studies Trio sequencing of sporadic diseases
  • 14. De novo Mutation Calling/Filtering Variant Individual variant Multi-sample calling calling variant calling Exome Variant Server 6500 exome Cross-checking sequenced individuals public databases Visual Inspection Sanger sequencing confirmation
  • 15. Detecting Copy Number Variants ERDS (Estimation by Read Depth with SNvs) Average read depth (RD) of every 2-kb window were calculated, followed by GC corrections. A paired Hidden Markov model was applied to infer copy numbers of every window by utilizing both RD information and heterozygosity information. homozygous heterozygous duplication deletion deletion Windows
  • 16. Illumina • Uses a flow cell • Cluster generated on slide via bridge amplification • Sequencing by synthesis – Performed by flowing labeled bases over flow cell – 4 pictures taken (one for each base) – Cluster color determined at each cycle allows interrogation of sequence • Advantages – Low cost per base – Very high throughput • Limitations – High cost per experiment – Short read length (30-150bp) – Acquired a company that uses new tech to reach read lengths of 2-10Kb Schadt et al 2010 HMG
  • 17. Ion Torrent • Emulsion PCR is used to generate clusters on a bead • Sequencing by synthesis – Pyrosequencing – Relies on release of pyrophosphate for detection – Instead of a visual cue, system senses the release of H+ as each base is flowed over the beads • Advantages – Short run time – Does not require modified bases – Longer read length (200bp) • Limitations – Low data output – High homopolymer error rate
  • 18. Third Generation Sequencing • Defined as single molecule sequencing • Less complex sample prep • Much longer read length – SGS Short read length a huge disadvantage for de novo sequencing applications • Two categories – Sequencing by synthesis – Direct sequencing • Passing molecule through a nanopore • Using atomic force microscopy • Bleeding edge technology – Many technical hurdles – Currently very high error rates
  • 19. Pacific Biosciences • Utilizes single molecule sequencing by synthesis • Extremely complex system – Each well contains a single DNA molecule and an immobilized polymerase – No reagent washing – Employs confocal microscopy to only detect fluorescence at the polymerase • Advantages – Very long read length (1-15kb) – Low complexity sample prep – Very fast data generation (real time) • Disadvantages – Prone to sequencing errors (~15% error rate) – Company on the verge of bankruptcy
  • 20. Third/Second Generation Sequencing • Currently only one viable high throughput long read sequencing platform – PacBio system has a 15% error rate – Need long reads for many applications from de novo sequencing to haplotyping • Second generation sequencers high throughput and accurate – Short reads are hard to assemble and leave gaps in repetitive sequences • Can use both as a highly accurate and extremely powerful tool for de novo sequencing applications – Use PacBio assembly as a scaffold – Correct errors by aligning HiSeq reads on top – Effective error rate of 0.1% – Expensive but extremely fast and accurate compared to other methods Koren et al 2012 Nature Biotechnology
  • 21. Future: Nanopore Sequencing • Leading candidate is Oxford Nanopore • Concept – Detect flow of electrons through the pore – Each base causes a detectable change in the current – Results in direct sequencing – Theoretically could be used to sequence RNA and protein too • Advantages – Long read length – Plug and play – Easily scalable • Disadvantages – No hard data yet Credit: John MacNeill/TechnologyReview – No specific release date
  • 22. Future: Direct sequencing • Concept stage techniques – Significant technical hurdles to overcome – Mostly proof of concept experiments • IBM DNA Transistor Credit: IBM – Bases read as single stranded DNA passes through the transistor – Gold bands represent metal, gray bands are the dielectric • Atomic force microscopy sequencing – Use AFM tip to detect each base of single stranded DNA Credit: Lee et al US PAT 20040124084
  • 23. Sequencing Applications • Old techniques which used to take days or years to perform can now be completed in hours • Next generation sequencing has opened a new door for addressing very complicated genetic questions – Has huge potential to revolutionize human healthcare – Survey complex tumor types – Research into macro and micro community genomics – Reveal evolutionary history
  • 24. De novo Sequencing • Human genome took 10 years to complete and cost $3 billion dollars – Done by laboriously cloning overlapping segments of the human genome into bacmid libraries and Sanger sequencing each one – Genome assembled using computers to line up over lapping sequences • Current estimate is around $4000 – Can be completed in a week – Companies like Complete Genomics say they have already sequenced thousands of human genomes • Future – Long read sequencers will make agricultural sequencing more viable – Whole genome sequencing for human diagnostics will become routine – Increasing the catalog of organismal genomes will improve our understanding of evolution and development
  • 25. Genome Mutation Analysis • Previously done by completing complicated and time consuming familial linkage studies and targeted Sanger sequencing • Next generation sequencing can look at every gene at once – Can produce a genetic map of the complete genome – Used to detect genetic polymorphisms – See every possible mutation • Future – Whole genome sequence analysis – Targeted genome sequencing analysis using predetermined sequence selection arrays (ex: Exome Enrichment)
  • 26. Pharmacogenetics • Very hot topic in the biotech and insurance industries • Use genetic typing to guess how a person might respond to different drug treatments • Currently relies on microarrays • NGS could provide significantly more information at more loci – Microarrays only look at a handful of polymorphisms – Current NGS approaches port the microarray technique to enrich pools for sequencing • Future – As the catalog of human genomes increases, it will be easier to calculate responses to treatment before drugs are administered Gauthier et al 2007 Cancer Cell
  • 27. Epigenetics • Defined as heritable genetic information that is not coded in the DNA bases – DNA methylation – Histone modifications • Previous mechanisms for detecting these Chromatin or DNA modifications relied on targeted probing – ChIP-PCR – Bisulfite sequencing – Footprinting assays • Next generation sequencing changed everything – Whole genome methylation mapping (MAP-IT) – Whole genome histone modification and protein binding mapping (ChIP-Seq - acetylation, methylation, etc) • ENCODE project
  • 28. ENCyclOpedia of Dna Elements (ENCODE) • International project – Follow up to the human genome project • Only 98% of the human genome codes for protein – Creating and maintaining DNA is biochemically expensive – What’s the other 98% of the genome doing? • ENCODE goals – Determine the functional elements of the human genome – Protein Coding – Non-Coding RNA – mRNA Expression – Regulatory protein binding sites – Histone modifications • Preliminary estimates show that 80% of human DNA is functional!
  • 29. Transcriptome/Expression Analysis • Gene expression analysis is important for disease discovery and cancer diagnosis • Expression analysis first relied on Northern blotting followed by DNA microarrays – Both cases require a probe – Need to “know” what you are looking for – Low resolution screening • Next generation approaches screen the entire transcriptome (RNA-Seq) – Single base resolution of expression – Can see level of expression and also visualize mutations in expressed sequences • Future – Important for diagnosing/treating cancer and heritable diseases
  • 30. Phenotypic Correlation • NGS data generates huge datasets with 85-99.9% base accuracy – Must determine which signals are real, and which are noise/errors – Most promising hits are validated by other assays (Sanger, qRT, Mass Spec) – How do we determine which hits to validate? • Currently have very small datasets, even in pharmacogenetics that have limited utility • Validated hits can be distractions See NYTimes Series on whole genome – Tumor diversity presents multiple escape Sequencing: http://nyti.ms/No4fgd routes during targeted treatment • Future – Require large validated datasets that are ethnically and geographically diverse
  • 31. Metagenomics • Used to survey macro and micro environments – Microbial communities (Soil/Gut) – Tumors – Plant communities – Coral reef ecosystems • Previous techniques coupled mtDNA or ribosomal Sanger sequencing with BLAST analysis – Limited by number of sequenced species – Can determine who, but not what is going on • NGS approaches now being used to determine exactly what organisms are present and how they interact – Can get expression data and link it back to community groups – Survey community diversity
  • 32. Data • Absolutely the largest roadblock for next generation sequencing • Terabytes of data are useless if we can’t efficiently analyze the data • How long should data be kept? – Depends on application • Human Diagnostic sequencing? • Research sequencing? • Where should data be kept and processed? – Local or Cloud (Amazon, etc)? – Cost of infrastructure vs cost of cloud service – Security issues • Future – Cloud based solutions will become more attractive

Editor's Notes

  1. Before we can talk about whole genome sequencing and how it could revolutionize personalized medicine, it is important that we first discuss the history of genetics. I know that you’ve covered much of this information in previous lectures, so I’m going to highlight a little bit of the history that I think is important for today’s discussion.
  2. I’m going to assume that you haven’t made it to this point in your study of biology without having heard of Gregor Mendel. Mendel is considered the father of modern genetics because he was the first to notice and then study the phenomenon that plants seemed to pass their traits onto their offspring. He developed two laws through his research studying pea plants in the mid-1800’s. His first law of segregation stated that each parent randomly passes one of two alleles to their offspring. His second law states that separate genes for separate traits are passed independently to their offspring in a ratio of 9:3:3:1. These laws held true in Mendel’s experiments however the significance of Mendel’s work was not realized until the early 1900’s when scientists were again trying to understand how traits were inherited. Mendel’s work was ignored for almost 50 years!However, the rediscovery of Mendel’s work quickly led to the discovery that not all traits assort independently. Much of this early work was done by Bateson and Punnet who showed that some traits seemed to assort together or be linked. Further work by Thomas Hunt Morgan and his student Alfred Sturtevant found that genes did appear to be linked together and that their frequency of recombination or loss of linkage was a good predictor of the distance between genes. This allowed for the creation of the first linkage maps of chromosomes. Understanding linkage is important for understanding some of the early work done in genetic association studies.
  3. You may be wondering how scientists were able to make linkage maps of genes in the early 20th century, nearly 20 years before the discovery that DNA was the genetic material that made up chromosomes. Much of this work was performed in model organisms. Mendel used peas, for example while others like Sturtevant used the humble fruit fly. Although model organisms sometimes become the target of needless political bickering, these organisms have been and will continue to be extremely important for understanding human genetics. In the early days, fruit flies were a great source of genetic material, not only because they can produce large numbers of offspring very quickly but also because their salivary glands contain giant chromosomes which can be stained and visualized using standard microscope procedures. Upon staining, these chromosomes display a distinct banding pattern based on chromatin density or how compacted the DNA is in each region. These bands could be used to track genetic recombinations. Of course the work completed in model organisms was later applied to humans, however studying genetics in humans is complicated by the fact that we don’t produce 400 offspring every generation and discovery of the family history of disease can be hard to determine if family members have passed away or if relatives refuse to participate. Because of these factors, studying human disease in the early days of genetics was extremely time consuming. It could take decades to develop informative maps to track disease. Further, the chromosomes contained in human cells are much smaller than those found in fruit fly salivary glands. So new techniques and better microscopes had to be invented before the power of chromosome staining could be applied to humans. As techniques have matured and the field of genetics has improved we have been able to determine how many Mendelian diseases are inherited using these simple techniques. The next logical step for genetics was to use the information gathered in familial studies and apply these lessons to entire populations. Modern genetics has facilitated our ability to look at genetic associations on a population wide and genome wide scale.
  4. Genome wide association studies aim to find genetic variants that are associated with traits. These can be used to understand human disease, but more generally they can be used to trace any genetic trait. These studies focus on looking at specific changes in the DNA. These can include snips, single nucleotide polymorphisms or the change in a single nucleotide position in the DNA. These changes can also include Indels or insertions or deletions of specific nucleotides. Finally, these studies also track copy number variations which include large deletions or duplications of genetic material.
  5. Before we start the discussion of genome wide studies, it is important to introduce the human genome project. Prior to the completion of the human genome project, tracking down genes was a very time consuming process of trial and error using cloning, PCR and Sanger sequencing. For the most part, traits could only be loosely mapped to large genetic regions. Geneticists realized that knowing the sequence of the entire genome would greatly enhance their ability to narrow down the location of all of the genes in the genome. However, the human genome project took over a decade to complete and cost nearly 3 billion dollars. Of course the best way to look at genome variation in individuals is to sequence all of their DNA individually and find the specific mutations that are responsible for their traits. This just was not possible at the time because genome sequencing was prohibitively expensive, but scientists could use what they already knew about human genetics and single nucleotide polymorphisms to develop a condensed map of the human genome that could then be used to track genetic changes. To create this abridged map, a team of scientists came together with the goal of mapping all of the known common SNPs that are present in 1% or more of the population in 269 people from 4 different ethnic backgrounds. In essence, the human haplotype map, or hapmap for short, serves as the baseline group for future studies. Once the hapmap was in place, groups of affected individuals could be genotyped to find common SNPs that were predictive of their disease, trait, or response to a drug. In theory, this was a great workaround for the problem of having to whole genome sequence individuals, however this approach does not necessarily tell scientists which genes in particular are responsible for the trait and in many cases only serves as a marker.
  6. We can explore this idea in a little more depth. Sometimes changes in the DNA have no effect whatsoever on the phenotype, however, these changes can be used as markers. This is sometimes a hard concept to understand. Remember back about how Punnet and Bateson showed that some genes can be linked. Well, you can think of SNPs as markers that can be linked with a trait but not responsible for it. Sometimes SNPs are very near the genes responsible for the trait, other times they are on completely different chromosomes. Because the majority of your DNA is conserved from your mother and father, there are many common SNPs that have been passed down through generations. Sometimes SNPs can be predictive of disease when looking at large sets of data and calculating statistics. Again, this was the whole idea behind the human hapmap. Certain variants can be statistically associated with a trait while not at all being causal! To further complicate the situation, because different populations of humans have been isolated form one another, each population has passed different sets of SNPs to their offspring. This means that important SNPs in one ethnicity may not be conserved in another. For genome wide association studies to be effective predictors, large control datasets from each population must be obtained.
  7. So how do scientists go about determining the SNP genotype for individuals? Much of this is currently done using array technology. The technology we use to do this in the Genomic Analysis Facility relies on bead chip arrays. These beads have DNA baits on them that are complementary to DNA just upstream of a SNP. These beads are exposed to fragmented DNA that has been isolated from an individual and then these captured fragments serve as the template for a DNA extension reaction. This is done using fluorescently labeled DNA bases. Each base has a different color so it’s easy to detect which SNP genotype is present at each targeted SNP location. These signals are then detected by a laser and the genotypes plotted. In the output here you can see a blue, a purple and a red cluster. These clusters indicate the SNP genotype at this position for each individual tested in this assay.
  8. Some other methods for SNP genotyping include quantitative realtime PCR which can be used to genotype samples at a handful of SNPs. This is typically used when looking at a small number of individual samples or a small number of SNP locations.Another commonly used technology is a mass array. Instead of using PCR probes to find SNPs, the massarray does an extension reaction with very heavy modified bases. Mass spectrometry is then used to detect which heavy base is present in each SNP reaction.Finally, the gold standard technique for detecting and confirming SNPs is Sanger sequencing. Sanger sequencing is beneficial because it allows researchers to look directly at the DNA. It is a very robust technique, however it is hard to use to genotype a large number of samples at a time. These days, Sanger sequencing is mostly used as a confirmatory assay to back up the findings of other SNP genotyping techniques.
  9. Up until 2008 or so, most genome wide association studies were lacking because they weren’t really genome wide. They employed genome wide biomarkers, but their utility in trait discovery was limited because they relied on common variants. These studies missed rare variants and because many of the SNPs were not mutations in specific genes, GWA studies were not descriptive of disease causation.The major benefit of whole genome sequencing is that it does assay the entire genome. This technique discovers all of the variants in the human genome which has its drawbacks and benefits. One of the biggest criticisms of whole genome sequencing is that, because this technique finds all of the genetic variation, that it makes it hard to determine which mutations are actually important. Without large control datasets to set the benchmark for normal variation, it’s a huge task to sort out which variants are signal and which are noise.Enter the thousand genomes project. This project is similar to the human hapmap project, in that it was set up to provide a large database of genomic variation except the thousand genomes project focuses on the genomes of 1000 individuals from a range of ethnic backgrounds. Again, the important genomic variants can differ among ethnic populations.
  10. What caused the price of whole genome sequencing to decrease so rapidly and facilitate this new age of whole genome sequencing? Next generation sequencing or second generation sequencing was developed to increase the throughput of sequencing. Sanger sequencing had been advanced to the point that 384 sequences could be analyzed at the same time, but this still was not high enough throughput to quickly sequence genomes.Next generation sequencing’s main advantage is that it can sequence millions to billions of sequences in parallel and it does not require a homogenous input sequence. This is because sequences are obtained as independent clustersAll of the second generation technologies rely on sequencing by synthesis using a polymerase to generate a complement DNA strand.
  11. The technical theory behind all of the second generation sequencing technologies is very similar. Genomic DNA is isolated and then fragmented to a particular size range. In the Genomic analysis facility, we try to fragment the DNA so that the majority of it is 300 basepairs in length. We then perform some biochemistry to ligate small pieces of DNA to the ends of these fragments. These are called adaptors. The adaptors are very important because they attach a “known” DNA sequence to each fragment. This is important for amplifying the library and also for capturing the library on a flowcell or bead. The flow cell or bead, depending on the sequencing technology, has known complimentary DNA attached to their surface. This allows them to capture DNA fragments. Once the fragments have been captured by the bead or the flow cell, the fragments are amplified on the surface to create large fragment clusters. These clusters are important because they allow for the future detection of a sequencing signal. Once the clusters are made, they are placed in the sequencer where the sequences can be read. This is done by sequentially flowing labeled bases over the bead or flow cell and detecting the signals that are emitted from each spot. I’ll go into the details later of how each technology actually sequences the DNA. This cycle of base addition and detection is repeated hundreds of times on millions of clusters to obtain the final sequencing information.
  12. There are many different types of sequencing that can be done using this technology. I’ll highlight the ones that we perform most often in the genomic analysis facility. Of course there’s whole genome sequencing which is essentially what I described on the previous slide. There’s also whole exome sequencing which utilizes a selection protocol to only sequence the DNA found in exons or coding DNA sequences. This technique is advantageous in that it saves a lot of money on the sequencing end if the question you are asking relates to coding DNA sequences. This selection is done by attaching complimentary RNA to beads to fish out only the fragments of DNA that relate to coding DNA sequences. A similar technique that further saves cost is custom capture. This technique is employed when a researcher is looking to sequence DNA related specific DNA targets of interests. An example of this would be to create a capture library of only the genes known to be involved in a specific type of cancer. Finally, there’s amplicon sequencing. This technique is usually only used when you want to sequence a small number of genetic regions in a large number of samples. This technique is only cost effective when used on a small number of regions because specific primers need to be created for each targeted region. As the number of regions increase, methods such as custom capture become more cost effective.
  13. There are a few different types of studies that we do in the Center for Human Genome variation. These include multiplex family studies, case control studies and trio sequencing studies.In multiplex family studies, we sequence DNA from families with a known history of disease and then probe their genomes for variants that appear to be conserved in affected individuals. The power of multiplex family studies is that the genetic background in these individuals is conserved and affords us with a greater signal to noise ratio.When sequencing entire families is not possible, trio sequencing is employed. This is a powerful technique for sporadic disease where the disease is not conserved within a family lineage and new genetic mutations appear to be responsible for the disease. For these studies, we sequence the genomes of both parents and the affected child to try to determine where de novo mutations may have occurred. Finally, we’ve already touched on case-control studies a little bit when we talked about the human hapmap and the thousand genomes project. Case-control studies are used to look at how traits or diseases can be identified while looking at populations. In these studies, unaffected controls are sequenced and compared to affected individuals to try to find genetic mutations that are more highly represented in cases than controls.
  14. So how do we discover these disease causing mutations in the sequencing data? Using trio sequencing as an example here, we sequence the genomes of a mother, a father, and the affected child. We run this sequencing data through a computer system that takes the small DNA fragments that we sequenced and aligns them together so that we can compare the three genomes to one another. We use this data to find differences in the three genomes and then compare those genetic differences to a database of known disease causing genetic mutations. Here you can see in this example that the two parents have a normal sequence at this location while the affected child appears to be heterozygous for a mutation. Sanger sequencing of all three individuals at this locus confirms that the affected individual is heterozygous at this location. So how do we know that this mutation is actually the cause of the disease based on the sequence information alone? We don’t, but we can make an educated guess if the mutation occurs in a protein coding region. Additionally, it is becoming more and more common to follow up these association studies in other models to provide functional information about a mutation. This is done by making specific mutations in animal or cell culture models and seeing if the mutation produces a similar phenotype.
  15. In addition to point mutation detection, we can also use next generation sequencing data to detect copy number variations including insertions and deletions. Because next generation sequencing looks at millions of small sequence fragments, we get multiple sequences for each region of the genome. Based on how often we see a signal for each sequence, we can determine how much of each sequence is likely to be present. Using ERDS analysis, we can look at specific sequences and see if they are completely missing such as a homozygous deletion. We can determine heterozygous deletions if we obtain half as much sequence information as we expect, or we can find duplications if we receive more sequence information than expected.
  16. I’ve spent the last few slides talking about the applications of genome sequencing in the genomic analysis facility, but I’d also like to provide you with information on the current crop of next generation sequencers and where the field is heading. I’m going to start this off by discussing the technology that we currently use in the genomic analsis facility for sequencing which is the Illumina system. You will get a chance to see this technology in action when you come to tour the lab on Friday.The illumina sequencing platform uses a flow cell to process sequencing samples. Clusters are created via bridge amplification within each lane of the flow cell.Sequencing is performed by flowing fluorescently labeled bases over the clusters and imaging cluster tiles for each colored base. Determination of the color of the cluster after each cycle allows for interrogation of the sequenceThe advantage of the illumina system is that it is very high throughput and has a low cost per base. It can generate anywhere from 40-240 million reads per lane of a flow cell depending on the sequencing system and read lengthSome of the limitations are that each experiment can cost between $10-20,000 depending on the depth of sequencing and the length of the run. The illumina system also is not ideal for de novo genome sequencing because the short read length limits its ability to accurately sequence large repeats. However, illumina has recently purchased a company that says it can increase the read length of illumina sequencing to the 2-10kb range making it an attractive choice for future sequencing experiments that require longer reads.
  17. The ion torrent system uses a completely new detection system. For the most part, all of the next generation sequencing systems use light to determine the sequence. The ion torrent system uses semiconductor technology to detect the sequence based on changes in pHThis is done using the pyrosequencing system in which pyrophosphate is released during each cycle of the sequencing reaction. This results in a detectable change in the pH of the reaction well. This is one of the advantages of the system because it does not rely on visual inspection to determine the sequence. This significantly increases the speed at which the sequencing is performed because a camera doesn’t have to image each cluster, instead, there is a pH detecting sensor under each bead.The technique also doesn’t require expensive modified bases or complicated chemistry to perform the sequencing reactions. Another benefit is that the read lengths are slightly longer than that of the HiSeq system at 200 bpIt has a high homopolymer error rate and the ion torrent system has low data output. It is not an ideal system for large scale projects and has a relatively high base cost compared to Illumina sequencing. Ion torrent sequencing is ideal for array based custom capture sequencing that is starting to be used by diagnostic labs
  18. We are currently entering the era of third generation sequencing. The defining characteristic of 3rd generation sequencing is that sequences are determined from single DNA molecules, not clusters. One of the main criticisms of cluster based second generation sequencing is that mutations can occur and be amplified during cluster generation. Third generation sequencing also has the added benefit of less complex sample prep and longer read lengths. Current 3rd generation sequencers can produce reads up to 15 kb, with an average read length of 3 kbThese technologies fall under two categories. Like second generation sequencers, some of these techniques employ sequencing by synthesis while the more bleeding edge technologies sequence the DNA by reading the strand directly.Direct reading of the DNA can be accomplished by passing the DNA through a specialized pore or using atomic force microscopyCurrently there is only one functioning 3rd generation technology on the market and this sector still faces many technical hurdles
  19. The Pacific biosciences sequencing system is extremely complex. It uses sequencing by synthesis to “watch” a polymerase sequence a single DNA strand in real timeThis is done by immobilizing a single polymerase at the bottom of a nanowell and using confocal microscopy to look at a small slice of the visual field and only detect fluorescence at the polymerase. In this system reagents exist in a large reaction volume and sequencing is allowed to proceed at a very rapid rate.The advantages of the system are the extremely long read lengths, low complexity of sample prep and very fast generation of sequencing dataThe major disadvantage of this sequencing technique is the very high error rate which is on the order of 15%, because of this, the company is struggling to maintain its customer base and attract new customers.
  20. PacBio may still have the last laugh though. The long reads that their system generates are invaluable to de novo sequencing applications such as determining the genomic sequence of previously unsequenced organisms, however the high error rate of those reads makes them nearly useless.On the flip side, short read sequencers are very accurate but their data is hard to assemble into complete genomes because holes are generated in highly repetitive sequencesOne recently published solution is to use HiSeq or 454 reads to “fix” the large sequencing reads generated by the PacBio system. By using the PacBio reads as a scaffold, HiSeq reads can be aligned and a consensus can be obtained to fix the readsThe resulting effective error rate is better than any other available next generation sequencing system on the market although the cost of doing such a project is very highThis hybrid sequencing system is really a bridge technique until PacBio can improve its error rate or other long read technologies come on-line
  21. The future of DNA sequencing is in direct interrogation of single molecules. Most of these technologies are in the concept stage although a small startup says that it will be releasing its product at the end of the year. But they have been saying they’d release a product at the end of the year for the last 3 years so how close they are to an actual product is anyone’s guess.The concept behind Oxford Nanopore’s technology is to feed a single DNA molecule through a membrane and detect the flow of electrons as each base passes through. Every DNA base should cause a detectable change in current flow. The system theoretically can be used to directly sequence RNA and protein tooOxford says the system has a very long read length, is infinitely scalable, and plug and play. They even introduced a USB drive device that can turn any computer into a high throughput sequencerAt this point, Oxford has yet to release any real product or data so we can assume that most of this is probably pure speculation and marketing hype
  22. Other candidates in the direct sequencing game are concept or proof of concept stage devices.One of these is a project from IBM called the DNA transistor which utilizes a pore to direct DNA through a dielectric transistor to read bases in a similar manner as the Oxford Nanopore systemOther labs have pioneered using atomic force microscopy to sequence short stretches of DNA, however a commercial system has yet to materialize
  23. To this point I’ve spent some time talking about sequencing and where the technology is heading. Now I’d like to discuss how these sequencers can be used to answer complex biological questions. Sequencers are now able to perform genetic techniques which used to take days or years in a matter of hours.Next generation sequencing has opened a new door and has huge potential to revolutionize human healthcare and scientific research
  24. One of these areas which I touched on briefly already is de novo sequencing. Previously this was done by laboriously cloning overlapping segments into plasmids and Sanger sequencing each one. This is how the human genome was sequenced by the NIH. It took 10 years and cost 3 billion dollars. Using the current crop of next generation sequencers, we can perform the same de novo sequencing for around $4000 and complete it in a week. Biotech companies like complete genomics report that they have already sequenced a few thousand human genomes.The future of de novo sequencing is evolving and relies on future long read sequencers to make the alignment and data analysis process more efficient. In the case of agriculture, many plants are considered impossible to sequence because of the highly repetitive nature of their DNA.As sequencing costs decrease, de novo sequencing of human genomes will become a routine diagnostic tool and along these lines, using de novo sequencing to increase the catalog of organismal genomes will improve our understanding of evolution and development.
  25. Another area where next generation DNA sequencing has the potential to change science and medicine is in genome mutation analysis. This type of genome analysis is one of the biggest focuses of the work that we do in the genomic analysis facility.As I stated earlier, complicated and time consuming linkage studies coupled with Sanger sequencing were used in the past to elucidate genetic diseasesNow next generation sequencing can look directly at the entire genome and produce a complete genetic map of a patients DNAIn the future, whole genome mutation mapping will be used to diagnose human disease. This will occur slowly and likely proceed through targeted genome sequencing using selection arrays or panels to look at specific regions of the genome which have already be linked to disease.
  26. Along the same lines of genetic mutation mapping is the current hot topic of pharmacogenetics. Pharmaceutical and insurance companies are really interested in understanding how genetic data can predict a patients response to drug treatment. These screens currently rely on microarrays but next generation sequencing could provide sequence level information at more loci. Microarrays only look at a handful of polymorphisms. The value of these types of screens is enormous if a drug phenotype association is determined, however, pharmaco genetics suffers from relatively small datasets with low predictive power for most drugs. As the amount of genetic data increases it should be easier to predict treatment outcomes
  27. Next generation sequencing has already revolutionized the field of epigenetics which looks at heritable genetic information that isn’t coded for in the DNA sequence. These marks include DNA methylation or histone modifications that affect gene expression. Previous methods for detecting these modifications relied on low throughput and sometimes extremely complicated techniques to determine the presence of modified sequences or DNA bound proteinsNext generation sequencing changed everything because now researchers can look at chromatin on a global scale and not just at their favorite convenient genomic locus. Whole genome epigentic sequencing is revealing the complexities of chromation acetylation and methylation which in the future may be important for understanding dysregulation of gene expression in a wide variety of diseases.
  28. In the last year, the ENCODE consortium published their initial findings on a wide range of genomic topics. ENCODE stands for the Encyclopedia of DNA Elements and was a follow up to the human genome project. It quickly became clear that knowing the human genome sequence alone wasn’t very helpful for understanding how the genome functions. Of the 3 billion bases sequenced, only 1-2% of DNA actually codes for functional proteins. This means that 98-99% of the genome is involved in other processes! Unfortunately, this led to the propagation of the idea that the genome was mostly ‘junk’ DNA. The problem with this idea of Junk DNA is that DNA bases are biochemically expensive and maintaining the genome isn’t an easy task. This junk DNA must serve some purpose and finding this purpose is the goal of the ENCODE project. We know that non-coding DNA plays an important role in gene expression via promoters, enhancers and other protein binding sites. Before whole genome sequencing, finding all of these sites was an impossible task. The application of Chromatin immunoprecipitation coupled with whole genome sequencing has allowed the ENCODE consortium to determine the protein binding locations of a 119 gene expression regulatory proteins in 147 different tissue types.Combining the RNA-Sequencing information with the protein binding information, the ENCODE consortium has determined that over 80% of the DNA coded by the genome is functional in some way in that it is either bound by protein or converted into RNA. As the available data expands to include all of the 1800 DNA regulatory proteins and additional tissue types, there’s no doubt that the amount of functional genomic material will increase. Of course, this 80% figure is still somewhat controversial in the field with some biologists saying that they believe that only 10-15% of the genome actually contributes to phenotypes.So how do these results relate to pharmacogenetics and personalized medicine? These results show that sometimes knowing the absolute sequence of the DNA isn’t enough to fully understand how the genome is functioning. This is especially true in cancers where uncontrolled cell growth can be the result of overexpression of a perfectly normal protein, but because gene expression is dysregulated, this protein causes unwanted cellular proliferation.
  29. Transcriptomeand gene expression anaylsis are important for understanding what genes are expressed at any given time in a sample. This kind of information is very important for disease discovery and cancer diagnosisBefore microarrays, expression analysis was done by northern blot, however both systems require a probe to determine which genes are expressed. The problem with these systems, even with microarrays is that you need to “know” what you’re looking for or at least guess what you’re looking forNext generation sequencing eliminates the guess work because now it is possible to survey every expressed gene on a sequence level basis to look at how much of each gene is expressed and determine if there are any mutations in those expressed sequencesTranscriptomeprofiling using next generation sequencing has already been used experimentally to diagnose disease and define cancer targets. Refining this technique for broader use in the clinic has important applications in medicine
  30. One of the major challenges of Next generation sequencing is the data deluge and sifting through millions of bases of DNA to find mutations that actually result in a phenotype. The error rate of current NGS technologies complicate the issue further.The most promising hits must still be validated by other means like realtime PCR, Mass Spec, or Sanger sequencing and one of the major questions plaguing the field is how to choose which data to validate. Just because a mutation occurs in a gene doesn’t necessarily mean it contributes to the phenotypeDataset size is a huge problem for most phenotypic correlation studies and most lack the size or the diversity to be very useful for predicting complex diseasesOn the other hand, validated hits can be a distraction, especially in tumors where high diversity gives cancer cells multiple escape routes to continue killing patients. This idea is highlighted well in a great series the New York Times ran last year on whole genome sequencing.For genomes to provide valuable information on phenotypes we need to continue generating large validated datasets that are ethnically and even geographically diverse
  31. Another field that can benefit significantly from next generation sequencing technology is metagenomics. This field seeks to determine how communities of organisms interact. This may seem like an odd topic to highlight in a lecture about human genetics and personalized medicine, but there are communities living in and on you that effect your health in ways that we are only just beginning to understand.Metagenomicscan be employed to explore micro scale soil and gut microbial communities all of the way to macro scale coral reef communitiesPrevious techniques relied on sequencing of mitochondrial DNA or polymorphic regions of ribosomal DNA to classify the composition of communities. These studies are time intensive and not completely informative because you can only determine who is present and not necessarily what is expressed and how the organisms might be interactingNext generation sequencing can get around many of these limitations by surveying community diversity and linking DNA and expression back to community members
  32. I touched on the data problem a few slides back but data analysis is by far the biggest roadblock for next generation sequencing. The amount of data obtained from these studies is mind boggling and essentially useless if we can’t efficiently analyze the data.Another big question is how long should the data be kept? If you ask the sequencing companies what the solution to this problem is, they’ll say that saving the data indefinitely is too expensive, and if you need the data again, you should just redo the run. Of course there’s a huge economic benefit in that for them. However, I think the answer to this question really depends on the application and most doctors would tell you that for diagnostic purposes, sequencing data should be stored forever.If the data is kept forever, then where should it be stored? Should local resources be used to maintain the data or is hosting with an external company preferable? Again, this comes down to a cost benefit analysis but I think that in the future, cloud or external systems will become more attractive. This is because large companies like amazon and google can afford to invest in very high power resources and limit the cost of universities investing in constantly changing technologies that quickly become obsolete.We are rapidly approaching a time when every person will have their genome sequenced multiple times in their life. Determining what to do with all of the data and who has access to it is a complicated ethical question that I’m sure will be a future discussion topic in this course.