SlideShare a Scribd company logo
1 of 39
Download to read offline
Next Generation Sequencing for
Model and Non-Model Organism
           2nd day

        Jun Sese and Kentaro Shimizu
           sesejun@cs.titech.ac.jp

              Ph.D course lecture @
    Institute of Plant Biology, Univ. of Zurich
                    26/05/2011
Today’s Menu

•   Lecture
    •   Current RNA-Seq analysis
    •   Genome and RNA Asembly
    •   Introduction to AWK
        •   First step of programming
•   Exercise
    •   Visualization of mapped reads
    •   RNA-Seq analysis
    •   Genome assembly

                                        2
Sequencerʼs Output


                      Genome Sequence

Mapping Program



 Mapping Result

                                  RNA-Seq
  Visualization      Further Analysis
                                            3
RNA-Seq
•   Which genes are highly expressed?
•   Need to normalize by sequence length
    •   RPKM (Reads Par Kilo-basepair per Million reads)
        [Mortazavi et al. Nature Methods. 2008]
        •   An initial gene expression counting method


Think about two genes expressed in a cell.
Suppose that a mRNA is expressed from each gene.
Short Gene             Long Gene




    2                       8

              Longer gene has more frequency.
                                                           4
RNA-Seq (contd)
•   Some corrections including multiple-test and
    fragment bias will be required.
    •  Srivastava and Chen. NAR. 2010
    •  Li, Jiang and Wong. Genome Research. 2010
    •  No standard method.
•   After mapping reads, some tools are available to
    count reads.
    •  Cufflinks
    •  HTSeq
    •  R packages
        • DEGSeq [Wang et al. 2010]
        • edgeR [Robinson, McCarthy and Smyth. 2010]
        • DEseq [Anders and Huber. 2010]
                                                       5
Sequencerʼs Output                 Sequencer
                        Assemble

                      Genome Sequence

Mapping Program



 Mapping Result



  Visualization      Further Analysis
                                               6
Assembly
•   Genome/Gene assembly is a kind of puzzle.
    • Assemble a long sequence by combining short reads

           ATATGGATG              CTAAGCAT
                       TGCCATAT
         CGAGGCAT
                                    GATGCTAAG
                    CATATGCGA
                                  GGCATGCC


        GATGCTAAG
            CTAAGCAT
                  CATATGCGA
                        CGAGGCAT
                            GGCATGCC
                                TGCCATAT
                                    ATATGGATG
        GATGCTAAGCATATGCGAGGCATGCCATATGGATG
                                                          7
Assembly programs also
depend on sequence length
•   Sanger sequence
    •   Archine
•   Roche 454
    •   Mira3, Newbler
•   Illumina/SOLiD sequencers
    •   Velvet, ABySS, SOAPdenovo,...
•   Recently gene(RNA) assemble programs have been
    developed
    •   Oases http://www.ebi.ac.uk/~zerbino/oases/
    •   Trinity [Grabherr et al. Nature Biotech. 2011]
                                                         8
Overlap-Layout-Consensus
•   Mainly used to assemble Sanger and Roche 454 sequences.




                           Kasahara and Morishita.
                           Large-scale genome sequence processing.
                           2006.                                     9
de Bruijn Graph approach
• Used in recent short read assemblers
  • Velvet, ABySS,...
• Generate k-mer graph (de Bruijn graph), and then find minimum
  paths covering all edges
• Originally introduced in Pevzner, Tang and Waterman, PNAS,
   2001.




                             Miller, Koren and Sutton. Genomics. 2010.
                                                                         10
Miller, Koren and Sutton. Genomics, 2010.
                                            11
Genome assembly problem
  has no correct answer.
•   True genome sequence exists, I know.
•   In reality, we can not know the whole genome sequence
    exactly.
•   In most genome assemble study, some indexes are
    used to check whether the assembly is success or not.
    •   Number of contigs
    •   Total length of contigs
    •   N50
•   If you read EST sequences, the sequences can use to
    check the assemble quality.
    •   Note: You can not use the ESTs to do assemble
        genome because of keeping independency between
        training set and test set.
                                                            12
Assembled sequences vary
   between assemblers
•   Compare 5 assemblers for RNA assembly with
    Roche 454 reads
    •   Kumar and Blaxter. BMC Genomics. 2010.
    •   Compare Newbler, SeqMan, CLC (Commercial),
        CAP3 and MIRA3 (Free)
    •   No winner
    •   Newbler 2.5 generates longest contigs
    •   SeqMan is the best for recapturing known genes
    •   MIRA3 is competitive for Newbler and SeqMan


                                                         13
Assembled sequences vary
    between assemblers (contd)
•   Compare 6 assemblers for genome assembly
    • Bao et al. J. Hum Gen. 2010.
    • Use 1.5 million reads. Human genome resequencing
      data. Read length is 76 bp.
    • Authors conclude that SOAPdenovo was the best.
        •High genome coverage, low memory and fast.
    • SSAKE and ABySS generated very longer contig than
      SOAPdenovo.
    • Because of shortage of # of reads, this comparison is
      not practical.
        •They selected reads because their machine only
         have 32GB memory.
    • Genome assembly require various parameters to get
      “good” result. Authors did not mention about the
      parameter tuning.
                                                              14
Change File Format
                    Sequencerʼs Output
                              Sequence Format

                                           Genome Sequence

                    Mapping Program      BWA, Bowtie, etc.



                     Mapping Result      Output Format

   We have to
change file format     Visualization         Further Analysis
                                                               15
Introduction to AWK
•   “grep” is very useful command, but we may require more
    complicated search.
    • e.g., select lines whose third column is “Chr1.”
        •‘grep “Chr1” file’ select lines even when the line contains
         “Chr1” in first columns.
    • e.g., select lines whose values are less than 100.
        •Grep cannot compare values.
•   Replace a word with other word in file.
    • Editors can do that if file size is small.
•   AWK is one of the traditional and simple solution.
    • For more complicated tasks, script languages like perl,
      python and ruby are useful.
•   We here introduce “minimum” requirements about AWK.
    • You can find many introductory documents about awk in
      the Web.
                                                                      16
AWK in a nutshell
      • Process each line
      • $n means n-th column.
       • $1 is first column and $2 is second column.
      • $0 means whole line
$ cat nums.tab    # same as “cut -f2 nums.tab”
11.2     13.8     $ awk '{print $2}' nums.tab
10.9     7.7      13.8
15.2     7.0      7.7
9.4      10.9     7.0
8.8      9.1      10.9
                  9.1
                  # Only print second column is equal to “10.9”
                  # Compare with ‘grep “10.9” nums.tab’
                  $ awk '{if($2 == "10.9") print $0}' nums.tab
                  9.4 10.9
                  # Compare as numerical value
                  $ awk '{if($2 > 10) print $0}' nums.tab
                  11.2 13.8
                  9.4 10.9
                  $ awk '{if($2 > 10 & $2 < 12) print $0}' nums.tab
                  9.4 10.9

                                                                 17
AWK in a nutshell (2)
$ cat nums.tab
11.2     13.8
10.9     7.7
15.2     7.0
9.4      10.9
8.8      9.1

# Print lines contains “9” in second column
$ awk '{if($2 ~ /9/) print $0}' nums.tab
9.4 10.9
8.8 9.1
# Print lines start from “1”
$ awk '{if($1 ~ /^9/) print $0}' nums.tab
9.4      10.9
# Replace special string
$ awk '{gsub(/10/,"15"); print $0}' nums.tab
11.2     13.8
15.9     7.7
15.2     7.0
9.4      15.9
8.8      9.1

“ ” is just string, and / / is regular expression.
                                                     18
Sequencerʼs Output


                      Genome Sequence

Mapping Program



 Mapping Result

                                  RNA-Seq
  Visualization      Further Analysis
                                            19
Convert SAM to BAM
       •   SAM file is very large file size.
       •   We convert the SAM file into BAM file, which is
           computer friendly format.
       •   Install SAMtools
           •  http://samtools.sourceforge.net/
# $ curl -O http://switch.dl.sourceforge.net/project/samtools/
samtools/0.1.16/samtools-0.1.16.tar.bz2
# $ bzip2 -dc samtools-0.1.16.tar.bz2 | tar xvf -
# $ ln -s samtools-0.1.16 samtools
# $ cd samtools
# $ make # $ cd ..

$ ./samtools/samtools faidx TAIR10_chr_all.fas
# Generate TAIR10_chr_all.fas.fai
# “” indicates that the line continues to next line.
# You do not need to input the “”
$ ./samtools/samtools view -bt TAIR10_chr_all.fas.fai 
-o tha_reads.bam tha_reads.sam
# Generate tha_reads.bam
$ ./samtools/samtools sort tha_reads.bam tha_reads.sorted
# Sort reads and generate tha_reads.sorted.bam
$ ./samtools/samtools index tha_reads.sorted.bam tha_reads.sorted.bai
# Generate index of bam file into tha_reads.sorted.bai               20
Visualize mapped result (IGV)
•   1. Install IGV
•
                     $ unzip IGV_1.5.64.zip #install
    2. Start IGV     $ java -Xmx1g -jar IGV_1.5.64/igv.jar #start IGV
                     # Wait a minute. New window will appear.



3. Select A.thaliana
      (TAIR10)

    4. File > Load from File > Select
    “tha_reads.sorted.bam”

                 5. Zoomin, Zoomin...but it is
                difficult to find mapped reads :(




                                                                        21
22
Mapped reads on Chr1
       • Use SRR038985_chr1.sam
        • Include all reads mapped onto Chromosome 1
       • Convert the SAM file into BAM, and load from IGV
# We can skip this > $ ./samtools/samtools faidx TAIR10_chr_all.fas
$ ./samtools/samtools view -bt TAIR10_chr_all.fas.fai 
-o SRR038985_chr1.bam SRR038985_chr1.sam
$ ./samtools/samtools sort SRR038985_chr1.bam SRR038985_chr1.sorted
$ ./samtools/samtools index 
SRR038985_chr1.sorted.bam SRR038985_chr1.sorted.bai




                                                                      23
Visualize mapped result (Ensembl)
       •   Install BEDTools
           •   http://code.google.com/p/bedtools/
       •   Using bamToBed in the BEDTools, you can convert
           bam format into BED format.
       •   BED format can describe simple track information.
           Ensembl and UCSC genome browser can read this
           file and display its contents.
# Skip install process
# $ curl -O http://bedtools.googlecode.com/files/
BEDTools.v2.12.0.tar.gz
# $ tar zxvf BEDTools.v2.12.0.tar.gz
# $ ln -s BEDTools-Version-2.12.0 BEDTools
# $ cd BEDTools-Version-2.12.0
# $ make
# $ cd ..

$ ./BEDTools/bin/bamToBed -i SRR038985_chr1.sorted.bam 
> SRR038985_chr1.sorted.bed
                                                               24
Visualize mapped result (Ensembl)
•   Go to http://plants.ensembl.org in
    your browser
•   Select Arabidopsis thaliana
•   Click manage your data in left
    column
•   Select “Upload Data” in left column
•   Name for this upload: my_reads
•   Data format: BED
•   Upload file: select your bed file

•   DON’T push Upload now!!!




                                          25
Problems...
  •   Two problems
      •  BED file is too large to upload. Maximum file size we can
         upload to Ensembl Plants is 5MB
        •    We have to select region in the BED file.
      •  Chromosome name is different
        •    In our BED file, chromosome name is like “Chr1,” while in
             ensembl, the name is just “1.”
            •    We have to convert the name.
  •   Finally, we can upload the BED file!
      •  It takes about a minute. Don’t push “Upload” button
         repeatedly.
$ awk '{if($3 < 1000000) print $0}' SRR038985_chr1.sorted.bed 
> SRR038985_chr1_to_1M.sorted.bed
# You can change region by replacing “$3 < 100000” with “$3 < 100000
&& $3 > 50000”
$ awk 'gsub(/^Chr/,"")' SRR038985_chr1_to_1M.sorted.bed 
> SRR038985_chr1_to_1M.ensembl.bed
$ ls -lh SRR038985_chr1_to_1M.ensembl.bed
# Please check the file size is less than 5MB
                                                                        26
Visualize mapped result (Ensembl)
•   Click link “1:0-100000”
•   You can see your reads on “my_reads” track.
    •  Only you can see your track
    •  You have to upload BED file again after you logout your
       computer.




                                                                27
28
Count tags on each gene
 •   Most RNA-Seq tools depend on some libraries.
     • We have to install several programs to use them.
     • Some of them require administrator authority.
 •   Provide simple python script and count the numbers of tags.
 # We skip download GFF file.
 # GFF file contains gene positions on chromosomes.
 # $ curl -O ftp://ftp.arabidopsis.org/home/tair/Genes/
 TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff

 $ python count_reads_on_gene.py SRR038985_chr1.sam
 TAIR10_GFF3_genes.gff > SRR038985_chr1.exp

SRR038985_chr1.exp
Gene Name      Count         Sort by count in reverse order
 ...                         % sort -k2 -nr SRR038985_chr1.exp
 AT1G01046       0           AT1G18745 59
 AT1G01050       1           AT1G16635 47
 AT1G01060       0           AT1G21650 27
 ....                        AT1G75163 16
                             ...                                   29
A.lyrata reads and visualization
•   A.lyrata genome paper was published on April. 2011.
•   Genome sequence forms small contigs
    •   These status is similar to just after sequence assembly
•   We map reads on A.lyrata and visualize the data in IGV.
    •   In Ensembl Plants, A.lyrata genome is already available.
        However, unpublished genome sequence is not
        available on the site.
        •   This is limitation of web application (web sites).
    •   We here select IGV again.
        •   IGV does not contain A.lyrata genome information.
        •   We start from importing genome and gene
            informations.

                                                                   30
Mapping A.lyrata Reads
#Archive includes these files
#$ curl -O 
#ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v7.0/Alyrata/assembly/
Alyrata_107_RM.fa.gz
#This file contains all chromosome sequences. Need not concatenate.
#$ curl -O 
#ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v7.0/Alyrata/
annotation/Alyrata_107_gene.gff3.gz
#$ gzip -d Alyrata_107_RM.fa.gz
#$ gzip -d Alyrata_107_gene.gff3.gz
$ ./bwa/bwa index -c Alyrata_107_RM.fa
$ python csfasta2fastq.py --bwa lyr_reads > lyr_reads.bwa
$ ./bwa/bwa aln -c Alyrata_107_RM.fa lyr_reads.bwa > lyr_reads.sai
$ ./bwa/bwa samse Alyrata_107_RM.fa lyr_reads.sai lyr_reads.bwa 
> lyr_reads.sam
$ ./samtools/samtools faidx Alyrata_107_RM.fa
$ ./samtools/samtools view -bt Alyrata_107_RM.fa.fai -o 
lyr_reads.bam lyr_reads.sam
$ ./samtools/samtools sort lyr_reads.bam lyr_reads.sorted
$ ./samtools/samtools index lyr_reads.sorted.bam lyr_reads.sorted.bai

$ java -jar ./IGV_1.5.64/igv.jar                                    31
Visualization of Mapped Result

•   Load genome and genes. In IGV, File > Import Genome

    •   Name: A.lyrata (as you like!)

    •   Sequence File: Select your Alyrata_107_RM.fa

    •   Cytoband File: [empty]

    •   Gene File: Select your Alyrata_107_gene.gff3

        •   To check file contents, you need wait a moment.

    •   Then, save.

    •   Select file to save genome information.

•   Load read information. In IGV, File > Load from File.

    •   Select “lyr_reads.sorted.bam”
                                                             32
33
Assemble reads with velvet
• This is toy example. We just check the usage.
• Genome/Gene assembly requires huge main memory.
  • Velvet requires “AT LEAST” 12GB.
• Require two steps: velveth and velvetg
  • For SOLiD reads, use velveth_de and velvetg_de
    • Options are the same.
    • Before run velvet, we have to change format using ABI’s script
        called denovo2.0 (SOLiD only)
       • http://solidsoftwaretools.com/gf/project/denovo/frs/?
           action=FrsReleaseBrowse&frs_package_id=65
• After this process (if the reads come from genome), you can run gene
    prediction programs (Fgenesh, EuGene, GenomeThreader etc.).

•   Modern assemblers use de Brujin graph (k-mer graph). The change of
    parameter k will change assemble result drastically.
    • We have to generate many assemble results with various
      parameters to obtain the best one.
                                                                         34
# # Download and install velvet
# $ curl -O http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.1.03.tgz
# $ gzip -dc velvet_1.1.03.tgz | tar xvf -
# $ ln -s velvet_1.1.03 velvet
# $ cd velvet
# $ make color
# Download ABI’s scripts and extract it
$ gzip -dc denovo2.tgz | tar xvf -
# Preprocessing for velvet
$ perl ./denovo2/utils/solid_denovo_preprocessor_v1.2.pl --run_type 
fragment -output chr1_de --f3_file SRR038985_chr1.csfasta
# Run velvet
$ ./velvet/velveth_de assemble_chr1 17 -fasta -short 
chr1_de/doubleEncoded_input.de
$ ./velvet/velvetg_de assemble_chr1 -exp_cov auto
# assemble_chr1/contigs.fa contains generated contigs

# Show status
$ perl ./denovo2/utils/assembly_stats.pl assemble_chr1/contigs.fa

Sum contig length      :   303616
Num contigs            :   3796
Mean contig length     :   79
Median contig length   :   66
N50 value              :   79
Max                    :   583
                                                                    35
For Roche 454 (or IonTorrent)
•   Read length is longer than illumina and SOLiD
•   Traditional Sanger sequence analysis can be used
    •  Homology search with BLAST or BLAT
    •  Assembly with CAP3 or MIRA3
•   Combining Roche 454 with illumina/SOLiD will produce better
    result.
    •  Recent assemblies for long genome have used the
       combination.
•   One of the problems when we use BLAST/BLAT is that the
    programs do not support modern file format such as SAM/BAM.
    •  Some programs such as GMAP support new format.
•   To solve the problem, we make a format converting script and
    use it.

                                                               36
Mapping 454 Reads
•   We use EST sequences
•   EST sequences contains poly-A tail and vector strings.
    • For short reads, we did not this phase because the
      sequences are too short to check whether they are vector
      strings.
•   Procedure
    • Remove these sequences
        •Use lucy
    • Map trimmed sequences against genome
        •BLAST and BLAT
    • Convert the result to SAM format
    • Convert the SAM to BAM and check the result in viewer.


                                                                 37
# # Download and install lucy from http://lucy.sourceforge.net/
# curl -O http://jaist.dl.sourceforge.net/project/lucy/lucy/lucy
%201.20/lucy1.20.tar.gz
# gzip -dc lucy1.20.tar.gz | tar xvf -
# cd lucy-1.20p
# make; ln -s lucy-1.20p lucy
# Download blat executable file (For Mac OS X) and set it up
# $ curl -O 
# http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386/blat/blat
# $ chmod 755 blat
# To trim vector sequences, run lucy
$ ./lucy/lucy -vector pDNR.vec pDNRsplice.spl -out 
roche454_test_trim.fasta roche454_test_trim.qual 
roche454_test.fna roche454_test.qual
# Run blat. (About 7mins required)
# We may need to change score matrix to get meaningful alignment
$ ./blat -t=dna -q=dna -tileSize=8 -out=blast TAIR10_chr_all.fas 
roche454_test_trim.fasta roche454_test_TAIR10.result
# Convert the result into SAM file
# -t option specifies the maximum threshold of E-value in SAM file.
$ ruby blastn2sam.rb -t 0.00001 -s roche454_test_TAIR10.result 
> roche454_test_TAIR10_e5.sam
# After this process, you can do the same procedure as short reads
# (converting SAM to BAM and visualize the data in IGV.)            38
Concluding Remarks
•   Analysis in this lecture is first step for bioinformatics and
    computer science.
•   Softwares and methods for analysis of next generation
    sequencers are initial phase.
    •   Only mapping and assemble softwares are widely used.
        Other processes are under development.
    •   To use NGS, we have to check the updates of softwares
        and unpublished information.
        • Use mailing lists and QA sites.
•   Most softwares in biology have limited numbers of users.
    •   Think about Microsoft Word. Many users, but many...
    •   Many softwares have poor documentation.
    •   Bugs always exist.
        • Good softwares update frequently to fix bugs and
          catch up new information.
•   If no software exists for your experiment, simple script may
    help your analysis.                                            39

More Related Content

What's hot

Making powerful science: an introduction to NGS and beyond
Making powerful science: an introduction to NGS and beyondMaking powerful science: an introduction to NGS and beyond
Making powerful science: an introduction to NGS and beyondAdamCribbs1
 
Making powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysisMaking powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysisAdamCribbs1
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Maté Ongenaert
 
Overcoming the challenges of designing efficient and specific CRISPR gRNAs
Overcoming the challenges of designing efficient and specific CRISPR gRNAsOvercoming the challenges of designing efficient and specific CRISPR gRNAs
Overcoming the challenges of designing efficient and specific CRISPR gRNAsIntegrated DNA Technologies
 
Cpf1-based genome editing using ribonucleoprotein complexes
Cpf1-based genome editing using ribonucleoprotein complexesCpf1-based genome editing using ribonucleoprotein complexes
Cpf1-based genome editing using ribonucleoprotein complexesIntegrated DNA Technologies
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
Optimized methods to use Cas9 nickases in genome editing
Optimized methods to use Cas9 nickases in genome editingOptimized methods to use Cas9 nickases in genome editing
Optimized methods to use Cas9 nickases in genome editingIntegrated DNA Technologies
 
Using BioNano Maps to Improve an Insect Genome Assembly​
Using BioNano Maps to Improve an Insect Genome Assembly​Using BioNano Maps to Improve an Insect Genome Assembly​
Using BioNano Maps to Improve an Insect Genome Assembly​Jennifer Shelton
 
Analysis of ChIP-Seq Data
Analysis of ChIP-Seq DataAnalysis of ChIP-Seq Data
Analysis of ChIP-Seq DataPhil Ewels
 
Reducing off-target events in CRISPR genome editing applications with a novel...
Reducing off-target events in CRISPR genome editing applications with a novel...Reducing off-target events in CRISPR genome editing applications with a novel...
Reducing off-target events in CRISPR genome editing applications with a novel...Integrated DNA Technologies
 
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...Integrated DNA Technologies
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Genome Reference Consortium
 
Target capture of DNA from FFPE samples— recommendations for generating robus...
Target capture of DNA from FFPE samples— recommendations for generating robus...Target capture of DNA from FFPE samples— recommendations for generating robus...
Target capture of DNA from FFPE samples— recommendations for generating robus...Integrated DNA Technologies
 
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsRare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsGolden Helix Inc
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
 

What's hot (20)

ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
Making powerful science: an introduction to NGS and beyond
Making powerful science: an introduction to NGS and beyondMaking powerful science: an introduction to NGS and beyond
Making powerful science: an introduction to NGS and beyond
 
Making powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysisMaking powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysis
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
Overcoming the challenges of designing efficient and specific CRISPR gRNAs
Overcoming the challenges of designing efficient and specific CRISPR gRNAsOvercoming the challenges of designing efficient and specific CRISPR gRNAs
Overcoming the challenges of designing efficient and specific CRISPR gRNAs
 
Cpf1-based genome editing using ribonucleoprotein complexes
Cpf1-based genome editing using ribonucleoprotein complexesCpf1-based genome editing using ribonucleoprotein complexes
Cpf1-based genome editing using ribonucleoprotein complexes
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
Optimized methods to use Cas9 nickases in genome editing
Optimized methods to use Cas9 nickases in genome editingOptimized methods to use Cas9 nickases in genome editing
Optimized methods to use Cas9 nickases in genome editing
 
Using BioNano Maps to Improve an Insect Genome Assembly​
Using BioNano Maps to Improve an Insect Genome Assembly​Using BioNano Maps to Improve an Insect Genome Assembly​
Using BioNano Maps to Improve an Insect Genome Assembly​
 
Analysis of ChIP-Seq Data
Analysis of ChIP-Seq DataAnalysis of ChIP-Seq Data
Analysis of ChIP-Seq Data
 
Reducing off-target events in CRISPR genome editing applications with a novel...
Reducing off-target events in CRISPR genome editing applications with a novel...Reducing off-target events in CRISPR genome editing applications with a novel...
Reducing off-target events in CRISPR genome editing applications with a novel...
 
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...
 
Target capture of DNA from FFPE samples— recommendations for generating robus...
Target capture of DNA from FFPE samples— recommendations for generating robus...Target capture of DNA from FFPE samples— recommendations for generating robus...
Target capture of DNA from FFPE samples— recommendations for generating robus...
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsRare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
 
20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop
 
agbt 2016 workshop lindsay
agbt 2016 workshop lindsayagbt 2016 workshop lindsay
agbt 2016 workshop lindsay
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 

Viewers also liked

Introducing data analysis: reads to results
Introducing data analysis: reads to resultsIntroducing data analysis: reads to results
Introducing data analysis: reads to resultsAGRF_Ltd
 
Normalization of microarray
Normalization of microarrayNormalization of microarray
Normalization of microarray弘毅 露崎
 
フリーソフトではじめるNGS融合遺伝子解析入門
フリーソフトではじめるNGS融合遺伝子解析入門フリーソフトではじめるNGS融合遺伝子解析入門
フリーソフトではじめるNGS融合遺伝子解析入門Amelieff
 
DNAマイクロアレイの解析と多重検定補正
DNAマイクロアレイの解析と多重検定補正DNAマイクロアレイの解析と多重検定補正
DNAマイクロアレイの解析と多重検定補正弘毅 露崎
 
フリーソフトではじめるがん体細胞変異解析入門 第33回勉強会資料
フリーソフトではじめるがん体細胞変異解析入門 第33回勉強会資料フリーソフトではじめるがん体細胞変異解析入門 第33回勉強会資料
フリーソフトではじめるがん体細胞変異解析入門 第33回勉強会資料Amelieff
 
RNAseqによる変動遺伝子抽出の統計: A Review
RNAseqによる変動遺伝子抽出の統計: A ReviewRNAseqによる変動遺伝子抽出の統計: A Review
RNAseqによる変動遺伝子抽出の統計: A Reviewsesejun
 
FDRの使い方 (Kashiwa.R #3)
FDRの使い方 (Kashiwa.R #3)FDRの使い方 (Kashiwa.R #3)
FDRの使い方 (Kashiwa.R #3)Haruka Ozaki
 
次世代シーケンサが求める機械学習
次世代シーケンサが求める機械学習次世代シーケンサが求める機械学習
次世代シーケンサが求める機械学習sesejun
 
Kogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisKogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisJunsu Ko
 
バイオインフォマティクスによる遺伝子発現解析
バイオインフォマティクスによる遺伝子発現解析バイオインフォマティクスによる遺伝子発現解析
バイオインフォマティクスによる遺伝子発現解析sesejun
 

Viewers also liked (11)

RNASkim
RNASkimRNASkim
RNASkim
 
Introducing data analysis: reads to results
Introducing data analysis: reads to resultsIntroducing data analysis: reads to results
Introducing data analysis: reads to results
 
Normalization of microarray
Normalization of microarrayNormalization of microarray
Normalization of microarray
 
フリーソフトではじめるNGS融合遺伝子解析入門
フリーソフトではじめるNGS融合遺伝子解析入門フリーソフトではじめるNGS融合遺伝子解析入門
フリーソフトではじめるNGS融合遺伝子解析入門
 
DNAマイクロアレイの解析と多重検定補正
DNAマイクロアレイの解析と多重検定補正DNAマイクロアレイの解析と多重検定補正
DNAマイクロアレイの解析と多重検定補正
 
フリーソフトではじめるがん体細胞変異解析入門 第33回勉強会資料
フリーソフトではじめるがん体細胞変異解析入門 第33回勉強会資料フリーソフトではじめるがん体細胞変異解析入門 第33回勉強会資料
フリーソフトではじめるがん体細胞変異解析入門 第33回勉強会資料
 
RNAseqによる変動遺伝子抽出の統計: A Review
RNAseqによる変動遺伝子抽出の統計: A ReviewRNAseqによる変動遺伝子抽出の統計: A Review
RNAseqによる変動遺伝子抽出の統計: A Review
 
FDRの使い方 (Kashiwa.R #3)
FDRの使い方 (Kashiwa.R #3)FDRの使い方 (Kashiwa.R #3)
FDRの使い方 (Kashiwa.R #3)
 
次世代シーケンサが求める機械学習
次世代シーケンサが求める機械学習次世代シーケンサが求める機械学習
次世代シーケンサが求める機械学習
 
Kogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisKogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysis
 
バイオインフォマティクスによる遺伝子発現解析
バイオインフォマティクスによる遺伝子発現解析バイオインフォマティクスによる遺伝子発現解析
バイオインフォマティクスによる遺伝子発現解析
 

Similar to 20110524zurichngs 2nd pub

rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfPushpendra83
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingmikaelhuss
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analysesfnothaft
 
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent DataIonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent DataAdrian Baez-Ortega
 
NGS Pipeline Preparation - Tools Selection
NGS Pipeline Preparation - Tools SelectionNGS Pipeline Preparation - Tools Selection
NGS Pipeline Preparation - Tools SelectionMinesh A. Jethva
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment DesignYaoyu Wang
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers Golden Helix Inc
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger Eli Kaminuma
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
 
Creating a SNP calling pipeline
Creating a SNP calling pipelineCreating a SNP calling pipeline
Creating a SNP calling pipelineDan Bolser
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionJatinder Singh
 
The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput ...
The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput ...The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput ...
The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput ...Mackenna Galicia
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSHAMNAHAMNA8
 

Similar to 20110524zurichngs 2nd pub (20)

RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdf
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent DataIonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
NGS Pipeline Preparation - Tools Selection
NGS Pipeline Preparation - Tools SelectionNGS Pipeline Preparation - Tools Selection
NGS Pipeline Preparation - Tools Selection
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
Matlab bioinformatics presentation
Matlab bioinformatics presentationMatlab bioinformatics presentation
Matlab bioinformatics presentation
 
Bioinformatica t4-alignments
Bioinformatica t4-alignmentsBioinformatica t4-alignments
Bioinformatica t4-alignments
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
 
Creating a SNP calling pipeline
Creating a SNP calling pipelineCreating a SNP calling pipeline
Creating a SNP calling pipeline
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
 
The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput ...
The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput ...The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput ...
The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput ...
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 

More from sesejun

20110602labseminar pub
20110602labseminar pub20110602labseminar pub
20110602labseminar pubsesejun
 
20110214nips2010 read
20110214nips2010 read20110214nips2010 read
20110214nips2010 readsesejun
 
Datamining 9th association_rule.key
Datamining 9th association_rule.keyDatamining 9th association_rule.key
Datamining 9th association_rule.keysesejun
 
Datamining 8th hclustering
Datamining 8th hclusteringDatamining 8th hclustering
Datamining 8th hclusteringsesejun
 
Datamining r 4th
Datamining r 4thDatamining r 4th
Datamining r 4thsesejun
 
Datamining r 3rd
Datamining r 3rdDatamining r 3rd
Datamining r 3rdsesejun
 
Datamining r 2nd
Datamining r 2ndDatamining r 2nd
Datamining r 2ndsesejun
 
Datamining r 1st
Datamining r 1stDatamining r 1st
Datamining r 1stsesejun
 
Datamining 6th svm
Datamining 6th svmDatamining 6th svm
Datamining 6th svmsesejun
 
Datamining 5th knn
Datamining 5th knnDatamining 5th knn
Datamining 5th knnsesejun
 
Datamining 4th adaboost
Datamining 4th adaboostDatamining 4th adaboost
Datamining 4th adaboostsesejun
 
Datamining 3rd naivebayes
Datamining 3rd naivebayesDatamining 3rd naivebayes
Datamining 3rd naivebayessesejun
 
Datamining 2nd decisiontree
Datamining 2nd decisiontreeDatamining 2nd decisiontree
Datamining 2nd decisiontreesesejun
 
Datamining 7th kmeans
Datamining 7th kmeansDatamining 7th kmeans
Datamining 7th kmeanssesejun
 
100401 Bioinfoinfra
100401 Bioinfoinfra100401 Bioinfoinfra
100401 Bioinfoinfrasesejun
 
Datamining 8th Hclustering
Datamining 8th HclusteringDatamining 8th Hclustering
Datamining 8th Hclusteringsesejun
 
Datamining 9th Association Rule
Datamining 9th Association RuleDatamining 9th Association Rule
Datamining 9th Association Rulesesejun
 
Datamining 9th Association Rule
Datamining 9th Association RuleDatamining 9th Association Rule
Datamining 9th Association Rulesesejun
 
Datamining 8th Hclustering
Datamining 8th HclusteringDatamining 8th Hclustering
Datamining 8th Hclusteringsesejun
 
Datamining 7th Kmeans
Datamining 7th KmeansDatamining 7th Kmeans
Datamining 7th Kmeanssesejun
 

More from sesejun (20)

20110602labseminar pub
20110602labseminar pub20110602labseminar pub
20110602labseminar pub
 
20110214nips2010 read
20110214nips2010 read20110214nips2010 read
20110214nips2010 read
 
Datamining 9th association_rule.key
Datamining 9th association_rule.keyDatamining 9th association_rule.key
Datamining 9th association_rule.key
 
Datamining 8th hclustering
Datamining 8th hclusteringDatamining 8th hclustering
Datamining 8th hclustering
 
Datamining r 4th
Datamining r 4thDatamining r 4th
Datamining r 4th
 
Datamining r 3rd
Datamining r 3rdDatamining r 3rd
Datamining r 3rd
 
Datamining r 2nd
Datamining r 2ndDatamining r 2nd
Datamining r 2nd
 
Datamining r 1st
Datamining r 1stDatamining r 1st
Datamining r 1st
 
Datamining 6th svm
Datamining 6th svmDatamining 6th svm
Datamining 6th svm
 
Datamining 5th knn
Datamining 5th knnDatamining 5th knn
Datamining 5th knn
 
Datamining 4th adaboost
Datamining 4th adaboostDatamining 4th adaboost
Datamining 4th adaboost
 
Datamining 3rd naivebayes
Datamining 3rd naivebayesDatamining 3rd naivebayes
Datamining 3rd naivebayes
 
Datamining 2nd decisiontree
Datamining 2nd decisiontreeDatamining 2nd decisiontree
Datamining 2nd decisiontree
 
Datamining 7th kmeans
Datamining 7th kmeansDatamining 7th kmeans
Datamining 7th kmeans
 
100401 Bioinfoinfra
100401 Bioinfoinfra100401 Bioinfoinfra
100401 Bioinfoinfra
 
Datamining 8th Hclustering
Datamining 8th HclusteringDatamining 8th Hclustering
Datamining 8th Hclustering
 
Datamining 9th Association Rule
Datamining 9th Association RuleDatamining 9th Association Rule
Datamining 9th Association Rule
 
Datamining 9th Association Rule
Datamining 9th Association RuleDatamining 9th Association Rule
Datamining 9th Association Rule
 
Datamining 8th Hclustering
Datamining 8th HclusteringDatamining 8th Hclustering
Datamining 8th Hclustering
 
Datamining 7th Kmeans
Datamining 7th KmeansDatamining 7th Kmeans
Datamining 7th Kmeans
 

20110524zurichngs 2nd pub

  • 1. Next Generation Sequencing for Model and Non-Model Organism 2nd day Jun Sese and Kentaro Shimizu sesejun@cs.titech.ac.jp Ph.D course lecture @ Institute of Plant Biology, Univ. of Zurich 26/05/2011
  • 2. Today’s Menu • Lecture • Current RNA-Seq analysis • Genome and RNA Asembly • Introduction to AWK • First step of programming • Exercise • Visualization of mapped reads • RNA-Seq analysis • Genome assembly 2
  • 3. Sequencerʼs Output Genome Sequence Mapping Program Mapping Result RNA-Seq Visualization Further Analysis 3
  • 4. RNA-Seq • Which genes are highly expressed? • Need to normalize by sequence length • RPKM (Reads Par Kilo-basepair per Million reads) [Mortazavi et al. Nature Methods. 2008] • An initial gene expression counting method Think about two genes expressed in a cell. Suppose that a mRNA is expressed from each gene. Short Gene Long Gene 2 8 Longer gene has more frequency. 4
  • 5. RNA-Seq (contd) • Some corrections including multiple-test and fragment bias will be required. • Srivastava and Chen. NAR. 2010 • Li, Jiang and Wong. Genome Research. 2010 • No standard method. • After mapping reads, some tools are available to count reads. • Cufflinks • HTSeq • R packages • DEGSeq [Wang et al. 2010] • edgeR [Robinson, McCarthy and Smyth. 2010] • DEseq [Anders and Huber. 2010] 5
  • 6. Sequencerʼs Output Sequencer Assemble Genome Sequence Mapping Program Mapping Result Visualization Further Analysis 6
  • 7. Assembly • Genome/Gene assembly is a kind of puzzle. • Assemble a long sequence by combining short reads ATATGGATG CTAAGCAT TGCCATAT CGAGGCAT GATGCTAAG CATATGCGA GGCATGCC GATGCTAAG CTAAGCAT CATATGCGA CGAGGCAT GGCATGCC TGCCATAT ATATGGATG GATGCTAAGCATATGCGAGGCATGCCATATGGATG 7
  • 8. Assembly programs also depend on sequence length • Sanger sequence • Archine • Roche 454 • Mira3, Newbler • Illumina/SOLiD sequencers • Velvet, ABySS, SOAPdenovo,... • Recently gene(RNA) assemble programs have been developed • Oases http://www.ebi.ac.uk/~zerbino/oases/ • Trinity [Grabherr et al. Nature Biotech. 2011] 8
  • 9. Overlap-Layout-Consensus • Mainly used to assemble Sanger and Roche 454 sequences. Kasahara and Morishita. Large-scale genome sequence processing. 2006. 9
  • 10. de Bruijn Graph approach • Used in recent short read assemblers • Velvet, ABySS,... • Generate k-mer graph (de Bruijn graph), and then find minimum paths covering all edges • Originally introduced in Pevzner, Tang and Waterman, PNAS, 2001. Miller, Koren and Sutton. Genomics. 2010. 10
  • 11. Miller, Koren and Sutton. Genomics, 2010. 11
  • 12. Genome assembly problem has no correct answer. • True genome sequence exists, I know. • In reality, we can not know the whole genome sequence exactly. • In most genome assemble study, some indexes are used to check whether the assembly is success or not. • Number of contigs • Total length of contigs • N50 • If you read EST sequences, the sequences can use to check the assemble quality. • Note: You can not use the ESTs to do assemble genome because of keeping independency between training set and test set. 12
  • 13. Assembled sequences vary between assemblers • Compare 5 assemblers for RNA assembly with Roche 454 reads • Kumar and Blaxter. BMC Genomics. 2010. • Compare Newbler, SeqMan, CLC (Commercial), CAP3 and MIRA3 (Free) • No winner • Newbler 2.5 generates longest contigs • SeqMan is the best for recapturing known genes • MIRA3 is competitive for Newbler and SeqMan 13
  • 14. Assembled sequences vary between assemblers (contd) • Compare 6 assemblers for genome assembly • Bao et al. J. Hum Gen. 2010. • Use 1.5 million reads. Human genome resequencing data. Read length is 76 bp. • Authors conclude that SOAPdenovo was the best. •High genome coverage, low memory and fast. • SSAKE and ABySS generated very longer contig than SOAPdenovo. • Because of shortage of # of reads, this comparison is not practical. •They selected reads because their machine only have 32GB memory. • Genome assembly require various parameters to get “good” result. Authors did not mention about the parameter tuning. 14
  • 15. Change File Format Sequencerʼs Output Sequence Format Genome Sequence Mapping Program BWA, Bowtie, etc. Mapping Result Output Format We have to change file format Visualization Further Analysis 15
  • 16. Introduction to AWK • “grep” is very useful command, but we may require more complicated search. • e.g., select lines whose third column is “Chr1.” •‘grep “Chr1” file’ select lines even when the line contains “Chr1” in first columns. • e.g., select lines whose values are less than 100. •Grep cannot compare values. • Replace a word with other word in file. • Editors can do that if file size is small. • AWK is one of the traditional and simple solution. • For more complicated tasks, script languages like perl, python and ruby are useful. • We here introduce “minimum” requirements about AWK. • You can find many introductory documents about awk in the Web. 16
  • 17. AWK in a nutshell • Process each line • $n means n-th column. • $1 is first column and $2 is second column. • $0 means whole line $ cat nums.tab # same as “cut -f2 nums.tab” 11.2 13.8 $ awk '{print $2}' nums.tab 10.9 7.7 13.8 15.2 7.0 7.7 9.4 10.9 7.0 8.8 9.1 10.9 9.1 # Only print second column is equal to “10.9” # Compare with ‘grep “10.9” nums.tab’ $ awk '{if($2 == "10.9") print $0}' nums.tab 9.4 10.9 # Compare as numerical value $ awk '{if($2 > 10) print $0}' nums.tab 11.2 13.8 9.4 10.9 $ awk '{if($2 > 10 & $2 < 12) print $0}' nums.tab 9.4 10.9 17
  • 18. AWK in a nutshell (2) $ cat nums.tab 11.2 13.8 10.9 7.7 15.2 7.0 9.4 10.9 8.8 9.1 # Print lines contains “9” in second column $ awk '{if($2 ~ /9/) print $0}' nums.tab 9.4 10.9 8.8 9.1 # Print lines start from “1” $ awk '{if($1 ~ /^9/) print $0}' nums.tab 9.4 10.9 # Replace special string $ awk '{gsub(/10/,"15"); print $0}' nums.tab 11.2 13.8 15.9 7.7 15.2 7.0 9.4 15.9 8.8 9.1 “ ” is just string, and / / is regular expression. 18
  • 19. Sequencerʼs Output Genome Sequence Mapping Program Mapping Result RNA-Seq Visualization Further Analysis 19
  • 20. Convert SAM to BAM • SAM file is very large file size. • We convert the SAM file into BAM file, which is computer friendly format. • Install SAMtools • http://samtools.sourceforge.net/ # $ curl -O http://switch.dl.sourceforge.net/project/samtools/ samtools/0.1.16/samtools-0.1.16.tar.bz2 # $ bzip2 -dc samtools-0.1.16.tar.bz2 | tar xvf - # $ ln -s samtools-0.1.16 samtools # $ cd samtools # $ make # $ cd .. $ ./samtools/samtools faidx TAIR10_chr_all.fas # Generate TAIR10_chr_all.fas.fai # “” indicates that the line continues to next line. # You do not need to input the “” $ ./samtools/samtools view -bt TAIR10_chr_all.fas.fai -o tha_reads.bam tha_reads.sam # Generate tha_reads.bam $ ./samtools/samtools sort tha_reads.bam tha_reads.sorted # Sort reads and generate tha_reads.sorted.bam $ ./samtools/samtools index tha_reads.sorted.bam tha_reads.sorted.bai # Generate index of bam file into tha_reads.sorted.bai 20
  • 21. Visualize mapped result (IGV) • 1. Install IGV • $ unzip IGV_1.5.64.zip #install 2. Start IGV $ java -Xmx1g -jar IGV_1.5.64/igv.jar #start IGV # Wait a minute. New window will appear. 3. Select A.thaliana (TAIR10) 4. File > Load from File > Select “tha_reads.sorted.bam” 5. Zoomin, Zoomin...but it is difficult to find mapped reads :( 21
  • 22. 22
  • 23. Mapped reads on Chr1 • Use SRR038985_chr1.sam • Include all reads mapped onto Chromosome 1 • Convert the SAM file into BAM, and load from IGV # We can skip this > $ ./samtools/samtools faidx TAIR10_chr_all.fas $ ./samtools/samtools view -bt TAIR10_chr_all.fas.fai -o SRR038985_chr1.bam SRR038985_chr1.sam $ ./samtools/samtools sort SRR038985_chr1.bam SRR038985_chr1.sorted $ ./samtools/samtools index SRR038985_chr1.sorted.bam SRR038985_chr1.sorted.bai 23
  • 24. Visualize mapped result (Ensembl) • Install BEDTools • http://code.google.com/p/bedtools/ • Using bamToBed in the BEDTools, you can convert bam format into BED format. • BED format can describe simple track information. Ensembl and UCSC genome browser can read this file and display its contents. # Skip install process # $ curl -O http://bedtools.googlecode.com/files/ BEDTools.v2.12.0.tar.gz # $ tar zxvf BEDTools.v2.12.0.tar.gz # $ ln -s BEDTools-Version-2.12.0 BEDTools # $ cd BEDTools-Version-2.12.0 # $ make # $ cd .. $ ./BEDTools/bin/bamToBed -i SRR038985_chr1.sorted.bam > SRR038985_chr1.sorted.bed 24
  • 25. Visualize mapped result (Ensembl) • Go to http://plants.ensembl.org in your browser • Select Arabidopsis thaliana • Click manage your data in left column • Select “Upload Data” in left column • Name for this upload: my_reads • Data format: BED • Upload file: select your bed file • DON’T push Upload now!!! 25
  • 26. Problems... • Two problems • BED file is too large to upload. Maximum file size we can upload to Ensembl Plants is 5MB • We have to select region in the BED file. • Chromosome name is different • In our BED file, chromosome name is like “Chr1,” while in ensembl, the name is just “1.” • We have to convert the name. • Finally, we can upload the BED file! • It takes about a minute. Don’t push “Upload” button repeatedly. $ awk '{if($3 < 1000000) print $0}' SRR038985_chr1.sorted.bed > SRR038985_chr1_to_1M.sorted.bed # You can change region by replacing “$3 < 100000” with “$3 < 100000 && $3 > 50000” $ awk 'gsub(/^Chr/,"")' SRR038985_chr1_to_1M.sorted.bed > SRR038985_chr1_to_1M.ensembl.bed $ ls -lh SRR038985_chr1_to_1M.ensembl.bed # Please check the file size is less than 5MB 26
  • 27. Visualize mapped result (Ensembl) • Click link “1:0-100000” • You can see your reads on “my_reads” track. • Only you can see your track • You have to upload BED file again after you logout your computer. 27
  • 28. 28
  • 29. Count tags on each gene • Most RNA-Seq tools depend on some libraries. • We have to install several programs to use them. • Some of them require administrator authority. • Provide simple python script and count the numbers of tags. # We skip download GFF file. # GFF file contains gene positions on chromosomes. # $ curl -O ftp://ftp.arabidopsis.org/home/tair/Genes/ TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff $ python count_reads_on_gene.py SRR038985_chr1.sam TAIR10_GFF3_genes.gff > SRR038985_chr1.exp SRR038985_chr1.exp Gene Name Count Sort by count in reverse order ... % sort -k2 -nr SRR038985_chr1.exp AT1G01046 0 AT1G18745 59 AT1G01050 1 AT1G16635 47 AT1G01060 0 AT1G21650 27 .... AT1G75163 16 ... 29
  • 30. A.lyrata reads and visualization • A.lyrata genome paper was published on April. 2011. • Genome sequence forms small contigs • These status is similar to just after sequence assembly • We map reads on A.lyrata and visualize the data in IGV. • In Ensembl Plants, A.lyrata genome is already available. However, unpublished genome sequence is not available on the site. • This is limitation of web application (web sites). • We here select IGV again. • IGV does not contain A.lyrata genome information. • We start from importing genome and gene informations. 30
  • 31. Mapping A.lyrata Reads #Archive includes these files #$ curl -O #ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v7.0/Alyrata/assembly/ Alyrata_107_RM.fa.gz #This file contains all chromosome sequences. Need not concatenate. #$ curl -O #ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v7.0/Alyrata/ annotation/Alyrata_107_gene.gff3.gz #$ gzip -d Alyrata_107_RM.fa.gz #$ gzip -d Alyrata_107_gene.gff3.gz $ ./bwa/bwa index -c Alyrata_107_RM.fa $ python csfasta2fastq.py --bwa lyr_reads > lyr_reads.bwa $ ./bwa/bwa aln -c Alyrata_107_RM.fa lyr_reads.bwa > lyr_reads.sai $ ./bwa/bwa samse Alyrata_107_RM.fa lyr_reads.sai lyr_reads.bwa > lyr_reads.sam $ ./samtools/samtools faidx Alyrata_107_RM.fa $ ./samtools/samtools view -bt Alyrata_107_RM.fa.fai -o lyr_reads.bam lyr_reads.sam $ ./samtools/samtools sort lyr_reads.bam lyr_reads.sorted $ ./samtools/samtools index lyr_reads.sorted.bam lyr_reads.sorted.bai $ java -jar ./IGV_1.5.64/igv.jar 31
  • 32. Visualization of Mapped Result • Load genome and genes. In IGV, File > Import Genome • Name: A.lyrata (as you like!) • Sequence File: Select your Alyrata_107_RM.fa • Cytoband File: [empty] • Gene File: Select your Alyrata_107_gene.gff3 • To check file contents, you need wait a moment. • Then, save. • Select file to save genome information. • Load read information. In IGV, File > Load from File. • Select “lyr_reads.sorted.bam” 32
  • 33. 33
  • 34. Assemble reads with velvet • This is toy example. We just check the usage. • Genome/Gene assembly requires huge main memory. • Velvet requires “AT LEAST” 12GB. • Require two steps: velveth and velvetg • For SOLiD reads, use velveth_de and velvetg_de • Options are the same. • Before run velvet, we have to change format using ABI’s script called denovo2.0 (SOLiD only) • http://solidsoftwaretools.com/gf/project/denovo/frs/? action=FrsReleaseBrowse&frs_package_id=65 • After this process (if the reads come from genome), you can run gene prediction programs (Fgenesh, EuGene, GenomeThreader etc.). • Modern assemblers use de Brujin graph (k-mer graph). The change of parameter k will change assemble result drastically. • We have to generate many assemble results with various parameters to obtain the best one. 34
  • 35. # # Download and install velvet # $ curl -O http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.1.03.tgz # $ gzip -dc velvet_1.1.03.tgz | tar xvf - # $ ln -s velvet_1.1.03 velvet # $ cd velvet # $ make color # Download ABI’s scripts and extract it $ gzip -dc denovo2.tgz | tar xvf - # Preprocessing for velvet $ perl ./denovo2/utils/solid_denovo_preprocessor_v1.2.pl --run_type fragment -output chr1_de --f3_file SRR038985_chr1.csfasta # Run velvet $ ./velvet/velveth_de assemble_chr1 17 -fasta -short chr1_de/doubleEncoded_input.de $ ./velvet/velvetg_de assemble_chr1 -exp_cov auto # assemble_chr1/contigs.fa contains generated contigs # Show status $ perl ./denovo2/utils/assembly_stats.pl assemble_chr1/contigs.fa Sum contig length : 303616 Num contigs : 3796 Mean contig length : 79 Median contig length : 66 N50 value : 79 Max : 583 35
  • 36. For Roche 454 (or IonTorrent) • Read length is longer than illumina and SOLiD • Traditional Sanger sequence analysis can be used • Homology search with BLAST or BLAT • Assembly with CAP3 or MIRA3 • Combining Roche 454 with illumina/SOLiD will produce better result. • Recent assemblies for long genome have used the combination. • One of the problems when we use BLAST/BLAT is that the programs do not support modern file format such as SAM/BAM. • Some programs such as GMAP support new format. • To solve the problem, we make a format converting script and use it. 36
  • 37. Mapping 454 Reads • We use EST sequences • EST sequences contains poly-A tail and vector strings. • For short reads, we did not this phase because the sequences are too short to check whether they are vector strings. • Procedure • Remove these sequences •Use lucy • Map trimmed sequences against genome •BLAST and BLAT • Convert the result to SAM format • Convert the SAM to BAM and check the result in viewer. 37
  • 38. # # Download and install lucy from http://lucy.sourceforge.net/ # curl -O http://jaist.dl.sourceforge.net/project/lucy/lucy/lucy %201.20/lucy1.20.tar.gz # gzip -dc lucy1.20.tar.gz | tar xvf - # cd lucy-1.20p # make; ln -s lucy-1.20p lucy # Download blat executable file (For Mac OS X) and set it up # $ curl -O # http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386/blat/blat # $ chmod 755 blat # To trim vector sequences, run lucy $ ./lucy/lucy -vector pDNR.vec pDNRsplice.spl -out roche454_test_trim.fasta roche454_test_trim.qual roche454_test.fna roche454_test.qual # Run blat. (About 7mins required) # We may need to change score matrix to get meaningful alignment $ ./blat -t=dna -q=dna -tileSize=8 -out=blast TAIR10_chr_all.fas roche454_test_trim.fasta roche454_test_TAIR10.result # Convert the result into SAM file # -t option specifies the maximum threshold of E-value in SAM file. $ ruby blastn2sam.rb -t 0.00001 -s roche454_test_TAIR10.result > roche454_test_TAIR10_e5.sam # After this process, you can do the same procedure as short reads # (converting SAM to BAM and visualize the data in IGV.) 38
  • 39. Concluding Remarks • Analysis in this lecture is first step for bioinformatics and computer science. • Softwares and methods for analysis of next generation sequencers are initial phase. • Only mapping and assemble softwares are widely used. Other processes are under development. • To use NGS, we have to check the updates of softwares and unpublished information. • Use mailing lists and QA sites. • Most softwares in biology have limited numbers of users. • Think about Microsoft Word. Many users, but many... • Many softwares have poor documentation. • Bugs always exist. • Good softwares update frequently to fix bugs and catch up new information. • If no software exists for your experiment, simple script may help your analysis. 39