The document describes a typical bioinformatics workflow for analyzing Illumina sequencing data. It involves several common processing steps: removing adapter contamination, trimming reads for quality, mapping reads to a genome or transcriptome, filtering for uniquely mapped reads, and filtering for high quality alignments. Each step progressively reduces the total number of reads until arriving at a data set suitable for final analysis. The document emphasizes understanding why each step is important and how it affects the data. It also provides a tip to use the "ls -ltr" command after each step to check that output files were properly created and contain data.
5. A typical bioinformatics workflow
Illumina data
(FASTQ format)
Remove adapter contamination
scythe
cutadapt
trimgalore
skewer
Btrim
Trimmomatic
Lots of tools
you could use!
Atypicalbioinformaticsworkflow
Lotsoftools
youcoulduse!
Removeadaptercontamination
scythe
cutadapt
trimgalore
skewer
Btrim
Trimmomatic
6. Trim reads for low quality bases
sickle
Qtrim
FastQC
FastX
PRINSEQ
Trimmomatic
Trimreadsforlowqualitybases
sickle
Qtrim
FastQC
FastX
PRINSEC)
Trimmomatic
7. Map reads to genome/transcriptome
BWA
Bowtie
TopHat
SHRiMP
BFAST
MAQ
From ebi.ac.uk/~nf/hts_mappers/
There are a lot of
read mappers out there!
Fromebi.ac.uk/-nf/hts_mappers/ H I S A T •-JAGuaR • -
BWA-PSSM • - -
MOSAIK•- - - - - -
Hobbes2 •
CUSHAW3a-
NextGenMap •
Subread/Subjunc •
CRAC•-
SRmapper•-
GEM•
STAR •
ERNE•-
BatMelh•-
BLASRa-
YAHA •
SeciAlto •
Batmis •
Therearealotof DynMaPp O S A •
ContextMap•-
as?n1 •-
RUMa_
readmappersoutthere!StampydrFAST•-Bismark•-
•-
MapSplicea-REALa--
BS-Seekera-- - B S - S e e k e r 2 - ••
Supersplat
liceMapRAT • - B R A T - S W -•-
BFAST•-
segemeht•-
GNUMAP•-
GenomeMapper•-
mrFAST • • - mrsFAST m r s FA S T- L i l t r a - -• - - - -
PerM • - - - - - ---
RNA-Mate • - - -X-Matea- - - - SBSMAP • - - - - S p l a z e r
RazerS • --•--MicroRazerS - • - - • RazerS3
SHRIMPa ——•SHR1MP2-•
BWAs - - •BWA-SW
CloudBurst •
ProbeMatch •• W H A M - •
TopHata- T o p H a t 2-•-
Bowlie •- B o w t i e 2 •-
MOM4-
PASS•- P A S S - b i s - -•
Slider • - - -Slider-II-
()PALMA •
SOCS"-
MAO•
SegMap •
ZOOM•
PalMaNa-
RMAP•
SOAP• —SOAP2--•
BWT-SW • - - S O A P S p l i c e - -•
Blata-
SSAHA•
GMAP •
Exonerate •
Mummer3 •
ELAND •
GSNAP-a-
20012002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Years
8. Map reads to genome/transcriptome
BWA
Bowtie
TopHat
SHRiMP
BFAST
MAQ
From ebi.ac.uk/~nf/hts_mappers/Fromeloi.ac.uki-nti GnotdrnietAtft.- 2 c 1 4 . 1.5auppl9:512
hitk.,:,www.bicrileckentrakuoiryt41-2105/75.•9•512
HISAT
JAGuaIR - -
Bw •A-PSSM - - - -M0-A1K
Approach
ARYANA:AligningReadsbyVetAnother
MiladGnoliimi•r,Arjeankba::'',AliSharifiviv:1-•.44,Harritireza(..hitsazMerio. . ..ignit5.
Abstract
PitTsburgh,PA,1..,'SA31March-OSApril20.4
iert)mRic:COM8-Seq:FourthAnnualRkC(....V/111Satellite'Workshopor)MassivelyParallelSequencing
Motivation:Althoughthereare
'•'--AarlycihretentaigorithmsancsoftwarerookbrNigningsequencingreacio s r
gappeos,Fo./pncesearchisfarfromsoivenStrongInterestinfastalignrrien:-ishest1.1,1pc7e0intheSV or.7tmforaigorithms',V-rhbeperrionfastaridaccuratealignment.
anclitiortdenow?assembtyofneat-GeneratoniPet.enringlngreadequitesfastoveriap-layriur-concensus
tieInnoczmvecompetitiononagoingaroller:tonofreadstoagiverdatabasedfreferencegenomes.In
-f_ultra-• -
Contribution:I'leintrot-LreARvANA.afastgappecrear!alignerdevelopedonMebissofiilleAincleA•ing
nisastr,_cturewithaco-ripletelyneooaighrrentengOPthatrh.akesitsignrfiramlyfasterthan7hreeotheraligner's:
Sowtie2,BMAantiSegAirt),wtncomparableGen-t,c-.:tyant:acruracy.Insteadofthporne-consurningt-haricraciong:vac:et:ores''L,!•handhingrntsrnatrtx5,s,ARYANIAcome;withthpsese-anO-exten0aigorIMmirframeworkanoa
5lonificantlyIrnPrOvedmth
efficiencybyIntegrongriNpialgorithmictetirnidt.elincluong
dynamArseer:seteCtion,
nin'ectionalspeceltensiortreset-4.rephashtablesanogap-fillingcAnynn•nirbrogsarnming.Asthpreaclength _ - -
increasesARYA-V/A•.!TItioeflornyintermsofspeedanaahgnmentratebecomesmoreevelent.Thisisinperfect
',lakesAtpar)/todeveionmission-specieNignersforotherappiicationsusingARVANAengine.harmony4viththeiFelilit'ngthtrenaas:heseci4enclnigTechnologiesevohieIhealgorithmcplaTformofARYANA
introduction
Availability:ARYAN.4compip7esourcerexiecanheobrairteilfromkittp.//gitbubcOrnlar)'ana-aligner
i:vt-tyliv:nscellcarriesahatA4offnreconsistingorseveralusedalaborioushierarchilprocesstodividethegertorne
thnuNanditl r
billitmsofcharacteniwithanswerstomany into srnalier.covegtamwhiletheCelera(;i-siolnicsfirm
vitalqumlions_.1-11.mnineffortstodecipherthathookhasreplacedthatb rin
yatrnnputationalsequence-assemblysoli-
Islernatio,:ratilnynanGenolne..eq.ite-ncingConxort,Lion
gainedincreasing:rloitivntlintsince/953WhtiLthedoublewareappliedtothedatageneatedfrontbhoellyshredded
helicalstructure011)NAwasdiscovered-'twentyyears(shotgun)wholegentorte17,.ti:.'theautomatedSanger
Liter.W..GilbertandA.Maxarnreactthenrst2,1-tit...It-atter r
methodwasthegoldstandardfin-abouttwodettleN,as
wordofthebook[I].svhenIISangerandhistsolleastiesthe.first*-ene.,-ntieoror021i/Axecitiencing.untiliecreasing
applicationoflabeleddideoxynucleotidetriphosphatexvolome ofen-orfreegenomirinformationcan%edmiler-
weredmelopinganothmsequentingmethodbasedonthedemandforla.,,tandinexpensivemethodstoproducehigh
I I
thatact;ISchainterminatorsinaPC.Rrmclior:/2,3...
genceofnewtechnologies.thesotailedNett-Geno-rainn I
drearnofreadingthehunzarihonk f e wasrtallaedhyAboutthreedecadesafterthefirnONAvegurnLing,SequericisvOVG,S)
.-1,paradigrnshihinboththeexperimentaltechnititieli 2 0 1 3 2 0 1 4 2 0 1 5
completionofthe t 3 I li t h efrulnangenrmreprofect(4-61,rhe and computationalInettulthocturred
doetothetransition
SSAHA• -II B l o t •-_
Ftli 1stca'Aut'O' iniblniran 1 avaiklii‘41MI' (–CIa? V* artfig•
.
rit:ctir;s1P,eye iveSangermate-pairedreadst-,-41t7to
•coeirsgt:,-,1,vi, i,),:kly•ieri?itt,ari,
relmenregerunnes,suchasthehumangenotr, ormore
hvananliJ-Ktrutoa' V areSarrt-tunnowtr-eas,tat,
ttore-.4.0,7f4,,ati,
than2000prokitryotex-toilvar),nesandArchaea.lamg,
totheNGStec:hnologiesandalso;Availabilityoffinished
2001 2 0 0 0 WattledCentral'''''..•„
Nzvoetr - - --—-ecthecrtPrta4
4..0,,,,t,:.0.,.a.,....„.0,,,elun.:06,z,kx...,0_,-;:t:eC—rnOrdo.Ercfo;CerretnseS:0;xa:13'stect'AL:i.deelat;,,13,17,a5Vt.GISrbtco,„.-"•amoeue?aro%x,,,, (-1'sYl't“:""Mort$Fttecr,...-0-?D14',1C.4,Tr'lelow:ccrseitv..43P.Ittfrtfct'NIa61Lt)&-.ACUISark*arnkozoimat,re:errrao'rPt.v•nit
el,A
(611;
Bloinformatics
19. The effect of applying many
'bioinformatics axes'
Illumina data
(FASTQ format)
2 FASTQ files
Files are ~6.5 GB
52.5 million reads total
Theeffectofapplyingmany
1bloinformaticsaxes'
IIluminadata
(FASTQformat)
2FASIQfiles
52.5millionreadstotal
Filesare,-,64.5GB
20. Remove adapters & trim
50.1 million reads
Removeadapters&trim
50.1millionreads
21. Align to transcriptome with Bowtie
35.8 million reads map
AligntotranscriptomewithBowtie
35.8millionreadsmap
22. Filter for uniquely mapped reads
31.4 million reads align uniquely
Filterforuniquelymappedreads
31.4millionreadsalignuniquely
23. Filter for high quality alignments
22.7 million reads have alignment scores of zero
Filterforhighqualityalignments
22.7millionreadshavealignmentscoresofzero
24. Data suitable for
final analysis
Reduced data from 52.5 to 22.7 million reads
Datasuitablefor
finalanalysis
Reduceddatafrom52.5to22.7millionreads
25. It can be helpful to know how the different
steps in a workflow reduce your data
Itcanbehelpfultoknowhowthedifferent
stepsinaworkflowreduceyourdata