Automated sequencing of genomes require automated gene assignment
Includes detection of open reading frames (ORFs)
Identification of the introns and exons
Gene prediction a very difficult problem in pattern recognition
Coding regions generally do not have conserved sequences
Much progress made with prokaryotic gene prediction
Eukaryotic genes more difficult to predict correctly
2. •Automated sequencing of genomes require automated gene
assignment
•Includes detection of open reading frames (ORFs)
•Identification of the introns and exons
•Gene prediction a very difficult problem in pattern
recognition
•Coding regions generally do not have conserved sequences
•Much progress made with prokaryotic gene prediction
•Eukaryotic genes more difficult to predict correctly
3. Ab initio methods
•Predict genes on given sequence alone
•Uses gene signals
•Start/stop codon
•Intronsplice sites
•Transcription factor binding sitesribosomal binding sites
•Poly-A sites
•Codon demand multiple of three nucleotides
•Gene content
•Nucleotide composition – use HMMs
Homologybased methods
•Matches to known genes
•Matches to cDNA
Consensus based
•Uses output from more than one program
4. Prokaryotic gene structure
•ATG (GTG or TTG less frequent) is start codon
•Ribosome binding site (Shine-Dalgarno sequence)
complementary to 16S rRNA of ribosome
•AGGAGGT
•TAG stop codon
•Transcription termination site (-independent
termination)
•Stem-loop secondary structure followed by string
of Ts
5. •Translate sequence into 6 reading frames
•Stop codon randomly every 20 codons
•Look for frame longer that 30 codons (normally 50-60
codons)
•Presence of start codon and Shine-Dalgarno sequence
•Translate putative ORF into protein, and search databases
•Non-randomness of 3rd base of codon, more frequently G/C
•Plotting wobble base GC% can identify ORFs
•3rd base also repeats, thus repetition gives clue on gene
location
6. Markov chains and HMMs
• Order depends on k previous positions
• The higher the order of a Markov model to describe a gene, the
more non-randomness the model includes
• Genes described in codons or hexamers
• HMMs trained with known genes
• Codon pairs are often found, thus 6 nucleotide patterns often
occur in ORFs – 5th-order Markov chain
• 5th-order HMM gives very accurate gene predictions
• Problem may be that in short genes there are not enough
hexamers
• InterpolatedMarkov Model (IMM) samples different length
Markov chains. Weighing scheme places less weight on rare k-
mers
• Final probability is the probability of all weighted k-mers
• Typical and atypical genes
7. GeneMark (http://exon.gatech.edu/genemark/)
Trained on complete microbial genomes
Most closely related organism used for predictions
Glimmer (Gene Locator and Interpolation Markov
Model)
(http://www.cbcb.umd.edu/software/glimmer/)
FGENESB(http://linux1.softberry.com/)
5th-order HMM
Trained with bacterial sequences
Linear discriminant analysis (LDA)
RBSFinder (ftp://ftp.tigr.org )
Takes output from Glimmer and searches for S-D
sequencesclose to start sites
10. Gene prediction in Eukaryotes
Low gene density (3% in humans)
Space between genes very large with multiply repeated
sequencesand transposable elements
Eukaryotic genes are split (introns/exons)
Transcript is capped (methylation of 5’ residue)
Splicing in spliceosome
Alternative splicing
Poly adenylation (~250 As added) downstream of
CAATAAA(T/C)consensusbox
Major issue identification of splicing sites
GT-AG rule (GTAAGT/Y12NCAG 5’/3’ intron splice
junctions)
Codon use frequencies
ATG start codon
Kozak sequence (CCGCCATGG)
12. Discriminant analysis
•Plot 2D graph of coding length versus 3’
splice site
•Place diagonal line (LDA) that separates
true coding from non-coding sequences
based on learnt knowledge
•QDA fits quadratic curve
•FGENES uses LDA
•MZEF(Michael Zang’s Exon Finder uses
QDA)
13. Neural Nets
•A series of input, hidden and output layers
•Gene structure information is fed to input layer, and is
separated into several classes
•Hexamer frequencies
•splice sites
•GC composition
•Weights are calculated in the hidden layer to generate
output of exon
•When input layer is challenged with new sequence,
the rules that was generated to output exon is applied
to new sequence
14. HHMs
•GenScan (http://genes.mit.edu/GENSCAN.html)
5th-order HMM
•Combined hexamer frequencies with coding signals
•Initiation codons
•TATAboxes
•CAP site
•Poly-A
•Trained on Arabidopsis and maize data
•Extensively used in human genome project
•HMMgene (http://www.cbs.dtu.dk/services/HMMgene)
•Identified sub regions of exons from cDNA or proteins
•Locks such regions and used HMM extension into neighboring regions
15.
16.
17. Homology based programs
•Uses translations to search for EST, cDNA and
proteins in databases
•GenomeScan
(http://genes.mit.edu/genomescan.html)
•Combined GENSCAN with BLASTX
•EST2Genome
(http://bioweb.pasteur.fr/seqanal/interfaces/est2geno
me.html)
•Compares EST and cDNA to user sequence
•TwinScan
•Similar to GenomeScan
18.
19. Consensus-based programs
•Uses several different programs to generate lists of
predicted exons
•Only common predicted exons are retained
•GeneComber
(http://www.bioinformatics.ubc.ca/gencombver/inde
x.php)
•Combined HMMgene with GenScan
•DIGIT (http://digit.gsc.riken.go.jp/cgi-bin/index.cgi)
•Combines FGENESH, GENSCAN and HMMgene
22. •Promoters are short regions upstream of transcription start site
•Contains short (6-8nt) transcription factor recognition site
•Extremely laborious to define by experiment
•Sequence is not translated into protein, so no homology
matchingis possible
•Each promoter is unique with a unique combination of factor
binding sites – thus no consensuspromoter
23. polymerase
ORF
-35 box
-10 box
TF site
TF
•70 factor bindsto -35 and -10 boxes and recruit full polymerase enzyme
•-35 box consensus sequence: TTGACA
•-10 box consensus sequence: TATAAT
•Transcriptionfactorsthat activateor repress transcription
•Bindto regulatory elements
•DNA loopsto allow long-distanceinteractions
Prokaryoticgene
24. PolymeraseI, II and III
Basaltranscription factors(TFIID, TFIIA, TFIIB, etc.)
TATA box (TATA(A/T)A(A/T)
“Housekeeping”genes often do not containTATA boxes
Initiatiorsite (Inr) (C/T) (C/T) CA(C/T) (C/T) coincides with transcription
start
ManyTF sites
Activation/repression
TF site
TF site TATA Inr
Pol II
Eukaryoticgene structure
25. Ab initio methods
•Promoter signals
•TATA boxes
•Hexamer frequencies
•Consensussequence matching
•PSSM
•Numerous FPs
•HMMs incorporateneighboring information
26. Promoter prediction in prokaryotes
•Find operon
•Upstreamoffirst gene is promoter
•Wang rules (distance between genes, no -
independent termination, number of genomes that
display linkage)
•BPROM (http://www.softberry.com)
•Based of arbitarry setting of operon egen distances
•200bop uopstream of first gene
•‘many FPs
•FindTerm (http://sun1.softberry.com)
•Searches for -independent termination signals
27. Prediction in eukaryotes
• Searching for consensussequences in databases (TransFac)
• Increase specuificity by searching for CpG islands
• High density fo trasncription factor binding sitres
• CpGProD (http://pbil.univ-lyon1.fr/software/cpgprod.html)
• CG% inmoving window
• Eponine (http://servlet.sanger.ac.uk:8080/eponine/ )
• Matches TATAbox, CCAAT bvox, CpG island to PSSM
• Cluster-Buster(http://zlab.bu.edu/cluster-buster/cbust.html)
• Detectshigh concentrationsof TF sites
• FirstEF (http://rulai.cshl.org/tools/FirstEF/)
• QDAof fisrt exonboundary
• McPromoter (http://genes.mit.edu/McPromoter.html)
• Neural net of DNA bendability, TAT box,initator box
• Trained for Drosophila and human sequences
28. Phylogenetic footprinting technique
•Identifyconserved regulatory sites
•Human-chimpanzeetoo close
•Humanfish too distant
•Human0-mouse appropriate
•ConSite(http://mordor.cgb.ki.se/cgi-bin/CONSITE/consite)
•Aligntwo sequences by global;alignmentalgorithm
•Identifyconserved regions and compare to TRANSFAC database
•High scoring hits returned as positives
•rVISTA (http://rvista.dcode.org)
•IdentifiedTRANSFACsites in two orthologoussequences
•Alignssequences with localalignment algorithm
•Highest identity regions returned as hits
•Bayesaligner
(http://www.bioinfo.rpi.edu/applications/bayesian/bayes/bayes.align12.pl)
•Alignstwo sequences with Bayesianalgorithm
•Even weakly conserved regions identified
29. Expression-profilingbased method
Microarrayanalysesallowsidentificationof co-regulatedgenes
Assume that promoters containsimilarregulatory sites
Findsuch sites by EM and Gibbs sampling using iterationof PSSM
Co-expressed genes may be regulatedat higher levels
MEME(http://meme.sdsc.edu/meme/website/meme-intro.html)
AlignACE(http://atlas.med.harvard.edu/cgi-bin/alignace.pl)
Gibbssampling algorithm