2. CpG Island
• Region of the genome with high frequency of CpG
sites than the rest of the genome.
• Formal Definition - CpG island is a region with at
least 200 bp, and a GC percentage that is greater
than 50 % .
• CpG is shorthand for “—C—phosphate—G—that
is, cytosine and guanine separated by only one
phosphate.
2
4. Importance of CpG Islands
• CpG island acts as a proxy to
identify a gene.
• They often occur at the start of
the gene.
• Cytosines in CpG
dinucleotides can be
methylated(have methyl group
attache) to form 5-
methylcytosine.
4
6. Importance of Methylation
• Our body consist thousand of cell . Every cell of our body
contain same copy of DNA with same blueprint of genetic code,
then how do they decide among themselves which function has
to performed ?
• How Does heart cell know it’s a heart cell
• How Does skin cell know it’s skin cell.
• They need outside instructions from these little carbon hydrogen
compounds called methyl group.
• How characteristics change across generations without changes
to the DNA sequence itself.
6
7. Epigenetics & CpG Islands
• Literal meaning of epigenetic is ‘above genetics’. It
decides methylation of CpG island
• CpG islands regulate expression of nearby genes.
• Proteins involved in
gene expression
can be repelled or
attracted by the
methyl group
7
8. Background: Epigenetics
• Environmental factors like what we do, what we eat, what we
smoke and how stressed we are decide the methyl group binding.
• Bad diet can actually lead methyl group binding to the wrong
place and with these bad instruction cell become abnormal and
become disease
• Epigenetics is also controlled by histones. Histones are protein that
are basically spools that DNA wind itself around . Histones can
change how tightly or loosely the DNA is around them.
• If loosely around — the gene get more expressed
• If tightly around — the gene get less expressed
8
10. Background: Epigenetics
• So methyl group is more like a ‘switch’ and histones
are more like a ‘knob’
• Every cell of your body has a distinct methylation
and histones pattern that gives every cell its
marching order.
• DNA can be thought of as body ‘hardware’ and
epigenome is more like a software which tells the
hardware what work it has to do and hence justifies
its meaning.
10
11. Now Some Computer
Science……..
• Task - Design a method that, given a candidate
string (k-mer), score it according to how confident it
came from CpG Island.
• Apply, Sequence Model which is a probabilistic
model that associates probabilities with
sequences.
11
12. Sequence Models
• Sequence models learn from examples.
• Say we have sampled 100K 5-mers from inside
CpG islands and 100K 5-mers from outside.
• Can we guess whether CGCGC came from CpG
island.?
• P(inside) = 315/(315 + 12)
12
# CGCGC inside 315
# CGCGC outside 12
13. Sequence Models
• To estimate p(x) we count # times x appears in the
training set labelled INSIDE divided by total # of
times x appears in training set.
• But for sufficiently long k, we might not see any
occurrences of x, or very few.To overcome this
limitation we will go for joint probability distribution.
• P(X) = P(Xk,Xk-1,………X1) where P(X) is the
probability of sequence X
13
17. • P(x) now equal product
of all the Markov chain
edge weights on our
string driven walk
through the chain
!
!
• Nodes label are symbol
and transition label are
conditional probability
17
21. Hidden Markov Model
• In simpler Markov models (like a Markov chain), the
state is directly visible to the observer, and
therefore the state transition probabilities are the
only parameters.
• In a hidden Markov model, the state is not directly
visible, but the output, dependent on the state, is
visible. Each state has a probability distribution
over the possible output tokens. The adjective
'hidden' refers to the state sequence through which
the model passes.
21
28. 28
Hidden Markov Model-
Viterbi Algorithm
• Given flips can we say when the dealer was using
loaded coin.
• We want to find p* , the most likely path given the
emission.
!
• Viterbi algorithm is a dynamic programming algorithm
for finding the most likely sequence of hidden states –
called the Viterbi path – that results in a sequence of
observed events.