Presented by Raphael Mrode, ILRI, at the workshop on Essential Knowledge for Effective Improvement and Dissemination of Genetics in Sheep and Goats, Addis Ababa, Ethiopia, 3–5 November 2020
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
Genomic selection in Livestock
1. Partner Logo
Partner
Logo
Genomic selection in Livestock
Raphael Mrode, ILRI
Essential Knowledge for Effective Improvement and Dissemination of
Genetics in Sheep and Goats
3 – 5 November 2020
Addis Ababa , Ethiopia
2. 2
The basic goal: genetic progress and
increased productivity
Identifying animals with best genetic merit as parents of
the next generation genetic improvement
Distributio
n of offsprin
g p
heno
ty
pes
Gene
tic
im
pro
vem
en
t
O
P
P
P
Distributio
n of ph
enoty
pes in the p
are
ntal ge
neration
Anim
als selected
to be parents
P
S
P
P
3. 3
The basic goal: genetic progress and
increased productivity
To achieve this goal we need accurate estimation breeding
values (EBVs)
However what is available are phenotypes (Y) which are
influenced by genetic and environmental effects
Y = Genetic (G) + Environment (E)
Thus Var(Y) = Var(G) + Var(E)
Sources of Var(E) could be environmental and management factors
Sources of Var(G) due to different forms of inheritance leading to
different components of genetic variance (e.g. additive genetic
variance, additive maternal genetic variance and so on.
Accurate estimate of G (EBVs) is our challenge
4. 4
Reality of Field data
Often we deal with field data → variety of
environmental factors, animals with different degree of
relatedness, many generations and unbalanced
We need framework to model phenotypic observations
accounting for non-genetic systematic sources of
variation (Var(E)) to estimate EBV accurately
The linear mixed model provides such a framework
6. 6
Linear mixed model
In matrix notation, a mixed linear model may be represented as
y = Xb + Za + e
where
y = n x 1 vector of observations; n = number of records.
b = p x 1 vector of fixed effects; p = number of levels for fixed (lactation
number) effects
a = q x 1 vector of random animal effects; q = number of levels for
random effects
e = n x 1 vector of random residual effects
X = design matrix of order n x p, that relates records to fixed effects
Z = design matrix of order n x q, that relates records to random animal
effects
Both X and Z are both termed design or incidence matrices.
7. 7
Assumptions of the linear mixed model
It is assumed that residual effects are independently
distributed with variance σ2
e, therefore, var(e) = Iσ2
e = R;
var(a) = G = Iσ2
a or Aσ2
a and A is the numerator
relationship matrix. Thus
R
0
0
G
e
a
V
8. 8
MME for breeding values
Mixed Model equations (MME) with the relationship
matrix incorporated are
with α = σ2
e/σ2
a = 1-h2/h2
=
+
y
Z
y
X
a
b
A
Z
Z
X
Z
Z
X
X
X
1 ˆ
ˆ
9. 9
Limitations of A matrix
The relationship matrix A based on pedigree is an
average relationship which assumes infinite loci.
Real relationships are a bit different due to finite
genome size
Therefore A is the expectation of realized relationships
Two half-sibs might have a correlation of 0.3 or 0.2
10. 10
Use of microsatellites as markers and
limitations
Initially microsatellites were used as genetic markers in the 1980s
and 1990s.
Microsatellites are set of short repeated DNA sequences at a
particular locus on a chromosome, which vary in number in
different individuals and so can be used as markers
Most significant genetic marker can be 10 cM or more from the
QTL, therefore QTL are not mapped precisely.
The association between marker and QTL may not persist through
the population.
The phase between marker and QTL may have to estimated for
each family
10
11. 11
Single Nucleotide Polymorphism (SNP).
• SNP is a DNA sequence variation occurring when a
single nucleotide — A, T, C, or G — in the genome
differs between paired chromosomes in an individual.
• For example, two sequenced DNA fragments from an
individual, AAGCCTA to AAGCTTA, contain a difference in
a single nucleotide.
• In this case we say that there are two alleles: C and T.
Almost all common SNPs have only two alleles.
11
13. Whole Genome Sequence & Genotyping chips
Began with Human (2001) and mice, WGS of the chicken (2004), the dog
(2005), bovine (2006), horse (2007), pig (2009), ...
New technologies for genotyping and sequencing
Simultaneous genotyping of many SNP
From few dozens up to several million SNP
Two main technology providers, Illumina and Affymetrix
Illumina products in cattle
3000 (7000 1000 20000=« LD »)
54 000=« 50k »
777 000=« HD »
14. 14
Genomic Selection (GS)
GS - the use of genomic breeding values (GEBV), used for the
selection of animals.
Genomic selection requires that markers (SNPs) are in linkage
disequilibrium (LD) with the QTLs across the whole population
Thus the use of SNPs as markers enables all QTL in the genome
to be traced through the tracing of chromosome segments
defined by adjacent SNPs.
15. 15
Steps in Genomic Selection (GS)
• Genotype animals with phenotypes (sires with daughters records for sex
limited traits)
• Estimate SNP solutions (SNP Key) in the reference population
• Validate in another data set but records excluded to determine accuracy of
SNP key
• Genotype animals at birth or young age (no phenotypes) and use SNP key to
prediction their GEBV and do selection
Reference
population
Genotyped and
phenotyped animals
Genotyped but no
phenotypes
Selection
candidates
Genotyped & phenotyped
but phenotypes excluded
Validation
candidates
16. 16
Main advantages of Genomics
Young bulls can be genotyped early in life and breeding values
computed
Can be used to select young bulls to be progeny tested, thereby
reducing cost
Higher accuracy of about 20-40% for young bulls above parent
average
Reduction in generation interval
17. 17
Genomic Selection : efficiency
Two main factors :
• Accuracy of SNP effect estimation
• size of reference population
• heritability of the trait
• statistical methodology used
• Linkage Disequilibrium (LD) between markers and QTL
• marker density
• effective size of the population => number of
« independent » segments
• Relationship between candidates and reference population
18. 18
Size of the Reference populaton
Greatly influences the accuracy of genomic evaluations
(Goddard, 2008)
19. 20
25
30
35
40
45
20 25 30 35 40 45
DYD
Estimated BV
Training set
20
25
30
35
40
45
20 25 30 35 40 45
DYD
(proxy
of
true
BV)
GEBV
Training set
Validation set
Two important parameters :
• R2 or r(DYD, GEBV) (should be « large enough »)
• slope of the regression (should be close to 1)
Overestimation
= “inflation”
Validation test
20. 20
Increasing the size of the Reference
populations
Genotype as many progeny tested sires as possible
International collaborations
Holstein: 2 big consortia
USA + Canada ~>35000 bulls + UK + Italy
Eurogenomics France, The Netherlands, (Germany),
Nordic countries, Spain,Poland ~34000 bulls (?)
For small breeds or other species (goats, Sheep, beef cattle : not
enough sires
combine with many genotyped cows
About 4 -5 cows records provide equivalent information to one
proven sire (Goddard (2009) and Daetwyler et al. (2013) )
21. 21
General linear model
The general linear model underlying genomic evaluation is of the form
y = Xb + gi + e
where
m is the number of SNPs ; y is the data vector,
b the vector for mean or fixed effects
gi the genetic effect of the ith SNP genotype and e is the error.
The matrix M is of the dimension n (number of animals) and m, and Mi relates
the ith SNPs to data
It is assumed that all the additive genetic variance is explained by all the markers
effects such that the estimate of animal’s total genetic merit or breeding value
(a) is: a = gi.
m
i
i
M
m
i
i
M
22. 22
Data types used for genomic evaluation
• y = YD (Yield deviation) = Individual record corrected for
all fixed and non genetic random effects
• y = DYD (Daughter yield deviation) = twice average for
a bull of all YD of their daughters corrected for ½ genetic
merit of their dams (with associated weight = EDC
(Equivalent Daughter Contribution
• y = de-regressed proofs -- obtained by solving the MME
to get the right-hand side
• EBVs --- NO
23. 23
Coding and scaling genotypes
• The genotypes of animals (elements of M) are
commonly coded as 2 and 0 for the two homozygotes
(AA and BB) and 1 for the heterozygote (AB).
• Or if alleles are expressed in terms of nucleotides, and
reference allele at a locus is G and the alternative allele
is C, then code 0 = GG , 1 = GC and 2 = CC.
• The diagonal elements of MM’ then indicate the
individual relationship with itself (inbreeding) and the
off-diagonal indicate the number of alleles shared by
relatives
24. 24
Scaling of genotypes
• SNPs → 2 alleles A/B but only one effect defined substitution effect mi
• Commonly elements of M are scaled
– to set the mean values of the alleles effects to zero
– account for differences in allele frequencies of the various SNPS
– Let the frequency of the second or alternative allele at locus j be pj
– Elements of M can be scaled by subtracting 2pj.
– If the element of column j of a matrix P equals 2pj, then matrix Z,
which contained the scaled elements of M is : Z = M - P.
• Furthermore, the elements of Z be normalised by dividing
the column for marker j by its standard deviation assumed to
be
.
25. 25
Mixed linear model for computing SNP
effect
• The most common random model used assumes
– the effect of the SNP are normally distributed,
– all SNP are from a common normal distribution (eg. the same genetic variance for all
SNPs).
• There are two equivalent models with these assumptions
• (1) SNP-BLUP - a model fitting individual SNP effects simultaneously.
– DGV for selection candidates are calculated as DGV = Zĝ, where ĝ are the estimates of
random SNP effects.
– Assumes σ2
g is known but this may not be the case in practise and σ2
g may be
approximated from σ2
a.
• (2) GBLUP - a model estimates DGV directly, with a (co) variance among
breeding values of G σ2
a, where G is the genomic relationship matrix, the
realised proportion of the genome that animals share in common estimated
from the SNP.
26. 26
SNP BLUP model
In matrix form, model is
Y = Xb + Zg + e
Y = vector of observations: these can be de-regressed
EBVs, phenotypes corrected for all fixed effects
where g = vector of additive genetic effects
corresponding to allele substitution effects for each SNP
and Z = scaled matrix of genotypes
MME are below with α = σ2
e/σ2
g
y
Z
y
X
g
b
I
Z
Z
X
Z
Z
X
X
X
ˆ
ˆ
α
27. 27
SNP-BLUP
• If y in MME = de-regressed breeding values of bulls,
then
– Each observation may be associated with differing reliabilities.
– Thus a weighted analysis may be required to account for these
differences in bull reliabilities.
– Weight (wti) = effective daughter contribution or wti = (1/
reldtr) – 1, where reldtr is the bull’s reliability from daughters
with parent information excluded
28. 28
SNP-BLUP
• The MME then are
• where R = D and D is a diagonal matrix with diagonal
element i = wti.
• In practise, the value of σ2
g may not been known and σ2
g
could be obtained
• either as σ2
g = σ2
a /m, with m = the number of markers
• or as σ2
g = σ2
a /2Σpj(1 – pj)
• and α = 2Σpj(1 – pj) *[ σ2
e/σ2
a]
y
R
Z
y
R
X
g
b
I
Z
R
Z
X
R
Z
Z
R
X
X
R
X
1
1
1
1
1
1
ˆ
ˆ
α
30. 30
Example 1
• The observations are the daughter yield deviations for fat yield and the effective
daughter contribution (EDC) for each bull is also given.
• The EDC can be used as weights in the analysis but will ignore for this presentation
• It is assumed the genetic variance for fat yield is 35.241kg2 and residual variance of
245kg2
• Animals 13 to 20 as assumed as the reference population and 21 to 26 as
validation candidates.
• SNP effects are predicted using using all 10 SNPs.
• The incidence matrix X = Iq , with q = 8, the number of
animals in the reference population
31. 31
Computing the matrices we need
• The incidence matrix X = Iq , with q = 8, the number of
animals in the reference population
• X’ = [ 1 1 1 1 1 1 1 1]
• The computation of Z requires calculating the allele
frequency for each SNP.
32. 32
Computing Matrices
• The allele frequency for the ith SNP was computed as
with n = 14, the number of animals with genotypes and mij are
elements of M.
• Allele frequencies 0.321, 0.179, 0.357, 0.357, 0.143,
0.607, 0.071, 0.964, 0.571 and 0.393 respective.
• Using those frequencies 2Σpj(1 – pj) = 3.5383. Thus α =
3.5383*(245/35.242) = 24.598
n
*
2
m
n
j
ij
33. 33
Z matrix
• Z= M – P and is
• We have computed X and Z.
• Remaining matrices X’Z and Z’X and Z’Z are computed by
multiplication. Then add Iα to Z’Z then MME are formed.
• When solved we these solutions:
0.786
0.857
0.071
0.143
0.214
0.286
0.714
0.286
0.643
0.643
0.786
0.857
0.071
0.143
0.786
0.286
0.286
0.286
0.357
0.643
0.214
0.857
0.071
0.143
0.786
0.286
0.286
0.714
0.643
0.357
1.214
0.143
0.071
0.143
1.214
0.286
1.286
0.286
0.643
0.643
0.214
0.857
0.071
0.143
0.214
0.286
0.286
1.286
0.357
0.643
1.214
0.143
0.071
0.143
1.214
0.714
0.286
1.286
0.643
0.357
0.786
0.143
0.071
0.143
0.786
0.286
0.714
0.714
0.357
0.357
1.214
0.143
0.071
0.143
1.214
0.286
0.286
0.286
0.357
1.357
Z
34. 34
Computing GEBVs
• Solutions
• -----------------------
• Mean effect
•
• 9.944
•
• SNP effects (ĝ)
• 1 0.087
• 2 -0.311
• 3 0.262
• 4 -0.080
• 5 0.110
• 6 0.139
• 7 0.000
• 8 0.000
• 9 -0.061
• 10 -0.016
• The SNP solutions are also called as the SNP key
35. 35
GEBVs for Validation animals
• The DGV for the reference animals (animals 13- 20) is
then computed as Zĝ.
• For the validation animals (animals 21 -26) , DGV = Z2ĝ
where Z2 contains the centralised genotypes for the
validation candidates
37. 37
GBLUP
Equivalent model to SNP-BLUP
BLUP MME but with A-1) replaced by G-1
The DGV is computed directly as the sum of the SNP
effects(a = Zg)
Model is
y = Xb + Wa + e
where a = vector of DGVs and W is the design matrix
linking records to animals
Matrix X is as defined before and W is an identity
matrix ( a diagonal matrix with all diagonal elements =
1)
38. 38
GBLUP
Given that a = Zg
Then var(a) = ZZ’σ2
g.
Note that σ2
g =
then the matrix ZZ’ can be scaled such that
G =
and var(a) = Gσ2
a .
Division by 2Σpi(1−pi) makes G analogous to A.
)
p
(1
p
2
σ
j
j
2
a
)
p
(1
p
2
Z
Z
j
j
41. 41
GBLUP
• MME are
• where α now equals σ2
e/σ2
a . Solutions for example in previous
table
• Advantages:
– Existing software for genetic evaluation can be used by replacing A with G
– systems of equations are of the size of animals which tend to be fewer
than the number of SNP.
– In pedigreed populations G discriminates among sibs, and other relatives,
capture information on Mendelian sampling.
– method is attractive for populations without good pedigree as G will
capture this information among the genotyped individuals
y
R
W
y
R
X
a
b
G
W
R
W
X
R
W
W
R
X
X
R
X
1
1
1
1
1
1
ˆ
ˆ
1α
43. 43
Single Step Method
GBLUP computes genomic breeding values only for
genotyped animals.
How can non-genotyped animals benefit from genomic
information
Let g2 be the genetic (genomic) values of genotyped animals and
g1 the genetic values of non genotyped animals
An estimate of g1 based on genomic information is obtained by
regression of g1 on g2 and added to information from BLUP through
the usual MME
44. 44
Single Step Method
• We define variance of vector of g1 (non-genotyped) and
g2 (genotyped)
H = Variance of
1
2
g
g
1 1 1
11 12 11 12 22 22 22 21 12 22
1
21 22 22 21
=
H H A A A G A A A A A G
H
H H GA A G
non genotyped genotyped
45. 45
Single Step Method
• Model is just as before but uses all data (genotyped
and ungenotyped):
• MME are the usual but with A-1 replaced with H-1
• Surprisely, H-1 has simple form:
y X Za e
1
' ' '
'
' '
X X X Z 1 y
Z y
X g
Z Z Z H
46. better lives through livestock
ilri.org
ILRI thanks all donors and organizations who globally supported its work through their contributions to the CGIAR Trust Fund
There MUST be a CGIAR logo or a CRP logo. You can copy and paste the logo you need from the final slide of this presentation. Then you can delete that final slide
To replace a photo above, copy and paste this link in your browser: http://www.flickr.com/photos/ilri/sets/72157632057087650/detail/
Find a photo you like and the right size, copy and paste it in the block above.
Once you have the right logos please delete this slide