Genomic selection in Livestock

Partner Logo
Partner
Logo
Genomic selection in Livestock
Raphael Mrode, ILRI
Essential Knowledge for Effective Improvement and Dissemination of
Genetics in Sheep and Goats
3 – 5 November 2020
Addis Ababa , Ethiopia

2
The basic goal: genetic progress and
increased productivity
 Identifying animals with best genetic merit as parents of
the next generation genetic improvement
Distributio
n of offsprin
g p
heno
ty
pes
Gene
tic
im
pro
vem
en
t
O
P
P
P
Distributio
n of ph
enoty
pes in the p
are
ntal ge
neration
Anim
als selected
to be parents
P
S
P
P

3
The basic goal: genetic progress and
increased productivity
 To achieve this goal we need accurate estimation breeding
values (EBVs)
 However what is available are phenotypes (Y) which are
influenced by genetic and environmental effects
 Y = Genetic (G) + Environment (E)
 Thus Var(Y) = Var(G) + Var(E)
 Sources of Var(E) could be environmental and management factors
 Sources of Var(G) due to different forms of inheritance leading to
different components of genetic variance (e.g. additive genetic
variance, additive maternal genetic variance and so on.
 Accurate estimate of G (EBVs) is our challenge

4
Reality of Field data
 Often we deal with field data → variety of
environmental factors, animals with different degree of
relatedness, many generations and unbalanced
 We need framework to model phenotypic observations
accounting for non-genetic systematic sources of
variation (Var(E)) to estimate EBV accurately
 The linear mixed model provides such a framework

5
Example data and pedigree files
• Data file Pedigree file

6
Linear mixed model
In matrix notation, a mixed linear model may be represented as
y = Xb + Za + e
where
y = n x 1 vector of observations; n = number of records.
b = p x 1 vector of fixed effects; p = number of levels for fixed (lactation
number) effects
a = q x 1 vector of random animal effects; q = number of levels for
random effects
e = n x 1 vector of random residual effects
X = design matrix of order n x p, that relates records to fixed effects
Z = design matrix of order n x q, that relates records to random animal
effects
Both X and Z are both termed design or incidence matrices.

7
Assumptions of the linear mixed model
 It is assumed that residual effects are independently
distributed with variance σ2
e, therefore, var(e) = Iσ2
e = R;
var(a) = G = Iσ2
a or Aσ2
a and A is the numerator
relationship matrix. Thus


















R
0
0
G
e
a
V

8
MME for breeding values
 Mixed Model equations (MME) with the relationship
matrix incorporated are
 with α = σ2
e/σ2
a = 1-h2/h2
=
+ 



























 y
Z
y
X
a
b
A
Z
Z
X
Z
Z
X
X
X
1 ˆ
ˆ


9
Limitations of A matrix
 The relationship matrix A based on pedigree is an
average relationship which assumes infinite loci.
 Real relationships are a bit different due to finite
genome size
 Therefore A is the expectation of realized relationships
 Two half-sibs might have a correlation of 0.3 or 0.2

10
Use of microsatellites as markers and
limitations
 Initially microsatellites were used as genetic markers in the 1980s
and 1990s.
 Microsatellites are set of short repeated DNA sequences at a
particular locus on a chromosome, which vary in number in
different individuals and so can be used as markers
 Most significant genetic marker can be 10 cM or more from the
QTL, therefore QTL are not mapped precisely.
 The association between marker and QTL may not persist through
the population.
 The phase between marker and QTL may have to estimated for
each family
10

11
Single Nucleotide Polymorphism (SNP).
• SNP is a DNA sequence variation occurring when a
single nucleotide — A, T, C, or G — in the genome
differs between paired chromosomes in an individual.
• For example, two sequenced DNA fragments from an
individual, AAGCCTA to AAGCTTA, contain a difference in
a single nucleotide.
• In this case we say that there are two alleles: C and T.
Almost all common SNPs have only two alleles.
11

Whole Genome Sequence & Genotyping chips
 Began with Human (2001) and mice, WGS of the chicken (2004), the dog
(2005), bovine (2006), horse (2007), pig (2009), ...
 New technologies for genotyping and sequencing
 Simultaneous genotyping of many SNP
 From few dozens up to several million SNP
 Two main technology providers, Illumina and Affymetrix
 Illumina products in cattle
 3000 (7000 1000 20000=« LD »)
 54 000=« 50k »
 777 000=« HD »

14
Genomic Selection (GS)
 GS - the use of genomic breeding values (GEBV), used for the
selection of animals.
 Genomic selection requires that markers (SNPs) are in linkage
disequilibrium (LD) with the QTLs across the whole population
 Thus the use of SNPs as markers enables all QTL in the genome
to be traced through the tracing of chromosome segments
defined by adjacent SNPs.

15
Steps in Genomic Selection (GS)
• Genotype animals with phenotypes (sires with daughters records for sex
limited traits)
• Estimate SNP solutions (SNP Key) in the reference population
• Validate in another data set but records excluded to determine accuracy of
SNP key
• Genotype animals at birth or young age (no phenotypes) and use SNP key to
prediction their GEBV and do selection
Reference
population
Genotyped and
phenotyped animals
Genotyped but no
phenotypes
Selection
candidates
Genotyped & phenotyped
but phenotypes excluded
Validation
candidates

16
Main advantages of Genomics
 Young bulls can be genotyped early in life and breeding values
computed
 Can be used to select young bulls to be progeny tested, thereby
reducing cost
 Higher accuracy of about 20-40% for young bulls above parent
average
 Reduction in generation interval

17
Genomic Selection : efficiency
 Two main factors :
• Accuracy of SNP effect estimation
• size of reference population
• heritability of the trait
• statistical methodology used
• Linkage Disequilibrium (LD) between markers and QTL
• marker density
• effective size of the population => number of
« independent » segments
• Relationship between candidates and reference population

18
Size of the Reference populaton
 Greatly influences the accuracy of genomic evaluations
(Goddard, 2008)

20
25
30
35
40
45
20 25 30 35 40 45
DYD
Estimated BV
Training set
20
25
30
35
40
45
20 25 30 35 40 45
DYD
(proxy
of
true
BV)
GEBV
Training set
Validation set
Two important parameters :
• R2 or r(DYD, GEBV) (should be « large enough »)
• slope of the regression (should be close to 1)
Overestimation
= “inflation”
Validation test

20
Increasing the size of the Reference
populations
 Genotype as many progeny tested sires as possible
 International collaborations
 Holstein: 2 big consortia
USA + Canada ~>35000 bulls + UK + Italy
Eurogenomics France, The Netherlands, (Germany),
Nordic countries, Spain,Poland ~34000 bulls (?)
 For small breeds or other species (goats, Sheep, beef cattle : not
enough sires
 combine with many genotyped cows
 About 4 -5 cows records provide equivalent information to one
proven sire (Goddard (2009) and Daetwyler et al. (2013) )


21
General linear model
The general linear model underlying genomic evaluation is of the form
y = Xb + gi + e
where
m is the number of SNPs ; y is the data vector,
b the vector for mean or fixed effects
gi the genetic effect of the ith SNP genotype and e is the error.
The matrix M is of the dimension n (number of animals) and m, and Mi relates
the ith SNPs to data
It is assumed that all the additive genetic variance is explained by all the markers
effects such that the estimate of animal’s total genetic merit or breeding value
(a) is: a = gi.

m
i
i
M

m
i
i
M

22
Data types used for genomic evaluation
• y = YD (Yield deviation) = Individual record corrected for
all fixed and non genetic random effects
• y = DYD (Daughter yield deviation) = twice average for
a bull of all YD of their daughters corrected for ½ genetic
merit of their dams (with associated weight = EDC
(Equivalent Daughter Contribution
• y = de-regressed proofs -- obtained by solving the MME
to get the right-hand side
• EBVs --- NO

23
Coding and scaling genotypes
• The genotypes of animals (elements of M) are
commonly coded as 2 and 0 for the two homozygotes
(AA and BB) and 1 for the heterozygote (AB).
• Or if alleles are expressed in terms of nucleotides, and
reference allele at a locus is G and the alternative allele
is C, then code 0 = GG , 1 = GC and 2 = CC.
• The diagonal elements of MM’ then indicate the
individual relationship with itself (inbreeding) and the
off-diagonal indicate the number of alleles shared by
relatives

24
Scaling of genotypes
• SNPs → 2 alleles A/B but only one effect defined substitution effect mi
• Commonly elements of M are scaled
– to set the mean values of the alleles effects to zero
– account for differences in allele frequencies of the various SNPS
– Let the frequency of the second or alternative allele at locus j be pj
– Elements of M can be scaled by subtracting 2pj.
– If the element of column j of a matrix P equals 2pj, then matrix Z,
which contained the scaled elements of M is : Z = M - P.
• Furthermore, the elements of Z be normalised by dividing
the column for marker j by its standard deviation assumed to
be
.

25
Mixed linear model for computing SNP
effect
• The most common random model used assumes
– the effect of the SNP are normally distributed,
– all SNP are from a common normal distribution (eg. the same genetic variance for all
SNPs).
• There are two equivalent models with these assumptions
• (1) SNP-BLUP - a model fitting individual SNP effects simultaneously.
– DGV for selection candidates are calculated as DGV = Zĝ, where ĝ are the estimates of
random SNP effects.
– Assumes σ2
g is known but this may not be the case in practise and σ2
g may be
approximated from σ2
a.
• (2) GBLUP - a model estimates DGV directly, with a (co) variance among
breeding values of G σ2
a, where G is the genomic relationship matrix, the
realised proportion of the genome that animals share in common estimated
from the SNP.

26
SNP BLUP model
 In matrix form, model is
 Y = Xb + Zg + e
 Y = vector of observations: these can be de-regressed
EBVs, phenotypes corrected for all fixed effects
 where g = vector of additive genetic effects
corresponding to allele substitution effects for each SNP
and Z = scaled matrix of genotypes
 MME are below with α = σ2
e/σ2
g
































y
Z
y
X
g
b
I
Z
Z
X
Z
Z
X
X
X
ˆ
ˆ
α

27
SNP-BLUP
• If y in MME = de-regressed breeding values of bulls,
then
– Each observation may be associated with differing reliabilities.
– Thus a weighted analysis may be required to account for these
differences in bull reliabilities.
– Weight (wti) = effective daughter contribution or wti = (1/
reldtr) – 1, where reldtr is the bull’s reliability from daughters
with parent information excluded

28
SNP-BLUP
• The MME then are
• where R = D and D is a diagonal matrix with diagonal
element i = wti.
• In practise, the value of σ2
g may not been known and σ2
g
could be obtained
• either as σ2
g = σ2
a /m, with m = the number of markers
• or as σ2
g = σ2
a /2Σpj(1 – pj)
• and α = 2Σpj(1 – pj) *[ σ2
e/σ2
a]










































y
R
Z
y
R
X
g
b
I
Z
R
Z
X
R
Z
Z
R
X
X
R
X
1
1
1
1
1
1
ˆ
ˆ
α

29
Example 1
FAT SNP
Animal Sire Dam Mean EDC DYD Genotype
13 0 0 1 558 9.0 2 0 1 1 0 0 0 2 1 2
14 0 0 1 722 13.4 1 0 0 0 0 2 0 2 1 0
15 13 4 1 300 12.7 1 1 2 1 1 0 0 2 1 2
16 15 2 1 73 15.4 0 0 2 1 0 1 0 2 2 1
17 15 5 1 52 5.9 0 1 1 2 0 0 0 2 1 2
18 14 6 1 87 7.7 1 1 0 1 0 2 0 2 2 1
19 14 9 1 64 10.2 0 0 1 1 0 2 0 2 2 0
20 14 9 1 103 4.8 0 1 1 0 0 1 0 2 2 0
21 1 3 1 13 7.6 2 0 0 0 0 1 2 2 1 2
22 14 8 1 125 8.8 0 0 0 1 1 2 0 2 0 0
23 14 11 1 93 9.8 0 1 1 0 0 1 0 2 2 1
24 14 10 1 66 9.2 1 0 0 0 1 1 0 2 0 0
25 14 7 1 75 11.5 0 0 0 1 1 2 0 2 1 0
26 14 12 1 33 13.3 1 0 1 1 0 2 0 1 0 0

30
Example 1
• The observations are the daughter yield deviations for fat yield and the effective
daughter contribution (EDC) for each bull is also given.
• The EDC can be used as weights in the analysis but will ignore for this presentation
• It is assumed the genetic variance for fat yield is 35.241kg2 and residual variance of
245kg2
• Animals 13 to 20 as assumed as the reference population and 21 to 26 as
validation candidates.
• SNP effects are predicted using using all 10 SNPs.
• The incidence matrix X = Iq , with q = 8, the number of
animals in the reference population

31
Computing the matrices we need
• The incidence matrix X = Iq , with q = 8, the number of
animals in the reference population
• X’ = [ 1 1 1 1 1 1 1 1]
• The computation of Z requires calculating the allele
frequency for each SNP.

32
Computing Matrices
• The allele frequency for the ith SNP was computed as
with n = 14, the number of animals with genotypes and mij are
elements of M.
• Allele frequencies 0.321, 0.179, 0.357, 0.357, 0.143,
0.607, 0.071, 0.964, 0.571 and 0.393 respective.
• Using those frequencies 2Σpj(1 – pj) = 3.5383. Thus α =
3.5383*(245/35.242) = 24.598
n
*
2
m
n
j
ij


33
Z matrix
• Z= M – P and is
• We have computed X and Z.
• Remaining matrices X’Z and Z’X and Z’Z are computed by
multiplication. Then add Iα to Z’Z then MME are formed.
• When solved we these solutions:


































































0.786
0.857
0.071
0.143
0.214
0.286
0.714
0.286
0.643
0.643
0.786
0.857
0.071
0.143
0.786
0.286
0.286
0.286
0.357
0.643
0.214
0.857
0.071
0.143
0.786
0.286
0.286
0.714
0.643
0.357
1.214
0.143
0.071
0.143
1.214
0.286
1.286
0.286
0.643
0.643
0.214
0.857
0.071
0.143
0.214
0.286
0.286
1.286
0.357
0.643
1.214
0.143
0.071
0.143
1.214
0.714
0.286
1.286
0.643
0.357
0.786
0.143
0.071
0.143
0.786
0.286
0.714
0.714
0.357
0.357
1.214
0.143
0.071
0.143
1.214
0.286
0.286
0.286
0.357
1.357
Z

34
Computing GEBVs
• Solutions
• -----------------------
• Mean effect
•
• 9.944
•
• SNP effects (ĝ)
• 1 0.087
• 2 -0.311
• 3 0.262
• 4 -0.080
• 5 0.110
• 6 0.139
• 7 0.000
• 8 0.000
• 9 -0.061
• 10 -0.016
• The SNP solutions are also called as the SNP key

35
GEBVs for Validation animals
• The DGV for the reference animals (animals 13- 20) is
then computed as Zĝ.
• For the validation animals (animals 21 -26) , DGV = Z2ĝ
where Z2 contains the centralised genotypes for the
validation candidates

36
Solutions for validation animals



















































































































016
.
0
061
.
0
000
.
0
000
.
0
139
.
0
110
.
0
080
.
0
262
.
0
311
.
0
087
.
0
786
.
0
143
.
1
929
.
0
143
.
0
786
.
0
286
.
0
286
.
0
286
.
0
357
.
0
357
.
0
786
.
0
143
.
0
071
.
0
143
.
0
786
.
0
714
.
0
286
.
0
714
.
0
357
.
0
643
.
0
786
.
0
143
.
1
071
.
0
143
.
0
214
.
0
714
.
0
714
.
0
714
.
0
357
.
0
357
.
0
214
.
0
857
.
0
071
.
0
143
.
0
214
.
0
286
.
0
714
.
0
286
.
0
643
.
0
643
.
0
786
.
0
143
.
1
071
.
0
143
.
0
786
.
0
714
.
0
286
.
0
0714
357
.
0
643
.
0
214
.
1
143
.
0
071
.
0
857
.
1
214
.
0
286
.
0
714
.
0
714
.
0
357
.
0
357
.
1
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
26
25
24
23
22
21
a
a
a
a
a
a






















354
.
0
054
.
0
143
.
0
240
.
0
114
.
0
027
.
0

37
GBLUP
 Equivalent model to SNP-BLUP
 BLUP MME but with A-1) replaced by G-1
 The DGV is computed directly as the sum of the SNP
effects(a = Zg)
 Model is
 y = Xb + Wa + e
 where a = vector of DGVs and W is the design matrix
linking records to animals
 Matrix X is as defined before and W is an identity
matrix ( a diagonal matrix with all diagonal elements =
1)

38
GBLUP
 Given that a = Zg
 Then var(a) = ZZ’σ2
g.
 Note that σ2
g =
 then the matrix ZZ’ can be scaled such that

 G =

 and var(a) = Gσ2
a .
 Division by 2Σpi(1−pi) makes G analogous to A.
  )
p
(1
p
2
σ
j
j
2
a
 

)
p
(1
p
2
Z
Z
j
j

39
G matrix from 42K SNPs
Gall =
13 0.957
14 -0.108 0.973
15 0.452 -0.116 1.182
16 0.209 -0.058 0.424 1.025
17 0.234 -0.083 0.425 0.312 1.037
18 -0.040 0.438 0.097 -0.047 -0.043 1.151 symmetric
19 -0.089 0.458 0.039 -0.067 -0.070 0.426 1.175
20 -0.093 0.460 0.053 -0.058 -0.063 0.432 0.707 1.183
21 0.077 -0.082 0.064 0.104 0.082 -0.071 -0.069 -0.069 1.031
22 -0.056 0.418 0.093 -0.046 -0.038 0.408 0.355 0.342 -0.044 1.139
23 -0.005 0.464 -0.038 -0.035 -0.038 0.206 0.223 0.215 0.011 0.280 0.993
24 -0.070 0.468 0.075 -0.027 -0.053 0.403 0.521 0.550 -0.079 0.424 0.260 1.198
25 -0.052 0.416 0.098 -0.009 -0.031 0.386 0.363 0.342 -0.038 0.370 0.219 0.419
1.125
26 -0.070 0.493 -0.084 -0.039 -0.044 0.258 0.241 0.270 -0.072 0.253 0.178 0.259
0.214 1.009

40
A matrix for the same individuals
13 1.008
14 0.033 1.037
15 0.545 0.021 1.041
16 0.288 0.021 0.536 1.016
17 0.285 0.031 0.541 0.293 1.020
18 0.047 0.580 0.036 0.028 0.032 1.062
19 0.033 0.613 0.021 0.021 0.031 0.365 1.095 symmetric
20 0.033 0.613 0.021 0.021 0.031 0.365 0.613 1.095
21 0.099 0.031 0.082 0.118 0.074 0.028 0.031 0.031 1.021
22 0.046 0.586 0.032 0.031 0.039 0.351 0.373 0.373 0.044 1.068
23 0.096 0.569 0.067 0.043 0.047 0.329 0.357 0.357 0.042 0.338 1.050
24 0.041 0.574 0.027 0.019 0.026 0.331 0.406 0.406 0.028 0.335 0.335 1.056
25 0.033 0.548 0.035 0.039 0.039 0.315 0.336 0.336 0.037 0.321 0.310 0.310 1.029
26 0.035 0.588 0.023 0.024 0.039 0.337 0.376 0.376 0.036 0.347 0.341 0.348 0.325 1.070

41
GBLUP
• MME are
• where α now equals σ2
e/σ2
a . Solutions for example in previous
table
• Advantages:
– Existing software for genetic evaluation can be used by replacing A with G
– systems of equations are of the size of animals which tend to be fewer
than the number of SNP.
– In pedigreed populations G discriminates among sibs, and other relatives,
capture information on Mendelian sampling.
– method is attractive for populations without good pedigree as G will
capture this information among the genotyped individuals











































y
R
W
y
R
X
a
b
G
W
R
W
X
R
W
W
R
X
X
R
X
1
1
1
1
1
1
ˆ
ˆ
1α

42
Solutions for the example data
• Reference Animals
• 13 0.069
• 14 0.116
• 15 0.049
• 16 0.260
• 17 -0.500
• 18 -0.359
• 19 0.146
• 20 -0.231
•
• Selection or validation candidates
• 21 0.028
• 22 0.115
• 23 -0.240
• 24 0.143
• 25 0.054
• 26 0.353

43
Single Step Method
GBLUP computes genomic breeding values only for
genotyped animals.
How can non-genotyped animals benefit from genomic
information
Let g2 be the genetic (genomic) values of genotyped animals and
g1 the genetic values of non genotyped animals
An estimate of g1 based on genomic information is obtained by
regression of g1 on g2 and added to information from BLUP through
the usual MME

44
Single Step Method
• We define variance of vector of g1 (non-genotyped) and
g2 (genotyped)
H = Variance of
1
2
 
 
 
g
g
 
1 1 1
11 12 11 12 22 22 22 21 12 22
1
21 22 22 21
=
  

 
 
 
  
 
   
H H A A A G A A A A A G
H
H H GA A G
non genotyped genotyped

45
Single Step Method
• Model is just as before but uses all data (genotyped
and ungenotyped):
• MME are the usual but with A-1 replaced with H-1
• Surprisely, H-1 has simple form:
  
y X Za e
1

  
   

     
     
 
' ' '
'
' '
X X X Z 1 y
Z y
X g
Z Z Z H

better lives through livestock
ilri.org
ILRI thanks all donors and organizations who globally supported its work through their contributions to the CGIAR Trust Fund

Genomic selection in Livestock

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Genomic selection in Livestock

Similar to Genomic selection in Livestock (20)

More from ILRI

More from ILRI (20)

Recently uploaded

Recently uploaded (20)

Genomic selection in Livestock

Editor's Notes