Advanced Methods of Statistical Analysis used in Animal Breeding.
1.
2.
3.
4. Gathering or obtaining the desired information under study.
Primary Source
Data is collected by researcher himself
can be collected by using experiments, surveys, questionnaires,
interviews, and observations.
Secondary Source
comes from resources that have already been published.
Data collected, compiled or written by other researchers eg. books,
journals, newspapers
Any reference must be acknowledged
DATA
5. Variables : Any measurable characteristics or quantity which can assume
a range of numerical values within certain limits, i.e. income, Height, age,
weight ,price etc.
• A discrete variable may assume only a
countable number of values: intermediate
values are not meaningful.
• Mastitis, Disease
Discrete
• A continuous variable may assume any real
value within some range. Takes
fractional or integral values.
• Milk yield ,fat yield,Body wt etc
Continuous
6.
7. A mathematical model is an equation or a set of equations which
represents the behavior of a system.
• Linear Model : Unit increase in independent variable cause a proportionate
increase dependent variable.
• Y= β0+ β1X + β2X2 + e
A linear model will exactly spell out which effects are affecting which observation
and the different effects (such as breed and feeding regime) are estimated
simultaneously and during this process they are corrected for each other.
Linear models are the most common type of statistical models used in animal
breeding to predict breeding values based on phenotypic observations.
• Non-linear Model : If one or all the parameter of a model are not appear
• linearly, the model is known as nonlinear model.
• Y= a Xb e(-cx) + e
3/11/2016
MODEL
8. The model usually consists of factors.
• Discrete factors or class variables such as sex, year, herd
• Continuous factors or covariables such as age
Model Contains 3 components :
Predictor-Dependent Variable
Predictant- Independent Variable
Error term
Model
9. Types of Analysis :
(A)Univariate Analysis : we group the individuals on the basis of single
performance.
When we use one variable to describe a person, place, or thing.
(B)Bivariate Analysis : When we use two variable .
(C)Multivariate Analysis : When we use more than two variables.
3/11/2016
10. Univariate Analysis :
1. Linear Regression Model
2. Least square model
a. Random effect Model- Heritability,Repeatabilty estimation.
b. Fixed effect Model-
c. Mixed effect Model- BLUP
3/11/2016
11. Linear Regression Model :
Where Yi = Dependent Variable
Xi = Independent Variable
β0 & β1 remains fixed ,we can’t found them exactly.
e= Error term
The principle of estimation of regression coefficient is based on Least Square
Analysis.
3/11/2016
Yi = β0+ β1Xi + e
12.
13. RandomEffects
Effects which have levels that are considered to be drawn from an
infinite large population of levels.
Animal effects are often random.
In repeated experiments there maybe other animals drawn from
the population. e,g. Sire ,Dam effect
FixedEffects
Effects for which the defined classes comprise all the possible levels
of interest, e.g. Herd , Season ,Year effect.
Effects can be considered as fixed when the number of levels are
relatively small and is confined to this number after repeated
sampling.
14. Predictors- Used for Estimation of Random effect.
Estimators- Used for Estimation of Fixed effect.
Principles of Least Square Analysis :
The Square of difference between observed and estimated/ predicted value of
dependent variable must be least or Zero.
3/11/2016
𝒀= b0+ b1Xi + e then
(𝒀𝒊 − 𝒀) 𝟐 ≅ ε 𝟐 ≅ 0
15. Random Effect Model :
Where , Yij = jth observation of ith Sire
μ= General mean effect
Si = Effect of ith sire
eij = Error term
3/11/2016
Yij = μ+ Si + eij
16. Fixed Effect :
Where , Yij = jth observation of ith Herd
μ= General mean effect
Fi = Effect of ith herd
eij = Error term
3/11/2016
Yij = μ+ Fi + eij
17. To achieve this ‘mixed models’ are used in which fixed effects
and breeding values (indicated as ‘random effects’) will be
estimated jointly.
3/11/2016
Mixed Effect Models
18. Where , Yijk = kth observation in ith farm of jth Sire
μ= General mean effect
Sj = Effect of jth sire in ith farm
Fi = ith farm effect
eijk = Error term
Yijk=μ+ Fi+Sj + eijk
20. BLUP estimation of breeding values is based on a mixed model, which is
a linear basis of Best Linear Unbiased Prediction.
BLUP
21. Best in the sense that they have minimum mean squared error
within the class of linear unbiased estimators; and predictors to
distinguish them from estimators of fixed effects.
BLUP estimates of the realized values of the random variables u
are linear in the sense that they are linear functions of the data,
y;
The Best prediction is that which minimises the prediction error.
Unbiased in the sense that the average value of the estimate is
equal to the average value of the quantity being estimated;
(G.K Robinson,1991)
3/11/2016
22. Maximizes the correlation between true and
estimated value of effects by minimizing the error variance.
The factors for which estimates are required
linear functions of the observations.
Estimates of fixed effects and estimable
functions are such that E(T) = 𝜃.
3/11/2016
23. 3 different kinds of BLUP, (Henderson, 1973 )
Henderson model-1
Henderson model-2
Henderson model-3
Only Random Effect
Random + Fixed +
Interaction Effect
Both Random & Fixed
Effect
24. The general linear model equation in matrix form is
Y = Xβ + Zu + e
Where ,
Y is an n × 1 vector of n observed records
Xis a known incidence matrix of order n × p, which relates the records in y
to the fixed effects in b
β is a p × 1 vector of p levels of fixed effects (to be estimated )
Zis a known incidence matrix of order n × q, which relates the records in y
to the random effects in u
uis a q × 1 vector of q levels of random effects such as individual genetic value
(to be estimated )
eis an n × 1 vector of random, residual terms
3/11/2016
25. Expectations and Variance Covariance (VCV) Matrices
In general the expectation of y is
> which is also known as the 1st moment. The 2nd moments describe the
variance covariance structure of y:
G is a dispersion matrix for random effects other than errors ,
R is the dispersion matrix of error terms, for which both are general square
matrices assumed to be non-singular and positive definite, with elements
that are assumed known.
We usually write V = ZGZT + R
3/11/2016
26. Estimating fixed Effects & Predicting Random
Effects :-
For a mixed model, y, X, and Z , 𝛽, u, R, and G are generally
unknown
Two complementary estimation issues
Estimation of 𝜷 and u
Estimation of fixed effects,
BLUE = Best Linear Unbiased Estimator
Prediction of random effects
BLUP = Best Linear Unbiased Predictor
3/11/2016
𝜷= (XT 𝑽−𝟏
X) -1 XT 𝑽−𝟏
y
𝒖= GZT 𝑽−𝟏 ( y-X 𝜷 )
27. • The BLUP eliminates the non genetic biases in estimating Breeding Value.
• It also removes the genetic biases taking in to account the effects of non-
random mating , genetic merit of Dams and selection.
3/11/2016
29. Advantages:
– Handles unbalanced designs
– Uses information for all relatives measured to improve estimates
BLUP can be used to estimate a variety of genetic values
– GCA, SCA, line values (i.e., genotypic values of
pure lines)
– One can also use BLUP to estimate environmental effects, G x E
30. REML = Restricted Maximum Likelihood.
Standard ML variance estimation assumes fixed factors are known
without error.
REML is an approach that produces unbiased estimators for these special
cases and produces less biased estimates than ML in general.
Depending on whom you ask, REML stands for Residual Maximun
Likelihood or Restricted Maximum Likelihood
3/11/2016
REML Variance Component Estimation
31. variance components by REML are estimated based on residuals
calculated after fitting by ordinary least squares from fixed effects part of
the model.
It Maximizes a marginal likelihood function.
So it is also called Residual Maximun Likelihood or Marginal Maximun
Likelihood.
For linear mixed effects models, the REML estimators of variance
components produce the same estimates as the unbiased ANOVA-based
estimators formed by taking appropriate linear combinations of mean
squares when the latter are positive and data are balanced.
3/11/2016
36. Before getting onto iterative algorithms, it is helpful to review the difference
between the log-likelihood function l used to calculate maximum likelihood
estimates, and that (𝒍 𝟐) used for REML:
The term log (𝑋 𝑇
𝐻−1
𝑋)makes the adjustment for degrees of freedom used in
estimating treatment effects, so that REML estimates of variance components
are less biased than ML estimates.
The other major differences are:
𝒍 𝟐 is not a function of the fixed effects 𝜏
The constant in 𝒍 𝟐 is a function of the fixed design matrix X
3/11/2016
ML vrs REML
38. ML vrs REML
ML estimates are biased because no account is taken of degree of
freedom in estimating the variance components.
REML takes care of bias in estimates as well as avoids –ve estimates
of component variance .
(Searle et al, 1992 )
39. Genetic Evaluation
REML and BLUP applied to multi trait mixed models have become the standard
method for genetic evaluation in all terrestrial animal species.
The main benefits of using this methodology include:
(1) Increasing accuracy of selection;
(2) Managing accumulation of inbreeding;
(3) estimating genetic trend without a control;
(4) the possibility of conducting large scale genetic evaluation across
populations.
N.H. Nguyen and R.W. Ponzoni Vol. 29 No. 3 & 4 Jul-Dec 2006
3/11/2016
41. 3/11/2016
Multivariate analysis consists of a collection of methods
that can be used when several measurements are made
on each individual or object in one or more samples.
44. Yi =β0+ β1X1+β2X2+…………… βnXn+ ei
Yi=β0+ i=1
n
βiXi+ ei
When the number of population is more than one and each animal in the
population has multiple characters.
3/11/2016
1. Multivariate Regression :
2. Discriminant Analysis :
45. • Purpose : To find out the Discriminant Function (D) that increases the
differences among populations by minimising the variances within the
population and maximising the mean differences between the
populations with respect of characters
• D=λ1 d1 + λ2 d2+ λ3 d3
• = ∑ λi di
Where , di = ith mean difference of the populations in relation to the
character,
λi = weighting coefficient attached to the difference.
3/11/2016
46. A Linear combinations of independent characters are involved to maximise the
variance accounted for in the original set of characters.
• Z1=a1 x1 + a2 x2+ a3 x3+ a4 x4
• Z2=b1 x1 + b2 x2+ b3 x3+ b4 x4
• Z3=c1 x1 + c2 x2+ c3 x3+ c4 x4
• Z4=d1 x1 + d2 x2+ d3 x3+ d4 x4
ai , bi , ci , di are relative weighting factor attached to each character.
∑ai
2= ∑bi
2= ∑ci
2 =∑di
2=1
It is mainly confined to single population
Principal Component shows highest variance - 1st Principal Component
Principal Component shows next highest variance – 2nd Principal Component .
3/11/2016
3. Principal Component (Z)
Analysis :
First time this analysis In India was reported by Dhara and
Chakravarty (1996) in large animals for predicting the
breeding value of milk production on the basis of selected
number of principal components.
47. 4. Genetic Divergence Analysis / Genetic distance analysis / D2
analysis : (Given by P C Mahalanobis,1928 )
by summing the squares of deviation of the same
transformed or untransformed traits between the two genotypes in various
combinations.
i= No. of traits varrying from 1-p,
J,k= genotypes. j ≠ k
Follows chi-square dist. at p degrees of freedom
More critical is the trait the more no. of distinctly different group develop.
3/11/2016
D2= 𝒊=𝟏
𝒑
𝒅𝒊
𝟐
= (𝒀𝒊
𝒋
− 𝒀𝒊
𝒌
) 𝟐
48. Few Notes on GD :
To keep variation within population between different animals,thus many
groups can be formed within population.
Explain about extent of variability and range of variability within
population.
It explains the evolutionary divergence.
Progeny testing programme needs GD for distiguishing unrelated sires.
3/11/2016
49. Used in single population
Traits of the Animals can be divided into two sets and the relationship
between two sets Is to be evaluated.
2 set of characters – Y set- response character
X set- Predictor character
Y Set is maximally correlated with X set.
• M= p1 y1 + p2 y2
• N= q1 x1 + q2 x2+ q3 x3
The Canonical coefficients (p1 , p2 and q1 , q2 and q3) in such a way so that the
correlation between two sets of characters or Canonical Variate (M and N)
become maximum and that correlation is called Canonical correlation.
3/11/2016
5. Canonical Variate Analysis
In India First used on dairy buffaloes by Thomas
and Chakravarty ( 1999 )
50. Fundamentals of a Bayesian Analysis :
Formulate a probability model for the data.
Decide on a prior distribution, which quantifies the uncertainty in the values
of the unknown model parameters before the data are observed.
Summarize important features of the posterior distribution, or calculate
quantities of interest based on the posterior distribution.
These quantities constitute statistical outputs, such as point estimates and
intervals.
51. Bayesian inference: A form of inference which regards parameters as being
random variables possessed of prior distributions reflecting the accumulated
state of knowledge.
Bayes estimation: The estimation of population parameters by the use of
methods of inverse probability.
( A.L. Pretorius and A.J. van der Merwe (2000)
52. This theorem is based on Conditional probability.
Conditional probability :
The probability of event B occurring when it is known that some event A
has only occurred and it is noted by P( 𝑩
𝑨).
when A & B are
dependent event.
P( 𝑩
𝑨)=
𝑷 𝑨∩𝑩
𝑷 𝑨
𝑷 𝑨 ∩ 𝑩 = 𝑷 𝑨 × 𝑷( 𝑩
𝑨
)
or
𝑷 𝑨 ∩ 𝑩 = 𝑷 𝑩 × 𝑷( 𝑨
𝑩
)
53. If B1 , B2….. BK are mutually disjoint events with probability P(BK)≠ 0
(i=1,2,….K) than for any arbitrary event ‘A’ which is a subset of 𝑖=1
𝐾
𝐵𝑖 such
that P(A)≠ 0 then we have
𝑷 𝑩 𝒊
𝑨
=
𝑷 𝑩𝒊 . 𝑷 𝑨
𝑩 𝒊
𝒊=𝟏
𝑲
𝑷 𝑩𝒊 . 𝑷 𝑨
𝑩 𝒊
54. 1. The probability i.e. P(B1), P(B2),……. P(BK) are called “A Priori Probability” as
they exist before we get any information of the experiment itself.
2. The probability i.e. 𝑃 𝐴
𝐵 𝑖
, i=1,2,3…..k are called “Likelihood” because they
indicate how how likely the event under consideration is to occur for given
each and “A Priori Probability”.
3. The probability i.e. 𝑃 𝐵 𝑖
𝐴
are called “Posterior Probability” because they are
determined after the result of experiment are known.
𝑷 𝑩 𝒊
𝑨
=
𝑷 𝑩𝒊 . 𝑷 𝑨
𝑩 𝒊
𝒊=𝟏
𝑲
𝑷 𝑩𝒊 . 𝑷 𝑨
𝑩 𝒊
55. Few Important Notes :
The Notation of “priori” and “posterior” in Bayes’ theorem are relative to a
given sample outcome. That is, if a posterior distribution has been
determined from a particular sample, this Posterior distribution would be
considered the prior distribution relative to a new sample.
Priori Posterior-1 Posterior-2
56. 3/11/2016
References :
Robin Thompson and Esa Mantysaari , Prospects for Statistical Methods
in Animal Breeding ,Jour. Ind. Soc. Ag. Statistics 57 (Special Volume),
2004: 15-25
P Narain, Statistics And Its Application To Agriculture And Genetics ,
IARI,New Delhi.
A.K.Chakravarty, Multivariate Analysis In Animal Breeding, NDRI,Karnal.
Verbyla, A. P. (1990) A conditional derivation of residual maximum
likelihood. Australian Journal of Statistics, 32, 227-230.
Henderson CR (1975) Best linear unbiased estimation and prediction
under a selection model. Biometrics 31:423–447.
Henderson CR (1976) A simple method for the inverse of a numerator
relationship matrix used in prediction of breeding values. Biometrics
32:69–83
57. 3/11/2016
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003) Bayesian Data
Analysis, 2nd ed. London, Chapman & Hall.
Carlin, B. P., and Louis, T. A. (2000) Bayes and Empirical Bayes Methods for
Data Analysis, 2nd ed. Boca Raton, Chapman & Hall.
Duchateau, L., Janssen, P., & Rowlands,G.L., 1998. Linear Mixed Models. An
introduction with applications in veterinary research. ILRI, Nairobi, Kenya,
159-170.