1. ACHARYA N.G RANGA AGRICULTURAL
UNIVERSITY
Submitted by :-
P.TEJASREE
TAD/2023-10
Ph.D 1st Year
Dept of GPBR
Genomic Selection
Submitted to :-
Dr. V.L.N. Reddy
Professor & Head
Dept of Molecular Biology
and Biotechnology 1
S.V AGRICULTURAL COLLEGE, TIRUPATHI
2. Genomic selection (GS) is a form of marker-assisted selection, which utilizes markers across the entire genome
to estimate genomic estimated breeding values (GEBVs) taking additive genetic effects of all genes into
account. Proposed by Meuwissen et al. (2001)
• GS uses a training population of known phenotypes and
genotypes to construct a model of each marker’s effect on
the trait. The model is then applied to predict the
phenotypic performance of the untested individuals having
only genotypes.
• The GEBV of an individual is the sum total of effects
associated with all the marker alleles present in the
individual.
2
• Phenotype is used as response and genotype as a predictor
3. • A GS model that uses information about known QTLs has been termed as gene-assisted genomic selection
(Rutkoski et al. 2010).
• The prediction accuracy is estimated from the correlation between the GEBVs of the individuals and
measured phenotype for which it is available, thereby decreasing the length of the breeding cycle.
GS is increasingly being used in both plant and animal breeding programs to accelerate genetic gain of the
traits governed by minor genes
3
• The BV is, therefore, determined by progeny testing and is based only on the additive genetic effects.
• In contrast, the genotypic value of an individual/line is the phenotype expected from its genotype.
• The genotypic value, therefore, is based on both additive and nonadditive genetic effects.
16. BLUPF90 Family
GCTA (Genome-wide Complex Trait Analysis)
RR-BLUP (Ridge Regression Best Linear Unbiased Prediction)
Beagle
AlphaSim
BLUP
PLINK
ASReml
TASSEL
SAS
CERVUS
GenSel
Softwares
16
GMS tool
solGS
BWGS
BGLR
Gselection
lme4GS
STGS
MTGS
17. Workflow of the two main functions of BWGS.Bwgs.cv does model cross-validation on a training set and bwgs.predict
does model calibration on a training set and GEBV prediction of a target set of genotypes. MAF = Minor Allele Frequency,
maxNA = maximum % of marker missing data.
17
Training
Testing
18. Going into more details of the pipeline, the workflow comprises three main steps:
1.A step of (missing) genotyping data imputation: This option can be useful for sources of genotyping data such as
GBS. The following options are available:MNI: missing data are replaced by the mean allele frequency of the
given marker. This imputation method is only suited when there are a few missing values, typically in marker data
from SNP chips or KasPAR.
EMI: missing data are replaced using an expectation-maximization methods described in function A.mat of R-
package rrBLUP.
18
19. 2. A step of dimension reduction: i.e. reducing the number of markers. This reduction could be necessary to speed up
computation on large datasets, depending on computer resources available. The following methods are available RMR:
Random sampling (without replacement) of a subset of markers. To be used with the parameter “reduct.marker.size”.
•LD (with r2 and MAP): enables “pruning” of markers which are in LD > r2. Only the marker with the least missing values is
kept for each pair in LD>r2. To allow faster computation, r2 is estimated chromosome by chromosome, so a MAP file is
required with information of marker assignation to chromosomes. The MAP file should contain at least three columns:
marker_name, chromosome_name and distance_from_origin (either genetic of physical distance, only used for sorting
markers, LD being re-estimated from marker Data).
•ANO (with pval): one-way ANOVA are carried out with R function lm on trait “pheno” Every markers are tested one at a
time, and only markers with pvalue<pval are kept for GEBV prediction
•ANO+LD (with pval and r2, MAP is facultative): combines a first step of marker selection with ANO, then a second step of
pruning using LD option. 19
20. Options for selecting a subset of the training population are:
•RANDOM: a subset of sample.pop.size is randomly selected for training the model, and the unselected part
of the population is used for validation. The process is repeated nFolds * nTimes to have the same number of
replicates than with cross-validation.
•OPTI: the optimization algorithm based on CD mean to select a subset which maximizes average CD
(coefficient of determination) in the validation set. Since the process is long and has some stochastic
components, it is repeated only nTimes.
20
3. A step of model building and cross validation:
In the general case of genomic selection, the number of explanatory variables, i.e. markers, (largely) exceeds the
number of observations, making the classical linear model equation unsolvable. In a review, [24] classified most
of the methods that have been proposed to overcome this “big data” problem, into penalized regression (to make
them solvable) or semi-parametric methods.
23. Year of field trial
2018-19 2019-20 2020-21 2021-22
No. of genotypes tested 431 816 1491 1029
No of trial 26 43 71 43
No. of location 1 9 1 7
Trial design RCB, SA RCB RCB RCB, ARCB
Range (t ha-1) 5.31-6.32 4.52 – 6.32 5.05– 6.15 5.11 – 6.56
Average (t ha-1) 5.77±0.026 5.81±0.14 5.79±0.20 5.77±0.27
CV (%) 4.44 2.45 3.46 4.65
Supplementary Table S1 Meta data and descriptive statistics of the analysis of the breeding lines for yield
during Boro season of 2018-19 to 2021-22 under favorable ecosystems
Genotyping with genome-wide 1024 SNP markers including 92 trait-specific markers named as 1K-RiCA panel
The genotyping data of 1k-RiCA SNPs were filtered using TASSEL v5.0 following the criteria that the
individuals with more than 15% of heterozygous loci were removed, markers with more than 15% of missing
values and minor allele frequency below 0.05 were removed. After filtering, 814–889 markers were retained for
doing downstream analysis.
Genotyping and phenotyping of the breeding lines
23
24. Statistical methods used in this article
1.rrBLUP Model: The study employed the Ridge Regression Best Linear Unbiased Prediction (rrBLUP) model to
estimate marker effects and calculate Genomic Estimated Breeding Values (GEBVs) for the breeding lines.
2.Cross-Validation: The researchers conducted cross-validation by randomly sampling different proportions of the
entries as a training population to assess the accuracy of Genomic Selection. This method helps evaluate the
predictive performance of the models and optimize the training population size.
3.Regression Analysis: Regression analysis was used to determine the rate of genetic gain in yield over time by
gBLUP values across trails.
4.Post Facto Analysis: The study included post facto analysis of the crosses made in the breeding program from
1994 to 2022. This analysis involved retrieving and analyzing data from the BRRI crossing database to evaluate the
frequency of crosses using specific varieties as parents.
5.Heritability Estimation: Heritability estimates were likely used to assess the reliability and quality of the
phenotypic data used in the genomic selection analysis.
6.Data Filtering and Quality Control: Various data filtering techniques were applied to ensure the quality of
genotypic data, such as removing individuals with high levels of heterozygosity and filtering out markers with high
rates of missing values or low minor allele frequencies.
These statistical methods were instrumental in analyzing the genomic data, estimating breeding values, optimizing
training populations, assessing genetic gain, and evaluating the performance of the genomic selection approach in
rice breeding in Bangladesh.
24
25. • Grain yield data of 1445 breeding lines tested in 64 historical trials during 2014–2019 under the irrigated
breeding program of Bangladesh Rice Research Institute (BRRI) were used to estimate baseline genetic
gain.
• While genomic BLUPs for 3767 breeding lines evaluated at multi-locations under 183 trials during 2019–
2022 were extracted and used to estimate the rate of changes in genetic improvement of rice yield.
• Two-stage linear mixed model analysis was performed for extracting performance BLUP for the yield of
each line.
The R-packages ‘emmeans’ were used to implement the models.
25
26. In the second stage, the BLUEs obtained from the first stage model were used as the response variables in the
mixed model analysis. The BLUEs for yield within each environment was modeled according to Bates et al.
(2015).
The R-packages “lme4” were used to implement the models.
26
27. Estimation of genomic estimated breeding values and optimization of training population size
The rrBLUP model was used to estimate the marker effects in R software using mixed. solve function of
rrBLUP package.
Individual GEBVs were then obtained using estimated marker effects. The prediction accuracy from the rrBLUP
model was used to estimate GS relative efficiency (REc).
The GS accuracy was estimated as the correlation coefficient of the GEBVs and the phenotypic values for all
accessions.
The REc was estimated using the equation:
27
rG.O is the accuracy of GS and H2 is the estimated heritability
28. Sparse testing of training population
The efficiency of GS depends on the relative proportion (size) and genetic relationship of the training population with
the whole breeding population under the model.
Training populations comprising 60% of the total breeding
lines were considered for yield testing at four locations
following the sparse testing model of GS
To save resources and to make connectivity between the
trials, 40% of the total entries of the whole breeding
population were sampled first as a common share to each
training population.
28
29. Estimation of genetic gain for yield
Genetic gain was estimated as the rate of change in breeding value per unit of time
These BLUP values were regressed on the year when the lines were evaluated to get the baseline genetic gain.
Genomic BLUP values of each line were extracted from the trials conducted during 2020, 2021, and 2022 following the same
principle using R-package rrBLUP, and the rate change in genetic improvement in yield was determined by regressing on the
trial year.
29
The analysis of 1108 individual lines tested in 44 trials from 2016 to 2019 under irrigated favorable ecosystem showed the
yield BLUP varied from 5.79 t ha−1 (in 2016) to 5.88 t ha−1 (in 2019) with an average of 5.78 t ha−1.
The simple regression analysis of the BLUP values with the trial year showed a baseline genetic gain for the yield of 0.0174 t
ha−1・year−1
In general, low rates of genetic gain in South and Southeast Asian rice breeding are likely due to long breeding cycles caused
by repeated use of older, popular varieties as parents, and by limited selection intensity for yield in multi-location trials.
30. Assessment of baseline genetic gain for yield
30
Estimation of current genetic gain
This rate of genetic improvement is quite low and inadequate
compared to the expected genetic gain of at least 0.044 t ha−1
per year (approximately 1% annually)
The simple regression analysis of the BLUP values with
the trial year showed a baseline genetic gain for the yield
of 0.0174 t ha−1・year−1
31. 31
Post facto analysis of BRRI crosses
The frequency of the crosses using these two varieties (BRRI dhan28 and BRRI dhan29)
as parents were estimated in the percentage of the total number of crosses made under
the irrigated breeding program after their release.
Post facto analysis of BRRI crosses showed that in its irrigated breeding program, the most popular
varieties BRRI dhan28 and BRRI dhan29 were repeatedly used as parents. Also, frequent use of
landrace varieties in the crossing programs without proper pre-breeding activities has resulted in
limited improvement in additive breeding value for grain yield of rice in the irrigated breeding
program.
32. Assessment of the accuracy of genomic selection
Accuracy of GS was estimated through Pearson’s correlation between the predicted performance and the actual performance.
A study aiming to optimize training population size with
669 breeding lines tested in the Boro season of 2020 at
multi-locations showed that average prediction accuracy
gradually increased up to 27.42% with the increase of
training population size (up to when 80% of the entries of
the breeding population was included in the training
population) and afterward it sharply jumped to 62.64%
when 100% lines were in the training population
32
33. Genomic selection for choosing parents
The lines were selected based on GEBV for yield and used in
the crossing program and thereby, frequency of favorable
alleles for yield has been increased in the breeding population.
33
34. Genomic selection at the early generation yield testing stage
Sparse testing model of GS method was practiced in the irrigated breeding program at the OYT stage, which is the first stage
of yield trial. In the 2020–21 Boro season, out of 650 breeding lines, 249 lines at Cumilla, 289 lines at Gazipur, 232 lines at
Habiganj, and 275 lines at Rangpur were tested as training population.
The genomic prediction of these four sites showed a range of
predicted yield between 5.94–6.81 t/ha at Cumilla, 5.04–
6.96 t/ha at Gazipur, 7.40–8.18 t/ha at Habiganj, and 6.22–
6.86 t/ha at Rangpur.
The prediction accuracy with the training population was
0.456 at Cumilla, 0.715 at Gazipur, 0.499 at Habiganj, and
0.456 at Rangpur
34
35. On the other hand, out of 548 breeding lines,
292 lines at Cumilla, 249 lines at Gazipur, 280
lines at Habiganj and 125 lines at Rangpur tested
as training population in Boro 2021–22 showed
prediction accuracy 0.396, 0.603, 0.378, and
0.329, respectively.
The predicted yield based on GEBV was found
4.89–4.95 t/ha at Cumilla, 5.23–7.50 t/ha at
Gazipur, 6.4262–6.4263 t/ha at Habiganj, and
4.70–5.66 t/ha at Rangpur
35
36. Genomic selection at F5:6 LST without yield evaluation
GS was performed on 505 F5:6 LST (Line Stage
Testing) lines derived from 77 crosses using
genotyping data of 860 SNPs and yield data of their
39 parents.
The prediction accuracy was found 0.103 when
correlation analysis was performed between the
predicted yield (gBLUP) of the LST lines and the
BLUE values extracted for the same set of lines from
the OYT trials.
However, the correlation coefficient between the
gBLUP and BLUEs of the parents was as high as
0.708. 36
37. 37
Breeding stage Years Required Crop season
Hybridization 1.0 T.Aman +Boro
F2-F5 in RGA 1.5 T.Aman +Boro
F5 in LST 0.5 T.Aman
OYT 0.5 Boro
Phenotyping for Key target
traits
0.5 T.Aman
AYT 0.5 Boro
Total 4.5
Supplementary Table S3 Timeframe of the current breeding cycle of irrigated breeding program
38. Thus, GS has cut down another 0.5–1.0 years that
would be required for yield testing in the
advanced yield trials and phenotyping for grain
quality and pest reaction before selecting
parents for recycling.
Applying GS together with MAS for key target
traits at the line fixation stage (F4-F5) has further
reduced cycle time by at least half a year
38
Heritability for a quantitative trait like yield
between 0.4 to 0.6 is considered to be optimum
for the best quality data.
39. Based on the above findings it can be concluded that by applying GS, superior lines with high breeding value can be
reliably captured with and without extensive field phenotyping.
Practicing GS for the consecutive 4 years from 2019 to 2022, genetic improvement for yield has been recorded at the rate
of 117 kg ha−1 year−1, which is around 6.77 fold higher than the baseline gain.
The main findings of the study are: - The study emphasizes the importance of creating and capturing higher genetic
accuracy within the shortest possible time to enhance genetic gain for any trait. - The application of genomic selection
(GS) together with trait-specific marker-assisted selection (MAS) has significantly expedited yield improvement and
shortened the breeding cycle, leading to a seven-fold larger annual genetic gain and a reduction of around 1.5 years in the
breeding cycle from the existing 4.5 years. - Genomic selection (GS) has been routinely used since the Boro 2018-19
season for selecting high breeding value lines to recycle in the crossing program.
39
40. Overall, the future prospects of genomic selection in rice breeding include
advancements in precision breeding, accelerated genetic gain, trait stacking,
climate resilience, multi-omics integration, and the widespread adoption of
genomic technologies in developing countries. By harnessing the power of
genomics, breeders can drive innovation, sustainability, and resilience in rice
production to meet the challenges of a rapidly changing agricultural
landscape.
Future prospects
40
41. References
• Partha S. Biswas, M. M. Emam Ahmed, Wazifa Afrin , Anisar Rahman , A. K. M. Shalahuddin, Rafiqul Islam,
Fahamida Akter, Md Abu Syed, Md Ruhul Amin Sarker, K. M. Ifterkharuddaula and Mohammad Rafiqul
Islam.2023. Enhancing genetic gain through the application of genomic selection in developing irrigated rice for
the favorable ecosystem in Bangladesh.
• Neeraj Budhlakoti, Amar Kant Kushwaha, Anil Rai, K K Chaturvedi, Anuj Kumar, Anjan Kumar Pradhan, Uttam
Kumar, Rajeev Ranjan Kumar, Philomin Juliana, D C Mishra and Sundeep Kumar.2022.Genomic Selection: A
Tool for Accelerating the Efficiency of Molecular Breeding for Development of Climate-Resilient Crops
• Réka Howard,Alicia L. Carriquiry, and William D. Beavis.2014. Parametric and Nonparametric Statistical
Methods for Genomic Selection of Traits with Additive and Epistatic Genetic Architectures
• https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7141418/
• Marker Assisted Plant Breeding: Principles and Practices – B.D Singh and A.K Singh
41
Editor's Notes
Genomic selection emerged as an important tool which can utilize such information for modeling the crop yield for effective and rapid selection under different environmental conditions to meet the production challenges in a climate-changing world.