2. Introduction
• Genetic markers
– heritable polymorphisms that can be measured in one or
more populations of individuals
– heart of modern genetics
– enable the study of important questions in population
genetics, ecological genetics and evolution
• Advent of next-generation sequencing (NGS)
– whole genome sequencing
– re-sequencing : discovering, sequencing and genotyping
thousands of markers across almost any genome
• comprehensive genome-wide association studies for any
organism
• genome-wide studies on wild populations
3. NGS marker discovery and
genotyping methods
• RRL and CRoPS (reduced-representation libraries and
complexity reduction of polymorphic sequences)
• RAD seq (Restriction-site associated DNA sequencing)
• GBS (Genotyping by sequencing)
– the digestion of multiple samples of genomic DNA
– a selection or reduction of the resulting restriction fragments
– NGS of the final set of fragments, which should be less than 1
kb in size
7. GBS results in Maize
• Parental line
– 98% of 1,146,449 HQ reads were aligned with maize genome
– 868,336 reads that aligned perfectly to the maize genome
• 276 RILs
– 6 lanes, 48-plex, 2,090 Mbp per lane on average
– From 145,836,644 raw reads, 83% passed filtering process (120,438,739
GBS reads)
– 436,372 reads were produced per DNA sample and 95% of samples
– 809,651 sequence tags covering 51.8 Mbp or 2.3% of the maize
genome
– 167,494 of the dominant markers, could be placed upon frame work
map of 25,185 sequence tags.
8. TASSEL-GBS
• new bottleneck is the efficient bioinformatics
analysis of the vast and ever-expanding sea of
data
• TASSEL-GBS (Trait Analysis by aSSociation, Evolution and Linkage)
– Not limited to the specific restriction enzymes
utilized in those protocols:
– work on nearly any restriction enzyme and
barcoding approach specifically
– designed to efficiently handle large quantities of
data from large numbers of samples
13. Population genetic-based filtering of
putative SNPS
• Putative SNPs from GBS may be of low quality
– sequencing error
– paralogous sequence tags from different loci
• To detect and filter out error-prone SNPs
– minor allele frequency (MAF)
– inbreeding coefficient (or ‘‘index of panmixia’’)
𝐹𝐼𝑇 = 1 −
𝐻𝑜
𝐻𝑒
𝐻𝑒 = 2𝑞(1 − 𝑞)
14. Capacity for large numbers of markers
and samples
• 31,978 samples took 495 CPU-hours on 64 core Linux
machine with 512GB of RAM
• 383 samples requires approximately 1 CPU-hour on a
MacBook Pro with a 2.6 GHz Intel Core i7 processor
and 16GB of RAM running OS X.
15. UNEAK pipeline in TASSEL-GBS
• Absence of a reference genome,
– SNP calling may be much less accurate with short-
read sequencing technologies,
– true SNPs, sequencing errors and SNPs between
paralogs can be difficult to distinguish
• Universal Network-Enabled Analysis Kit (UNEAK)
– To enable genome-wide association studies (GWAS)
and genomic selection (GS)
18. SNP discovery in switchgrass
Full-sib population
(n=130)
Half-sib population
(n=168)
66 diverse population
(n=540)
400,107 476,005 700,236
• The average coverage of the three data sets was less
than 1X
• Using most informative markers (0.2<MAF<0.3), 3000
paternal SNPs into 18 linkage groups
• Paternal linkage map 41,709 markers, maternal map
46,508 markers
19. Strengths and Weaknesses of GBS
• Strengths of GBS and TASSEL-GBS
– The large number of markers potentially produced
– Low cost and minimal startup cost
– Integration of SNP discovery with SNP calling
• Weakness
– When conducted at low coverage, is the amount of
missing data
20. Reference
• Elshire R, Glaubitz J, Sun Q, Poland J, Kawamoto K, et al. (2011) A
robust, simple genotyping-by-sequencing (GBS) approach for
high diversity species. PLoS ONE 6.
• Glaubitz JC, Casstevens TM, Lu F, Harriman J, Elshire RJ, et al.
(2014) TASSEL-GBS: A High Capacity Genotyping by Sequencing
Analysis Pipeline. PLoS ONE 9
• Lu F, Lipka AE, Glaubitz J, Elshire R, Cherney JH, et al. (2013)
Switchgrass Genomic Diversity, Ploidy, and Evolution: Novel
Insights from a Network-Based SNP Discovery Protocol. PLoS
Genet 9
• Davey J, Hohenlohe PA, Etter PD, Boone JQ, Catchen JM, et al.
(2011) Genome-wide genetic marker discovery and genotyping
using next-genration sequencing. Nat Rev Genet 12:499-510
Editor's Notes
제가 오늘 발표할 내용은 GBS이고, 그림을 중심으로 개념과 분석방법에 대해 간략하게 설명하도록 하겠습니다.
생물 집단에서 유전되는 polymorphism
집단유전학 생태유전학 진화 등의 중요한 연구를 가능케 함
최근 NGS기술의 발달은
리시퀀싱을 통해 수천개의 마커를 어떤 genome에서도 생산 가능
이로 인해 종합적인 gwas 분석과 야생집단의 genome wide 스터디도 연구 가능
NGS를 통한 마커 개발과 genotyping 방법이 개발
RRL 2008 RAD 2008
GBS 방법이 간편하기 때문에 이를 이용한 연구가 많이 있습니다. 그래서 이 GBS에 대해 더 자세하게 알아보돌하겠습니다
처음 소개된 GBS의 경우 ApeK1을 사용하였는데,
다른 제한효소에 비해 rare cutter 이고, methy기 sensitive 하여 균등하게 잘림
868,336 reads : no mismatch
78% : single genomic location
83% : 처음 72bp에서 바코드, cut site 포함, no N
일루미나 필터링 (70%) : underestimate
644개의 Frame map과 binomial test를 통해 일차로 25천개의 tag가 맵핑 되고 다시 한번 더 해서 16만개 tag가 더 올라감
GBS 방법이 성공적으로 옥수수에서 분석이 된 후에 보리 밀 수수 및 동물들에서도 GBS 분석을 수행하였습니다
GBS 방법이 많이 사용되면서 분석 방법이 점점 더 중요해지고 있는데 주요 파이프라인으로 TASSEL-GBS가 쓰임
Trait Analysis by aSSociation, Evolution and Linkage
여기서는 SNP discovery 파이프라인과 production 파이프라인을 분리하여 컴퓨터 리소스를 효율적으로 쓰기 만듬
Filtering을 하지 않으면 시퀀싱 에러나 paralog 서열이 SNP detectio을 방해함
따라서 MAF라는 값으로 SNP들을 필터링
많은 샘플을 분석하려면 당연히 많은 시간이 걸리는데,
여기서는 옥수수의 31978 개의 샘플을 가지고 GBS 분석을 해봄
조금전까지는 레퍼런스가 있는 경우의 분석이었는데, TASSEL에서는 레퍼런스가 없어도 분석을 할 수가 있ㅇ음
Pairwise alignment
1,242,860 putative SNPs were discovered
MMC modulated modularity clustering