Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & Spark MLLib/ML & SparkR
1. Dimensionality Reduction of
Genomic Variation
with Big Data Genomics ADAM & Spark MLLib/ML & SparkR
Deborah Siegel
Northwest Genomics Center
UW Center for Mendelian Genomics
twitter: @dsiegel
coauthoring a book on data analysis in SparkR with Amanda Casari
Seattle Spark Meetup October 16, 2015
3. Reference Genome
!
!
Humans have 2 copies of most of our
chromosomes, thus we have two alleles for
each position of our genome.
!
!
!
AATCATGTGTGGCTACTTACTGTCACT
!
!
AATCATGTGTGGCTACTTACTGTCACT
AATCATGTGTAGCTACTTACTGTCACT
!
!
!
We will represent our data with Manhattan
distance: for each person in our sample,
how many alternate alleles are present in
their genotype for that position.
!
homozygous
ref allele
GG
heterozygous
GA
homozygous
alt allele
AA
GG
{0}
GA
{1}
AA
{2}
Representation
4. Motivation
Manolio et al. Finding the missing heritability of complex diseases.
Nature. 2009 Oct 8
Used with permission
5. John Novembre, Toby Johnson, Katarzyna Bryc, Zoltán Kutalik, Adam R. Boyko, Adam Auton, Amit Indap, Karen S. King, Sven
Bergmann, Matthew R. Nelson, Matthew Stephens, Carlos D. Bustamante (2008). Genes mirror geography within Europe Nature
DOI: 10.1038/nature07331
Used with Permission
Motivation Genes Mirror Geography within Europe
6. ℝ3,000,000,000 ℝ50,818,468 ℝ 493,782 ℝ1838 ℝ2
!
Genomic Positions chr22
biallelic snp
variants with
complete data in
our sample
common
variants
in our
sample
dimension
reduced
Data 1000 genomes project
publicly available on AWSData
Feature Space
8. vcf file
Filter - Complete data
Filter - CommonVariants
create RDD[Row]
with sample ID and ordered
array of alt allele counts
Filter - Populations
Case Class of
SampleVariants
parquet file of
genotype objects
MLVectorAssembler
Spark ML
create DataFrame
df to rdd
create schema
Spark/ADAM
ML fit PCA (model)
ML transform (projection)
write to parquet
on hdfs
SparkR
read parquet
from hdfs
to local data.frame
plot
PC1 &
PC2
workflow
MLVectorSlicer
convert vector slices to strings
rdd to df
9.
10. !
Also:
Northwest Genomics Center
Amanda Casari
Frank Nothaft
& Big Data Genomics http://bdgenomics.org/
Neil Ferguson
for his excellent blog post - http://bdgenomics.org/blog/
2015/07/10/genomic-analysis-using-adam/
!
Thank You!
Call to Action:
Talk with folks in development community
about PCA on Spark, and SparkR