This document discusses genomic-scale data pipelines. It introduces Dr. Denis Bauer and his transformational bioinformatics team. It describes how genomic data and research will grow exponentially to exabytes by 2025. It outlines genomic research workflows and challenges like processing, analyzing, and visualizing large variant call format (VCF) data. It presents two cloud data pipeline patterns used by the team: 1) A Spark server cluster pipeline for machine learning on large genomic datasets. 2) A serverless pipeline using AWS Lambda and Step Functions for scalable genomic searches.
2. Denis Bauer,
PhD
Oscar Luo,
PhD
Rob Dunne,
PhD
Piotr Szul
Team
Aidan O’BrienLaurence Wilson,
PhD
Adrian White
Andy Hindmarch
Collaborators
David Levy
News
Software
Dan Andrews
Kaitao Lai,
PhD
Arash Bayat
John Hildebrandt
Mia Chapman
Ian Blair
Kelly Williams
Jules Damji
Gaetan Burgio Lynn Langit
Natalie Twine,
PhD
Prabha Pillay
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Transformational Bioinformatics Team
3. 1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
Twitter
YouTube
Big Data in 2025…Petabytes?
1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
Twitter
YouTube
Big Data in 2025…Petabytes?
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
4. 1
0.17
2
20
0 5 10 15 20 25
Astronomy
Twitter
YouTube
Genomic
GENOMIC Big Data in 2025 - Exabytes
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
5. Genome holds Blueprint for Every Cell
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
6. Affects Looks, Disease Risk, and Behavior
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
8. Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genomic Research Workflow
https://www.projectmine.com/about/
BigData Focus
9. Finding the Disease Gene(s)
Spot the letter that is…
• common amongst all affected
• absent in all unaffected*
* oversimplified
cases
controls
Gene1 Gene2
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
12. Performance – Faster and More Accurate
VariantSpark is the only method to scale to 100% of the genome
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
low Accuracy high
lowSpeedhigh
18. DEMO: Who is a Bondi Hipster?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
19. Supervised ML: Wide Random Forests
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
20. Scaling to 50 M variables and 10 K samples
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
100K trees: 5 – 50h
AWS: ~$215.50
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
• Yarn Cluster
• 12 workers
• 16 x Intel CPUs
• Xeon E5-2660@2.20GHz
• 128 GB RAM
• Spark 1.6.1
• 128 executors
• 6GB / executor 0.75TB
• Synthetic dataset
Whole Genome
Range
GWAS Range
21. Future Directions for VariantSpark RF
Mixed feature types
Unordered
Categorical
Continuous
Build Community
Python API
Non-Genomic
Demos
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Implementation by
22. Try it out: VariantSpark Notebook
Transformational Bioinformatics| Denis C. Bauer @allPowerde
https://docs.databricks.com/spark/latest/training/variant-spark.html
23. Genome Editing can correct genetic
diseases, ex. hypertrophic cardiomyopathy
“Editing does not work every time, e.g.
only 7 in 10 embryos were mutation free.”
Aim: Develop computational
guidance framework to
enable edits the first time;
every time
Ma et al. Nature 2017 *
* Controversy around the paper – stay tuned
Transformational Bioinformatics| Denis C. Bauer @allPowerde
24. Make Process Parallel and Scalable
SPEED
• Each search can be
broken down into parallel
tasks - each takes
seconds
SCALE
• Researchers might want
to search the target for
one gene or 100,000
Scalability + Agility =
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
25. One of the first Serverless Applications in Research
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Featured in
26.
27. X-Ray Tracing Demo of GT-Scan2
• Find performance
bottlenecks
• Fix and test
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Webapp
Resources (S3, DynamoDB)
Lambda
33. CloudDataPipelinePattern
Problem Data Technologies MVPs Pipeline
Analyze
GWAS
vcf -> S3/Spark Ingest
ETL
Analyze
Viz
S3 -> Databricks DBFS
Apache Spark
Variant-Spark ML
Notebook, SQL, R, Python
Spark
Server
Cluster
Transformational Bioinformatics| Denis C. Bauer @allPowerde
34. Spark Server Cluster Pipeline Pattern
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Jupyter Notebook
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
35. Cloud Genomic-Scale Data Pipelines
• Problem # 1 – ML on Large Data
• Solution: Spark-server cluster + custom
machine learning
• Problem #2 – Burstable Search
• Solution: Serverless pipeline
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
http://www.nature.com/nature/journal/v462/n7276/fig_tab/nature08645_F1.html
Bauer et al. Trends Mol Med. 2014 PMID: 24801560.
https://www.gt-scan.net/ --AND- AMA with Dr, Bauer -- https://www.reddit.com/r/science/comments/5fiicm/science_ama_series_im_denis_bauer_a_team_leader/
https://aws.amazon.com/xray/details/
X-ray has helped us isolate the component in our GT-Scan2 pipeline that slows the overall execution time, and how replacing it with a more performant function increased runtime 4 fold (125s to 32s).
X-ray has helped us isolate the component in our GT-Scan2 pipeline that slows the overall execution time, and how replacing it with a more performant function increased runtime 4 fold (125s to 32s).
Recent team presentation - https://www.slideshare.net/AustralianNationalDataService/gtscan2-bringing-bioinformatics-to-the-cloud-may-tech-talk
Quickly access a managed Spark cluster - AWS EC2 / spot instances
Link to your data and perform whole genome analysis in real-time