deck from talk at YOW Data in Sydney, covers VariantSpark, custom Apache Spark Machine Learning library and also GT-Scan2 using AWS Lambda architecture for bioinformatics
2. Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Transformational Bioinformatics Team
Denis Bauer,
PhD
Oscar Luo,
PhD
Rob Dunne,
PhD
Piotr Szul
Team
Aidan O’BrienLaurence Wilson,
PhD
Adrian White
Andy Hindmarch
Collaborators
David Levy
News
Software
Dan Andrews
Kaitao Lai,
PhD
Natalie Twine,
PhD
Arash Bayat
John Hildebrandt
Mia Chapman
Ian Blair
Kelly Williams
Jules Damji
Gaetan Burgio Lynn Langit
3. 1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
Twitter
YouTube
Big Data in 2025…Petabytes?
1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
Twitter
YouTube
Big Data in 2025…Petabytes?
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
4. Genome holds the blueprint for every cell
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
5. It affects looks, disease risk, and behavior
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
6. 1
0.17
2
20
0 5 10 15 20 25
Astronomy
Twitter
YouTube
Genomic
GENOMIC Big Data in 2025 - Exabytes
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
9. Finding the disease gene(s)
Spot the variant that is…
• common amongst all affected
• absent in all unaffected*
* oversimplified
cases
controls
Gene1 Gene2
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
10. CloudDataPipelinePattern
Problem
• Define biz
problem
Data
• Quality
• Quantity
• Location
Candidate
Technologies
• Ingest
• Clean
• Analyze
• Predict
• Visualize
Build MVPs
• Iterate
• Learn
• Assemble
Assemble
Pipeline
• Validate sections
• Test at scale
13. What is CSIRO’s solution?
For Scale at
reasonable cost Use Apache Hadoop
For Scale at
speed Use Apache Spark
For Usability in
bioinformatics Create a domain-specific ML API (library)
For global use
Leverage Cloud Pipeline Patterns
Transformational Bioinformatics| Denis C. Bauer @allPowerde
14. GWAS Analysis with Variant-Spark
On-premise Cluster
with Apache Hadoop & Spark
Genomics Analysts
CSIRO corporate data center
Transformational Bioinformatics| Denis C. Bauer @allPowerde
21. Performance – Faster and More Accurate
VariantSpark is the only method to scale to 100% of the genome
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
low Accuracy high
lowSpeedhigh
22. Scaling to 50 M variables and 10 K samples
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
100K trees: 5 – 50h
AWS: ~$215.50
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
• Yarn Cluster
• 12 workers
• 16 x Intel Xeon E5-2660@2.20GHz CPU
• 128 GB of RAM
• Spark 1.6.1 on YARN
• 128 executors
• 6GB / executor (0.75TB)
• Synthetic dataset
Whole Genome
Range
GWAS Range
23. Try it out: VariantSpark Notebook
https://databricks.com/blog/2017/07/26/breaking-the-
curse-of-dimensionality-in-genomics-using-wide-
random-forests.html
Transformational Bioinformatics| Denis C. Bauer @allPowerde
24. Future Directions for VariantSpark RF
Additional feature types
Unordered
Categorical
For Scores -
Continuous
Different feature ranges
Small and Big
Inputs
For Gene
Expression analysis
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
25. Genome Editing can correct genetic
diseases, ex. hypertrophic cardiomyopathy
Editing does not work every time, e.g. only
7 in 10 embryos were mutation free
Aim: Develop computational
guidance framework to enable edits
the first time; every time
Ma et al. Nature 2017 *
* Controversy around the paper – stay tuned
Transformational Bioinformatics| Denis C. Bauer @allPowerde
26. Make process parallel and scalable
• SPEED: Each search can be broken down into parallel tasks to then only take
seconds
• SCALE: Researchers might want to search the target for one gene or 100,000
Scalability + Agility =
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
27. One of the first Serverless Applications in Research
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Featured in
This is My Architecture
http://www.nature.com/nature/journal/v462/n7276/fig_tab/nature08645_F1.html
Bauer et al. Trends Mol Med. 2014 PMID: 24801560.
https://www.gt-scan.net/ --AND- AMA with Dr, Bauer -- https://www.reddit.com/r/science/comments/5fiicm/science_ama_series_im_denis_bauer_a_team_leader/
Recent team presentation - https://www.slideshare.net/AustralianNationalDataService/gtscan2-bringing-bioinformatics-to-the-cloud-may-tech-talk
Quickly access a managed Spark cluster - AWS EC2 / spot instances
Link to your data and perform whole genome analysis in real-time