SlideShare a Scribd company logo
1 of 34
Dr. Denis Bauer & Lynn Langit
Genomic-scale Data Pipelines
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Transformational Bioinformatics Team
Denis Bauer,
PhD
Oscar Luo,
PhD
Rob Dunne,
PhD
Piotr Szul
Team
Aidan O’BrienLaurence Wilson,
PhD
Adrian White
Andy Hindmarch
Collaborators
David Levy
News
Software
Dan Andrews
Kaitao Lai,
PhD
Natalie Twine,
PhD
Arash Bayat
John Hildebrandt
Mia Chapman
Ian Blair
Kelly Williams
Jules Damji
Gaetan Burgio Lynn Langit
1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
Twitter
YouTube
Big Data in 2025…Petabytes?
1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
Twitter
YouTube
Big Data in 2025…Petabytes?
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genome holds the blueprint for every cell
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
It affects looks, disease risk, and behavior
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
1
0.17
2
20
0 5 10 15 20 25
Astronomy
Twitter
YouTube
Genomic
GENOMIC Big Data in 2025 - Exabytes
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
VCF Data
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genomic Research Workflow
https://www.projectmine.com/about/
Focus
Finding the disease gene(s)
Spot the variant that is…
• common amongst all affected
• absent in all unaffected*
* oversimplified
cases
controls
Gene1 Gene2
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
CloudDataPipelinePattern
Problem
• Define biz
problem
Data
• Quality
• Quantity
• Location
Candidate
Technologies
• Ingest
• Clean
• Analyze
• Predict
• Visualize
Build MVPs
• Iterate
• Learn
• Assemble
Assemble
Pipeline
• Validate sections
• Test at scale
CloudDataPipelinePattern
Candidate
Technologies
• Ingest
• Clean
• Analyze
• Predict
• Visualize
Build MVPs
• Iterate
• Learn
• Assemble
Assemble
Pipeline
• Validate sections
• Test at scale
Machine Learning Pipeline Pattern
What is CSIRO’s solution?
For Scale at
reasonable cost Use Apache Hadoop
For Scale at
speed Use Apache Spark
For Usability in
bioinformatics Create a domain-specific ML API (library)
For global use
Leverage Cloud Pipeline Patterns
Transformational Bioinformatics| Denis C. Bauer @allPowerde
GWAS Analysis with Variant-Spark
On-premise Cluster
with Apache Hadoop & Spark
Genomics Analysts
CSIRO corporate data center
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Why
Apache
Spark?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)
Cited
4
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Supervised ML: Wide Random Forests
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Solving Important Questions…
Cancer genomics?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
DEMO: Who is a Hipster?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
VariantSpark & Databricks Notebook
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
databricks Notebook
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Performance – Faster and More Accurate
VariantSpark is the only method to scale to 100% of the genome
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
low Accuracy high
lowSpeedhigh
Scaling to 50 M variables and 10 K samples
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
100K trees: 5 – 50h
AWS: ~$215.50
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
• Yarn Cluster
• 12 workers
• 16 x Intel Xeon E5-2660@2.20GHz CPU
• 128 GB of RAM
• Spark 1.6.1 on YARN
• 128 executors
• 6GB / executor (0.75TB)
• Synthetic dataset
Whole Genome
Range
GWAS Range
Try it out: VariantSpark Notebook
https://databricks.com/blog/2017/07/26/breaking-the-
curse-of-dimensionality-in-genomics-using-wide-
random-forests.html
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Future Directions for VariantSpark RF
Additional feature types
Unordered
Categorical
For Scores -
Continuous
Different feature ranges
Small and Big
Inputs
For Gene
Expression analysis
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genome Editing can correct genetic
diseases, ex. hypertrophic cardiomyopathy
Editing does not work every time, e.g. only
7 in 10 embryos were mutation free
Aim: Develop computational
guidance framework to enable edits
the first time; every time
Ma et al. Nature 2017 *
* Controversy around the paper – stay tuned
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Make process parallel and scalable
• SPEED: Each search can be broken down into parallel tasks to then only take
seconds
• SCALE: Researchers might want to search the target for one gene or 100,000
Scalability + Agility =
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
One of the first Serverless Applications in Research
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Featured in
This is My Architecture
GT-Scan2
Considering Services
for GT-Scan2
• Use AWS Step Functions
• Simplify workflow
• Simplify task timeouts
• Simplify task failures
• Must evaluate costs
• SNS vs. Step Functions
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
CloudDataPipelinePattern
Problem Data
Candidate
Technologies
Build MVPs
Assemble
Pipeline
1. Analyze/GWAS vcf -> S3/Hadoop Ingest
ETL
Analyze
Viz
S3 -> Databricks DBFS
Apache Spark
Variant-Spark ML
Notebook SQL, R or Python
Spark
2. Search/GTScan2 S3/fastq-> DynamoDB
S3/fastq, bed
Ingest
ETL
Analyze
Viz
S3
Lambda
Lambda
Lambda/API Gateway
Serverless
Spark Pipeline Pattern
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Jupyter Notebook
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Serverless Architecture Pattern
Lambda
function
1
Lambda
function
2
Lambda
function
3
buckets with
objects DynamoDB
API Gateway Users
Step Functions
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Cloud Genomic Data Pipelines
• Problem # 1 – Analyze
• Find the mutated genes
• Solution: Spark-based machine learning
• Problem #2 – Scan
• Find the nucleotide (DNA letters)
• Solution: Serverless
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genomics Big Data Pipelines
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Dr. Denis Bauer & Lynn Langit

More Related Content

What's hot

Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
Kirill Osipov
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
Allen Day, PhD
 

What's hot (20)

The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
Cloud Accelerated Genomics
Cloud Accelerated GenomicsCloud Accelerated Genomics
Cloud Accelerated Genomics
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Reference architecture for Internet Of Things
Reference architecture for Internet Of ThingsReference architecture for Internet Of Things
Reference architecture for Internet Of Things
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
 
Overview of big data in cloud computing
Overview of big data in cloud computingOverview of big data in cloud computing
Overview of big data in cloud computing
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Big Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudBig Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the Cloud
 
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
 
CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides
CloudCamp Chicago - Big Data & Cloud May 2015 - All SlidesCloudCamp Chicago - Big Data & Cloud May 2015 - All Slides
CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides
 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
 
Building Scalable Aggregation Systems
Building Scalable Aggregation SystemsBuilding Scalable Aggregation Systems
Building Scalable Aggregation Systems
 
#EarthOnAWS | AWS Public Sector Summit 2017
#EarthOnAWS | AWS Public Sector Summit 2017#EarthOnAWS | AWS Public Sector Summit 2017
#EarthOnAWS | AWS Public Sector Summit 2017
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 

Similar to Genomic Scale Big Data Pipelines

wolstencroft-ogf20-astro
wolstencroft-ogf20-astrowolstencroft-ogf20-astro
wolstencroft-ogf20-astro
webuploader
 
Thesis blending big data and cloud -epilepsy global data research and inform...
Thesis  blending big data and cloud -epilepsy global data research and inform...Thesis  blending big data and cloud -epilepsy global data research and inform...
Thesis blending big data and cloud -epilepsy global data research and inform...
Anup Singh
 

Similar to Genomic Scale Big Data Pipelines (20)

How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
Translating genomics into clinical practice - 2018 AWS summit keynote
Translating genomics into clinical practice - 2018 AWS summit keynoteTranslating genomics into clinical practice - 2018 AWS summit keynote
Translating genomics into clinical practice - 2018 AWS summit keynote
 
Going Server-less for Web-Services that need to Crunch Large Volumes of Data
Going Server-less for Web-Services that need to Crunch Large Volumes of DataGoing Server-less for Web-Services that need to Crunch Large Volumes of Data
Going Server-less for Web-Services that need to Crunch Large Volumes of Data
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research
 
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn Langit
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
wolstencroft-ogf20-astro
wolstencroft-ogf20-astrowolstencroft-ogf20-astro
wolstencroft-ogf20-astro
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Considerations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflowConsiderations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflow
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
apidays LIVE Australia 2021 - APIs enable global collaborations and accelerat...
apidays LIVE Australia 2021 - APIs enable global collaborations and accelerat...apidays LIVE Australia 2021 - APIs enable global collaborations and accelerat...
apidays LIVE Australia 2021 - APIs enable global collaborations and accelerat...
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 
Data Virtualization Modernizes Biobanking
Data Virtualization Modernizes BiobankingData Virtualization Modernizes Biobanking
Data Virtualization Modernizes Biobanking
 
Easygenomics ISCB Cloud section 2012
Easygenomics ISCB Cloud section 2012Easygenomics ISCB Cloud section 2012
Easygenomics ISCB Cloud section 2012
 
Thesis blending big data and cloud -epilepsy global data research and inform...
Thesis  blending big data and cloud -epilepsy global data research and inform...Thesis  blending big data and cloud -epilepsy global data research and inform...
Thesis blending big data and cloud -epilepsy global data research and inform...
 
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaHadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
 

More from Lynn Langit

More from Lynn Langit (20)

Serverless Architectures
Serverless ArchitecturesServerless Architectures
Serverless Architectures
 
10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming
 
Blastn plus jupyter on Docker
Blastn plus jupyter on DockerBlastn plus jupyter on Docker
Blastn plus jupyter on Docker
 
Testing in Ballerina Language
Testing in Ballerina LanguageTesting in Ballerina Language
Testing in Ballerina Language
 
Teaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsTeaching Kids to create Alexa Skills
Teaching Kids to create Alexa Skills
 
Practical cloud
Practical cloudPractical cloud
Practical cloud
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examples
 
Teaching Kids Programming
Teaching Kids ProgrammingTeaching Kids Programming
Teaching Kids Programming
 
Practical Cloud
Practical CloudPractical Cloud
Practical Cloud
 
Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
 
Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond Relational
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for Bioinformatics
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
 
Scaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformScaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud Platform
 
SQL Server on Google Cloud Platform
SQL Server on Google Cloud PlatformSQL Server on Google Cloud Platform
SQL Server on Google Cloud Platform
 
Redis Labs and SQL Server
Redis Labs and SQL ServerRedis Labs and SQL Server
Redis Labs and SQL Server
 
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and YellowfinBuilding a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
 
What is 'Teaching Kids Programming'
What is 'Teaching Kids Programming'What is 'Teaching Kids Programming'
What is 'Teaching Kids Programming'
 
Teaching Kids Programming for Developers
Teaching Kids Programming for DevelopersTeaching Kids Programming for Developers
Teaching Kids Programming for Developers
 

Recently uploaded

(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 

Recently uploaded (20)

Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 

Genomic Scale Big Data Pipelines

  • 1. Dr. Denis Bauer & Lynn Langit Genomic-scale Data Pipelines
  • 2. Transformational Bioinformatics | Denis C. Bauer | @allPowerde Transformational Bioinformatics Team Denis Bauer, PhD Oscar Luo, PhD Rob Dunne, PhD Piotr Szul Team Aidan O’BrienLaurence Wilson, PhD Adrian White Andy Hindmarch Collaborators David Levy News Software Dan Andrews Kaitao Lai, PhD Natalie Twine, PhD Arash Bayat John Hildebrandt Mia Chapman Ian Blair Kelly Williams Jules Damji Gaetan Burgio Lynn Langit
  • 3. 1000 17 2000 0 500 1000 1500 2000 2500 Astronomy Twitter YouTube Big Data in 2025…Petabytes? 1000 17 2000 0 500 1000 1500 2000 2500 Astronomy Twitter YouTube Big Data in 2025…Petabytes? Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 4. Genome holds the blueprint for every cell Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 5. It affects looks, disease risk, and behavior Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 6. 1 0.17 2 20 0 5 10 15 20 25 Astronomy Twitter YouTube Genomic GENOMIC Big Data in 2025 - Exabytes Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 7. VCF Data Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 8. Transformational Bioinformatics | Denis C. Bauer | @allPowerde Genomic Research Workflow https://www.projectmine.com/about/ Focus
  • 9. Finding the disease gene(s) Spot the variant that is… • common amongst all affected • absent in all unaffected* * oversimplified cases controls Gene1 Gene2 Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 10. CloudDataPipelinePattern Problem • Define biz problem Data • Quality • Quantity • Location Candidate Technologies • Ingest • Clean • Analyze • Predict • Visualize Build MVPs • Iterate • Learn • Assemble Assemble Pipeline • Validate sections • Test at scale
  • 11. CloudDataPipelinePattern Candidate Technologies • Ingest • Clean • Analyze • Predict • Visualize Build MVPs • Iterate • Learn • Assemble Assemble Pipeline • Validate sections • Test at scale
  • 13. What is CSIRO’s solution? For Scale at reasonable cost Use Apache Hadoop For Scale at speed Use Apache Spark For Usability in bioinformatics Create a domain-specific ML API (library) For global use Leverage Cloud Pipeline Patterns Transformational Bioinformatics| Denis C. Bauer @allPowerde
  • 14. GWAS Analysis with Variant-Spark On-premise Cluster with Apache Hadoop & Spark Genomics Analysts CSIRO corporate data center Transformational Bioinformatics| Denis C. Bauer @allPowerde
  • 16. BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4) Cited 4 Transformational Bioinformatics| Denis C. Bauer @allPowerde
  • 17. Supervised ML: Wide Random Forests Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 18. Solving Important Questions… Cancer genomics? Transformational Bioinformatics| Denis C. Bauer @allPowerde
  • 19. DEMO: Who is a Hipster? Transformational Bioinformatics| Denis C. Bauer @allPowerde
  • 20. VariantSpark & Databricks Notebook Transformational Bioinformatics | Denis C. Bauer | @allPowerde databricks Notebook Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 21. Performance – Faster and More Accurate VariantSpark is the only method to scale to 100% of the genome Transformational Bioinformatics | Denis C. Bauer | @allPowerde low Accuracy high lowSpeedhigh
  • 22. Scaling to 50 M variables and 10 K samples Transformational Bioinformatics | Denis C. Bauer | @allPowerde 100K trees: 5 – 50h AWS: ~$215.50 100K trees: 200 – 2000h AWS: ~ $ 8620.00 • Yarn Cluster • 12 workers • 16 x Intel Xeon E5-2660@2.20GHz CPU • 128 GB of RAM • Spark 1.6.1 on YARN • 128 executors • 6GB / executor (0.75TB) • Synthetic dataset Whole Genome Range GWAS Range
  • 23. Try it out: VariantSpark Notebook https://databricks.com/blog/2017/07/26/breaking-the- curse-of-dimensionality-in-genomics-using-wide- random-forests.html Transformational Bioinformatics| Denis C. Bauer @allPowerde
  • 24. Future Directions for VariantSpark RF Additional feature types Unordered Categorical For Scores - Continuous Different feature ranges Small and Big Inputs For Gene Expression analysis Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 25. Genome Editing can correct genetic diseases, ex. hypertrophic cardiomyopathy Editing does not work every time, e.g. only 7 in 10 embryos were mutation free Aim: Develop computational guidance framework to enable edits the first time; every time Ma et al. Nature 2017 * * Controversy around the paper – stay tuned Transformational Bioinformatics| Denis C. Bauer @allPowerde
  • 26. Make process parallel and scalable • SPEED: Each search can be broken down into parallel tasks to then only take seconds • SCALE: Researchers might want to search the target for one gene or 100,000 Scalability + Agility = Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 27. One of the first Serverless Applications in Research Transformational Bioinformatics | Denis C. Bauer | @allPowerde Featured in This is My Architecture
  • 29. Considering Services for GT-Scan2 • Use AWS Step Functions • Simplify workflow • Simplify task timeouts • Simplify task failures • Must evaluate costs • SNS vs. Step Functions Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 30. CloudDataPipelinePattern Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks DBFS Apache Spark Variant-Spark ML Notebook SQL, R or Python Spark 2. Search/GTScan2 S3/fastq-> DynamoDB S3/fastq, bed Ingest ETL Analyze Viz S3 Lambda Lambda Lambda/API Gateway Serverless
  • 31. Spark Pipeline Pattern Transformational Bioinformatics | Denis C. Bauer | @allPowerde Jupyter Notebook Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 32. Serverless Architecture Pattern Lambda function 1 Lambda function 2 Lambda function 3 buckets with objects DynamoDB API Gateway Users Step Functions Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 33. Cloud Genomic Data Pipelines • Problem # 1 – Analyze • Find the mutated genes • Solution: Spark-based machine learning • Problem #2 – Scan • Find the nucleotide (DNA letters) • Solution: Serverless Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 34. Genomics Big Data Pipelines Transformational Bioinformatics | Denis C. Bauer | @allPowerde Dr. Denis Bauer & Lynn Langit

Editor's Notes

  1. http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
  2. https://www.genome.gov/18016863/a-brief-guide-to-genomics/ https://www.thinglink.com/scene/617714375666434050
  3. http://images.wisegeek.com/woman-in-greek-tank-top-looking-at-thumb.jpg http://nborganics.com.au/index.php/product/herbs-coriander/
  4. http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
  5. http://www.internationalgenome.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40/
  6. https://academics.cloud.databricks.com/#notebook/170398/command/170419 – AND-- http://www.drjasonfox.com/
  7. Quickly access a managed Spark cluster - AWS EC2 / spot instances Link to your data and perform whole genome analysis in real-time
  8. https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html
  9. http://www.nature.com/nature/journal/v462/n7276/fig_tab/nature08645_F1.html Bauer et al. Trends Mol Med. 2014 PMID: 24801560.
  10. https://www.gt-scan.net/ --AND- AMA with Dr, Bauer -- https://www.reddit.com/r/science/comments/5fiicm/science_ama_series_im_denis_bauer_a_team_leader/
  11. Recent team presentation - https://www.slideshare.net/AustralianNationalDataService/gtscan2-bringing-bioinformatics-to-the-cloud-may-tech-talk
  12. Quickly access a managed Spark cluster - AWS EC2 / spot instances Link to your data and perform whole genome analysis in real-time