SlideShare a Scribd company logo
1 of 35
Download to read offline
A Workflow-Driven Discovery and Training Ecosystem
for Distributed Analysis of Biomedical Big Data
İlkay ALTINTAŞ, Ph.D.
Chief Data Science Officer, San Diego Supercomputer Center
Founder and Director, Workflows for Data Science Center of Excellence
SAN DIEGO SUPERCOMPUTER CENTER at UC San Diego
Providing Cyberinfrastructure for Research and Education
• Established	as	a	national	
supercomputer	resource	center	
in	1985	by	NSF
• A	world	leader	in	HPC,	data-
intensive	computing,	and	scientific	
data	management
• Current	strategic	focus	on	“Big	
Data”,	“versatile	computing”,	and	
“life	sciences	applications”
1985
today
Two discoveries in drug
design from 1987 and 1991.
Ross Walker Group
SDSC continues to be a leader in scientific computing and big data!
Gordon: First	
Flash-based	Supercomputer	
for	Data-intensive	Apps
Comet: Serving the Long
Tail of Science
27 standard racks
= 1944 nodes
= 46,656 cores
= 249 TB DRAM
= 622 TB SSD
~ 2 Pflop/s
• 36 GPU nodes
• 4 Large Memory nodes
• 7 PB Lustre storage
• High performance
virtualization
SDSC Data Science Office
-- Expertise, Systems and Training
for Data Science Applications --
SDSC Data Science Office (DSO)
SDSC DSO is a collaborative virtual organization at SDSC for collective
lasting innovation in data science research, development and education.
DSO
SDSC Expertise and Strengths
BigDataPlatforms
Training
Industry
Applications
Life Sciences is an ongoing strategic
application thrust at SDSC…
Genomic Analysis is a Big Data and Big Compute Problem
BIG DATA
COMPUTING AT
SCALE
Enables dynamic data-driven applications
Computer-Aided Drug Discovery
Personalized Precision Medicine
Requires:
• Data management
• Data-driven methods
• Scalable tools for
dynamic coordination
and resource
optimization
• Skilled interdisciplinary
workforce
Team work and
process management
Vaccine Development
Metagenomics
…
New era of data
science!
Needs and Trends for the New Era Data Science
-- the Big Data Era Goals --
• More	data-driven
• More	dynamic
• More	process-driven
• More	collaborative
• More	accountable
• More	reproducible
• More	interactive
• More	heterogeneous
Velocity
Variety
Volume Scalable batch
processing
Stream processing
Extensible data storage,
access and integration
Genomic Data Management and Processing in the Big
Data Era has Unique Challenges!
HBase
Hive Pig
Zookeeper
Giraph
Storm
Spark
MapReduce
YARN
MongoDB
Cassandra
HDFS
Flink
Lower levels:
Storage and scheduling
Higher levels:
Interactivity
These challenges push for new tools to tackle them.
COORDINATION AND
WORKFLOW MANAGEMENT
DATA INTEGRATION
AND PROCESSING
DATA MANAGEMENT
AND STORAGE
How do we use
these new tools
and combine them
with existing
domain-specific
solutions in
scientific
computing and
data science?
COORDINATION AND
WORKFLOW MANAGEMENT
DATA INTEGRATION
AND PROCESSING
DATA MANAGEMENT
AND STORAGE
Layer 1: Data Management and Storage
COORDINATION AND
WORKFLOW MANAGEMENT
DATA INTEGRATION
AND PROCESSING
DATA MANAGEMENT
AND STORAGE
Layer 2: Data Integration and Processing
HBase
Hive PigZookeeper
Giraph
Storm
Spark
MapReduce
YARN
MongoDB
Cassandra
HDFS
Flink
+ Application
specific libraries
Most of the time, more than one
analysis need to take place…
And each analysis has multiple steps
to integrate!
Pipelining is a way to put the steps together.
Source: http://www.slideshare.net/BigDataCloud/big-data-analytics-with-google-
cloud-platform
Source: https://www.mapr.com/blog/distributed-stream-and-graph-processing-
apache-flink
Source: https://www.computer.org/csdl/mags/so/2016/02/mso2016020060.html
Source: http://www.slideshare.net/ThoughtWorks/big-data-pipeline-with-scala
COORDINATION AND
WORKFLOW MANAGEMENT
DATA INTEGRATION
AND PROCESSING
DATA MANAGEMENT
AND STORAGE
Layer 3: Coordination and Workflow Management
COORDINATION AND
WORKFLOW MANAGEMENT
ACQUIRE PREPARE ANALYZE REPORT	 ACT
…
kepler-project.org
Workflows for Data Science
Center of Excellence at SDSC
Building functional, operational
and reproducible solution
architectures using big data
and HPC tools is what we do.
Focus	on	the	
question,	
not	the	
technology!
• Access and query data
• Scale computational analysis
• Increase reuse
• Save time, energy and money
• Formalize and standardize
Real-Time	Hazards	Management
wifire.ucsd.edu
Data-Parallel	Bioinformatics
bioKepler.org
Scalable	Automated	Molecular	Dynamics	and	Drug	Discovery
nbcr.ucsd.edu
WorDS.sdsc.edu
bioKepler:
A Kepler Module for Bio Big Data Analysis
Data-Parallel	Bioinformatics
bioKepler.org
Source: Larry Smarr, Calit2
• Metagenomic Sequencing
• JCVI	Produced
• ~150	Billion	DNA	Bases	From
Seven	of	LS	Stool	Samples	Over	1.5	Years
• ~3	Trillion	DNA	Bases	From	NIH	Human	Microbiome
Program	Data	Base
• 255	Healthy	People,	21	with	IBD
Illumina
HiSeq 2000
at JCVI
SDSC Gordon Data Supercomputer
Example from 2013: Inflammatory Bowel Disease (IBD)
• Supercomputing	(W.Li,	JCVI/HLI/UCSD):	
• ~20	CPU-Years	on	SDSC’s	Gordon
• ~4	CPU-Years	on	Dell’s	HPC	Cloud
• Produced	Relative	Abundance	of	
• ~10K	Bacteria,	Archaea,	Viruses	in	~300	People
• ~3	Million	Filled	Spreadsheet	Cells
Ongoing Research:
Optimization of Heterogeneous Resource Utilization using bioKepler
National	
Resources
(Gordon) (Comet)
(Stampede)
(Lonestar)
Cloud	
Resources
Optimized
Local	Cluster	Resources
Uses existing genomics
tools and computing
systems!
Computing is just one part
of it…
…new methods needed!
Needs of a Dynamic Ecosystem of
Genomic Discovery
• Exploratory	methods	to	see	temporal	changes	and	patterns	in	sequence	
data
• Efficient	updates	to	analysis	as	quick	as	new	sequence	data	gets	generated	
• Regular	reruns	of	annotations	as	reference	databases	evolve
• Integration	of	genomic	data	with	other	types	of	data,	e.g.,	image,	
environmental,	social	graphs
• Dynamic	ability	to	check	quality	and	provenance	of	data	and	analysis
• Transparent	support	for	computing	platforms	designed	for	genomic	
discovery	and	pattern	analysis
• Workflow	coordination	and	system	integration
• People	and	culture	to	make	it	happen	collaboratively!
Examples from 2016: Apache Big Data Technologies
in Life Sciences
• Lightning	Fast	Genomics	with	ADAM
• Goal
• Study	genetic	variations	in	populations	at	scale	(e.g.,	1000	Genomes	Project)
• Technology	stack
• Apache	Avro	(data	serialization,	schema	definition)
• Apache	Parquet	(compact	columnar	storage)
• Apache	Spark	(distributed	parallel	processing)
• Spark	MLlib	(machine	learning,	clustering)
• Source:	AMPLab,	UC	Berkeley	(http://bdgenomics.org/)
• Compressive	Structural	Bioinformatics	using	MMTF
• Goal
• 100+	speedup	of	large-scale	3D	structural	analysis	of	the	Protein	Data	Bank	(PDB)	
• Technology	stack
• MMTF	(Macromolecular	Transmission	format,	compact	storage	in	Hadoop	Sequence	Files)
• Apache	Spark	(in-memory,	parallel	distributed	workflows	using	compressed	data)
• Spark	ML	(clustering)
• Source:	SDSC,	UC	San	Diego	(http://mmtf.rcsb.org/)
Development of tools and technologies that enable models to bridge across
diverse scales of biological organization, while leveraging all types and
sources of data
NBCR Example: Distilling Medical Image
Data for Biomedical Action nbcr.ucsd.edu
Identify gaps in multiscale modeling capabilities and develop new
methods and tools that allow us to bridge across these gaps
Å nm – μm 0.1mm - mm cm
fs - μs μs - ms ms - s s - lifespan
Molecular &
Macromolecular
Sub-Cellular Cell Tissue Organ
Spatialand
TemporalScales
Driving Biomedical Projects propel technology development across multi-scale
modeling capability gaps, from simulation to data assembly & integration
A challenge: Data Integration
Challenge to bridge across diverse scales of biological organization, to understand emergent behavior, and
the molecular mechanisms underlying biological function & disease
Integrated Multi-Scale Modeling Toolkits in NBCR
User	Interface NBCR	
Products
Battling complexity while facilitating collaboration and increasing reproducibility.
Cyberinfrastructure Innovation Based on User Needs
Domain-specific tools, workflows,
data and computing infrastructure.
Components for Multi-Scale Modeling
A handful of customizable and and
extensible tools, workflows, user
interfaces and publishable research
objects.
NBCR Products
Workflows
Scientific	Tools
Past
Experiments
• UI generation
• Logical workflow generation
• Uncertainty quantification
• Workflow execution
• Provenance tracking
• System integration
medium
Prima-1
Sticticacid
35ZWF
25KKL
22LSV
32CTM
26RQZ
27WT9
33AG6
33BAZ
28NZ6
27TGR
27VFS
35LWZ
36EB5
27UDP
32LDE
0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2
no	p53
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
no
com
poundP
rim
a-1
35ZW
F25K
K
L25P
W
S24M
LP26Y
Y
G
22LS
V24M
N
R32C
TM
22K
TV24M
Y
424LB
C24N
P
U24N
W
3
Series1"
Series2"
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
no
com
poundP
rim
a-1
35ZW
F25K
K
L25P
W
S24M
LP26Y
Y
G
22LS
V24M
N
R32C
TM
22K
TV24M
Y
424LB
C24N
P
U24N
W
3
Series1"
Series2"cancer	cell	with	p53-R175H	mutant
cell	proliferation
15 new reactivation compounds
reactivation compounds
kill cells with p53 cancer
mutant
BENEFITS:
• Increase	reuse
• Reproducibility
• Scale	execution,	
problem	&	solution	
• Compare	methods
• Train	students
Minimization Actor Equilibration Actor
AMBER GPU MD Workbench
Rommie Amaro, PI, UCSD
Computational chemistry, biophysics
Andrew McCammon, UCSD
Computational chemistry, biophysics,
chemical physics
Mark Ellisman, UCSD
Molecular & cellular biology
Andrew McCulloch, UCSD
Bioengineering, biophysics
Michel Sanner, TSRI
Drug discovery & molecular
visualization
Phil Papadopoulos, UCSD/SDSC
Computer engineering, cyberinfrastructure
technology
Ilkay Altintas, UCSD/SDSC
Workflows, provenance
Michael Holst, UCSD
Math, physics
Arthur Olson, TSRI
Computational chemistry, drug
discovery, visualization
LEADERSHIP
TEAM
Training at the interface
Challenge: how do we build the next generation of
interdisciplinary scientists?
Data-to-Structural-Models Simulation-Based Drug Discovery
Biomedical Big Data Training Collaboratory
http://biobigdata.ucsd.edu
• BBDTC	website	is	up	and	evolving!
• BBDTC	contains	sevenfull,	open	biomedical	training	courses
• Four-course	biomedical	big	data	series	is	planned	for	Winter	2017
Working with Industry Partners at SDSC
SDSC Provides a Range of Strategies for
Engaging with Industry
• Sponsored	research	agreements	
• Service	agreements	for	use	of	systems	&	consulting
• Focused	centers	of	excellence	(Big	Data	Systems,	Predictive	Analytics,	Workflow	
Technologies)	
• Training	programs	in	Data	Science	&	Analytics
• Industry	Partners	Program	for	“jump	starting”	collaborations
Working with industry helps companies be more competitive, drives innovation, and fosters a
healthy ecosystem between the research and private sector.
Example for Industrial Collaboration:
Janssen R&D Rheumatoid Arthritis Study
• Janssen	was	interested	in	correlating	genomic	
profile	with	response	to	TNFα	inhibitor	
golimumab
• Sequenced	438	patients	(full	genome)
• SDSC	assisted	with	re-alignment	and	variant	
calling	using	new/improved	algorithms
• Needed	analysis	done	in	a	reasonable	
timeframe	(a	few	weeks)
Questions?
IlkayAltintas,Ph.D.
Email:ialtintas@ucsd.edu

More Related Content

What's hot

A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...Geoffrey Fox
 
Introducing the hadoop ecosystem
Introducing the hadoop ecosystemIntroducing the hadoop ecosystem
Introducing the hadoop ecosystemGeert Van Landeghem
 
Thesis blending big data and cloud -epilepsy global data research and inform...
Thesis  blending big data and cloud -epilepsy global data research and inform...Thesis  blending big data and cloud -epilepsy global data research and inform...
Thesis blending big data and cloud -epilepsy global data research and inform...Anup Singh
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at YorkMing Li
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentationKlawal13
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
 
Educating a New Breed of Data Scientists for Scientific Data Management
Educating a New Breed of Data Scientists for Scientific Data Management Educating a New Breed of Data Scientists for Scientific Data Management
Educating a New Breed of Data Scientists for Scientific Data Management Jian Qin
 
IRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET Journal
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Where is the opportunity for libraries in the collaborative data infrastructure?
Where is the opportunity for libraries in the collaborative data infrastructure?Where is the opportunity for libraries in the collaborative data infrastructure?
Where is the opportunity for libraries in the collaborative data infrastructure?LIBER Europe
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
iMarine catalogue of services
iMarine catalogue of servicesiMarine catalogue of services
iMarine catalogue of servicesiMarine283644
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data Geoffrey Fox
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataJoey Li
 

What's hot (20)

A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Introducing the hadoop ecosystem
Introducing the hadoop ecosystemIntroducing the hadoop ecosystem
Introducing the hadoop ecosystem
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
IJET-V3I2P14
 
Thesis blending big data and cloud -epilepsy global data research and inform...
Thesis  blending big data and cloud -epilepsy global data research and inform...Thesis  blending big data and cloud -epilepsy global data research and inform...
Thesis blending big data and cloud -epilepsy global data research and inform...
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at York
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentation
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Educating a New Breed of Data Scientists for Scientific Data Management
Educating a New Breed of Data Scientists for Scientific Data Management Educating a New Breed of Data Scientists for Scientific Data Management
Educating a New Breed of Data Scientists for Scientific Data Management
 
IRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articles
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Where is the opportunity for libraries in the collaborative data infrastructure?
Where is the opportunity for libraries in the collaborative data infrastructure?Where is the opportunity for libraries in the collaborative data infrastructure?
Where is the opportunity for libraries in the collaborative data infrastructure?
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
iMarine catalogue of services
iMarine catalogue of servicesiMarine catalogue of services
iMarine catalogue of services
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 

Viewers also liked

The Use of Discovery Driven Planning to Manage High Uncertainty Projects
The Use of Discovery Driven Planning to Manage High Uncertainty ProjectsThe Use of Discovery Driven Planning to Manage High Uncertainty Projects
The Use of Discovery Driven Planning to Manage High Uncertainty ProjectsJose Briones
 
JPJ1407 Expressive, Efficient, and Revocable Data Access Control for Multi-...
JPJ1407   Expressive, Efficient, and Revocable Data Access Control for Multi-...JPJ1407   Expressive, Efficient, and Revocable Data Access Control for Multi-...
JPJ1407 Expressive, Efficient, and Revocable Data Access Control for Multi-...chennaijp
 
Building a Distributed Data Pipeline
Building a Distributed Data PipelineBuilding a Distributed Data Pipeline
Building a Distributed Data PipelineTom Lous
 
Secure data sharing in cloud computing using revocable storage identity-based...
Secure data sharing in cloud computing using revocable storage identity-based...Secure data sharing in cloud computing using revocable storage identity-based...
Secure data sharing in cloud computing using revocable storage identity-based...Shakas Technologies
 
SECURE DATA SHARING IN CLOUD COMPUTING USING REVOCABLE-STORAGE IDENTITY-BASED...
SECURE DATA SHARING IN CLOUD COMPUTING USING REVOCABLE-STORAGE IDENTITY-BASED...SECURE DATA SHARING IN CLOUD COMPUTING USING REVOCABLE-STORAGE IDENTITY-BASED...
SECURE DATA SHARING IN CLOUD COMPUTING USING REVOCABLE-STORAGE IDENTITY-BASED...Nexgen Technology
 
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & AirflowBuilding a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & AirflowTom Lous
 
Key knowledge, skills and behaviours required by Learning and Development Pro...
Key knowledge, skills and behaviours required by Learning and Development Pro...Key knowledge, skills and behaviours required by Learning and Development Pro...
Key knowledge, skills and behaviours required by Learning and Development Pro...Learning and Development Freelancer
 
The need to redefine genomic data sharing - moving towards Open Science Oct ...
The need to redefine genomic data sharing - moving towards Open Science  Oct ...The need to redefine genomic data sharing - moving towards Open Science  Oct ...
The need to redefine genomic data sharing - moving towards Open Science Oct ...Fiona Nielsen
 
How to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationHow to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationJoaquin Dopazo
 
Genomic futures v_pitt_kent_osu
Genomic futures v_pitt_kent_osuGenomic futures v_pitt_kent_osu
Genomic futures v_pitt_kent_osuBen Busby
 
Advanced genomics v_medical_pitt_kent_osu
Advanced genomics v_medical_pitt_kent_osuAdvanced genomics v_medical_pitt_kent_osu
Advanced genomics v_medical_pitt_kent_osuBen Busby
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFlávio Codeço Coelho
 
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of BioinformaticsManaging & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of BioinformaticsRaul Chong
 
Current Practices, Trends and Emerging roles in Learning and Development
Current Practices, Trends and Emerging roles in Learning and DevelopmentCurrent Practices, Trends and Emerging roles in Learning and Development
Current Practices, Trends and Emerging roles in Learning and DevelopmentLearning and Development Freelancer
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 
Big Data and Genomic Medicine by Corey Nislow
Big Data and Genomic Medicine by Corey NislowBig Data and Genomic Medicine by Corey Nislow
Big Data and Genomic Medicine by Corey NislowKnome_Inc
 
Day 2 Big Data panel at the NIH BD2K All Hands 2016 meeting
Day 2 Big Data panel at the NIH BD2K All Hands 2016 meetingDay 2 Big Data panel at the NIH BD2K All Hands 2016 meeting
Day 2 Big Data panel at the NIH BD2K All Hands 2016 meetingWarren Kibbe
 

Viewers also liked (20)

The Use of Discovery Driven Planning to Manage High Uncertainty Projects
The Use of Discovery Driven Planning to Manage High Uncertainty ProjectsThe Use of Discovery Driven Planning to Manage High Uncertainty Projects
The Use of Discovery Driven Planning to Manage High Uncertainty Projects
 
Uncertainty reduction
Uncertainty reductionUncertainty reduction
Uncertainty reduction
 
JPJ1407 Expressive, Efficient, and Revocable Data Access Control for Multi-...
JPJ1407   Expressive, Efficient, and Revocable Data Access Control for Multi-...JPJ1407   Expressive, Efficient, and Revocable Data Access Control for Multi-...
JPJ1407 Expressive, Efficient, and Revocable Data Access Control for Multi-...
 
Building a Distributed Data Pipeline
Building a Distributed Data PipelineBuilding a Distributed Data Pipeline
Building a Distributed Data Pipeline
 
Secure data sharing in cloud computing using revocable storage identity-based...
Secure data sharing in cloud computing using revocable storage identity-based...Secure data sharing in cloud computing using revocable storage identity-based...
Secure data sharing in cloud computing using revocable storage identity-based...
 
SECURE DATA SHARING IN CLOUD COMPUTING USING REVOCABLE-STORAGE IDENTITY-BASED...
SECURE DATA SHARING IN CLOUD COMPUTING USING REVOCABLE-STORAGE IDENTITY-BASED...SECURE DATA SHARING IN CLOUD COMPUTING USING REVOCABLE-STORAGE IDENTITY-BASED...
SECURE DATA SHARING IN CLOUD COMPUTING USING REVOCABLE-STORAGE IDENTITY-BASED...
 
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & AirflowBuilding a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
 
Coaching poster
Coaching posterCoaching poster
Coaching poster
 
Key knowledge, skills and behaviours required by Learning and Development Pro...
Key knowledge, skills and behaviours required by Learning and Development Pro...Key knowledge, skills and behaviours required by Learning and Development Pro...
Key knowledge, skills and behaviours required by Learning and Development Pro...
 
The need to redefine genomic data sharing - moving towards Open Science Oct ...
The need to redefine genomic data sharing - moving towards Open Science  Oct ...The need to redefine genomic data sharing - moving towards Open Science  Oct ...
The need to redefine genomic data sharing - moving towards Open Science Oct ...
 
How to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationHow to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical information
 
Genomic futures v_pitt_kent_osu
Genomic futures v_pitt_kent_osuGenomic futures v_pitt_kent_osu
Genomic futures v_pitt_kent_osu
 
Advanced genomics v_medical_pitt_kent_osu
Advanced genomics v_medical_pitt_kent_osuAdvanced genomics v_medical_pitt_kent_osu
Advanced genomics v_medical_pitt_kent_osu
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
 
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of BioinformaticsManaging & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
 
Current Practices, Trends and Emerging roles in Learning and Development
Current Practices, Trends and Emerging roles in Learning and DevelopmentCurrent Practices, Trends and Emerging roles in Learning and Development
Current Practices, Trends and Emerging roles in Learning and Development
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Big Data and Genomic Medicine by Corey Nislow
Big Data and Genomic Medicine by Corey NislowBig Data and Genomic Medicine by Corey Nislow
Big Data and Genomic Medicine by Corey Nislow
 
Day 2 Big Data panel at the NIH BD2K All Hands 2016 meeting
Day 2 Big Data panel at the NIH BD2K All Hands 2016 meetingDay 2 Big Data panel at the NIH BD2K All Hands 2016 meeting
Day 2 Big Data panel at the NIH BD2K All Hands 2016 meeting
 

Similar to A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

SDSC Industry News Q1 2015
SDSC Industry News Q1 2015SDSC Industry News Q1 2015
SDSC Industry News Q1 2015Ron Hawkins
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library EMC
 
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...Institute of Information Systems (HES-SO)
 
Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsIlkay Altintas, Ph.D.
 
Hughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesHughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesASIS&T
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science James Hendler
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessAjay Ohri
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
 
The pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleThe pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleEnis Afgan
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 

Similar to A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data (20)

SDSC Industry News Q1 2015
SDSC Industry News Q1 2015SDSC Industry News Q1 2015
SDSC Industry News Q1 2015
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library
 
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
 
Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable Workflows
 
Hughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesHughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication Repositories
 
ODSC and iRODS
ODSC and iRODSODSC and iRODS
ODSC and iRODS
 
Big Data
Big Data Big Data
Big Data
 
Cyberistructure
CyberistructureCyberistructure
Cyberistructure
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
Satellite Volta
Satellite VoltaSatellite Volta
Satellite Volta
 
Bar camp bigdata
Bar camp bigdataBar camp bigdata
Bar camp bigdata
 
The pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleThe pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an example
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Information_Systems
Information_SystemsInformation_Systems
Information_Systems
 

More from Ilkay Altintas, Ph.D.

Collaborative Data Science In A Highly Networked World
Collaborative Data Science In A Highly Networked WorldCollaborative Data Science In A Highly Networked World
Collaborative Data Science In A Highly Networked WorldIlkay Altintas, Ph.D.
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Ilkay Altintas, Ph.D.
 
Workflow-Driven Geoinformatics Applications and Training in the Big Data Era
Workflow-Driven Geoinformatics Applications and Training in the Big Data EraWorkflow-Driven Geoinformatics Applications and Training in the Big Data Era
Workflow-Driven Geoinformatics Applications and Training in the Big Data EraIlkay Altintas, Ph.D.
 
Using Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire ResilienceUsing Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire ResilienceIlkay Altintas, Ph.D.
 
Using Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire ResilienceUsing Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire ResilienceIlkay Altintas, Ph.D.
 
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesWorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesIlkay Altintas, Ph.D.
 
Invited Talk for EUDAT Workshop in Barcelona
Invited Talk for EUDAT Workshop in Barcelona Invited Talk for EUDAT Workshop in Barcelona
Invited Talk for EUDAT Workshop in Barcelona Ilkay Altintas, Ph.D.
 

More from Ilkay Altintas, Ph.D. (7)

Collaborative Data Science In A Highly Networked World
Collaborative Data Science In A Highly Networked WorldCollaborative Data Science In A Highly Networked World
Collaborative Data Science In A Highly Networked World
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
 
Workflow-Driven Geoinformatics Applications and Training in the Big Data Era
Workflow-Driven Geoinformatics Applications and Training in the Big Data EraWorkflow-Driven Geoinformatics Applications and Training in the Big Data Era
Workflow-Driven Geoinformatics Applications and Training in the Big Data Era
 
Using Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire ResilienceUsing Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire Resilience
 
Using Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire ResilienceUsing Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire Resilience
 
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesWorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
 
Invited Talk for EUDAT Workshop in Barcelona
Invited Talk for EUDAT Workshop in Barcelona Invited Talk for EUDAT Workshop in Barcelona
Invited Talk for EUDAT Workshop in Barcelona
 

Recently uploaded

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 

Recently uploaded (20)

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 

A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

  • 1. A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data İlkay ALTINTAŞ, Ph.D. Chief Data Science Officer, San Diego Supercomputer Center Founder and Director, Workflows for Data Science Center of Excellence
  • 2. SAN DIEGO SUPERCOMPUTER CENTER at UC San Diego Providing Cyberinfrastructure for Research and Education • Established as a national supercomputer resource center in 1985 by NSF • A world leader in HPC, data- intensive computing, and scientific data management • Current strategic focus on “Big Data”, “versatile computing”, and “life sciences applications” 1985 today Two discoveries in drug design from 1987 and 1991.
  • 3. Ross Walker Group SDSC continues to be a leader in scientific computing and big data! Gordon: First Flash-based Supercomputer for Data-intensive Apps Comet: Serving the Long Tail of Science 27 standard racks = 1944 nodes = 46,656 cores = 249 TB DRAM = 622 TB SSD ~ 2 Pflop/s • 36 GPU nodes • 4 Large Memory nodes • 7 PB Lustre storage • High performance virtualization
  • 4. SDSC Data Science Office -- Expertise, Systems and Training for Data Science Applications -- SDSC Data Science Office (DSO) SDSC DSO is a collaborative virtual organization at SDSC for collective lasting innovation in data science research, development and education. DSO SDSC Expertise and Strengths BigDataPlatforms Training Industry Applications
  • 5. Life Sciences is an ongoing strategic application thrust at SDSC…
  • 6. Genomic Analysis is a Big Data and Big Compute Problem BIG DATA COMPUTING AT SCALE Enables dynamic data-driven applications Computer-Aided Drug Discovery Personalized Precision Medicine Requires: • Data management • Data-driven methods • Scalable tools for dynamic coordination and resource optimization • Skilled interdisciplinary workforce Team work and process management Vaccine Development Metagenomics …
  • 7. New era of data science! Needs and Trends for the New Era Data Science -- the Big Data Era Goals -- • More data-driven • More dynamic • More process-driven • More collaborative • More accountable • More reproducible • More interactive • More heterogeneous
  • 8. Velocity Variety Volume Scalable batch processing Stream processing Extensible data storage, access and integration Genomic Data Management and Processing in the Big Data Era has Unique Challenges!
  • 9. HBase Hive Pig Zookeeper Giraph Storm Spark MapReduce YARN MongoDB Cassandra HDFS Flink Lower levels: Storage and scheduling Higher levels: Interactivity These challenges push for new tools to tackle them.
  • 10. COORDINATION AND WORKFLOW MANAGEMENT DATA INTEGRATION AND PROCESSING DATA MANAGEMENT AND STORAGE How do we use these new tools and combine them with existing domain-specific solutions in scientific computing and data science?
  • 11. COORDINATION AND WORKFLOW MANAGEMENT DATA INTEGRATION AND PROCESSING DATA MANAGEMENT AND STORAGE Layer 1: Data Management and Storage
  • 12. COORDINATION AND WORKFLOW MANAGEMENT DATA INTEGRATION AND PROCESSING DATA MANAGEMENT AND STORAGE Layer 2: Data Integration and Processing HBase Hive PigZookeeper Giraph Storm Spark MapReduce YARN MongoDB Cassandra HDFS Flink + Application specific libraries
  • 13. Most of the time, more than one analysis need to take place… And each analysis has multiple steps to integrate!
  • 14. Pipelining is a way to put the steps together. Source: http://www.slideshare.net/BigDataCloud/big-data-analytics-with-google- cloud-platform Source: https://www.mapr.com/blog/distributed-stream-and-graph-processing- apache-flink Source: https://www.computer.org/csdl/mags/so/2016/02/mso2016020060.html Source: http://www.slideshare.net/ThoughtWorks/big-data-pipeline-with-scala
  • 15. COORDINATION AND WORKFLOW MANAGEMENT DATA INTEGRATION AND PROCESSING DATA MANAGEMENT AND STORAGE Layer 3: Coordination and Workflow Management
  • 16. COORDINATION AND WORKFLOW MANAGEMENT ACQUIRE PREPARE ANALYZE REPORT ACT … kepler-project.org
  • 17. Workflows for Data Science Center of Excellence at SDSC Building functional, operational and reproducible solution architectures using big data and HPC tools is what we do. Focus on the question, not the technology! • Access and query data • Scale computational analysis • Increase reuse • Save time, energy and money • Formalize and standardize Real-Time Hazards Management wifire.ucsd.edu Data-Parallel Bioinformatics bioKepler.org Scalable Automated Molecular Dynamics and Drug Discovery nbcr.ucsd.edu WorDS.sdsc.edu
  • 18. bioKepler: A Kepler Module for Bio Big Data Analysis Data-Parallel Bioinformatics bioKepler.org
  • 19. Source: Larry Smarr, Calit2 • Metagenomic Sequencing • JCVI Produced • ~150 Billion DNA Bases From Seven of LS Stool Samples Over 1.5 Years • ~3 Trillion DNA Bases From NIH Human Microbiome Program Data Base • 255 Healthy People, 21 with IBD Illumina HiSeq 2000 at JCVI SDSC Gordon Data Supercomputer Example from 2013: Inflammatory Bowel Disease (IBD) • Supercomputing (W.Li, JCVI/HLI/UCSD): • ~20 CPU-Years on SDSC’s Gordon • ~4 CPU-Years on Dell’s HPC Cloud • Produced Relative Abundance of • ~10K Bacteria, Archaea, Viruses in ~300 People • ~3 Million Filled Spreadsheet Cells
  • 20. Ongoing Research: Optimization of Heterogeneous Resource Utilization using bioKepler National Resources (Gordon) (Comet) (Stampede) (Lonestar) Cloud Resources Optimized Local Cluster Resources Uses existing genomics tools and computing systems! Computing is just one part of it… …new methods needed!
  • 21. Needs of a Dynamic Ecosystem of Genomic Discovery • Exploratory methods to see temporal changes and patterns in sequence data • Efficient updates to analysis as quick as new sequence data gets generated • Regular reruns of annotations as reference databases evolve • Integration of genomic data with other types of data, e.g., image, environmental, social graphs • Dynamic ability to check quality and provenance of data and analysis • Transparent support for computing platforms designed for genomic discovery and pattern analysis • Workflow coordination and system integration • People and culture to make it happen collaboratively!
  • 22. Examples from 2016: Apache Big Data Technologies in Life Sciences • Lightning Fast Genomics with ADAM • Goal • Study genetic variations in populations at scale (e.g., 1000 Genomes Project) • Technology stack • Apache Avro (data serialization, schema definition) • Apache Parquet (compact columnar storage) • Apache Spark (distributed parallel processing) • Spark MLlib (machine learning, clustering) • Source: AMPLab, UC Berkeley (http://bdgenomics.org/) • Compressive Structural Bioinformatics using MMTF • Goal • 100+ speedup of large-scale 3D structural analysis of the Protein Data Bank (PDB) • Technology stack • MMTF (Macromolecular Transmission format, compact storage in Hadoop Sequence Files) • Apache Spark (in-memory, parallel distributed workflows using compressed data) • Spark ML (clustering) • Source: SDSC, UC San Diego (http://mmtf.rcsb.org/)
  • 23. Development of tools and technologies that enable models to bridge across diverse scales of biological organization, while leveraging all types and sources of data NBCR Example: Distilling Medical Image Data for Biomedical Action nbcr.ucsd.edu
  • 24. Identify gaps in multiscale modeling capabilities and develop new methods and tools that allow us to bridge across these gaps Å nm – μm 0.1mm - mm cm fs - μs μs - ms ms - s s - lifespan Molecular & Macromolecular Sub-Cellular Cell Tissue Organ Spatialand TemporalScales Driving Biomedical Projects propel technology development across multi-scale modeling capability gaps, from simulation to data assembly & integration
  • 25. A challenge: Data Integration Challenge to bridge across diverse scales of biological organization, to understand emergent behavior, and the molecular mechanisms underlying biological function & disease
  • 26. Integrated Multi-Scale Modeling Toolkits in NBCR User Interface NBCR Products Battling complexity while facilitating collaboration and increasing reproducibility. Cyberinfrastructure Innovation Based on User Needs Domain-specific tools, workflows, data and computing infrastructure. Components for Multi-Scale Modeling A handful of customizable and and extensible tools, workflows, user interfaces and publishable research objects. NBCR Products Workflows Scientific Tools Past Experiments • UI generation • Logical workflow generation • Uncertainty quantification • Workflow execution • Provenance tracking • System integration
  • 28. Minimization Actor Equilibration Actor AMBER GPU MD Workbench
  • 29. Rommie Amaro, PI, UCSD Computational chemistry, biophysics Andrew McCammon, UCSD Computational chemistry, biophysics, chemical physics Mark Ellisman, UCSD Molecular & cellular biology Andrew McCulloch, UCSD Bioengineering, biophysics Michel Sanner, TSRI Drug discovery & molecular visualization Phil Papadopoulos, UCSD/SDSC Computer engineering, cyberinfrastructure technology Ilkay Altintas, UCSD/SDSC Workflows, provenance Michael Holst, UCSD Math, physics Arthur Olson, TSRI Computational chemistry, drug discovery, visualization LEADERSHIP TEAM
  • 30. Training at the interface Challenge: how do we build the next generation of interdisciplinary scientists? Data-to-Structural-Models Simulation-Based Drug Discovery
  • 31. Biomedical Big Data Training Collaboratory http://biobigdata.ucsd.edu • BBDTC website is up and evolving! • BBDTC contains sevenfull, open biomedical training courses • Four-course biomedical big data series is planned for Winter 2017
  • 32. Working with Industry Partners at SDSC
  • 33. SDSC Provides a Range of Strategies for Engaging with Industry • Sponsored research agreements • Service agreements for use of systems & consulting • Focused centers of excellence (Big Data Systems, Predictive Analytics, Workflow Technologies) • Training programs in Data Science & Analytics • Industry Partners Program for “jump starting” collaborations Working with industry helps companies be more competitive, drives innovation, and fosters a healthy ecosystem between the research and private sector.
  • 34. Example for Industrial Collaboration: Janssen R&D Rheumatoid Arthritis Study • Janssen was interested in correlating genomic profile with response to TNFα inhibitor golimumab • Sequenced 438 patients (full genome) • SDSC assisted with re-alignment and variant calling using new/improved algorithms • Needed analysis done in a reasonable timeframe (a few weeks)