A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

A Workflow-Driven Discovery and Training Ecosystem
for Distributed Analysis of Biomedical Big Data
İlkay ALTINTAŞ, Ph.D.
Chief Data Science Officer, San Diego Supercomputer Center
Founder and Director, Workflows for Data Science Center of Excellence

SAN DIEGO SUPERCOMPUTER CENTER at UC San Diego
Providing Cyberinfrastructure for Research and Education
• Established as a national
supercomputer resource center
in 1985 by NSF
• A world leader in HPC, data-
intensive computing, and scientific
data management
• Current strategic focus on “Big
Data”, “versatile computing”, and
“life sciences applications”
1985
today
Two discoveries in drug
design from 1987 and 1991.

Ross Walker Group
SDSC continues to be a leader in scientific computing and big data!
Gordon: First
Flash-based Supercomputer
for Data-intensive Apps
Comet: Serving the Long
Tail of Science
27 standard racks
= 1944 nodes
= 46,656 cores
= 249 TB DRAM
= 622 TB SSD
~ 2 Pflop/s
• 36 GPU nodes
• 4 Large Memory nodes
• 7 PB Lustre storage
• High performance
virtualization

SDSC Data Science Office
-- Expertise, Systems and Training
for Data Science Applications --
SDSC Data Science Office (DSO)
SDSC DSO is a collaborative virtual organization at SDSC for collective
lasting innovation in data science research, development and education.
DSO
SDSC Expertise and Strengths
BigDataPlatforms
Training
Industry
Applications

Life Sciences is an ongoing strategic
application thrust at SDSC…

Genomic Analysis is a Big Data and Big Compute Problem
BIG DATA
COMPUTING AT
SCALE
Enables dynamic data-driven applications
Computer-Aided Drug Discovery
Personalized Precision Medicine
Requires:
• Data management
• Data-driven methods
• Scalable tools for
dynamic coordination
and resource
optimization
• Skilled interdisciplinary
workforce
Team work and
process management
Vaccine Development
Metagenomics
…

New era of data
science!
Needs and Trends for the New Era Data Science
-- the Big Data Era Goals --
• More data-driven
• More dynamic
• More process-driven
• More collaborative
• More accountable
• More reproducible
• More interactive
• More heterogeneous

Velocity
Variety
Volume Scalable batch
processing
Stream processing
Extensible data storage,
access and integration
Genomic Data Management and Processing in the Big
Data Era has Unique Challenges!

HBase
Hive Pig
Zookeeper
Giraph
Storm
Spark
MapReduce
YARN
MongoDB
Cassandra
HDFS
Flink
Lower levels:
Storage and scheduling
Higher levels:
Interactivity
These challenges push for new tools to tackle them.

COORDINATION AND
WORKFLOW MANAGEMENT
DATA INTEGRATION
AND PROCESSING
DATA MANAGEMENT
AND STORAGE
How do we use
these new tools
and combine them
with existing
domain-specific
solutions in
scientific
computing and
data science?

COORDINATION AND
WORKFLOW MANAGEMENT
DATA INTEGRATION
AND PROCESSING
DATA MANAGEMENT
AND STORAGE
Layer 1: Data Management and Storage

COORDINATION AND
WORKFLOW MANAGEMENT
DATA INTEGRATION
AND PROCESSING
DATA MANAGEMENT
AND STORAGE
Layer 2: Data Integration and Processing
HBase
Hive PigZookeeper
Giraph
Storm
Spark
MapReduce
YARN
MongoDB
Cassandra
HDFS
Flink
+ Application
specific libraries

Most of the time, more than one
analysis need to take place…
And each analysis has multiple steps
to integrate!

Pipelining is a way to put the steps together.
Source: http://www.slideshare.net/BigDataCloud/big-data-analytics-with-google-
cloud-platform
Source: https://www.mapr.com/blog/distributed-stream-and-graph-processing-
apache-flink
Source: https://www.computer.org/csdl/mags/so/2016/02/mso2016020060.html
Source: http://www.slideshare.net/ThoughtWorks/big-data-pipeline-with-scala

COORDINATION AND
WORKFLOW MANAGEMENT
DATA INTEGRATION
AND PROCESSING
DATA MANAGEMENT
AND STORAGE
Layer 3: Coordination and Workflow Management

COORDINATION AND
WORKFLOW MANAGEMENT
ACQUIRE PREPARE ANALYZE REPORT ACT
…
kepler-project.org

Workflows for Data Science
Center of Excellence at SDSC
Building functional, operational
and reproducible solution
architectures using big data
and HPC tools is what we do.
Focus on the
question,
not the
technology!
• Access and query data
• Scale computational analysis
• Increase reuse
• Save time, energy and money
• Formalize and standardize
Real-Time Hazards Management
wifire.ucsd.edu
Data-Parallel Bioinformatics
bioKepler.org
Scalable Automated Molecular Dynamics and Drug Discovery
nbcr.ucsd.edu
WorDS.sdsc.edu

bioKepler:
A Kepler Module for Bio Big Data Analysis
Data-Parallel Bioinformatics
bioKepler.org

Source: Larry Smarr, Calit2
• Metagenomic Sequencing
• JCVI Produced
• ~150 Billion DNA Bases From
Seven of LS Stool Samples Over 1.5 Years
• ~3 Trillion DNA Bases From NIH Human Microbiome
Program Data Base
• 255 Healthy People, 21 with IBD
Illumina
HiSeq 2000
at JCVI
SDSC Gordon Data Supercomputer
Example from 2013: Inflammatory Bowel Disease (IBD)
• Supercomputing (W.Li, JCVI/HLI/UCSD):
• ~20 CPU-Years on SDSC’s Gordon
• ~4 CPU-Years on Dell’s HPC Cloud
• Produced Relative Abundance of
• ~10K Bacteria, Archaea, Viruses in ~300 People
• ~3 Million Filled Spreadsheet Cells

Ongoing Research:
Optimization of Heterogeneous Resource Utilization using bioKepler
National
Resources
(Gordon) (Comet)
(Stampede)
(Lonestar)
Cloud
Resources
Optimized
Local Cluster Resources
Uses existing genomics
tools and computing
systems!
Computing is just one part
of it…
…new methods needed!

Needs of a Dynamic Ecosystem of
Genomic Discovery
• Exploratory methods to see temporal changes and patterns in sequence
data
• Efficient updates to analysis as quick as new sequence data gets generated
• Regular reruns of annotations as reference databases evolve
• Integration of genomic data with other types of data, e.g., image,
environmental, social graphs
• Dynamic ability to check quality and provenance of data and analysis
• Transparent support for computing platforms designed for genomic
discovery and pattern analysis
• Workflow coordination and system integration
• People and culture to make it happen collaboratively!

Examples from 2016: Apache Big Data Technologies
in Life Sciences
• Lightning Fast Genomics with ADAM
• Goal
• Study genetic variations in populations at scale (e.g., 1000 Genomes Project)
• Technology stack
• Apache Avro (data serialization, schema definition)
• Apache Parquet (compact columnar storage)
• Apache Spark (distributed parallel processing)
• Spark MLlib (machine learning, clustering)
• Source: AMPLab, UC Berkeley (http://bdgenomics.org/)
• Compressive Structural Bioinformatics using MMTF
• Goal
• 100+ speedup of large-scale 3D structural analysis of the Protein Data Bank (PDB)
• Technology stack
• MMTF (Macromolecular Transmission format, compact storage in Hadoop Sequence Files)
• Apache Spark (in-memory, parallel distributed workflows using compressed data)
• Spark ML (clustering)
• Source: SDSC, UC San Diego (http://mmtf.rcsb.org/)

Development of tools and technologies that enable models to bridge across
diverse scales of biological organization, while leveraging all types and
sources of data
NBCR Example: Distilling Medical Image
Data for Biomedical Action nbcr.ucsd.edu

Identify gaps in multiscale modeling capabilities and develop new
methods and tools that allow us to bridge across these gaps
Å nm – μm 0.1mm - mm cm
fs - μs μs - ms ms - s s - lifespan
Molecular &
Macromolecular
Sub-Cellular Cell Tissue Organ
Spatialand
TemporalScales
Driving Biomedical Projects propel technology development across multi-scale
modeling capability gaps, from simulation to data assembly & integration

A challenge: Data Integration
Challenge to bridge across diverse scales of biological organization, to understand emergent behavior, and
the molecular mechanisms underlying biological function & disease

Integrated Multi-Scale Modeling Toolkits in NBCR
User Interface NBCR
Products
Battling complexity while facilitating collaboration and increasing reproducibility.
Cyberinfrastructure Innovation Based on User Needs
Domain-specific tools, workflows,
data and computing infrastructure.
Components for Multi-Scale Modeling
A handful of customizable and and
extensible tools, workflows, user
interfaces and publishable research
objects.
NBCR Products
Workflows
Scientific Tools
Past
Experiments
• UI generation
• Logical workflow generation
• Uncertainty quantification
• Workflow execution
• Provenance tracking
• System integration

medium
Prima-1
Sticticacid
35ZWF
25KKL
22LSV
32CTM
26RQZ
27WT9
33AG6
33BAZ
28NZ6
27TGR
27VFS
35LWZ
36EB5
27UDP
32LDE
0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2
no p53
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
no
com
poundP
rim
a-1
35ZW
F25K
K
L25P
W
S24M
LP26Y
Y
G
22LS
V24M
N
R32C
TM
22K
TV24M
Y
424LB
C24N
P
U24N
W
3
Series1"
Series2"
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
no
com
poundP
rim
a-1
35ZW
F25K
K
L25P
W
S24M
LP26Y
Y
G
22LS
V24M
N
R32C
TM
22K
TV24M
Y
424LB
C24N
P
U24N
W
3
Series1"
Series2"cancer cell with p53-R175H mutant
cell proliferation
15 new reactivation compounds
reactivation compounds
kill cells with p53 cancer
mutant
BENEFITS:
• Increase reuse
• Reproducibility
• Scale execution,
problem & solution
• Compare methods
• Train students

Minimization Actor Equilibration Actor
AMBER GPU MD Workbench

Rommie Amaro, PI, UCSD
Computational chemistry, biophysics
Andrew McCammon, UCSD
Computational chemistry, biophysics,
chemical physics
Mark Ellisman, UCSD
Molecular & cellular biology
Andrew McCulloch, UCSD
Bioengineering, biophysics
Michel Sanner, TSRI
Drug discovery & molecular
visualization
Phil Papadopoulos, UCSD/SDSC
Computer engineering, cyberinfrastructure
technology
Ilkay Altintas, UCSD/SDSC
Workflows, provenance
Michael Holst, UCSD
Math, physics
Arthur Olson, TSRI
Computational chemistry, drug
discovery, visualization
LEADERSHIP
TEAM

Training at the interface
Challenge: how do we build the next generation of
interdisciplinary scientists?
Data-to-Structural-Models Simulation-Based Drug Discovery

Biomedical Big Data Training Collaboratory
http://biobigdata.ucsd.edu
• BBDTC website is up and evolving!
• BBDTC contains sevenfull, open biomedical training courses
• Four-course biomedical big data series is planned for Winter 2017

Working with Industry Partners at SDSC

SDSC Provides a Range of Strategies for
Engaging with Industry
• Sponsored research agreements
• Service agreements for use of systems & consulting
• Focused centers of excellence (Big Data Systems, Predictive Analytics, Workflow
Technologies)
• Training programs in Data Science & Analytics
• Industry Partners Program for “jump starting” collaborations
Working with industry helps companies be more competitive, drives innovation, and fosters a
healthy ecosystem between the research and private sector.

Example for Industrial Collaboration:
Janssen R&D Rheumatoid Arthritis Study
• Janssen was interested in correlating genomic
profile with response to TNFα inhibitor
golimumab
• Sequenced 438 patients (full genome)
• SDSC assisted with re-alignment and variant
calling using new/improved algorithms
• Needed analysis done in a reasonable
timeframe (a few weeks)

Questions?
IlkayAltintas,Ph.D.
Email:ialtintas@ucsd.edu

A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Similar to A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data (20)

More from Ilkay Altintas, Ph.D.

More from Ilkay Altintas, Ph.D. (7)

Recently uploaded

Recently uploaded (20)

A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data